Stochastic linguistics deals with the probabilities of certain patterns occurring in natural language and is therefore very likely to play an important role in future natural language processing (NLP) applications, including the semantic web.
As Graham Coulter-Smith puts it in his essay The Stochastic Revolution in Art and Science:
… the leading edge of web technology developing towards globalized information interflow is almost exclusively based on stochastic technologies.
Stochastic techniques, such as n-gram and latent semantic analysis (LSA) help us identify and classify patterns in natural language, through which we are able to compare, search and analyze documents in a language independent manner.
However, they fail to further machine understanding beyond a broad structural analysis. These techniques recognize a very limited set of relationships between terms, that usually narrow down to: “identical” or “interchangeable”. These relations say nothing about the quality of the connection, i.e. whether two similar terms are similar in structure or meaning.
This is where a catch 22 begins to unfold. In order to provide relations reflecting on the meaning of terms by which they can be understood, machines running stochastic analyses would have to understand those terms first. The only way of resolving this “catch” leads through adding human intelligence to the mix, extending the network of terms with the missing semantic links, which is how we arrive at content mapping.
Content mapping is a system that does exactly the above by collectively defining and maintaining a rich set of relations between bits of content, including natural language patterns. Relations create equivalence classes for content elements where one term may belong to several classes based on its meaning, structure and function. Synonymy and polysemy that are hinted by LSA for instance, are not only explicitly defined in content mapping, but extended by relations vital to machine understanding, like generalization and identical meaning.
Let’s see an example. The figure below places the term “it’s 5 o’clock” in a content map. Colored connections are based on the votes of people participating in the mapping process. Outlined white arrows represent generated connections. Similarly, bubbles with solid outlines are actual pieces of content (terms), ones with dotted outlines are generated.
In the map, different relations contribute to language understanding in different ways.
- Generalizations help conceptualizing terms.
- Responses indicate contextual relationships between terms by connecting effects to their causes or answers to questions.
- Identical meaning creates pathways for terms with low number of connections to other relations such as generalizations or responses.
- Abstractions extract structural similarity from terms to be used later by pattern recognition.
It takes two
Even though we obtain richer and more reliable information about term relationships through content mapping, it would take a lot of guesswork before actually related terms would get connected. To reduce the amount of unnecessary passes, LSA could provide higher-than-normal error rate connections as clues for the content mapping process to follow up.
A composite solution that unites the two, could point to a direction where a language independent structural and semantic understanding of text finally comes within our reach.