Posts Tagged ‘content mapping’

Database Options for Content Mapping

February 5, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

While writing posts on the relation between content mapping and semantic web-related topics, I’m also working out the technical background for a specific content mapping solution.

Organic network of content

Content mapping is a collective effort to organize content into an organic, “living” network. As opposed to technologies that attempt to understand content by semantic analysis, content mapping facilitates understanding by having humans classify the connections in between. It is built on the presumption that conceptualization in the human brain follows the same pattern, where comprehension manifests through the connections between otherwise meaningless words, sounds, and mental images gathered by experience. Content mapping therefore is not restricted to textual content as NLP is. It’s applicable to images, audio, and video as well.

The purpose of content mapping is to guide from one piece of content to its most relevant peers. Searching in a content map equals to looking for the ‘strongest’ paths between a node and its network.

Approach & architecture

The technical essence of content mapping is the way we store and manage connections. From a design perspective, I see three approaches.

  • Graph: Querying neighbors in arbitrary depth while aggregating connection properties along paths. Limited in the sense that it works only on networks with no more than a fixed number of edges on a path, e.g. question-answer pairs.
  • Recursive: Crawling all paths in a node’s network while calculating and sorting weights. Resource hungry due to recursion. Aggregated weights have to be stored until result is returned, and cached until an affected connection is changed.
  • Indexing: Tracking paths as implicit connections on the fly. All implicit connections have to be stored separately to make sure they’re quickly retrievable.

When deciding on an architecture upon which to implement the solution, three choices come to mind.

  • Relational: Traditional RDBMS, mature and familiar. The richness of SQL and data integrity is highly valuable for most web applications, but such advantages often come at the price of costly joins, tedious optimizations and poor scalability.
  • Graph: Fits applications dealing with networks. Despite the structural resemblance with content maps, this genre of databases – being relatively young – lacks certain features necessary for a content mapping solution, such as aggregation along paths.
  • Distributed: Scalability and performance are given highest priority. Consequently, access to resources, and features common in relational databases such as references, joins, or transactions are limited or completely missing.

The following table summarizes the key characteristics of each of the nine approach-architecture combinations.

Graph Recursive Indexing
Relational Costly self-joins in fixed depth Complex, caching required Writing is not scalable
Graph No aggregation along paths Graph architecture not exploitable Implicit connection as separate edge type
Distributed Lacks joins, same as recursive Limited access to resources Needs concurrency management

Finalists

The table above shows that most options have at least one showstopper: either complexity, lack of features and scalability, costly operations or unfitting architecture.

Only two of them seem to satisfy the purpose of content mapping as described in the first section: the graph and distributed implementations of the indexing approach.

  • Even though it’s not the graph approach we’re talking about, this is a combination that exploits the advantages of the graph database to its full extent. By storing implicit connections as separate edges, there’s no need to query paths deeper than one neighbor.
  • In a distributed database there are no constraints or triggers, demanding more attention in regard to concurrency management. Graph structure is not supported on a native level, but scalability and performance make up for it.

Ontologies in Content Mapping

January 28, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

Ontology in computer science is a formal representation of concepts within a domain. With the emergence of the semantic web, ontologies will take the role of the anchor to which all content can and should relate.

Ownership

Behind the unspoken conceptualizations in our heads lie formal ontologies. Which, being necessarily man-made, pose a question.

Who has ownership over ontologies?

If their purpose is indeed to serve as the lighthouse on the sea of information, ontologies must be unambiguous, and therefore be defined and maintained by a single entity. Will this single entity be a company, a consortium, a committee, an organization or perhaps a government agency? What guarantees that formal ontologies will follow the changes that may occur in the instance domain?

Collective definition

There is one guarantee: collective ontology management. And I’m not thinking of Wikipedia-style collaboration, but real collective effort where everyone throws in his/her two cents.

Take a look at this very simple comparison between equivalent fractions of a content map and an ontology. (A content map is based on collective definition of probabilistic relations between content elements.)

The resemblance is hard to miss. It’s no surprise, both deal with concepts and instances bound into a network through different sorts of relations. But as we take a closer look, it becomes obvious that content mapping is fundamentally different.

  • There’s no distinction between elements such as classes, instances, attributes, et cetera. They’re all content. What constitutes a class from an ontology point of view depends solely on the relation. One element may be instance and class simultaneously.
  • There are fewer, more general types of connections. You can extend an ontology with new relations that specify the way certain elements are connected to each other. Content mapping defines only a few, from which new, implicit ones can be derived via machine learning.
  • Domains don’t have definite borders. It is very likely that elements have connections leading out of a domain, superseding what we call ontology alignment. As an element may be instance and class at the same time, it can also belong to more than one domain. In fact, these are the connections through which cross-ontology relationships emerge.
  • Dynamics is inherently embedded into the system. As content changes, connections follow. Classes are constantly created, updated or deleted by changing generalization connections.

Content mapping creates an organic system where ontologies float on the surface.

Defining ontologies in this environment is no longer necessary, they crystallize with the natural progress. We only have to harvest the upper generalization layers to get an understanding of conceptual connections in any data set. Domains needn’t be defined beforehand either. Instead, we draw their outlines where we deem them fitting.

Clues to rely on

However flexible content mapping technology may seem in defining and following ontologies, its purpose is to connect previously unconnected content, and therefore it needs clues to follow up. Prior user input, search indexes, or existing ontologies may provide these clues. Once those clues are there, content mapping simplifies ontology management in several aspects.

  • Fewer relations: Only a handful of general relations are explicit, domain specific relations are all derived from those.
  • No need for focused attention: Ontology management requires no supervision as implicit connections change with content.
  • No knowledge of semantics: Connections (both explicit and implicit) can be set or changed without any knowledge on the subject of semantics or ontologies.

Stochastic Linguistics and Content Mapping

January 26, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

Stochastic linguistics deals with the probabilities of certain patterns occurring in natural language and is therefore very likely to play an important role in future natural language processing (NLP) applications, including the semantic web.

As Graham Coulter-Smith puts it in his essay The Stochastic Revolution in Art and Science:

… the leading edge of web technology developing towards globalized information interflow is almost exclusively based on stochastic technologies.

Stochastic techniques

Stochastic techniques, such as n-gram and latent semantic analysis (LSA) help us identify and classify patterns in natural language, through which we are able to compare, search and analyze documents in a language independent manner.

However, they fail to further machine understanding beyond a broad structural analysis. These techniques recognize a very limited set of relationships between terms, that usually narrow down to: “identical” or “interchangeable”. These relations say nothing about the quality of the connection, i.e. whether two similar terms are similar in structure or meaning.

This is where a catch 22 begins to unfold. In order to provide relations reflecting on the meaning of terms by which they can be understood, machines running stochastic analyses would have to understand those terms first. The only way of resolving this “catch” leads through adding human intelligence to the mix, extending the network of terms with the missing semantic links, which is how we arrive at content mapping.

Mapping language

Content mapping is a system that does exactly the above by collectively defining and maintaining a rich set of relations between bits of content, including natural language patterns. Relations create equivalence classes for content elements where one term may belong to several classes based on its meaning, structure and function. Synonymy and polysemy that are hinted by LSA for instance, are not only explicitly defined in content mapping, but extended by relations vital to machine understanding, like generalization and identical meaning.

Let’s see an example. The figure below places the term “it’s 5 o’clock” in a content map. Colored connections are based on the votes of people participating in the mapping process. Outlined white arrows represent generated connections. Similarly, bubbles with solid outlines are actual pieces of content (terms), ones with dotted outlines are generated.

In the map, different relations contribute to language understanding in different ways.

  • Generalizations help conceptualizing terms.
  • Responses indicate contextual relationships between terms by connecting effects to their causes or answers to questions.
  • Identical meaning creates pathways for terms with low number of connections to other relations such as generalizations or responses.
  • Abstractions extract structural similarity from terms to be used later by pattern recognition.

It takes two

Even though we obtain richer and more reliable information about term relationships through content mapping, it would take a lot of guesswork before actually related terms would get connected. To reduce the amount of unnecessary passes, LSA could provide higher-than-normal error rate connections as clues for the content mapping process to follow up.

A composite solution that unites the two, could point to a direction where a language independent structural and semantic understanding of text finally comes within our reach.

Entropy and the Future of the Web

January 19, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

Inspired by this post of Chris Dixon, I summarized my thoughts on the future of the web in a single tweet like this:

The fundamental question that will shape the future of the web is how we deal with entropy.

Options

The disorder of the web thrives on both content and their connections. Today’s approach to the web of tomorrow depend on how we address this issue. The figure below shows what we can expect as a result of different combinations of low and high entropy in the two layers.

Entropy in the content layer reflects the degree of internal disorder. If we choose to lower content entropy through the addition of relevant metadata or structure we’ll realize the semantic web. If we don’t then content will remain unorganized and we’ll end up in the noisy web.

Entropy in the connection layer expresses disorder in the network of content. By defining meaningful relations between content elements connection entropy will decrease leading to the synaptic web. Should we leave connections in their ad-hoc state, we’ll arrive in the unorganized web.

The study of web entropy becomes interesting when we take a look at the intersections of these domains.

  • Semantic – synaptic: The most organized, ideal form of the web. Content and connections are thoroughly described, transparent and machine readable. Example: linked data.
  • Semantic – unorganized: Semantic content loosely connected throughout the web. Most blog posts have the valid semantic structure of documents, however, they’re connected by hyperlinks that say nothing about their relation. (Say, whether a blog entry extends, reflects on or debates the linked one.)
  • Noisy – synaptic: Organizes high entropy content by connecting relevant elements via meaningful relations. Among others, tagging, filtering, recommendation engines and content mapping fall into this domain.
  • Noisy – unorganized: Sparse network of unstructured content. This is the domain we’ve known for one and a half decades where keyword based indexing and search still dominates the web. If it continues to develop in this direction then technologies such as linguistic parsing and topic identification will definitely come into play in the future.

Which one?

The question is obvious: which domain represents the optimal course to take? Based on the domains’ description semantic – synaptic seems to be the clear choice. But we’re discussing entropy here and from thermodynamics we know that entropy grows in systems that are prone to spontaneous change and order is restored only at the cost of energy and effort.

Ultimately, the question comes down to this: are we going to fight entropy or not?

Bringing the semantic web into existence is an enormous task. To me, fighting the reluctance of people to adopt the use of metadata and semantic formats is unimaginable. The synaptic web seems more feasible as the spreading of social media already indicates. But in the end what matters is which domain or combination of domains will be popular among early adopters. The rest will follow.

Advertising with Content Mapping

January 14, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

Advertising is the number one option to monetize content on the web. Even with the advent of the real-time web, until a viable model of in-stream advertising is conceived, search engines and their means of online marketing, such as SEO, AdWords and AdSense remain dominant. As in-stream and other real-time web marketing models mature, become relevant and non-intrusive, search engines will have to undergo fundamental change to keep a significant segment of the market.

Keywords are bad

In order to induce fundamental change first we have to identify the fundamental flaw. At the dawn of the World Wide Web the paradigm of search was borrowed from text documents where a certain paragraph is easily spotted by looking up a few words we presume to be in it.

With more extensive content and less prior knowledge about it this paradigm became harder and harder to apply. However, in our efforts to rank content by its structure, context and user preferences we kept keywords all along. Moreover, keywords today fuel an entire industry of online advertising, recklessly overlooking the distortion they add between content and the user’s specific preferences.

To make search and search-based ads as relevant and as non-intrusive as the real-time web has to offer, keywords must be forgotten once and for all.

Content mapping

Content mapping connects units of content directly by user interaction via a rich, irreducible set of relations. Between your content and others there may be similarities, equivalence, references and other sorts of relations of various significance and strength (relevance). Content that is relevant to yours make up its ideal context. The first n results a search engine returns is on the other hand the actual context.

The distance between the ideal and actual context marks the accuracy of a search engine.

Now, when you’re searching with Google, you’re basically trying to define the ideal context for the content you need. Imagine just how clumsy and inefficient it is to do through a couple of keywords.

Using a content mapping engine you type in a piece of content, not context. That content (or one that’s semantically identical) is probably already placed and centered in its ideal context. You’ll receive the elements inside as results in decreasing order of relevance.

SEO

Search engine optimization is an attempt to match the actual context to the ideal. Inevitably, when you tune your webpage for certain keywords you’re guaranteed to bolt it into the wrong context.

With content mapping, there’s no need for SEO. Not in a fair use scenario anyway, but on the other hand, the well-known SEO exploits (black hat, article spinning, keyword stuffing) obviously won’t work either. If the actual context of your content changes, it will re-position itself automatically to a new context that approximates the ideal as close as possible.

Shooting in the dark

When it comes to online marketing, SEO is just one of your options. Ranking algorithms may change and your content gets easily ripped out of the context you worked on so hard to match. So, you turn to a different, somewhat more reliable marketing tool, AdWords for example.

What happens from then on is again viewed through the smudgy glass of keywords. First, you take a wild guess at what keywords will best match your ideal context, bid for them and see what happens. If the conversion rates are not satisfactory, repeat the process until you get the best achievable results.

Assuming your campaign was successful, along the way you’ve probably

  • lost a lot of time tweaking
  • lost potential customers / deals
  • paid for the wrong keywords
  • took an exam or hired a consultant
  • and ended up in a wrong context anyway

In a content mapping environment however, you land at the center of your ideal context. With no tweaking, no time nor money lost.

What’s the catch?

I’ve hinted in the definition that content mapping relies on user input. In fact it relies on almost nothing but that. I admit that building and maintaining the connection index takes huge collective efforts, but I’m convinced about its feasibility.

We only have to make sure it

  • Provides frictionless tools for contribution: When the entire index has to be collected from the network it’s vital for the process not to demand more time and attention from contributors than what’s necessary.
  • Treats harmful activity as noise: Random noise is natural in content mapping. Useful information within the system – however small percentage – is expected to be coherent and thus extractable. In order to suppress useful information a successful attack would have to insert harmful information of at least equal coherence. Input gathering tools within the system must be designed with that in mind.

Regardless of how cleverly we gather information from the network, latency remains an integral property of content mapping. Changing actual context needs time to catch up to the ideal depending on the size of the network. The bigger it is, the faster the response. At the start of a campaign one must be clear with the delay by which the content gets centered in its context.

Unfortunately neither of the concerns above are comparable to building the network in terms of size and difficulty. However, the steps through which this can be achieved are yet to be defined.

Updates

  • Click-through rates: It’s sort of self explaining, but it may be necessary to emphasize the following. When a piece of content is centered in its ideal context it will yield the highest click-through rates when placed on a blog or website in an AdSense fashion.
  • Similar solutions: MyLikes has implemented a system in which advertisers may reach a higher click-through rate by placing their ads next to (or embedded into) relevant content produced by trusted “influencers”.
  • In-stream solutions: Take a look at this list of Twitter-based marketing tools on oneforty. They may not be all in-stream, but they can give you a general idea of advertising in the real-time web.