Posts Tagged ‘programming’

Database Options for Content Mapping

February 5, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

While writing posts on the relation between content mapping and semantic web-related topics, I’m also working out the technical background for a specific content mapping solution.

Organic network of content

Content mapping is a collective effort to organize content into an organic, “living” network. As opposed to technologies that attempt to understand content by semantic analysis, content mapping facilitates understanding by having humans classify the connections in between. It is built on the presumption that conceptualization in the human brain follows the same pattern, where comprehension manifests through the connections between otherwise meaningless words, sounds, and mental images gathered by experience. Content mapping therefore is not restricted to textual content as NLP is. It’s applicable to images, audio, and video as well.

The purpose of content mapping is to guide from one piece of content to its most relevant peers. Searching in a content map equals to looking for the ‘strongest’ paths between a node and its network.

Approach & architecture

The technical essence of content mapping is the way we store and manage connections. From a design perspective, I see three approaches.

  • Graph: Querying neighbors in arbitrary depth while aggregating connection properties along paths. Limited in the sense that it works only on networks with no more than a fixed number of edges on a path, e.g. question-answer pairs.
  • Recursive: Crawling all paths in a node’s network while calculating and sorting weights. Resource hungry due to recursion. Aggregated weights have to be stored until result is returned, and cached until an affected connection is changed.
  • Indexing: Tracking paths as implicit connections on the fly. All implicit connections have to be stored separately to make sure they’re quickly retrievable.

When deciding on an architecture upon which to implement the solution, three choices come to mind.

  • Relational: Traditional RDBMS, mature and familiar. The richness of SQL and data integrity is highly valuable for most web applications, but such advantages often come at the price of costly joins, tedious optimizations and poor scalability.
  • Graph: Fits applications dealing with networks. Despite the structural resemblance with content maps, this genre of databases – being relatively young – lacks certain features necessary for a content mapping solution, such as aggregation along paths.
  • Distributed: Scalability and performance are given highest priority. Consequently, access to resources, and features common in relational databases such as references, joins, or transactions are limited or completely missing.

The following table summarizes the key characteristics of each of the nine approach-architecture combinations.

Graph Recursive Indexing
Relational Costly self-joins in fixed depth Complex, caching required Writing is not scalable
Graph No aggregation along paths Graph architecture not exploitable Implicit connection as separate edge type
Distributed Lacks joins, same as recursive Limited access to resources Needs concurrency management


The table above shows that most options have at least one showstopper: either complexity, lack of features and scalability, costly operations or unfitting architecture.

Only two of them seem to satisfy the purpose of content mapping as described in the first section: the graph and distributed implementations of the indexing approach.

  • Even though it’s not the graph approach we’re talking about, this is a combination that exploits the advantages of the graph database to its full extent. By storing implicit connections as separate edges, there’s no need to query paths deeper than one neighbor.
  • In a distributed database there are no constraints or triggers, demanding more attention in regard to concurrency management. Graph structure is not supported on a native level, but scalability and performance make up for it.

Querying randomly from uneven distributions

December 20, 2009

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

Not long ago I was looking for a way to select random elements from an unevenly distributed set.

Solutions out there

First thing, googled up every possible combination of all the keywords that made sense. Yet all I’ve found on the subject was one single forum thread. They came up with the following.

  • Increase row ID by element weight on each element insert.
  • Query a random element by generating a random ID between the first and latest and return the row with the nearest lower ID.
  • Update weights by rebuilding the table from the row in question.

Needless to say, this solution is inferior at a high element count and outright unacceptable in real-time web applications. So, I was forced to put together a solution of my own.

Scraping the surface

The first idea that popped into my mind was an SQL approach with two tables: ‘Elements’ and ‘Multiplicity’. In ‘Elements’ go all the properties of the actual elements of the distribution, one row for each. In ‘Multiplicity’, rows are inserted or deleted each time an element’s weight is changed. A query goes as simple as this (MySQL):


(The join is there only to show that it’s the information in Elements that carries value for us. I could just as well select from Multiplicity and then look up the corresponding element by key. Also, I looked past the inefficiency of ORDER BY RAND() for the sake of compactness.)

While the above does work, there are still problems. For one, only integer weights may be concerned. But I could live with that. The real problem is that every single unit of weight is another row in Multiplicity. Imagine just 1 million rows in Elements, each having an average weight of 1000. Looks like I’ve just added three magnitudes to my data storage needs. No doubt, this concept also fails, albeit for a different reason, on large-scale.

One that works

So, what’s the solution, that’s fast and doesn’t consume the entire Internet? Right after the SQL approach I started experimenting with something similar to the first solution (inserting and querying being the same as there), but I already knew that updating weight was not an option. And that was the key. I cannot update the weight in a row but I can update element properties. So what I do is insert the same element with the new weight and update the original row by removing all element properties and leaving only the weight, thus creating an empty, but weighted row. This empty row becomes a placeholder for a future element with the same weight. So before inserting an element I first check for a placeholder and insert only if there’s none. Otherwise I just update the placeholder. This way the IDs and weights remain consistent, and the method is applicable to real-time applications even with huge datasets.

There are only two minor problems. To keep the number of placeholders low, only a predefined set of weights may be used. Powers of two, Fibonacci numbers, et cetera. Secondly, empty elements count too in the querying process, distorting the relative weight of real elements. But this can be easily resolved by re-running the query again and again until it returns a real element.