Delta Transactions

March 1, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

It’s somewhat ironic that I’m starting my second database programming related post with the same “I was looking for a certain solution but I couldn’t find any” passage. The difference is, this time it’s a problem more general than querying a weighted list: how to handle complex changes in a distributed database.

In the RDBMS world, a single transaction lets you do as many operations as you like, on as many tables or rows as you like. Lacking proper design and forward thinking though, more often than not this practice leads to hogging a significant portion of your resources.

Distributed databases can’t let this happen, so they introduce certain limitations to transactions. In Google App Engine (GAE) datastore for example, you can operate on rows of the same type having a common ancestor (rows, or entities as they’re called in GAE may have parent-children relationships ensuring that all entities on the same tree are stored on the same server). Bad design will still lead to problems (e.g. storing all your data on the same tree) but this way it’s your app that gets screwed not the database.

The lazy way

While this is generally good news, you’re faced with a new problem: handling changes that depend on the outcome of others. Here’s an example: you’re adding a new entry to the table (model in GAE) “grades” in a school database. You’ll have to keep the corresponding row, identified by the student, subject and semester, in the table “averages”, in sync. The lazy way of doing this would be performing the following steps in one transaction:

  1. Add row to grades
  2. Calculate new average
  3. Update average

‘By’ instead of ‘to’

There is a solution however, with two transactions, or more in more complex cases. In the last step of the above example, instead of changing the average (or rather the sum) to a certain value we’d change it by a certain value in a separate, “delta” transaction. The figure below demonstrates the difference between the two.  At the top you see the timeline of  two conventional transactions and below their overlapping delta equivalents.

Advantages of delta transactions:

  • Isolated changes remain consistent.
  • You can still roll back a batch of changes from code as delta transactions don’t care about subsequent changes as long as they are reversible and commutative.
  • Using more, but shorter transactions (often operating on a single entity) there’s lower chance of concurrency.
  • Even if concurrency occurs, fewer re-tries will be enough to avoid failure.

The obvious constraint to delta transactions is that they are only applicable to numeric values.

Support?

Considering how often this is needed, I was expecting some support for this kind of transactions in NoSQL based platforms. GAE features a method called get_or_insert() which is similar in a sense that it wraps a transaction that inserts the specified row (entity) before returning it, in case it doesn’t exist. But there could just as well be a method delta_or_insert(), that either inserts the specified row (entity) with the arguments as initial values if it doesn’t exist, or updates it with the arguments added, multiplied, etc. to the currently stored values.

Moreover, there could be support for rollback too, using delta transaction lists, and even the possibility to evaluate expressions only when they are needed would be a great one. Features like these that simplify transactions while increasing application stability and data integrity would be much appreciated I think by many developers new to these recent platforms.


Collaboration versus Collectivity

February 9, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

In previous posts, especially in those related to content mapping, I frequently referred to collective actions and efforts in describing certain concepts, but never elaborated on the exact meaning of these terms. One could think that collectivity and collaboration are identical (they often are mentioned in the same context) as both have something to do with individuals working together. In fact, I find it important to highlight their differences for I expect collectivity to play as vital a role in Web 3.0 as collaboration did in Web 2.0.

Transition

As already understood from popular Web 2.0 applications such as Wikipedia, Google Docs, or WordPress, we define collaboration as sharing workload in a group of individuals who engage in a complex task, working towards a common goal in a managed fashion, and are conscious of the process’ details all the way.

As the number of participants grow however, it becomes apparent that collaboration is not scalable beyond a certain level while remaining faithful to the definition outlined above. Although there is such a thing as large-scale collaboration, what that covers is lots of people having the possibility of contribution but in reality only a few doing so. Mass collaboration goes further by blurring the definition of collaboration so much that it practically becomes just another expression for collectivity.

And when I speak of collectivity, I think of a crowd performing a simple, uncoordinated task where participants don’t have to be aware of their involvement in the process while contributing. The outcome of a collective action is merely a statistical aggregation of individual results.

Different realms

Collaboration and collectivity operate in different realms. Collaboration can be thought of as an incremental process (linear) while collectivity is more similar to voting (parallel). On the figure below, arrows represent the timeline of sub-tasks performed by participants.

Suppose a sub-task like that was the creation or modification of a Wikipedia entry. In this case collaboration proves more effective, as it offers a higher chance of eliminating factual errors during the process, while a collective approach would surely preserve all of them (and offer the one with the fewest). The semantic complexity of a document does not fit the more or less hit-and-miss approach of collectivity.

However, if we decrease the complexity of the content, say, to one sentence, the probability of individual solutions being as ‘good’ as products of collaboration is expected to be equal. Collective approaches therefore suit low-complexity content better.

The synaptic web

What content is of lower complexity than connections within a content network? Different relations such as identity, generalization, abstraction, response or ‘part-of’ require no more than a yes-no answer. Collectivity is cut out exactly for this kind of tasks.

As the creators of the synaptic web concept put it,

With the advent of the real-time web, however, increasingly effective publishing, sharing and engagement tools are making it easier to find connections between nodes in near-real time by observing human gestures at scale, rather than relying on machine classification.

Hence the synaptic web calls for collectivity. What we need now is more applications that make use of it.

Updates

  • Just one day before my post, @wikinihiltres posted an article comparing the efficiency of collective and collaborative approaches to content production through the examples of Wikipedia and Wikinews, concluding that “the balance that ought to be sought is one that continues to accept the powerful aggregative influence, but that greatly promotes collaboration where possible, since collaboration most reliably produces good results”.

Database Options for Content Mapping

February 5, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

While writing posts on the relation between content mapping and semantic web-related topics, I’m also working out the technical background for a specific content mapping solution.

Organic network of content

Content mapping is a collective effort to organize content into an organic, “living” network. As opposed to technologies that attempt to understand content by semantic analysis, content mapping facilitates understanding by having humans classify the connections in between. It is built on the presumption that conceptualization in the human brain follows the same pattern, where comprehension manifests through the connections between otherwise meaningless words, sounds, and mental images gathered by experience. Content mapping therefore is not restricted to textual content as NLP is. It’s applicable to images, audio, and video as well.

The purpose of content mapping is to guide from one piece of content to its most relevant peers. Searching in a content map equals to looking for the ‘strongest’ paths between a node and its network.

Approach & architecture

The technical essence of content mapping is the way we store and manage connections. From a design perspective, I see three approaches.

  • Graph: Querying neighbors in arbitrary depth while aggregating connection properties along paths. Limited in the sense that it works only on networks with no more than a fixed number of edges on a path, e.g. question-answer pairs.
  • Recursive: Crawling all paths in a node’s network while calculating and sorting weights. Resource hungry due to recursion. Aggregated weights have to be stored until result is returned, and cached until an affected connection is changed.
  • Indexing: Tracking paths as implicit connections on the fly. All implicit connections have to be stored separately to make sure they’re quickly retrievable.

When deciding on an architecture upon which to implement the solution, three choices come to mind.

  • Relational: Traditional RDBMS, mature and familiar. The richness of SQL and data integrity is highly valuable for most web applications, but such advantages often come at the price of costly joins, tedious optimizations and poor scalability.
  • Graph: Fits applications dealing with networks. Despite the structural resemblance with content maps, this genre of databases – being relatively young – lacks certain features necessary for a content mapping solution, such as aggregation along paths.
  • Distributed: Scalability and performance are given highest priority. Consequently, access to resources, and features common in relational databases such as references, joins, or transactions are limited or completely missing.

The following table summarizes the key characteristics of each of the nine approach-architecture combinations.

Graph Recursive Indexing
Relational Costly self-joins in fixed depth Complex, caching required Writing is not scalable
Graph No aggregation along paths Graph architecture not exploitable Implicit connection as separate edge type
Distributed Lacks joins, same as recursive Limited access to resources Needs concurrency management

Finalists

The table above shows that most options have at least one showstopper: either complexity, lack of features and scalability, costly operations or unfitting architecture.

Only two of them seem to satisfy the purpose of content mapping as described in the first section: the graph and distributed implementations of the indexing approach.

  • Even though it’s not the graph approach we’re talking about, this is a combination that exploits the advantages of the graph database to its full extent. By storing implicit connections as separate edges, there’s no need to query paths deeper than one neighbor.
  • In a distributed database there are no constraints or triggers, demanding more attention in regard to concurrency management. Graph structure is not supported on a native level, but scalability and performance make up for it.

Ontologies in Content Mapping

January 28, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

Ontology in computer science is a formal representation of concepts within a domain. With the emergence of the semantic web, ontologies will take the role of the anchor to which all content can and should relate.

Ownership

Behind the unspoken conceptualizations in our heads lie formal ontologies. Which, being necessarily man-made, pose a question.

Who has ownership over ontologies?

If their purpose is indeed to serve as the lighthouse on the sea of information, ontologies must be unambiguous, and therefore be defined and maintained by a single entity. Will this single entity be a company, a consortium, a committee, an organization or perhaps a government agency? What guarantees that formal ontologies will follow the changes that may occur in the instance domain?

Collective definition

There is one guarantee: collective ontology management. And I’m not thinking of Wikipedia-style collaboration, but real collective effort where everyone throws in his/her two cents.

Take a look at this very simple comparison between equivalent fractions of a content map and an ontology. (A content map is based on collective definition of probabilistic relations between content elements.)

The resemblance is hard to miss. It’s no surprise, both deal with concepts and instances bound into a network through different sorts of relations. But as we take a closer look, it becomes obvious that content mapping is fundamentally different.

  • There’s no distinction between elements such as classes, instances, attributes, et cetera. They’re all content. What constitutes a class from an ontology point of view depends solely on the relation. One element may be instance and class simultaneously.
  • There are fewer, more general types of connections. You can extend an ontology with new relations that specify the way certain elements are connected to each other. Content mapping defines only a few, from which new, implicit ones can be derived via machine learning.
  • Domains don’t have definite borders. It is very likely that elements have connections leading out of a domain, superseding what we call ontology alignment. As an element may be instance and class at the same time, it can also belong to more than one domain. In fact, these are the connections through which cross-ontology relationships emerge.
  • Dynamics is inherently embedded into the system. As content changes, connections follow. Classes are constantly created, updated or deleted by changing generalization connections.

Content mapping creates an organic system where ontologies float on the surface.

Defining ontologies in this environment is no longer necessary, they crystallize with the natural progress. We only have to harvest the upper generalization layers to get an understanding of conceptual connections in any data set. Domains needn’t be defined beforehand either. Instead, we draw their outlines where we deem them fitting.

Clues to rely on

However flexible content mapping technology may seem in defining and following ontologies, its purpose is to connect previously unconnected content, and therefore it needs clues to follow up. Prior user input, search indexes, or existing ontologies may provide these clues. Once those clues are there, content mapping simplifies ontology management in several aspects.

  • Fewer relations: Only a handful of general relations are explicit, domain specific relations are all derived from those.
  • No need for focused attention: Ontology management requires no supervision as implicit connections change with content.
  • No knowledge of semantics: Connections (both explicit and implicit) can be set or changed without any knowledge on the subject of semantics or ontologies.

Stochastic Linguistics and Content Mapping

January 26, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

Stochastic linguistics deals with the probabilities of certain patterns occurring in natural language and is therefore very likely to play an important role in future natural language processing (NLP) applications, including the semantic web.

As Graham Coulter-Smith puts it in his essay The Stochastic Revolution in Art and Science:

… the leading edge of web technology developing towards globalized information interflow is almost exclusively based on stochastic technologies.

Stochastic techniques

Stochastic techniques, such as n-gram and latent semantic analysis (LSA) help us identify and classify patterns in natural language, through which we are able to compare, search and analyze documents in a language independent manner.

However, they fail to further machine understanding beyond a broad structural analysis. These techniques recognize a very limited set of relationships between terms, that usually narrow down to: “identical” or “interchangeable”. These relations say nothing about the quality of the connection, i.e. whether two similar terms are similar in structure or meaning.

This is where a catch 22 begins to unfold. In order to provide relations reflecting on the meaning of terms by which they can be understood, machines running stochastic analyses would have to understand those terms first. The only way of resolving this “catch” leads through adding human intelligence to the mix, extending the network of terms with the missing semantic links, which is how we arrive at content mapping.

Mapping language

Content mapping is a system that does exactly the above by collectively defining and maintaining a rich set of relations between bits of content, including natural language patterns. Relations create equivalence classes for content elements where one term may belong to several classes based on its meaning, structure and function. Synonymy and polysemy that are hinted by LSA for instance, are not only explicitly defined in content mapping, but extended by relations vital to machine understanding, like generalization and identical meaning.

Let’s see an example. The figure below places the term “it’s 5 o’clock” in a content map. Colored connections are based on the votes of people participating in the mapping process. Outlined white arrows represent generated connections. Similarly, bubbles with solid outlines are actual pieces of content (terms), ones with dotted outlines are generated.

In the map, different relations contribute to language understanding in different ways.

  • Generalizations help conceptualizing terms.
  • Responses indicate contextual relationships between terms by connecting effects to their causes or answers to questions.
  • Identical meaning creates pathways for terms with low number of connections to other relations such as generalizations or responses.
  • Abstractions extract structural similarity from terms to be used later by pattern recognition.

It takes two

Even though we obtain richer and more reliable information about term relationships through content mapping, it would take a lot of guesswork before actually related terms would get connected. To reduce the amount of unnecessary passes, LSA could provide higher-than-normal error rate connections as clues for the content mapping process to follow up.

A composite solution that unites the two, could point to a direction where a language independent structural and semantic understanding of text finally comes within our reach.