Posts Tagged ‘search’

Entropy and the Future of the Web

January 19, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

Inspired by this post of Chris Dixon, I summarized my thoughts on the future of the web in a single tweet like this:

The fundamental question that will shape the future of the web is how we deal with entropy.

Options

The disorder of the web thrives on both content and their connections. Today’s approach to the web of tomorrow depend on how we address this issue. The figure below shows what we can expect as a result of different combinations of low and high entropy in the two layers.

Entropy in the content layer reflects the degree of internal disorder. If we choose to lower content entropy through the addition of relevant metadata or structure we’ll realize the semantic web. If we don’t then content will remain unorganized and we’ll end up in the noisy web.

Entropy in the connection layer expresses disorder in the network of content. By defining meaningful relations between content elements connection entropy will decrease leading to the synaptic web. Should we leave connections in their ad-hoc state, we’ll arrive in the unorganized web.

The study of web entropy becomes interesting when we take a look at the intersections of these domains.

  • Semantic – synaptic: The most organized, ideal form of the web. Content and connections are thoroughly described, transparent and machine readable. Example: linked data.
  • Semantic – unorganized: Semantic content loosely connected throughout the web. Most blog posts have the valid semantic structure of documents, however, they’re connected by hyperlinks that say nothing about their relation. (Say, whether a blog entry extends, reflects on or debates the linked one.)
  • Noisy – synaptic: Organizes high entropy content by connecting relevant elements via meaningful relations. Among others, tagging, filtering, recommendation engines and content mapping fall into this domain.
  • Noisy – unorganized: Sparse network of unstructured content. This is the domain we’ve known for one and a half decades where keyword based indexing and search still dominates the web. If it continues to develop in this direction then technologies such as linguistic parsing and topic identification will definitely come into play in the future.

Which one?

The question is obvious: which domain represents the optimal course to take? Based on the domains’ description semantic – synaptic seems to be the clear choice. But we’re discussing entropy here and from thermodynamics we know that entropy grows in systems that are prone to spontaneous change and order is restored only at the cost of energy and effort.

Ultimately, the question comes down to this: are we going to fight entropy or not?

Bringing the semantic web into existence is an enormous task. To me, fighting the reluctance of people to adopt the use of metadata and semantic formats is unimaginable. The synaptic web seems more feasible as the spreading of social media already indicates. But in the end what matters is which domain or combination of domains will be popular among early adopters. The rest will follow.

The Noisy Web

January 6, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

We are witnessing the decay of Google Search. The recent improvements (categories, promotion, real-time results) are insignificant compared to the magnitude of the problem, namely, poor relevance of results. By relevance I mean the results’ relation to the specific idea in the user’s mind, and not their relation to the keywords.

Keyword-based search increases the distance and distortion between results and what the user is really looking for.

Poor relevance

Why do keywords perform so poorly? After all, they would work perfectly in a world where all data on the web is semantically indexed through relevant metadata. In reality however, the gap between relevant information and noise is so huge that keywords are likely to be caught in both. The keyword meta tag fiasco around the millennium has proven the inefficiency and vulnerability of metadata.

The widely criticized Semantic Web that is anticipated to be an integral part of the third generation Web aims in that direction anyway. How it’s going to deal with obvious obstacles such as entropy and human behavior remain unanswered.

Making sense of noise

Instead of going into the problems posed by metadata, let’s focus on the naturally noisy web. Since reducing entropy in general requires immense efforts let’s turn the problem around and start digging in the noise.

There are two ways to do this:

  • treat the entire set of data as noise and recognize patterns that are interesting to us
  • prepare useful data for extraction from the background noise as we come across them

The first option calls for some sort of AI. While this is a viable solution I’d question its feasibility. I don’t see algorithms – no matter how complex they are – cover every single aspect of content recognition and interpretation.

For the second option I can show a very fitting example. In digital watermarking we’re hiding drops of information in a vast ocean of noise. In order to recover that tiny amount of data we have to make sure that it’s either or both

  • significantly more coherent than the background noise (coherent)
  • repeated over and over throughout different domains of the signal (redundant)

We can put the same concept in Web terms by connecting relevant content through user interaction.

Content mapping

There are a couple of attempts at using the crowd to add context to content: Google’s Promote button, Digg, Twitter lists just to name a few. It’s easy to see that these tools don’t connect content to content. They connect content to metadata which brings us back to the original problem. OWL, the language of the Semantic Web can be used to define connections indirectly via class connections, but this solution again favors the metadata domain.

Direct content to content connections are practically non-existing as of today except for online stores where articles refer to each other by a recommendation system. These connections are quite limited by the narrow niche and the very few and specific relations (also bought / also viewed / similar). Unquestionably, creating these connections on a grand scale is an enormous yet far more feasible a task than keeping entropy low. The good news is that tools like the ones mentioned above (Digg, Twitter) spread a completely new user behavior that will perfectly fit content mapping.

By defining a sufficiently rich set of relations in content connections, mapping will be machine readable. It won’t know that e.g. a certain text element does represent a book author as it would in a semantic solution, but through a series of connections it’s going to have implicit knowledge about it.

The “Google killer” cometh

Whatever is going to go in the footsteps of Google Search (perhaps a new Google Search?) it’s going to end the era of keywords. Ideally it’s going to feature strong content mapping induced by fundamentally changing online behavior mixed with light semantics. It will be dumb enough in terms of algorithmic complexity, yet smart enough to harness the collective intelligence and knowledge of content creators and consumers alike.

Updates

  • In Google abandons Search Andrew Orlowski elaborates on how real-time results and voting kill PageRank and through the generated noise and irrelevance pushes back the entire Internet into the chaos from which it emerged.
  • Nova Spivack tears down the hype encircling search engines in Eliminating the Need for Search by realizing how search is an “intermediary stepping stone” that’s ““in the way” between intention and action”. He lists a couple of solutions that aim to break out of the conventional search engine image, but in the end fail to bring about drastic change. Instead, he proposes the concept of “help engines” that supposedly help the user in a proactive way.