Posts Tagged ‘noise’

The Noisy Web

January 6, 2010

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

We are witnessing the decay of Google Search. The recent improvements (categories, promotion, real-time results) are insignificant compared to the magnitude of the problem, namely, poor relevance of results. By relevance I mean the results’ relation to the specific idea in the user’s mind, and not their relation to the keywords.

Keyword-based search increases the distance and distortion between results and what the user is really looking for.

Poor relevance

Why do keywords perform so poorly? After all, they would work perfectly in a world where all data on the web is semantically indexed through relevant metadata. In reality however, the gap between relevant information and noise is so huge that keywords are likely to be caught in both. The keyword meta tag fiasco around the millennium has proven the inefficiency and vulnerability of metadata.

The widely criticized Semantic Web that is anticipated to be an integral part of the third generation Web aims in that direction anyway. How it’s going to deal with obvious obstacles such as entropy and human behavior remain unanswered.

Making sense of noise

Instead of going into the problems posed by metadata, let’s focus on the naturally noisy web. Since reducing entropy in general requires immense efforts let’s turn the problem around and start digging in the noise.

There are two ways to do this:

  • treat the entire set of data as noise and recognize patterns that are interesting to us
  • prepare useful data for extraction from the background noise as we come across them

The first option calls for some sort of AI. While this is a viable solution I’d question its feasibility. I don’t see algorithms – no matter how complex they are – cover every single aspect of content recognition and interpretation.

For the second option I can show a very fitting example. In digital watermarking we’re hiding drops of information in a vast ocean of noise. In order to recover that tiny amount of data we have to make sure that it’s either or both

  • significantly more coherent than the background noise (coherent)
  • repeated over and over throughout different domains of the signal (redundant)

We can put the same concept in Web terms by connecting relevant content through user interaction.

Content mapping

There are a couple of attempts at using the crowd to add context to content: Google’s Promote button, Digg, Twitter lists just to name a few. It’s easy to see that these tools don’t connect content to content. They connect content to metadata which brings us back to the original problem. OWL, the language of the Semantic Web can be used to define connections indirectly via class connections, but this solution again favors the metadata domain.

Direct content to content connections are practically non-existing as of today except for online stores where articles refer to each other by a recommendation system. These connections are quite limited by the narrow niche and the very few and specific relations (also bought / also viewed / similar). Unquestionably, creating these connections on a grand scale is an enormous yet far more feasible a task than keeping entropy low. The good news is that tools like the ones mentioned above (Digg, Twitter) spread a completely new user behavior that will perfectly fit content mapping.

By defining a sufficiently rich set of relations in content connections, mapping will be machine readable. It won’t know that e.g. a certain text element does represent a book author as it would in a semantic solution, but through a series of connections it’s going to have implicit knowledge about it.

The “Google killer” cometh

Whatever is going to go in the footsteps of Google Search (perhaps a new Google Search?) it’s going to end the era of keywords. Ideally it’s going to feature strong content mapping induced by fundamentally changing online behavior mixed with light semantics. It will be dumb enough in terms of algorithmic complexity, yet smart enough to harness the collective intelligence and knowledge of content creators and consumers alike.

Updates

  • In Google abandons Search Andrew Orlowski elaborates on how real-time results and voting kill PageRank and through the generated noise and irrelevance pushes back the entire Internet into the chaos from which it emerged.
  • Nova Spivack tears down the hype encircling search engines in Eliminating the Need for Search by realizing how search is an “intermediary stepping stone” that’s ““in the way” between intention and action”. He lists a couple of solutions that aim to break out of the conventional search engine image, but in the end fail to bring about drastic change. Instead, he proposes the concept of “help engines” that supposedly help the user in a proactive way.
Follow

Get every new post delivered to your Inbox.