Since the official launch of his initiative “blekko“, Rich Skrenta has tried to differentiate his search engine from other, major search engines by challenging them on issues like spam , privacy and transperncy. The most apparent difference between Blekko and other horizontal search engines is the use of “slashtags” — custom search engines that are built by the Blekko community around one topic and limit search results to the most authoritative websites on that topic. These slashtags are the manifestation of Blekko’s fifth rule that challenge the seemingly impersonal nature of the afore-mentioned search engines by focusing on user-search context. The first direct attack occurred two month ago when Skrenta stated in his blog that they are “not afraid of Google“.
This statement may sound courageous considering the fate of previous search engines that tried to rival Google. On the other hand, Blekko’s aspiration to be”the third search engine” may be interpreted as symbolic. Whether or not Blekko will become one of the major horizontal search engines, it is apparent that Google had set some standards that Blekko will have to cope with (this may explain why 2 members of Blekko’s management team are former Google employees). Now, one year after its public launch, is probably a good time to assess what Blekko still has to achieve in order compete with Google’s standards.
Natural Language Processing (NLP) Tasks
Misspelling and Mistyping
Google’s spelling correction has been constantly evolving over the years and now it supports 31 languages. A nonexhaustive examination of some of Wikipedia’s Commonly misspelled words suggests that Blekko’s spell-checker is at a level similar to major search engines even in foreign languages like French or German. However, when using their localized versions, Google, Bing and Yahoo! yield better corrections than Blekko (Blekko declares it currently focuses on English language results). Moreover, Google spelling correction may improve for queries longer than one word.
Stemming and Synonyms
Although the efficiency of automatic stemming has been long debated, it has become a standard, at least for commercial search engines. Google Auto Stemming started at the end of 2003. Blekko, on the other hand, doesn’t have a stemming algorithm or it just does not highlight the word variants (e.g. [run shoes]).
Google automatic synonyms expansion is part of the trend to fully automate the process of query reformulation. Google’s algorithm adds on-the-fly synonyms that are relevant to the context of the query (irrelevant synonyms may degrade search results). Blekko apparently doesn’t have a synonyms expansion algorithm, but it is important to recall that thesaurus-based expansion is only one of various methods for query reformation (e.g. relevance feedback). Actually, Blekko had an auto-slashing feature right from the start (now called slashtag boosting).
In the last decade, Google has been using N-gram models as the ultimate solution for a variety of NLP tasks like spelling correction, stemming and synonyms expansions. Starting the N-grams language exploration with n-grams over the web, Google has been recently experimenting with , Google has been recently experimenting with ngrams over Google Books digitized corpus.
Although less refined corpus, the World Wide Web may also Serve as training data for NLP algorithms with greatly-reduced costs.(The problem of accuracy can be mitigated by the vast amount of web-based texts.) Wikipedia’s explosive growth can be also used by Blekko to examine a controlled web-based parallel corpora. Moreover, Wikipedia may combine with Wiktionary in order to obtain comprehensive open source parallel corpora. In addition, Google’s one to five-grams datasets is a good free multilingual accurate source. (WordNet is good source as well, but it is free only for the English version.)
Advanced Search Operators
Google has variety of advanced search operators, some of them undocumented (e.g. the AROUND operator). Blekko also has some “advanced slashtags” as part of its SEO stats pages. However these operators are much more restricted than Google’s or even Bing’s and Yahoo!’s. Although the slashtags”/inbound”, “/outbound”, and “/internal” are usefull for SEO purposes, they cannot be combined with each other or with other slashtags, or even with keywords. Actually, the undocumented “link:” operator should be used in order to build compound search queries (e.g. [link:Blekko.com /techblogs] or [link:Blekko.com /techblogs /date]).
Recently Blekko has launched the service Web Grepper, which allows Blekko’s registered users to ask for mining of data that cannot be discovered with standard keywords search. However, these requests should get enough votes from the Blekko-community in order to be implemented. Still, this activity may lead eventually to Blekko’s own set of advanced search operators.
One of the main reasons for the proliferation of topical search engines start-ups some years ago, was the failure of horizontal search engines to recognize the context of users’ queries. Topical search engines, people believed, will have advantage over the big horizontal search engines — at least in one area — with a much small budget. Although the major search engines weren’t indifferent to this trend and even bought some of these start-ups, their solution to this failure was eventually personalized search or adaptive search.
In the last year there was a continuing debate over the negative implication of personalized search. The discussion started probably with Eli Pariser’s TED Talk on the “Filter bubble”. One of the drawbacks of personalization is its threat to user privacy, usually by recording the user’s search history. However, it is important to remember that personalization is the price users are silently expected to pay for their reluctance to use search engine reformulation and refinement tools or even to formulate more than two words queries.
Apparently Blekko does not need users’ search history because it can infer users’ intention from their slashtags. While typing different slahtags to focus a search on a specific topic may be welcomed by early-adopters (i.e. power-users), the late majority may be intimidated by the concept of slashtags. Even slashtags suggestions (i.e. refinments) may overwhelm users’ short-term memory. In this phase Blekko will probably want to make slashtags more opaque to the users through slashtags personalization.
Now, Blekko has recently stated that it will discard user search logs within 48 hours. Although this time frame may be sufficient to infer users’ short-term interest, it will fail to create long-term stable profiles. (The recent trend of major search engines is to save search logs up to 18 months.) This statement may sound as a manifestation of Blekko’s tenth rule, but actually it breaks this rule and may be used as an escape hatch to increase the time frame as soon as Blekko will obtain a critical mass of users in order to guess users’ invariant intentions. Moreover, Blekko may build long-term user-profiles for users who logged-in with their facebook account right now, based on their social networks.
Personalized slashtags may make Blekko more similar to Google. However it may also enhance Blekko’s main agenda — namely to combine human curation with algorithmic techniques in order to fight spam. In other words, this feature will inject human wisdom into computers’ artificial intelligence. In this playground Google won’t beat Blekko unless it will relinquish its “no manual intervention” philosophy.