Thursday, August 26, 2010

Overview of my postings on the Atbrox blog (May 2010 - Aug 2010)

As previously mentioned I currently write most of my blog postings over on Atbrox (my startup company)

Here are the latest postings:

Sunday, May 30, 2010

Evaluation of Search Predictions made in May 2000

In May 2000 I wrote A few thoughts about the future of Internet Information Retrieval (i.e. search), but how did it actually go? I've tried to evaluate them in this posting, with the original prediction in italic font followed by the evaluation.

1a) Prediction - Specialized Services within Search

It seems likely that the specialization in the Internet Information Retrieval (IIR) business will continue. Internet information crawling, pre-processing, indexing, searching and presentation requires different types of technologies and know-how, this might create opportunities for new companies specializing in only one step of the IIR "food chain". One possibility could be that companies doing crawling will do offer extracts of relevant data on request, e.g. a search engine specializing in winter sports could get only relevant data extracted from several regional crawler companies. In other words, the IIR "food chain" might increase in length.


1b) Evaluation
Specalization of search services happened to some degree, but had relatively small impact. Examples of such services include fetching/crawl-related services (e.g. 80legs). But the services with biggest impact are the free (e.g. Google Ajax Search API and Bing APIs) and commercial search APIs (e.g. Yahoo Boss and Wolfram Alpha API), all in common that they offer the last step, i.e. search - so implicitly covering all steps. Noteworthy happenings in the related direction is cloud computing and increasing number of large data sets (e.g. infochimps collection, DBPedia and the Public Terabyte (crawl) dataset)

2a) Prediction about Potential New Search Players

As the importance of Internet Information Retrieval grows, players that have been concentrating on the lower end of the Internet "food chain", i.e. major bandwidth providers (e.g. MCI or British Telecom) and network software/hardware vendors (e.g. 3COM or Cisco) might want to enter the market as providers of partially indexed data to search engines and topic hierarchies.


2b) Evaluation
This didn't happen at all to my knowledge.

3a) Prediction about Potential New Search Technologies

With the increased growth of the amount of data on the Internet, new technologies for doing distributed indexing/search of data will probably occur. This is particularly interesting if processing and indexing of multimedia data (e.g. sound, pictures and video) becomes popular. Processing of multimedia data is considerably more CPU intensive than processing of textual data. Example of such processing could be automatic detection of objects (e.g. a car) in video frames.


3b) Evaluation
(Massively) distributed indexing in the "SETI@home-style" didn't happen at large scale, though there are a few examples pursuing distributed indexing/search, e.g. the Majestic project. The in retrospective obvious processing of multimedia data is happening (but not trivial problems to solve).

Conclusion
If I am kind - 0.5 on prediction 1, 0 on prediction 2 and 0.5 on prediction 2 ~ 33.33% correct?

Sunday, January 24, 2010

My recent reads in Information Retrieval - Indexing


Information Retrieval (IR) - better known as Search - is probably the most exciting research field I know of, the reasons that makes IR exciting are:
  • solvability - it can probably never be solved perfectly, but always be improved
  • coverage - it spans all areas of computer science and touches many other sciences (e.g. statistics)
  • importance - it is the most important research area related to supporting human decisions? (~AI)
  • difficulty - it is extremely hard to do well
  • applicability - it can be used practically anywhere (anytime).
Where to start learning about information retrieval?
Before jumping into research papers I suggest reading a book about IR, either:
Search Engines: Information Retrieval in Practice (2009) or
Introduction to Information Retrieval (2008)
They are both good and relatively similar books written by a mix of authors from search industry and academic IR research (note: I personally prefer the newest one).

My recent reads in Information Retrieval?


Indexing - algorithms and datastructures for self-indexing
Self-indexing is where (lossless) compression meets indexing, and is an alternative to the classic inverted index. Self-indices has some nice characteristics wrt compression, performance and query-flexibility. Indexing-research-rockstar Gonzalo Navarro even called it the Miracle of Self-indexing (2009).
2 key papers in the field are:
  1. Opportunistic Data Structures with Applications (2000)
    • Introduced the FM-index
  2. High-order entropy-compressed text indexes (2003)
    • Introduced the Wavelet Index Tree
Check out Navarro's survey paper Compressed Full-Text Indexes (2007) for a good overview of self-indexing.

Have a nice read :)