Sunday, October 19, 2003

Googling the Inverted Index

Our readings on Information Retrieval have been quite valuable. I now feel like I have some grounding in the systems side of our discipline. The readings are loaded into my citation manager and I know where to find the formula for constructing vector space models (although I’ll need my undergraduate textbooks to decipher what a “dot product” is). Despite my interest in IR, I have to wonder about our reliance on controlled vocabularies. Where would inverted indexes be without a controlled vocabulary? How does post-coordinated searching work? What about natural language queries?

To understand these issues, I’m taking a quick look at Google—specifically Brin and Page’s work graduate school work from 1998: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm (Brin & Page, 1998).

It seems that the original goal of Google was to improve the quality of the early search engines and to overcome the increasingly commercial nature of the web. The key to Google’s efficiency is their trademarked PageRank technology. Brin and Page, however, attribute PageRank to concepts of citation matching (without crediting ISI!).

In trying to understand Google, I thought that Brin and Page had created a completely new means of working with large document spaces. In their paper, however, it seems that they have just polished the methods described in our readings. Indeed, their research is rooted in Information Science: they note that the “very large corpus” typically used to test IR systems is completely inapplicable to the web.

From their paper, I learned:

As Google crawls pages, copies of the HMTL are compressed and written to a drive. Each page is given a unique ID. An indexing engine then notes word occurrences. These occurrences are written to a forward index. Indexing terms are broken into various buckets to improve efficiency. These buckets are then analyzed to establish a descriptive lexicon. The lexicon is then used to create an inverted index. It seems that we can’t get away from the inverted index!

One of the strengths of Google lies in the size of its lexicon: 14 million words. LCSH 26th edition, in comparison, contains only 270,000 controlled terms. The Google lexicon, however, can easily be contained in the working memory of a single computer (256 MB).

Despite their age, our readings are still very salient. Even Google is subject to those seemingly antique terms of “precision”, “recall”, and “inverted index”. Given Google’s impending IPO, perhaps we should spend more time in the antiquity of Information Science!


References

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7), 107-117.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home