Monday, November 10, 2003

Steps for Google…

The search engine wars are raging. Who will come out on top: Teoma, AllTheWeb, Vivisimo, AltaVista, or—the perennial favorite—Google. They are all pretty good. I can generally find what I’m looking for… it’s hard not to find what you’re looking for when you get 2,600,000 hits!

Recall is obviously not a problem with search engines but precision has become a huge issue. Despite their limitations, the search engines have already adopted many of the standard IR techniques like lexical analysis to treat punctuation, the elimination of stopwords, and text compression to improve performance. The search engines, however, are missing some of the major aspects of preprocessing identified by Baeza-Yates. They don’t, for example, seem to perform any stemming to determine word roots nor do they utilize a controlled vocabulary to improve precision. Although certain directory services such as the Google Directory, DMOZ, or Yahoo! provide certain aspects of controlled vocabularies, their categories are neither exhaustive nor exclusive like formally engineered thesauri and controlled vocabularies.

Where Google excels is in determining similarity between documents and queries. The PageRank method combines elements of the vector space model with bibliometric techniques. The similarity weight of various documents is adjusted by a “credibility” rating determined through co-citation hyper-linking patterns.

Although Google works well, it certainly could be improved. Google automatically indexes documents and often key concepts are omitted from descriptions. When using the OCLC interface for SocAbs, for example, the user can select from various document types such as journal articles or books. Other than allowing the user to filter results sets by format (e.g., Powerpoint, PDF, etc.), this functionality is missing from Google. In addition, Google provides no provision for determining the real validity or credibility of a document as determined through a formal peer review process.

Perhaps the one area where Google—and all of the other major search engines—could be improved through the use of relevance feedback. Why am I not allowed to select a set of documents and then ask the engine to provide similar documents? As it is, Google only allows me to find similar documents based on a single document and not a set of documents.

Similarity searching based on a set of documents also may introduce another facility to Google. Since no formal controlled vocabulary is used, related terms and inherently invisible to the searcher. Google provides no means to expand a search with synonyms. Perhaps through analysis of the most important—or information heavy—words in documents, the searcher could determine previously unknown keywords.

RESEARCH THOUGHT: Salton and McGill introduce the SMART retrieval system. In their discussion, they provide a description of clustering methods and illustrate the ability to determine centroids for clusters. It would be interesting to compare the clusters resulting from automatic indexing to formal LCSH headings. One could use the OPAC as a sampling frame. Only book entries from the last four years (because they contain detailed abstracts, blurbs, and TOCs) will be considered. The sample could be drawn from particular LCC codes such as engineering or even Z- library and information science.



0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home