Monday, November 03, 2003

Basic IR Definitions

After attempting to tackle a rather formidable treatise on information retrieval (Baeza-Yates & Ribeiro, 1999), I’ve run into a few roadblocks. I need some definitions:



Bit Mask

A pattern of binary values which is combined with some value using bitwise AND with the result that bits in the value in positions where the mask is zero are also set to zero. For example, if, in C, we want to test if bits 0 or 2 of x are set, we can write

int mask = 5; /* binary 101 */

if (x & mask) ...

A bit mask might also be used to set certain bits using bitwise OR, or to invert them using bitwise exclusive OR.

Source: http://dict.die.net/bit%20mask/ (attributed to: The Free On-line Dictionary of Computing (09 FEB 02))



Hashing

Producing hash values for accessing data or for security. A hash value (or simply hash), also called a message digest, is a number generated from a string of text. The hash is substantially smaller than the text itself, and is generated by a formula in such a way that it is extremely unlikely that some other text will produce the same hash value.
Hashes play a role in security systems where they're used to ensure that transmitted messages have not been tampered with. The sender generates a hash of the message, encrypts it, and sends it with the message itself. The recipient then decrypts both the message and the hash, produces another hash from the received message, and compares the two hashes. If they're the same, there is a very high probability that the message was transmitted intact.

Hashing is also a common method of accessing data records. Consider, for example, a list of names:

· John Smith
· Sarah Jones
· Roger Adams

To create an index, called a hash table, for these records, you would apply a formula to each name to produce a unique numeric value. So you might get something like:

· 1345873 John smith
· 3097905 Sarah Jones
· 4060964 Roger Adams

Then to search for the record containing Sarah Jones, you just need to reapply the formula, which directly yields the index key to the record. This is much more efficient than searching through all the records till the matching record is found.

Source: http://www.webopedia.com/TERM/h/hashing.html



Heap’s Law

For text files, the second important thing is the number of distinct words or vocabulary of each document. We use the Heaps’ Law. This is a very precise law ruling the growth of the vocabulary in natural language texts... Hence, the vocabulary of a text grows sublinearly with the text size, in a proportion close to its square root.

Source: http://www.pnclink.org/annual/annual1998/1998pdf/yates.pdf



Patricia Tree
Definition: A compact representation of a trie where all nodes with one child are merged with their parents.

See also suffix tree, compact DAWG.

Note: A compact directed acyclic word graph (DAWG) merges common suffix trees to save additional space.

Source: http://www.nist.gov/dads/HTML/patriciatree.html



Suffix

Definition: The end characters of a string. More formally a string v is a suffix of a string u if u=u'v for some string u'.

Source: http://www.nist.gov/dads/HTML/suffix.html



Suffix Tree

Definition: A compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents.

See also multi suffix tree, Patricia tree, suffix array, directed acyclic word graph.

Note: A suffix tree is a Patricia tree corresponding to the suffixes of a given string. A directed acyclic word graph (DAWG) is a more compact form.

Source: http://www.nist.gov/dads/HTML/suffixtree.html



Trie

Definition: A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.

See also digital tree, digital search tree, directed acyclic word graph, compact DAWG, Patricia tree, suffix tree.

Note: The name comes from retrieval and is pronounced, "tree."

Source: http://www.nist.gov/dads/HTML/trie.html



References

Baeza-Yates, R., & Ribeiro, B. d. A. N. (1999). Modern information retrieval. New York: ACM Press.

11 Comments:

Anonymous Anonymous said...

XUnzAb The best blog you have!

2/11/07 03:02  
Anonymous Anonymous said...

rbg5Yi Good job!

2/11/07 10:13  
Anonymous Anonymous said...

Wonderful blog.

2/11/07 11:08  
Anonymous Anonymous said...

Please write anything else!

2/11/07 11:58  
Anonymous Anonymous said...

actually, that's brilliant. Thank you. I'm going to pass that on to a couple of people.

2/11/07 12:46  
Anonymous Anonymous said...

Good job!

2/11/07 13:51  
Anonymous Anonymous said...

Good job!

3/11/07 03:26  
Anonymous Anonymous said...

Good job!

3/11/07 08:01  
Anonymous Anonymous said...

Nice Article.

3/11/07 10:10  
Anonymous Anonymous said...

Thanks to author.

3/11/07 10:58  
Anonymous Anonymous said...

2XNx37 write more, thanks.

4/11/07 10:37  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home