Hash function
From Exampleproblems
A hash function or hash algorithm is a function for summarizing or probabilistically identifying data. Such a summary is known as a hash value or simply a hash, and the process of computing such a value is known as hashing.
A fundamental property of all hash functions is that if two hashes (according to the same function) are different, then the two inputs were different in some way. This property is a consequence of hash functions being deterministic, mathematical functions, but they are generally only surjective functions (i.e. not necessarily one-to-one). Consequently, the equality of two hash values does not guarantee the two inputs were the same, but in some cases, probability theoretic or computability theoretic guarantees apply.
Because of the variety of applications for hash functions (details below), they are often tailored to the application. For example, cryptographic hash functions assume the existence of an adversary who can deliberately try to find inputs with the same hash value. Functions for error detection and correction focus on distinguishing cases in which data has been disturbed by random processes. In any application, a good hash function is one that yields few hash collisions in expected input domains. In hash tables and data processing, collisions inhibit the distinguishing of data, making records more costly to find.
Typical hash functions have an infinite domain, such as byte strings of arbitrary length, and a finite range, such as bit sequences of some fixed length. Functions that follow this paradigm and produce random-looking output are considered stock hash functions, because they can be used directly in or easily adapted for most applications. Cryptographic hash functions such as MD5 are commonly used as stock hash functions.
Contents |
Cryptography
- Main article: cryptographic hash function
Hash functions (a type of one-way function) are fundamental for much of cryptography. In this application, functions are characterized and evaluated in terms of their ability to withstand attack by an adversary. More specifically, given a message x, if it is computationally infeasible to find a message y not equal to x such that H(x) = H(y) then H is said to be a weakly collision-free hash function. A strongly collision-free hash function H is one for which it is computationally infeasible to find any two messages x and y such that H(x) = H(y).
The requirements for a good cryptographic hash function are stronger than those in many other applications (error correction and audio identification not included). For this reason, cryptographic hash functions make good stock hash functions--even functions whose cryptographic security is compromised, such as MD5 and SHA-1. The SHA-2 algorithm, however, has no known compromises.
Hash tables
- Main article: Hash table
Hash tables, a major application for hash functions, enable fast lookup of a data record given its key. (Note: Keys are not usually secret as in cryptography, but both are used to "unlock" or access information.) For example, keys in an English dictionary would be English words, and their associated records would contain definitions. In this case, the hash function must map alphabetic strings to indexes for the hash table's internal array.
The generally impossible/impractical ideal for a hash table's hash function is to map each key to a unique index (see perfect hashing), because this guarantees access to each data record in the first probe into the table. Hash functions that are truly random with uniform output (including most cryptographic hash functions) yield good performance because, on average, only one or two probes will be needed (depending on the load factor). Perhaps as important is that excessive collision rates with random hash functions are highly improbable--if not computationally infeasible for an adversary. However, a small, predictable number of collisions are virtually inevitable (see birthday paradox).
In many cases, a heuristic hash function can yield many fewer collisions than a random hash function. Heuristic functions take advantage of regularities in likely sets of keys. For example, one could design a heuristic hash function such that file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc. map to successive indexes of the table, meaning that such sequences will not collide. Beating a random hash function on "good" sets of keys usually means performing much worse on "bad" sets of keys, which can arise naturally--not just through attacks. Bad perfomance of a hash table's hash function means that lookup can degrade to a costly linear search.
One hash function that exhibits properties of both random hash functions and heuristic hash functions is Bob Jenkins' LOOKUP2 hash function, published in an article in Dr. Dobb's Journal. The hash function performs well as long as there is no adversary, for it is trivially reversible and useless as a cryptographic hash function.
Error correction
- Main article: Error correction and detection
Using a hash function to detect errors in transmission is straightforward. First compute the hash function at the sender and send the hash value along with the data. Then compute the hash function again at the receiver, and if the hash values do not match, there was an error. This is called a redundancy check.
For error correction, a distribution of likely perturbations is assumed at least approximately. Perturbations to a string are then classified into large (improbable) and small (probable) errors. The second criterion is then restated so that if we are given H(x) and x+s, then we can compute x efficiently if s is small. Such hash functions are known as error correction codes. Important sub-class of these correction codes are cyclic redundancy checks and Reed-Solomon codes.
Audio identification
For audio identification such as finding out whether an MP3 file matches one of a list of known items, one could use a conventional hash function such as MD5, but this would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume. Using something like MD5 is useful as a first pass to find exactly identical files, but another more advanced algorithm is required to find all identical items. Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing algorithms do exist that are robust to these minor differences. Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room.
Rabin-Karp string search algorithm
Rabin-Karp string search algorithm is a relatively fast string searching algorithm that works in O(n) time on average. It is based on the use of hashing to compare strings.
Origins of the term
The term "hash" apparently comes by way of analogy with its standard meaning in the physical world, to "chop and mix." Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953; the term hash came into use some ten years later.
In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular division.
See also
- Cryptography
- Cryptographic hash function
- HMAC
- Geometric hashing
- Message digest
- Perfect hash function
- Rabin-Karp string search algorithm
- Zobrist hashing
- Bloom filter
- Hash table
- Hash list
- Hash tree
References
External links
- General purpose hash function algorithms
- Hash Functions for Hash Table Lookup by Bob Jenkins
- Hash Functions by Paul Hsieh
- What is a hash function? from RSA Laboratories
- universal hashing by stinson from university of nabraska lincolncs:Hašovací funkce
da:Hashfunktion de:Hash-Funktion es:Función hash fr:Fonction de hachage it:Hash he:פונקציה חד כיוונית lt:Maišos funkcija nl:Hashfunctie ja:ハッシュ関数 pl:Funkcja haszująca pt:Hash ru:Хэш-функция sl:Sekljalna funkcija uk:Хешувальна функція zh:散列函數
