Hash function

From Exampleproblems

Jump to: navigation, search

A hash function or hash algorithm is a function for summarizing or probabilistically identifying data. Such a summary is known as a hash value or simply a hash, and the process of computing such a value is known as hashing.

Image:Hash function.png
A typical hash function at work

A fundamental property of all hash functions is that if two hashes (according to the same function) are different, then the two inputs were different in some way. This property is a consequence of hash functions being deterministic, mathematical functions, but they are generally only surjective functions (i.e. not necessarily one-to-one). Consequently, the equality of two hash values does not guarantee the two inputs were the same, but in some cases, probability theoretic or computability theoretic guarantees apply.

Because of the variety of applications for hash functions (details below), they are often tailored to the application. For example, cryptographic hash functions assume the existence of an adversary who can deliberately try to find inputs with the same hash value. Functions for error detection and correction focus on distinguishing cases in which data has been disturbed by random processes. In any application, a good hash function is one that yields few hash collisions in expected input domains. In hash tables and data processing, collisions inhibit the distinguishing of data, making records more costly to find.

Typical hash functions have an infinite domain, such as byte strings of arbitrary length, and a finite range, such as bit sequences of some fixed length. Functions that follow this paradigm and produce random-looking output are considered stock hash functions, because they can be used directly in or easily adapted for most applications. Cryptographic hash functions such as MD5 are commonly used as stock hash functions.

Contents

Cryptography

Main article: cryptographic hash function

Hash functions (a type of one-way function) are fundamental for much of cryptography. In this application, functions are characterized and evaluated in terms of their ability to withstand attack by an adversary. More specifically, given a message x, if it is computationally infeasible to find a message y not equal to x such that H(x) = H(y) then H is said to be a weakly collision-free hash function. A strongly collision-free hash function H is one for which it is computationally infeasible to find any two messages x and y such that H(x) = H(y).

The requirements for a good cryptographic hash function are stronger than those in many other applications (error correction and audio identification not included). For this reason, cryptographic hash functions make good stock hash functions--even functions whose cryptographic security is compromised, such as MD5 and SHA-1. The SHA-2 algorithm, however, has no known compromises.

Hash tables

Main article: Hash table

Hash tables, a major application for hash functions, enable fast lookup of a data record given its key. (Note: Keys are not usually secret as in cryptography, but both are used to "unlock" or access information.) For example, keys in an English dictionary would be English words, and their associated records would contain definitions. In this case, the hash function must map alphabetic strings to indexes for the hash table's internal array.

The generally impossible/impractical ideal for a hash table's hash function is to map each key to a unique index (see perfect hashing), because this guarantees access to each data record in the first probe into the table. Hash functions that are truly random with uniform output (including most cryptographic hash functions) yield good performance because, on average, only one or two probes will be needed (depending on the load factor). Perhaps as important is that excessive collision rates with random hash functions are highly improbable--if not computationally infeasible for an adversary. However, a small, predictable number of collisions are virtually inevitable (see birthday paradox).

In many cases, a heuristic hash function can yield many fewer collisions than a random hash function. Heuristic functions take advantage of regularities in likely sets of keys. For example, one could design a heuristic hash function such that file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc. map to successive indexes of the table, meaning that such sequences will not collide. Beating a random hash function on "good" sets of keys usually means performing much worse on "bad" sets of keys, which can arise naturally--not just through attacks. Bad perfomance of a hash table's hash function means that lookup can degrade to a costly linear search.

One hash function that exhibits properties of both random hash functions and heuristic hash functions is Bob Jenkins' LOOKUP2 hash function, published in an article in Dr. Dobb's Journal. The hash function performs well as long as there is no adversary, for it is trivially reversible and useless as a cryptographic hash function.

Error correction

Main article: Error correction and detection

Using a hash function to detect errors in transmission is straightforward. First compute the hash function at the sender and send the hash value along with the data. Then compute the hash function again at the receiver, and if the hash values do not match, there was an error. This is called a redundancy check.

For error correction, a distribution of likely perturbations is assumed at least approximately. Perturbations to a string are then classified into large (improbable) and small (probable) errors. The second criterion is then restated so that if we are given H(x) and x+s, then we can compute x efficiently if s is small. Such hash functions are known as error correction codes. Important sub-class of these correction codes are cyclic redundancy checks and Reed-Solomon codes.

Audio identification

For audio identification such as finding out whether an MP3 file matches one of a list of known items, one could use a conventional hash function such as MD5, but this would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume. Using something like MD5 is useful as a first pass to find exactly identical files, but another more advanced algorithm is required to find all identical items. Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing algorithms do exist that are robust to these minor differences. Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room.

Rabin-Karp string search algorithm

Rabin-Karp string search algorithm is a relatively fast string searching algorithm that works in O(n) time on average. It is based on the use of hashing to compare strings.

Origins of the term

The term "hash" apparently comes by way of analogy with its standard meaning in the physical world, to "chop and mix." Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953; the term hash came into use some ten years later.

In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular division.

See also

References

External links

da:Hashfunktion de:Hash-Funktion es:Función hash fr:Fonction de hachage it:Hash he:פונקציה חד כיוונית lt:Maišos funkcija nl:Hashfunctie ja:ハッシュ関数 pl:Funkcja haszująca pt:Hash ru:Хэш-функция sl:Sekljalna funkcija uk:Хешувальна функція zh:散列函數

Argan Oil
Natural Skin Care
Organic Skin Care
visitor stats