|
Soundex
algorithm: method
(from wikipedia)
Soundex is a phonetic algorithm for indexing names
by their sound when pronounced in English. The basic aim is for names
with the same pronunciation to be encoded to the same string so that
matching can occur despite minor differences in spelling. Soundex is
the most widely known of all phonetic algorithms and is often used
(incorrectly) as a synonym for "phonetic algorithm".
Soundex was developed by Robert Russell and
Margaret Odell and patented in 1918 and 1922. A variation called
American Soundex was used in the 1930s for a retrospective analysis of
the US censuses from 1890 through 1920. The Soundex code came to
prominence in the 1960s when it was the subject of several articles in
the Communications and Journal of the Association for Computing
Machinery (CACM and JACM), and especially when described in Donald
Knuth's magnum opus, The Art of Computer Programming.
The Soundex code for a name consists of a letter
followed by three numbers: the letter is the first letter of the name,
and the numbers encode the remaining consonants. Similar sounding
consonants share the same number so, for example, the labial B, F, P
and V are all encoded as 1. Vowels can affect the coding, but are never
coded directly unless they appear at the start of the name.
The exact algorithm is as follows:
1. Retain the first letter of the string
2. Remove all occurrences of the following letters, unless it is the
first letter: a, e, h, i, o, u, w, y
3. Assign numbers to the remaining letters (after the first) as follows:
|
number
|
--letter |
|
1
|
B, F, P, V |
|
2
|
C, G, J, K, Q, S, X, Z |
|
3
|
D, T |
|
4
|
L |
|
5
|
M, N |
|
6
|
R |
4. If two or more letters with the same number
were adjacent in the original name (before step 1), or adjacent except
for any intervening h and w, then omit all but the first.
5. Return the first four bytes padded with 0.
|