Go to Bitmap.us Home


  
History of Soundex
   

What is Soundex?

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for names with the same pronunciation to be encoded to the same representation so that they can be matched despite minor differences in spelling. Soundex is the most widely known of all phonetic algorithms and is often used often incorrectly as a synonym for "phonetic algorithm". Improvements to Soundex are the basis for many modern phonetic algorithms.

History

Soundex was developed by Robert Russell and Margaret Odell and patented in 1918 and 1922 (U.S. Patent 1,261,167  and U.S. Patent 1,435,663). A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. The Soundex code came to prominence in the 1960s when it was the subject of several articles in the Communications and Journal of the Association for Computing Machinery (CACM and JACM). The National Archives and Records Administration (NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. Government. These encoding rules are available from NARA, upon request, in the form of General Information Leaflet 55, "Using the Census Soundex".

Limitations

Names that sound alike do not always have the same Soundex code. For example, Lee (L000) and Leigh (L200) are pronounced identically, but have different Soundex codes because the silent g in Leigh is given a code. Names that sound alike but start with a different first letter will always have a different Soundex code. Thus, names such as Carr (C600) and Karr (K600) should be calculated separately. Soundex is based on English pronunciation so European names may not Soundexed correctly. For example, some French surnames with silent last letters will not code according to pronunciation. This is true with French name such as Beaux - where the x is silent. Sometimes this surname is also spelled Beau (B000) and is pronounced identically to Beaux (B200), yet they will have different Soundex codes. Sometimes names that don't sound alike have the same Soundex code. When searching for the surname Powers (P620), you have to wade through Pierce, Price, Perez and Park which all have the same Soundex code. Yet Power (P600), a common way to spell Powers 100 years ago, has a different Soundex code. Surnames with prefixes were usually coded without the prefix, but not always. If you are searching for a surname such as DiCaprio or LaBianca, you should try the Soundex for both with and without the prefix. US Census Soundex confusion arises with names such as Ashcraft. When the original Soundex coder didn't code the H and didn't consider the H as a separator between the adjacent letters with the same code S and C, then the S and C would be considered adjacent letters to be coded only once and the Soundex will be A261. In the 1920 NY Census, Ashcraft is found under A261. Those who coded the Soundex for the 1880, 1900 and 1910 census may or may not have used this rule. They sometimes considered the H as a separator, and did not code the S and C as adjacent letters that would only be assigned one letter, but rather gave a number code to each letter. In this case Ashcraft would be A226. The important thing to know is that the US Census was not consistent with using the letter H and W as separators between adjacent letters. If you are trying to calculate the Soundex for a name with the letters W or H that separate two adjacent letters, it is best to calculate the Soundex using the two different methods to locate the name in the US census. This would be true of any name that has any of the letters C, S, G, J, K, Q, X, Z on both sides of the letter H or W such as SHC, SHS, CHS, KHZ, SWS, KWS, CWK. A surname of more than one word, or a surname that commonly comes before a given name, such as Native Americans and Chinese surnames, may have been coded under the name which appears last, even though it might not be the actual surname. In the case of multi-word surnames, only the last word may have been coded.  

Rules

The Soundex code for a name consists of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants. Similar sounding consonants share the same number so, for example, the labial B, F, P, and V are all encoded as 1. Vowels can affect the coding, but are never coded directly unless they appear at the start of the name.

The exact algorithm is as follows:

A.  Remember the initial letter.

B.  Convert each letter (including the first) according to the following table. Ignore punctuation such as apostrophes, spaces and hyphens.

0 = A, E, I, O, U, W, Y, H

1 = B, P, F, V

2 = C, S, K, G, J, Q, S, Z

3 = D, T 

4 = L

5 = M, N  

6 = R

C. Change all consecutive duplicate digits to a single example. e.g. change 22 to 2

D. Replace the first digit by the letter remembered in step A.

F. Remove all zeros from the string. 

G. Adjust to four characters by truncating or padding to the right with zeros.

The resulting 4-character code is the Simplified Soundex for that name

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150".

Soundex Variants

A similar algorithm called "Reverse Soundex" prefixes the last letter of the name instead of the first.

The NYSIIS algorithm was introduced by the New York State Identification and Intelligence System as an improvement to the Soundex algorithm. NYSIIS handles some multi-character n-grams and maintains relative vowel positioning, whereas Soundex does not.

As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the Metaphone algorithm for the same purpose. Philips later developed an improvement to Metaphone, which he called Double-Metaphone. Double-Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English.

Daitch-Mokotoff Soundex (D-M Soundex) was developed by genealogist Gary Mokotoff and later improved by genealogist Randy Daitch because of problems they encountered while trying to apply the Russell Soundex to Jews with Germanic or Slavic surnames (such as Moskowitz vs. Moskovitz or Levine vs. Lewin). D-M Soundex is sometimes referred to as "Jewish Soundex" or "Eastern European Soundex", although the authors discourage the use of these nicknames. The D-M Soundex algorithm can return as many as 32 individual phonetic encodings for a single name. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999. This algorithm is much more complex than Russell Soundex.

 Soundex Calculator

This calculator is based on the on code performed in censuses in 1880, 1900, 1910 and is also currently used in calculating Driver's License numbers.
     

 
         Valid XHTML  |   Valid CSS  |   A  |   How to 3D  |   D  |   v1.8

         Secure  Connection Secured © 2019