Rob's TIMSS Blog

My discoveries and ramblings of TIMSS/Personify.

Monday, March 13, 2006

SOUNDEX & De-duping

SOUNDEX is a phonetic coding used to group similar English names. It was developed in the early 1900s and used with Census data to group people with similar last names. Because the Census data was hand written and sometimes given by neighbors and those who couldn't spell, it was difficult to find people based purely on the spelling of a last name.

A SOUNDEX code consists of a letter and 3 numbers. The letter is the first letter of the name. The numbers represent letters that are pronounced similarly. Vowels and the letters H, W, and Y are excluded, and double letters are only coded once. The numbers are as follows:


1- B, F, P, V
2- C, G, J, K, Q, S, X, Z
3- D, T
4- L
5- M, N
6- R

If you run out of letters before getting 3 numbers, the rest are filled in with zeros.

Some examples are:

  • Washington-W252
  • Kaighn- K250
  • Lee- L000
  • Hall- H400
  • Smith- S530
  • Smyth- S530

There is a SOUNDEX function in both Oracle and SQL Server, or you can even go here to get the SOUNDEX of a name.

Since it is based on English pronunciation, it will work best with "English" names. And because nicknames are often substituted for first names that start with different letters (Bob & Robert), SOUNDEX should not be used on first names because the first character is so important in the SOUNDEX of name.

How is SOUNDEX connected to TIMSS? The de-duping process (CUS590) uses SOUNDEX to get groups of customers who have similar names. Then other comparisons are made against the customers in that group (same state, same gender, same SSN, etc) to see if they are a potential match. SOUNDEX is not for determining duplicates, it is only to put those with similarly spelled names together so that they can be further compared.

When running a process to find duplicates, we could just compare everyone to everyone. However, for most customers this is a waste of time since they are obviously not duplicates (i.e. Smith;John & Doe;John). To save effort, we can group similar customers first before doing comparisons.

These groups can be based on any characteristic of the customer. Last name is the obvious choice because everyone has one and it rarely has variations or changes (the spelling of my last name has never changed; a woman's usually only does related to marriage). Street addresses & email addresses do change (I've had 6 mailing addresses and used 7 email addresses in the last 15 years), and other characteristics like gender or SSN cast either to broad a net or not narrow enough (SSN would be a match, but I know I don't have it for all my customers and if it did match it's more likely a typo than a duplicate).

If we create groups for out comparisons based on last names being exactly the same, then we miss the most obvious reason duplicates are created: typos. When a name is entered a vowel is omitted or letters that sound alike are used (Smith & Smyth, or for me Kaighn & Kaign).

So if we group customers with similar last names first before comparing, we save ourselves a lot of effort by not comparing customers that obviously aren't duplicates, and we don't loose customers where the last names aren't an exact match.

Thus SOUNDEX is first used to generate groups of customers with similar names and then comparisons are done on those smaller groups. Putting 2 customers in the same group does not mean they are duplicates, it just means that they are similar enough to do comparisons on to see if they are duplicates based upon their name, address, gender, SSN, etc.

Below are some useful websites when you find people with similar last names:

http://www.census.gov/genealogy/www/namesearch.html - If you have 2 spellings of a name and are unsure which is correct you can search the 1990 Census for last names see which name is more common.

http://www.census.gov/genealogy/names/names_files.html - List all last names by frequency. If you have a very large customer database, you may want to de-dup a particular last name (i.e. SMITH, JOHNSON, WILLIAMS, JONES, BROWN, DAVIS, MILLER). Since the SOUNDEX would be the same for all these customers, that would be your group.

http://www.anywho.com/ - Look up people based upon address, or even do a reverse lookup on a phone number.

Applies to: TIMSS5 & TIMSS6

1 Comments:

At 5:51 AM, Blogger Rob Kaighn said...

This is a re-post. I added some more detail as to why SOUNDEX is used in the CUS590 process.

 

Post a Comment

<< Home