An Introduction to Fuzzy Matching

What is fuzzy matching?

Fuzzy Matching, also known as Approximate String Matching, is a technique to identify whether two strings are similar but not the same. Some everyday uses of fuzzy matching include: spell check, auto-correct, spam filtering, record linkage, and address matching.

How does it work?

In deterministic data matching the result is boolean (1 or 0) and in probabilistic data matching the likelihood of whether the record matches is calculated. If you use a fuzzy address matching algorithm then you can set an acceptance criteria. Altering this threshold can help import the accuracy of your results and reduce false positives. 

Levenshtein Distance

Levenshtein Distance, also known as the minimum edit distance, is the distance between two strings in terms of edits. One edit can be the addition/removal or substitution of a letter between two strings, see the below example:

Soundex Algorithm

Soundex is amongst the early algorithms designed for phonetics based matching. It is most commonly used for genealogical database searches.

The Soundex algorithm outputs a four digit code given a name. Because of this four digit restriction, Soundex has an issue with high cardinality data. It supports up to 26,000 different patterns so if your data has more than this number of unique values you will end up with two very different words having a code collision.

Soundex is also restricted in that it only analyses the initial four ‘sounds’ of a string. Consider “Gaurav Gupta”, it is composed of 6 distinct sounds: ‘gau’,’r’,’av’,’gu’,’p’,’ta’. Soundex will examine the first four sounds and return a string ‘G612’; whereas ‘Gaurav’ only has three sounds so the fourth digit is empty and the returned code is ‘G610’. So the Soundex value for “Gaurav Gupta” vs “Gaurav” is 75%. However in row 4 it can be seen that Soundex struggles to differentiate between two individuals with the same first name ‘Michael’ but very different surnames which happen to start with the same letter.

Original Result Change Distance
Winds Wind Removal of ‘S’ 1
Wind Windy Addition of ‘y’ 1
Wild Wind Substitution of ‘l’ to ‘n’ 1
Wild Windy Addition of ‘y’
Substitution of ‘l’ to ‘n’
2
Name 1 Name 2 Levenshtein Soundex
Stuart Stewart 3 100
Abdul Rasheed Abdul Rashid 2 100
Gaurav Gupta Gaurav 6 75
Michael Smith Michael Sutton 4 100

Metaphones

Metaphone is another phonetic algorithm similar to Soundex, however it considers the entire string and not just the first few sounds. The code length also has no restrictions so it can deal with a large pool of words without any problems.

However, Soundex can ‘outperform’ Metaphone when it comes to missing surnames or where there are differences at the end of a string. In this case you can see that Soundex is more confident of matching “Rebecca Smith-Jones” and “Rebecca Smith” whereas Metaphone is less sure. 

Name 1 Name 2 Levenshtein Soundex Metaphone
Stuart Stewart 3 100 100
Abdul Rasheed Abdul Rashid 2 100 100
Gaurav Gupta Gaurav 6 75 67
Michael Smith Michael Sutton 4 100 67
Rebecca Smith-Jones Rebecca Smith 6 100 80

If you wish to conduct your own research into the fuzzy matching discussed in this article, then please find the code I used below. I hope it gives you as much enjoyment to work with as I had experimenting with it.

References

Gupta, M. (2020) Phonetics based Fuzzy string matching algorithms, url: https://medium.com/data-science-in-your-pocket/phonetics-based-fuzzy-string-matching-algorithms-8399aea04718

Code

# -*- coding: utf-8 -*-

"""

Created on Fri Oct  8 15:27:43 2021

@author: DavidGlover

"""

import Levenshtein as lev

import jellyfish

import phonetics

from fuzzywuzzy import fuzz

# Get Levenshtein distance

print(lev.distance('Windy', 'Wind'))

print(lev.distance('Wind', 'Windy'))

print(lev.distance('Wild', 'Wind'))

print(lev.distance('Wild', 'Windy'))

print(lev.distance('Stuart ', 'Stewart'))

print(lev.distance('Abdul Rashid', 'Abdul Rasheed'))

print(lev.distance('Gaurav Gupta', 'Gaurav'))

print(lev.distance('Michael Smith', 'Michael Sutton'))

print(lev.distance('Rebecca Smith-Jones', 'Rebecca Smith'))

# Get Soundex for the same names

name1 = jellyfish.soundex('Stuart')

name2 = jellyfish.soundex('Stewart')

print(fuzz.ratio(name1,name2))

name3 = jellyfish.soundex('Abdul Rashid')

name4 = jellyfish.soundex('Abdul Rasheed')

print(fuzz.ratio(name3,name4))

name5 = jellyfish.soundex('Gaurav Gupta')

name6 = jellyfish.soundex('Gaurav')

print(fuzz.ratio(name5,name6))

name7 = jellyfish.soundex('Michael Smith')

name8 = jellyfish.soundex('Michael Sutton')

print(fuzz.ratio(name7,name8))

# Get Soundex for the same names

name1 = phonetics.metaphone('Stuart')

name2 = phonetics.metaphone('Stewart')

print(fuzz.ratio(name1,name2))

name3 = phonetics.metaphone('Abdul Rashid')

name4 = phonetics.metaphone('Abdul Rasheed')

print(fuzz.ratio(name3,name4))

name5 = phonetics.metaphone('Gaurav Gupta')

name6 = phonetics.metaphone('Gaurav')

print(fuzz.ratio(name5,name6))

name7 = phonetics.metaphone('Michael Smith')

name8 = phonetics.metaphone('Michael Sutton')

print(fuzz.ratio(name7,name8))

print(fuzz.ratio('Rebecca Smith-Jones','Rebecca'))

name1 = jellyfish.soundex('Rebecca Smith-Jones')

name2 = jellyfish.soundex('Rebecca')

print(fuzz.ratio(name1,name2))

name3 = phonetics.metaphone('Rebecca Smith-Jones')

name4 = phonetics.metaphone('Rebecca Smith')

print(fuzz.ratio(name3,name4))

Previous
Previous

The Affliction of the Data Analyst

Next
Next

Climber to Coder