An Introduction to Fuzzy Matching

6 Dec

What is fuzzy matching?

Fuzzy Matching, also known as Approximate String Matching, is a technique to identify whether two strings are similar but not the same. Some everyday uses of fuzzy matching include: spell check, auto-correct, spam filtering, record linkage, and address matching.

How does it work?

In deterministic data matching the result is boolean (1 or 0) and in probabilistic data matching the likelihood of whether the record matches is calculated. If you use a fuzzy address matching algorithm then you can set an acceptance criteria. Altering this threshold can help import the accuracy of your results and reduce false positives.

Levenshtein Distance

Levenshtein Distance, also known as the minimum edit distance, is the distance between two strings in terms of edits. One edit can be the addition/removal or substitution of a letter between two strings, see the below example:

Soundex Algorithm

Soundex is amongst the early algorithms designed for phonetics based matching. It is most commonly used for genealogical database searches.

The Soundex algorithm outputs a four digit code given a name. Because of this four digit restriction, Soundex has an issue with high cardinality data. It supports up to 26,000 different patterns so if your data has more than this number of unique values you will end up with two very different words having a code collision.

Soundex is also restricted in that it only analyses the initial four ‘sounds’ of a string. Consider “Gaurav Gupta”, it is composed of 6 distinct sounds: ‘gau’,’r’,’av’,’gu’,’p’,’ta’. Soundex will examine the first four sounds and return a string ‘G612’; whereas ‘Gaurav’ only has three sounds so the fourth digit is empty and the returned code is ‘G610’. So the Soundex value for “Gaurav Gupta” vs “Gaurav” is 75%. However in row 4 it can be seen that Soundex struggles to differentiate between two individuals with the same first name ‘Michael’ but very different surnames which happen to start with the same letter.

Original	Result	Change	Distance
Winds	Wind	Removal of ‘S’	1
Wind	Windy	Addition of ‘y’	1
Wild	Wind	Substitution of ‘l’ to ‘n’	1
Wild	Windy	Addition of ‘y’ Substitution of ‘l’ to ‘n’	2

Name 1	Name 2	Levenshtein	Soundex
Stuart	Stewart	3	100
Abdul Rasheed	Abdul Rashid	2	100
Gaurav Gupta	Gaurav	6	75
Michael Smith	Michael Sutton	4	100

Metaphones

Metaphone is another phonetic algorithm similar to Soundex, however it considers the entire string and not just the first few sounds. The code length also has no restrictions so it can deal with a large pool of words without any problems.

However, Soundex can ‘outperform’ Metaphone when it comes to missing surnames or where there are differences at the end of a string. In this case you can see that Soundex is more confident of matching “Rebecca Smith-Jones” and “Rebecca Smith” whereas Metaphone is less sure.

    Name 1
    Name 2
    Levenshtein
    Soundex
    Metaphone
  


    Stuart 
    Stewart
    3
    100
    100
  

    Abdul Rasheed
    Abdul Rashid
    2
    100
    100
  

    Gaurav Gupta
    Gaurav
    6
    75
    67
  

    Michael Smith
    Michael Sutton
    4
    100
    67
  

    Rebecca Smith-Jones
    Rebecca Smith
    6
    100
    80
  

If you wish to conduct your own research into the fuzzy matching discussed in this article, then please find the code I used below. I hope it gives you as much enjoyment to work with as I had experimenting with it.

References

Gupta, M. (2020) Phonetics based Fuzzy string matching algorithms, url: https://medium.com/data-science-in-your-pocket/phonetics-based-fuzzy-string-matching-algorithms-8399aea04718

Code

# -*- coding: utf-8 -*-

"""

Created on Fri Oct 8 15:27:43 2021

@author: DavidGlover

"""

import Levenshtein as lev

import jellyfish

import phonetics

from fuzzywuzzy import fuzz

# Get Levenshtein distance

print(lev.distance('Windy', 'Wind'))

print(lev.distance('Wind', 'Windy'))

print(lev.distance('Wild', 'Wind'))

print(lev.distance('Wild', 'Windy'))

print(lev.distance('Stuart ', 'Stewart'))

print(lev.distance('Abdul Rashid', 'Abdul Rasheed'))

print(lev.distance('Gaurav Gupta', 'Gaurav'))

print(lev.distance('Michael Smith', 'Michael Sutton'))

print(lev.distance('Rebecca Smith-Jones', 'Rebecca Smith'))

# Get Soundex for the same names

name1 = jellyfish.soundex('Stuart')

name2 = jellyfish.soundex('Stewart')

print(fuzz.ratio(name1,name2))

name3 = jellyfish.soundex('Abdul Rashid')

name4 = jellyfish.soundex('Abdul Rasheed')

print(fuzz.ratio(name3,name4))

name5 = jellyfish.soundex('Gaurav Gupta')

name6 = jellyfish.soundex('Gaurav')

print(fuzz.ratio(name5,name6))

name7 = jellyfish.soundex('Michael Smith')

name8 = jellyfish.soundex('Michael Sutton')

print(fuzz.ratio(name7,name8))

# Get Soundex for the same names

name1 = phonetics.metaphone('Stuart')

name2 = phonetics.metaphone('Stewart')

print(fuzz.ratio(name1,name2))

name3 = phonetics.metaphone('Abdul Rashid')

name4 = phonetics.metaphone('Abdul Rasheed')

print(fuzz.ratio(name3,name4))

name5 = phonetics.metaphone('Gaurav Gupta')

name6 = phonetics.metaphone('Gaurav')

print(fuzz.ratio(name5,name6))

name7 = phonetics.metaphone('Michael Smith')

name8 = phonetics.metaphone('Michael Sutton')

print(fuzz.ratio(name7,name8))

print(fuzz.ratio('Rebecca Smith-Jones','Rebecca'))

name1 = jellyfish.soundex('Rebecca Smith-Jones')

name2 = jellyfish.soundex('Rebecca')

print(fuzz.ratio(name1,name2))

name3 = phonetics.metaphone('Rebecca Smith-Jones')

name4 = phonetics.metaphone('Rebecca Smith')

print(fuzz.ratio(name3,name4))

Open SourceData ScienceData AnalysisData Quality

David Dancey

An Introduction to Fuzzy Matching

What is fuzzy matching?

How does it work?

Levenshtein Distance

Soundex Algorithm

Metaphones

References

Code

The Affliction of the Data Analyst

Climber to Coder