An Introduction to Fuzzy Matching
What is fuzzy matching?
Fuzzy Matching, also known as Approximate String Matching, is a technique to identify whether two strings are similar but not the same. Some everyday uses of fuzzy matching include: spell check, auto-correct, spam filtering, record linkage, and address matching.
How does it work?
In deterministic data matching the result is boolean (1 or 0) and in probabilistic data matching the likelihood of whether the record matches is calculated. If you use a fuzzy address matching algorithm then you can set an acceptance criteria. Altering this threshold can help import the accuracy of your results and reduce false positives.
Levenshtein Distance
Levenshtein Distance, also known as the minimum edit distance, is the distance between two strings in terms of edits. One edit can be the addition/removal or substitution of a letter between two strings, see the below example:
Original | Result | Change | Distance | |
---|---|---|---|---|
Winds | Wind | Removal of ‘S’ | 1 | |
Wind | Windy | Addition of ‘y’ | 1 | |
Wild | Wind | Substitution of ‘l’ to ‘n’ | 1 | |
Wild | Windy | Addition of ‘y’ Substitution of ‘l’ to ‘n’ |
2 |
Name 1 | Name 2 | Levenshtein | Soundex |
---|---|---|---|
Stuart | Stewart | 3 | 100 |
Abdul Rasheed | Abdul Rashid | 2 | 100 |
Gaurav Gupta | Gaurav | 6 | 75 |
Michael Smith | Michael Sutton | 4 | 100 |
Metaphones
Metaphone is another phonetic algorithm similar to Soundex, however it considers the entire string and not just the first few sounds. The code length also has no restrictions so it can deal with a large pool of words without any problems.
However, Soundex can ‘outperform’ Metaphone when it comes to missing surnames or where there are differences at the end of a string. In this case you can see that Soundex is more confident of matching “Rebecca Smith-Jones” and “Rebecca Smith” whereas Metaphone is less sure.
Name 1 | Name 2 | Levenshtein | Soundex | Metaphone |
---|---|---|---|---|
Stuart | Stewart | 3 | 100 | 100 |
Abdul Rasheed | Abdul Rashid | 2 | 100 | 100 |
Gaurav Gupta | Gaurav | 6 | 75 | 67 |
Michael Smith | Michael Sutton | 4 | 100 | 67 |
Rebecca Smith-Jones | Rebecca Smith | 6 | 100 | 80 |
If you wish to conduct your own research into the fuzzy matching discussed in this article, then please find the code I used below. I hope it gives you as much enjoyment to work with as I had experimenting with it.
References
Gupta, M. (2020) Phonetics based Fuzzy string matching algorithms, url: https://medium.com/data-science-in-your-pocket/phonetics-based-fuzzy-string-matching-algorithms-8399aea04718
Code
# -*- coding: utf-8 -*-
"""
Created on Fri Oct 8 15:27:43 2021
@author: DavidGlover
"""
import Levenshtein as lev
import jellyfish
import phonetics
from fuzzywuzzy import fuzz
# Get Levenshtein distance
print(lev.distance('Windy', 'Wind'))
print(lev.distance('Wind', 'Windy'))
print(lev.distance('Wild', 'Wind'))
print(lev.distance('Wild', 'Windy'))
print(lev.distance('Stuart ', 'Stewart'))
print(lev.distance('Abdul Rashid', 'Abdul Rasheed'))
print(lev.distance('Gaurav Gupta', 'Gaurav'))
print(lev.distance('Michael Smith', 'Michael Sutton'))
print(lev.distance('Rebecca Smith-Jones', 'Rebecca Smith'))
# Get Soundex for the same names
name1 = jellyfish.soundex('Stuart')
name2 = jellyfish.soundex('Stewart')
print(fuzz.ratio(name1,name2))
name3 = jellyfish.soundex('Abdul Rashid')
name4 = jellyfish.soundex('Abdul Rasheed')
print(fuzz.ratio(name3,name4))
name5 = jellyfish.soundex('Gaurav Gupta')
name6 = jellyfish.soundex('Gaurav')
print(fuzz.ratio(name5,name6))
name7 = jellyfish.soundex('Michael Smith')
name8 = jellyfish.soundex('Michael Sutton')
print(fuzz.ratio(name7,name8))
# Get Soundex for the same names
name1 = phonetics.metaphone('Stuart')
name2 = phonetics.metaphone('Stewart')
print(fuzz.ratio(name1,name2))
name3 = phonetics.metaphone('Abdul Rashid')
name4 = phonetics.metaphone('Abdul Rasheed')
print(fuzz.ratio(name3,name4))
name5 = phonetics.metaphone('Gaurav Gupta')
name6 = phonetics.metaphone('Gaurav')
print(fuzz.ratio(name5,name6))
name7 = phonetics.metaphone('Michael Smith')
name8 = phonetics.metaphone('Michael Sutton')
print(fuzz.ratio(name7,name8))
print(fuzz.ratio('Rebecca Smith-Jones','Rebecca'))
name1 = jellyfish.soundex('Rebecca Smith-Jones')
name2 = jellyfish.soundex('Rebecca')
print(fuzz.ratio(name1,name2))
name3 = phonetics.metaphone('Rebecca Smith-Jones')
name4 = phonetics.metaphone('Rebecca Smith')
print(fuzz.ratio(name3,name4))