If you’ve ever found yourself trying to match up slightly different text strings and done a bit of research you’ve likely run across Levenshtein distance, aka edit distance. It can be a fantastic tool for fuzzy matching, helping you figure out how similar (or different) two strings are by counting the number of changes needed to make one string match another.
Before we dive into the details, let's clarify what Levenshtein distance actually is. In simple terms, it’s a measure of the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another.
For example, the Levenshtein distance between "kitten" and "sitting" is 3:
kitten -> sitten (substitution of 'k' with 's')
sitten -> sittin (substitution of 'e' with 'i')
sittin -> sitting (insertion of 'g')
As you can see, a low distance indicates that two strings are very similar, requiring only a few edits to match each other.
Conversely, a high distance suggests that the strings are quite different. For instance, if you're working with a customer database and trying to merge records, you might use Levenshtein distance to identify near matches. If two entries like "John Doe" and "Jon Doe" have a low distance (e.g., 1), it’s a good sign they refer to the same person despite a minor typo.
On the other hand, "John Doe" and "Jane Smith" would have a high distance, indicating they are different individuals. It’s by no means fool-proof, but by setting a threshold for acceptable distances, you can automate the process of identifying near-duplicate records, significantly improving data quality and consistency.
Levenshtein distance is popular in many use cases:
Typically, calculating Levenshtein distance involves a fair bit of coding, often in languages like Python or R. For instance, you might use libraries such as `difflib` in Python to get the job done. But not everyone is comfortable writing code, and even for those who are, it can be time-consuming.
When it comes to spreadsheets, the task becomes even more daunting. Excel and Google Sheets don't have built-in functions for Levenshtein distance, so you’d need to write complex formulas or use VBA scripts. This can be pretty intimidating and cumbersome, especially if you’re not a seasoned coder.
Thankfully, there’s an easier way. Gigasheet has made it super simple to calculate Levenshtein distance without writing a single line of code. Here's how it works:
Start by loading your dataset in Gigasheet. This could be a CSV, Parquet, or Excel file or data from popular sources like Databricks, Snowflake, or Salesforce.
Gigasheet provides a built-in formula for calculating Levenshtein distance. Simply select the columns you want to compare and apply the formula.
LEVENSHTEIN()
Voila! You’ll get the Levenshtein distance between the entries in your columns instantly.
Once you have the Levenshtein distances, you can use this information in various ways in Gigasheet:
The era of battling with intricate coding and spreadsheet acrobatics to determine Levenshtein distance is over. Gigasheet makes it incredibly easy to perform fuzzy matching and clean up your data. Whether you’re a data analyst, a marketer, or just someone who works with data regularly, this tool, along with numerous others within Gigasheet can save you a ton of time and hassle. Sign up for free!
Got any questions or need help getting started? Hit us up in the in-app chat and we'll be happy to assist!