How To
Aug 6, 2024

Easy Fuzzy Matching: Calculate Levenshtein Distance Without Code

If you’ve ever found yourself trying to match up slightly different text strings and done a bit of research you’ve likely run across Levenshtein distance, aka edit distance. It can be a fantastic tool for fuzzy matching, helping you figure out how similar (or different) two strings are by counting the number of changes needed to make one string match another.

What is Levenshtein Distance?

Before we dive into the details, let's clarify what Levenshtein distance actually is. In simple terms, it’s a measure of the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another.

For example, the Levenshtein distance between "kitten" and "sitting" is 3:

kitten -> sitten (substitution of 'k' with 's')
sitten -> sittin (substitution of 'e' with 'i')
sittin -> sitting (insertion of 'g')

As you can see, a low distance indicates that two strings are very similar, requiring only a few edits to match each other.

Conversely, a high distance suggests that the strings are quite different. For instance, if you're working with a customer database and trying to merge records, you might use Levenshtein distance to identify near matches. If two entries like "John Doe" and "Jon Doe" have a low distance (e.g., 1), it’s a good sign they refer to the same person despite a minor typo.

On the other hand, "John Doe" and "Jane Smith" would have a high distance, indicating they are different individuals. It’s by no means fool-proof, but by setting a threshold for acceptable distances, you can automate the process of identifying near-duplicate records, significantly improving data quality and consistency.

Why Use Levenshtein Distance?

Levenshtein distance is popular in many use cases:

  • Data Cleaning: Ensuring data consistency by identifying and correcting typos.
  • Record Linking: Matching records from different databases with slight variations in names or addresses.
  • Spell Checkers: Suggesting corrections for misspelled words.
  • Search Engines: Improving search results by matching user queries with relevant results despite spelling errors.

Traditional Approaches to Calculating Levenshtein Distance

Typically, calculating Levenshtein distance involves a fair bit of coding, often in languages like Python or R. For instance, you might use libraries such as `difflib` in Python to get the job done. But not everyone is comfortable writing code, and even for those who are, it can be time-consuming.

When it comes to spreadsheets, the task becomes even more daunting. Excel and Google Sheets don't have built-in functions for Levenshtein distance, so you’d need to write complex formulas or use VBA scripts. This can be pretty intimidating and cumbersome, especially if you’re not a seasoned coder.

Gigasheet Formulas: The No-Code Solution, That Works at Scale

Thankfully, there’s an easier way. Gigasheet has made it super simple to calculate Levenshtein distance without writing a single line of code. Here's how it works:

Use Gigasheet to Calculate Levenshtein Distance Online

1. Load Your Data

Start by loading your dataset in Gigasheet. This could be a CSV, Parquet, or Excel file or data from popular sources like Databricks, Snowflake, or Salesforce.

2. Use the Formula

Gigasheet provides a built-in formula for calculating Levenshtein distance. Simply select the columns you want to compare and apply the formula. 

LEVENSHTEIN()

3. Instant Results

Voila! You’ll get the Levenshtein distance between the entries in your columns instantly.

Practical Applications of Edit Distance 

Once you have the Levenshtein distances, you can use this information in various ways in Gigasheet:

  • Data Deduplication: in addition to Gigasheet’s built-in deduplication, you can use this to Identify and merge duplicate records that have slight variations in text (versus exact matches).
  • Improved Joins: Enhance your matching across sheets with VLOOKUP. Useful for cleaning up customer contacts, product listings, etc.
  • Quality Control: Ensure data quality by flagging entries that exceed a certain distance threshold, indicating potential errors.

Final Thoughts on the Levenshtein Distance

The era of battling with intricate coding and spreadsheet acrobatics to determine Levenshtein distance is over. Gigasheet makes it incredibly easy to perform fuzzy matching and clean up your data. Whether you’re a data analyst, a marketer, or just someone who works with data regularly, this tool, along with numerous others within Gigasheet can save you a ton of time and hassle. Sign up for free!

Got any questions or need help getting started? Hit us up in the in-app chat and we'll be happy to assist!

The ease of a spreadsheet with the power of a data warehouse.

No Code
No Training
No Installation
Sign Up, Free

Similar posts

By using this website, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.