This blog post provides a comprehensive overview of the calculation of the Levenshtein Distance, presenting a clear progression from a basic recursive approach to more sophisticated dynamic programming Python implementations. If you find my content helpful, consider following me on LinkedIn and Twitter.
The Levenshtein distance is a widely used metric in text analysis that measures the minimum number of single-character edits - insertions, deletions, or substitutions - needed to change one string into another. This post explores three Python implementations for finding the Levenshtein Distance, having different space and time complexity. This distance value has applications in various fields, such as natural language processing and DNA sequencing.
Let us first consider a recursive implementation for calculating the Levenshtein distance. Denote by m and n the lengths of the two strings.
In this implementation:
- If one string is empty, the distance is the length of the other string (all insertions or deletions)
- If the first characters of both strings are the same, move to the next character without increasing the count
- If the first characters are different, recursively check all possibilities, insertion, deletion, and substitution.
Although this approach is straightforward, it is inefficient for longer strings due to its exponential time complexity.
To optimize the time complexity of the recursive approach, dynamic programming can be used as illustrated through the following implementation:
- Initializes a matrix (
edit[i][j] represents the Levenshtein Distance between the first
i characters of
s2 and the first
j characters of
- Iteratively fills the matrix, determining the minimum number of operations required at each step.
- Returns the final value
edit[-1][-1], which gives the Levenshtein Distance for the entire strings.
This solution reduces the time complexity to O(n*m) but requires O(n*m) space.
The below-enclosed optimized version further reduces space complexity:
In this implementation:
- Space complexity is reduced to O(min(n, m)) by using only two arrays (
p_row) instead of a full matrix.
- The arrays are iteratively updated to reflect the current and previous rows of the conceptual matrix.
- The algorithm retains the O(n*m) time complexity but is significantly more space-efficient, especially for large strings.
The Levenshtein Distance is a versatile tool in text analysis. While the recursive approach is conceptually simple, it is inefficient for large strings. The basic dynamic programming solution improves time complexity, and the optimized version offers significant space efficiency. Note that the optimal dynamic programming solution can be downloaded from my GitHub repo.