This blog post provides a comprehensive overview of the calculation of the Levenshtein Distance, presenting a clear progression from a basic recursive approach to more sophisticated dynamic programming Python implementations. If you find my content helpful, consider following me on LinkedIn and Twitter.
The Levenshtein distance is a widely used metric in text analysis that measures the minimum number of single-character edits - insertions, deletions, or substitutions - needed to change one string into another. This post explores three Python implementations for finding the Levenshtein Distance, having different space and time complexity. This distance value has applications in various fields, such as natural language processing and DNA sequencing.
Let us first consider a recursive implementation for calculating the Levenshtein distance. Denote by m and n the lengths of the two strings.
# O(3^(n+m)) time
# O(m+n) space
def leven(s1, s2):
if not s1: return len(s2)
if not s2: return len(s1)
if s1[0] == s2[0]:
return leven(s1[1:], s2[1:])
return 1+min(leven(s1[1:], s2),
leven(s1, s2[1:]),
leven(s1[1:], s2[1:]))
In this implementation:
- If one string is empty, the distance is the length of the other string (all insertions or deletions)
- If the first characters of both strings are the same, move to the next character without increasing the count
- If the first characters are different, recursively check all possibilities, insertion, deletion, and substitution.
Although this approach is straightforward, it is inefficient for longer strings due to its exponential time complexity.
To optimize the time complexity of the recursive approach, dynamic programming can be used as illustrated through the following implementation:
# O(n*m) time | O(n*m) space
def leven(s1, s2):
edit = [[x for x in range(len(s1)+1)] for y in range(len(s2)+1)]
for i in range(1, len(s2)+1):
edit[i][0] = edit[i-1][0]+1
for i in range(1, len(s2)+1):
for j in range(1, len(s1)+1):
if s2[i-1] == s1[j-1]:
edit[i][j] = edit[i-1][j-1]
else:
edit[i][j] = 1+min(edit[i-1][j-1], edit[i-1][j], edit[i][j-1])
return edit[-1][-1]
This implementation:
- Initializes a matrix (edit
) where edit[i][j]
represents the Levenshtein Distance between the first i
characters of s2
and the first j
characters of s1
.
- Iteratively fills the matrix, determining the minimum number of operations required at each step.
- Returns the final value edit[-1][-1]
, which gives the Levenshtein Distance for the entire strings.
This solution reduces the time complexity to O(n*m) but requires O(n*m) space.
The below-enclosed optimized version further reduces space complexity:
# O(n*m) time
# O(min(n, m)) space
def leven(s1, s2):
# Swap strings if needed
if len(s1) > len(s2):
s1, s2 = s2, s1
# Previous row
p_row = range(len(s1)+1)
for i, c1 in enumerate(s2):
# Start a new row
c_row = [i+1]
for j, c2 in enumerate(s1):
# Calculate costs
insertions = p_row[j+1]+1
deletions = c_row[j]+1
substitutions = p_row[j]+(c1 != c2)
c_row.append(min(insertions, deletions, substitutions))
# Move to the next row
p_row = c_row
return p_row[-1]
In this implementation:
- Space complexity is reduced to O(min(n, m)) by using only two arrays (c_row
and p_row
) instead of a full matrix.
- The arrays are iteratively updated to reflect the current and previous rows of the conceptual matrix.
- The algorithm retains the O(n*m) time complexity but is significantly more space-efficient, especially for large strings.
The Levenshtein Distance is a versatile tool in text analysis. While the recursive approach is conceptually simple, it is inefficient for large strings. The basic dynamic programming solution improves time complexity, and the optimized version offers significant space efficiency. Note that the optimal dynamic programming solution can be downloaded from my GitHub repo.