Levenshtein Distance Calculation

Dec. 26, 2023 - 2382 views

3 min read

Wissam Fawaz

This blog post provides a comprehensive overview of the calculation of the Levenshtein Distance, presenting a clear progression from a basic recursive approach to more sophisticated dynamic programming Python implementations. If you find my content helpful, consider following me on LinkedIn and Twitter.

The Levenshtein distance is a widely used metric in text analysis that measures the minimum number of single-character edits - insertions, deletions, or substitutions - needed to change one string into another. This post explores three Python implementations for finding the Levenshtein Distance, having different space and time complexity. This distance value has applications in various fields, such as natural language processing and DNA sequencing.

 

1. Finding the Distance Recursively:

Let us first consider a recursive implementation for calculating the Levenshtein distance. Denote by m and n the lengths of the two strings. 

# O(3^(n+m)) time
# O(m+n) space
def leven(s1, s2):
  if not s1: return len(s2)
  if not s2: return len(s1)
  if s1[0] == s2[0]: 
      return leven(s1[1:], s2[1:])
  return 1+min(leven(s1[1:], s2),
  leven(s1, s2[1:]),
  leven(s1[1:], s2[1:]))

In this implementation: 

- If one string is empty, the distance is the length of the other string (all insertions or deletions)

- If the first characters of both strings are the same, move to the next character without increasing the count

- If the first characters are different, recursively check all possibilities, insertion, deletion, and substitution. 

Although this approach is straightforward, it is inefficient for longer strings due to its exponential time complexity. 

 

2. Suboptimal Dynamic Programming Solution:

To optimize the time complexity of the recursive approach, dynamic programming can be used as illustrated through the following implementation: 

# O(n*m) time | O(n*m) space
def leven(s1, s2):
  edit = [[x for x in range(len(s1)+1)] for y in range(len(s2)+1)]
  for i in range(1, len(s2)+1):
    edit[i][0] = edit[i-1][0]+1
  for i in range(1, len(s2)+1):
    for j in range(1, len(s1)+1):
      if s2[i-1] == s1[j-1]:
        edit[i][j] = edit[i-1][j-1]
      else:
        edit[i][j] = 1+min(edit[i-1][j-1], edit[i-1][j], edit[i][j-1]) 
  return edit[-1][-1]

This implementation: 

- Initializes a matrix (edit) where edit[i][j] represents the Levenshtein Distance between the first i characters of s2 and the first j characters of s1.

- Iteratively fills the matrix, determining the minimum number of operations required at each step.

- Returns the final value  edit[-1][-1], which gives the Levenshtein Distance for the entire strings.

This solution reduces the time complexity to O(n*m) but requires O(n*m) space. 

 

3. Optimal Dynamic Programming Solution:

The below-enclosed optimized version further reduces space complexity:

# O(n*m) time
# O(min(n, m)) space
def leven(s1, s2):
  # Swap strings if needed
  if len(s1) > len(s2):
      s1, s2 = s2, s1
  # Previous row
  p_row = range(len(s1)+1)
  for i, c1 in enumerate(s2):
      # Start a new row
      c_row = [i+1]
      for j, c2 in enumerate(s1):
          # Calculate costs
          insertions = p_row[j+1]+1
          deletions = c_row[j]+1
          substitutions = p_row[j]+(c1 != c2)
          c_row.append(min(insertions, deletions, substitutions))
      # Move to the next row
      p_row = c_row
  return p_row[-1]

In this implementation:

- Space complexity is reduced to O(min(n, m)) by using only two arrays (c_row and p_row) instead of a full matrix.

- The arrays are iteratively updated to reflect the current and previous rows of the conceptual matrix.

- The algorithm retains the O(n*m) time complexity but is significantly more space-efficient, especially for large strings.

 

The Levenshtein Distance is a versatile tool in text analysis. While the recursive approach is conceptually simple, it is inefficient for large strings. The basic dynamic programming solution improves time complexity, and the optimized version offers significant space efficiency. Note that the optimal dynamic programming solution can be downloaded from my GitHub repo