Levenshtein distance
Encyclopedia : L : LE : LEV : Levenshtein distance
In information theory and computer science, the Levenshtein distance or edit distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965. It is useful in applications that need to determine how similar two strings are, such as spell checkers.
For example, the Levenshtein distance between "kitten" and "sitting" is 3, since these three edits change one into the other, and there is no way to do it with fewer than three edits:
- kitten → sitten (substitution of 'k' for 's')
- sitten → sittin (substitution of 'e' for 'i')
- sittin → sitting (insert 'g' at the end)
The algorithm
A commonly-used bottom-up dynamic programming algorithm for computing the Levenshtein distance involves the use of an (n + 1) × (m + 1) matrix, where n and m are the lengths of the two strings. Here is pseudocode for a function LevenshteinDistance that takes two strings, str1 of length lenStr1, and str2 of length lenStr2, and computes the Levenshtein distance between them:
int LevenshteinDistance(char str1[1..lenStr1], char str2[1..lenStr2]) // d is a table with lenStr1+1 rows and lenStr2+1 columns declare int d[0..lenStr1, 0..lenStr2] // i and j are used to iterate over str1 and str2 declare int i, j, costTwo examples of the resulting matrix (the minimum steps to be taken are highlighted):for i from 0 to lenStr1 d[i, 0] := i for j from 0 to lenStr2 d[0, j] := j
for i from 1 to lenStr1 for j from 1 to lenStr2 if str1[i] = str2[j] then cost := 0 else cost := 1 d[i, j] := minimum( d[i-1, j ] + 1, // deletion d[i , j-1] + 1, // insertion d[i-1, j-1] + cost // substitution )
return d[lenStr1, lenStr2]
|
|
The invariant maintained throughout the algorithm is that we can transform the initial segment str1[1..i] into str2[1..j] using a minimum of d[i,j] operations. At the end, the bottom-right element of the array contains the answer.
Possible improvements
Possible improvements to this algorithm include:- We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.
- We can store the number of insertions, deletions, and substitutions separately, or even the positions at which they occur, which is always
j. - We can give different penalty costs to insertion, deletion and substitution.
- The initialization of
d[i,0]can be moved inside the main outer loop. - This algorithm parallelizes poorly, due to a large number of data dependencies. However, all the
costs can be computed in parallel, and the algorithm can be adapted to perform theminimumfunction in phases to eliminate dependencies.
Proof of correctness
As mentioned earlier, the invariant is that we can transform the initial segments[1..i] into t[1..j] using a minimum of d[i,j] operations. This invariant holds since:
- It is initially true on row and column 0 because
s[1..i]can be transformed into the empty stringt[1..0]by simply dropping allicharacters. Similarly, we can transforms[1..0]tot[1..j]by simply adding alljcharacters. - The minimum is taken over three distances, each of which is feasible:
- * If we can transform
s[1..i]tot[1..j-1]inkoperations, then we can simply addt[j]afterwards to gett[1..j]ink+1operations. - * If we can transform
s[1..i-1]tot[1..j]inkoperations, then we can do the same operations ons[1..i]and then remove the originals[i]at the end ink+1operations. - * If we can transform
s[1..i-1]tot[1..j-1]inkoperations, we can do the same tos[1..i]and then do a substitution oft[j]for the originals[i]at the end if necessary, requiringk+costoperations. - The operations required to transform
s[1..n]intot[1..m]is of course the number required to transform all ofsinto all oft, and sod[n,m]holds our result.
d[i,j] is in fact minimal; this is more difficult to show, and involves an argument by contradiction in which we assume d[i,j] is smaller than the minimum of the three, and use this to show one of the three is not minimal.Implementations
The implementations of the Levenshtein algorithm on this page are illustrative only. Applications will, in most cases, use implementations which use heap allocations sparingly, in particular when large lists of words are compared to each other. The following remarks indicate some of the variations on this and related topics:
1. Most implementations use one- or two-dimensional arays to store the distances of prefixes of the words compared. In most applications the size of these structures is previously known. This is the case, when, for instance the distance is relevant only if it is below a certian maximally allowed distance (this happens when words are selected from a dictionary to approximately match a given word). In this case the arrays can be preallocated and reused over the various runs of the algorithm over successive words.
2. Using maximally allowed distances reduces the search. To achieve this, include the maximal distance in the parameter list to the algorithm and keep track of the shortest found distance of prefixes of the words compared.
3. Deletion/Insertion and replacement of characters can be assigned different weights. The usual choice is to set them both to 1 (this is the choice for all of the algorithms on this page). Different values for these weights allows for more flexible search strategies in lists of words.
const unsigned int cost_del = 1;
const unsigned int cost_ins = 1;
const unsigned int cost_sub = 1;
unsigned int edit_distance( const std::string& s1, const std::string& s2 )
unsigned int tmp = dist[ n1*(n2+1) + n2 ]
delete[] dist;
return tmp;
}
Below is a slightly faster version using only linear space.
const unsigned int cost_del = 1;
const unsigned int cost_ins = 1;
const unsigned int cost_sub = 1;
unsigned int edit_distance( const std::string& s1, const std::string& s2 )
r = p;
p = q;
q = r;
}
unsigned int tmp = p[ n2 ]
delete[] p;
delete[] q;
return tmp;
}
public int LevenshteinDistance(string str1, string str2)
}
return d[str1.Length, str2.Length]
}
public int minimum3(int a, int b, int c)
A fast Common Lisp version of the Python code below.
(defun levenshtein-distance (a b)
"Computes the Levenshtein distance between strings a and b."
(let ((al (length a))
(bl (length b)))
;; Swap a and b if a is longer.
(when (> al bl)
(multiple-value-setq (a b al bl) (values b a bl al)))
(loop
for i below bl
for prev = (loop for i from 0 to al collect i)
then (copy-list current)
for current = (cons (1+ i) (make-list al :initial-element 0))
do (loop
for j below al
for (c0 . ccons) on current
for (p1 p0) on prev
for add = (1+ p1)
for delete = (1+ c0)
for change = (+ p0 (if (eql (elt a j) (elt b i))
0 1))
do (rplaca ccons (min add delete change)))
finally (return (car (last current))))))
Emacs lisp implementation maintained at the Emacs Wiki.[link]''
unsigned int edit_distance( const std::string& s1, const std::string& s2 )
unsigned int tmp = dist[ n1*(n2+1) + n2 ] delete[] dist;
return tmp; }
unsigned int edit_distance( const std::string& s1, const std::string& s2 ) r = p; p = q; q = r; }
unsigned int tmp = p[ n2 ] delete[] p; delete[] q;
return tmp; }
public int LevenshteinDistance(string str1, string str2) }return d[str1.Length, str2.Length] }
public int minimum3(int a, int b, int c)
A fast Common Lisp version of the Python code below.
(defun levenshtein-distance (a b)
"Computes the Levenshtein distance between strings a and b."
(let ((al (length a))
(bl (length b)))
;; Swap a and b if a is longer.
(when (> al bl)
(multiple-value-setq (a b al bl) (values b a bl al)))
(loop
for i below bl
for prev = (loop for i from 0 to al collect i)
then (copy-list current)
for current = (cons (1+ i) (make-list al :initial-element 0))
do (loop
for j below al
for (c0 . ccons) on current
for (p1 p0) on prev
for add = (1+ p1)
for delete = (1+ c0)
for change = (+ p0 (if (eql (elt a j) (elt b i))
0 1))
do (rplaca ccons (min add delete change)))
finally (return (car (last current))))))
Emacs lisp implementation maintained at the Emacs Wiki.[link]''
(defun levenshtein-distance (str1 str2) "Return the edit distance between strings STR1 and STR2." (if (not (stringp str1)) (error "Argument was not a string: %s" str1)) (if (not (stringp str2)) (error "Argument was not a string: %s" str2)) (let* ((make-table (function (lambda (columns rows init) (make-vector rows (make-vector columns init))))) (tref (function (lambda (table x y) (aref (aref table y) x)))) (tset (function (lambda (table x y object) (let ((row (copy-sequence (aref table y)))) (aset row x object) (aset table y row) object)))) (length-str1 (length str1)) (length-str2 (length str2)) (d (funcall make-table (1+ length-str1) (1+ length-str2) 0))) (let ((i 0) (j 0)) (while (<= i length-str1) (funcall tset d i 0 i) (setq i (1+ i))) (while (<= j length-str2) (funcall tset d 0 j j) (setq j (1+ j)))) (let ((i 1)) (while (<= i length-str1) (let ((j 1)) (while (<= j length-str2) (let* ((cost (if (equal (aref str1 (1- i)) (aref str2 (1- j))) 0 1)) (deletion (1+ (funcall tref d (1- i) j))) (insertion (1+ (funcall tref d i (1- j)))) (substitution (+ (funcall tref d (1- i) (1- j)) cost))) (funcall tset d i j (min insertion deletion substitution))) (setq j (1+ j)))) (setq i (1+ i)))) (funcall tref d length-str1 length-str2)))
Because Haskell automatically memoizes results of previous calls, it is particularly suited to a simple recursive implementation:
editDistance :: Eq a => [a] -> [a] -> Int editDistance s [] = length s editDistance [] t = length t editDistance (s:ss) (t:ts) = minimum [ (if s == t then 0 else 1) + editDistance ss ts, 1 + editDistance ss (t:ts), 1 + editDistance (s:ss) ts ]
public class LevenshteinDistance
public static int computeLevenshteinDistance(String str1, String str2)
private static int computeLevenshteinDistance(char[] str1, char[] str2)
for(int j=0; j
for(int i=1; i<=str1.length; i++)
for(int j=1;j<=str2.length; j++)
distance[i][j]= minimum(distance[i-1][j]+1,
distance[i][j-1]+1, distance[i-1][j-1] +
((str1[i-1]==str2[j-1])?0:1));
return distance[str1.length][str2.length]
}
}
Written by Magnus Lie Hetland [link].
public static int computeLevenshteinDistance(String str1, String str2)
private static int computeLevenshteinDistance(char[] str1, char[] str2)
for(int j=0; j
return distance[str1.length][str2.length]
}
}
def distance(a,b): "Calculates the Levenshtein distance between a and b." n, m = len(a), len(b) if n > m: # Make sure n <= m, to use O(min(n,m)) space a,b = b,a n,m = m,ncurrent = range(n+1) for i in range(1,m+1): previous, current = current, [i]+[0]*n for j in range(1,n+1): add, delete = previous[j]+1, current[j-1]+1 change = previous[j-1] if a[j-1] != b[i-1]: change = change + 1 current[j] = min(add, delete, change)
return current[n]
Uses srfi-25 and srfi-42
(define add1 (lambda (x) (+ x 1)))
(define sub1 (lambda (x) (- x 1)))
(define levenshtein-distance
(lambda (s1 s2)
(let* ((width (add1 (string-length s1)))
(height (add1 (string-length s2)))
(d (make-array (shape 0 height 0 width) 0)))
(do-ec (:range x width) (array-set! d 0 x x))
(do-ec (:range y height) (array-set! d y 0 y))
(do-ec (:range x (string-length s1))
(:range y (string-length s2))
(array-set!
d (add1 y) (add1 x)
(min
(add1 (array-ref d y (add1 x)))
(add1 (array-ref d (add1 y) x))
(+ (array-ref d y x)
(if (eqv? (string-ref s1 x)
(string-ref s2 y))
0
1)))))
(displarray d)
(array-ref d (sub1 height) (sub1 width)))))
Contributed by Stephan Maier. This is an implementation in Prolog which makes use only of ISO-Prolog
with the exception of the append(?list, ?list, ?list)-rule which, however, is available on most
prolog implementations. The implementation of the algorithm is recursive but the recursion only
serves to simulate two loops over the length of the words compared. Efficiency of the implementation is
contrasted with rather large parameter set for the rule 'lev(...)'.
% Implementation of the computation of the leveshtein distance of two words.
% This implementation works on all ISO-compatible implemenations of Prolog
% which implement append(?list, ?list, ?list) in the usual way.
% levenshtein(+atom, +atom, -integer): Computes the Levenshtein distance between two given words
% W1: Atom
% W2: Atom
% D : Distance
levenshtein(W1, W2, D) :- atom_length(W1, 0), atom_length(W2, L), D is L, !.
levenshtein(W1, W2, D) :- atom_length(W2, 0), atom_length(W1, L), D is L, !.
levenshtein(W1, W2, D) :- atom_length(W1, L1), initialList(L1, StartRow), lev(W1, W2, [], StartRow, 1, 1, 0, D), !.
% lev(+W1, +W2, +CR, +LR, +P1, +P2, +CD, -D): Rule which represents the actual calculation. This rule
% works recursively simulating a loop over the positions (P1, P2) in the words W1 and W2. The outer
% loop is over P2, the inner loop is over P1. To keep track of previously computed distances two lists
% are used.
% W1: Atom
% W2: Atom
% CR: Start of the row (current row)
% LR: Remainder of last row (last row)
% P1: Position in word W1
% P2: Position in word W2
% CD: Current value of distance; this is the last element of CR
% D : Value of distance
% Verify the end-condition for the recursion first: This is achieved when the P2-counter exceeds the length of W2
lev(_, W2, _, _, _, P2, CD, D) :- atom_length(W2, L2), P2 > L2, D is CD, !.
% The initial rule after P2 has been increased by 1
lev(W1, W2, [], LR, 1, P2, _, D) :- CR = [P2], lev(W1, W2, CR, LR, 1, P2, P2, D), !.
% This is the actual computational rule
lev(W1, W2, CR, [H1, H2|LR], P1, P2, CD, D) :-
% T1 is the distance at (P1, P2) obtained from removing the letter at P1 from W1
T1 is CD + 1,
% T2 is the distance at (P1, P2) obtained from adding the letter at P2 from W2
T2 is H2 + 1,
charAt(W1, P1, C1), charAt(W2, P2, C2),
% T3 is the distance at (P1, P2) obtained from exchanging the letters at P1 from W1 and P2 from W2
( C1 = C2 -> T3 is H1; T3 is H1 + 1 ),
% The distance at (P1, P2) is the minimum of the previously computed distances
min(T1, T2, T3, NCD),
% Compute the next value for CR, increase the P1 position by i and the recursively call lev(...)
append(CR, [NCD], NCR), NP1 is P1 + 1,
lev(W1, W2, NCR, [H2|LR], NP1, P2, NCD, D), !.
% This rule verifies the end of line condition for the loop over P1
lev(W1, W2, CR, [_], _, P2, CD, D) :- NP2 is P2 + 1, lev(W1, W2, [], CR, 1, NP2, CD, D), !.
% initialList(+N, -L): Returns a list L of the form [0, 1, 2, ..., N].
initialList(N, _) :- (\+ integer(N); N < 1) -> fail.
initialList(1, L) :- L = [0, 1], !.
initialList(N, L) :- N1 is N - 1, L2 = [N], initialList(N1, L1), append(L1, L2, L).
% charAt(+A, ?N, -C): Returns the character at position N in the atom A
% The position is 1-based
% A: The atom
% N: The position at which to extract the character
% C: The character of A at position N
charAt(A, N, C) :- P is N - 1, sub_atom(A, P, 1, _, C).
% min(...): These rules compute the minimum of the given integer values
% I1, I2, I3: Integer values
% M: The minimum over the values
min(I1, I2, M) :- integer(I1), integer(I2), ( I1 =< I2 -> M is I1; M is I2).
min(I1, I2, I3, M) :- min(I1, I2, A), min(I2, I3, B), min(A, B, M).
Imports System.Math
Public Module Levenshtein
Public Function LevenshteinDistance(ByVal a As String, ByVal b As String) As Integer
Dim d As Integer(,) = New Integer(a.Length + 1, b.Length + 1)
Dim i As Integer
Dim j As Integer
Dim insert_penalty As Integer = 1
Dim delete_penalty As Integer = 1
Dim substitute_penalty As Integer = 1
Dim cost As Integer
For i = 0 To a.Length
d(i, 0) = i
Next
For j = 0 To b.Length
d(0, j) = j
Next
For i = 1 To a.Length
For j = 1 To b.Length
If (a.Chars(i - 1) = b.Chars(j - 1)) Then
cost = 0
Else
cost = substitute_penalty
End If
d(i, j) = Min(Min(d(i - 1, j) + insert_penalty, d(i, j - 1) + delete_penalty), d(i - 1, j - 1) + cost)
Next
Next
Return d(a.Length, b.Length)
End Function
' This may also be useful to some
' Just split two strings into arrays on spaces or on a lot of different things
' A good use would be if you want to check that a user input
' a name correctly or misspelled/misordered the name
Public Function StringArrayLevenshteinDistance(ByVal first As String(), ByVal second As String()) As Integer
' assume first is search string
' assume second is find string
Dim j As Integer
Dim k As Integer
Dim value As Integer
Dim bestMatchIndex = -1
Dim smallestWordLevenshtein As Integer = -1
Dim strLevenshtein As Integer = 0
' go through the parts of the search string
For j = 0 To first.Length - 1
' go through the parts of the find string
For k = 0 To second.Length - 1
' find the part that is closest to the one being searched for
If (first(j) <> "" And second(k) <> "") Then
value = LevenshteinDistance(first(j), second(k))
If (value < smallestWordLevenshtein Or smallestWordLevenshtein = -1) Then
smallestWordLevenshtein = value
bestMatchIndex = k
End If
End If
Next
' add the smallest lev vaule found to the return value for the strings
If (smallestWordLevenshtein <> -1) Then
strLevenshtein += smallestWordLevenshtein
smallestWordLevenshtein = -1
End If
' and drop that section of this string so it's not used later
If (bestMatchIndex <> -1) Then
second.SetValue("", bestMatchIndex)
bestMatchIndex = -1
End If
Next
Return strLevenshtein
End Function
End Module
loFuzzy = CREATEOBJECT("FuzzyMatch")
? loFuzzy.EditDistance("kitten", "sitting")
define class FuzzyMatch as custom
**
** Computes the edit distance between two strings
** using the Levenshtein algorithm. The edit distance is how many
** changes need to be made to make them identical. The edit distance
** for identical strings is 0.0, the edit distance for completely
** different strings depends on what letters need to change or
** be added in order to make them the same.
**
function EditDistance ( ;
sSource as string, ;
sTarget as string ) as decimal
local ;
iLenSource as integer, ;
iLenTarget as integer, ;
iRow as integer, ;
iCol as integer
iLenSource = len( sSource )
iLenTarget = len( sTarget )
if iLenSource == 0.0
return iLenTarget
endif
if iLenTarget == 0.0
return iLenSource
endif
if iLenSource < iLenTarget
sTemp = sSource
sSource = sTarget
sTarget = sTemp
iLenTemp = iLenSource
iLenSource = iLenTarget
iLenTarget = iLenTemp
endif
dimension Levenshtein( iLenTarget+1, iLenSource+1 )
for iRow = 1 to iLenTarget + 1
Levenshtein[iRow,1] = iRow -1
endfor
for iCol = 1 to iLenSource + 1
Levenshtein[1,iCol] = iCol -1
endfor
for iCol = 2 to iLenSource + 1
for iRow = 2 to iLenTarget + 1
local ;
dCost as decimal, ;
cColChar as string, ;
cRowChar as string
dCost = 0.0
cColChar = substr( sSource, iCol-1, 1 )
cRowChar = substr( sTarget, iRow-1, 1 )
if cColChar != cRowChar
dCost = 1.0
endif
Levenshtein[iRow,iCol] = ;
min( ;
Levenshtein[iRow-1,iCol] + 1, ;
Levenshtein[iRow,iCol-1] + 1, ;
Levenshtein[iRow-1,iCol-1] + dCost )
endfor
endfor
return Levenshtein[iLenTarget+1,iLenSource+1]
endfunc
**
** Compares two strings and returns the % match using
** the EditDistance function.
**
function PercentMatch( ;
sSource as string, ;
sTarget as string ) as decimal
local ;
iLenSource as integer, ;
iLenTarget as integer, ;
dResult as decimal, ;
dEditDistance as decimal
iLenSource = len( sSource )
iLenTarget = len( sTarget )
if iLenSource + iLenTarget == 0
return 0.0
else
dEditDistance = this.EditDistance( sSource, sTarget)
dResult = (1.0-(dEditDistance/max(iLenSource,iLenTarget))) * 100.0
return dResult
endif
endfunc
enddefine
Written by Moorthy S V [link]
class LevenshteinDistance
private function compute(a:Array, b:Array)
var distance = new Array();
for(var g=0; g
for(var i=0; i<=a.length; i++)
distance[i][0] = i;
}
for(var j=0; j<=b.length+1; j++)
for(var k=1; k<=a.length; k++)
}
return distance[a.length][b.length]
}
private function minimum(a:Number, b:Number, c:Number)
if (Number(b)<=Number(a) && Number(b)<=Number(c))
if(result == -1 )
return result;
}
}
printf("\ndistance is: %d", dist("mama", "tatko"));
sub dist
push(@d, [@temp]);
}
for($i=0; $i
for($j=0; $j
for($i=1; $i
$d[$i][$j]=min($d[$i-1][$j]+1, $d[$i][$j-1]+1, $d[$i-1][$j-1]+$cost);
}
}
return $d[length($str1)][length($str2)]
}
# Alternatively: use List::Util qw/ min /;
sub min
return $m;
}
# Alternatively (the fastest possible way)
use Text::LevenshteinXS 'distance'; #this provides access to precompiled C implementation of distance() function
Uses linear space, and can be curried with just one string to factor arrays allocations
(* Written by Pierre Etchemaite *)
type levenshtein_costs =
let levenshtein_distance lc =
fun s1 ->
let l1 = String.length s1 in
let m = Array.make (l1 + 1) 0 in
let n = Array.make (l1 + 1) 0 in
fun s2 ->
let l2 = String.length s2 in
m.(0) <- 0;
for i = 1 to l1 do
m.(i) <- m.(i - 1) + lc.delete_cost
done;
let rec aux j m n =
if j = l2 then m.(l1)
else
let c2 = s2.[j] in
n.(0) <- m.(0) + lc.insert_cost;
for i = 1 to l1 do
let i1 = i - 1 in
n.(i) <-
min
(min (n.(i1) + lc.delete_cost)
(m.(i) + lc.insert_cost))
(m.(i1) +
(if s1.[i1] <> c2 then lc.replace_cost else 0))
done;
aux (j + 1) n m in
aux 0 m n
function levenshtein(string1, string2)
local str1, str2, distance = , , ;
str1.len, str2.len = string.len(string1), string.len(string2);
string.gsub(string1, "(.)", function(s) table.insert(str1, s); end);
string.gsub(string2, "(.)", function(s) table.insert(str2, s); end);
for i = 0, str1.len do distance[i][0] = i; end
for i = 0, str2.len do distance[0][i] = i; end
for i = 1, str1.len do
for j = 1, str2.len do
local tmpdist = 1;
if(str1[i-1] == str2[j-1]) then tmpdist = 0; end
distance[i][j] = math.min(
distance[i-1][j] + 1, distance[i][j-1]+1, distance[i-1][j-1] + tmpdist);
end
end
return distance[str1.len][str2.len]
end
levenshtein <- function(string1, string2, case = TRUE, map = NULL)
if(ifelse(case, string1, tolower(string1)) == ifelse(case, string2, tolower(string2))) return(0)
s1 <- strsplit(paste(" ", ifelse(case, string1, tolower(string1)), sep=""), NULL)1
s2 <- strsplit(paste(" ", ifelse(case, string2, tolower(string2)), sep=""), NULL)1
l1 <- length(s1)
l2 <- length(s2)
d <- matrix(nrow = l1, ncol = l2)
for(i in 1:l1) d[i,1] <- i-1
for(i in 1:l2) d[1,i] <- i-1
for(i in 2:l1) for(j in 2:l2) d[i,j] <- min((d[i-1,j]+1) , (d[i,j-1]+1) , (d[i-1,j-1]+ifelse(s1[i] == s2[j], 0, 1)))
d[l1,l2]
}
# Implementation based on http://www.merriampark.com/ld.htm
# Implementation by Alberto Simoes (ambs cpan.org)
#
# .sub main :main
# $S1 = "purl"
# $S2 = "perl"
# $I1 = levenshtein($S1,$S2)
# print $I1
# print "\n"
# end
# .end
# Levenshtein distance. Pass two strings in
.sub levenshtein
.param string s
.param string t
.local int n # s size
.local int m # t size
.local pmc matrix
.local int i
.local int j
.local string schar
.local string tchar
.local int cost
.local int a
.local int b
.local int c
n = length s
m = length t
if n == 0 goto return_m
if m == 0 goto return_n
new matrix, .ResizablePMCArray
i = 0
init_matrix:
new $P0, .ResizableIntegerArray
matrix[i] = $P0
i += 1
if i <= m goto init_matrix
i = 0
init_matrix_1:
matrix[i;0] = i
i += 1
if i <= m goto init_matrix_1
i = 0
init_matrix_2:
matrix[0;i] = i
i += 1
if i <= n goto init_matrix_2
init_matrix_done:
i = 1
cycle:
j = 1
inner_cycle:
$I0 = i - 1
$I1 = j - 1
schar = substr s, $I1, 1
tchar = substr t, $I0, 1
cost = calculate_cost(schar, tchar)
a = matrix[$I0;j]
a += 1
b = matrix[i;$I1]
b += 1
c = matrix[$I0;$I1]
c += cost
cost = minimum(a,b,c)
matrix[i;j] = cost
j += 1
if j <= n goto inner_cycle
i += 1
if i <= m goto cycle
cost = matrix[m;n]
.return(cost)
return_m:
.return(m)
return_n:
.return(n)
.end
.sub calculate_cost
.param string first
.param string second
if first == second goto zero
.return(1)
zero:
.return(0)
.end
.sub minimum
.param int a
.param int b
.param int c
if a < b goto other
if b < c goto b_label
c_label:
.return(c)
a_label:
.return(a)
b_label:
.return(b)
other:
if a < c goto a_label
goto c_label
.end
Since PHP 3.0.17, there is a built-in levenshtein function[link].
Upper and lower bounds
The Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:
- It is always at least the difference of the sizes of the two strings.
- It is at most the length of the longer string.
- It is zero if and only if the strings are identical.
- If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.
- If the strings are called
s and t, the number of characters (not counting duplicates) found in s but not in t is a lower bound.
See also
External links
(define levenshtein-distance (lambda (s1 s2) (let* ((width (add1 (string-length s1))) (height (add1 (string-length s2))) (d (make-array (shape 0 height 0 width) 0))) (do-ec (:range x width) (array-set! d 0 x x)) (do-ec (:range y height) (array-set! d y 0 y)) (do-ec (:range x (string-length s1)) (:range y (string-length s2)) (array-set! d (add1 y) (add1 x) (min (add1 (array-ref d y (add1 x))) (add1 (array-ref d (add1 y) x)) (+ (array-ref d y x) (if (eqv? (string-ref s1 x) (string-ref s2 y)) 0 1))))) (displarray d) (array-ref d (sub1 height) (sub1 width)))))
Contributed by Stephan Maier. This is an implementation in Prolog which makes use only of ISO-Prolog with the exception of the append(?list, ?list, ?list)-rule which, however, is available on most prolog implementations. The implementation of the algorithm is recursive but the recursion only serves to simulate two loops over the length of the words compared. Efficiency of the implementation is contrasted with rather large parameter set for the rule 'lev(...)'.
% Implementation of the computation of the leveshtein distance of two words. % This implementation works on all ISO-compatible implemenations of Prolog % which implement append(?list, ?list, ?list) in the usual way.% levenshtein(+atom, +atom, -integer): Computes the Levenshtein distance between two given words % W1: Atom % W2: Atom % D : Distance levenshtein(W1, W2, D) :- atom_length(W1, 0), atom_length(W2, L), D is L, !. levenshtein(W1, W2, D) :- atom_length(W2, 0), atom_length(W1, L), D is L, !. levenshtein(W1, W2, D) :- atom_length(W1, L1), initialList(L1, StartRow), lev(W1, W2, [], StartRow, 1, 1, 0, D), !.
% lev(+W1, +W2, +CR, +LR, +P1, +P2, +CD, -D): Rule which represents the actual calculation. This rule % works recursively simulating a loop over the positions (P1, P2) in the words W1 and W2. The outer % loop is over P2, the inner loop is over P1. To keep track of previously computed distances two lists % are used. % W1: Atom % W2: Atom % CR: Start of the row (current row) % LR: Remainder of last row (last row) % P1: Position in word W1 % P2: Position in word W2 % CD: Current value of distance; this is the last element of CR % D : Value of distance
% Verify the end-condition for the recursion first: This is achieved when the P2-counter exceeds the length of W2 lev(_, W2, _, _, _, P2, CD, D) :- atom_length(W2, L2), P2 > L2, D is CD, !.
% The initial rule after P2 has been increased by 1 lev(W1, W2, [], LR, 1, P2, _, D) :- CR = [P2], lev(W1, W2, CR, LR, 1, P2, P2, D), !.
% This is the actual computational rule lev(W1, W2, CR, [H1, H2|LR], P1, P2, CD, D) :- % T1 is the distance at (P1, P2) obtained from removing the letter at P1 from W1 T1 is CD + 1, % T2 is the distance at (P1, P2) obtained from adding the letter at P2 from W2 T2 is H2 + 1, charAt(W1, P1, C1), charAt(W2, P2, C2), % T3 is the distance at (P1, P2) obtained from exchanging the letters at P1 from W1 and P2 from W2 ( C1 = C2 -> T3 is H1; T3 is H1 + 1 ), % The distance at (P1, P2) is the minimum of the previously computed distances min(T1, T2, T3, NCD), % Compute the next value for CR, increase the P1 position by i and the recursively call lev(...) append(CR, [NCD], NCR), NP1 is P1 + 1, lev(W1, W2, NCR, [H2|LR], NP1, P2, NCD, D), !.
% This rule verifies the end of line condition for the loop over P1 lev(W1, W2, CR, [_], _, P2, CD, D) :- NP2 is P2 + 1, lev(W1, W2, [], CR, 1, NP2, CD, D), !.
% initialList(+N, -L): Returns a list L of the form [0, 1, 2, ..., N]. initialList(N, _) :- (\+ integer(N); N < 1) -> fail. initialList(1, L) :- L = [0, 1], !. initialList(N, L) :- N1 is N - 1, L2 = [N], initialList(N1, L1), append(L1, L2, L).
% charAt(+A, ?N, -C): Returns the character at position N in the atom A % The position is 1-based % A: The atom % N: The position at which to extract the character % C: The character of A at position N charAt(A, N, C) :- P is N - 1, sub_atom(A, P, 1, _, C).
% min(...): These rules compute the minimum of the given integer values % I1, I2, I3: Integer values % M: The minimum over the values min(I1, I2, M) :- integer(I1), integer(I2), ( I1 =< I2 -> M is I1; M is I2). min(I1, I2, I3, M) :- min(I1, I2, A), min(I2, I3, B), min(A, B, M).
Imports System.Math
Public Module Levenshtein
Public Function LevenshteinDistance(ByVal a As String, ByVal b As String) As Integer
Dim d As Integer(,) = New Integer(a.Length + 1, b.Length + 1)
Dim i As Integer
Dim j As Integer
Dim insert_penalty As Integer = 1
Dim delete_penalty As Integer = 1
Dim substitute_penalty As Integer = 1
Dim cost As Integer
For i = 0 To a.Length
d(i, 0) = i
Next
For j = 0 To b.Length
d(0, j) = j
Next
For i = 1 To a.Length
For j = 1 To b.Length
If (a.Chars(i - 1) = b.Chars(j - 1)) Then
cost = 0
Else
cost = substitute_penalty
End If
d(i, j) = Min(Min(d(i - 1, j) + insert_penalty, d(i, j - 1) + delete_penalty), d(i - 1, j - 1) + cost)
Next
Next
Return d(a.Length, b.Length)
End Function
' This may also be useful to some
' Just split two strings into arrays on spaces or on a lot of different things
' A good use would be if you want to check that a user input
' a name correctly or misspelled/misordered the name
Public Function StringArrayLevenshteinDistance(ByVal first As String(), ByVal second As String()) As Integer
' assume first is search string
' assume second is find string
Dim j As Integer
Dim k As Integer
Dim value As Integer
Dim bestMatchIndex = -1
Dim smallestWordLevenshtein As Integer = -1
Dim strLevenshtein As Integer = 0
' go through the parts of the search string
For j = 0 To first.Length - 1
' go through the parts of the find string
For k = 0 To second.Length - 1
' find the part that is closest to the one being searched for
If (first(j) <> "" And second(k) <> "") Then
value = LevenshteinDistance(first(j), second(k))
If (value < smallestWordLevenshtein Or smallestWordLevenshtein = -1) Then
smallestWordLevenshtein = value
bestMatchIndex = k
End If
End If
Next
' add the smallest lev vaule found to the return value for the strings
If (smallestWordLevenshtein <> -1) Then
strLevenshtein += smallestWordLevenshtein
smallestWordLevenshtein = -1
End If
' and drop that section of this string so it's not used later
If (bestMatchIndex <> -1) Then
second.SetValue("", bestMatchIndex)
bestMatchIndex = -1
End If
Next
Return strLevenshtein
End Function
End Module
loFuzzy = CREATEOBJECT("FuzzyMatch")
? loFuzzy.EditDistance("kitten", "sitting")
define class FuzzyMatch as custom
**
** Computes the edit distance between two strings
** using the Levenshtein algorithm. The edit distance is how many
** changes need to be made to make them identical. The edit distance
** for identical strings is 0.0, the edit distance for completely
** different strings depends on what letters need to change or
** be added in order to make them the same.
**
function EditDistance ( ;
sSource as string, ;
sTarget as string ) as decimal
local ;
iLenSource as integer, ;
iLenTarget as integer, ;
iRow as integer, ;
iCol as integer
iLenSource = len( sSource )
iLenTarget = len( sTarget )
if iLenSource == 0.0
return iLenTarget
endif
if iLenTarget == 0.0
return iLenSource
endif
if iLenSource < iLenTarget
sTemp = sSource
sSource = sTarget
sTarget = sTemp
iLenTemp = iLenSource
iLenSource = iLenTarget
iLenTarget = iLenTemp
endif
dimension Levenshtein( iLenTarget+1, iLenSource+1 )
for iRow = 1 to iLenTarget + 1
Levenshtein[iRow,1] = iRow -1
endfor
for iCol = 1 to iLenSource + 1
Levenshtein[1,iCol] = iCol -1
endfor
for iCol = 2 to iLenSource + 1
for iRow = 2 to iLenTarget + 1
local ;
dCost as decimal, ;
cColChar as string, ;
cRowChar as string
dCost = 0.0
cColChar = substr( sSource, iCol-1, 1 )
cRowChar = substr( sTarget, iRow-1, 1 )
if cColChar != cRowChar
dCost = 1.0
endif
Levenshtein[iRow,iCol] = ;
min( ;
Levenshtein[iRow-1,iCol] + 1, ;
Levenshtein[iRow,iCol-1] + 1, ;
Levenshtein[iRow-1,iCol-1] + dCost )
endfor
endfor
return Levenshtein[iLenTarget+1,iLenSource+1]
endfunc
**
** Compares two strings and returns the % match using
** the EditDistance function.
**
function PercentMatch( ;
sSource as string, ;
sTarget as string ) as decimal
local ;
iLenSource as integer, ;
iLenTarget as integer, ;
dResult as decimal, ;
dEditDistance as decimal
iLenSource = len( sSource )
iLenTarget = len( sTarget )
if iLenSource + iLenTarget == 0
return 0.0
else
dEditDistance = this.EditDistance( sSource, sTarget)
dResult = (1.0-(dEditDistance/max(iLenSource,iLenTarget))) * 100.0
return dResult
endif
endfunc
enddefine
Written by Moorthy S V [link]
Public Module Levenshtein Public Function LevenshteinDistance(ByVal a As String, ByVal b As String) As Integer
Dim d As Integer(,) = New Integer(a.Length + 1, b.Length + 1) Dim i As Integer Dim j As Integer Dim insert_penalty As Integer = 1 Dim delete_penalty As Integer = 1 Dim substitute_penalty As Integer = 1 Dim cost As Integer
For i = 0 To a.Length d(i, 0) = i Next For j = 0 To b.Length d(0, j) = j Next
For i = 1 To a.Length For j = 1 To b.Length If (a.Chars(i - 1) = b.Chars(j - 1)) Then cost = 0 Else cost = substitute_penalty End If d(i, j) = Min(Min(d(i - 1, j) + insert_penalty, d(i, j - 1) + delete_penalty), d(i - 1, j - 1) + cost) Next Next
Return d(a.Length, b.Length)
End Function
' This may also be useful to some ' Just split two strings into arrays on spaces or on a lot of different things
' A good use would be if you want to check that a user input ' a name correctly or misspelled/misordered the name Public Function StringArrayLevenshteinDistance(ByVal first As String(), ByVal second As String()) As Integer ' assume first is search string ' assume second is find string Dim j As Integer Dim k As Integer Dim value As Integer
Dim bestMatchIndex = -1 Dim smallestWordLevenshtein As Integer = -1 Dim strLevenshtein As Integer = 0
' go through the parts of the search string For j = 0 To first.Length - 1 ' go through the parts of the find string For k = 0 To second.Length - 1 ' find the part that is closest to the one being searched for If (first(j) <> "" And second(k) <> "") Then value = LevenshteinDistance(first(j), second(k)) If (value < smallestWordLevenshtein Or smallestWordLevenshtein = -1) Then smallestWordLevenshtein = value bestMatchIndex = k End If End If Next
' add the smallest lev vaule found to the return value for the strings If (smallestWordLevenshtein <> -1) Then strLevenshtein += smallestWordLevenshtein smallestWordLevenshtein = -1 End If ' and drop that section of this string so it's not used later If (bestMatchIndex <> -1) Then second.SetValue("", bestMatchIndex) bestMatchIndex = -1 End If
Next
Return strLevenshtein
End Function End Module
loFuzzy = CREATEOBJECT("FuzzyMatch")
? loFuzzy.EditDistance("kitten", "sitting")
define class FuzzyMatch as custom
**
** Computes the edit distance between two strings
** using the Levenshtein algorithm. The edit distance is how many
** changes need to be made to make them identical. The edit distance
** for identical strings is 0.0, the edit distance for completely
** different strings depends on what letters need to change or
** be added in order to make them the same.
**
function EditDistance ( ;
sSource as string, ;
sTarget as string ) as decimal
local ;
iLenSource as integer, ;
iLenTarget as integer, ;
iRow as integer, ;
iCol as integer
iLenSource = len( sSource )
iLenTarget = len( sTarget )
if iLenSource == 0.0
return iLenTarget
endif
if iLenTarget == 0.0
return iLenSource
endif
if iLenSource < iLenTarget
sTemp = sSource
sSource = sTarget
sTarget = sTemp
iLenTemp = iLenSource
iLenSource = iLenTarget
iLenTarget = iLenTemp
endif
dimension Levenshtein( iLenTarget+1, iLenSource+1 )
for iRow = 1 to iLenTarget + 1
Levenshtein[iRow,1] = iRow -1
endfor
for iCol = 1 to iLenSource + 1
Levenshtein[1,iCol] = iCol -1
endfor
for iCol = 2 to iLenSource + 1
for iRow = 2 to iLenTarget + 1
local ;
dCost as decimal, ;
cColChar as string, ;
cRowChar as string
dCost = 0.0
cColChar = substr( sSource, iCol-1, 1 )
cRowChar = substr( sTarget, iRow-1, 1 )
if cColChar != cRowChar
dCost = 1.0
endif
Levenshtein[iRow,iCol] = ;
min( ;
Levenshtein[iRow-1,iCol] + 1, ;
Levenshtein[iRow,iCol-1] + 1, ;
Levenshtein[iRow-1,iCol-1] + dCost )
endfor
endfor
return Levenshtein[iLenTarget+1,iLenSource+1]
endfunc
**
** Compares two strings and returns the % match using
** the EditDistance function.
**
function PercentMatch( ;
sSource as string, ;
sTarget as string ) as decimal
local ;
iLenSource as integer, ;
iLenTarget as integer, ;
dResult as decimal, ;
dEditDistance as decimal
iLenSource = len( sSource )
iLenTarget = len( sTarget )
if iLenSource + iLenTarget == 0
return 0.0
else
dEditDistance = this.EditDistance( sSource, sTarget)
dResult = (1.0-(dEditDistance/max(iLenSource,iLenTarget))) * 100.0
return dResult
endif
endfunc
enddefine
Written by Moorthy S V [link]
class LevenshteinDistanceprivate function compute(a:Array, b:Array)
var distance = new Array(); for(var g=0; g
for(var i=0; i<=a.length; i++) distance[i][0] = i; } for(var j=0; j<=b.length+1; j++)
for(var k=1; k<=a.length; k++) } return distance[a.length][b.length] }
private function minimum(a:Number, b:Number, c:Number)
if (Number(b)<=Number(a) && Number(b)<=Number(c))
if(result == -1 )
return result; } }
printf("\ndistance is: %d", dist("mama", "tatko"));
sub dist
push(@d, [@temp]);
}
for($i=0; $i
for($j=0; $j
for($i=1; $i
$d[$i][$j]=min($d[$i-1][$j]+1, $d[$i][$j-1]+1, $d[$i-1][$j-1]+$cost);
}
}
return $d[length($str1)][length($str2)]
}
# Alternatively: use List::Util qw/ min /;
sub min
return $m;
}
# Alternatively (the fastest possible way)
use Text::LevenshteinXS 'distance'; #this provides access to precompiled C implementation of distance() function
Uses linear space, and can be curried with just one string to factor arrays allocations
(* Written by Pierre Etchemaite *)
type levenshtein_costs =
let levenshtein_distance lc =
fun s1 ->
let l1 = String.length s1 in
let m = Array.make (l1 + 1) 0 in
let n = Array.make (l1 + 1) 0 in
fun s2 ->
let l2 = String.length s2 in
m.(0) <- 0;
for i = 1 to l1 do
m.(i) <- m.(i - 1) + lc.delete_cost
done;
let rec aux j m n =
if j = l2 then m.(l1)
else
let c2 = s2.[j] in
n.(0) <- m.(0) + lc.insert_cost;
for i = 1 to l1 do
let i1 = i - 1 in
n.(i) <-
min
(min (n.(i1) + lc.delete_cost)
(m.(i) + lc.insert_cost))
(m.(i1) +
(if s1.[i1] <> c2 then lc.replace_cost else 0))
done;
aux (j + 1) n m in
aux 0 m n
function levenshtein(string1, string2)
local str1, str2, distance = , , ;
str1.len, str2.len = string.len(string1), string.len(string2);
string.gsub(string1, "(.)", function(s) table.insert(str1, s); end);
string.gsub(string2, "(.)", function(s) table.insert(str2, s); end);
for i = 0, str1.len do distance[i][0] = i; end
for i = 0, str2.len do distance[0][i] = i; end
for i = 1, str1.len do
for j = 1, str2.len do
local tmpdist = 1;
if(str1[i-1] == str2[j-1]) then tmpdist = 0; end
distance[i][j] = math.min(
distance[i-1][j] + 1, distance[i][j-1]+1, distance[i-1][j-1] + tmpdist);
end
end
return distance[str1.len][str2.len]
end
levenshtein <- function(string1, string2, case = TRUE, map = NULL)
if(ifelse(case, string1, tolower(string1)) == ifelse(case, string2, tolower(string2))) return(0)
s1 <- strsplit(paste(" ", ifelse(case, string1, tolower(string1)), sep=""), NULL)1
s2 <- strsplit(paste(" ", ifelse(case, string2, tolower(string2)), sep=""), NULL)1
l1 <- length(s1)
l2 <- length(s2)
d <- matrix(nrow = l1, ncol = l2)
for(i in 1:l1) d[i,1] <- i-1
for(i in 1:l2) d[1,i] <- i-1
for(i in 2:l1) for(j in 2:l2) d[i,j] <- min((d[i-1,j]+1) , (d[i,j-1]+1) , (d[i-1,j-1]+ifelse(s1[i] == s2[j], 0, 1)))
d[l1,l2]
}
# Implementation based on http://www.merriampark.com/ld.htm
# Implementation by Alberto Simoes (ambs cpan.org)
#
# .sub main :main
# $S1 = "purl"
# $S2 = "perl"
# $I1 = levenshtein($S1,$S2)
# print $I1
# print "\n"
# end
# .end
# Levenshtein distance. Pass two strings in
.sub levenshtein
.param string s
.param string t
.local int n # s size
.local int m # t size
.local pmc matrix
.local int i
.local int j
.local string schar
.local string tchar
.local int cost
.local int a
.local int b
.local int c
n = length s
m = length t
if n == 0 goto return_m
if m == 0 goto return_n
new matrix, .ResizablePMCArray
i = 0
init_matrix:
new $P0, .ResizableIntegerArray
matrix[i] = $P0
i += 1
if i <= m goto init_matrix
i = 0
init_matrix_1:
matrix[i;0] = i
i += 1
if i <= m goto init_matrix_1
i = 0
init_matrix_2:
matrix[0;i] = i
i += 1
if i <= n goto init_matrix_2
init_matrix_done:
i = 1
cycle:
j = 1
inner_cycle:
$I0 = i - 1
$I1 = j - 1
schar = substr s, $I1, 1
tchar = substr t, $I0, 1
cost = calculate_cost(schar, tchar)
a = matrix[$I0;j]
a += 1
b = matrix[i;$I1]
b += 1
c = matrix[$I0;$I1]
c += cost
cost = minimum(a,b,c)
matrix[i;j] = cost
j += 1
if j <= n goto inner_cycle
i += 1
if i <= m goto cycle
cost = matrix[m;n]
.return(cost)
return_m:
.return(m)
return_n:
.return(n)
.end
.sub calculate_cost
.param string first
.param string second
if first == second goto zero
.return(1)
zero:
.return(0)
.end
.sub minimum
.param int a
.param int b
.param int c
if a < b goto other
if b < c goto b_label
c_label:
.return(c)
a_label:
.return(a)
b_label:
.return(b)
other:
if a < c goto a_label
goto c_label
.end
Since PHP 3.0.17, there is a built-in levenshtein function[link].
Upper and lower bounds
The Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:
- It is always at least the difference of the sizes of the two strings.
- It is at most the length of the longer string.
- It is zero if and only if the strings are identical.
- If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.
- If the strings are called
s and t, the number of characters (not counting duplicates) found in s but not in t is a lower bound.
See also
sub dist
push(@d, [@temp]); }
for($i=0; $i
return $d[length($str1)][length($str2)]
}
# Alternatively: use List::Util qw/ min /;
sub min
return $m;
}
# Alternatively (the fastest possible way)
use Text::LevenshteinXS 'distance'; #this provides access to precompiled C implementation of distance() function
Uses linear space, and can be curried with just one string to factor arrays allocations
(* Written by Pierre Etchemaite*) type levenshtein_costs =
let levenshtein_distance lc = fun s1 -> let l1 = String.length s1 in let m = Array.make (l1 + 1) 0 in let n = Array.make (l1 + 1) 0 in fun s2 -> let l2 = String.length s2 in m.(0) <- 0; for i = 1 to l1 do m.(i) <- m.(i - 1) + lc.delete_cost done; let rec aux j m n = if j = l2 then m.(l1) else let c2 = s2.[j] in n.(0) <- m.(0) + lc.insert_cost; for i = 1 to l1 do let i1 = i - 1 in n.(i) <- min (min (n.(i1) + lc.delete_cost) (m.(i) + lc.insert_cost)) (m.(i1) + (if s1.[i1] <> c2 then lc.replace_cost else 0)) done; aux (j + 1) n m in aux 0 m n
function levenshtein(string1, string2)
local str1, str2, distance = , , ;
str1.len, str2.len = string.len(string1), string.len(string2);
string.gsub(string1, "(.)", function(s) table.insert(str1, s); end);
string.gsub(string2, "(.)", function(s) table.insert(str2, s); end);
for i = 0, str1.len do distance[i][0] = i; end
for i = 0, str2.len do distance[0][i] = i; end
for i = 1, str1.len do
for j = 1, str2.len do
local tmpdist = 1;
if(str1[i-1] == str2[j-1]) then tmpdist = 0; end
distance[i][j] = math.min(
distance[i-1][j] + 1, distance[i][j-1]+1, distance[i-1][j-1] + tmpdist);
end
end
return distance[str1.len][str2.len]
end
levenshtein <- function(string1, string2, case = TRUE, map = NULL)
if(ifelse(case, string1, tolower(string1)) == ifelse(case, string2, tolower(string2))) return(0)
s1 <- strsplit(paste(" ", ifelse(case, string1, tolower(string1)), sep=""), NULL)1
s2 <- strsplit(paste(" ", ifelse(case, string2, tolower(string2)), sep=""), NULL)1
l1 <- length(s1)
l2 <- length(s2)
d <- matrix(nrow = l1, ncol = l2)
for(i in 1:l1) d[i,1] <- i-1
for(i in 1:l2) d[1,i] <- i-1
for(i in 2:l1) for(j in 2:l2) d[i,j] <- min((d[i-1,j]+1) , (d[i,j-1]+1) , (d[i-1,j-1]+ifelse(s1[i] == s2[j], 0, 1)))
d[l1,l2]
}
# Implementation based on http://www.merriampark.com/ld.htm
# Implementation by Alberto Simoes (ambs cpan.org)
#
# .sub main :main
# $S1 = "purl"
# $S2 = "perl"
# $I1 = levenshtein($S1,$S2)
# print $I1
# print "\n"
# end
# .end
# Levenshtein distance. Pass two strings in
.sub levenshtein
.param string s
.param string t
.local int n # s size
.local int m # t size
.local pmc matrix
.local int i
.local int j
.local string schar
.local string tchar
.local int cost
.local int a
.local int b
.local int c
n = length s
m = length t
if n == 0 goto return_m
if m == 0 goto return_n
new matrix, .ResizablePMCArray
i = 0
init_matrix:
new $P0, .ResizableIntegerArray
matrix[i] = $P0
i += 1
if i <= m goto init_matrix
i = 0
init_matrix_1:
matrix[i;0] = i
i += 1
if i <= m goto init_matrix_1
i = 0
init_matrix_2:
matrix[0;i] = i
i += 1
if i <= n goto init_matrix_2
init_matrix_done:
i = 1
cycle:
j = 1
inner_cycle:
$I0 = i - 1
$I1 = j - 1
schar = substr s, $I1, 1
tchar = substr t, $I0, 1
cost = calculate_cost(schar, tchar)
a = matrix[$I0;j]
a += 1
b = matrix[i;$I1]
b += 1
c = matrix[$I0;$I1]
c += cost
cost = minimum(a,b,c)
matrix[i;j] = cost
j += 1
if j <= n goto inner_cycle
i += 1
if i <= m goto cycle
cost = matrix[m;n]
.return(cost)
return_m:
.return(m)
return_n:
.return(n)
.end
.sub calculate_cost
.param string first
.param string second
if first == second goto zero
.return(1)
zero:
.return(0)
.end
.sub minimum
.param int a
.param int b
.param int c
if a < b goto other
if b < c goto b_label
c_label:
.return(c)
a_label:
.return(a)
b_label:
.return(b)
other:
if a < c goto a_label
goto c_label
.end
Since PHP 3.0.17, there is a built-in levenshtein function[link].
Upper and lower bounds
The Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:
- It is always at least the difference of the sizes of the two strings.
- It is at most the length of the longer string.
- It is zero if and only if the strings are identical.
- If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.
- If the strings are called
s and t, the number of characters (not counting duplicates) found in s but not in t is a lower bound.
See also
levenshtein <- function(string1, string2, case = TRUE, map = NULL)
if(ifelse(case, string1, tolower(string1)) == ifelse(case, string2, tolower(string2))) return(0)
s1 <- strsplit(paste(" ", ifelse(case, string1, tolower(string1)), sep=""), NULL)1
s2 <- strsplit(paste(" ", ifelse(case, string2, tolower(string2)), sep=""), NULL)1
l1 <- length(s1)
l2 <- length(s2)
d <- matrix(nrow = l1, ncol = l2)
for(i in 1:l1) d[i,1] <- i-1
for(i in 1:l2) d[1,i] <- i-1
for(i in 2:l1) for(j in 2:l2) d[i,j] <- min((d[i-1,j]+1) , (d[i,j-1]+1) , (d[i-1,j-1]+ifelse(s1[i] == s2[j], 0, 1)))
d[l1,l2]
}
# Implementation based on http://www.merriampark.com/ld.htm
# Implementation by Alberto Simoes (ambs cpan.org)
#
# .sub main :main
# $S1 = "purl"
# $S2 = "perl"
# $I1 = levenshtein($S1,$S2)
# print $I1
# print "\n"
# end
# .end
# Levenshtein distance. Pass two strings in
.sub levenshtein
.param string s
.param string t
.local int n # s size
.local int m # t size
.local pmc matrix
.local int i
.local int j
.local string schar
.local string tchar
.local int cost
.local int a
.local int b
.local int c
n = length s
m = length t
if n == 0 goto return_m
if m == 0 goto return_n
new matrix, .ResizablePMCArray
i = 0
init_matrix:
new $P0, .ResizableIntegerArray
matrix[i] = $P0
i += 1
if i <= m goto init_matrix
i = 0
init_matrix_1:
matrix[i;0] = i
i += 1
if i <= m goto init_matrix_1
i = 0
init_matrix_2:
matrix[0;i] = i
i += 1
if i <= n goto init_matrix_2
init_matrix_done:
i = 1
cycle:
j = 1
inner_cycle:
$I0 = i - 1
$I1 = j - 1
schar = substr s, $I1, 1
tchar = substr t, $I0, 1
cost = calculate_cost(schar, tchar)
a = matrix[$I0;j]
a += 1
b = matrix[i;$I1]
b += 1
c = matrix[$I0;$I1]
c += cost
cost = minimum(a,b,c)
matrix[i;j] = cost
j += 1
if j <= n goto inner_cycle
i += 1
if i <= m goto cycle
cost = matrix[m;n]
.return(cost)
return_m:
.return(m)
return_n:
.return(n)
.end
.sub calculate_cost
.param string first
.param string second
if first == second goto zero
.return(1)
zero:
.return(0)
.end
.sub minimum
.param int a
.param int b
.param int c
if a < b goto other
if b < c goto b_label
c_label:
.return(c)
a_label:
.return(a)
b_label:
.return(b)
other:
if a < c goto a_label
goto c_label
.end
Since PHP 3.0.17, there is a built-in levenshtein function[link].
Upper and lower bounds
The Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:
- It is always at least the difference of the sizes of the two strings.
- It is at most the length of the longer string.
- It is zero if and only if the strings are identical.
- If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.
- If the strings are called
s and t, the number of characters (not counting duplicates) found in s but not in t is a lower bound.
See also
# Levenshtein distance. Pass two strings in .sub levenshtein .param string s .param string t
.local int n # s size .local int m # t size
.local pmc matrix .local int i .local int j
.local string schar .local string tchar
.local int cost .local int a .local int b .local int c
n = length s m = length t
if n == 0 goto return_m if m == 0 goto return_n
new matrix, .ResizablePMCArray i = 0 init_matrix: new $P0, .ResizableIntegerArray matrix[i] = $P0 i += 1 if i <= m goto init_matrix i = 0 init_matrix_1: matrix[i;0] = i i += 1 if i <= m goto init_matrix_1 i = 0 init_matrix_2: matrix[0;i] = i i += 1 if i <= n goto init_matrix_2 init_matrix_done: i = 1 cycle: j = 1 inner_cycle: $I0 = i - 1 $I1 = j - 1 schar = substr s, $I1, 1 tchar = substr t, $I0, 1
cost = calculate_cost(schar, tchar)
a = matrix[$I0;j] a += 1
b = matrix[i;$I1] b += 1
c = matrix[$I0;$I1] c += cost
cost = minimum(a,b,c) matrix[i;j] = cost
j += 1 if j <= n goto inner_cycle i += 1 if i <= m goto cycle
cost = matrix[m;n] .return(cost) return_m: .return(m) return_n: .return(n) .end
.sub calculate_cost .param string first .param string second if first == second goto zero .return(1) zero: .return(0) .end
.sub minimum .param int a .param int b .param int c if a < b goto other if b < c goto b_label c_label: .return(c) a_label: .return(a) b_label: .return(b) other: if a < c goto a_label goto c_label .end
s and t, the number of characters (not counting duplicates) found in s but not in t is a lower bound.
External links
- [Levenshtein Distance, in Three Flavors, by Michael Gilleland]
- [NIST's Dictionary of Algorithms and Data Structures: Levenshtein Distance]
- [CSE 590BI, Winter 1996 Algorithms in Molecular Biology] The algorithms from lectures 2, 3 and 4 are based on the Levenshtein distance but implement a different scoring function. The Haskell example was based on these notes.
- [Levenshtein Distance - visualized]
- [Distance between strings - Levenshtein and Hamming] at cut-the-knot
- [edit-distance.scm] - Functional implementation of the algorithm in the Scheme programming language by Rares Vernica.
- [Another Edit Distance Visualization (very fast)]
- [Wie funktioniert der Levenshtein-Algorithmus...?]
- [Continuous variants, spike train metrics, and applications to neurophysiology]
- [another visualization]
- Project Dedupe http://dedupe.sourceforge.net
From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.
