### Author Topic: Levenshtein and Damerau-Levenshtein distance sefuns  (Read 11948 times)

#### chaos

• BFF
• Posts: 291
• Job, school, social life, sleep. Pick 2.5.
##### Levenshtein and Damerau-Levenshtein distance sefuns
« on: November 05, 2008, 12:16:09 PM »
These are LPC implementations of the Levenshtein and Damerau-Levenshtein 'edit distance' algorithms described at http://en.wikipedia.org/wiki/Levenshtein_distance and http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance.

man page for levenshtein_distance():
Quote
levenshtein_distance - compute Levenshtein distance of two strings

SYNOPSIS
int levenshtein_distance(string a, string b)

DESCRIPTION
Computes the Levenshtein distance between strings a and b,
which is the number of insertions, deletions, or replacements
see http://en.wikipedia.org/wiki/Levenshtein_distance.

EXAMPLE
> eval -b levenshtein_distance("kitten", "mittens")
[Integer] 2
> eval -b levenshtein_distance("terrible", "tortuous")
[Integer] 6
> eval -b levenshtein_distance("terrible", "tribble")
[Integer] 2
> eval -b levenshtein_distance("know", "nkow")
[Integer] 2

damerau_levenshtein_distance(sefun), regexp(efun)

source code for levenshtein_distance():
Code: [Select]
`int levenshtein_distance(string a, string b) {    int alen = strlen(a);    int blen = strlen(b);    mixed array dist = allocate(alen + 1, allocate(blen + 1));    for(int j = 1; j <= blen; j++)        dist[0][j] = j;    dist[alen][0] = alen;    for(int i = 0; i < alen; i++) {        dist[i][0] = i;        for(int j = 0; j < blen; j++)            dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]));    }    return dist[alen][blen];}`
man page for damerau_levenshtein_distance():
Quote
damerau_levenshtein_distance - compute Damerau-Levenshtein distance of two strings

SYNOPSIS
int damerau_levenshtein_distance(string a, string b)

DESCRIPTION
Computes the Damarau-Levenshtein distance between strings a and b,
which is the number of insertions, deletions, replacements or
transpositions necessary to transform one to the other provided
see http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance.

EXAMPLE
> eval -b damerau_levenshtein_distance("kitten", "mittens")
[Integer] 2
> eval -b damerau_levenshtein_distance("terrible", "tortuous")
[Integer] 6
> eval -b damerau_levenshtein_distance("terrible", "tribble")
[Integer] 2
> eval -b damerau_levenshtein_distance("know", "nkow")
[Integer] 1

levenshtein_distance(sefun), regexp(efun)

source code for damerau_levenshtein_distance():
Code: [Select]
`int damerau_levenshtein_distance(string a, string b) {    int alen = strlen(a);    int blen = strlen(b);    mixed array dist = allocate(alen + 1, allocate(blen + 1));    for(int j = 1; j <= blen; j++)        dist[0][j] = j;    dist[alen][0] = alen;    for(int i = 0; i < alen; i++) {        dist[i][0] = i;        for(int j = 0; j < blen; j++)            if(i && j && a[i] == b[j - 1] && a[i - 1] == b[j])                dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]), dist[i - 1][j - 1] + (a[i] != b[j]));            else                dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]));    }    return dist[alen][blen];}`

#### wodan

• BFF
• Posts: 434
• Drink and code, you know you want to!
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #1 on: November 05, 2008, 01:29:53 PM »
and that only slightly over a week after I added it to FluffOS cvs as string_difference()

#### chaos

• BFF
• Posts: 291
• Job, school, social life, sleep. Pick 2.5.
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #2 on: November 05, 2008, 02:00:54 PM »
Ha.  Great minds think in the same gutters.

Which one did you implement?  Or can you get both behaviors with bitflags or whatever?

#### wodan

• BFF
• Posts: 434
• Drink and code, you know you want to!
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #3 on: November 05, 2008, 03:03:20 PM »
neither, but woom (@discworld) did the first one.

#### chaos

• BFF
• Posts: 291
• Job, school, social life, sleep. Pick 2.5.
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #4 on: November 13, 2008, 12:44:52 PM »
I regret to say that the implementations I psoted before are, though plausible, incorrect, and will fail on some inputs.  (The man pages contain one; the Levenshtein and Damerau-Levenshtein distances from 'terrible' to 'tribble' are 3, not 2.)  Correct implementations:

Code: [Select]
`int levenshtein_distance(string a, string b) {    int alen = strlen(a);    int blen = strlen(b);    mixed array dist = allocate(alen + 1, allocate(blen + 1));    for(int i = 1; i <= alen; i++)        dist[i][0] = i;    int array dist0 = dist[0];    for(int j = 1; j <= blen; j++)        dist0[j] = j;    for(int i = 0; i < alen; i++)        for(int j = 0; j < blen; j++)            dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]));    return dist[alen][blen];}int damerau_levenshtein_distance(string a, string b) {    int alen = strlen(a);    int blen = strlen(b);    mixed array dist = allocate(alen + 1, allocate(blen + 1));    for(int i = 1; i <= alen; i++)        dist[i][0] = i;    int array dist0 = dist[0];    for(int j = 1; j <= blen; j++)        dist0[j] = j;    for(int i = 0; i < alen; i++)        for(int j = 0; j < blen; j++)            if(i && j && a[i] == b[j - 1] && a[i - 1] == b[j])                dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]), dist[i - 1][j - 1] + (a[i] != b[j]));            else                dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]));    return dist[alen][blen];}`

#### Woom

• Acquaintance
• Posts: 3
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #5 on: November 16, 2008, 02:00:25 PM »
Hi!

I only did the Levenshtein distance since I (as yet) haven't been able to find an algorithm for calculating the Damerau-Levenshtein distance. (No, I haven't searched very hard.)

Your D-L algorithm only finds the optimum string alignment, which for D-L is not the same as the actual distance. Try it with CA -> ABC, which has an edit distance of 2, but optimal alignment of 3.

Neat, though

Woom

#### chaos

• BFF
• Posts: 291
• Job, school, social life, sleep. Pick 2.5.
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #6 on: November 16, 2008, 05:12:26 PM »
Yeah, that's the "provided that no substring is modified more than once" part in the man page.  There's discussion of that on the wikipedia page as well.  I wasn't clear on whether that was part of the actual definition of D-L distance or not, but my intuition about it was that removing the multiple-edit constraint would blow out the doors on the algorithm, making it O(nk) or even O(n!), so probably wasn't worth messing with to begin with.

#### detah

• BFF
• Posts: 190
• Ruler of 2D
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #7 on: November 17, 2008, 08:33:41 AM »
This seems like some cool code.
Are you using it as Step #1 for a fuzzymatch program?
Or do you have some other use for this code?

I have been considering several upgrades for my Help File System over the last 2 years. A couple months ago, I completed an upgrade which lists all help files for which the help string is a substring. eg. Player types "help fire", the Help File System outputs: "There is no help file for fire. See also: fireball, firestorm and wall of fire."

Some type of fuzzymatch routine has been on my mind, but I havent really started to tackle it yet. I am going to focus my attention on 2 kinds of common typing errors, the single letter typo "fiteball" and the transposition typo "ifreball".

Thanks for posting.

#### chaos

• BFF
• Posts: 291
• Job, school, social life, sleep. Pick 2.5.
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #8 on: November 17, 2008, 09:06:03 AM »
You're welcome.  Yeah, it's for fuzzymatch.  Identifying things like 'fiteball' and 'ifreball' is exactly what I had in mind.

#### Woom

• Acquaintance
• Posts: 3
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #9 on: November 17, 2008, 01:20:11 PM »
Same here. Discworld has been using fuzzy matching to suggest help files for as long as I've been around, but this algorithm is an improvement to what we've been using. I had it added to FluffOS for extra speed.

Woom

• Friend
• Posts: 50
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #10 on: November 24, 2008, 11:12:48 AM »
You're welcome.  Yeah, it's for fuzzymatch.  Identifying things like 'fiteball' and 'ifreball' is exactly what I had in mind.

Hmm. How about a Keyboard-based fuzzymatch, to fix common problems like finger-slips.. ie, 'y' instead of 't', 'b' instead of 'n', or hand-shifts, like 'cin' instead of 'com', or 'kik' instead of 'lol'?

#### detah

• BFF
• Posts: 190
• Ruler of 2D
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #11 on: November 25, 2008, 08:04:10 AM »
The program as it is does not care whether the incorrect character is an adjacent key on the keyboard or if it is on the other side of the keyboard. It searches through all letters. Therefore, your keyboard-based concern is a mere subset of the problems that this sefun will solve. That's why it is so cool.
-Detah

#### skullslayer

• Acquaintance
• Posts: 12
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #12 on: January 06, 2009, 09:03:46 PM »
I've done a few tweaks to implement this on our mud, and came across one minor error in the man page, which was most confusing for a while

Quote
> eval -b levenshtein_distance("terrible", "tribble")
[Integer] 2

This distance is 3, not 2 - delete e, r, and insert b

#### chaos

• BFF
• Posts: 291
• Job, school, social life, sleep. Pick 2.5.
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #13 on: January 07, 2009, 06:21:10 AM »
I've done a few tweaks to implement this on our mud, and came across one minor error in the man page

Yeah.  See my post of 2008-11-13.  The fundamental error was in the implementation; the error in the man page was consequential of that.

#### cratylus

• Posts: 1020
##### Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #14 on: January 13, 2009, 11:09:33 AM »
You're welcome.  Yeah, it's for fuzzymatch.  Identifying things like 'fiteball' and 'ifreball' is exactly what I had in mind.

Sweet!

What's the licensing for the code/mans you provided?

I've done a little translating of your code so that it plugs into Dead Souls
as a sefun file. MudOS/FluffOS handles "second-argument-to-allocate-being-a-pointer"
rather, uh...unexpectedly. Here's a gently-tested DS-ified version:

cat fuzzymatch.c
Code: [Select]
`int levenshtein_distance(string a, string b) {    int alen = sizeof(a);    int blen = sizeof(b);    mixed foo, baz = alen+1;    mixed array dist = ({});    while(baz){        dist += ({ allocate(blen +1) });        baz--;    }     foreach(foo in dist){        foo[0] = baz;        baz++;    }    baz = 1;    while(baz < (blen + 1)){        dist[0][baz] = baz;        baz++;    }    for(int i = 0; i < alen; i++){        for(int j = 0; j < blen; j++){            dist[i + 1][j + 1] = min( ({ dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]) }) );        }    }    return dist[alen][blen];}int damerau_levenshtein_distance(string a, string b) {    int alen = sizeof(a);    int blen = sizeof(b);    mixed foo, baz = alen+1;    mixed array dist = ({});    while(baz){        dist += ({ allocate(blen +1) });        baz--;    }    foreach(foo in dist){        foo[0] = baz;        baz++;    }    baz = 1;    while(baz < (blen + 1)){        dist[0][baz] = baz;        baz++;    }    for(int i = 0; i < alen; i++)        for(int j = 0; j < blen; j++)            if(i && j && a[i] == b[j - 1] && a[i - 1] == b[j])                dist[i + 1][j + 1] = min( ({ dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]), dist[i - 1][j - 1] + (a[i] != b[j]) }) );            else                dist[i + 1][j + 1] = min( ({ dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]) }) );    return dist[alen][blen];}`
I did notice a minor difference in your results and mine...hopefully someone will point out where I goofed.

Quote
12:06:47 Dead Souls Omega /secure/sefun > eval return levenshtein_distance("kitten", "mittens")
Result = 2

12:06:56 Dead Souls Omega /secure/sefun > eval return levenshtein_distance("terrible", "tortuous")
Result = 6

12:06:59 Dead Souls Omega /secure/sefun > eval return levenshtein_distance("terrible", "tribble")
Result = 3

12:07:03 Dead Souls Omega /secure/sefun > eval return levenshtein_distance("know", "nkow")
Result = 2

12:07:07 Dead Souls Omega /secure/sefun > eval return damerau_levenshtein_distance("kitten", "mittens")
Result = 2

12:07:13 Dead Souls Omega /secure/sefun > eval return damerau_levenshtein_distance("terrible", "tortuous")
Result = 6

12:07:16 Dead Souls Omega /secure/sefun > eval return damerau_levenshtein_distance("terrible", "tribble")
Result = 3

12:07:22 Dead Souls Omega /secure/sefun > eval return damerau_levenshtein_distance("know", "nkow")
Result = 1

-Crat