Author Topic: Levenshtein and Damerau-Levenshtein distance sefuns  (Read 8843 times)

Offline chaos

  • BFF
  • ***
  • Posts: 291
  • Job, school, social life, sleep. Pick 2.5.
    • View Profile
    • Lost Souls
Levenshtein and Damerau-Levenshtein distance sefuns
« on: November 05, 2008, 01:16:09 PM »
These are LPC implementations of the Levenshtein and Damerau-Levenshtein 'edit distance' algorithms described at http://en.wikipedia.org/wiki/Levenshtein_distance and http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance.

man page for levenshtein_distance():
Quote
levenshtein_distance - compute Levenshtein distance of two strings

SYNOPSIS
    int levenshtein_distance(string a, string b)

DESCRIPTION
    Computes the Levenshtein distance between strings a and b,
    which is the number of insertions, deletions, or replacements
    necessary to transform one to the other.  For more information,
    see http://en.wikipedia.org/wiki/Levenshtein_distance.

EXAMPLE
    > eval -b levenshtein_distance("kitten", "mittens")
    [Integer] 2
    > eval -b levenshtein_distance("terrible", "tortuous")
    [Integer] 6
    > eval -b levenshtein_distance("terrible", "tribble")
    [Integer] 2
    > eval -b levenshtein_distance("know", "nkow")
    [Integer] 2

SEE ALSO
    damerau_levenshtein_distance(sefun), regexp(efun)

source code for levenshtein_distance():
Code: [Select]
int levenshtein_distance(string a, string b) {
    int alen = strlen(a);
    int blen = strlen(b);
    mixed array dist = allocate(alen + 1, allocate(blen + 1));
    for(int j = 1; j <= blen; j++)
        dist[0][j] = j;
    dist[alen][0] = alen;
    for(int i = 0; i < alen; i++) {
        dist[i][0] = i;
        for(int j = 0; j < blen; j++)
            dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]));
    }
    return dist[alen][blen];
}

man page for damerau_levenshtein_distance():
Quote
damerau_levenshtein_distance - compute Damerau-Levenshtein distance of two strings

SYNOPSIS
    int damerau_levenshtein_distance(string a, string b)

DESCRIPTION
    Computes the Damarau-Levenshtein distance between strings a and b,
    which is the number of insertions, deletions, replacements or
    transpositions necessary to transform one to the other provided
    that no substring is modified more than once.  For more information,
    see http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance.

EXAMPLE
    > eval -b damerau_levenshtein_distance("kitten", "mittens")
    [Integer] 2
    > eval -b damerau_levenshtein_distance("terrible", "tortuous")
    [Integer] 6
    > eval -b damerau_levenshtein_distance("terrible", "tribble")
    [Integer] 2
    > eval -b damerau_levenshtein_distance("know", "nkow")
    [Integer] 1

SEE ALSO
    levenshtein_distance(sefun), regexp(efun)

source code for damerau_levenshtein_distance():
Code: [Select]
int damerau_levenshtein_distance(string a, string b) {
    int alen = strlen(a);
    int blen = strlen(b);
    mixed array dist = allocate(alen + 1, allocate(blen + 1));
    for(int j = 1; j <= blen; j++)
        dist[0][j] = j;
    dist[alen][0] = alen;
    for(int i = 0; i < alen; i++) {
        dist[i][0] = i;
        for(int j = 0; j < blen; j++)
            if(i && j && a[i] == b[j - 1] && a[i - 1] == b[j])
                dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]), dist[i - 1][j - 1] + (a[i] != b[j]));
            else
                dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]));
    }
    return dist[alen][blen];
}

Offline wodan

  • BFF
  • ***
  • Posts: 433
  • Drink and code, you know you want to!
    • View Profile
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #1 on: November 05, 2008, 02:29:53 PM »
and that only slightly over a week after I added it to FluffOS cvs as string_difference() :)


Offline chaos

  • BFF
  • ***
  • Posts: 291
  • Job, school, social life, sleep. Pick 2.5.
    • View Profile
    • Lost Souls
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #2 on: November 05, 2008, 03:00:54 PM »
Ha.  Great minds think in the same gutters.

Which one did you implement?  Or can you get both behaviors with bitflags or whatever?

Offline wodan

  • BFF
  • ***
  • Posts: 433
  • Drink and code, you know you want to!
    • View Profile
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #3 on: November 05, 2008, 04:03:20 PM »
neither, but woom (@discworld) did the first one.

Offline chaos

  • BFF
  • ***
  • Posts: 291
  • Job, school, social life, sleep. Pick 2.5.
    • View Profile
    • Lost Souls
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #4 on: November 13, 2008, 01:44:52 PM »
I regret to say that the implementations I psoted before are, though plausible, incorrect, and will fail on some inputs.  (The man pages contain one; the Levenshtein and Damerau-Levenshtein distances from 'terrible' to 'tribble' are 3, not 2.)  Correct implementations:

Code: [Select]
int levenshtein_distance(string a, string b) {
    int alen = strlen(a);
    int blen = strlen(b);
    mixed array dist = allocate(alen + 1, allocate(blen + 1));
    for(int i = 1; i <= alen; i++)
        dist[i][0] = i;
    int array dist0 = dist[0];
    for(int j = 1; j <= blen; j++)
        dist0[j] = j;
    for(int i = 0; i < alen; i++)
        for(int j = 0; j < blen; j++)
            dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]));
    return dist[alen][blen];
}

int damerau_levenshtein_distance(string a, string b) {
    int alen = strlen(a);
    int blen = strlen(b);
    mixed array dist = allocate(alen + 1, allocate(blen + 1));
    for(int i = 1; i <= alen; i++)
        dist[i][0] = i;
    int array dist0 = dist[0];
    for(int j = 1; j <= blen; j++)
        dist0[j] = j;
    for(int i = 0; i < alen; i++)
        for(int j = 0; j < blen; j++)
            if(i && j && a[i] == b[j - 1] && a[i - 1] == b[j])
                dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]), dist[i - 1][j - 1] + (a[i] != b[j]));
            else
                dist[i + 1][j + 1] = min(dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]));
    return dist[alen][blen];
}

Offline Woom

  • Acquaintance
  • *
  • Posts: 3
    • View Profile
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #5 on: November 16, 2008, 03:00:25 PM »
Hi!

I only did the Levenshtein distance since I (as yet) haven't been able to find an algorithm for calculating the Damerau-Levenshtein distance. (No, I haven't searched very hard.)

Your D-L algorithm only finds the optimum string alignment, which for D-L is not the same as the actual distance. Try it with CA -> ABC, which has an edit distance of 2, but optimal alignment of 3.

Neat, though :)

Woom

Offline chaos

  • BFF
  • ***
  • Posts: 291
  • Job, school, social life, sleep. Pick 2.5.
    • View Profile
    • Lost Souls
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #6 on: November 16, 2008, 06:12:26 PM »
Yeah, that's the "provided that no substring is modified more than once" part in the man page.  There's discussion of that on the wikipedia page as well.  I wasn't clear on whether that was part of the actual definition of D-L distance or not, but my intuition about it was that removing the multiple-edit constraint would blow out the doors on the algorithm, making it O(nk) or even O(n!), so probably wasn't worth messing with to begin with.

Offline detah

  • BFF
  • ***
  • Posts: 190
  • Ruler of 2D
    • View Profile
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #7 on: November 17, 2008, 09:33:41 AM »
This seems like some cool code.
Are you using it as Step #1 for a fuzzymatch program?
Or do you have some other use for this code?

I have been considering several upgrades for my Help File System over the last 2 years. A couple months ago, I completed an upgrade which lists all help files for which the help string is a substring. eg. Player types "help fire", the Help File System outputs: "There is no help file for fire. See also: fireball, firestorm and wall of fire." 

Some type of fuzzymatch routine has been on my mind, but I havent really started to tackle it yet. I am going to focus my attention on 2 kinds of common typing errors, the single letter typo "fiteball" and the transposition typo "ifreball".

Thanks for posting.

Offline chaos

  • BFF
  • ***
  • Posts: 291
  • Job, school, social life, sleep. Pick 2.5.
    • View Profile
    • Lost Souls
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #8 on: November 17, 2008, 10:06:03 AM »
You're welcome.  Yeah, it's for fuzzymatch.  Identifying things like 'fiteball' and 'ifreball' is exactly what I had in mind.

Offline Woom

  • Acquaintance
  • *
  • Posts: 3
    • View Profile
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #9 on: November 17, 2008, 02:20:11 PM »
Same here. Discworld has been using fuzzy matching to suggest help files for as long as I've been around, but this algorithm is an improvement to what we've been using. I had it added to FluffOS for extra speed.

Woom

Offline shadyman

  • Friend
  • **
  • Posts: 50
    • View Profile
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #10 on: November 24, 2008, 12:12:48 PM »
You're welcome.  Yeah, it's for fuzzymatch.  Identifying things like 'fiteball' and 'ifreball' is exactly what I had in mind.

Hmm. How about a Keyboard-based fuzzymatch, to fix common problems like finger-slips.. ie, 'y' instead of 't', 'b' instead of 'n', or hand-shifts, like 'cin' instead of 'com', or 'kik' instead of 'lol'?

Offline detah

  • BFF
  • ***
  • Posts: 190
  • Ruler of 2D
    • View Profile
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #11 on: November 25, 2008, 09:04:10 AM »
Shadyman-
The program as it is does not care whether the incorrect character is an adjacent key on the keyboard or if it is on the other side of the keyboard. It searches through all letters. Therefore, your keyboard-based concern is a mere subset of the problems that this sefun will solve. That's why it is so cool.
-Detah

Offline skullslayer

  • Acquaintance
  • *
  • Posts: 12
    • View Profile
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #12 on: January 06, 2009, 10:03:46 PM »
I've done a few tweaks to implement this on our mud, and came across one minor error in the man page, which was most confusing for a while

Quote
  > eval -b levenshtein_distance("terrible", "tribble")
    [Integer] 2

This distance is 3, not 2 - delete e, r, and insert b

Offline chaos

  • BFF
  • ***
  • Posts: 291
  • Job, school, social life, sleep. Pick 2.5.
    • View Profile
    • Lost Souls
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #13 on: January 07, 2009, 07:21:10 AM »
I've done a few tweaks to implement this on our mud, and came across one minor error in the man page

Yeah.  See my post of 2008-11-13.  The fundamental error was in the implementation; the error in the man page was consequential of that.

Offline cratylus

  • Your favorite and best
  • Administrator
  • ***
  • Posts: 1018
  • Cratylus@Dead Souls <ds> np
    • View Profile
    • About Cratylus
Re: Levenshtein and Damerau-Levenshtein distance sefuns
« Reply #14 on: January 13, 2009, 12:09:33 PM »
You're welcome.  Yeah, it's for fuzzymatch.  Identifying things like 'fiteball' and 'ifreball' is exactly what I had in mind.

Sweet!

What's the licensing for the code/mans you provided?

I've done a little translating of your code so that it plugs into Dead Souls
as a sefun file. MudOS/FluffOS handles "second-argument-to-allocate-being-a-pointer"
rather, uh...unexpectedly. Here's a gently-tested DS-ified version:

cat fuzzymatch.c
Code: [Select]
int levenshtein_distance(string a, string b) {
    int alen = sizeof(a);
    int blen = sizeof(b);
    mixed foo, baz = alen+1;
    mixed array dist = ({});
    while(baz){
        dist += ({ allocate(blen +1) });
        baz--;
    }
    foreach(foo in dist){
        foo[0] = baz;
        baz++;
    }
    baz = 1;
    while(baz < (blen + 1)){
        dist[0][baz] = baz;
        baz++;
    }
    for(int i = 0; i < alen; i++){
        for(int j = 0; j < blen; j++){
            dist[i + 1][j + 1] = min( ({ dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]) }) );
        }
    }
    return dist[alen][blen];
}

int damerau_levenshtein_distance(string a, string b) {
    int alen = sizeof(a);
    int blen = sizeof(b);
    mixed foo, baz = alen+1;
    mixed array dist = ({});
    while(baz){
        dist += ({ allocate(blen +1) });
        baz--;
    }
    foreach(foo in dist){
        foo[0] = baz;
        baz++;
    }
    baz = 1;
    while(baz < (blen + 1)){
        dist[0][baz] = baz;
        baz++;
    }
    for(int i = 0; i < alen; i++)
        for(int j = 0; j < blen; j++)
            if(i && j && a[i] == b[j - 1] && a[i - 1] == b[j])
                dist[i + 1][j + 1] = min( ({ dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]), dist[i - 1][j - 1] + (a[i] != b[j]) }) );
            else
                dist[i + 1][j + 1] = min( ({ dist[i][j + 1] + 1, dist[i + 1][j] + 1, dist[i][j] + (a[i] != b[j]) }) );
    return dist[alen][blen];
}


I did notice a minor difference in your results and mine...hopefully someone will point out where I goofed.

Quote
12:06:47 Dead Souls Omega /secure/sefun > eval return levenshtein_distance("kitten", "mittens")
Result = 2

12:06:56 Dead Souls Omega /secure/sefun > eval return levenshtein_distance("terrible", "tortuous")
Result = 6

12:06:59 Dead Souls Omega /secure/sefun > eval return levenshtein_distance("terrible", "tribble")
Result = 3

12:07:03 Dead Souls Omega /secure/sefun > eval return levenshtein_distance("know", "nkow")
Result = 2

12:07:07 Dead Souls Omega /secure/sefun > eval return damerau_levenshtein_distance("kitten", "mittens")
Result = 2

12:07:13 Dead Souls Omega /secure/sefun > eval return damerau_levenshtein_distance("terrible", "tortuous")
Result = 6

12:07:16 Dead Souls Omega /secure/sefun > eval return damerau_levenshtein_distance("terrible", "tribble")
Result = 3

12:07:22 Dead Souls Omega /secure/sefun > eval return damerau_levenshtein_distance("know", "nkow")
Result = 1


-Crat
Link to my favorite Damrau


EDIT: deleted and reposted after I realized I'd goofed in removing the dist0 manipulation.
« Last Edit: January 13, 2009, 12:16:15 PM by cratylus »