Understanding the Definition of Weighted Levenstein Distance

Asked 2 years ago, Updated 2 years ago, 101 views

I'm asking because I don't really understand the definition of weighted Levenstein distance, which measures the distance between strings.

The normal Levenstein distance is the minimum number of insertions, deletions, and replacements when editing string 1 to string 2.It is said that the weighted Levenstein distance is the sum of the weights for insertion, deletion, and replacement.

http://id.fnshr.info/2011/11/29/damerau/

In this case, which of the following interpretations is the majority of the weighted Levenstein distance algorithm?

In the case of 2, the minimum number of edits and procedures will change depending on the editing cost added in the middle.
I implemented it intuitively thinking that it was 1. but when I looked it up, there were a lot of 2.

go-lsd-parametricized/lsd.go at master delta/go-lsd-parametricized
https://github.com/deltam/go-lsd-parametrized/blob/master/lsd.go#L96

weighted-levenshtein/clev.pyx at master·infoscout/weighted-levenshtein
https://github.com/infoscout/weighted-levenshtein/blob/master/weighted_levenshtein/clev.pyx#L457

algorithm natural-language-processing

2022-09-30 19:39

2 Answers

2 is correct.1 may not satisfy the distance axiom.

The distance becomes a larger value when you take a detour.
d(X,Y) dd(X,Z)+d(Z,Y)

Let's calculate the next edit distance using the method of 1 with additional cost = 1, deletion cost = 1, and replacement cost = 3.
d("ABCDEFG", "ABCXEFG")=3
d("ABCDEFG", "ABCEFG") = 1
d("ABCEFG", "ABCXEFG") = 1

d("ABCDEFG", "ABCEFG")+d("ABCEFG", "ABCXEFG")=2
I took a detour, but my score got smaller.

This does not happen when calculated using two methods.


2022-09-30 19:39

The definition of the cost itself can be changed according to what you want to do, so either definition is fine, but when you say weighted edit distance, the definition of 2 is usually used.After a quick search, for example, Stanford's "Introduction to Information Retrieval" uses a definition of 2.Also, the implementation in "Algorithm Implementation/Strings/Levenshtein distance" of WikiBooks follows the definition of 2.

Let's say you want to compare the similarities between the two DNAs, so you have to use the edit distance.As DNA changes over time, the probability of insertion, deletion, and replacement does not seem to be equal.In order to find DNA that seems to have changed from the original DNA for a short time, I would like the distance between the two to be reduced, so I would like to define the distance like that.Then, for example, instead of aligning the cost of a single change with 1, you can think of a way to change it according to the probability of insertion, deletion, and replacement.If you follow this view, you will adopt the definition of 2.The definition of 1 is to set the course of DNA change first and then add the likelihood of the path as a cost, because it doesn't really take into account whether the path was most likely to occur between the two.

However, the definition can vary depending on what you mean by "weighted", regardless of whether you are in the majority or not, so please check the definition you are using and reading now.


2022-09-30 19:39

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.