strdist(X;Y;M;N)

Returns the edit distance between two given strings using a specified method.

Syntax

strdist(X;Y;M;N)
bstrdist(X;Y;M;N)

Input

Argument Type Description
X text The first of the two strings in which to determine the edit distance

A scalar value or the name of a column

Y text The second of the two strings in which to determine the edit distance

A scalar value or the name of a column

M text Specifies the method used to calculate the edit distance between the two given strings

M must be one of the following:

  • 'id' for insert-delete distance
  • 'lev' for Levenshtein distance
  • 'dl' for Damerau-Levenshtein distance
N integer Specifies the type of measure
  • N=0 specifies a non-normalized measure
  • N=1 specifies a normalized measure
  • N=-1 specifies a measure that is the length of the longest common subsequence (if M='id')

Return Value

Returns the decimal number corresponding to a measure of the edit distance between X and Y (i.e., the minimum number of edits required to transform one string into the other).

An "edit" is defined as an insertion or deletion for insert-delete distance; as either of those or a single-character substitution for Levenshtein distance; and as any of those or a transposition of adjacent characters for Damerau-Levenshtein distance.

If N=0 then the result is the minimum number of edits required.

If N=1 then the result is normalized:
  • If M='lev' or 'dl', it is divided by the length of the longer of X and Y.
  • If M='id', it is divided by the sum of the length of X and Y.

If N=-1 and M='id', the result is the length of the longest (not necessarily contiguous) subsequence common to the two strings, which is related to the insert-delete edit distance as follows:

edit_distance_between_X_and_Y = length_of_X + length_of_Y - 2*length_of_longest_common_subsequence

If either X or Y is N/A:
  • If N=0, the result is the length of the non-N/A argument.
  • If N=1, the result is 1.

Sample Usage

string1 string2 method normalized strdist(string1;string2;method;normalized)
'cat' 'rat' 'lev' 0 1
'cat' 'rat' 'lev' 1 0.333333333333333
'apples' 'oranges' 'lev' 0 5
'apples' 'oranges' 'lev' 1 0.714285714285714
'apples' 'apples' 'lev' 0 0
'apples' 'apples' 'lev' 1 0
'formal' 'fromage' 'lev' 0 4
'formal' 'fromage' 'dl' 0 3
'formal' 'fromage' 'dl' 1 0.4285714285714285
'formal' 'fromage' 'id' 0 5
'formal' 'fromage' 'id' 1 0.3846153846153846
'formal' 'fromage' 'id' -1 4
'' 'fromage' 'id' 0 7
'' 'fromage' 'id' 1 1

Additional Information

  • strdist is Unicode (UTF-8) compliant and will work with Unicode or plain ASCII text fields.
  • If passed a string argument that is not legal Unicode, it will by default signal an error (configurable as a user preference).
  • A corresponding function bstrdist can be used with non-Unicode strings (e.g., binary or legacy encodings).