Levenshtein String Similarity Search

CH 2010-12-30

Name: Levenshtein String Similarity Search

Author: Christian Herhaus, Merck Serono

Version: 1.0

Created: 1/2009

Purpose: The Levenshtein distance (also called "edit distance") describes the similarity between two alphanumerical strings in steps of insertion/deletion/mutation (for more details see here). This concept derived from Bioinformatics can also be used for "fuzzy" string comparisons, e.g. in cases where the data to be compared may be erroneous as it was not dictionary-controlled. The attached component expects two data streams with one of them tagged as reference stream. The reference stream has to enter the component first.

The component compares all records of the data stream with the reference data on the basis of the Levenshtein distance as a similarity measure. It passes all records whose distance to reference data falls below a certain treshold. Records leaving the passport will pass the condition that there is at least one record from the reference stream which lies below the user-defined Levenshtein distance treshold. They contain two new array properties, LevenshteinReference and LevenshteinDistance, which keep all the matching reference records together with their distances.

Requirements: Pipeline Pilot 6.1.5 or later
Perl on Server (part of the distribution)

O/S: Windows and Linux

Limitations: none known

Keyword: string similarity search edit distance levenshtein

Contents: Levenshtein Similarity Search.xml, Levenshtein Search Example.xml

Installation:

1. Unzip the archive.
2. Import the the component and/or the example protocol into your user tab in the Pipeline Pilot client by dragging and dropping it in the Explorer window.
3. Open the example protocol.
4. Run the protocol to check if the component works and try modified component parameters.