Name: Levenshtein String Similarity Search
Author: Christian Herhaus, Merck Serono
Version: 1.0
Created: 1/2009
Purpose: The Levenshtein distance (also called "edit distance") describes the similarity between two alphanumerical strings in steps of insertion/deletion/mutation (for more details see here). This concept derived from Bioinformatics can also be used for "fuzzy" string comparisons, e.g. in cases where the data to be compared may be erroneous as it was not dictionary-controlled. The attached component expects two data streams with one of them tagged as reference stream. The reference stream has to enter the component first.
The component compares all records of the data stream with the reference data on the basis of the Levenshtein distance as a similarity measure. It passes all records whose distance to reference data falls below a certain treshold. Records leaving the passport will pass the condition that there is at least one record from the reference stream which lies below the user-defined Levenshtein distance treshold. They contain two new array properties, LevenshteinReference and LevenshteinDistance, which keep all the matching reference records together with their distances.
Requirements: Pipeline Pilot 6.1.5 or later
Perl on Server (part of the distribution)
O/S: Windows and Linux
Limitations: none known
Keyword: string similarity search edit distance levenshtein
Contents: Levenshtein Similarity Search.xml, Levenshtein Search Example.xml
Installation:
1. Unzip the archive.
2. Import the the component and/or the example protocol into your user tab in the Pipeline Pilot client by dragging and dropping it in the Explorer window.
3. Open the example protocol.
4. Run the protocol to check if the component works and try modified component parameters.