Wildcards extending dictionaries and concept matching

HP 2009-02-25

We are in the process of extending dictionaries and concept matching in the TAC for the next release. In addition to making them faster,
we intend to make the algorithm for matching terms more flexible and under more direct
user control. If you have an interest in matching trailing wildcard terms in
dictionaries, we would appreciate your input on our plans.

Currently, matching dictionary terms to text requires an exact number or
letter match, ignoring punctuation and case. Therefore, "AK-ase" and
"AKase" match each other, but neither of those match "AKases" (for example).

The plan is to use "TAC tokenization" to make this more flexible. When
building the dictionary, the user chooses the kind of tokenization they want
to perform on the dictionary terms and text. For example, they can choose
to lowercase or to enforce case sensitivity. They can choose to apply
stopwords and/or stemming. They can choose to ignore certain punctuation
characters. Etc. At match time, a match is accepted if the tokenized
dictionary term equals the tokenized text. For example, if the user had
chosen lowercase and standard stemming, then "AKases" would tokenize to
"akase", and would match a dictionary term "AKase" (which also tokenizes to
"akase").

Question 1: Does the tokenization approach above meet your
needs, or is wildcarding specifically required? (Note that with stemming,
all the following would match: Recall, recalled, recalling, recalls, BUT NOT
recaller-ID (if such a term existed!)).

Question 2: As a result of tokenization, a dictionary term and text will match
even though they are not identical. How much control or insight do you need
in to this? For example, we can calculate numerous statistics/properties
about how similar the two strings are -- things like "edit distance",
"capitalization match", "punctuation match", etc. Our plan is to let the
user choose a weight and/or threshold for each of these properties, and only
terms that meet these conditions are accepted. For example, if the user
sets "Max Edit Distance" to 2, then "recalled" would match "recall" (2 char
difference), but "recalling" would not match (3 char difference) -- even
though they both stem-match. Is this a useful level of control, or is
more/less control required?

Thanks in advance,
Nancy
________________________
Nancy Miller Latimer
nlatimer@accelrys.com
Biological Sciences & TAC PM Pipeline Pilot™
(office) 858-799-5657
(mobile) 858-229-0290
Accelrys
10188 Telesis Court, Suite 100
San Diego, CA 92121-4779