I need to do "fuzzy searches on strings (multi word business names). to match similar names. I have been experimenting with the levenshtein algorithm and other matching algorithms. Levenshtein is too slow and only seems to work well with misspelled single words.
I think it will be helpful to evaluate how many words match in a given string, and ideally the order in which they match, e..g
regarding the string, HACKENSACK UNIVERSITY HOSPITAL, HACKENSACK UNIVERSITY would be a better match than
HACKENSACK HOSPITAL, if we consider that the first two words matching means more than the first and third. I think that USUALLY this would hold true for many business name searches. Ironically though in the example I just gave, the second string is more likely to be a match. Even so, I want to prioritize multi word matches by how many words match, with a priority given to the words matching in order.
Given this I wanted to be able to go beyond the getwordnum() function. I wanted to be able to get the word count, the length of each word, and the number of spaces between each word. If I don't have an exact match, for example, on a four word string, I want to first check for a match on the first three words, then the first two if that doesn't match, and so on. The only way I could figure out to do this was to develop the function below. It will enable me to concatenate the search string in the ways I need to.
I have tested this on a number of different strings... seems to be working. I'm sure though that someone has done a better job of this than me, or that there are libraries of string handling functions out there that are better than this. Note that I haven't handled other delimiters, nulls, and I'm sure a lot else.
Any feedback or help on this would be appreciated.
LPARAMETERS tc_string
SET STEP ON
CREATE CURSOR cur_words (cwholestring char(250), cword char(100), ncount integer, nnumspaces integer, Nwordlen integer )
lc_string=''
lc_word=''
ln_space_count=0
ln_non_space_count=0
lc_spaces=''
ln_word_count=0
ln_chars_in_word=0
IF RIGHT(tc_string,1) <> CHR(32)
tc_string=tc_string+ SPACE(1)
endif
ln_string_len=LEN(tc_string)
FOR i = 1 TO ln_string_len
lc_char=SUBSTR(tc_string,i,1)
IF asc(lc_char) = 32
IF ln_non_space_count<>0
ln_word_count=ln_word_count+1
APPEND blank IN cur_words
replace cwholestring with tc_string in cur_words
replace cword WITH lc_word IN cur_words
replace nwordlen WITH ln_chars_in_word IN cur_words
ln_chars_in_word=0
ln_non_space_count=0
lc_word=''
ENDIF
lc_spaces=lc_spaces+" "
ln_space_count=ln_space_count+1
ELSE
IF ln_chars_in_word=0
APPEND blank IN cur_words
replace cwholestring with tc_string in cur_words
replace cword WITH '' IN cur_words
replace nnumspaces WITH ln_space_count IN cur_words
ln_space_count=0
endif
ln_chars_in_word=ln_chars_in_word+1
lc_word=lc_word+lc_char
ln_non_space_count=ln_non_space_count+1
endif
endfor