>I need to do "fuzzy searches on strings (multi word business names). to match similar names. I have been experimenting with the levenshtein algorithm and other matching algorithms. Levenshtein is too slow and only seems to work well with misspelled single words.
>
>I think it will be helpful to evaluate how many words match in a given string, and ideally the order in which they match, e..g
>regarding the string, HACKENSACK UNIVERSITY HOSPITAL, HACKENSACK UNIVERSITY would be a better match than
>HACKENSACK HOSPITAL, if we consider that the first two words matching means more than the first and third. I think that USUALLY this would hold true for many business name searches. Ironically though in the example I just gave, the second string is more likely to be a match. Even so, I want to prioritize multi word matches by how many words match, with a priority given to the words matching in order.
>
>Given this I wanted to be able to go beyond the getwordnum() function. I wanted to be able to get the word count, the length of each word, and the number of spaces between each word. If I don't have an exact match, for example, on a four word string, I want to first check for a match on the first three words, then the first two if that doesn't match, and so on. The only way I could figure out to do this was to develop the function below. It will enable me to concatenate the search string in the ways I need to.
>
>I have tested this on a number of different strings... seems to be working. I'm sure though that someone has done a better job of this than me, or that there are libraries of string handling functions out there that are better than this. Note that I haven't handled other delimiters, nulls, and I'm sure a lot else.
>
>Any feedback or help on this would be appreciated.
>
>
>
>
>
>LPARAMETERS tc_string
>SET STEP ON
>CREATE CURSOR cur_words (cwholestring char(250), cword char(100), ncount integer, nnumspaces integer, Nwordlen integer )
>lc_string=''
>lc_word=''
>ln_space_count=0
>ln_non_space_count=0
>lc_spaces=''
>ln_word_count=0
>ln_chars_in_word=0
>IF RIGHT(tc_string,1) <> CHR(32)
> tc_string=tc_string+ SPACE(1)
>endif
>ln_string_len=LEN(tc_string)
>FOR i = 1 TO ln_string_len
> lc_char=SUBSTR(tc_string,i,1)
> IF asc(lc_char) = 32
> IF ln_non_space_count<>0
> ln_word_count=ln_word_count+1
>
> APPEND blank IN cur_words
> replace cwholestring with tc_string in cur_words
> replace cword WITH lc_word IN cur_words
> replace nwordlen WITH ln_chars_in_word IN cur_words
> ln_chars_in_word=0
> ln_non_space_count=0
> lc_word=''
> ENDIF
> lc_spaces=lc_spaces+" "
> ln_space_count=ln_space_count+1
> ELSE
>
> IF ln_chars_in_word=0
> APPEND blank IN cur_words
> replace cwholestring with tc_string in cur_words
> replace cword WITH '' IN cur_words
> replace nnumspaces WITH ln_space_count IN cur_words
> ln_space_count=0
> endif
> ln_chars_in_word=ln_chars_in_word+1
> lc_word=lc_word+lc_char
> ln_non_space_count=ln_non_space_count+1
>
>
>
> endif
>
>endfor
>
>
If your code is working, what improvements are you looking for?
If you're looking for different approaches here are a couple of things to look at:
- Text-search engines such as SQL Server full-text search
- If you want your searches to work "like Google" you could consider the Google Search Appliance:
http://www.google.com/enterprise/search/gsa.html
Regards. Al
"Violence is the last refuge of the incompetent." -- Isaac Asimov
"Never let your sense of morals prevent you from doing what is right." -- Isaac Asimov
Neither a despot, nor a doormat, be
Every app wants to be a database app when it grows up