I have an SQLite database that stores several hundred or thousand strings, I keep an array of these strings that I grow so I can easily search my database more quickly. However the user can search with a search string and I will rank the strings in my database for their closeness to the search string. For example lets say they search “foo”. If I have entries, “foo” “foobar” and “foo foo” in my database, does anyone have any ideas for an algorithm that would rank these strings in order:
1. “foo” (its an exact match)
2. “foo foo” (it contains the search string twice)
3. “foobar” (it contains the search string once)
Does anyone know of or have any ideas for an algorithm that would have this result? I am working in both java and c++ if anyone wishes to post any code snippets, however I’m really just looking for ideas for algorithms.
Note, I would like something like fobar or fuo to also show up in the search results since it is 1 letter off from the search,
When you say you want the ranking to be in linear time, I guess you only want to analyse each string in the set once.
One relatively simple way to do that would be to calculate a score based on some rules you define. Of course, the more rules you have the longer it will all take, but as long as you implement the analysis well it shouldn’t take long even for thousands of strings.
An example would be that you say exact matches gain a score of 100, while containing the search string n number of times achieves a score of 10n, and containing it within another word n times gets 5n, and so on. If you implement your rules in a fairly decoupled way, you could tweak your rules a few times and see how well they perform under real searches until you’re happy with the accuracy of the search.
Once you have a set of scores, you can use some very fast sorting algorithm to sort your results for you in order of best score to worst. Of course, you would exclude results of a score less than x.
(Just as a side note, this technique would make it very easy to implement advanced search features, such as AND/OR/NOT, because you could split the analysis for search terms and combine their scores per result)