Due to a recent requirement, I have had to develop a search algorithm and I wanted to create something which made sense and didn't follow the ranking and age ideology behind Google's search engine. Not that I consider Google's search engine something to scoff at mind you; it's a great search algorithm, but it doesn't suit the purpose of my implementation. My objective has been to create a search engine which searches within the site it's hosted on and I wanted it to be as relevant as possible;
- Search must be human-like
- Search must allow for Fuzzy Matching
- Search must search through Pages
With this in mind, I set about trying to resolve how I would go about achieving this. Since I am searching through Pages and hence HTML, I resolved to first strip out HTML Tags and Special Characters such as To do this, I made use of Regex and put together the following expression:
Which basically translates to:
- Remove anything which starts with '&' and ends with ';'
- Remove anything which starts with '<' and ends with '>'
Quite simple, but effective. Preserving class names, image URLs and what have you, is beyond scope. Now that I'm left with a clean, HTML-Free string, I set about thinking what the most natural and human-like method of searching would be. I put myself in the hypothetical shoes of a poor hypothetical man tasked with searching for a phrase in a book and determined that a human would not search for keywords but would instead search for that phrase. Hence I determined if a user searches for a phrase of a particular length, then the text must be scoured to find those words adjacent to one another. This however presents a problem; human beings do not have infallible memory, so the search might not contain the exact wording. Especially with superfluous words like "the", "and", "or"...etc. Hence the next step would be to strip out the most common words used in the English Language. This technique is known as stop words and a short google search afterwards landed me with the following satisfying list of stop words:
a, about, above, after, again, against, all,am, an, and, any, are, aren't, as,at, be,because, been, before, being, below, between,both, but, by, can't, cannot, could, couldn't,did, didn't, do, does, doesn't, doing, don't,down, during, each, few, for, from, further,had, hadn't, has, hasn't, have, haven't, having,he, he'd, he'll, he's, her, here, here's, hers,herself, him, himself, his, how, how's, i, i'd,i'll, i'm, i've, if, in, into, is, isn't, it,it's, its, itself, let's, me, more, most, mustn't,my, myself, no, nor, not, of, off, on, once,only, or, other, ought, our, ours, ourselves,out, over, own, same, shan't, she, she'd, she'll,she's, should, shouldn't, so, some, such, than,that, that's, the, their, theirs, them, themselves,then, there, there's, these, they, they'd, they'll,they're, they've, this, those, through, to, too,under, until, up, very, was, wasn't, we, we'd,we'll, we're, we've, were, weren't, what, what's,when, when's, where, where's, which, while, who,who's, whom, why, why's, with, won't, would, wouldn't,you, you'd, you'll, you're, you've, your, yours,yourself, yourselves
With that out of the way, I determined that I needed some kind of algorithm which would rank the search results in a non-static fashion. Seeing as I elected to make use of Fuzzy Logic, I knew I would have access to a scoring methodology, hence it would simply be a matter of calculating the average of all the results. All results below that average would not be considered, whereas the rest would be returned, ordered by their score. The testing results are quite accurate so far and given some more testing time, I will be providing a downloadable example.