Mark Mintoff My superpower is common sense

7Mar/131

Regex Split by Comma not Surrounded by Quotes

Recently I had a requirement where I had to programmatically extract data from a very messy CSV file and populate a table. The data contained addresses, which included commas in them. In order to distinguish between a comma which separates columns and a comma in the text, the text was surrounded by double quotes.

The solution is obvious; use Regex. The regular expression however is not so straight-forward.

In C#, with escape characters:

Works like a charm.

VN:F [1.9.22_1171]
Rating: 4.9/5 (19 votes cast)
10Jan/130

Developer Friendly Windows Search

As a developer, I find Windows' search incredibly lacking. I can never find the files I want, especially if I'm trying to find a particular pattern of text within the files. To this end, I resorted to making my own simple search algorithm:

I hope you find this to be as useful as I do.

VN:F [1.9.22_1171]
Rating: 5.0/5 (3 votes cast)
2May/120

Adding Regular Expression Support to SQL Server

First and foremost, it is worth noting that SQL Server supports .Net assemblies, allowing us to create methods which can be used from SQL.

.Net assemblies (dll files) can be loaded into the below folder.

With that out of the way, we can get started on creating a .Net assembly. SQL Server does not support regular expressions natively, so I thought it would be useful to add this functionality as my first assembly.

To start, we can create a Class Library Solution (I called mine 'TextFunctions') and add a single class (I called mine 'StringUtility'). The following is the code I wrote for my StringUtility class:

Very simple bit of code, which allows me to check for Regex matches. No big deal. Now we must install this assembly into the database. First we must check to ensure that the database is CLR enabled, so we run:

Next we will create the assembly. Please note that the dll file must be on the same machine as the database is. You might also want to specify the database you will be adding this to. I added a simple "use master;" to my script.

Now we can install the RGX function I created.

Note that the syntax for EXTERNAL NAME is SolutionName.[Namespace.ClassName].MethodName

That is all! You now have a brand new function which executes C# code and returns a boolean value back! Let us try some simple examples:

This function can also be used in WHERE clause conditions as follows:

Enjoy!

VN:F [1.9.22_1171]
Rating: 4.3/5 (3 votes cast)
9Sep/110

Designing a Search Algorithm

Due to a recent requirement, I have had to develop a search algorithm and I wanted to create something which made sense and didn't follow the ranking and age ideology behind Google's search engine. Not that I consider Google's search engine something to scoff at mind you; it's a great search algorithm, but it doesn't suit the purpose of my implementation. My objective has been to create a search engine which searches within the site it's hosted on and I wanted it to be as relevant as possible;

  • Search must be human-like
  • Search must allow for Fuzzy Matching
  • Search must search through Pages

With this in mind, I set about trying to resolve how I would go about achieving this. Since I am searching through Pages and hence HTML, I resolved to first strip out HTML Tags and Special Characters such as   To do this, I made use of Regex and put together the following expression:

Which basically translates to:

  • Remove anything which starts with '&' and ends with ';'
  • Remove anything which starts with '<' and ends with '>'

Quite simple, but effective. Preserving class names, image URLs and what have you, is beyond scope. Now that I'm left with a clean, HTML-Free string, I set about thinking what the most natural and human-like method of searching would be. I put myself in the hypothetical shoes of a poor hypothetical man tasked with searching for a phrase in a book and determined that a human would not search for keywords but would instead search for that phrase. Hence I determined if a user searches for a phrase of a particular length, then the text must be scoured to find those words adjacent to one another. This however presents a problem; human beings do not have infallible memory, so the search might not contain the exact wording. Especially with superfluous words like "the", "and", "or"...etc. Hence the next step would be to strip out the most common words used in the English Language. This technique is known as stop words and a short google search afterwards landed me with the following satisfying list of stop words:


a, about, above, after, again, against, all,am, an, and, any, are, aren't, as,at, be,because, been, before, being, below, between,both, but, by, can't, cannot, could, couldn't,did, didn't, do, does, doesn't, doing, don't,down, during, each, few, for, from, further,had, hadn't, has, hasn't, have, haven't, having,he, he'd, he'll, he's, her, here, here's, hers,herself, him, himself, his, how, how's, i, i'd,i'll, i'm, i've, if, in, into, is, isn't, it,it's, its, itself, let's, me, more, most, mustn't,my, myself, no, nor, not, of, off, on, once,only, or, other, ought, our, ours, ourselves,out, over, own, same, shan't, she, she'd, she'll,she's, should, shouldn't, so, some, such, than,that, that's, the, their, theirs, them, themselves,then, there, there's, these, they, they'd, they'll,they're, they've, this, those, through, to, too,under, until, up, very, was, wasn't, we, we'd,we'll, we're, we've, were, weren't, what, what's,when, when's, where, where's, which, while, who,who's, whom, why, why's, with, won't, would, wouldn't,you, you'd, you'll, you're, you've, your, yours,yourself, yourselves


With that out of the way, I determined that I needed some kind of algorithm which would rank the search results in a non-static fashion. Seeing as I elected to make use of Fuzzy Logic, I knew I would have access to a scoring methodology, hence it would simply be a matter of calculating the average of all the results. All results below that average would not be considered, whereas the rest would be returned, ordered by their score. The testing results are quite accurate so far and given some more testing time, I will be providing a downloadable example.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)