Since the site’s inception, we’ve been massing large amounts of content on which millions of people have come to depend. We have numerous ways of getting to the content, but the quickest and easiest way to find specific information is to search for it.
There are, however, a couple of caveats with Microsoft Full Text search. The first is that it throws errors when your search criteria contain “noise words”. By default, Full Text search is configured with a list of “noise words”. Microsoft (and many other search engines) consider words like “because,been,before,being,between,both,but,by” to be common words that should not be contained in an index. Of course, you can trap this error easily in your application, but realistically, the search engine should just filter the words out of the search phrase itself.
The second and more important issue is how Full Text handles acronyms and numerical values in search strings. We never really did get to the bottom of the problem, but even with all of the noise words removed from Full Text, certain search phrases that contained acronym and numerical data wouldn’t return results. Since our data is full of technical acronyms and numerical model numbers, this was a major issue for us.
AnandTech Search 1.0 (ColdFusion Verity)
The first version of the site used a search server included with ColdFusion named “Verity”. Most people have heard of Verity; they are one of the industry leaders in enterprise search software. The version of Verity that was included with ColdFusion back then was a light version of the full-blown Verity Search server. Although it did quite well at locating content via Boolean searches, it lacked flexibility and wasn’t all that of a performant.AnandTech Search 2.0 (Microsoft FullText Search)
After we migrated to Microsoft SQL Server, we decided to use the Full Text search that is built-in to SQL Server. SQL Server Full Text came to be in version 7.0, and allows you to create catalogs that can contain multiple indexes on text column types. You can then configure Full Text to index the data in the background, or perform one time or scheduled indexing of the data.There are, however, a couple of caveats with Microsoft Full Text search. The first is that it throws errors when your search criteria contain “noise words”. By default, Full Text search is configured with a list of “noise words”. Microsoft (and many other search engines) consider words like “because,been,before,being,between,both,but,by” to be common words that should not be contained in an index. Of course, you can trap this error easily in your application, but realistically, the search engine should just filter the words out of the search phrase itself.
The second and more important issue is how Full Text handles acronyms and numerical values in search strings. We never really did get to the bottom of the problem, but even with all of the noise words removed from Full Text, certain search phrases that contained acronym and numerical data wouldn’t return results. Since our data is full of technical acronyms and numerical model numbers, this was a major issue for us.
48 Comments
View All Comments
glennpratt - Tuesday, September 6, 2005 - link
Add quotes when you wan't a common word included, ie:http://search.anandtech.com/search?q=%22i-ram%22&a...">"i-ram"
Jeff7181 - Monday, September 5, 2005 - link
I like the article... but more importantly, AnandTech Search isn't completely useless anymore.Lifted - Monday, September 5, 2005 - link
Hell, he will f---ing kill you!Hi - Monday, September 5, 2005 - link
i guess anand is dead:(
Verdant - Monday, September 5, 2005 - link
at $3,000 it is a pretty good deal, but the 100,000 limitation is, well, a huge limitation, and the $30,000 pricetag on it's big brother is not that competitive with software solutions/crawlers.... especially since IT can only search 500,000 documents.Googer - Tuesday, September 6, 2005 - link
Are you kidding? It's a freaking P3 1.26GHz! I could Do much better for Two grand.rajivdx - Friday, September 9, 2005 - link
Make no mistake about it its bloody fast!! It makes the rest of the Anandtech site look *absolutely* slow by comparision.Verdant - Tuesday, September 6, 2005 - link
you've never looked at the prices of database software licenses, developers or content management packages before, have you?Calin - Tuesday, September 6, 2005 - link
There are two processors in fact - and you could buy something like that (no hardware) for $1,500 easily (maybe not so easily as the 1.26GHz Pentium III could be hard to find, but you could replace them with Athlons or Pentium4s at the same performance for cheaper).JustAnAverageGuy - Tuesday, September 6, 2005 - link
You're mostly paying for the software.