Wednesday, March 23, 2005
The objective of this project, of course, was to be able to get ranked very high in popular search engines for a given set of keywords. The search engine game is such that no trick worked forever - as soon as the search engine reindexed your pages, you'd probably fall in rankings as they changed indexing techniques. The time between indexing, however, tended to be months rather than days or weeks. These days Google can reindex your site very quickly, but five years ago the major search engines like Altavista were indexing everything on the internet once every few months. This could be very frustrating, but ended up working to our advantage in other ways.
At the time, search engine companies were starting to look to diversify their product line (meaning they weren't making much money from surfers). The "intranet" craze of the late 1990s was not quite dead yet, and so most of the search vendors were beginning to offer "intranet" versions of their technology. You could literally buy HotBot or Excite in box as a 1U server unit, or as just a software package. Same engine, just more limited in how many pages it could index. This would prove to be their undoing. With their intranet products, we could reindex our pages and others in a test environment just as fast as we liked, rather than wait two or three months for Altavista to reindex the entire internet to see if our pages were successful.
A very simple web spider app was written which would query each of the major search engines for our keywords. The pages in the top 2000 results for each would be downloaded, saved, and later added to the intranet search engine's corpuses. This populated our test environment with the best pages the internet had to offer for our keywords, at least in the eyes of the search engines. With those loaded and indexed, we were ready to start generating our own "counter" pages.
I was very, very lucky that XML parsers were immature at the time of the project. For Java, the only working choice was OpenXML - everything else out there was in development or horribly broken. OpenXML had a very good HTML parser included, which solved a number of problems for me. It was rare to find web pages in the wild it could not parse, included those that ranked well in live search engines. Often pages would be ranked high because of poorly formatted HTML - double ALT tags, unbalanced BODY tags, and other things that made the few alternative HTML parsers not choke, but barf. A pretty simple set of changes to the sample code provided with OpenXML for HTML parsing added the ability to do word frequency counts inside page elements. This showed that Excite, for example, favored high keyword density in the "NAME" element of an HTML tag, and it favored it in the IMG tag. That gave me some additional direction for the code that generated the pages.
And this is the point in the story where I should be giving you a nice background summary of what genetic algorithms are, how they work, and all of that. After a few drafts, I'm not. I'll leave it to the experts. This [wikipedia article] explains it better than I could, and this [diagram] is better that what I was able to sketch out. The basic concept here is that like in real-world evolution, you progressively breed a solution to a problem, along the way picking out the offspring that are closest to whatever fitness criteria you set.
In the case of this project, the fitness criteria was how each generated page was ranked in the intranet search engine. The generated pages were competing against the top pages downloaded from the live search engines out on the internet, and they had to come out on top. The application would breed a set of pages with random and semi-random word densities in tags, save information on how each page was created (densities per tag, etc.) locally as "genes", and then submit them to the intranet search engine to be indexed. Once they were in the index along with the initial pages from the internet, the application would search the intranet search engine for the target keywords and see how each page ranked. The higher a generated page ranked, the more likely it would be used as a seed in the next cycle, passing it's "genes" onto a next generation of pages.
In testing this worked surprisingly well. On the first pass, a generated page would be in at least the top 25 search results, often in the top 15-20. Within 3-4 generations (depending on the keyword), a generated page would be first or second in the rankings. This was surprising enough that at least with the Excite intranet engine, we randomly started changing the indexing parameters as a sanity check (i.e. Excite would or would not take poorly formed HTML, etc.). Again, within a few generations of page breeding, we had a result in the top five, even with the most restrictive indexing parameters possible.
[ 3/23/2005 02:14:00 PM ] [