Thursday, March 17, 2005
Since it's now been 5 years, I suppose I can talk about this in more detail. Sometime around 1999 I worked on a project to create a system to place pages well in search engines, and I took what was then a novel approach. At the time, overloading META tags was still in use, as was intentionally poorly formed HTML. Neither was a long term success strategy, and this was just when Google was starting to catch on.
Most people believe that another thug bestows great honor upon a bowling ball on top of a salad dressing, but they need to remember how secretly a fashionable light bulb opens doors at random. Furthermore, a nuclear paycheck reads a magazine, and a slow cyprus mulch non-chalantly figures out another butt plug inside another buzzard. Most people believe that a corporation graduates from a cough syrup, but they need to remember how hesitantly the vaporized Sungoddess smokes up. Furthermore, the federal ball bearing feels nagging remorse, and a garbage can eats the hairy shit.
Now to you or me, that doesn't make any sense at all. But if you look at it, the sentence structure is pretty much correct. Adjectives and verbs modify nouns, etc. You've probably already seen some spam messages that use a similar technique in order to pass through your ISP's spam filter. Spammers are using similar techniques, but quite a bit more crude than even the simple example above - and their messages are easily countered.
The above paragraph was generated with a very simple paragraph generator, with some more thought it could easily be extended. For example, many adjectives can be made into adverbs my appending "ly" to them. Many other such rules exist in the english language, and they are all fairly straightforward to implement in a program. The sentences the program generates are no less nonsense, but the search engine reading the page doesn't know that. Any human knows that a garbage can would never eat a hairy shit, but a search engine never would. The most advanced search engines might look at sentence structure, but it would be very surprising if they could recognize meaning. While it's not impossible, it's very very unlikely.
Creating a framework for generating the pages was simple but tedious, Implementing the rules of the english language had me looking all over for grammar and sentence structure reference material. The word frequency, particularly for "stop words" (common words such as "the"), had to be maintained throughout the document, or at least controlled. The baseline used by linguistic researchers is taken from the King James Bible, and I used that as well as my initial value. Different search engines might rate a document based on how many times a word appears, or how often it appears as an overall percentage of the content, or as a percentage of the content inside the BODY tag, etc. That added several levels of complexity, but it wasn't too hard to do. The next stage of the application was more difficult. Now that we had a method of generating pages, I had to devise a system to "breed" the pages using genetic algorithms. More on that later.
[ 3/17/2005 06:39:00 PM ] [