Change Displayed Text SizeGrow Displayed Text SizeShrink Displayed Text Size
  

                                                                
      

Tuesday, October 07, 2003

The knowing fish in your ear.

Charles Miller has an interesting [post] about recent spams he has seen that were apparently engineered to counter Bayesian spam filtering.
In the past year, a lot of geeks have pushed Bayesian filters as the be-all, end-all spam filtering solution. It doesn't really work that way. [Lexical analysis] (ie statistical analysis of human readable text) will only go so far, and Bayesian techniques are just one tool in a very big toolbox. Currently there are hundred of other "cutting edge" techniques for using computers to analyze text, Bayes is just one of them.

"On the other hand, if we had a more sophisticated linguistic filter, it wouldn’t be hard for spammers to come up with a program that generated random, but grammatically correct sentences."


Several years ago I worked on a project that did just that. The objective was to beat a system not unlike a Bayesian analyzer - in this case, it was a search engine. We were trying to create search engine placement pages that were theoretically "unbeatable". Part of that system would generate HTML text that would look, to any computer, as if it was human written. It used different parts of speech correctly, conjugated english correctly, and had a large (but still limited) vocabulary.
Of course, once you read it's output, you'd see that it could only create nonsense. Most of it's pages were good for a few laughs, I'll have to dig some up and post them.
During that project's development, I looked at "old school" text analysis techniques. Quite a few good ideas came from [cryptanalysis], but even more came from a far more interesting source: [The Voynich Manuscript]
Wether the Voynich Manuscript is a hoax or not doesn't matter. The various attempts to "decode" it resulted in a number of breakthroughs in text analysis - there are a lot of good lessons to be learned in computer science from the [successes and mistakes] of the people trying to read it.

10/07/2003 08:27:59 PM ] [  ]
      

    
[archives]