The fallacy of corpus anti-spam evalutation
I get asked to review papers on anti-spam by various technical journals and I am continuously surprised by the insidiousness of text classification methods in anti-spam research. For instance, a lot of researchers are now using the TREC spam corpus to justify the effectiveness of their anti-spam technique and journal editors are insisting on analysis based on this corpus. This is horribly broken. Text classification research has relied on standard corpora to evaluate the effectiveness of new methods - the Reuters 21578 corpus, now the Reuters Corpus @ NIST - has virtually been a standard for this - and it stands to reason. If a classifier trains on 10% of stories about soccer and is able to detect the remaining 90% correctly, we can be quite certain that it will perform well on future soccer coverage. This method of testing reveals a classifier’s resilience to drift in vocabulary and topics, which is inherent to culture and the evolution of language. But imagine a team of sports reporters whose job is not to accurately report the highlights of a soccer game but to wordsmith their story to read like a business section editorial. Their goal is to fool a text classifier trained on samples of soccer coverage, which they achieve easily by eschewing the colorful soccer lingo of offside and red cards and free kicks, and using instead the mundane vocabulary of supply demand curves, human resources and NASDAQ.
A text classifier that is immune to such wickedness is no longer passively modeling topic drift, rather it’s trying to predict all the ways in which these fallen journalists will attempt to fool it. In other words, the classifier cannot rely on a corpus from the past to be meaningfully co-related to a corpus from the future. This is like the warning associated with stock trading (past performance is not a reliable indicator of future performance), except it is worse. The spam corpus of today is a function of anti-spam systems of today; it is a direct result of spammers trying to defeat the anti-spam systems that are deployed. When a new anti-spam system is deployed, the nature of the corpus changes, in direct proportion to the nature of the anti-spam system. This is entirely unlike text classification research, you are always training on the wrong corpus!
There are no easy solutions to this, just like there are no easy solutions to predicting the stock market. One way to use corpora more meaningfully is to classified it in a taxonomy that represents the type of anti-spam technique they were meant to attack. The corpora that attacks Naive Bayesian classifiers should be distinct from corpora that attacks Fingerprint classifiers which should be distinct from corpora that attacks Network based classifiers. The researcher should assess what technique is closest to their proposed system and evaluate their technique against that corpora. If the proposed system is entirely novel, researchers should acknowledge the inscrutability of their method against existing corpora and use simulation and predictive methods (and good old reasoning) to determine how their method will measure up to an active adversary.
While a new method must always do well against old corpora, the fact that it does is not a guarantee that it will do well against future corpora. This is known as overfitting, and dependence on corpus based evaluation of spam filters results in overfitting on known attack strategies. Strategic over-fitting has disastrous effects in security, both electronic and real-world, and I really hope anti-spam research does not get bogged down by poor methodology.
Interesting!
What are all these spam comments? Just first is OK (besides mine …) . This is ridiculous; on a post talking about spam to appear so much spam comments.
Ciprian Pop, yeah, someone is pumping over 10 / sec. I’ve cleaned up a whole bunch and added Akismet to the loop.
Don’t you hate spam to?