JURD: Joiner of Un-Readable Documents to Reverse Tokenization Attacks to Content-based Spam Filters

Abstract

Spam has become a major issue in computer security because it is a channel for threats such as computer viruses, worms and phishing. More than 85% of received e-mails are spam. Historical approaches to combating these messages, including simple techniques like sender blacklisting or the use of e-mail signatures, are no longer completely reliable. Many current solutions feature machine-learning algorithms trained using statistical representations of the terms that most commonly appear in such e-mails. However, there are attacks that can subvert the filtering capabilities of these methods. Tokenization attacks, in particular, insert characters that create divisions within words, causing incorrect representations of e-mails. In this paper, we introduce a new method that reverses the effects of tokenization attacks. Our method processes e-mails iteratively by considering possible words, starting from the first token and compares the word candidates with a common dictionary to which spam words have been previously added. We provide an empirical study of how tokenization attacks affect the filtering capability of a Bayesian classifier and we show that our method can reverse the effects of tokenization attacks.

Publication
Consumer Communications and Networking Conference (CCNC), 2013 IEEE

Related