Hokkaido University Collection of Scholarly and Academic Papers >
Graduate School of Information Science and Technology / Faculty of Information Science and Technology >
Peer-reviewed Journal Articles, etc >
Unsupervised Spam Detection by Document Probability Estimation with Maximal Overlap Method
Title: | Unsupervised Spam Detection by Document Probability Estimation with Maximal Overlap Method |
Authors: | Uemura, Takashi Browse this author | Ikeda, Daisuke Browse this author | Kida, Takuya Browse this author →KAKEN DB | Arimura, Hiroki Browse this author |
Keywords: | unsupervised spam detection | document complexity | suffix tree | maximal overlap method | word salad |
Issue Date: | 2011 |
Publisher: | 人工知能学会 |
Journal Title: | Transactions of the Japanese Society for Artificial Intelligence |
Volume: | 26 |
Issue: | 1 |
Start Page: | 297 |
End Page: | 306 |
Publisher DOI: | 10.1527/tjsai.26.297 |
Abstract: | In this paper, we study content-based spam detection for spams that are generated by copying a seed document with some random perturbations. We propose an unsupervised detection algorithm based on an entropy-like measure called document complexity, which reflects how many similar documents exist in the input collection of documents. As the document complexity, however, is an ideal measure like Kolmogorov complexity, we substitute an estimated occurrence probability of each document for its complexity. We also present an efficient algorithm that estimates the probabilities of all documents in the collection in linear time to its total length. Experimental results showed that our algorithm especially works well for word salad spams, which are believed to be difficult to detect automatically. |
Type: | article |
URI: | http://hdl.handle.net/2115/47125 |
Appears in Collections: | 情報科学院・情報科学研究院 (Graduate School of Information Science and Technology / Faculty of Information Science and Technology) > 雑誌発表論文等 (Peer-reviewed Journal Articles, etc)
|
Submitter: 喜田 拓也
|