A. Lagopoulos, G. Tsoumakas (2020) Content-Aware Web Robot Detection, Applied Intelligence, Springer
Author(s): A. Lagopoulos, G. Tsoumakas
Keywords: Web Robot, Crawler, Semantics, Supervised Learning, Latent Dirichlet Allocation
Abstract: Web crawlers account for more than a third of the total web traffic and they are threatening the security, privacy and veracity of web applications and their users. Businesses in finance, ticketing and publishing, as well as websites with rich and unique content, are the ones mostly affected by their actions. To deal with this problem, we present a novel web robot detection approach that takes advantage of the content of a website based on the assumption that human web users are interested in specific topics, while web robots crawl the web randomly. Our approach extends the typical user session representation of log-based features with a novel set of features that capture the semantics of the content of the requested resources. In addition, we contribute a new real-world dataset, which we make publicly available, towards alleviating the scarcity of open data in this field. Empirical results on this dataset validate our assumption and show that our approach outranks state-of-the-art methods for web robot detection.