Web (ro)bots constantly request resources from web servers across the Internet, without human intervention, indexing and scraping content with an aim to make information reachable and available on demand. Recent industry reports show that 37.9% (42.2%) of all the web traffic in 2018 (2017) was generated by web robots, affecting every industry all over the world. Bots may access web applications for beneficial reasons, such as indexing and health monitoring. However, around half of the bot traffic is considered to be malicious, threatening the security and privacy of
a web application and its users. With an ultimate goal to monetize the information requested, they perform actions such as price and content scraping, account take over and creation, credit card fraud and denial of service attacks. Businesses in finance, ticketing and education sectors are the ones most affected by these actions and need to deal not only with security issues but also with the unfair competition deriving from such fraudulent practices. Furthermore, another common threat that web applications need to deflect is analytics skewing, which is caused by otherwise benign bots. Websites with unique and rich content, like data repositories, marketplaces and digital publishing portals, see their reports and metrics altered, rendering their validity questionable. Therefore, the detection of web robots and the filtering of their activities are important tasks in the fight for a secure and trustworthy web.
Our contribution is a novel web robot detection approach that takes into account the content of a website. The key assumption of the proposed approach is that humans are typically interested in specific topics, subjects or domains, while robots typically crawl all the available resources uniformly and regardless of their content. Based on this assumption, our main contribution is a novel representation for web sessions that quantifies the semantic variance of the web content requested within a session. Correspondingly, our main research question is whether such a content-aware representation can improve over state-of-the-art approaches that neglect content. Furthermore, we contribute a dataset consisting of log file entries obtained from our university’s library search engine in two forms: (i) the raw log files as obtained from the server, and (ii) their processed form as a labeled dataset of log entries grouped into sessions along with their extracted features. We make this dataset publicly available, the first one in this domain, in order to provide a common ground for testing web robot detection methods, as well as other methods that analyze server logs