F. Kokkoras, K. Ntonas, N. Bassiliades, “DEiXTo: A web data extraction suite”, 6th Balkan Conference on Informatics (BCI 2013), ACM, pp. 9-12, Thessaloniki, Greece, 19-21 Sep 2013, 2013.
Web data extraction (or web scraping) is the process of collecting unstructured or semi-structured information from the World Wide Web, at different levels of automation. It is an important, valuable and practical approach towards web reuse while at the same time can serve the transition of the web to the semantic web, by provid-ing the structured data required by the latter. In this paper we present DEiXTo, a web data extraction suite that provides an arsenal of features aiming at designing and deploying well-engineered extraction tasks. We focus on presenting the core pattern matching algorithm and the overall architecture, which allows programming of custom-made solutions for hard extraction tasks. DEiXTo consists of both freeware and open source components.