Funding/Industry: EPAnEK 2014-2020
Duration: 36 months (7/2018 - 6/2021)
Scientific Responsible: Grigoris Tsoumakas
Category: National Projects
This project aims to create an innovative information platform to automate the recognition, analysis, labeling and metadata enriching of cultural assets, allowing organizations and libraries to easily create new layers of information on existing digitized documents, improving interoperability and reutilization, optimizing the dissemination to both the general public and the specialized researcher. The project will also enhance the knowledge that exists in digital form on the “virtual shelves” of cultural organizations through computer science, standards and digital humanities procedures, resulting in an integrated solution that can be employed in Greece or in other countries. The project objective is to develop an integrated Software as a Service (SaaS) implement that can input digitized documents and automatically enrich with XML files compliant with the international standard Text Encoding Initiative ISO/IEC 24610-1:2006, that include structural and semantic information. The initial input to the system will be digital text documents in various formats (TIFF, JPEG, PDF), while the final output digital files comprised of all available information, such as TEI XML description files, which will be available for access and for proofing. The SaaS implement is accessible via web user interface or API, while users can purchase work packages that can be spent for specific tasks. The technologies this project will utilize, include machine intelligence to automatically extract entities such as proper names and geographical names, automated semantic analysis for knowledge mining from the text and interconnecting the discovered concepts/topics to existing entities. Simultaneously the optical character recognition will be optimized by training the commercial package ABBYY Finereader and using it in tandem with open source OCR engines. Finally the training and the verification of the research results will be done by processing and encoding part of the Trikoglou collection, a very important cultural asset of the Aristotle University of Thessaloniki.