ECARLE aims to create an innovative information platform to automate the recognition, analysis, labeling and metadata enriching of cultural assets, allowing organizations and libraries to easily create new layers of information on existing digitized documents, improving interoperability and reutilization, optimizing the dissemination to both the general public and the specialized researcher. The project will also enhance the knowledge that exists in digital form on the “virtual shelves” of cultural organizations through computer science, standards and digital humanities procedures, resulting in an integrated solution that can be employed in Greece or in other countries. This integrated software solution with specific target markets will be utilized by DataScouting, which has extensive experience in creating and promoting similar solutions.
The project objective is to develop an integrated Software as a Service (SaaS) implement that can input digitized documents and automatically enrich with XML files compliant with the international standard Text Encoding Initiative ISO/IEC 24610-1:2006, that include structural and semantic information. The initial input to the system will be digital text documents in various formats (TIFF, JPEG, PDF), while the final output digital files comprised of all available information, such as TEI XML description files, which will be available for access and for proofing. The SaaS implement is accessible via web user interface or API, while users can purchase work packages that can be spent for specific tasks.
The technologies this project will utilize, include machine intelligence to automatically extract entities such as proper names and geographical names, automated semantic analysis for knowledge mining from the text and interconnecting the discovered concepts/topics to existing entities. Simultaneously the optical character recognition will be optimized by training the commercial package ABBYY Finereader and using it in tandem with open source OCR engines.
Finally, the training and the verification of the research results will be done by processing and encoding the Trikoglou collection, a very important cultural asset of the Aristotle University of Thessaloniki.
Partners
Intelligent Systems Lab, School of Informatics, Aristotle University of Thessaloniki
Laboratory of Philology and New Technologies, School of Philology, Aristotle University of Thessaloniki
Library and Information Center, Aristotle University of Thessaloniki
Publications
Vasileios Barzokas, Eirini Papagiannopoulou, and Grigorios Tsoumakas. 2020. Studying the Evolution of Greek Words via Word Embeddings. In proceedings of hte 11th Hellenic Conference on Artificial Intelligence (SETN 2020). Association for Computing Machinery, New York, NY, USA, 118–124. DOI:https://doi.org/10.1145/3411408.3411425
Christina Tzogka, Fotini Koidaki, Stavros Doropoulos, Ioannis Papastergiou, Efthymios Agrafiotis, Katerina Tiktopoulou and Stavros Vologiannidis. OCR Workflow: Facing Printed Texts of Ancient, Medieval and Modern Greek Literature. Presented in Qurator 2021 – Conference on Digital Curation Technologies.
Despina Christou and Grigorios Tsoumakas. 2021. “Improving Distantly-Supervised Relation Extraction through BERT-based Label & Instance Embeddings”. In IEEE Access. https://doi.org/10.1109/ACCESS.2021.3073428
Dimitris Dimitriadis, Sofia Zapounidou and Grigorios Tsoumakas. “Semantic Indexing of 19th-Century Greek Literature Using 21st-Century Linguistic Resources”. Sustainability. 2021; 13(16):8878. https://doi.org/10.3390/su13168878
Despina Christou and Grigorios Tsoumakas. 2021. “Extracting Semantic Relationships in Greek Literary Texts”. In Sustainability 2021; 13(16): 9391. https://doi.org/10.3390/su13169391
Koidaki, Fotini, and Katerina Tiktopoulou. “Encoding semantic relationships in literary texts: A methodological proposal for linking networked entities into semantic relations .” Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 – 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). https://doi.org/10.4242/BalisageVol26.Koidaki01.