Dimitrios Zaikis and Ioannis Vlahavas. “From Pre-Training to Meta-Learning: A Journey in Low-Resource-Language Representation Learning.” In: IEEE Access 11 (Oct. 2023). IF: 3.9, pp. 115951–115967. doi: 10.1109/ACCESS.2023.3326337. url: https://ieeexplore.ieee.org/document/10288436.
Language representation learning is a vital field in Natural Language Processing (NLP) that aims to capture the intricate semantics and contextual information of text. With the advent of deep learning and neural network architectures, representation learning has revolutionized the NLP landscape. However, the majority of research in this field has concentrated on resource-rich languages, putting Low-Resource Languages (LRL) at a disadvantage due to their limited linguistic resources and the absence of pre-trained models. This paper addresses the significance of language representation learning in a low-resource language, Greek, and its impact on various downstream tasks that heavily rely on semantically and contextually enriched language representations. Accurate classification requires an understanding of nuanced linguistic cues and contextual dependencies. Effective representations bridge the gap between raw text data and classification models, encoding semantic meaning, syntactic structures, and contextual information. By leveraging various representation learning techniques using Transformer-based Language Models (LM), such as domain-adaption and contrastive learning, we aim to enhance the performance of text classification in this LRL setting. We explore the challenges and opportunities in developing effective representations and propose a multi-stage LM pre-training and meta-learning approach to improve performance in classification downstream tasks. The proposed approach was evaluated on Greek expert-annotated texts from social media posts, news articles, press text clippings and internet articles such as blog posts and opinion pieces. The results show significant improvements in the classification effectiveness of each task in terms of micro-averaged F1-score in sentiment, irony, hate speech, emotion and three custom classification tasks.