Zaikis, D., Vlahavas, I. DACL+: domain-adapted contrastive learning for enhanced low-resource language representations in document clustering tasks. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-10589-1
Low-resource languages in natural language processing present unique challenges, marked by limited linguistic resources and sparse data. These challenges extend to document clustering tasks, where the need for meaningful and semantically rich representations is crucial. Along with the emergence of transformer-based language models (LM), the need for vast amounts of training data has also increased significantly. To this end, we introduce a domain-adapted contrastive learning approach for low-resource Greek document clustering. We introduce manually annotated datasets, essential for LM pre-training and clustering tasks, and extend the investigations by combining Greek BERT and Longformer models. We explore the efficacy of various domain adaptation pre-training objectives and of further pre-training the LMs using contrastive learning with diverse loss functions on datasets generated from a classification corpus. By maximizing the similarity between positive examples and minimizing the similarity between negative examples, our proposed approach learns meaningful representations that capture the underlying structure of the documents. We demonstrate that our proposed approach significantly improves the accuracy of clustering tasks, with an average improvement of up to 50% compared to the base LM, leading to enhanced performance in unsupervised learning tasks. Furthermore, we show how combining language models optimized for different sequence lengths improves performance and compare this approach against an unsupervised graph-based summarization method. Our findings underscore the importance of effective document representations in enhancing the accuracy of clustering tasks in low-resource language settings.