Dimitrios Zaikis, Stylianos Kokkas, and Ioannis Vlahavas. “DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks.” In: Engineering Applications of Neural Networks. Ed. by Lazaros Iliadis, Ilias Maglogiannis, Serafin Alonso, Chrisina Jayne, and Elias Pimenidis. Vol. 1826. Series Title: Communications in Computer and Information Science. Cham: Springer Nature Switzerland, 2023, pp. 585–598. isbn: 978-3-031-34203-5 978-3-031-34204-2. doi: 10.1007/978-3-031-34204-2_47. url: https://link.springer.com/10.1007/978-3-031-34204-2_47.
Clustering in Natural Language Processing (NLP) groups similar text phrases or documents together based on their semantic meaning or context into meaningful groups that can be useful in several information extraction tasks, such as topic modeling, document retrieval and text summarization. However, clustering documents in low-resource languages poses unique challenges due to limited linguistic resources and lack of carefully curated data. These challenges extend to the language modeling domain, where training Transformer-based Language Models (LM) requires large amounts of data in order to generate meaningful representations. To this end, we created two new corpora from Greek media sources and present a Transformer-based contrastive learning approach for document clustering tasks. We improve low-resource LMs using in-domain second phase pre-training (domain-adaption) and learn document representations by contrasting positive examples (i.e., similar documents) and negative examples (i.e., dissimilar documents). By maximizing the similarity between positive examples and minimizing the similarity between negative examples, our proposed approach learns meaningful representations that capture the underlying structure of the documents. Additionally, we demonstrate how combining language models that are optimized for different sequence lengths improve the performance and compare this approach against an unsupervised graph-based summarization method that generates concise and informative summaries for longer documents. By learning effective document representations, our proposed approach can significantly improve the accuracy of clustering tasks such as topic extraction, leading to an improved performance in downstream tasks.