A Topic Model for Extreme Multi-Label Classification

Papanikolaou, Y., Tsoumakas, G. (2018) Subset Labeled LDA: A Topic Model for Extreme Multi-Label Classification, In Proceedings of the 20th International Conference on Big Data Analytics and Knowledge Discovery (DaWaK 2018), Regensburg, Germany, September 3-6, 2018.

Labeled Latent Dirichlet Allocation (LLDA) is an extension of the standard unsupervised Latent Dirichlet Allocation (LDA) algorithm, to address multi-label learning tasks. Previous work has shown it to perform en par with other state-of-the-art multi-label methods. Nonetheless, with increasing number of labels LLDA encounters scalability issues. In this work, we introduce Subset LLDA, a topic model that extends the standard LLDA algorithm, that not only can efficiently scale up to problems with hundreds of thousands of labels but also improves over the LLDA state-of-the-art in terms of prediction accuracy. We conduct experiments on eight data sets, with labels ranging from hundreds to hundreds of thousands, comparing our proposed algorithm with the other LLDA algorithms (Prior-LDA, Dep-LDA), as well as the state-of-the-art in extreme multi-label classification. The results show a steady advantage of our method over the other LLDA algorithms and competitive results compared to the extreme multi-label classification algorithms.