Mylonas, N., Karlos, S., & Tsoumakas, G. (2023). WeakMeSH: Leveraging provenance information for weakly supervised classification of biomedical articles with emerging MeSH descriptors. Artificial Intelligence in Medicine, 137, 102505. doi:10.1016/j.artmed.2023.102505
Medical Subject Headings (MeSH) is a hierarchically structured thesaurus created by the National Library of Medicine of USA. Each year the vocabulary gets revised, bringing forth different types of changes. Those of particular interest are the ones that introduce new descriptors in the vocabulary either brand new or those who come up as a product of a complex change. These new descriptors often lack ground truth articles and rendering learning models that require supervision not applicable. Furthermore, this problem is characterized by its multi label nature and the fine-grained character of the descriptors that play the role of classes, requiring expert supervision and a lot of human resources. In this work, we alleviate these issues through retrieving insights from provenance information about those descriptors present in MeSH to create a weakly labeled train set for them. At the same time, we make use of a similarity mechanism to further filter the weak labels obtained through the descriptor information mentioned earlier. Our method, called WeakMeSH, was applied on a large-scale subset of the BioASQ 2018 data set consisting of 900 thousand biomedical articles. The performance of our method was evaluated on BioASQ 2020 against several other approaches that had given competitive results in similar problems in the past, or apply alternative transformations against the proposed one, as well as some variants that showcase the importance of each different component of our proposed approach. Finally, an analysis was performed on the different MeSH descriptors each year to assess the applicability of our method on the thesaurus.