G. Tzanis, I. Vlahavas, “Mining High Quality Clusters of SAGE Data”, Proceedings of the 2nd VLDB Workshop on Data Mining in Bioinformatics, Vienna, Austria, 2007.
Serial Analysis of Gene Expression (SAGE) is a method that allows the quantitative and simultaneous analysis of the whole gene function of a cell. One of the advantages of this method is that the experimenter does not have to select a priori the mRNA sequences that will be counted in a sample. This makes SAGE a powerful tool for analyzing gene expression and studying various diseases, such as cancer. An important concern in cancer studies is the discovery of the differences between healthy and cancerous samples and the accurate separation of these two groups of samples. However, the high dimensionality of the data, the multiple cell sources (i.e. bulk and cell line) and the multiple cancer subtypes make very difficult the effective clustering of SAGE libraries. Furthermore, the various sources of noise pose an extra challenge to data miners. For all these reasons we propose an approach that involves the discretization of the data, the selection of the most prominent gene tags and the use of a clustering algorithm in order to obtain more compact and reliable clusters that can assist cancer profiling. We experimented with two families of clustering algorithms, partitional and hierarchical, and we utilized various cluster validity criteria in order to evaluate the resulted clustering structures. The experimental results have shown that our approach provides more interesting clustering structures.