Bioinformatics – Intelligent Analysis of Biological and Biomedical Data

Introduction

Recent advances in technology have produced a wealth of digital machines and sensors which, along with recent advancements in biotechnology and more specifically the high-throughput sequencing methods, have led to an unprecedented explosion of data on every aspect of biology and medicine concerning various life threatening diseases, such as Chronic Lymphocytic Leukemia [1] and Diabetes [2].  To this end, application of machine learning and data mining methods in biosciences is presently, more than ever before, vital and indispensable in efforts to transform intelligently all available information into valuable knowledge and ultimately to answer fundamental questions from biology and medicine.

Our contribution

Immunogenetic Data Analysis

Immunity is the capability of the human organism to defend from the attack of environmental agents that are foreign to itself and are potentially harmful. Those foreign elements could be viruses, bacteria and various other substances.  Our work is focused on methodologies which integrate different immunogenetic and clinicobiological data sources and data mining methods to analyze them in order to study the patterns of mutations that occur through the process of Somatic Hypermutation (SHM). All methods were applied to patient data with Chronic Lymphocytic Leukemia (CLL) [3].

Population Genomic Data Analysis

Population genetics is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and population structure.  Our work is focused on the analysis of single nucleotide polymorphisms (SNPs) data, which main feature is the high dimensionality, mainly focusing on the problem of selecting the most informative markers for assigning individuals to populations of origin. We have developed data mining methods for informative marker selection [4,5] and TRES [6], a software containing algorithms for SNP dataset manipulation and all state-of-the-art marker evaluation algorithms used in this field. In the same area of population genetics, a microsatellite pattern discovery algorithm has been developed along with a software application called MiGA [7].

Links:

TRES [6]: http://mlkd.csd.auth.gr/bio/tres/

MIGA [7]: http://mlkd.csd.auth.gr/bio/miga/

Polyadenylation Site Prediction

Polyadenylation is a process that takes place after transcription termination. It involves cleavage of the new transcript (mRNA), followed by template-independent addition of adenines at its newly synthesized 3’ end. The cleavage site is called polyadenylation site or, in short, poly(A) site. Polyadenylation is considered to be part of the larger process of producing mature mRNA for translation. The aim of the polyadenylation process is to protect the mRNA in order to reach intact the protein synthesis site. Nowadays, the research in this field is focused on discovering new cis-regulatory elements and on predicting the poly(A) site accurately. The accurate prediction of poly(A) site is a crucial step to define gene boundaries and get an insight in transcription termination in eukaryotes, which is a process less well understood. In our research group we have been working on the project of polyadenylation site prediction for quite a while and have developed tools that can be used in analyzing and predicting poly(A) sites [8,9,10].

Translation Initiation Site Prediction

The prediction of the Translation Initiation Site (TIS) in a genomic sequence is an important issue in biological research. Although several methods have been proposed to deal with this problem, there is a great potential for the improvement of the accuracy of these methods.

Due to various reasons, including noise in the data as well as biological reasons, TIS prediction is still an open problem and definitely not a trivial task. In our research group we have been working on this issue for quite a while and have been experimenting on real world DNA sequences. Our methods are described in the following papers [11,12,13,14,15]

References

  1. Kavakiotis, I., Xochelli, A., Agathangelidis, A., Tsoumakas, G., Maglaveras, N., Stamatopoulos, K., Hadzidimitriou, A., Vlahavas, I., Chouvarda, I. (2016) Integrating Multiple Immunogenetic Data Sources For Feature Extraction and Mining Mutation Patterns: The Case of Chronic Lymphocytic Leukemia Shared Mutations BMC Bioinformatics 2016 Jun 6;17 Suppl 5:173. doi: 10.1186/s12859-016-1044-3.
  2. Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., Chouvarda, I. (2017). Machine Learning and Knowledge Discovery Methods in Diabetes Research. Computational and Structural Biotechnology Journal, Volume 15, 2017, Pages 104–116.
  3. Kavakiotis, I., Xochelli, A., Agathangelidis, A., Tsoumakas, G., Maglaveras, N., Stamatopoulos, K., Hadzidimitriou, A., Vlahavas, I., Chouvarda, I. (2014) “Integrating Multiple Immunogenetic Data Sources For Feature Extraction and Mining Mutation Patterns: The Case of Chronic Lymphocytic Leukemia Shared Mutations”, Statistical Methods for Omics Data Integration and Analysis. Heraklion, Crete, Greece, November 10-12, 2014
  4. Kavakiotis I., Samaras P., Triantafyllidis A., Vlahavas I.FIFS: A Data Mining Method for Informative Marker Selection in High Dimensional Population Genomic Data. (Under Review)
  5. Kavakiotis I., Triantafyllidis A., Tsoumakas G., Vlahavas I., (2016) “Ensemble Feature Selection using Rank Aggregation Methods for Population Genomic Data.” ACM Proceedings of the 9th Hellenic Conference on Artificial Intelligence, 22, 2016
  6. Kavakiotis I., Triantafyllidis A., Ntelidou D, Alexandri P, Megens HJ, Crooijmans RP, Groenen MA, Tsoumakas G, Vlahavas I. (2015) “TRES: Identification of Discriminatory and Informative SNPs from Population Genomic Data.”, Journal of Heredity Wiley,2015 2015 Sep-Oct;106(5):672-6. doi: 10.1093/jhered/esv044. Epub 2015 Jul 2
  7. Kavakiotis I., Triantafyllidis A., Samaras P., Voulgaridis A., Karaiskou N., Konstantinidis E., Vlahavas I. (2014) “Pattern discovery for microsatellite genome analysis”, Computers in Biology and Medicine, Edward John Ciaccio (Ed.), Elsevier, Vol. 46, pp. 71-78 , 2014.
  8. Kavakiotis I. , Tzanis G., Vlahavas I.,(2014) “Polyadenylation site prediction using PolyA-iEP method”, Polyadenylation Method and Protocols, Joanna Rorbach and Agnieszka Bobrowicz (Eds.), Springer, Methods In Molecular Biology, 2014;1125:131-40. doi: 10.1007/978-1-62703-971-0_11
  9. Tzanis G., Kavakiotis I., Vlahavas I. (2011) “PolyA-iEP: A Data Mining Method for the Effective Prediction of Polyadenylation Sites”, Expert Systems with Applications, Elsevier, 38(10): 12398-12408, 2011.
  10. Tzanis G., Kavakiotis I., Vlahavas I. (2008) “Polyadenylation Site Prediction Using Interesting Emerging Pattern”, 8th IEEE International Conference on Bioinformatics and Bioengineering, IEEE, Athens, Greece, 2008: 1-7
  11. Tzanis, C. Berberidis, and I. Vlahavas. “MANTIS: A Data Mining Methodology for Effective Translation Initiation Site Prediction”. In Proceedings of the 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, Lyon, France, August 23-26, 2007.
  12. Tzanis, C. Berberidis, and I. Vlahavas, “A Novel Data Mining Approach for the Accurate Prediction of Translation Initiation Sites”, In Proceedings of the 7th International Symposium on Biological and Medical Data Analysis, Nicos Maglaveras et al. (Eds.), Springer-Verlag, Thessaloniki, Greece, December 7-8, 2006.
  13. Tzanis, I. Vlahavas, “Prediction of Translation Initiation Sites Using Classifier Selection”, Proc. 4th Hellenic Conference on Artificial Intelligence (to be presented), G. Antoniou, G. Potamias, D. Plexousakis, C. Spyropoulos (Ed.), Springer-Verlag, Heraclion, Crete, 2006.
  14. Tzanis, C. Berberidis, A. Alexandridou, I. Vlahavas, “Improving the Accuracy of Classifiers for the Prediction of Translation Initiation Sites in Genomic Sequences”, 10th Panhellenic Conference on Informatics (PCI’2005), P. Bozanis and E.N. Houstis (Eds.), Springer-Verlag, LNCS 3746, pp. 426 – 436, Volos, Greece, 11-13 November, 2005. Details
  15. G. Tzanis, C. Berberidis, I. Vlahavas, “StackTIS: A Stacked Generalization Approach for Effective Prediction of Translation Initiation Sites”, Computers in Biology and Medicine, Elsevier, Vol. 42, No. 1, pp. 61-69, 2012.