Bioinformatics – Intelligent Analysis of Biological and Biomedical Data

Introduction

Recent advances in technology have produced a wealth of digital machines and sensors which, along with recent advancements in biotechnology and more specifically the high-throughput sequencing methods, have led to an unprecedented explosion of data on every aspect of biology and medicine concerning various life threatening diseases, such as Chronic Lymphocytic Leukemia [1] and Diabetes [2]. To this end, application of machine learning and data mining methods in biosciences is presently, more than ever before, vital and indispensable in efforts to transform intelligently all available information into valuable knowledge and ultimately to answer fundamental questions from biology and medicine.

Our contribution

Immunogenetic Data Analysis

Immunity is the capability of the human organism to defend from the attack of environmental agents that are foreign to itself and are potentially harmful. Those foreign elements could be viruses, bacteria and various other substances. Our work is focused on methodologies which integrate different immunogenetic and clinicobiological data sources and data mining methods to analyze them in order to study the patterns of mutations that occur through the process of Somatic Hypermutation (SHM). All methods were applied to patient data with Chronic Lymphocytic Leukemia (CLL) [3].

Population Genomic Data Analysis

Population genetics is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and population structure. Our work is focused on the analysis of single nucleotide polymorphisms (SNPs) data, which main feature is the high dimensionality, mainly focusing on the problem of selecting the most informative markers for assigning individuals to populations of origin. We have developed data mining methods for informative marker selection [4,5] and TRES [6], a software containing algorithms for SNP dataset manipulation and all state-of-the-art marker evaluation algorithms used in this field. In the same area of population genetics, a microsatellite pattern discovery algorithm has been developed along with a software application called MiGA [7].

Links:

TRES [6]: http://mlkd.csd.auth.gr/bio/tres/

MIGA [7]: http://mlkd.csd.auth.gr/bio/miga/

Polyadenylation Site Prediction

Polyadenylation is a process that takes place after transcription termination. It involves cleavage of the new transcript (mRNA), followed by template-independent addition of adenines at its newly synthesized 3’ end. The cleavage site is called polyadenylation site or, in short, poly(A) site. Polyadenylation is considered to be part of the larger process of producing mature mRNA for translation. The aim of the polyadenylation process is to protect the mRNA in order to reach intact the protein synthesis site. Nowadays, the research in this field is focused on discovering new cis-regulatory elements and on predicting the poly(A) site accurately. The accurate prediction of poly(A) site is a crucial step to define gene boundaries and get an insight in transcription termination in eukaryotes, which is a process less well understood. In our research group we have been working on the project of polyadenylation site prediction for quite a while and have developed tools that can be used in analyzing and predicting poly(A) sites [8,9,10].

Translation Initiation Site Prediction

The prediction of the Translation Initiation Site (TIS) in a genomic sequence is an important issue in biological research. Although several methods have been proposed to deal with this problem, there is a great potential for the improvement of the accuracy of these methods.

Due to various reasons, including noise in the data as well as biological reasons, TIS prediction is still an open problem and definitely not a trivial task. In our research group we have been working on this issue for quite a while and have been experimenting on real world DNA sequences. Our methods are described in the following papers [11,12,13,14,15]

Intelligent Systems Lab