A publication in Genome Biology

Reducing bias in scientific research due to genomic contamination



imgActu
Image ©Shutterstock - Kateryna Kone

Luc Cornet and Denis Baurain, researchers in the field of phylogenomics at the University of Liège, are proposing a new approach to compare tools using genomic contaminations simulations in order to help researchers in the selection of the most appropriate algorithms for their studies. This approach aims to limit the biases induced by these contaminants in the reliability of scientific studies. The research has been published in the Open Access scientific journal Genome Biology. 

N

owadays, genomes - the genetic material of an organism digitalized in a computer file - have become the basic building block of many scientific studies. They are used, for example, to study the evolutionary history of species or in the medical field to better combat human pathogens. There are several hundred thousand genomes available to researchers and their number is constantly increasing. While this deluge of data has opened up new research opportunities in comparative genomics and related fields, it has been accompanied by a growing problem of contamination of a number of genomes published in public databases. "The inclusion of foreign sequences alongside authentic sequences is called 'genome contamination'," explains Prof. Denis Baurain, a biologist and researcher at the InBioS research unit (Faculty of Science) at ULiège. Contaminating sequences can creep into a genome on many occasions, from the sampling of the organism in its original environment to the computer analysis of its genome. A very topical example is the study of the microbiome, such as that of the human gut flora, where the multiplicity of organisms sampled at the same time considerably increases the probability of contamination. The quality of these genomes often determines the reliability of the resulting scientific studies. This is why the presence of sequence segments that do not belong to the intended organism has been under the scrutiny of researchers for several years. Contamination is indeed a phenomenon known to be the source of errors in many publications, including in prestigious scientific journals.

The decreasing cost of sequencing and the concomitant increase in the number of publicly available genomes has created an acute need for automated software to assess this genomic contamination. In the last six years, eighteen software packages have been released, each with its own strengths and weaknesses. Deciding which tools to use is becoming increasingly difficult without an understanding of the underlying algorithms," explains Luc Cornet, collaborating scientist at ULiège and Principal Investigator of the BELSPO BCCM (Belgian Coordinated Collections of Microorganisms) project, first author of the article just published in the journal Genome Biology. We have therefore decided to review these programs, evaluating six of them, with a view to presenting their operating principles. This scientific approach is intended to guide researchers in choosing appropriate tools for specific applications.

Luc Cornet and Denis Baurain - co-authors of numerous publications on the subject, notably on the creation of algorithms for detecting contaminants within genomes - draw up, in this new publication, a rigorous comparison of all the available algorithms. They also define for the first time many key concepts in the field of genomic contamination. The importance of contamination is such that a whole series of algorithms are available to assess the quality of genomes," says Luc Cornet. Their publication rate is also very high, with eleven new tools published in the last three years alone. Despite these efforts, the ultimate detection tool does not yet exist, as each has its own qualities and weaknesses.

While training their colleagues in contamination detection in the context of the BELSPO-funded BCCM, GEN-ERA research project, Luc Cornet and Denis Baurain realized that it was complicated to explain the sometimes subtle differences between the various algorithms. They, therefore, decided to compare these tools on simulations of genomic contaminations in order to help researchers select the most appropriate ones for their studies. The two researchers conclude that it is important not to rely on the use of a single tool, as is often the case in current scientific studies, but that the combination of multiple approaches with complementary principles is necessary in the track of contaminants.

Scientific reference

Luc Cornet & Denis Baurain, Contamination detection in genomic data: more is not enough, Genome Biology, 2022

Contacts

Luc Cornet

Denis Baurain

Share this news