Logo



Teachers' Little Helpers Go Genomic

(January 24th 2009) Why do undergraduates waste their time solving theoretical problems? Faced with surging tidal waves of unsorted metagenomic DNA sequences, a French university lecturer has come up with the bright idea of getting his students to work directly on real science problems, reports Jeremy Garwood.

Pascal Hingamp, a Maitre de Conferences (lecturer) in Bioinformatics at the University of Marseille (Luminy) has published his "Metagenomic Annotation using a distributed grid of undergraduate students" in PLoS Biology (2008; 6(11) e296). Recommending the exploitation of this largely untapped resource, Hingamp describes training hundreds of his undergraduate students to successfully analyse metagenomic sequence data. He claims that his 'Annotathon' approach "combines the excitement of novelty provided by 'hot-off-the-sequencer', as yet unannotated metagenomics data, with a highly structured e-learning Web tool."

"Since most bioinformatics resources are accessible online, almost every type of bioinformatics teaching can be done from a computer room equipped with broadband internet. Early on in their undergraduate studies, students can tackle bioinformatics questions that are at the forefront of current research, and even investigate problems that have not been addressed to date."

Genomic sequence databases are currently overflowing with the input of automated DNA sequences from projects looking at different organisms and disease pathologies. However, the latest fashion for metagenomics is adding a new level of complexity to the jigsaw puzzle that has to be organised out of all those strings of nucleic acid letters. Metagenomics (also known as Environmental Genomics, Ecogenomics or Community Genomics) is the study of genetic material recovered directly from environmental samples. Unlike in traditional microbiology, metagenomics doesn't rely on purified clonal cultures. Instead it looks to sequence all the DNA in a particular environmental sample, arguing that this not only permits us to study organisms that are not easily cultured in the laboratory, but also to look at them in their natural environment. Recent huge metagenomic initiatives include the Human Microbiome Project (from the US National Institutes of Health) which will sequence DNA samples obtained from the mouth, nose, skin, and (where appropriate) vaginas of 250 human volunteers.

Thanks to automated DNA sequencing machines, the generation of DNA sequences from these heterogeneous environmental samples has become routine. However what is the best way to process and understand this rapidly accumulating mass of data? Hingamp's "in silico" (i.e. computer) analyses "classically begin with open reading frame (ORF) prediction, followed by identification of conserved protein functional domains, as well as similarity searching in sequence databases." But, when protein-coding homologies are identified in the DNA sequences, the analysis then proceeds to "multiple sequence alignments and phylogenetic tree reconstruction" in order to assign the gene sequence to the original organism from which it came.

Hingamp says that students are expected to apply this workflow to several distinct sequences, typically three, that have been randomly sampled. However, "since even for this limited number of sequences, the annotation effort requires stamina, we have called this procedure 'Annotathon'" (an allusion to 'telethons', the lengthy televised fundraising events that, especially in France, are associated with research into genetic causes of human disease). Hingamp chose the Global Ocean Sampling (GOS) metagenomic sequences as his data source since environmental sequence databases do not usually provide any annotation other than submitter identity and sampling location, but says any other source of sequences with no public annotations could be exploited. Furthermore, "the novelty and biodiversity aspect of these projects undoubtedly contributes to positive student perception." However, to make life a little easier on his in silico novices (third year university students), he has prefiltered the GOS dataset to exclude sequences that do not contain at least a 60 amino acid ORF.

After ten hours of theoretical teaching, students had four half-day practical sessions in the computer room but were instructed to continue their annotations outside classes - "connection logs indeed show that students spend on average 42 hours online, of which only 16 correspond to supervised classes." Overall, the 515 students that have taken part in the Annotathon over the past three years have analyzed a total of 2.3 Mb of ocean microbial DNA, representing 9,500 hours of cumulative annotation. Hingamp claims that his students' results compare very well to those from the "professional" Global Ocean Sampling program, although he doesn't comment upon the relative quality of established bioinformatics research.

Nevertheless, he does admit that students dropped into the Annotathon did encounter difficulties. Firstly, when delineating ORFs, students didn't understand the logic when told to choose "any initiation codon" when working with short, probably truncated, coding sequences, although the ORF start position was subsequently adjusted as necessary. However, many students were further "destabilized" by the 'trial and error' nature of phylogenetic analyses, finding it hard to grasp the sequence selection strategy for multiple alignments.

Indeed, they often found it difficult to judge the quality of a multiple alignment and failed to identify sequences that should be removed from a suboptimal alignment. "Multiple alignment interpretation is often superficial, and few students confront the conserved regions, identified in multiple alignments, with identified protein domains or known family structural features."

But the single "most challenging in silico analysis faced by students" was the construction and interpretation of phylogenetic trees. Hingamp laments: "They commonly stumble over whether the trees obtained are compatible with known reference phylogeny, or if trees obtained by alternative methods are congruent. Evolutionary events like duplications or horizontal gene transfers are frequently missed." Indeed.

Now, after all this in silico DNA juggling, one might not expect students to grasp all the subtleties of their work, but Hingamp is a hard task master and says he was surprised to find that many students had "considerable difficulties summarizing their findings in their final conclusion and producing a rigorous argumentation. This was a very discriminative point among students, some showing truly remarkable skills, while others remained at a very basic level in their analysis."

Perhaps we're not all cut out to be bioinformaticians. Nevertheless, Pascal Hingamp highly recommends his use of patient undergraduates to sort through the genomic databases. Any teachers (or researchers with access to willing hands) who wish to use the Annotathon are invited to consult the public server at: http://annotathon.univ-mrs.fr/


Last Changes: 03.27.2009