False positive discoveries in ancient metagenomics, and how aMeta attempts to solve the problem: Part 1

by Nikolay Oskolkov* and Zoé Pochon*

*These authors contributed equally

Many researchers delving into metagenomics and ancient metagenomics stumble upon unexpected, exotic discoveries in their samples from time to time. Below, we’ll walk you through a few typical scenarios where you might suspect that something is too good to be true. However, maybe a lack of computational experience has prevented you from figuring out what might have gone wrong.

Example 1.

Picture this: you’re having your first look at next-generation sequencing data from a metagenomic sample and you’ve got no clue about the organisms present in the sample. You happen to use Kraken2 [1] (because someone recommended it to you) with a database that you have not built yourself (but which is available on the server used by your lab), and you copy a few command lines from a postdoc colleague (who left the lab a few weeks ago), and you get back a list of candidate organisms in your sample similar to the one below.

unnamed

Figure 1: Visualisation of Kraken2 results reordered by the number of reads and filtered for species-level classification. Each row presents data for a single species. “Percent_Reads” indicates the percentage of total analyzed reads assigned to the species, “Reads” shows the number of reads, “TaxReads” displays the reads assigned to the species level, “Rank” indicates the taxonomic level (S for species), “TaxID” is the taxonomic identification number, and “Name” is the species name. Species highlighted in green are bacteria, likely environmental contaminants, while those highlighted in red are animals, with corresponding images shown beside them.

At first, the microbes at the top of the list make sense - you’d expect a bunch of different bacterial species in any archaeological (metagenomic) sample. But then, as you scroll down, you find traces of Canis lupis famillaris (dog) DNA (>2000 reads), Aquila chrysaetos chrysaetos (eagle) DNA (>1000 reads), and even Apteryx mantelli (kiwi) DNA (>1000 reads). Now, that’s intriguing…!

After discussing the results at the lab meeting and double-checking the sample IDs, you realise that this particular sequenced sample was not a real sample but actually a negative control (blank). Kiwi DNA in a blank sample? Seriously? Now, you’ve got a lot of questions: ‘What’s up with my analysis? Why am I picking up eagles and kiwis where they shouldn’t be? Are they even there? Did someone sneak kiwi DNA into the lab? Are my reagents contaminated? Is Kraken2 acting up?”.

Example 2.

Next up, you’re checking out a soil sample and you detect the common carp fish (Cyprinus carpio). It’s all over the place, like 10% of your reads get assigned to this organism. Seems like a solid case for fish DNA in the soil, right? But hold on – how? Could it be because of the area you sampled? No water in sight - nada! Maybe there was a lake or river there ages ago? Now that would be an intriguing twist! Or there might be something “fishy” going on with the data?

unnamed-2

Figure 2: Krona visualisation of Kraken2 output** from a soil sample showing presence of DNA from common carp (left), which was an artefact of adapter contamination in both query reads and common carp reference genome (right).

However, when you present this exciting find at a conference, someone points you to a few bioinformatics blog posts claiming that the common carp reference genome is full of Illumina adapters that were not removed. It’s a bit frustrating at first. On the bright side, you’re pretty sure that the pipeline you used did perform adapter trimming, so your processed reads can’t contain adapters. The only catch? You forgot to check the quality control report after adapter trimming… In this case, removing the adapters and making sure they’re gone using a tool like FastQC will help you kick out this false positive fishy hit. But, hey, keep in mind that reference genomes can be contaminated in other ways. This is the case for several parasite reference genomes that can be quite contaminated by their host’s DNA. For example, Spirometra erinaceieuropaei, a tapeworm, is contaminated with host (human) DNA [2]. So that means that even if you use a solid pipeline and a decent database, you must never forget to investigate the alignment of your organism of interest!

Example 3.

In the last scenario, your supervisor is very excited about a new collaboration with another lab. They ship over some ancient tooth samples from a mass grave, supposedly victims of a plague centuries ago. Your supervisor is pretty sure that a DNA analysis will confirm the cause of their demise. Your mission: shotgun sequencing and analysing the metagenomic data to check for the potential presence of Yersinia pestis, the agent of plague. Once you get the data back from sequencing, you download the Yersinia pestis reference genome and align your reads to it. This analysis results in circa 22,000 reads aligned uniquely to the Yersinia pestis genome, which should be strong evidence for the plague present in your samples. You skip checking the alignments, thinking the number of mapped reads is proof enough. Little did you know, if you had checked, using, for example, the Integrative Genomics Viewer (IGV) [3], you’d have noticed the alignments look a bit off - a “stacked” pattern with lots of mismatches.

unnamed-3

Figure 3: Visualisation of a genomic region of Yersinia pestis with mapped reads using IGV. The reference genome is shown at the top, and the reads aligned to this region are displayed below. Variations (mutations) in the reads compared to the reference genome are highlighted in various colours. The reads are stacking up in different short regions and show numerous mutations relative to the reference.

During a lab meeting, someone suggests using competitive mapping (detailed explanation below) to validate your findings. What they mean is combining the Yersinia pestis reference genome with human and other bacterial reference genomes, rerunning the alignments against that database, and checking how many reads still map uniquely to Yersinia pestis, now that they have the option to map to other candidate reference genomes (hence the name of “competitive” mapping). Gradually adding more bacterial genomes to your concatenated reference, you discover that the number of reads uniquely mapping to Yersinia pestis drops dramatically.

unnamed-4

Figure 4: Effect of increasing reference genome database size on the number of reads uniquely mapping to *Yersinia pestis. The Y-axis shows the number of uniquely mapped reads, while the X-axis displays the number of reference genomes on a logarithmic scale. As more genomes are added, the unique reads mapping to Y. pestis decrease significantly, with only 11 reads remaining when all NCBI RefSeq bacterial genomes are included.*

Eventually, when all bacterial reference genomes available at NCBI RefSeq are included, you detect that only 11 reads map uniquely to Yersinia pestis. Talk about a let down! However, your supervisor remains the eternal optimist and suggests sequencing deeper and even capturing to fish out more Yersinia pestis DNA. He mentions that his collaborators in other labs see 5 reads assigned to a pathogen as a sign of its presence, and in your case, you’ve got a promising 11 reads..! In all three examples above, you stumbled upon something in your samples that didn’t actually exist—a classic case of a “false positive” discovery. The core goal of our pipeline, aMeta [4], is to counteract such false positive discoveries. We suggest that incorporating two essential steps into any of your analyses can help eliminate many of these false positive findings:

  • using competitive mapping against a very large and diverse collection of reference genomes
  • monitoring the evenness of coverage of reads mapping to the reference as crucial evidence of organism presence in a metagenomic sample

Both of these strategies are successfully implemented in aMeta (more about aMeta is coming up in Part 2 - stay tuned!). Indeed, in the Yersinia pestis example (example 3), the initial misjudgment about the pathogen’s presence could have been avoided, if competitive mapping against a large database of reference genomes had been used instead of direct mapping to the single reference genome of Yersinia pestis, or if there had been a check on how well the reads were aligning. These steps could have also helped prevent the false positive discoveries in the first and second examples. For clarity, let us illustrate what we mean by competitive mapping and evenness of coverage in a more technical manner.

Competitive mapping is a strategy used to avoid false positives by simultaneously aligning DNA sequences to a concatenated database of multiple reference genomes. By providing multiple potential targets (e.g., the genome of the organism of interest, potential contaminants such as human DNA and other microbial DNA such as species from the same genus and others), this approach ensures that each sequence reads maps competitively to the most appropriate genome. This reduces the likelihood of sequences being incorrectly attributed to the wrong organism, which can occur when using a single reference genome. In Figure 5 below we show what could theoretically happen when one compares single mapping and competitive mapping of a mammoth sample.

unnamed-6

Figure 5: Illustration of competitive mapping and its use to reduce contamination and false positive sequences. 1. Single genome (mammoth) aDNA mapping: Initial mapping of mammoth aDNA sequences using the elephant reference genome (closest available reference genome). Microbial sequences (red), elephant sequences (blue), and human sequences (green) are included. Many reads mismap to the elephant genome due to the lack of better reference genomes. 2. Correcting for human contamination: Using both elephant and human reference genomes reduces human contaminant sequences (green). 3. Ancient metagenomics mapping: Comprehensive mapping with elephant, human, and microbial reference genomes accurately classifies sequences. This step significantly reduces microbial contamination (red) and human contamination (green), leaving mostly elephant sequences (blue).

Evenness of coverage refers to the “uniformness or consistency with which sequenced reads are distributed across the reference genome” [4]. With shotgun sequencing, fragments are randomly sampled from the genomes present in your samples. If the target species is genuinely present, its genome should be sequenced randomly, resulting in a scattered and even distribution across the reference genome (Figure 6). However, if the target species is not present, you might still observe a substantial number of reads aligning to its reference genome due to the absence of a more accurate reference genome for a related species or the presence of conserved regions shared by both species. In these scenarios, the reference genome will exhibit uneven coverage, with reads stacking up in certain regions and leaving other, more specific regions without aligned reads (Figure 6).

unnamed-7

Figure 6: Illustration of read alignment scenarios and their impact on microbe detection. (A) When the target species is not present, reads tend to stack up on conserved regions of the reference genome, resulting in uneven coverage despite the same read depth (Lgenome ). The breadth of coverage in this case is 1/4, indicating the microbe is not detected. (B) When the target species is present, reads align evenly across the reference genome, providing consistent breadth and evenness of coverage, confirming the microbe’s detection. Here, the breadth of coverage is almost of 100%. In both cases, Lgenome represents the read depth, which is the same, but the breadth and evenness of coverage differ.

Conclusion

An important focus in ancient metagenomics is to confirm the presence of a species as truly ancient by displaying deamination pattern plots and read length plots. While this validation step is essential to make sure that you are working on truly ancient material, more emphasis should be placed into showing true presence of that specific species, by incorporating plots showing how reads map to the reference genome, and including filters for evenness of coverage (for example using sliding windows along the reference genome to measure breadth of coverage per window). Such plots, along with edit distance plots, post-mortem damage histograms, and mapping metrics are provided by aMeta for all detected species by aMeta in your sample. In summary, we recommend the use of competitive mapping against very large databases for species detection and verifying their authenticity by giving more attention to the evenness of coverage profile. Both of these essential steps are incorporated in aMeta. Stay tuned for Part 2 where we discuss how aMeta addresses the issue of false positive discoveries.

References

[1] Wood, Derrick E., Jennifer Lu, and Ben Langmead. “Improved metagenomic analysis with Kraken 2.” Genome biology 20 (2019): 1-13. DOI: https://doi.org/10.1186/s13059-019-1891-0

[2] Jensen, Theis ZT, Jonas Niemann, Katrine Højholt Iversen, Anna K. Fotakis, Shyam Gopalakrishnan, Åshild J. Vågene, Mikkel Winther Pedersen et al. “A 5700 year-old human genome and oral microbiome from chewed birch pitch.” Nature Communications 10, no. 1 (2019): 5520. DOI: https://doi.org/10.1186/s13059-023-03083-9, supplementary informationl materials

[3] Thorvaldsdóttir, Helga, James T. Robinson, and Jill P. Mesirov. “Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.” Briefings in bioinformatics 14, no. 2 (2013): 178-192. DOI: https://doi.org/10.1093/bib/bbs017

[4] Pochon, Zoé, Nora Bergfeldt, Emrah Kırdök, Mário Vicente, Thijessen Naidoo, Tom Van Der Valk, N. Ezgi Altınışık, Maja Krzewińska, Love Dalén, Anders Götherström, Claudio Mirabello, Per Unneberg and Nikolay Oskolkov. “aMeta: an accurate and memory-efficient ancient metagenomic profiling workflow.” Genome Biology 24, no. 1 (2023): 242. DOI: https://doi.org/10.1186/s13059-023-03083-9