17 Ancient Metagenomic Pipelines

Author

James A. Fellows Yates, Megan Michel, Nikolay Oskolkov

Self guided: chapter environment setup

For this chapter’s exercises, if not already performed, we will need to download the chapter’s dataset, decompress the archive, and create and activate the conda environment.

Do this, use wget or right click and save to download this Zenodo archive: 10.5281/zenodo.13759201, and unpack.

tar xvf ancient-metagenomic-pipelines.tar.gz 
cd ancient-metagenomic-pipelines/

We can then create the subsequently activate environment with the following commands.

conda env create -f ancient-metagenomic-pipelines.yml
conda activate ancient-metagenomic-pipelines

Warning

There are additional software requirements for this chapter

Please see the relevant chapter section in Before we start before continuing with this chapter.

17.1 Introduction

A pipeline is a series of linked computational steps, where the output of one process becomes the input of the next. Pipelines are critical for managing the huge quantities of data that are now being generated regularly as part of ancient DNA analyses. In this chapter we will go through three dedicated ancient DNA pipelines - all with some (or all!) functionality geared to ancient metagenomics - to show we how we can speed up the more routine aspects of the basic analyses we’ve learnt about earlier in this text book through workflow automation.

We will introduce:

nf-core/eager (https://nf-co.re/eager) - a generalised aDNA ‘workhorse’ pipeline that can do both ancient genomics and (basic) metagenomics (Fellows Yates et al. 2021)
aMeta (https://github.com/NBISweden/aMeta) - a pipeline for resource efficient and accurate ancient microbial detection and authentication (Pochon et al. 2022)
nf-core/mag (https://nf-co.re/mag) - a de novo metagenomics assembly pipeline (Krakau et al. 2022) that includes a dedicated ancient DNA mode for damage correction and validation.

Keep in mind that that there are many other pipelines that exist, and picking which one often they come down to personal preference, such as which functionality they support, which language they are written in, and whether their computational requirements can fit in our available resources.

Other examples of other ancient DNA genomic pipelines include Paleomix (https://paleomix.readthedocs.io/en/stable, Schubert et al. 2014), and Mapache (https://github.com/sneuensc/mapache, Neuenschwander et al. 2023), and for ancient metagenomics: metaBit (https://bitbucket.org/Glouvel/metabit/src/master/, Louvel et al. 2016) and HAYSTAC (https://github.com/antonisdim/haystac, Dimopoulos et al. 2022).

17.2 Workflow managers

All the pipelines introduced in this chapter utilise workflow managers. These are software that allows users to ‘chain’ together the inputs and outputs distinct ‘atomic’ steps of a bioinformatics analysis - e.g., separate terminal commands of different bioinformatic tools, so that ‘we don’t have to’. we have already seen very basic workflows or ‘pipelines’ when using the bash ‘pipe’ (|) in the Introduction to the Command Line chapter, where the each row of text the output of one command was ‘streamed’ into the next command.

However in the case of bioinformatics, we are often dealing with non-row based text files, meaning that ‘classic’ command line pipelines don’t really work. Instead this is where bioinformatic workflow managers come in: they handle the passing of files from one tool to the next but in a reproducible manager.

Modern computational bioinformatic workflow managers focus on a few main concepts. To summarise Wratten, Wilm, and Göke (2021), these areas are: data provenance, portability, scalability, re-entrancy - all together which contribute to ensuring reproducibility of bioinformatic analyses.

Data provenance refers to the ability to track and visualise where each file goes and gets processed, as well as metadata about each file and process (e.g., What version of a tool was used? What parameters were used in that step? How much computing resources were used).

Portability follows from data provenance where it’s not just can the entire execution of the pipeline be reconstructed - but can it also be run with the same results on a different machine? This is important to ensure that we can install and test the pipeline on our laptop, but when we then need to do heavy computation using real data, that it will still be able to execute on a high-performance computing cluster (HPC) or on the cloud - both that have very different configurations. This is normally achieved through the use of reusable software environments such as as conda (https://docs.conda.io/en/latest/) or container engines such as docker (https://www.docker.com/), and tight integration with HPC schedulers such as SLURM (https://slurm.schedmd.com/documentation.html).

As mentioned earlier, not having to run each command manually can be a great speed up to our analysis. However this needs to be able to be Scalable that it the workflow is still efficient regardless whether we’re running with one or ten thousands samples - modern workflow managers perform resource requirement optimisation and scheduling to ensure that all steps of the pipeline will be executed in the most resource efficient manner so it completes as fast as possible - but regardless of of the number of input data.

Finally, as workflows get bigger and longer, re-entrancy has become more important, i.e., the ability to re-start a pipeline run that got stuck halfway through due to an error.

All workflow managers have different ways of implementing the concepts above, and these can be very simple (e.g., Makefiles, https://en.wikipedia.org/wiki/Make_(software))) to very powerful and abstract (e.g. Workflow Description Language, https://github.com/openwdl/wdl). In this chapter we will use pipelines that use two popular workflow managers in bioinformatics, Nextflow (https://nextflow.io, (Di Tommaso et al. 2017)) and Snakemake (https://snakemake.github.io, (Mölder et al. 2021)).

This chapter will not cover how to write our own workflow, as this would require a whole other textbook. However it is recommended to learn and use workflow managers when carrying out repetitive or routine bioinformatic analysis (Nextflow and Snakemake being two popular ones in bioinformatics). Use of workflow managers can help make our work more efficiently (as we only run one command, rather than each step separately), but also more reproducible by reducing the risk of user error when executing each step: the computer will do exactly what we tell it, and if we don’t change anything, will do the exact same thing every time. If you’re interested in writing our own workflows using workflow managers, many training and tutorials exist on the internet (e.g., for Nextflow there is the official training: https://training.nextflow.io/ or from software carpentries: https://carpentries-incubator.github.io/workflows-nextflow/, or the official training for snakemake: https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html#tutorial).

17.3 What is nf-core/eager?

nf-core/eager is a computational pipeline specifically designed for preprocessing and analysis of ancient DNA data (Figure 17.1) (Fellows Yates et al. 2021). It is a reimplementation of the previously published EAGER (Efficient Ancient Genome Reconstruction) pipeline (Peltzer et al. 2016) written in the workflow manager Nextflow. In addition to reimplementing the original genome mapping and variant calling pipeline, in a more reproducible and portable manner the pipeline also included additional new functionality particularly for researchers interested in microbial sciences, namely a dedicated genotyper and consensus caller designed for low coverage genomes, the ability to get breadth and depth coverage statistics for particular genomic features (e.g. virulence genes), but also automated metagenomic screening and authentication of the off-target reads from mapping (e.g. against the host reference genome).

Figure 17.1: nf-core/eager workflow summary in the form of a ‘metro map’ style diagram. FASTQ or BAM input for red and yellow lines (representing eukaryotic and prokaryotic workflows) goes through `FastQC`, `AdapterRemoval`, alignment to a reference FASTA input with a range of alignments, before going through BAM filtering, deduplication, *in silico* damage removal, and variant calling. Multiple statistics steps come out of the deduplication ‘station’. The blue line represents the metagenomic workflow, where off-target reads come out of the BAM filtering ‘station’, and goes through complexity filtering with `BBDuk`, `MALT` or `Kraken2`, and optionally into `MaltExtract` for `MALT` output.

A detailed description of steps in the pipeline is available as part of nf-core/eager’s extensive documentation. For more information, check out the usage documentation (https://nf-co.re/eager/2.5.2/docs/usage/).

Briefly, nf-core/eager takes at a minimum standard input file types that are shared across the genomics field, i.e., raw FASTQ files or aligned reads in bam format, and a reference fasta. nf-core/eager performs preprocessing of this raw data, including adapter clipping, read merging, and quality control of adapter-trimmed data. nf-core/eager then carries mapping using a variety of field-standard shot-read alignment tools with default parameters adapted for short and damaged aDNA sequences. The off-target reads from host DNA mapping can then go into metagenomic classification and authentication (in the case of MALT (Vågene et al. 2018; Herbig et al. 2016)). After genomic mapping, BAM files go through deduplication of PCR duplicates, damage profiling and removal, and finally variant calling. A myriad of additional statistics can be generated depending on the users preference. Finally, nf-core eager uses MultiQC (https://multiqc.info/, (Ewels et al. 2016)) to create an integrated html report that summarises the output/results from each of the pipeline steps.

Question

Why cannot we use a standard genomics mapping and variant calling pipeline, such as nf-core/sarek Hanssen et al. (2024) for ancient DNA ?

Answer

Many tools used in standard genomics pipelines assume ‘modern’ data quality, i.e., high coverage, low error rates, and long read lengths.

In ancient DNA we need to use tools better suited for low coverage, and short read lengths. Furthermore, we would like additional tools for authenticating our ancient DNA - characteristics that you would not expect to find in modern data and thus not included in modern pipelines.

17.3.1 Running nf-core/eager

For the practical portion of this chapter, we will utilise sequencing data from four aDNA libraries, which we should have already downloaded from NCBI. If not, please see the Preparation section above . We will use nf-core/eager to perform a typical microbial genomic analysis, i.e., reconstruction of an ancient genome to generate variant calls that can be used for generating phylogenomic trees and other evolutionary analysis, and gene feature coverage statistics to allow insight into the functional aspects of the genome.

These four libraries come from from two ancient individuals, GLZ002 and KZL002. GLZ002 comes from the Neolithic Siberian site of Glazkovskoe predmestie (Yu et al. 2020) and was radiocarbon dated to 3081-2913 calBCE. KZL002 is an Iron Age individual from Kazakhstan, radiocarbon dated to 2736-2457 calBCE (Andrades Valtueña et al. 2022). Both individuals were originally analysed for human population genetic analysis, but when undergoing metagenomic screening of the off-target reads, both set of authors identified reads from Yersinia pestis from these individuals - the bacterium that causes plague. Subsequently the libraries from these individuals were processed using hybridisation capture to increase the number of Y. pestis sequences available for analysis.

Our aims in the following tutorial are to:

Preprocess the FASTQ files by trimming adapters and merging paired-end reads
Align reads to the Y. pestis reference and compute the endogenous DNA percentage
Filter the aligned reads to remove host DNA
Remove duplicate reads for accurate coverage estimation and genotyping
Generate statistics on gene features (e.g. virulence factors)
Merge data by sample and perform genotyping on the combined dataset
Review quality control data to evaluate the success of the previous steps

Let’s get started!

Warning

nf-core/eager v2 requires an older version of Nextflow - if installing manually, ensure we do not have Nextflow version any greater than v22.10.6!

First, download the latest version of the nf-core/eager repo (or check for updates if we have a previously-installed version).

nextflow pull nf-core/eager

Next we can re-visit AMDirT (see Accessing Ancient Metagenomic Data, (Borry et al. 2024)) to download a pre-prepared configuration file for nf-core/eager

Load AMDirt (AMDirT viewer), and select the latest release and the ‘ancientsinglegenome-hostassociated’ table
Filter the sample_name column to just show KZL002 and GLZ002, and select these two rows
- Press the burger icon on the column
- Press the filter tab and deselect everything
- Search GLZ002 and select in filter menu
- Search KZL002 and select in filter menu
- Close filter menu and select the two rows
Press the Validate Selection button
Press the ‘Download Curl sample download script’, ‘Download nf-core/eager input TSV’, and ‘Download Citations as BibTex’ buttons
Move the downloaded files into eager/ of this tutorial’s directory

i.e., if the files were downloaded to we Downloads/ folder rather than the chapters directory,
e.g., assuming we’ve got no other previous downloads we can run mv ~/Downloads/AncientMetagenomeDir_* eager/

Tip

We won’t actually use the BibTex citations file for anything in this tutorial, but it is good habit to always to record and save all citations of any software or data we use!

To download the FASTQ files

Move into the eager/ directory
Run bash AncientMetagenomeDir_curl_download_script.sh to download the files (this may take ~3 minutes)

Next we now inspect the AMDirT generated input TSV file for nf-core/eager!

cat AncientMetagenomeDir_nf_core_eager_input_table.tsv

Sample_Name Library_ID  Lane    Colour_Chemistry    SeqType Organism    Strandedness    UDG_Treatment   R1  R2  BAM
GLZ002  ERR4093961  0   4   PE  Homo sapiens    double  half    ERR4093961_1.fastq.gz   ERR4093961_2.fastq.gz   NA
GLZ002  ERR4093962  0   4   PE  Homo sapiens    double  full    ERR4093962_1.fastq.gz   ERR4093962_2.fastq.gz   NA
KZL002  ERR8958768  0   4   PE  Homo sapiens    double  half    ERR8958768_1.fastq.gz   ERR8958768_2.fastq.gz   NA
KZL002  ERR8958769  0   4   PE  Homo sapiens    double  half    ERR8958769_1.fastq.gz   ERR8958769_2.fastq.gz   NA

Here we see 10 columns, all pre-filled. The first two columns correspond to sample/library IDs that will be used for data provenance and grouping. When we have sequencing multiple lanes we can speed up preprocessing these independently before merging, so lane can specify this (although not used in this case as we have independent libraries per sample. we can indicate the colour chemistry to indicate whether our data requires additional pre-processing to remove poly-G tails, and then also strandedness and UDG damage treatment status of the libraries if we require further damage manipulation. Finally we provide paths to the FASTQ files or BAM files.

Other than the raw FASTQ files, we will need a reference genome and annotation coordinates of genes present on the genome. In this case we will use a Yersinia pestis (accession: GCF_001293415.1) reference genome (.fna) and gene feature file (.gff) from NCBI Genome (https://www.ncbi.nlm.nih.gov/genome).

These have already been placed in the reference/ directory for us.

Self guided: reference download and preparation

To download the required reference genome and annotation file run the following command to download from the NCBI Genome database.

## Download from NCBI
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_001293415.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_001293415.1.zip" -H "Accept: application/zip"

unzip *.zip
mv ncbi_dataset/data/GCF_001293415.1/* .

## We have to sort the gff file to make it eager compatible
gffread genomic.gff GCF_001293415.1_ASM129341v1_genomic.gff

With all of these, we can run the pipeline!

First lets enter a screen session to make sure we can leave the pipeline running in the background and continue using our terminal.

screen -R eager
conda activate ancient-metagenomic-pipelines

Now we can construct an eager command from within the data/ directory so that it looks like this.

nextflow run nf-core/eager -r 2.4.7 \
-profile docker \
--fasta ../reference/GCF_001293415.1_ASM129341v1_genomic.fna \
--input AncientMetagenomeDir_nf_core_eager_input_table.tsv \
--anno_file ../reference/GCF_001293415.1_ASM129341v1_genomic.gff \
--run_bam_filtering --bam_unmapped_type discard \
--skip_preseq \
--run_genotyping --genotyping_tool ug --gatk_ug_out_mode EMIT_ALL_SITES \
--run_bcftools_stats --run_bedtools_coverage

When the run starts, we see a long list of process lines with progress bars slowly appearing over time. If we press ctrl + a and then [ to access a ‘navigation’ (called ‘copy’) mode, then use our arrow keys on our keyboard to scroll up and down. To quit this mode just press q.

Tip

We don’t normally recommend running analyses in the directory our data is in! It is better to keep data and analysis results in separate directories. Here we are just running eager alongside the data for convenience (i.e., we don’t have to modify the downloaded input TSV)

Warning

This run will take between 20m-40m to run on a standard laptop! Time for a break, or we can continue reading this chapter as the critical output data has already been provided for we in the data directory.

So what is this command doing? The different parameters do the following:

Tell Nextflow to run the nf-core/eager pipeline with version 2.4.7
Specify which computing and software environment to use with -profile
- In this case we are running locally so we don’t specify a computing environment (such as a configuration for an institutional HPC)
- We use docker as our container engine, which downloads all the software and specific versions needed for nf-core/eager in immutable ‘containers’, to ensure nothing get broken and is as nf-core/eager expects
Provide the various paths to the input files (TSV with paths to FASTQ files, a reference fasta, and the reference fasta’s annotations)
Activate the various of the steps of the pipeline we’re interested in
- We turn off preseq (e.g. when we know we can’t sequence more)
- We want to turn on BAM filtering, and specify to generate unmapped reads in FASTQ file (so we could check off target reads e.g. for other pathogens)
- we turn on genotyping using GATK UnifiedGenotyper (preferred over GATK HaplotypeCaller (Poplin et al. 2018) due to in compatibility with that method to low-coverage data)
- We turn on variant statistics (from GATK) using bcftools (Danecek et al. 2021), and coverage statistics of gene features using bedtools (Quinlan and Hall 2010)

For full parameter documentation, see the website (https://nf-co.re/eager/2.5.2/parameters).

And now we wait…

Tip

As a reminder, to detach from the screen session type ctrl + a then d. To reattach screen -r eager

Specifically for ancient (meta)genomic data, the following parameters and options may be useful to consider when running our own data:

--mapper circularmapper
- This aligner is a variant of bwa aln that allows even mapping across the end of linear references of circular genomes
- Allows reads spanning the start/end of the sequence to be correctly placed, and this provides better coverage estimates across the entire genome
--hostremoval_input_fastq:
- Allows re-generation of input FASTQ files but with any reads aligned to the reference genome removed
- Can be useful when dealing with modern or ancient data where there are ethical restrictions on the publication of host DNA
- The output can be used as ‘raw data’ upload to public data repositories
--run_bam_filtering --bam_unmapped_type
- A pre-requisite for performing the metagenomic analysis steps of nf-core/eager
- Generates FASTQ files off the unmapped reads present in the reference mapping BAM file
--run_bedtools_coverage --anno_file /<path>/<to>/<genefeature>.bed
- Turns on calculating depth/breadth of annotations in the provided bed or GFF file, useful for generating e.g. virulence gene presence/absence plots
--run_genotyping --genotyping_tool ug --gatk_ug_out_mode EMIT_ALL_SITES
- Turns on genotyping with GATK UnifiedGenotyper
- A pre-requisite for running MultiVCFAnalyzer (Bos et al. 2014) for consensus sequencing creation
- It’s recommend to use either GATK UnifiedGenotyper or freeBayes (non-MultiVCFAnalyzer compatible!, Garrison and Marth 2012) for low-coverage data
- GATK HaplotypeCaller is not recommended for low coverage data as it performs local de novo assembly around possible variant sites, and this will fail with low-coverage short-read data
--run_multivcfanalyzer --write_allele_frequencies
- Turns on SNP table and FASTA consensus sequence generation with MultiVCFAnalyzer (see pre-requisites above)
- By providing --write_allele_frequencies, the SNP table will also provide the percentage of reads at that position supporting the call. This can help we evaluate the level of cross-mapping from related (e.g. contaminating environmental) species and may question the reliability of any resulting downstream analyses.
--metagenomic_complexity_filtering
- An additional preprocessing step of raw reads before going into metagenomic screening to remove low-complexity reads (e.g. mono- or di-nucleotide repeats)
- Removing these will speed up and lower computational requirements during classification, and will not bias profiles as these sequences provide no taxon-specific information (i.e., can be aligned against thousands to millions of genomes)
--run_metagenomic_screening
- Turns on metagenomic screening with either MALT or Kraken2
- If running with MALT, can supply --run_maltextract to get authentication statistics and plots (damage, read lengths etc.) for evaluation

Question

Why is it critical to report versions of all pipelines and tools?

Answer

(Bioinformatics) software is rarely stable.

Many tools will update frequently either with bug fixes and new or optimised functionality.

Both can cause changes in the output of the software, and thus the results of our analyses. If we want our analyses to be completely reproducible, and increase trust of other scientists in our results, we need to make sure they can generate the same output as we did.

17.3.2 Top Tips for nf-core/eager success

Screen sessions

Depending on the size of our input data, infrastructure, and analyses, running nf-core/eager can take hours or even days (e.g. the example command above will likely take around 20 minutes!).

To avoid crashes due to loss of power or network connectivity, try running nf-core/eager in a screen or tmux session.
```
screen -R eager
```
Multiple ways to supply input data

In this tutorial, a tsv file to specify our input data files and formats. This is a powerful approach that allows nf-core eager to intelligently apply analyses to certain files only (e.g. merging for paired-end but not single-end libraries). However inputs can also be specified using wildcards, which can be useful for fast analyses with simple input data types (e.g. same sequencing configuration, file location, etc.).

See the online nf-core/eager documentation (https://nf-co.re/eager/usage) for more details.

Example commands - do not run!

nextflow run nf-core/eager -r 2.4.7 -profile docker --fasta ../reference/GCF_001293415.1_ASM129341v1_genomic.fna \
--input "data/*_{1,2}.fastq.gz" <...> \
--udg_type half

Get our MultiQC report via email

If we have GNU mail or sendmail set up on our system, we can add the following flag to send the MultiQC html to our email upon run completion:

--email "your_address@something.com"
Check out the nf-core/eager launch GUI

For researchers who might be less comfortable with the command line, check out the nf-core/eager launch GUI (https://nf-co.re/launch?pipeline=eager&release=2.5.2)! The GUI also provides a full list of all pipeline options with short explanations for those interested in learning more about what the pipeline can do.
When something fails, all is not lost!

When individual jobs fail, nf-core/eager will try to automatically resubmit that job with increased memory and CPUs (up to two times per job).

When the whole pipeline crashes, we can save time and computational resources by resubmitting with the -resume flag. nf-core/eager will retrieve cached results from previous steps as long as the input is the same.
Monitor our pipeline in real time with the Seqera Platform

Regular users may be interested in checking out the Seqera Platform (previously known as Nextflow Tower), a tool for monitoring the progress of Nextflow pipelines in real time. Check the website (https://tower.nf) for more information.

17.3.3 nf-core/eager output: results files

Once completed, the results/ directory of our nf-core/eager run will contain a range of directories that will have output files and tool logs of all the steps of the pipeline. nf-core/eager tries to only save the ‘useful’ files.

Everyday useful files for ancient (microbial) (meta)genomics typically are in folders such as:

deduplication/ or merged_bams
- These contain the most basic BAM files we would want to use for downstream analyses (and used in the rest of the genomic workflow of the pipeline)
- They contain deduplicated BAM files (i.e., with PCR artefacts removed)
- merged_bams/
  - This directory will contain BAMs where multiple libraries of one sample have been merged into one final BAM file, when these have been supplied
damage_rescaling/ or trimmed bam/
- These contain the output BAM files from in silico damage removal, if we have turned on e.g. BAM trimming or damage rescaling
- If we have multiple libraries of one sample, the final BAMs we want will be in merged_bams/
genotyping/
- This directory will contain our final VCF files from the genotyping step, e.g. from the GATK or freeBayes tools
MultiVCFAnalyzer/
- This will contain consensus sequences and SNP tables from MultiVCFAnalyzer, which also allows generation of ‘multi-allelic’ SNP position statistics (useful for the evaluation of cross-mapping from contaminants or relatives)
bedtools/
- This directory will contain the depth and breadth statistics of genomic features if a gff or bed file has been provided to the pipeline
- The files can be used to generate gene heatmaps, that can be used to visualise a comparative presence/absence of virulence factor across genomes (e.g. for microbial pathogens)
metagenomic_classification or malt_extract
- This directory contains the output RMA6 files from MALT, the profiles and taxon count tables from Kraken2, or the aDNA authentication output statistics from maltExtract.

Most other folders contain either intermediate files that are only useful for technical evaluation in the case of problems, or statistics files that are otherwise summarised in the run report.

So, before we delve into these folders, it’s normally a good idea to do a ‘quality check’ of the pipeline run. we can do this using the interactive MultiQC pipeline run report.

Question

Why is it important to use deduplicated BAMs for downstream genomic analyses?

Answer

Without deduplication, you are artificially increasing the confidence in variant calls during genotyping. Many downstream analyses assume that each base call is made up of multiple indepndent ‘observations’ of that particular nucleotide. If you have not deduplicated your alignments, you may have the exact sample molecule represented twice (an artefact of amplification), thus violating the ‘independent observation’ assumption.

17.3.4 nf-core/eager output: run report

cd multiqc/

Note

If we’re impatient, and our nf-core/eager run hasn’t finished yet, we can cancel the run with ctrl + c (possibly a couple of times), and we can open a pre-made file in the data directory under ancient-metagenomic-pipelines/multiqc_report.html

In here we should see a bunch of files, but we should open the multiqc_report.html file in our browser. we can either do this via the command-line (e.g. for firefox firefox multiqc_report.html) or navigate to the file using our file browser and double clicking on the HTML file.

Once opened we will see a table, and below it many figures and other tables (Figure 17.2). All of these statistics can help we evaluate the quality of our data, pipeline run, and also possibly some initial results!

Figure 17.2: Screenshot of initial view of a `MultiQC` report. The left hand sidebar provides links to various sections of the report, most of them containing summary images of the outputs of all the different tools. On the main panel we see the MultiQC and nf-core/eager logos, some descriptive information about how the report was generated, and finally the top of the ‘General Statistics’ tables, that has columns such as Sample Name, Nr. Input reads, length of input reads, %trimmed, etc. Each numeric column contains a coloured bar chart to help readers to quickly evaluate the number of a given sample against all others in the table

Typically we will look at the General Statistics table to get a rough overview of the pipeline run.

If we hover our cursor over the column headers, we can see which tool the column’s numbers came from, however generally the columns go in order of left to right, where the left most columns are from earlier in the pipeline run (e.g. removing adapters), to variant calling statistics (e.g. number of variants called). we can also configure which columns to display using the Configure columns button.

It’s important to note that in MultiQC tables, we may have duplicate rows from the same library or sample. This is due to MultiQC trying to squish in as many statistics from as many steps of the pipeline as possible (for example, statistics on each of a pair of FASTQ files, and then statistics on the single merged and mapped BAM file), so we should play around with the column configuration to help we visualise the best way to initially evaluate our data.

The bulk of the MultiQC report is made up of per-tool summary plots (e.g., barcharts, linecharts etc.). Most of these will be interactive, we can hover over lines and bars to get more specific numbers of each plot. However the visualisations are aimed at helping we identify possible outliers that may represent failed samples or libraries.

Evaluating how good the data is and how well the run went will vary depending on the dataset and the options selected. However the nf-core/eager tutorials (https://nf-co.re/eager/usage#tutorial—how-to-set-up-nf-coreeager-for-pathogen-genomics) have a good overview of questions we can ask from our MultiQC report to see whether our data is good or not. We shamelessly copy these questions here (as the overlap authors of both the the nf-core/eager documentation and this text book is rather high).

Once completed, we can try going through the MultiQC report the command we executed above, and compare against the questions below. Keep in mind we have a sample N of two, so many of the questions in regards to identifying ‘outliers’ may be difficult.

The following questions are by no means comprehensive, but rather represent the ‘typical’ questions the nf-core/eager developers asked of their own data and reports. However they can act as a good framework for thinking critically about our own results.

General Stats Table

Do we see the expected number of raw sequencing reads (summed across each set of FASTQ files per library) that was requested for sequencing?
Does the percentage of trimmed reads look normal for aDNA, and do lengths after trimming look short as expected of aDNA?
Does the Endogenous DNA (%) columns look reasonable (high enough to indicate we have received enough coverage for downstream, and/or do we lose an unusually high reads after filtering )
Does ClusterFactor or ‘% Dups’ look high (e.g. >2 or >10% respectively - high values suggesting over-amplified or badly preserved samples i.e. low complexity; note that genome-enrichment libraries may by their nature look higher)
Do we see an increased frequency of C>Ts on the 5’ end of molecules in the mapped reads?
Do median read lengths look relatively low (normally <= 100 bp) indicating typically fragmented aDNA?
Does the % coverage decrease relatively gradually at each depth coverage, and does not drop extremely drastically
Does the Median coverage and percent >3x (or whatever we set) show sufficient coverage for reliable SNP calls and that a good proportion of the genome is covered indicating we have the right reference genome?
Do we see a high proportion of % Hets, indicating many multi-allelic sites (and possibly presence of cross-mapping from other species, that may lead to false positive or less confident SNP calls)?

FastQC (pre-AdapterRemoval)

Do we see any very early drop off of sequence quality scores suggesting problematic sequencing run?
Do we see outlier GC content distributions?
Do we see high sequence duplication levels?

AdapterRemoval

Do we see high numbers of singletons or discarded read pairs?

FastQC (post-AdapterRemoval)

Do we see improved sequence quality scores along the length of reads?
Do we see reduced adapter content levels?

Samtools Flagstat (pre/post Filter)

Do we see outliers, e.g. with unusually low levels of mapped reads, (indicative of badly preserved samples) that require downstream closer assessment?

DeDup/Picard MarkDuplicates

Do we see large numbers of duplicates being removed, possibly indicating over-amplified or badly preserved samples?

MALT (metagenomics only)

Do we have a reasonable level of mappability?
- Somewhere between 10-30% can be pretty normal for aDNA, whereas e.g. <1% requires careful manual assessment
Do we have a reasonable taxonomic assignment success?
- we hope to have a large number of the mapped reads (from the mappability plot) that also have taxonomic assignment

PreSeq (genomics only)

Do we see a large drop off of a sample’s curve away from the theoretical complexity? If so, this may indicate it’s not worth performing deeper sequencing as we will get few unique reads (vs. duplicates that are not any more informative than the reads you’ve already sequenced)

DamageProfiler (genomics only)

Do we see evidence of damage on the microbial DNA (i.e. a % C>T of more than ~5% in the first few nucleotide positions?) ? If not, possibly our mapped reads are deriving from modern contamination

QualiMap (genomics only)

Do we see a peak of coverage (X) at a good level, e.g. >= 3x, indicating sufficient coverage for reliable SNP calls?

MultiVCFAnalyzer (genomics only)

Do we have a good number of called SNPs that suggest the samples have genomes with sufficient nucleotide diversity to inform phylogenetic analysis?
Do we have a large number of discarded SNP calls?
Are the % Hets very high indicating possible cross-mapping from off-target organisms that may confounding variant calling?

As above evaluating these outputs will vary depending on the data and or pipeline settings, and very much. However the extensive output documentation (https://nf-co.re/eager/2.4.7/output) of nf-core/eager can guide we through every single table and plot to assist we in continuing any type of ancient DNA project, assisted by fun little cartoony schematic diagrams (Figure 17.3)!

Figure 17.3: Example of cartoon schematic diagram of the output from `DamageProfiler` (Neukamm, Peltzer, and Nieselt 2021) in different contexts. The six panels show different types of ancient DNA damage line-plots from having no damage (flat red/blue lines), however with a speech bubble noting that if the library was UDG treated, that the flat lines might be valid, all the way to the ‘classic’ ancient DNA damage plot with both red and blue lines showing an exponential decay from the ends of reads to the middle, with a motivational speech bubble. Source: Zandra Fagernäs, CC-BY 4.0 via nf-core/eager documentation (https://nf-co.re/eager/output).

17.3.5 Clean up

Before continuing onto the next section of this chapter, we will need to deactivate from the conda environment.

conda deactivate

If the nf-core/eager run is still not finished, we should enter the screen session with screen -r eager, press ctrl + c until we drop to the prompt, and type exit.

Then may also need to delete the contents of the eager directory if we are low on hard-drive space.

cd /<path>/<to>/ancient-metagenomic-pipelines/eager
rm -r *

17.4 What is aMeta?

Self guided: chapter environment setup

For this chapter’s exercises, if not already performed, we will need to create the special aMeta conda environment and activate the environment.

While nf-core/eager is a solid pipeline for microbial genomics, and can also perform metagenomic screening via the integrated HOPS pipeline (Hübler et al. 2019) or Kraken2 (Wood, Lu, and Langmead 2019), in some cases we may wish to have a more accurate and resource efficient pipeline In this section, we will demonstrate an example of using aMeta, a Snakemake workflow proposed by Pochon et al. (2022) that aims to minimise resource usage by combining both low-resource requiring k-mer based taxonomic profiling as well as accurate read-alignment (Figure 17.4).

Figure 17.4: Schematic diagram of the aMeta pipeline. Input samples initially undergo generalised screening using the K-mer based `KrakenUniq`. For every hit that the reads match inside this database, then sees the genome of that hit then constructed into a MALT database where the reads undergo a mapping step to generate alignments, a lowest-common-ancestor (LCA) algorithm step to refine taxonomic assignments, and ancient DNA authentication statistics generation

Rather than the very computationally heavy HOPS pipeline (Hübler et al. 2019), that requires extremely large computational nodes with large RAM (>1 TB) to load MALT databases into memory, aMeta does this via a two step approach. Firstly it uses KrakenUniq (a k-mer based and thus memory efficient method, Breitwieser, Baker, and Salzberg 2018) to do a screening of sequencing reads against a broad generalised microbial database. Once all the possible taxa have been detected, aMeta will then make a new database of just the genomes of the taxa that were reported from KrakenUniq (i.e. a project-specific database) but using MALT. Then aMeta will useMALTon thus much reduced database to perform computationally much heavier alignment against the reference genomes and LCA taxonomic reassignment. The output fromMALTis then sent to theMaltExtract` program of the HOPS pipeline for ancient DNA authentication statistics.

17.4.1 Running aMeta

In this tutorial we will try running the small test data that comes with aMeta.

aMeta has been written in Snakemake, which means running the pipeline has to be installed in a slightly different manner to the nextflow pull command that can be used for nf-core/eager.

Make sure you have followed the instructions in the Before You Start Chapter for cloning the aMeta GitHub repository to the ancient-metagenomic-pipelines/ directory. Once we have done this, we can make sure we are in the aMeta directory, if not already.

cd /<path>/<to>/ancient-metagenomic-pipelines/aMeta

And activate the dedicated aMeta conda environment.

conda activate aMeta

As aMeta also includes tools that normally require very large computational resources that cannot fit on a standard laptop, we will instead try to re-use the internal very small ‘fake’ data the aMeta developers use to test the pipeline.

Warning

The following steps are already performed up for Students in the summer schools as, particularly the set up conda envs will take a very long time!

If you are doing this chapter self-guided, it is critical to perform the following set up steps!

We don’t have to worry about trying to understand exactly what the following commands are doing, they will not be important for the rest of the chapter. However generally the commands try to pull all the relevant software (via conda), make a fake database and download other required files, and then reconstruct the basic directory and file structure required for running the pipeline.

Self-guided: aMeta set up and configuration

## Change into ~/.test to set up all the required test resources (Databases etc.)
cd .test/

## Set up conda envs
## If we can an error about a 'non-default solver backend' the run `conda config --set solver classic` and re-start the command
## If we have installed the standalone mamba, change --conda-frontend to mamba
snakemake --use-conda --show-failed-logs -j 2 --conda-cleanup-pkgs cache --conda-create-envs-only -s ../workflow/Snakefile --conda-frontend conda
source $(dirname $(dirname $CONDA_EXE))/etc/profile.d/conda.sh

## If we had to change the solver back to classic, we can switch to libmamba with `conda config --set solver libmamba`

## Build dummy KrakenUniq database
env=$(grep krakenuniq .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g")
conda activate $env
krakenuniq-build --db resources/KrakenUniq_DB --kmer-len 21 --minimizer-len 11 --jellyfish-bin $(pwd)/$env/bin/jellyfish
conda deactivate

## Get Krona taxonomy tax dump
env=$(grep krona .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
conda activate $env
cd $env/opt/krona
./updateTaxonomy.sh taxonomy
cd -
conda deactivate

## Adjust malt max memory usage
env=$(grep hops .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
conda activate $env
version=$(conda list malt --json | grep version | sed -e "s/\"//g" | awk '{print $2}')
cd $env/opt/malt-$version
sed -i -e "s/-Xmx64G/-Xmx3G/" malt-build.vmoptions
sed -i -e "s/-Xmx64G/-Xmx3G/" malt-run.vmoptions
cd -
conda deactivate

touch .initdb

## Run a quick test and generate the report (you can open this to check it looks like everythin was generated)
snakemake --use-conda --show-failed-logs --conda-cleanup-pkgs cache -s ../workflow/Snakefile $@ --conda-frontend conda -j 4
snakemake -s ../workflow/Snakefile --report --report-stylesheet ../workflow/report/custom.css --conda-frontend conda

## Now we move back into the main repository where we can symlink all the database files back to try running our 'own' test
cd ../
cd resources/
ln -s ../.test/resources/* .
cd ../config
mv config.yaml config.yaml.bkp
mv samples.tsv samplest.tsv.bkp
cd ../
ln -s .test/data/ .
ln -s .test/.snakemake/ . ## so we can re-use conda environments from the `.test` directory for the summer school run

## Again get the taxonomy tax dump for Krona, but this time for a real run
## Make sure you're now in the root directory of the repository!
env=$(grep krona .test/.snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
conda activate $env
cd $env/opt/krona
./updateTaxonomy.sh taxonomy
cd -
conda deactivate

## And back to the root of the repo for practising aMeta properly!
cd ../

Now hopefully we can forget all this, and imagine we are running data though aMeta as you would normally from scratch.

OK now aMeta is all set up, we can now simulate running a ‘real’ pipeline job!

17.4.2 aMeta configuration

First, we will need to configure the workflow. First, we need to create a tab-delimited samples.tsv file inside aMeta/config/ and provide the names of the input fastq-files. In a text editor (e.g. nano), write the following names paths in TSV format.

sample  fastq
bar data/bar.fq.gz
foo data/foo.fq.gz

Warning

Make sure when copy pasting into our test editor, tabs are not replaced with spaces, otherwise the file might not be read!

Warning

aMeta (v1.0.0) currently only supports single-end or pre-merged- data only!

Then we need to write a config file. This tells aMeta where to find things such as database files and other settings.

A minimal example config.yaml files can look like this. This includes specifying the location the main samplesheet, which points to a TSV file that contains all the FASTQs if the samples we want to analyse, and paths to all the required database files and reference genomes you may need. These paths and settings go inside the config.yaml file that must be placed inside inside aMeta/config/.

Make a configuration file with your text editor of choice (e.g. nano).

samplesheet: "config/samples.tsv"

krakenuniq_db: resources/KrakenUniq_DB

bowtie2_db: resources/ref.fa
bowtie2_seqid2taxid_db: resources/seqid2taxid.pathogen.map
pathogenomesFound: resources/pathogenomesFound.tab

malt_nt_fasta: resources/ref.fa
malt_seqid2taxid_db: resources/KrakenUniq_DB/seqid2taxid.map
malt_accession2taxid: resources/accession2taxid.map

ncbi_db: resources/ncbi

n_unique_kmers: 1000
n_tax_reads: 200

Once this config file is generated, we can start the run.

Note

As this is only a dummy run (due to the large-ish computational resources required for MALT), we re-use some of the resource files here. While this will produce nonsense output, it is used here to demonstrate how we would execute the pipeline.

17.4.3 Prepare and run aMeta

Make sure we’re still in the aMeta conda environment, and that we are still in the main aMeta directory with the following.

conda activate aMeta
cd /<path/<to>/ancient-metagenomic-pipelines/aMeta/

Finally, we are ready to run aMeta, where it will automatically pick up our config and samplesheet file we placed in config/!

## change conda-frontend to mamba if we set this up earlier!
snakemake --snakefile workflow/Snakefile --use-conda -j 10 --conda-frontend conda

Note

We can modify -j to represent the number of available CPUs we have on our machine.

If we are using conda is the frontend this could be quite slow!

Firstly we’ll have displayed on the console a bunch of messages about JSON schemas, building DAG of jobs and downloading installing of conda environments. We will know the pipeline has started when we get green messages saying things like Checkpoint and Finished job XX.

We’ll know the job is completed once we get the following messages

Finished job 0.
31 of 31 steps (100%) done
Complete log: .snakemake/log/2023-10-05T155051.524987.snakemake.log

17.4.4 aMeta output

All output files of the workflow are located in aMeta/results directory. To get a quick overview of ancient microbes present in our samples we should check a heatmap in results/overview_heatmap_scores.pdf.

Warning

If running during the summer school, you can use the following command to open the PDF file from the command line.

evince results/overview_heatmap_scores.pdf

Figure 17.5: Example microbiome profiling summary heatmap from aMeta. The columns represent different samples, and the rows of different species. The cells of the heatmap are coloured from blue, to yellow, to red, representing aMeta authentication scores from 0 to 10, with the higher the number the more confident of the hit being both the correct taxonomic assignment and that it is ancient. Numbers in the coloured cells also provide the direct score number.

The heatmap demonstrates microbial species (in rows) authenticated for each sample (in columns) (Figure 17.5).

The colors and the numbers in the heatmap represent authentications scores, i.e. numeric quantification of eight quality metrics that provide information about microbial presence and ancient status.

The authentication scores can vary from 0 to 10, the higher is the score the more likely that a microbe is present in a sample and is ancient.

Typically, scores from 8 to 10 (red color in the heatmap) provide good confidence of ancient microbial presence in a sample. Scores from 5 to 7 (yellow and orange colors in the heatmap) can imply that either: (a) a microbe is present but not ancient, i.e. modern contaminant, or (b) a microbe is ancient (the reads are damaged) but was perhaps aligned to a wrong reference, i.e. it is not the microbe we think about.

The former is a more common case scenario.

The latter often happens when an ancient microbe is correctly detected on a genus level but we are not confident about the exact species, and might be aligning the damaged reads to a non-optimal reference which leads to a lot of mismatches or poor evennes of coverage. Scores from 0 to 4 (blue color in the heatmap) typically mean that we have very little statistical evidence (very few reads) to claim presence of a microbe in a sample.

To visually examine the seven quality metrics

Deamination profile
Evenness of coverage
Edit distance (amount of mismatches) for all reads
Edit distance (amount of mismatches) for damaged reads
Read length distribution
PMD scores distribution
Number of assigned reads (depth of coverage)
average nucleotide identity (ANI)

Corresponding to the numbers and colors of the heatmap, one can find them in results/AUTHENTICATION/sampleID/taxID/authentic_<Sample>_<sampleID>.trimmed.rma6_<TaxID>_<taxID>.pdf for each sample sampleID and each authenticated microbe taxID. An example of such quality metrics is shown below in Figure 17.6.

Figure 17.6: Example of sample/hit specific PDF output from aMeta, 9 panels represent different figures that are useful for evaluating the authenticitiy of an ancient metagenomic hit. From Left to Right, Top from bottom, the panels consists of: 1. Edit distance (all reads) bar plot 2. Edit distance (ancient reads) bar plot 3. Breadth of coverage line plot 4. Deamination line plot 5. Read length distribution bar plot 6. PMD score histogram 7. Percent identity bar plot, 8. A table of similarity to hits from from closest hit to least closest 9. A general statistics table including the name of the taxonomic node, number of reads, duplicates, and mean read length etc.

Question

In our test data, what score does the sample ‘foo’ for the hit against Yersinia pestis? Is this a good score?

Inspect the results AUTHENTICATION/xxx/authentic_Sample_foo_*.pdf file What could have contributed to this particular score?

Hint: Check Supplementary File 2, section S5 of (Pochon et al. 2022) for some hints.

Answer

The sample foo gets a score of 4. This is a low score, and indicates that aMeta is not very confident that this is a true hit. The metrics that contribute to this score are:

Edit distance all reads (+1)
Deamination plot (+2)
Reads mapped with identity (+1),

17.4.5 Clean up

Before continuing onto the next section of this chapter, we will need to remove the output files, and deactivate from the conda environment.

rm -r results/ log/
## You can also optionall remove the conda environments if we are running out of space
# rm -r .snakemake/ .test/.snakemake
conda deactivate

17.5 What is nf-core/mag?

nf-core/mag (Krakau et al. 2022) is another Nextflow best-practise pipeline for the de novo assembly, binning, and annotation of metagenomes (Figure 17.7). While it was originally designed for modern metagenomes (with a focus on optional support co-assembly with Nanopore long reads), it has since been updated to include specific functionality and parameters for ancient metagenomes too.

Figure 17.7: Overview diagram of the main steps and tools of the nf-core/mag pipeline. Taking reads in FASTQ format or a samplesheet as input, reads can go through adapter and/or quality trimming for specific tools for both short and long reads. These reads can optionally go into taxonomic classification and visualisation, at the same time as the prepared reads go into sample-or group wise assemply, evaluation, as well as the ancient DNA valdiation subworkflow. Resulting contigs can be functionally annotated at the same time as optionally going through binning and binning refinement and evaluation (with statistics generation). Bins and refined bins can be taxonomically classified and annotated, with all final reports going into `MultiQC`.

nf-core/mag covers the same major steps as we will have run if we followed the chapter on De Novo assembly.

Firstly, as with nf-core/eager, nf-core/mag do some initial sequencing QC and cleanup with fastp (Chen et al. 2018) or AdapterRemoval (Schubert, Lindgreen, and Orlando 2016), running FastQC (Andrews 2010) before and after this step to make sure that adapters and overly short reads have been removed. It can then optionally do basic taxonomic classification of raw reads with Kraken2 (Wood, Lu, and Langmead 2019) or Centrifuge (Kim et al. 2016). The processed reads then ungo assembly with MEGAHIT (Li et al. 2015), metaSPAdes (Nurk et al. 2017), or SPAdeshybrid (Antipov et al. 2016) (the latter being for combining short and long reads), with contig annotation with Prodigal (Hyatt et al. 2010) to get gene prediction and assembly statistics with QUAST. A unique feature of nf-core/mag over other de novo assembly pipelines is an aDNA validation and correction step: contigs under go ‘reference free’ damage profiling using pyDamage (Borry et al. 2021), and then remaining damage bases that are mistakenly incorporated by assemblies are corrected using freebayes (Garrison and Marth 2012) and BCFTools. Contigs then optionally grouped into genomic ‘bins’ with a range of metagenomic binners (MetaBAT2 (Kang et al. 2019), MaxBin2 (Wu, Simmons, and Singer 2015), and CONCOCT (Alneberg et al. 2014)), as well as also optional binning refinement with DAS Tool (Sieber et al. 2018). Bins are then re-annotated with Prokka (Seemann 2014), taxonomically classified with either CAT (Meijenfeldt et al. 2019) or GTDB-Tk (Chaumeil et al. 2022) Evaluation of the resulting bins is carried out with BUSCO (Simão et al. 2015), CheckM (Parks et al. 2015), GUNC (Orakov et al. 2021) and/or QUAST (Gurevich et al. 2013) to assess the bin completeness. And, as with all nf-core pipelines, all the results are wrapped up into a MultiQC report. Detailed documentation can be see on the nf-core/mag parameters (https://nf-co.re/mag/2.3.2/parameters) and output (https://nf-co.re/mag/2.3.2/docs/output pages.

Question

Why do we need ancient DNA specific steps in a de novo assembly pipeline?

Answer

Firstly we need a method of validating that contigs are from an ancient genome (in nf-core/mag, damage patterns are generated by pyDamage).

Secondly, to correct for falsely included damage bases in contigs in some contexts. As previously stated a previous chapter by Alex Hübner:

In our group we realised that the gene sequences that were annotated on the contigs of MEGAHIT tended to have a higher number of nonsense mutations compared to the known variants of the genes. After manual inspection, we observed that many of these mutations appeared because MEGAHIT chose the damage-derived T allele over the C allele or the damage-derived A allele over a G allele (see Klapper et al. 2023, fig. S1).

To overcome this issue, my colleagues Maxime Borry, James Fellows Yates and I developed a strategy to replace these damage-derived alleles in the contig sequences. This approach consists of aligning the short-read data against the contigs, performing genotyping along them, and finally replacing all alleles for which we have strong support for an allele that is different from the one selected by MEGAHIT.

See the De novo assembly chapter for full context.

17.5.1 Running nf-core/mag

First re-activate the chapter’s conda environment.

conda activate ancient-metagenomic-pipelines

In this tutorial, we will use the same data and aim to replicate the steps taken in the De Novo assembly chapter. It is important to note that generally, de novo assembly, is very computational intensive. In many cases it will not be able to run a de novo assembly on a standard laptop, therefore some of the parameters used here have been tweaked to get the pipeline runable on powerful laptop level machines.

Self guided: reference download and preparation

This tutorial will re-use the data from the de novo assembly chapter. If we have not-run that practical, or deleted the data, pleas follow the instructions at the top of the de novo assembly page to download the denovo-assembly.tar.gz file and unpack it into the denovo-assembly directory.

Make sure once we’ve finished, change back into this chapter’s directory with the following.

cd /<path>/<to>/ancient-metagenomic-pipelines

The manual steps of the previous chapter included:

Read clean up with fastp,
Assembly with MEGAHIT,
Ancient DNA assessment and correction with freebayes and pyDamage
Binning using MetaBAT2 and MaxBin2 (originally using the MetaWRAP pipeline (Uritskiy, DiRuggiero, and Taylor 2018)),
Bin assessment with CheckM
Contig taxonomic classification with MMSeqs2 (Steinegger and Söding 2017) via the GTDB database
Genome annotation with Prokka

So, what command would we use to recreate these?

## 🛑 Don't run this yet! 🛑
nextflow run nf-core/mag -r 2.3.2 \
-profile test,docker \
--input '../../denovo-assembly/seqdata_files/*{R1,R2}.fastq.gz' --outdir ./results \
--ancient_dna  --binning_map_mode own \
--binqc_tool checkm --checkm_db data/checkm_data_2015_01_16/ \
--centrifuge_db false --kraken2_db false --skip_krona --skip_spades --skip_spadeshybrid --skip_maxbin2 --skip_concoct --skip_prodigal --gtdb false  --busco_reference false

In fact, nf-core/mag will do most of the steps for us! In this case, we mostly just need to deactivate most of the additional steps (the last line). Otherwise, as with eager we have done the following:

Tell Nextflow to run the nf-core/mag pipeline with version 2.3.2
Specify which computing and software environment to use with -profile
- In this case we are running locally so we don’t specify a computing environment (such as a configuration for an institutional HPC)
- We use docker as our container engine, which downloads all the software and specific versions needed for nf-core/mag in immutable ‘containers’, to ensure nothing get broken and is as nf-core/mag expects
Provide the various paths to the input files - the raw FASTQ files and the output directory
Specify we want to run the aDNA mode with mapping of reads back to the original contigs rather than for co-assembly

Binning mapping mode with ancient DNA

The aDNA mode of nf-core/mag only supports mapping reads back to the original contigs. The other mapping modes are when doing co-assembly, where we assume there are similar organisms across all our metagenomes, and wish to map reads from all samples in a group to a single sample’s contig (for example). Doing this on aDNA reads would risk making false positive damage patterns, as the damaged reads may derive from reads from a different sample with damage, that are otherwise not present in the sample used for generating the given contig.

5 . Specify we want to generate completeness statistics with CheckM (rather than BUSCO), with the associated pre-downloaded database

nf-core pipelines and reference files

Most nf-core pipelines will actually download reference databases, build reference indices for alignment, and in some cases reference genomes for we if we do not specify them as pipeline input. These download and reference building steps are often very time consuming, so it’s recommended that once we’ve let the database download it once, we should move the files somewhere safe and we can re-use in subsequent pipeline runs.

6 . Skip a few steps that are either ‘nice to have’ (e.g. read taxonomic classification), or require large computational resources (e.g.metaSPAdes, CONCOCT or BUSCO)

Information on skipped tests

The steps we are skipping are: host/arefact removal (Bowtie2 removal of phiX), taxonomic classification of reads (centrifuge, kraken2, krona), the additional assemblers (metaSPAdes and SPAdeshybrid, as these require very large computational resources), additional binners (MaxBin2 and CONCOCT, while they are used in the de novo assembly chapter, they take a long time to run), ship raw contig annotation (with Prodigal, as we do this at the bin level), and contig taxonomic classification (with GTDB and BUSCO as they require very large databases).

With this we are almost ready for running the pipeline!

17.5.2 Configuring Nextflow pipelines

Before we execute the command however, we need to again consider about the resource requirements. As described above, particularly de novo assembly (but also many other bioinformatic analyses) require a lot of computational resources, depending on the type of data and analyses we wish to run.

For most pipelines we must tweak the resource requirements to ensure a) they will fit on our cluster, and b) will run optimally so we don’t under- or over- use our computational resources, where latter will either make ourselves (takes too long), or our cluser/server system administrators (blocks other users) unhappy!

While nf-core pipelines all have ‘reasonable default’ settings for memory, CPU, and time limits, they will not work in all contexts. Here we will give a brief example of how to tweak the parameters of the pipeline steps so that they can run on our compute node or laptop.

For Nextflow, we must make a special ‘config’ that defines the names of each step of the pipeline we wish to tweak and the corresponding resources.

For this our nf-core/mag run we will need to open our text editor an empty file called custom.config, and copy and paste the contents of the next code block.

process {

  withName: FASTP { 
    cpus = 8
  }

  withName: BOWTIE2_PHIX_REMOVAL_ALIGN { 
    cpus = 8
  }

  withName: BOWTIE2_ASSEMBLY_ALIGN {    
    cpus = 8
  }

  withName: PYDAMAGE_ANALYZE {      
    cpus = 8
  }

  withName: PROKKA {
    cpus = 4
  }

  withName: MEGAHIT {
    cpus = 8
  }

  withName: CHECKM_LINEAGEWF{
    memory = 24.GB
    cpus   = 8
    ext.args = "--reduced_tree"
  }

}

Here we set the number of CPUs for most tools to a maximum of 8, and then we limit the amount of memory for CheckM to 24 GB (down from the default (https://github.com/nf-core/mag/blob/66cf53aff834d2a254b78b94fc54cd656b8b7b57/conf/base.config#L37-L41) of 36 GB - which exceeds most laptop memory). We also give a specifc CheckM flag command to use a smaller database.

We can save this custom.config file, and supply this to the Nextflow command with the Nextflow parameter (-c). The good thing about the config file, is we can re-use across runs on the same machine! So set once, and never think about it again.

And with that, we can run our nf-core/mag pipeline!

As recommended in the nf-core/eager section above, start a screen session.

screen -R mag

Then execute the command.

nextflow run nf-core/mag -r 2.3.2 \
-profile test,docker \
--input '../../denovo-assembly/seqdata_files/*{R1,R2}.fastq.gz' --outdir ./results \
--ancient_dna  --binning_map_mode own \
--binqc_tool checkm --checkm_db data/checkm_data_2015_01_16/ \
--centrifuge_db false --kraken2_db false --skip_krona --skip_spades --skip_spadeshybrid --skip_maxbin2 --skip_concoct --skip_prodigal --gtdb false  --busco_reference false \
-c custom.config

Again as with nf-core/eager when the run starts, we see a long list of process lines with progress bars slowly appearing over time. If in a screen session, we can press ctrl + a and then [ to access a ‘navigation’ (called ‘copy’) mode, then use our arrow keys on our keyboard to scroll up and down. To quit this mode just press q.

The pipeline run above should take around 20 minutes or so to complete.

Tip

Configuration files don’t need be personal nor custom!

We can also generate a HPC / server specific profile which can be used on all nf-core pipelines (and even our own!).

See nf-co.re/configs (https://nf-co.re/configs) for all existing institutional configs

Question

Why should you make sure to configure pipeline that use a workflow manager in the backend?

Answer

The primary reason for optimising the configuration of workflow managers is efficiency.

By correctly configuring your can ensure:

A maximal number of jobs can run in parallel
Runs are not killed (and repeatedly retried) if they request too many resources
Can reuse files across runs, such as databases and software containers

amongst others.

17.5.3 nf-core/mag output: results files

As with nf-core/eager, we will have a results directory containing many folders representing different stages of the pipeline, and a MultiQC report.

Relevant directories for evaluting our assemblies are as follows:

QC_shortreads/
- Contains various folders from the sequencing QC and raw read processing (e.g., FastQC, fastp/AdapterRemoval, host-removal results etc.)
Taxonomy/
- If we have not already performed taxonomic classification of our reads, the output of the Kraken2 or Centrifuge integrated steps of nf-core/mag will be deposited here, including an optional Krona piechart
- If activated, the contig-level taxonomic assignments will be deposited here (e.g., from CAT or GTDB)
Assembly/
- This directory will contain one directory per chosen assembler which will contain per-sample output files for each assembly.
- Within each of the assembler sub directories, a QC folder will contain the QUAST output
Ancient_DNA
- This folder will contain the output of pyDamage contig authentication to allow we to select or filter for just putatively ancient contigs
variant_calling
- This directory will contain the ‘damage-corrected’ contigs that will be used in downstream steps of the pipeline, if the aDNA mode is turned on
GenomeBinning/
- This directory will contain all the output from each of any of the selected contig binning tools. As well as the binned FASTAs, we also have read-depth statistics from when input reads are re-mapped back to the contigs
- The directory will also contain the output from the DASTool binning refinement if turned on
- In the QC sub-directory of this folder, we will find the QUAST results on the genome bins (rather than raw contigs as above), as well as BUSCO/CheckM/GUNC MAG completeness estimates
Prodigal / Prokka
- These directories will contain the genome annotation output of the raw contigs (Prodigal) or from bins (Prokka)

17.5.4 nf-core/mag output: run report

For MultiQC, many of the questions asked in the preprocessing section of the nf-core/eager results interpretation will also apply here.

Other than the QUAST results in the Assembly/*/QC/ or GenomeBinning/*/QC/ directories, the main table used for evaluating the output files is the bin_summary.tsv table in the GenomeBinning/ directory.

In this file we woud typically want to assess the following columns:

% Complete * columns - with the higher the number of expected domain-specific genes, normally representing a better quality of the resulting bin.
# contigs (>=* bp) columns - which represnts the number of contigs at different lengths, the fewer the shorter reads and the greater the longer reads we have again suggests a better assembly.
This can also be evaluated by the N50 or L75 columns (as described in the De novo assembly chapter).
fastani and closest_placement_taxonomy - these can tell we if our particular bin has a genome very similar to existing species in taxonomic databases
warnings - for specific comments on each assignment

17.6 (Optional) clean-up

Let’s clean up our working directory by removing all the data and output from this chapter.

The command below will remove the /<PATH>/<TO>/ancient-metagenomic-pipelines directory as well as all of its contents.

Pro Tip

Always be VERY careful when using rm -r. Check 3x that the path we are specifying is exactly what we want to delete and nothing more before pressing ENTER!

rm -rf /<PATH>/<TO>/ancient-metagenomic-pipelines*

Once deleted we can move elsewhere (e.g., cd ~).

We can also get out of the conda environment with the following.

conda deactivate

To delete the conda environment we can use conda remove.

conda remove --name ancient-metagenomic-data --all -y

17.7 Questions to think about

Why is it important to use a pipeline for genomic analysis of ancient data?
How can the design of pipelines such as nf-core/eager pipeline help researchers comply with the FAIR principles for management of scientific data?
What metrics do we use to evaluate the success/failure of ancient DNA sequencing experiments? How can these measures be evaluated when using nf-core/eager for data preprocessing and analysis?

17.8 References

Alneberg, Johannes, Brynjar Smári Bjarnason, Ino de Bruijn, Melanie Schirmer, Joshua Quick, Umer Z Ijaz, Leo Lahti, Nicholas J Loman, Anders F Andersson, and Christopher Quince. 2014. “Binning Metagenomic Contigs by Coverage and Composition.” Nature Methods 11 (11): 1144–46. https://doi.org/10.1038/nmeth.3103.

Andrades Valtueña, Aida, Gunnar U Neumann, Maria A Spyrou, Lyazzat Musralina, Franziska Aron, Arman Beisenov, Andrey B Belinskiy, et al. 2022. “Stone Age Yersinia Pestis Genomes Shed Light on the Early Evolution, Diversity, and Ecology of Plague.” Proceedings of the National Academy of Sciences of the United States of America 119 (17): e2116722119. https://doi.org/10.1073/pnas.2116722119.

Andrews, Simon. 2010. “FastQC: A Quality Control Tool for High Throughput Sequence Data.” http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

Antipov, Dmitry, Anton Korobeynikov, Jeffrey S McLean, and Pavel A Pevzner. 2016. “hybridSPAdes: An Algorithm for Hybrid Assembly of Short and Long Reads.” Bioinformatics (Oxford, England) 32 (7): 1009–15. https://doi.org/10.1093/bioinformatics/btv688.

Borry, Maxime, Adrian Forsythe, Aida Andrades Valtueña, Alexander Hübner, Anan Ibrahim, Andrea Quagliariello, Anna E White, et al. 2024. “Facilitating Accessible, Rapid, and Appropriate Processing of Ancient Metagenomic Data with AMDirT.” F1000Research 12 (926): 926. https://doi.org/10.12688/f1000research.134798.2.

Borry, Maxime, Alexander Hübner, Adam B Rohrlach, and Christina Warinner. 2021. “PyDamage: Automated Ancient Damage Identification and Estimation for Contigs in Ancient DNA de Novo Assembly.” PeerJ 9 (July): e11845. https://doi.org/10.7717/peerj.11845.

Bos, Kirsten I, Kelly M Harkins, Alexander Herbig, Mireia Coscolla, Nico Weber, Iñaki Comas, Stephen A Forrest, et al. 2014. “Pre-Columbian Mycobacterial Genomes Reveal Seals as a Source of New World Human Tuberculosis.” Nature 514 (7523): 494–97. https://doi.org/10.1038/nature13591.

Breitwieser, F P, D N Baker, and S L Salzberg. 2018. “KrakenUniq: Confident and Fast Metagenomics Classification Using Unique k-Mer Counts.” Genome Biology 19 (1): 198. https://doi.org/10.1186/s13059-018-1568-0.

Chaumeil, Pierre-Alain, Aaron J Mussig, Philip Hugenholtz, and Donovan H Parks. 2022. “GTDB-Tk v2: Memory Friendly Classification with the Genome Taxonomy Database.” Bioinformatics (Oxford, England) 38 (23): 5315–16. https://doi.org/10.1093/bioinformatics/btac672.

Chen, Shifu, Yanqing Zhou, Yaru Chen, and Jia Gu. 2018. “Fastp: An Ultra-Fast All-in-One FASTQ Preprocessor.” Bioinformatics 34 (17): i884–90. https://doi.org/10.1093/bioinformatics/bty560.

Danecek, Petr, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, et al. 2021. “Twelve Years of SAMtools and BCFtools.” GigaScience 10 (2). https://doi.org/10.1093/gigascience/giab008.

Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. “Nextflow Enables Reproducible Computational Workflows.” Nature Biotechnology 35 (4): 316–19. https://doi.org/10.1038/nbt.3820.

Dimopoulos, Evangelos A, Alberto Carmagnini, Irina M Velsko, Christina Warinner, Greger Larson, Laurent A F Frantz, and Evan K Irving-Pease. 2022. “HAYSTAC: A Bayesian Framework for Robust and Rapid Species Identification in High-Throughput Sequencing Data.” PLoS Computational Biology 18 (9): e1010493. https://doi.org/10.1371/journal.pcbi.1010493.

Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016. “MultiQC: Summarize Analysis Results for Multiple Tools and Samples in a Single Report.” Bioinformatics 32 (19): 3047–48. https://doi.org/10.1093/bioinformatics/btw354.

Fellows Yates, James A, Thiseas C Lamnidis, Maxime Borry, Aida Andrades Valtueña, Zandra Fagernäs, Stephen Clayton, Maxime U Garcia, Judith Neukamm, and Alexander Peltzer. 2021. “Reproducible, Portable, and Efficient Ancient Genome Reconstruction with Nf-Core/Eager.” PeerJ 9 (March): e10947. https://doi.org/10.7717/peerj.10947.

Garcia, Maxime, Szilveszter Juhos, Malin Larsson, Pall I Olason, Marcel Martin, Jesper Eisfeldt, Sebastian DiLorenzo, et al. 2020. “Sarek: A Portable Workflow for Whole-Genome Sequencing Analysis of Germline and Somatic Variants.” F1000Research 9 (63): 63. https://doi.org/10.12688/f1000research.16665.2.

Garrison, Erik, and Gabor Marth. 2012. “Haplotype-Based Variant Detection from Short-Read Sequencing.” arXiv [q-Bio.GN], July. http://arxiv.org/abs/1207.3907.

Gurevich, Alexey, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. 2013. “QUAST: Quality Assessment Tool for Genome Assemblies.” Bioinformatics (Oxford, England) 29 (8): 1072–75. https://doi.org/10.1093/bioinformatics/btt086.

Hanssen, Friederike, Maxime U Garcia, Lasse Folkersen, Anders Sune Pedersen, Francesco Lescai, Susanne Jodoin, Edmund Miller, et al. 2024. “Scalable and Efficient DNA Sequencing Analysis on Different Compute Infrastructures Aiding Variant Discovery.” NAR Genomics and Bioinformatics 6 (2): lqae031. https://doi.org/10.1093/nargab/lqae031.

Herbig, Alexander, Frank Maixner, Kirsten I Bos, Albert Zink, Johannes Krause, and Daniel H Huson. 2016. “MALT: Fast Alignment and Analysis of Metagenomic DNA Sequence Data Applied to the Tyrolean Iceman.” bioRxiv, April, 050559. https://doi.org/10.1101/050559.

Hübler, Ron, Felix M Key, Christina Warinner, Kirsten I Bos, Johannes Krause, and Alexander Herbig. 2019. “HOPS: Automated Detection and Authentication of Pathogen DNA in Archaeological Remains.” Genome Biology 20 (1): 280. https://doi.org/10.1186/s13059-019-1903-0.

Hyatt, Doug, Gwo-Liang Chen, Philip F Locascio, Miriam L Land, Frank W Larimer, and Loren J Hauser. 2010. “Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification.” BMC Bioinformatics 11 (March): 119. https://doi.org/10.1186/1471-2105-11-119.

Kang, Dongwan D, Feng Li, Edward Kirton, Ashleigh Thomas, Rob Egan, Hong An, and Zhong Wang. 2019. “MetaBAT 2: An Adaptive Binning Algorithm for Robust and Efficient Genome Reconstruction from Metagenome Assemblies.” PeerJ 7 (e7359): e7359. https://doi.org/10.7717/peerj.7359.

Kim, Daehwan, Li Song, Florian P Breitwieser, and Steven L Salzberg. 2016. “Centrifuge: Rapid and Sensitive Classification of Metagenomic Sequences.” Genome Research 26 (12): 1721–29. https://doi.org/10.1101/gr.210641.116.

Klapper, Martin, Alexander Hübner, Anan Ibrahim, Ina Wasmuth, Maxime Borry, Veit G Haensch, Shuaibing Zhang, et al. 2023. “Natural Products from Reconstructed Bacterial Genomes of the Middle and Upper Paleolithic.” Science (New York, N.Y.), May, eadf5300. https://doi.org/10.1126/science.adf5300.

Krakau, Sabrina, Daniel Straub, Hadrien Gourlé, Gisela Gabernet, and Sven Nahnsen. 2022. “Nf-Core/Mag: A Best-Practice Pipeline for Metagenome Hybrid Assembly and Binning.” NAR Genomics and Bioinformatics 4 (1). https://doi.org/10.1093/nargab/lqac007.

Li, Dinghua, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. 2015. “MEGAHIT: An Ultra-Fast Single-Node Solution for Large and Complex Metagenomics Assembly via Succinct de Bruijn Graph.” Bioinformatics 31 (10): 1674–76. https://doi.org/10.1093/bioinformatics/btv033.

Louvel, Guillaume, Clio Der Sarkissian, Kristian Hanghøj, and Ludovic Orlando. 2016. “metaBIT, an Integrative and Automated Metagenomic Pipeline for Analysing Microbial Profiles from High-Throughput Sequencing Shotgun Data.” Molecular Ecology Resources 16 (6): 1415–27. https://doi.org/10.1111/1755-0998.12546.

Meijenfeldt, F A Bastiaan von, Ksenia Arkhipova, Diego D Cambuy, Felipe H Coutinho, and Bas E Dutilh. 2019. “Robust Taxonomic Classification of Uncharted Microbial Sequences and Bins with CAT and BAT.” Genome Biology 20 (1): 217. https://doi.org/10.1186/s13059-019-1817-x.

Mölder, Felix, Kim Philipp Jablonski, Brice Letcher, Michael B Hall, Christopher H Tomkins-Tinch, Vanessa Sochat, Jan Forster, et al. 2021. “Sustainable Data Analysis with Snakemake.” F1000Research 10 (January): 33. https://doi.org/10.12688/f1000research.29032.2.

Neuenschwander, Samuel, Diana I Cruz Dávalos, Lucas Anchieri, Bárbara Sousa da Mota, Davide Bozzi, Simone Rubinacci, Olivier Delaneau, Simon Rasmussen, and Anna-Sapfo Malaspinas. 2023. “Mapache: A Flexible Pipeline to Map Ancient DNA.” Bioinformatics (Oxford, England) 39 (2): btad028. https://doi.org/10.1093/bioinformatics/btad028.

Neukamm, Judith, Alexander Peltzer, and Kay Nieselt. 2021. “DamageProfiler: Fast Damage Pattern Calculation for Ancient DNA.” Bioinformatics 37 (20): 3652–53. https://doi.org/10.1093/bioinformatics/btab190.

Nurk, Sergey, Dmitry Meleshko, Anton Korobeynikov, and Pavel A Pevzner. 2017. “metaSPAdes: A New Versatile Metagenomic Assembler.” Genome Research 27 (5): 824–34. https://doi.org/10.1101/gr.213959.116.

Orakov, Askarbek, Anthony Fullam, Luis Pedro Coelho, Supriya Khedkar, Damian Szklarczyk, Daniel R Mende, Thomas S B Schmidt, and Peer Bork. 2021. “GUNC: Detection of Chimerism and Contamination in Prokaryotic Genomes.” Genome Biology 22 (1): 178. https://doi.org/10.1186/s13059-021-02393-0.

Parks, Donovan H, Michael Imelfort, Connor T Skennerton, Philip Hugenholtz, and Gene W Tyson. 2015. “CheckM: Assessing the Quality of Microbial Genomes Recovered from Isolates, Single Cells, and Metagenomes.” Genome Research 25 (7): 1043–55. https://doi.org/10.1101/gr.186072.114.

Peltzer, Alexander, Günter Jäger, Alexander Herbig, Alexander Seitz, Christian Kniep, Johannes Krause, and Kay Nieselt. 2016. “EAGER: Efficient Ancient Genome Reconstruction.” Genome Biology 17 (1): 1–14. https://doi.org/10.1186/s13059-016-0918-z.

Pochon, Zoé, Nora Bergfeldt, Emrah Kırdök, Mário Vicente, Thijessen Naidoo, Tom van der Valk, N Ezgi Altınışık, et al. 2022. “aMeta: An Accurate and Memory-Efficient Ancient Metagenomic Profiling Workflow.” bioRxiv. https://doi.org/10.1101/2022.10.03.510579.

Poplin, Ryan, Valentin Ruano-Rubio, Mark A DePristo, Tim J Fennell, Mauricio O Carneiro, Geraldine A Van der Auwera, David E Kling, et al. 2018. “Scaling Accurate Genetic Variant Discovery to Tens of Thousands of Samples.” bioRxiv, July, 201178. https://doi.org/10.1101/201178.

Quinlan, Aaron R, and Ira M Hall. 2010. “BEDTools: A Flexible Suite of Utilities for Comparing Genomic Features.” Bioinformatics 26 (6): 841–42. https://doi.org/10.1093/bioinformatics/btq033.

Schubert, Mikkel, Luca Ermini, Clio Der Sarkissian, Hákon Jónsson, Aurélien Ginolhac, Robert Schaefer, Michael D Martin, et al. 2014. “Characterization of Ancient and Modern Genomes by SNP Detection and Phylogenomic and Metagenomic Analysis Using PALEOMIX.” Nature Protocols 9 (5): 1056–82. https://doi.org/10.1038/nprot.2014.063.

Schubert, Mikkel, Stinus Lindgreen, and Ludovic Orlando. 2016. “AdapterRemoval v2: Rapid Adapter Trimming, Identification, and Read Merging.” BMC Research Notes 9 (February): 88. https://doi.org/10.1186/s13104-016-1900-2.

Seemann, Torsten. 2014. “Prokka: Rapid Prokaryotic Genome Annotation.” Bioinformatics 30 (14): 2068–69. https://doi.org/10.1093/bioinformatics/btu153.

Sieber, Christian M K, Alexander J Probst, Allison Sharrar, Brian C Thomas, Matthias Hess, Susannah G Tringe, and Jillian F Banfield. 2018. “Recovery of Genomes from Metagenomes via a Dereplication, Aggregation and Scoring Strategy.” Nature Microbiology 3 (7): 836–43. https://doi.org/10.1038/s41564-018-0171-1.

Simão, Felipe A, Robert M Waterhouse, Panagiotis Ioannidis, Evgenia V Kriventseva, and Evgeny M Zdobnov. 2015. “BUSCO: Assessing Genome Assembly and Annotation Completeness with Single-Copy Orthologs.” Bioinformatics (Oxford, England) 31 (19): 3210–12. https://doi.org/10.1093/bioinformatics/btv351.

Steinegger, Martin, and Johannes Söding. 2017. “MMseqs2 Enables Sensitive Protein Sequence Searching for the Analysis of Massive Data Sets.” Nature Biotechnology 35 (11): 1026–28. https://doi.org/10.1038/nbt.3988.

Uritskiy, Gherman V, Jocelyne DiRuggiero, and James Taylor. 2018. “MetaWRAP-a Flexible Pipeline for Genome-Resolved Metagenomic Data Analysis.” Microbiome 6 (1): 158. https://doi.org/10.1186/s40168-018-0541-1.

Vågene, Åshild J, Alexander Herbig, Michael G Campana, Nelly M Robles García, Christina Warinner, Susanna Sabin, Maria A Spyrou, et al. 2018. “Salmonella Enterica Genomes from Victims of a Major Sixteenth-Century Epidemic in Mexico.” Nature Ecology & Evolution 2 (3): 520–28. https://doi.org/10.1038/s41559-017-0446-6.

Wood, Derrick E, Jennifer Lu, and Ben Langmead. 2019. “Improved Metagenomic Analysis with Kraken 2.” Genome Biology 20 (1): 257. https://doi.org/10.1186/s13059-019-1891-0.

Wratten, Laura, Andreas Wilm, and Jonathan Göke. 2021. “Reproducible, Scalable, and Shareable Analysis Pipelines with Bioinformatics Workflow Managers.” Nature Methods 18 (10): 1161–68. https://doi.org/10.1038/s41592-021-01254-9.

Wu, Yu-Wei, Blake A Simmons, and Steven W Singer. 2015. “MaxBin 2.0: An Automated Binning Algorithm to Recover Genomes from Multiple Metagenomic Datasets.” Bioinformatics 32 (4): 605–7. https://doi.org/10.1093/bioinformatics/btv638.

Yu, He, Maria A Spyrou, Marina Karapetian, Svetlana Shnaider, Rita Radzevičiūtė, Kathrin Nägele, Gunnar U Neumann, et al. 2020. “Paleolithic to Bronze Age Siberians Reveal Connections with First Americans and Across Eurasia.” Cell 181 (6): 1232–1245.e20. https://doi.org/10.1016/j.cell.2020.04.037.