Abstract:
BACKGROUND : Metagenomics allows unprecedented access to uncultured environmental microorganisms. The
analysis of metagenomic sequences facilitates gene prediction and annotation, and enables the assembly of draft
genomes, including uncultured members of a community. However, while several platforms have been developed
for this critical step, there is currently no clear framework for the assembly of metagenomic sequence data.
RESULTS : To assist with selection of an appropriate metagenome assembler we evaluated the capabilities of nine
prominent assembly tools on nine publicly-available environmental metagenomes, as well as three simulated
datasets. Overall, we found that SPAdes provided the largest contigs and highest N50 values across 6 of the 9
environmental datasets, followed by MEGAHIT and metaSPAdes. MEGAHIT emerged as a computationally
inexpensive alternative to SPAdes, assembling the most complex dataset using less than 500 GB of RAM
and within 10 hours.
CONCLUSIONS : We found that assembler choice ultimately depends on the scientific question, the available
resources and the bioinformatic competence of the researcher. We provide a concise workflow for the
selection of the best assembly tool.
Description:
Additional file 1: Table S1. Attributes of de novo assemblers used in
this study. Included in this table are the versions of each assembler used
in this study, along with the release date of each version. We provide a
link to each assemblers’ website accompanied by its reference and
number of citations. We gauge ease of use by providing the
programming language and MPI compatibility of each tool as well as
assessing the completeness of each tools’ available documentation.
Table S2. Characteristics of the metagenomic datasets used in this study.
Three metagenomes from three distinct environments (Soil, Aquatic and
Human gut) were selected, and we provide accession numbers,
sequencing platforms used and basic sequence characteristics (pre- and
post-filtering) of each metagenome. Table S3. Assembly statistics for the
assembled aquatic metagenomes. Table S4. Assembly statistics for the
assembled soil metagenomes. Table S5. Assembly statistics for the assembled
human gut metagenomes. Table S6. Assembly statistics for the
synthetic metagenomes. Figure S1. Nonpareil estimates of sequence
coverage (redundancy) for the 3 synthetic metagenomes studied. Figure
S2. Computational requirements for the Tara Ocean metagenome. A)
Total assembly span proportional to wall time required. B) Total assembly
span in relation to peak memory usage. Figure S3. Correlation between
assembly span and mapping rate. The exponential trendline indicates a
very strong positive correlation between the amount of data utilized and
the size of the generated assembly (R2 = 0.83).