Assembling metagenomes, one community at a time

Van der Walt, Andries Johannes; Van Goethem, Marc W.; Ramond, Jean-Baptiste; Makhalanyane, Thulani Peter; Reva, Oleg N.; Cowan, Don A.

Assembling metagenomes, one community at a time

Van der Walt, Andries Johannes; Van Goethem, Marc W.; Ramond, Jean-Baptiste; Makhalanyane, Thulani Peter; Reva, Oleg N.; Cowan, Don A.

URI: http://hdl.handle.net/2263/61779

Date: 2017-07-10

Abstract:

BACKGROUND : Metagenomics allows unprecedented access to uncultured environmental microorganisms. The analysis of metagenomic sequences facilitates gene prediction and annotation, and enables the assembly of draft genomes, including uncultured members of a community. However, while several platforms have been developed for this critical step, there is currently no clear framework for the assembly of metagenomic sequence data. RESULTS : To assist with selection of an appropriate metagenome assembler we evaluated the capabilities of nine prominent assembly tools on nine publicly-available environmental metagenomes, as well as three simulated datasets. Overall, we found that SPAdes provided the largest contigs and highest N50 values across 6 of the 9 environmental datasets, followed by MEGAHIT and metaSPAdes. MEGAHIT emerged as a computationally inexpensive alternative to SPAdes, assembling the most complex dataset using less than 500 GB of RAM and within 10 hours. CONCLUSIONS : We found that assembler choice ultimately depends on the scientific question, the available resources and the bioinformatic competence of the researcher. We provide a concise workflow for the selection of the best assembly tool.

Description:

Additional file 1: Table S1. Attributes of de novo assemblers used in this study. Included in this table are the versions of each assembler used in this study, along with the release date of each version. We provide a link to each assemblers’ website accompanied by its reference and number of citations. We gauge ease of use by providing the programming language and MPI compatibility of each tool as well as assessing the completeness of each tools’ available documentation. Table S2. Characteristics of the metagenomic datasets used in this study. Three metagenomes from three distinct environments (Soil, Aquatic and Human gut) were selected, and we provide accession numbers, sequencing platforms used and basic sequence characteristics (pre- and post-filtering) of each metagenome. Table S3. Assembly statistics for the assembled aquatic metagenomes. Table S4. Assembly statistics for the assembled soil metagenomes. Table S5. Assembly statistics for the assembled human gut metagenomes. Table S6. Assembly statistics for the synthetic metagenomes. Figure S1. Nonpareil estimates of sequence coverage (redundancy) for the 3 synthetic metagenomes studied. Figure S2. Computational requirements for the Tara Ocean metagenome. A) Total assembly span proportional to wall time required. B) Total assembly span in relation to peak memory usage. Figure S3. Correlation between assembly span and mapping rate. The exponential trendline indicates a very strong positive correlation between the amount of data utilized and the size of the generated assembly (R2 = 0.83).

Show full item record