Xample, the computational time for a dataset of 150,000 reads with average read length of 100 bp is about 2 , 3 minutes on a laptop with 8 GB RAM and 2 core 3.06 GHz CPU.TAMER is also applied to two sets of actual metagenomic data. Archived metagenomic AN 3199 custom synthesis datasets are accessible from several sources including the NCBI short read archive [22], CAMERA [23], and the MG-RAST server [24]. In this paper we analyze data from eight oral samples and two seawater samples. The eight oral samples downloaded from the MG-RAST server were examined in a human metagenome oral cavity study [25]. They represent different degrees of oral health with two samples for each of the four status, healthy controls (never with caries), treated for past caries, active caries, and cavities. There are totally about 2 million reads. The smallest sample has about 70,000 reads and the largest sample has about 465,000 reads. The average read length is 4256117 bp. The two seawater datasets were retrieved from MEGAN database (http://www.megan-db.org/megan-db/) and were studied in [20]. Each dataset consists of 10,000 reads and they are part of the Sargasso Sea Samples studied in [26]. The reads are about 800 bp long in both seawater datasets.Results Results for Simulation StudyUsing the same abundance setup as in [20], 150,000 reads are generated for each of the three complexity datasets, simLC, simMC, and simHC, with average length of 100 bp. For the simSC dataset, 100 genomes with the same abundance are randomly selected and 150,000 reads are generated. The characteristics of the datasets are listed in Table S1. For this simulation study, we compare TAMER with MEGAN. The proportions of reads correctly (TP) and incorrectly (FP) assigned at different taxonomy ranks are reported in Table 1. Here TP = Anlotinib supplier number of correctly assigned reads / total number of reads6100, and FP = number of incorrectly assigned reads/ total number of reads6100. For instance, for the simLC data, 146,880 reads are assigned to the corresponding species correctly, and 30 reads are assigned incorrectly, then TP = 146,880/ 150,0006100 = 97.92 and FP = 30/150,0006100 = 0.02. Note that the sum of TP and FP is not 100 as some reads do not have hits in the reference database. The simLC dataset consists of 25,926 reads generated from E. coli str. K-12 substr. MG1655 and 124,074 reads generated from Methanoculleus marisnigri JR1. Totally there are about 160 million base pairs and the simulated error rate is 0.027. The estimated probability of observing a mismatched base pair is 0.025 by TAMER. Using MegaBLAST, hits are found for 97.94 of the 150,000 reads in 4,407 unique taxa. At rank Species, TAMER accurately assigns 25,221 reads to species Escherichia coli which is close to the true value of 25,926 reads, while MEGAN only assigns 5,583 reads to this taxon (Figure 1 (a)). At rank Genus, MEGANSimulation StudiesDue to the complexity of metagenomic data, simulation studies with verifiable results are crucial to benchmark TAMER and conduct comparisons with other existing methods. For the analysis by MEGAN the default parameters are used. Simulation study 1. MetaSim [20], a sequencing simulator for genomics and metagenomics, is used to generate sequence reads for simulation studies. Four benchmark simulation datasets with low (2 genomes, simLC), medium (9 genomes, simMC), high (11 genomes, simHC), and super high (100 genomes, simSC) complexity are used. The first three setups were designed by [20] in conjunction with.Xample, the computational time for a dataset of 150,000 reads with average read length of 100 bp is about 2 , 3 minutes on a laptop with 8 GB RAM and 2 core 3.06 GHz CPU.TAMER is also applied to two sets of actual metagenomic data. Archived metagenomic datasets are accessible from several sources including the NCBI short read archive [22], CAMERA [23], and the MG-RAST server [24]. In this paper we analyze data from eight oral samples and two seawater samples. The eight oral samples downloaded from the MG-RAST server were examined in a human metagenome oral cavity study [25]. They represent different degrees of oral health with two samples for each of the four status, healthy controls (never with caries), treated for past caries, active caries, and cavities. There are totally about 2 million reads. The smallest sample has about 70,000 reads and the largest sample has about 465,000 reads. The average read length is 4256117 bp. The two seawater datasets were retrieved from MEGAN database (http://www.megan-db.org/megan-db/) and were studied in [20]. Each dataset consists of 10,000 reads and they are part of the Sargasso Sea Samples studied in [26]. The reads are about 800 bp long in both seawater datasets.Results Results for Simulation StudyUsing the same abundance setup as in [20], 150,000 reads are generated for each of the three complexity datasets, simLC, simMC, and simHC, with average length of 100 bp. For the simSC dataset, 100 genomes with the same abundance are randomly selected and 150,000 reads are generated. The characteristics of the datasets are listed in Table S1. For this simulation study, we compare TAMER with MEGAN. The proportions of reads correctly (TP) and incorrectly (FP) assigned at different taxonomy ranks are reported in Table 1. Here TP = number of correctly assigned reads / total number of reads6100, and FP = number of incorrectly assigned reads/ total number of reads6100. For instance, for the simLC data, 146,880 reads are assigned to the corresponding species correctly, and 30 reads are assigned incorrectly, then TP = 146,880/ 150,0006100 = 97.92 and FP = 30/150,0006100 = 0.02. Note that the sum of TP and FP is not 100 as some reads do not have hits in the reference database. The simLC dataset consists of 25,926 reads generated from E. coli str. K-12 substr. MG1655 and 124,074 reads generated from Methanoculleus marisnigri JR1. Totally there are about 160 million base pairs and the simulated error rate is 0.027. The estimated probability of observing a mismatched base pair is 0.025 by TAMER. Using MegaBLAST, hits are found for 97.94 of the 150,000 reads in 4,407 unique taxa. At rank Species, TAMER accurately assigns 25,221 reads to species Escherichia coli which is close to the true value of 25,926 reads, while MEGAN only assigns 5,583 reads to this taxon (Figure 1 (a)). At rank Genus, MEGANSimulation StudiesDue to the complexity of metagenomic data, simulation studies with verifiable results are crucial to benchmark TAMER and conduct comparisons with other existing methods. For the analysis by MEGAN the default parameters are used. Simulation study 1. MetaSim [20], a sequencing simulator for genomics and metagenomics, is used to generate sequence reads for simulation studies. Four benchmark simulation datasets with low (2 genomes, simLC), medium (9 genomes, simMC), high (11 genomes, simHC), and super high (100 genomes, simSC) complexity are used. The first three setups were designed by [20] in conjunction with.

By mPEGS 1