Analyzing the Human Microbiome: A "How To" Guide for Physicians

Andrea D Tyler, PhD; Michelle I Smith, PhD; Mark S Silverberg, MD; PhD


Am J Gastroenterol. 2014;109(7):983-993. 

In This Article

Human Microbiome Analysis Techniques—From Sample Collection to Interpretation of Data

Before the development and application of sequence-based molecular tools for microbiome analysis, culture-based methods were used. Such methods rely on the ability to grow viable organisms outside their natural habitat, which can be difficult as many species and strains that are well adapted to life in the human gut are not viable in in vitro conditions. In the past, this difficulty has resulted in underestimation of the complexity of the human gut microbial ecosystem.[19] However, there are several advantages of being able to culture organisms, including the ability to directly obtain information regarding bacterial metabolism and growth requirements, and the potential future use of cultured strains in experimentally evaluating the interaction between microbes and host or in developing clinically useful probiotics.[20,21] To address the limitations of this approach, culture-independent techniques have been applied to the analysis of complex microbial communities, using bacterial DNA sequences as a proxy for estimation of organism identity, relative abundance, and function. Although these new technologies have led to a recent boom in microbiome studies and increased our knowledge and understanding of the microbiome, the challenges and sources of error of sequence-based analyses are important to understand in order to accurately interpret results.

Figure 2 highlights the major steps involved in next-generation microbiome analyses along with considerations and potential limitations.[22,23] It is important to note that from sample collection to statistical analysis of results even small differences in experimental techniques at different steps may affect the observed composition of the human microbiome, often resulting in differing results for seemingly similar studies.

Figure 2.

Outline of the steps, including considerations and challenges involved in sequence-based microbiome analysis.

Sample Collection

At the time of collection, the type and number of samples to be obtained is dependent on the specific experimental question to be answered. In evaluating the gut microbiome, the most common samples used are stool and endoscopic biopsies. Although obtaining stool is noninvasive and provides a great deal of sample material, its microbial profile is substantially different from that of the tissue-associated microbial profile, which itself can vary greatly along the length of the gastrointestinal tract.[24] Biopsies are also more difficult to obtain, and the microbiome may be altered by the requirement that patients take laxatives before endoscopy.[22]


Storage of samples at −80 °C vs. immediate extraction of DNA from fresh samples can also influence the structure of the microbiome,[23] as can the use of preservatives such as RNAlater (QIAGEN, Valencia, CA).[25] Although fresh or immediately frozen samples provide the highest yields of bacterial DNA, the use of RNAlater may be of benefit in situations in which rapid access to refrigeration is not possible, in order to preserve DNA quality and prevent alterations in microbial community structure from occurring before extraction.


Once samples have been acquired, total DNA is extracted, typically using a combination of mechanical and enzymatic disruption. The relative efficiency of disrupting organisms is highly dependent on the structure of the bacterial cell, and selection of a specific technique may lead to alterations in apparent community composition.[26–29] Mixed samples contain both organisms that are more easily disrupted, such as Gram-negative species, and those that are more difficult to lyse, including Gram-positive species, mycobacteria, and spores. An extraction technique that is too harsh will shear the DNA of easily lysed organisms, and one that is too mild will prevent difficult-to-lyse organisms from having their DNA extracted.[27–29] Although the advantages of optimizing extraction protocols to provide increasingly accurate views of bacterial community structure are obvious, no methods are currently capable of providing a truly unbiased DNA sample.[30,31]

Library Preparation

To accurately assign sequences to taxonomic groups, comparison of genomic regions between experimental samples and reference data is required. Although the bacterial genome is relatively plastic, certain genetic markers are more stable, and can therefore be considered candidates for phylogenetic analysis. Most commonly used among these is the 16S ribosomal RNA (rRNA) gene, which is present in all Bacteria and Archaea. This gene is ~1,550-bp long and is composed of nine regions of high variability, termed hypervariable regions, flanked by more highly conserved regions (Figure 3). Within the hypervariable regions, sequence differences characterizing certain organisms allow for the taxonomic identification of the bacteria present in a sample. Advantages and disadvantages of using this gene as a marker are described in Table 1 . Alternative marker genes have been proposed, including 23S rRNA, cpn60, and rpoB, to address the limitations presented by 16S rRNA-based analyses.[32–35] However, the utility of these markers is limited by the relative incompleteness of reference database collections based on these markers compared with those currently available for use in 16S rRNA-based analyses. As such, 16S rRNA-based sequencing remains the gold standard for sequence-based bacterial analyses.

Figure 3.

Structure of 16S rRNA used in microbiome analyses. (a) Two-dimensional image of 16S rRNA including location of variable regions. (b) Linear model of genomic region of the 16S rRNA gene. Conserved regions are depicted in gray, with hypervariable regions labeled in black. Commonly used regions for analysis are identified, with sites of primer attachment denoted with arrows and amplified regions highlighted with solid bars.

Current next-generation sequencing technologies are limited by the length of sequences that they are able to provide. As such, it is not possible to use the entire length of the 16S rRNA gene, and specific region(s) must be selected to target for analysis. In general, longer sequences are more easily and accurately assigned to taxonomic outcome groups.[36] 16S rRNA gene sequencing methods take advantage of the fact that hypervariable regions are flanked by conserved regions that can be used as binding sites for universal primers (Figure 3). This allows for amplification and sequencing of the hypervariable regions of many different organisms within a sample. Sequence differences between organisms are then used to identify bacteria in samples, although the discriminatory power for differentiating between certain organisms may be limited. For example, certain groups within the Firmicutes, such as organisms related to the Clostridia group XIVa, are notoriously difficult to identify to the genus level based on 16S data. Furthermore, different hypervariable regions have different biases and levels of resolution, which makes it difficult to compare results of experiments evaluating different 16S regions.[36,37] Specific groups such as the Bifidobacteria are consistently under-represented regardless of the region investigated, usually because of universal primer mismatches.[38] Currently, the most commonly used segments are the V1–V3, V4, and V4–V5 regions, each of which is able to provide genus-level sequence resolution, with V1–V3 shown by some to provide slightly more accurate assignments (Figure 3).[39]

To facilitate identification of organisms, several reference databases such as Ribosomal Database Project (RDP), Greengenes, and Silva have been developed, which contain 16S rRNA sequence data on millions of organisms. Although highly curated, even within these references, there is disagreement as to the assignment of several taxa that can result in different sequence assignments. Such considerations have prompted some to evaluate the composite genetic material present in the microbiome, through metagenomic sequencing. However, the added sequencing cost and less well-developed resources for downstream processing make metagenomic analyses significantly more resource-intensive and therefore infeasible for many projects.


There are several next-generation sequencers that have been used for 16S rRNA sequencing. However, by and large, the majority of studies make use of either 454 pyrosequencing (Roche, Indianapolis, IN) or Illumina sequence by synthesis technologies (San Diego, CA). Both 454 pyrosequencing and Illumina offer platforms with different levels of coverage ( Box 1 ) and sequence lengths (base-pair length of the sequence), each using different chemistries to provide sequence information. Although Illumina is able to provide more coverage at a lower cost, 454 pyrosequencing is capable of generating longer sequences, often corresponding to increased taxonomic resolution. Until recently, 454 technology was primarily used for 16S rRNA gene sequencing because of its greater read length; however, continued advances in the Illumina technology allowing longer read lengths and significantly higher read numbers per sequencing run have resulted in a decrease in sequencing cost per sample, and have made this a widely popular approach.

Sequencing error rates and common types are different between Illumina and 454 pyrosequencing, with Illumina sequencing more prone to mismatching, whereas 454 technology typically has higher rates of insertions and deletions.[40,41] For both technologies, longer sequences are, on average, of higher quality. However, in both cases extended sequencing that approaches or exceeds the maximal sequence length possible for a technology results in a marked rise in sequencing error rate[37,42] Given the run- and sequence-specific nature of error rates, in order to minimize propagation of errors through workflows, it is important to include standardized control sequences and error-corrected base callers in each run to allow estimation of the true error rate. Furthermore, downstream sequence processing, which specifically targets and culls sequences with a higher probability of containing erroneous bases, is useful.

Quality Filtering

Following sequencing, quality filtering removes sequences with low base quality scores, short reads, and those with features suggestive of higher error probability, which is often specifically related to the sequencing technology (i.e., extensive homopolymeric stretches in 454 pyrosequencing). Rather than simply removing sequences, alternative solutions include removing portions of sequences with overall low quality, thus reducing the amount of information that is lost at this stage.[43] Providing more of a challenge for ensuring sequence quality is the ubiquity of chimeric sequences. Chimeras are an artifact of PCR amplification in which different parts of a sequence arise from different parent strands. Such sequences typically result from an incomplete sequence dissociating from its parent and acting as a primer for another different sequence. As these sequences are not a reflection of poor sequencing quality, they are more difficult to detect than errors resulting from low-quality reads. Chimeras are estimated to make up anywhere from 5 to 45% of sequences in a run and can be found in many 16S databases, thus representing a huge source of error for next-generation sequencing approaches.[44,45] Several algorithms can be used to detect and remove ~90% of chimeric sequences and substantially improve quality.[46]

Sequence Identification

Once raw sequences have been quality trimmed, they can be assigned to taxonomic outcome groups (i.e., attributing a sequence to the genus Bacteroides) to generate more meaningful information for downstream analyses. The two main approaches include direct assignment of sequences to phylotypes and operational taxonomic unit (OTU)-based processing.[47] When sequences are assigned directly to a taxonomic group based on comparisons of sequences with known databases (phylotype method), assignment is highly contingent on the accuracy of the sequencing platform and reference database. Assignment problems may arise when reference databases are incomplete, as novel sequences detected in an experiment may not fit into known taxonomic lineages.[47] A widely used alternate approach seeks to group sequences on the basis of similarity into OTUs and conduct analyses on these groups. OTUs are artificial constructs generated by grouping together sequences at a desired level of similarity. Sequences that are 97% similar are usually considered to be the same species. Typically, a representative or average sequence from each OTU is then compared with a reference database to obtain more meaningful data regarding the identity of that OTU. However, given that OTUs are constructed independent of reference data, they may not correspond well with true biological units (true species), and sequences within a given group have the potential to correspond to multiple taxa. OTU generation may also inflate estimates of sample diversity ( Box 1 ).[37,47] Both phylotype and OTU-based methods allow generation of abundance tables (called OTU tables when OTU-based methods are used), which can be used for analysis. To simplify sequence processing and analysis, several publically available suites of tools (including dotur, mothur, abundantotu, and QIIME)[48–50] have been developed, offering iterations of many common algorithms for sequence processing, organism identification, and analysis. These programs have simplified the analysis of complex microbiome data; however, they must still be used with caution as the selection of an appropriate pipeline is necessary for later accurate interpretation of results.

Initial next-generation sequencing studies of the human microbiome focused on high-level taxonomic assignments for analysis, providing information on whether a specific phyla or family was associated with a habitat, phenotypic outcome, or environmental factor in the human host. However, were a phylum level analysis applied to our human example above, humans would be classified together with 43,000 other species including tunicates, ascidiacea (sea squirts), and amphibians to name just a few. Thus, although some important information is gathered from this knowledge, it is clear that much ambiguity remains as well.

Statistical Analysis

Several methods of analysis are available for microbiome data, each of which explores data in different ways. Many studies make use of traditional ecological indicators of community composition including estimates of alpha and beta diversity. Diversity estimates incorporate information regarding species richness ( Box 1 ) and abundance ( Box 1 ). Alpha diversity is a measure of the mean diversity within a sample or ecosystem and can be calculated by using one of several different algorithms including the Shannon or Simpson diversity index. Beta diversity, on the other hand, is typically considered to involve a comparison of diversities between ecosystems or samples, and for microbiome studies it is often displayed as a principle coordinate or component analysis (PCoA/PCA) plot.[51] In the case of IBD, numerous studies have demonstrated a reduction in microbial alpha diversity among samples acquired from individuals with active disease compared with those who are healthy.[52,53] Such measures provide a broad overview of community structure, but do not investigate specific organisms, and they should therefore only be used in conjunction with more informative analyses.

Analyses that aim to determine whether individual organisms or OTUs are associated with outcome can be used to provide more detailed information regarding associations between environments or phenotypes and microbes. Microbiome data can be analyzed as a dichotomous variable, based on whether an organism is detected in an experiment or as a semicontinuous variable based on relative abundance ( Box 1 ). Dichotomous analyses are particularly useful in cases where lower sequencing coverage is achieved and are less likely to have significant results obscured by changes in abundance.[54] However, dichotomous analyses may be influenced by the depth of sequencing done in an experiment, making it difficult to make comparisons between sequencing runs or platforms. Relative abundance data are particularly useful in cases where large numbers of samples are available, coverage is high, and differences in relative taxon abundance within a community are thought to be important.[54] This analysis can also be problematic, however, as the lack of standardized internal controls in most studies makes data normalization between batches and experiments difficult. Each of these methods may provide different results as they answer slightly different questions, and it is often of benefit to conduct both types of analyses on a data set.

Further challenges in analyzing the microbiome include the lack of conformation of data to a normal distribution, conversion of raw organism or OTU counts to frequencies that are continuous, but bound by 0 and 1, and the observation that many organisms in microbial communities are detected in only a few samples, resulting in zero inflation of data ( Box 1 ). Such data features require additional data manipulation in order to use standard statistical tests such as analysis of variance. As such, several different methods of data transformation have been applied to microbiome data in order to circumvent these limitations.[55,56] Alternatively, exact and nonparametric statistics can be used on many different types of data, and do not make assumptions about data structure or rely on complex transformations for analysis.