Next-generation Sequencing

A Powerful Tool for the Discovery of Molecular Markers in Breast Ductal Carcinoma In Situ

Hitchintan Kaur; Shihong Mao; Seema Shah; David H Gorski; Stephen A Krawetz; Bonnie F Sloane; Raymond R Mattingly


Expert Rev Mol Diagn. 2013;13(2):151-165. 

In This Article

Next-generation Sequencing

Ultra-high-throughput massively parallel RNA sequencing (RNA-Seq) is a recently developed approach and is rapidly emerging as a more powerful alternative platform to microarrays for whole-genome expression profiling. Such NGS technologies offer many potential advantages compared with microarrays.[112] First, they do not rely on prior sequence information as is required for the probes used for microarrays.[113] This allows the experimental design to be unrestricted. Second, the level of expression is assigned on the basis of the entire transcript and not a few segments. The identification and quantification of gene expression at the whole-genome level without a priori sequence knowledge is unbiased and provides higher confidence when novel targets and network pathways are discovered.[114] Third, sequencing instead of hybridization minimizes concern with regard to cross-hybridization. Microarray cross-hybridization may happen if the probe sequences and the target transcript fragments are similar. Such hybridization noise may not be computationally solved by downstream data analysis. In comparison, if the optimal criteria are used during NGS alignment, misalignment or 'in silico cross-hybridization' can be effectively minimized. If two or more genome locations have very similar or identical sequences, the NGS short-reads mapping to one location will also map to the other locations to produce in silico cross-hybridization. The aligner can classify such reads as 'multiple mapping reads', and they can still be further analyzed as this caveat is noted. Fourth, beyond gene expression analysis, NGS can also identify novel isoforms and exons, allele-specific expression, mutations and fusion transcripts. Fifth, NGS data are obtained as digital signals that can be quantified, annotated and reannotated to reflect the current genome consensus. These attributes make NGS ideal for the detection of differentially expressed transcripts. Using RNA-Seq, for example, Huber-Keener et al. defined gene expression alterations associated with anti-estrogen resistance by comparing the transcriptomes of breast cancer cells that are either sensitive or resistant to tamoxifen and identified differential expression of transcripts regulating ERα functions, cell cycle, transcription/translation and mitochondrial dysfunction.[115]

NGS also enables the detection of rare transcripts, sequence mutations, transcriptional boundaries, alternative splice variants, differential polyadenylation, ncRNAs and antisense transcripts.[116–121] Indeed, whole-genome sequencing revealed the competing evolution of numerous subclones during early stages of breast cancer development,[122] including identification of regions of localized hypermutation.[123] The application of NGS has similarly revealed that the breast cancer susceptibility genes BRCA1 and BRCA2 are associated with homologous patterns of somatic mutations that include both short deletions and base substitutions.[123] Microarray and PCR approaches have been used to identify splice variants in breast cancer samples.[124–127] To do this, estimates of the possible alternative splice variant sites are required before probes/primers can be designed, and thus these methods can only test a defined number of candidate variant sites. In contrast, NGS can potentially detect all splice variants at the whole-genome level. Furthermore, with the advent of paired-end sequencing, it is now possible to identify fusion genes that may encode hybrid proteins with oncogenic potential such as those identified in leukemias and lymphomas.[128] Paired-end sequencing of various breast cancer cell lines has identified multiple high-frequency intrachromosomal rearrangements and, to a lesser extent, interchromosomal rearrangements.[129]

A recent issue of the journal Nature featured five studies using NGS approaches for the whole-genome analysis of breast cancer samples, producing many new insights on topics including copy-number variations, new descriptions of driver and other mutations and elevated mutation rates in treatment-resistant tumors.[130–134] Striking observations revealed by the use of NGS were that only approximately 36% of the gene mutations are detected as transcribed[134] and that many of the mutations would encode truncated proteins.[132] The study by Banerji et al. is notable for its inclusion of whole-exome sequencing results from nine DCIS samples (basal, luminal A and B, and Her-2 subtypes),[130] although the subset of results from the DCIS samples is not separately delineated from the entire set of 103 cancer and normal pairs. The overall dataset showed recurrent somatic mutations in five genes (PIK3CA, TP53, AKT1, GATA3 and MAP3K1), and recurrent mutations and deletions were discovered for CBFB and RUNX1. Three of these genes (TP53, GATA3 and RUNX1) encode transcription factors. These three genes were not subject to up or downregulation in the available NGS data on DCIS models,[43] but it is interesting to note that these factors can be resolved by co-citation networks (Figure 3) and are perhaps candidate drivers regulating some of the 295 differentially expressed genes observed in that analysis of DCIS. The immense amount of information provided by these studies, together with the additional data that will be further required to validate their clinical usefulness, point to a need for a sea change in our ability to organize and analyze clinical bioinformatics data.[135]

Figure 3.

Co-citation interaction networks among the ductal carcinoma in situ genes observed in next-generation sequencing analysis of ductal carcinoma in situ models [43]. The co-citation networks among 295 differentially expressed genes, and PIK3CA, TP53, AKT1, GATA3, MAP3K1, CBFB and RUNX17 were constructed using Genomatix Pathway System with default parameters. The gene connections are based on previous published literature; that is, the co-citations between two genes in the previous published literature that have been hand curated. This network indicates the co-citations of the input genes. The upregulated genes are shaded in warm colors (red, orange). The downregulated genes are shaded in cooler colors (blue, purple). CBFB does not appear in the network. PIK3CA, TP53, AKT1, GATA3, MAP3K1 and RUNX17 are shaded in yellow-brown. Of these, TP53, GATA3 and RUNX1 encode transcription factors with the motif indicated. (A) Shows the entire network formed. The circle indicates the region expanded in (B). Note the number of genes that share transcription factor-binding sites as indicated by the filled line terminators: diamonds indicate that gene A modulates gene B; arrowheads indicate that gene A activates gene B and stopped circles indicate that gene A inhibits gene B. This overview provides the opportunity to selectively evaluate the veracity of the resulting pathway.

The diversity of NGS techniques such as DNase-Seq, ChiP-Seq (definition of factor binding sites) and RNA-Seq (transcript profiling) has enabled the success of the ENCODE (Encyclopedia of DNA Elements) project consortium. This now includes a compilation from 147 different cell types of 1640 genome-wide datasets. These include transcription, transcription factor binding, DNase hypersensitive sites, chromatin structure, assembly and histone modification.[136,137] This has provided new insights into the organization and regulation of the human genome. Sequencing primary and processed RNAs has revealed that three quarters of the human genome can be transcribed and that genes are highly interlaced with overlapping transcripts that are transcribed from both DNA strands.[138]

Cell-free circulating nucleic acids have been recognized as potential biomarkers for the early detection and clinical monitoring of human breast cancers. Sequencing circulating nucleic acids in serum of patients with IDC showed that the quantities of specific cell-free transposable elements and endogenous retroviral DNA sequences in blood could distinguish early-stage IDC from normal and nonmalignant controls. The sample size in this study was small (n = 10), but the sensitivities and specificities suggest this could be a useful clinical tool if these results are borne out in larger trials.[139]

ncRNAs are functional RNAs that play important roles in gene expression regulation at different transcriptional and translational levels. The relevance of small ncRNA to cancer, particularly to breast cancer, has mainly been studied using NGS. miRNA is the most widely studied class of ncRNAs. Genome-wide miRNA expression profiling of breast cancer patients by SOLiD sequencing observed that the expressions of five miRNAs were altered at least fivefold. Two of them, miR-29a and miR-21, were identified as significantly increased in the serum of breast cancer patients.[140] Further analysis of serum genome-wide miRNAs revealed that miR-103, miR-23a, miR-29a, miR-222, miR-23b, miR-24 and miR-25 were coordinately upregulated. Of this suite of miRNAs, miR-222 was significantly increased in the serum of breast cancer patients and has been proposed as a potential biomarker for breast cancer.[141] Moreover, using NGS, a nine-microRNA signature has been identified that differentiates IDC from DCIS. Five miRNAs are associated with time to metastasis and overall survival of IDC patients.[142]

It should be emphasized that different NGS techniques and applications require varied approaches and tools. The average length of a small ncRNA (18–30 nucleotides) is much shorter than an mRNA transcript. Extraction of small ncRNA and sequencing library preparation thus requires specific approaches. Sequence alignment tools that can effectively handle short-length reads and trim the adapters are also necessary. For some ncRNAs, such as miRNA, potential target genes need to be identified during data analysis. Both RNA-Seq and ChIP-Seq demand deep sequencing and short-read alignment. Optimized peak-finding algorithms,[143] such as MACS[144] or PeakSeq,[145] are essential for the analysis of ChIP-Seq data to identify the binding sites of the transcription factor of interest.[146] Comparing RNA-Seq with ChIP-Seq techniques, ChIP-Seq can help to determine how transcription factors and other chromatin-associated proteins interact with specific segments of the genome to regulate gene expression. The results can help us to understand the biological processes and disease states, with the construction of coordinated regulatory networks.