Current methods for determining RNA structure with short-read sequencing cannot capture most differences between distinct transcript isoforms. Here we present RNA structure analysis using nanopore sequencing (PORE-cupine), which combines structure probing using chemical modifications with direct long-read RNA sequencing and machine learning to detect secondary structures in cellular RNAs. PORE-cupine also captures global structural features, such as RNA-binding-protein binding sites and reactivity differences at single-nucleotide variants. We show that shared sequences in different transcript isoforms of the same gene can fold into different structures, highlighting the importance of long-read sequencing for obtaining phase information. We also demonstrate that structural differences between transcript isoforms of the same gene lead to differences in translation efficiency. By revealing isoform-specific RNA structure, PORE-cupine will deepen understanding of the role of structures in controlling gene regulation.
Source code for all scripts (R version 3.4.1) and commands used for analysis can be found at http://github.com/awjga/PORE-cupine.
Wan, Y., Kertesz, M., Spitale, R. C., Segal, E. & Chang, H. Y. Understanding the transcriptome through RNA structure. Nat. Rev. Genet. 12, 641–655 (2011).
Kertesz, M. et al. Genome-wide measurement of RNA secondary structure in yeast. Nature 467, 103–107 (2010).
Wan, Y. et al. Landscape and variation of RNA secondary structure across the human transcriptome. Nature 505, 706–709 (2014).
Wan, Y. et al. Genome-wide measurement of RNA folding energies. Mol. Cell 48, 169–181 (2012).
Siegfried, N. A., Busan, S., Rice, G. M., Nelson, J. A. & Weeks, K. M. RNA motif discovery by SHAPE and mutational profiling (SHAPE-MaP). Nat. Methods 11, 959–965 (2014).
Spitale, R. C. et al. Structural imprints in vivo decode RNA regulatory mechanisms. Nature 519, 486–490 (2015).
Lucks, J. B. et al. Multiplexed RNA structure characterization with selective 2′-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc. Natl Acad. Sci. USA 108, 11063–11068 (2011).
Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J. S. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505, 701–705 (2014).
Zubradt, M. et al. DMS-MaPseq for genome-wide or targeted RNA structure probing in vivo. Nat. Methods 14, 75–82 (2017).
Ding, Y. et al. In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature 505, 696–700 (2013).
Strobel, E. J., Yu, A. M. & Lucks, J. B. High-throughput determination of RNA structures. Nat. Rev. Genet. 19, 615–634 (2018).
Sharon, D., Tilgner, H., Grubert, F. & Snyder, M. A single-molecule long-read survey of the human transcriptome. Nat. Biotechnol. 31, 1009–1014 (2013).
Au, K. F. et al. Characterization of the human ESC transcriptome by hybrid sequencing. Proc. Natl Acad. Sci. USA 110, E4821–E4830 (2013).
Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).
Tilgner, H. et al. Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events. Nat. Biotechnol. 33, 736–742 (2015).
Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods 15, 201–206 (2018).
Oikonomopoulos, S., Wang, Y. C., Djambazian, H., Badescu, D. & Ragoussis, J. Benchmarking of the Oxford Nanopore MinION sequencing for quantitative and qualitative assessment of cDNA populations. Sci. Rep. 6, 31602 (2016).
Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407–410 (2017).
Parker, M. T. et al. Nanopore direct RNA sequencing maps the complexity of Arabidopsis mRNA processing and m6A modification. eLife 9, e49658 (2020).
Liu, H. et al. Accurate detection of m6A RNA modifications in native RNA sequences. Nat. Commun. 10, 4079 (2019).
Weeks, K. M. Advances in RNA structure analysis by chemical probing. Curr. Opin. Struct. Biol. 20, 295–304 (2010).
Spitale, R. C. et al. RNA SHAPE analysis in living cells. Nat. Chem. Biol. 9, 18–20 (2013).
Sachsenmaier, N., Handl, S., Debeljak, F. & Waldsich, C. Mapping RNA structure in vitro using nucleobase-specific probes. Methods Mol. Biol. 1086, 79–94 (2014).
Guo, F., Gooding, A. R. & Cech, T. R. Structure of the Tetrahymena ribozyme: base triple sandwich and metal ion at the active site. Mol. Cell 16, 351–362 (2004).
Winkler, W., Nahvi, A. & Breaker, R. R. Thiamine derivatives bind messenger RNAs directly to regulate bacterial gene expression. Nature 419, 952–956 (2002).
Jambhekar, A. et al. Unbiased selection of localization elements reveals cis-acting determinants of mRNA bud localization in Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA 102, 18005–18010 (2005).
Sexton, A. N., Wang, P. Y., Rutenberg-Schoenberg, M. & Simon, M. D. Interpreting reverse transcriptase termination and mutation events for greater insight into the chemical probing of RNA. Biochemistry 56, 4713–4721 (2017).
Li, F. et al. Global analysis of RNA secondary structure in two metazoans. Cell. Rep. 1, 69–82 (2012).
Sun, L. et al. RNA structure maps across mammalian cellular compartments. Nat. Struct. Mol. Biol. 26, 322–330 (2019).
Wilbert, M. L. et al. LIN28 binds messenger RNAs at GGAGA motifs and regulates splicing factor abundance. Mol. Cell 48, 195–206 (2012).
Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
Pan, Q., Shai, O., Lee, L. J., Frey, B. J. & Blencowe, B. J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40, 1413–1415 (2008).
Moqtaderi, Z., Geisberg, J. V. & Struhl, K. Extensive structural differences of closely related 3′ mRNA isoforms: links to Pab1 binding and mRNA stability. Mol. Cell 72, 849–861 (2018).
Floor, S. N. & Doudna, J. A. Tunable protein synthesis by transcript isoforms in human cells. eLife 5, e10921 (2016).
Aw, J. G. et al. In vivo mapping of eukaryotic RNA interactomes reveals principles of higher-order organization and regulation. Mol. Cell 62, 603–617 (2016).
Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
Mustoe, A. M. et al. Pervasive regulatory functions of mRNA structure revealed by high-resolution SHAPE probing. Cell 173, 181–195 (2018).
Das, R., Laederach, A., Pearlman, S. M., Herschlag, D. & Altman, R. B. SAFA: semi-automated footprinting analysis software for high-throughput quantification of nucleic acid footprinting experiments. RNA 11, 344–354 (2005).
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Sovic, I. et al. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat. Commun. 7, 11307 (2016).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
Shah, A., Qian, Y., Weyn-Vanhentenryck, S. M. & Zhang, C. CLIP Tool Kit (CTK): a flexible and robust pipeline to analyze CLIP sequencing data. Bioinformatics 33, 566–567 (2017).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
We thank members of the Wan and Tan labs and F. Yao, H. M. Loh, C. C. Khor and M. Sikic for helpful discussions. Y.W. is supported by funding from A*STAR (A*STAR investigatorship 1630700155), the National Research Foundation Singapore (NRF2019-NRF-ISF003-2970 and CRP21-2018-0101), the EMBO Young Investigatorship and the CIFAR global scholarship.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 Chemical structures of RNA structure probing compounds, associated reaction products, mapped length and statistics of error rates.
a, Chemical structures of RNA structure probing compounds. Side chains for the carbodiimide of CMCT are highlighted and abbreviated as R’ and R’ for part (b). b, RNA nucleotide triphosphates with chemical adducts formed from reaction with structure probing compounds. Adducts are highlighted in green. c, Median lengths of mapped nanopore reads for unmodified and modified Tetrahymena RNA with different structure probing compounds. d, e, f, Boxplots showing the frequency of mismatch (d), deletion (e) and insertion (f) rates for different structure probing chemicals on Tetrahymena RNA, as compared to unmodified RNA. P-values were calculated using the two-sided Wilcoxon Rank Sum test. h–j, Boxplots showing the AUC-ROC performance of mismapping (h), deletion (i) and insertion (j) rates for the different compounds on the Tetrahymena RNA secondary structure. P-values were calculated using two-sided Wilcoxon rank-sum test. c-j, 6962-42107 reads from different libraries were used for comparisons (Supplementary Table 1). The middle, lower and upper boundary lines in the boxplot correspond to median, first and third quartiles. The upper whisker extends to the largest value no further than 1.5 × IQR from the hinge (where IQR is the inter-quartile range) and the lower whisker extends to the smallest value at most 1.5 × IQR of the hinge.
Extended Data Fig. 2 Distribution of mismatches, insertions, deletions along Tetrahymena RNA sequence.
Line plots of normalized number of mismatches (a), deletions (b) and insertions (c) caused by the different compounds and unmodified, along the length of the Tetrahymena RNA sequence. The red bars on top of the plots indicate the location of single-stranded bases in the secondary structure.
Extended Data Fig. 3 Error characterization of the modifications along the secondary structure of the Tetrahymena RNA.
a, Positions and intensity of mismatches (red), deletions (green) and insertions (purple) caused by the different chemical compounds are mapped along the secondary structure of the Tetrahymena RNA. b, Percentage of observed bases (upper number) and corresponding P-values were shown (lower number) for each observation. P-value was calculated using two-sided chi-square test for all modified versus unmodified comparisons.
Extended Data Fig. 4 Schematic of the bioinformatic workflow of PORE-cupine and characteristics of direct RNA sequencing signal.
Sequenced reads were basecalled using Albacore or Guppy, and mapped to the reference sequences using Graphmap. We used Nanopolish to align the raw signals and to extract current features which were used to train unmodified data using SVM. We then filter for reads that are longer than 50% of annotated lengths and for transcripts that have at least 100 reads in the modified library and 200 reads in the unmodified library for downstream analysis. b, Normalized current dwell time for single-stranded regions on Tetrahymena RNA modified with NAI-N3. With footprinting gels as a guide, the top 10% of the single-stranded regions on Tetrahymena RNA were selected for these plots. c-e, Normalized current mean (c), standard deviation (d) and dwell time (e) distributions for all positions on unmodified Tetrahymena RNA and RNA modified with NAI-N3. f, Bioanalyzer traces of in vitro transcribed, full-length, unmodified, and NAI-N3 (100 mM for 5 mins or 25 mins) modified Tetrahymena RNA. g, Mapping rates for modified versus unmodified Tetrahymena RNA. The total number of sequenced reads for unmodified, NAI-N3 (5 mins), and NAI-N3(25 mins) are 20149, 51760, and 22155 reads respectively. The percentage of mapped reads for unmodified, NAI-N3 (5 mins), and NAI-N3 (25 mins) are 75%, 81%, and 17% respectively. h, Density plots showing the distribution of lengths of sequenced unmodified and modified Tetrahymena RNA. Top: unmodified and NAI-N3 modified (100 mM, 5 min) RNA. Bottom: unmodified and NAI-N3 modified (100 mM, 25 min) RNA. i, Coverage of reads mapping to Tetrahymena RNA along its length, for unmodified (top), NAI-N3 modified (100 mM, 5 min)(middle) and extended NAI-N3 modified (100 mM, 25 min) RNA (bottom).
a, Scatterplot showing the distribution of normalized base reactivity between N = 2 biological replicates of modified Tetrahymena RNA. R = 0.97, CI95% = [0.97,0.98] (Pearson correlation). P-value=2.5×10-262, two-tailed Student’s T-test. b, Distribution of current mean and standard deviation for a unimodal (left) and bimodal (right) position in two biological replicates. c, AUC-ROC performance of the correlation of NAI-N3 reactivities of the training set based on PORE-cupine versus footprinting from 11 transcripts. d,e, Comparison of PORE-cupine reactivity and traditional footprinting. Two replicates of gels were shown for Tetrahymena RNA (d, R = 0.80) and lysine riboswitch (e, R = 0.74). Lane 1 of the footprinting gels show A (left, Tetrahymena) or G (right, Tetrahymena and lysine) ladder. Lane 2 shows unmodified RNA, and lane 3 shows NAI-N3 modified RNA. Quantification of the bands on the gels was done using SAFA. Pearson correlation was used to compare between SAFA and PORE-cupine signals. f, List of RNAs used for training and test. g, Scatter plot of per-base reactivity in two biological replicates of the three test RNAs. P-value = 0 using two-tailed Student’s T-test. R = 0.877, CI95% = [0.87, 0.89], by Pearson correlation. h-j, Line plots showing the per-base reactivity along the length of three test RNAs, for two biological replicates. R > = 0.89, using Pearson correlation. k, Boxplot showing the performance of the SVM parameters on the 3 test RNAs, based on training on the Tetrahymena RNA (left) or on 11 RNAs (right, footprinting gels). l, AUC-ROC performance of SVM parameters on 3 test RNAs (red, based on our current 11 training RNAs) versus test RNAs after random selection of 11/14 RNAs as training, for 20 times. m, Boxplot showing the performance of all, unimodal and bimodal positions on test RNAs using AUC-ROC based on footprinting gels from 3 transcripts. In c, k-m, the middle, lower and upper boundary lines in the boxplot correspond to median, first and third quartiles. The upper whisker extends to the largest value no further than 1.5 × IQR from the hinge (where IQR is the inter-quartile range) and the lower whisker extends to the smallest value at most 1.5 × IQR of the hinge. Outliers are shown as dots.
a, Bioanalyzer traces of unmodified and in vivo NAI-N3 modified (100 mM, 5 min) total B. subtilis RNA. b, Secondary structure model of B. subtilis 16 S rRNA. The structure probed regions are boxed in pink, green and blue. c-e, Comparisons between PORE-cupine and footprinting. Two replicates of gels are shown for each of the three regions along B. subtilis RNA. The gels show G ladder (lane 1), unmodified RNA (lane 2) and NAI-N3 modified RNA (lane 3) and a correlation of R (Pearson)= 0.91 (c), 0.74 (d) and 0.24 (e) between the gels. Quantification of the bands on the gels were done using SAFA. Comparison between SAFA quantification and PORE-cupine for each of the regions is shown as a line plot to the right of the gels. R = 0.52 (c), 0.76 (d), 0.62 (e) by Pearson correlation. f,g, Bioanalyzer traces of unmodified and in vitro NAI-N3 modified RPS29 (100 mM, 5 min) (f) and Adocbl riboswitch (g). h-i, Comparisons between PORE-cupine and footprinting. Two replicates of gels are shown for along RPS29 and Adocbl riboswitch RNA, R(Pearson)= 0.93(h) and 0.73(i). The gels show G ladder (lane 1), unmodified RNA (lane 2) and NAI-N3 modified RNA (lane 3). Quantification of the bands on the gels were done using SAFA. Comparison between SAFA quantification and PORE-cupine for each of the regions is shown to the right of the gels. R(Pearson)=0.68 (h) and 0.69 (i).
Line plots showing 2 replicates of PORE-cupine reactivities along TPP riboswitch in the presence of water (R = 0.86), 250 nM TPP (R = 0.87), 750 nM TPP (R = 0.83), and 10 μM TPP (R = 0.94). Pearson correlation is used to calculate the similarities between the reactivities of the replicates.
a, b, Bioanalyzer traces of unmodified and modified total (a) and polyA(+) selected (b) hESC. c, Barplots showing the number of reads after basecalling (12007032 and 10118432) and mapping (86% and 60%) in unmodified and modified hESC samples respectively. d, Histogram showing the distribution of reads with different amounts of modification in hESC. e, Boxplots showing the performance of reads with different amounts of modification, calculated using AUC-ROC on the test set of 3 RNAs, based on 10 footprinting regions. Reads were grouped into different classes: all reads (current, 147670 reads), with only 1 modification for the strand (only 1, 15461 reads), with 0-1% modification (68025 reads), with 1-2% modification (9771 reads), with 2-3% modification (803 reads), and with 3-4% modifications (76 reads). P-value is calculated using two-sided Wilcoxon Rank Sum test. The middle, lower and upper boundary of the boxplot correspond to the median, first and third quartiles, while the upper and lower whiskers extend from the hinge to the largest and smallest value at most 1.5 × IQR of the hinge respectively. f, Top: Fraction of modified reads mapped across exon-exon junctions (position 0) in hESC (black) and across artificial junctions with 50 base insertions (red). Bottom: Difference between mapping rates across normal versus artificial exon-exon junctions at each base; p-value was calculated using two-tailed Wilcoxon Rank-Sum Test. g, Line graph showing the percentage of bimodal positions observed (for both current standard deviation and mean) for all 1024 kmers. The orange line indicates the top 1% of bimodal signals across all kmers and the identity of the corresponding kmers are labelled above. h-j, Base composition along positions 1,2,3,4,5 of unimodal (left) and bimodal kmers (right). These include kmers that show bimodal current mean (h), bimodal current standard deviation (i), and both bimodal current mean and current standard deviation (j). k, Coverage of unmodified (left) and modified (right) reads along the hESC transcriptome using direct RNA sequencing.
Scatterplot showing the distribution of total reads per position for each hESC transcript, for N = 2 biologically independent replicates, of unmodified (left, p-value =0, using two-tailed Student T test, CI95% = [0.98,0.98], 1613 transcripts) and modified transcripts (right, 1751 transcripts, p-value =0 using two-tailed Student T test, CI95% = [0.97,0.97]). R (Pearson)=0.98 (left) and 0.97 (right). b, Barplot showing the number of transcripts left after abundance and length filter. The number of transcripts in each group is shown above the plot. c, Boxplots showing the distribution of median mapped lengths of unmodified (left) and NAI-N3 modified (middle) hESC mRNAs (1751 trancripts). Annotated refers to the distribution of expected lengths for each transcript based on ENSEMBL GRCh38 annotation (right). d, Histogram showing the distribution of transcripts having different fractions of annotated length in unmodified and modified samples. e, Distribution of Pearson correlations between full-length (>99% of known length) and partial transcripts in hESC (83 transcripts from N = 2 two biological replicates were used). The Y-axis shows the fraction of transcripts with a particular correlation. The X-axis depicts Pearson correlation coefficients. f, Boxplot showing PORE-cupine reactivity of different classes of transcripts. P-values were calculated using two-sided Wilcoxon Rank Sum test.1584 coding genes, 67 pseudogenes, 81 non-coding genes and 4 rRNAs were used. g, Top, Metagene analysis of PORE-cupine-derived mean reactivities aligned according to start (Upper) and stop (Lower) codons for all 559 transcripts. Bottom, Metagene autocorrelation function (ACF) plot for the 5’ UTR, CDS and 3’ UTR. In c and f, the middle, lower and upper boundary of the boxplot correspond to the median, first and third quartiles. The upper and lower whisker extends from the hinge to the largest and smallest value at most 1.5 × IQR of the hinge. Outliers are shown as dots.
a, Upper: Transcript organization of different RPLP0 isoforms. Alternative exons seen in our structural data are highlighted in red. Lower, normalized reactivity profiles for the different isoforms and their aggregate signal. b, Upper, Transcript organization of different RACK1 isoforms. Alternative exon is shown in red (also in inset). Lower, Line plots for the aggregate reactivity signal between the two isoforms are shown (Top). Middle, Line plots showing the expanded view of the reactivity difference between the isoforms. Bottom, Line plots showing the individual reactivity information for each isoform along its length. c, No. of transcripts with two structure changing regions that are more than 100, 200, 300, 400 or 500 bases apart. The value for each group is shown above the bar. d, Schematic of the TrIP-seq workflow. e, Line plot showing the absorbance A260 of each fraction (2-12) after polysome fractionation. f, Pair-wise correlations (Spearman correlation) of the read-counts/transcript for each fraction between two biological replicates. Fractions and batches are denoted as F2-12 and B1-2 respectively. g, Distribution of read-counts across different polysome fractions for two biological replicates of Actin B (left) and Activating transcription factor 4 (right).