Universe of RNA Structures


  1. Introduction
    1. What is URS?
    2. First steps with URS
  2. How to select a set of structures
    1. The Structures page
    2. Query
    3. Restrictions
      1. "General Information"
      2. "Contained Molecules"
      3. "Contained RNA Structure Patterns"
      4. "Contained Interactions"
    4. Results
  3. Statistics
    1. Chains
    2. Base Pairs
    3. Links
    4. Stems
    5. Loops
    6. Pseudoknots
    7. Multiplets
    8. RNA-Protein Interactions
  4. Definitions
    1. Base Pairs, Stems and Links
    2. Elementary Closed Regions, Pseudoknots, Signatures and Descriptions
    3. Stems and Loops
    4. Loop Structure
    5. Dictionary
  5. URSDB
  6. Example of URS session

I. Introduction    


1. What is URS?    

Universe of RNA Structures (URS) is a web-interface to URS database (URSDB) that includes all RNA-containing PDB entries. The data are annotated; in particular we have pointed out base pairs (canonical and non-canonical ones), stems, loops of various types, pseudoknots, elementary closed regions (ECR), multiplets, etc. For each structure element its specific characteristics are stored. For example, we store pseudoknot signatures and stem-descriptions of ECRs.

URS allows one

- to select a set of PDB entries or their structure elements having desired features;

- to obtain statistics of structural elements for selected subset of entries or for all database;

- to analyze the structure elements of the chosen PDB entry.

To annotate base pairs we have used the DSSR program package.


2. First steps with URS    

The work session starts with the main page (see Fig.1).

Home
Figure 1.

The lower blocks describe two main URS pages corresponding to the items "Structures" and "Statistics" of main menu.

The "Structures" page allows one to select the desired set of structures. The URS request specifies a logical combination of restrictions on PDB entry's name, it's molecular content, chains' features, structure elements, etc. It is possible to search in previous results or to add result of a new search to results of previous one.

Clicking on an element of search results one can jump to a window of an analysis of an individual entry. It contains molecular viewer and information on RNA chains within an entry.

The "Statistics" page allows users to get statistics of various structural elements, e.g. chains, base pairs, links, stems, loops, pseudoknots, multiplets and RNA-protein H-bonds. Depending on user's request the statistics can be calculated for the entire database, or for the selected set of structures, or for a user-defined PDB entry. As to our knowledge, URS currently is the only resource providing such an option.


II. How to select a set of structures    


1. The Structures page    

In the beginning of a working session the "Structures" page consists of two parts that are necessary to form a request, see Fig.2 and here. The upper part ("Query field", with grey background) shows the current query (1) and allows one to edit the query (2), to determine a search mode (3) and to submit a request (4). The lower part ("Restrictions field") contains 4 expandable fields allowing one to include into the query various restrictions on entries. After performing a search the results appear below the Restrictions field.

Query
Figure 2.

The Results field consists of three expandable blocks (see Fig.3). The results itself are shown in the third (lowest) block. The first block helps one to retrace the history of the set selection; the second block allows one to specify entries' attributes to be shown.

Query + Results
Figure 3.


2. Query    

Simple Search

A user can search for structures using the simple search form. To do that one should type a PDB-ID, author, sequence or any other keyword in the text area and click the Search button (or press the Enter key), see Fig.4A.

Simple Search
Figure 4A.

Advanced Search

URS allows one to formulate a query as a disjunction (OR-junction) of conjunctions (AND-junctions) of elementary queries; each elementary query gives a restriction on entries to be selected. The number of items in a conjunction can be arbitrary large; the maximal number of items in a disjunction is 6. The figure below explains how to use the Query field.

The left part of the field (see marks 1 - 8) shows the current query and allows one to edit it. Each conjunction is presented in a separate block (see 1, 2). At a moment only one block can be active; the restriction pointed out in the Restrictions field will be added to this block, see Example 1. Initially the field contains one conjunction block. To add or remove a block use buttons "OR" and "BACK" (see 7, 8). The right part allows one to set a search mode (see 9), to choose the mode of output (see 10) and to submit the query (see 11). Other explanations are given in the legend.

Query Field
Figure 4B.

1 - Inactive area. Click here to make the area active.

2 - Active area. Entered parameters will be added here.

3 - Click to remove selected conditions from the corresponding area. If nothing is selected the last condition in the corresponding area will be removed.

4 - Click to copy the content from the corresponding area to the active one.

5 - Click to clean all query fields and text fields of restrictions.

6 - Click to clean the Results field.

7 - Click to add a new area. Maximal number of areas is 6.

8 - Click to remove the lower area.

9 - Select a search mode: New Search, Search In Results or Add To Results.

10 - Select the format of results: only first model of each entry or include all models of each entry.

11 - Submit query.


Example 1.

Current query is shown on Fig.5. The query selects all entries containing H-type pseudoknots (signature abAB) or kissing loops (signature abAcBC). The second area is active.

Query Example 1
Figure 5.

Additional restriction on existence of a ligand will be added to this field, see Fig.6. The button that user has clicked is pointed out with yellow.

Query Example 1 2
Figure 6.


3. Restrictions    

One can use 4 types of restrictions, for each type of restrictions a special expandable block is used. To add a restriction to the query one has to specify parameters of a restriction and click on the corresponding Add Button button. Composite parameters are outlined by the frame.

The block "General Information", its explanation and examples are given here. In the text fields one can specify both full text (e.g. 1FFA for PDB ID) and only a fragment of the text, e.g. 1F. In the latter case all documents with PDB ID containing 1F will be selected. Lists of all possible variants of input are available.

The block "Contained Molecules", its explanation and examples are given here. The block allows one to select PDB documents according to presence/absence of various molecules. Unlike fields from the "General Information" section only full codes are possible.

The block "Contained RNA Structure Patterns", its explanation and examples are given here.

The block "Contained Interactions", its explanation and examples are given here.


A. "General Information"    

Restrictions 1
Figure 7.

One can specify:

- PDB ID, giving the full ID or text that ID should contain;

- Date (before or after a given date or exactly at a given date);

- Resolution (greater, less or equal to a given threshold);

- Method used to obtain the structure (X-RAY DIFFRACTION, SOLUTION NMR, ELECTRON MICROSCOPY or OTHER);

- Title, Author and Keyword (the corresponding fields of a PDB entry should contain the given texts).

Examples:

ParameterExampleComment
PDBIDPDB-ID:1A4-letter ID of the PDB entry contains "1A"
DateDate>2005-06-17PDB document was deposited after 17 June 2005
TitleTitle:5S_RRNAPDB entry title contains "5S RRNA"
ResolutionResolution<2.5AResolution of the structure less than 2.5 angstrom
MethodMethod:SOLUTION_NMRStructure was obtained by the Solution NMR method
AuthorAuthor:I.I.IvanovI.I. Ivanov is one of the authors of the article corresponding to the PDB entry
KeywordKeyword:HIV-1Keyword section of the PDB entry contains "HIV-1"

B. "Contained Molecules"    

Restrictions 2
Figure 8.

Left three fields allow one to specify desired number of chains of a biopolymer of interest; one can also restrict length of polymers under consideration. Two right fields allow one to request for presence/absence of a certain ligand or metallic ion or of ligands and ions at all. Lists of all possible ligand or ion codes appear automatically when one starts input.

Hint: use restriction DNA:#Chains=0 to select documents that do not contain DNA chains.

Examples:

MoleculeExampleComment
RNARNA:#Chains>1:Length>100PDB entry contains at least 2 RNA chains with length greater than 100
RNA:#Chains=3PDB entry contains exactly 3 RNA chains
DNADNA:#Chains>0PDB entry contains at least 1 DNA chain
ProteinProtein:#Chains>0:Length>100PDB entry contains at least 1 Protein chain with length more than 100
LigandLigand:PO4:NOPDB entry does not contain ligand PO4
Ligand:YESPDB entry contains at least one ligand of any kind
MetalMetal:ZN:YESPDB entry contains ZN metal ions
Metal:NOPDB entry does not contain metal ions of any kind

C. "Contained RNA Structure Patterns"    

Restrictions 3
Figure 9.

The block allows one to define a sequence pattern or secondary structure patterns (three types of secondary structure patterns are possible); only entries containing a given pattern will be selected. In case of "Structural Elements" field both presence and absence of a pattern can be specified.

The sequence pattern is complex; one can specify both sequence of a fragment (using of interest IUPAC 1-letter code) and desired interactions between nucleotides in dot-brackets notation. One can specify sequence or interactions or both.

Examples:

ExampleComment
SequenceAGRNRNA (or DNA) chains from the PDB entry contain at least one 'AGRN' fragment in extended alphabet
Interactions
SequenceRNA (or DNA) chains from the PDB entry contain at least one '.)...]]]..(((.' fragment in dot-bracket notation
Interactions.)...]]]..(((.
SequenceGGCCGRNA (or DNA) chains from the PDB entry contain at least one fragment with sequence = 'GGCCG' and dot-bracket description = '((((.'
Interactions((((.

The first restriction related to RNA secondary structure determines presence or absence in the document of certain types of secondary structure elements, namely, loops of any type (hairpins, internal loops, bulges, multiple junctions) and pseudoknots. The second field allows one to search for structures containing a pseudoknot with a certain signature. For example, abAB is a signature of an H-type pseudoknot and abAcBC is a signature of kissing loops. The pull-down menu contains list of all signatures appearing in PDB. The last field singles out elementary closed regions (ECRs) with a given stem-description.

Examples:

PatternExampleComment
ECR
Pattern
ECR:1(7;*).2(3,5;*CC).3.-2.-1.-3At least one ECR from the PDB entry contains the fragment of stem description matching pattern = '1(7;*).2(3,5;*CC).3.-2.-1.-3'
Pseudoknot
Pattern
Pseudoknot:abAcBdCDPDB entry contains pseudoknots with signature = 'abAcBdCD'
Structural
Elements
Elements:Bulges:NOPDB entry does not contain bulges

Extended Alphabet of Sequence Pattern (IUPAC Nucleotide Nomenclature Table)   

SymbolMeaningGroup/Origin of DesignationComplementary Symbol
AAAdenineT/U
GGGuanosineC
CCCytosineG
TTThymineA
UUUracilA
RG or ApuRineY
YT/U or CpYrimidineR
MA or CaMinoK
KG or T/UKetoM
SG or CStrong interactions 3H-bondsW
WA or T/UWeak interactions 2H-bondsS
BG or C or T/Unot AV
DA or G or T/Unot CH
HA or C or T/Unot GD
VA or G or Cnot T, not UB
NA or G or C or T/U, unknown, or otheraNyN

Dot-Bracket notation is a symbolic string representing the secondary structure of RNA (or DNA) fragment according the following rule.   

1. Each symbol presents a nucleotide; therefore the dot-bracket representation of a sequence fragment has the same length as the fragment;

2. Symbol '.' denotes an unpaired nucleotide;

3. Left brackets of various types '(', '[', '{' '<', etc. denote a nucleotide paired with an upstream nucleotide;
right brackets of various types ')', ']', '}' '>', etc. denote a nucleotide paired with a downstream nucleotide.
Various types of brackets are needed to represent pseudoknots;

4. Symbol '-' denotes a disordered (missing) nucleotide;

5. Symbol 'x' is used if we don't care whether a nucleotide is paired or not;

6. Symbol '*' means any number (zero included) of any symbols.

Note: dot-bracket notation must not contain '*' symbols if primary sequence of nucleotides is not empty.

More on dot-bracket notation see here

Examples:

CommentExampleMatching Fragments
Stem of length greater than or equal to 7(((((((*)))))))(((((((..]]--{{{..)))))))
(((((((..((..))...)))))))
(((((((..)))))))
Kissing Loops(*[*)*(*]*)(---[[.)....(]]...)
((..[..))...(((.].)))
((((....[[[...)))).---.((.]]]))

Base pairs, Conflicts, brackets, and levels.   

Base pairs (m, n) and (p, q) have a conflict if m < p < n < q or p < m < q < n. The classic pseudoknot-free secondary structures do not contain conflicts, however many RNA structures in PDB do contain conflicting base pairs.

We say that a base pair (p, q) is a base pair of level 0 if it does not have conflicts with any base pair (m, n) such that m < p. All base pairs in pseudoknot-free secondary structures has level 0.

A base pair (p, q) has level K if there are pairs (m0 , n0 ),..., (mK-1 , nK-1 ) such that for all i = 0,..., K-1

- mi < p;

- (mi , ni ) has level i;

- (mi , ni ) has a conflict with (p, q).

We use brackets '( )', '[ ]', '{ }' '< >' to denote nucleotides from base pairs of levels 0, 1, 2, 3 correspondingly. To denote base pairs of levels 5, 6 and 7 we use "brackets" '! ?', 'C D' and '6 9'.


Stem description of ECRs   

1. Introduction

Stem description (SD) of an elementary closed region (ECR) describes stems belonging to the ECR. To be more precise, SD describes wings, i.e. complementary fragments forming stems. One may specify order of wings within the ECR, their sequences, etc.

Each wing has an ID, an integer. ID's of wings of one stem are opposite numbers; the left wings have positive IDs, and the right ones have negative IDs, e.g. two wings of a stem may have IDs 3 and -3.

Stem description of an ECR is a string of wings descriptors separated by dots. The simplest stem description is "1.-1"; the description "1.2.-1.-2" describes pseudoknot of H-type.

2. Wing descriptors

Three forms of a wing descriptors are possible:

1) ID only

Examples: "1", "-3". Maximal possible ID is 99. In the current version of PDB the maximum number of stems in an ECR is 43.

2) ID([length][;Sequence])

Examples: "2(5)"; "3(5;AARAA)"; "-1(7;AA*AA)";
The pattern allows to specify length of the wing and (if needed) its sequence. Within a sequence one can use IUPAC alphabet and stars; a star denotes an arbitrary sequence. E.g. in the last example any sequence of length 7 starting and ending with AA meets the pattern

3) ID([min_length][,max_length][;Sequence])

Examples: "2(5, 7)"; "3(4, 5;AA*AA)"; "-1(7, 10;AA*GC*TT)";
The pattern is analogous to the previous one but it specifies not an exact length but a range of possible lengths. One of length bounds may be omitted. This means 2 for lower bound and 999 for upper bound.

3. How a stem description does match an ECR

A stem description (SD) matches an ECR if it contains a sequence of wings matching the wing descriptors of SD and, possibly, sub-ECRs and/or unpaired regions between the wings.

Currenly stem descriptions must not contain descriptions of sub-ECRs.

Example 1. "1(;*AA).-1(;*GG)"

This SD matches in particular following structures:

a) CCUAAAAUUAGG
   (((((..)))))

The pattern specifies a stem with left wing ending with AA and right wing ending with GG; the stem's length is not specified. Here and after descriptors of wings of a stem and corresponding sequence fragments are given in the same color.

b) CCUUAAAGGGAAUUUAACCCAAAAAAUUAAGG
   ((((((.(((..[[[..))).]]]..))))))

In this case there are 4 extra wings between the wings specified in the pattern.

Example 2. "1(7;GGGURRN).2(3,5;*CC).3(3,;AC*).-2(;*AG*).-1.-3"

This pattern describes a pseudoknotted ECR. Lengths of the stem 1 should be equal to 7; lengths of the stem 2 should be between 3 and 5 and lengths of stem 3 should be at least 3. In particular the pattern matches the following ECR

GGGUAACAUGCUCCGCACUCCGGAGCUUGUUACCCGGAGU
(((((((..(((((..[[[..)))))..)))))))..]]]


D. "Contained Interactions"    

Restrictions 4
Figure 10.

Restrictions in the left part of the block allows one to select the entries containing base pairs of various types. You can use three types of base pair classifications, namely, Saenger's classification, Leontis-Westhof's classification and the classification implemented in the DSSR program package.

Examples:

ParameterExampleMeaning
SaengerSaenger:VIPDB entry contains base-pairs of 'VI' type by classification of Saenger
LeontisLeontis:cWS:A-UPDB entry contains A-U base-pairs of 'cWS' type by classification of Leontis & Westhof
DSSRDSSR:cM+M:G-UPDB entry contains G-U base-pairs of 'cM+M' type by classification from DSSR
MultipletsMultiplets:nts=7PDB entry contains multiplets composed of exactly 7 nucleotides
Multiplets:nts=4,bps=3,wcwb>1PDB entry contains multiplets composed of exactly 4 nucleotides and exactly 3 base-pairs at least 1 of which is Watson-Crick or Wobble
H-bondsH-bonds:RNA=A.N6:Protein=ANY:Dist<3.0APDB entry contains H-bonds between N6 atom of adenine and any protein atom at a distance less than 3 angstrom

Saenger Classification of Base-Pairs   

[Saenger W. Principles of nucleic acid structure. - 1984.]

Saenger Classification


Leontis & Westhof Classification of Base-Pairs   

[Leontis N. B., Westhof E. Geometric nomenclature and classification of RNA base pairs //Rna. - 2001. - T. 7. - N. 4. - C. 499-512.]

Leontis Classification 1

Leontis Classification 2


DSSR Classification of Base-Pairs   

[http://forum.x3dna.org/rna-structures/dssr-output-base-pair-characteristics/]

Each base has three edges: W for the Watson-Crick edge, M for the major groove edge, and m for the minor groove edge. M corresponds to the Hoogsteen (or C-H) edge of the Leontis-Westhof nomenclature, and for the majority of cases (where the glycosidic bond is anti) m agrees with the 'sugar' edge. Note that in DSSR, the edges are defined purely on the geometry of the base plane as would be in a Watson-Crick base pair, and it is not related to sugar. The DSSR definition applies to RNA as well as DNA, with either syn or anti glycosidic bond.

In some boundary cases, the two bases in a pair may not be directly interacting edge-to-edge, where it is not straightforward to clearly designate which edge is involved. This is where the '.' comes in.

The DSSR notation contains 4 characters of the pattern: [ct][WMm.][+-][WMm.]. The third position is either '+' or '-', and it designates the relative orientation of the two bases (flipped or normal). The first position is either 'c' for cis and 't' for trans of the two glycosidic bonds. It is defined by the 'virtual' torsion angle tor(N1-C1'-C1'-N9) reported in the DSSR output.


4. Results    

The Results field consists of three expandable blocks (see Fig.11).

Results
Figure 11.

The results itself are shown in the third (lowest) block. The first block helps one to retrace the history of the set selection; the second one allows one to specify entries' attributes to be shown.

The selected entries are enumerated; for each entry are given its PDB ID, number of models (or model number if "Include All Models" was selected) and four entry's characteristics. By default, these are Header, Date, Method and Resolution; by default the entries are ordered by PDB ID.

One can change the characteristics to be shown and the order of entries using the block "Select Fields", see Fig.12.

Fields Selection
Figure 12.

The results of the search can be exported using "Export" button in one of 4 formats (CSV, XLS, TXT, XML).

Clicking on the PDB ID of a selected entry one can activate an entry's window, see Fig.13a.

3DView
Figure 13a.

The window has 7 informational tabs and 3D representation of structure provided by JSmol. Initially active tab is a summary tab; the other tabs allow one to get various information on the entry (see Fig. 13b - 13f).

On the Chains tab one can see all sequences and dot-bracket representations of secondary sctructures. To view a chain one need to click on its id (for example, Chain A) (see Fig.13b).

Chains
Figure 13b.

On the Base Pairs tab one can see a list of base pairs contained in the entry. To view a base pair one need to click on its Pair value (for example, U-A). The color scheme: Adenine = red; Guanine = green; Cytosine = yellow; Uracil = blue; Modified nucleotide = white (see Fig.13c).

BasePairs
Figure 13c.

On the Stems tab one can see a list of stems contained in the entry. To view a stem one need to click on its column with text "view". The color scheme: Adenine = red; Guanine = green; Cytosine = yellow; Uracil = blue; Modified nucleotide = white (see Fig.13d).

Stems
Figure 13d.

On the Loops tab one can see a list of loops contained in the entry. To view a loop one need to click on its column with text "view". The color scheme: Corresponding stem = purple; Threads = white; Wings = red; Nested stem-related ECRs (their external stems) = green; Nested pseudoknotted stem-related ECRs (their external wings) = blue (see Fig.13e).

Loops
Figure 13e.

On the Pseudoknots tab one can see a list of pseudoknots contained in the entry. To view a pseudoknot one need to click on its number (for example, No 1). The color scheme: Stems of level 0 = green; Stems of level 1 = red; Stems of level 2 = blue; Stems of level 3 = purple; Stems of level 4 = brown; Threads = white; Nested ECRs (their external stems) = gray; Nested pseudoknots (their external wings) = yellow (see Fig.13f).

Pseudoknots
Figure 13f.

The "Query History" block allows one to view a series of queries led to the selected set of entries (see Fig.14).

History
Figure 14.


III. Statistics    

URS allows users to view statistics related to chains, base pairs, stems, loops, pseudoknots, nucleotide multiplets and RNA-protein H-bonds (see Fig.15).

Statistics
Figure 15.

One can get statistics for previously selected set of entries, for all PDB entries or for the chosen PDB entry. The page contains 9 expandable blocks; each block corresponds to a proper statistics.


1. Chains    

The Chains block allows one to get statistics for RNA, DNA or protein chains (see Fig.16). One can point out several types of chains.

Statistics Chains
Figure 16.

The results contain both statistics on chains and on monomers (see Fig.17a).

Statistics Chains Results
Figure 17a.

Currently for all types of chains if the number of chains is less or equal to 200 the Results field contains the "Sequences" button allowing one to show/hide all sequences of the corresponding type in the considered entries (see Fig.17b).

Statistics Chains Results
Figure 17b.


2. Base Pairs    

The "Pair", "Saenger", "Leontis-Westhof", and "DSSR" toggles allow one to specify (one or more) classifications of base pairs (see Fig.18a). If two or more classifications are selected a line in the table describes base pairs that are equivalent in all selected classifications (see Fig.18b).

Base Pair and Dinucleotide Step parameters

Statistics Base Pairs
Figure 18a.

Statistics Base Pairs Results
Figure 18b.


3. Links    

Statistics Links Results
Figure 19.


4. Stems    

The Stems block allows one to view statistics on standard, closed or free stems, see Fig.20.

Statistics Stems
Figure 20.


5. Loops    

In the Loops block one has first to specify classes of loops of interest (see Fig.21), more than one class can be pointed out.

Statistics Loops
Figure 21.

The output format depends on the class of loops. For hairpins one can view statistics on hairpin length, number of wings of other stems within the hairpin (#Wings) and number of link ends (#Link Ends), see. Fig.22. The "total" column here and after contains the total number of corresponding values. E.g. all considered hairpins contain 144 ends of links. From 14 found hairpins 7 hairpins are classical and 7 hairpins are pseudoknotted.

Statistics Hairpins
Figure 22.

"More statistics" button helps one to view detailed statistics based on loop descriptions (see Fig.23).

Statistics Hairpins More
Figure 23.

The main tables for Bulges and Internal loops (see Fig.24a,b) are similar to that for Hairpins. The difference is in presence of Isolated loops field

Statistics Bulges
Figure 24a.

Statistics Internals
Figure 24b.

For bulges additionally are given numbers of left-handed bulges and right-handed ones (here 270 bulges are left-handed and in 261 cases non-empty half is right). For internal loops statistics on symmetrical/asymmetrical loops is given.

Main table for multiple loops (see Fig.25) contains information on number of faces, i.e. ends of stems (#Stem Faces) or complex closed regions (#Block Faces). Numbers of classical, isolated and pseudoknotted loops are also given.

Statistics Junctions
Figure 25.


Loop Description is a symbolic string that describes loop content by following rules:   

1. Loop Description is a sequence of sides and faces separated by hyphens;

2. Face can be of two types:

a) "|S|" is a face of stem;

b) "|B|" is a face of complex block.

3. Each side is a sequence of threads (represented by their length) and wings (represented by letter 'w' followed by the length of wing) separated by dots.

Examples:

1) "6" - classical hairpin of length 6;

2) "0-|B|-6" - isolated right-handed bulge of length 6;

3) "2-|S|-3-|S|-0.w2.0-|S|-1" - pseudoknotted 4-way junction of length 8. Its third side consists of two threads of zero length and one wing of length 2.


6. Pseudoknots    

The Pseudoknots table (see Fig.26) contains statistics on pseudoknotted elementary closed regions, namely, statistics on pseudoknot depth (number of parental ECRs), number of different types of brackets (Rank1) and maximal number of loops per thread (Rank2).

Statistics Pseudo
Figure 26.


7. Multiplets    

The Multiplets block is devoted to statistics of nucleotide multiplets (see Fig.27). The main table presents numbers of nucleotides, base pairs and canonical base pairs per multiplet. Additional table presents more accurate partitioning of multiplets depending on the above numbers.

Statistics Nucleotide Multiplets
Figure 27.


8. RNA-Protein Interactions    

Statistics RNA-Protein H-Bonds
Figure 29.


IV. Definitions    


1. Base Pairs, Stems and Links    

We consider RNA molecule as a sequence of nucleotides i.e. as a sequence of letters in the alphabet {A, C, G, U}. Nucleotides in a molecule are indexed from 5' - to 3'-end with integers from 1 to L; here L is the sequence's length.

A Base Pair is a pair of nucleotides (i, j), where i < j, which forms hydrogen bonds. We consider not only pairs of complementary nucleotides (A-U and G-C pairs, also known as Watson-Crick pairs) and G-U pairs (Wobble pairs), but also non-canonical pairs.

We say that a base pair (p, q) is a base pair of level 0 if it does not have conflicts with any base pair (m, n) such that m < p. All base pairs in pseudoknot-free secondary structures has level 0.

A base pair (p, q) has level K if there are pairs (m0 , n0 ),..., (mK-1 , nK-1 ) such that for all i = 0,..., K-1

- mi < p;

- (mi , ni ) has level i;

- (mi , ni ) has a conflict with (p, q).

A Stem (Standard Stem) is a sequence of base pairs of the form (i, j), (i+1, j-1),..., (i+k, j-k) such that

1) k ≥ 1;

2) i+k < j-k;

3) All pairs (i + x, j - x), where x = 0, ..., k, form base pairs, and all of them are Watson-Crick pairs (WC pairs) or Wobble pairs (WB pairs).

Remark. The URS database contains information of other types of stems, closed stems and free stems. All definitions below (wings, threads, etc.) are related to standard stems. However, they can be applied to stems of arbitrary type.

Pair (i, j) is called an external pair of the stem or a face. Pair (i + k, j - k) is called an internal pair of the stem.

For a stem (of any type) (i, j), (i+1, j-1),..., (i+k, j-k) the fragment [i, i + k] of an RNA chain is called a left wing of the stem, and the fragment [j - k, j] is called a right wing.

A Thread (or unpaired region) is a fragment [i, j], such that

1) There is no base pair (k, t), such that i ≤ k ≤ j or i ≤ t ≤ j.

2) There are base pairs containing nucleotides i-1 and j+1.

For technical reasons we allow threads of zero length between two wings of stems; the zero length thread is denoted by [i + 1, i], where i is the index of the last nucleotide of the previous wing.

A Tower is a set of N stems of any type such that their wings are located on the chain in the following order: 1L,2L,...,NL,NR,...,2R,1R, where iL is the left wing of i-th stem and iR is the right wing of i-th stem.

Base pairs (m, n) and (p, q) have a conflict if m < p < n < q or p < m < q < n. A base pair has a conflict with a stem if it has a conflict with a base pair from the stem.

A Link is a base pair that does not belong to any stem. A link is fully coordinated (or coordinated) if it does not have conflicts. A link is stem-coordinated (or weakly coordinated) if it does not have conflicts with stems but may have conflicts with other links. A link is stem-independent (or independent) if it has a conflict with a stem.


2. Elementary Closed Regions, Pseudoknots, Signatures and Descriptions   

An Elementary Closed Region (ECR) is a minimal region [i, j] where i < j, such that:

1) There is no base pairs (k, l) such that (i ≤ k ≤ j; l > j) or (k < i; i ≤ l ≤ j);

2) There is no l such that i < l < j and both regions [i,...,l] and [l+1,...,j] satisfy the condition 1);

3) There are base pairs (i, k) and (l, j); possibly, k = j and i = l.

A pair of positions (i,j) is called a face of the ECR [i,j]. Note, that if the positions i and j are paired and belong to a stem then the face of the ECR coincides with the face of the stem.

An ECR [k, l] is a sub-ECR of an ECR [i, j] if i < k < l < j and there are no other ECR [m, n] such that i < m < k < l < n < j.

An ECR is a pseudoknot (or pseudoknotted) if base pairs from its stems have conflicts. Otherwise ECR is called pseudoknot-free or classical.

The classification of pseudoknots used in URSDB is based on the notion of signature. The classification is close to topological classification [Andersen JE, Penner RC, Reidys CM, Waterman MS. Topological classification and enumeration of RNA structures by genus. J Math Biol. 2013 Nov;67(5)]. The main difference between the classifications is that our classification takes into account only stems.

Consider all stems of an ECR and index them with latin letters according to positions of their wings from 5'- to 3'-end. The left wing of the stem will be denoted with a small letter, e.g. a, the right wing will be denoted with a capital letter, e.g. A, and the stem will be denoted with two letters, e.g. aA.

A full signature of an ECR is a sequence of its wing letters given according to the wings positions on the chain from 5'- to 3'-end.


Example 1. See Fig.30a,b. Let ECR [10, 70] contains three stems, ([10, 15]; [65, 70]), ([20, 25]; [45, 50]), ([30, 35]; [55, 60]), here [10, 15] and [65, 70] are wings of the stem ([10, 15]; [65, 70]), etc. Then the stem ([10, 15]; [65, 70]) is aA stem, stem ([20, 25]; [45, 50]) is bB stem, and stem ([30, 35]; [55, 60]) is cC stem. The full signature of the ECR is abcBCA. A fragment [20, 60] is a sub-ECR of the initial ECR.

Signature Ex.1aSignature Ex.1b

Figure 30a. Positions of wings within the ECR from the Example 1; the stem aA is given in red; the stem bB is given in blue, and the stem cC is given in green. A fragment [20, 60] is a sub-ECR of the initial ECR.
Figure 30b. Schematic representation of the secondary structure of the stem from the Example 1.

Example 2. See Fig.31a,b. Let ECR contains four stems, ([10, 15]; [70, 75]), ([20, 25]; [50, 55]), ([30, 35]; [40, 45]), ([60, 65]; [80, 85]), here [10, 15] and [70, 75] are wings of the stem ([10, 15]; [70, 75]), etc. Then the stem ([10, 15]; [70, 75]) is aA stem, stem ([20, 25]; [50, 55]) is bB stem, stem ([30, 35]; [40, 45]) is cC stem, and stem ([60, 65]; [80, 85]) is dD stem. The full signature of the ECR is abcCBdAD. The fragment [20, 55] is a sub-ECR of the initial ECR, and the fragment [30, 45] is a sub-ECR of the ECR [20, 55].

Signature Ex.2aSignature Ex.2b

Figure 31a. Positions of wings within the ECR from the Example 2; the stem aA is given in green; the stems bB and cC are given in blue, and the stem dD is given in red. A fragment [20, 55] is a sub-ECR of the initial ECR, and a fragment [30, 45] is a sub-ECR of the ECR [20, 55].
Figure 31b. Schematic representation of the secondary structure of the stem from the Example 2.

Example 3. See Fig.32a,b. Let ECR contains six stems, ([2, 7]; [90, 95]), ([10, 15]; [80, 85]), ([20, 25]; [50, 55]), ([30, 35]; [40, 45]), ([60, 65]; [120, 125]), ([70, 75]; [110, 115]), here [2, 7] and [90, 95] are wings of the stem ([2, 7]; [90, 95]), etc. Then the stem ([2, 7]; [90, 95]) is aA stem, stem ([10, 15]; [80, 85]) is bB stem, stem ([20, 25]; [50, 55]) is cC stem, stem ([30, 35]; [40, 45]) is dD stem, stem ([60, 65]; [120, 125]) is eE stem, and stem ([70, 75]; [110, 115]) is fF stem. The full signature of the ECR is abcdDCefBAFE. The fragment [20, 55] is a sub-ECR of the initial ECR, and the fragment [30, 45] is a sub-ECR of the ECR [20, 55].

Signature Ex.3a

Figure 32a. Positions of wings within the ECR from the Example 3; the stems aA and bB are given in green; the stems cC and dD are given in blue, and the stems eE and fF are given in red. A fragment [20, 55] is a sub-ECR of the initial ECR, and a fragment [30, 45] is a sub-ECR of the ECR [20, 55].
Signature Ex.3b
Figure 32b. Schematic representation of the secondary structure of the stem from the Example 3.

An upper signature of the ECR is a string obtained from its full signature by

1) deletion of fragments corresponding to sub-ECRs;

2) "renaming" of the letters preserving their order to obtain a string containing all letters of a proper beginning of the alphabet.


Example 4.

The upper signature of the ECR from the Example 1 is aA; the fragment bcBC corresponding to the sub-ECR [20, 60] was deleted from the full signature abcBCA.

The upper signature of the ECR from the Example 2 is abAB. Firstly, the fragment bcCB corresponding to the sub-ECR [20, 55] was deleted from the full signature abcCBdAD; the obtained string is adAD. Then we replace d and D with b and B obtaining abAB.

Analogously, the upper signature of the ECR from the Example 3 is abcdBADC.


Stems xX, yY, zZ,... are connected within an upper signature if both the word xyz... and inverted word ...ZYX are subwords of the upper signature.

A signature (or a reduced signature) of the ECR is a string obtained from its upper signature by

1) deletion all letters except x and X (the first letter of the left part and the last letter of the right part) corresponding to chains of connected stems;

2) "renaming" of the letters preserving their order to obtain a string containing all letters of a proper beginning of the alphabet.


Example 5.

Signatures of ECRs from examples 1 and 2 coincide with their upper signatures. The signature of the ECR from the Example 3 is abAB and coincides with the signature of ECR from example 2.

Typical signatures

  a) H-knot: abAB;

  b) Kissing Loops: abAcBC;

  c) Triple knot: abcABC.


3. Stems and Loops   

Here and below, we assume a fixed chain with a given RNA secondary structure on it. The chain may be considered as an alternating sequence of threads and wings. To preserve generality, we assume that before the first and after the last nucleotide of the chain wings of "external stem" are added.

Each stem is associated with the part of the chain that is internal to it - the part between the end of left wing and the start of right wing of this stem, in other words, between the nucleotides that create the internal pair of the stem. For a fictitious external stem the internal part is the entire original RNA sequence.

Let H be a stem and (i, j) its internal pair.

Definition 1. The position of the chain t is internal to stem H (synonym: lies inside H), if i < t < j. Fragment of chain is internal to stem H (synonym: lies inside H), if all the positions are internal to stem H. stem H1 lies inside the stem H (is internal to H), if all the positions of her wings are internal to H.

Definition 2. The position of the chain t belongs to the stem H, if it is internal to the H and there is no stem H1, lying inside H, such that x < t < y, where (x, y) is the external pair (face) of H1.

Definition 3. Loop of the stem H is the set of all positions that belong to stem H.

Each position that is not included into bond belongs to at least one loop - normal or external. If a position of a thread (wing) belongs to a loop, then the entire thread (wing) belongs to this loop.

If the structure is pseudoknot-free, each loop in terms of Definition 3 is a loop in terms of the Nearest Neighbour Model and vice versa. In addition, each thread belongs to exactly one loop (possibly external), and no wing belongs to any loop. For pseudoknotted structures both of these properties do not hold.


4. Loop Structure    

Definition 4. Let H be a stem and (u, v) its internal pair. Region [i, j] is called H-related ECR (in general case stem-related ECR or S-ECR) if

1) [i, j] lies inside H;

2) There is no such pair (k, t) that (i ≤ k ≤ j < t < v) or (u < k < i ≤ t ≤ j);

3) There are pairs (i, k) and (t, j), where k ≤ j; i ≤ t;

4) There is no other than [i, j] region [i', j'] such that i' ≤ i < j ≤ j' and the region [i', j'] satisfies the conditions 1) - 3).

Pair (i, j) is called the face of the stem-related ECR.

Statement 1. Let Z = [f, g] be an H-related ECR; (u, v) is an internal pair of stem H. Then:

1) The entire section Z lies inside H.

2) A wing lies entirely inside Z or lies entirely outside Z.

3) Z starts with a left wing of a stem H1, lying inside H, and ends with a right wing of a stem H2, lying inside H.

4) If H1 = H2 is one and the same stem, face (s, t) of region Z is the face of a stem. Otherwise, s is the beginning of the left wing of H1 and t is the end of the right wing of H2.

Proof - follows from the Definition 4 and the fact that the wings do not overlap.

Definition 5. Let Z be a stem-related ECR. Region Z is called simple if its face is the face of a stem and called pseudoknotted otherwise. Pseudoknotted S-ECRs are also called blocks.

Statement 2. Let H be a stem and (u, v) its internal pair. Then:

1) There are no two H-related ECRs that overlap.

2) Let position t lie inside H. The position t does not belong to H if and only if t lies inside H-related ECR Z (i.e., lies inside Z, but not in its face).

Proof - follows from Definitions 1, 2 and 4.

Definition 6. Let H be a stem and (u, v) its internal pair. Let (s1, t1), ..., (sn, tn) - faces of all of the H-related ECRs; s1 < t1 < ... < sn < tn. For convenience let t0 = u; sn+1 = v. Suppose k - integer; 1 ≤ k ≤ n+1. Then the k-th side of the loop of H is a fragment [tk-1 + 1, sk - 1].

If sk= tk-1 + 1, k-th side of the loop of H has a length of zero.

Statement 3. Let H be a stem and (u, v) its internal pair. Let (s1, t1),..., (sn, tn) - faces of all of the H-related ECRs; s1 < t1 < ... < sn < tn. For convenience let t0 = u; sn+1 = v. Then the loop of H is the union of all the faces of H-related ECRs and located among them sides.

Proof - follows from the Statement 2.

Statement 4. Let H be a stem and (u, v) its internal pair and a position x belongs to side (t, s) of the loop of H. Then:

1) The position x is not involved in any bond or belongs to wing of a stem H', the other wing of which lies outside H.

2) If x belongs to a thread (wing), then this entire thread (wing) belongs to the same side of the loop of H.

Proof - follows from definition of side and the fact that wings do not overlap.

Statements 3 and 4 describe the possible structures of loops. Note that in case of pseudoknot-free structures, all closed regions are simple and each side consists of a single thread. Therefore, we give the following definition:

Definition 7. A loop is called classical if it does not contain wings and faces of blocks. A loop is called isolated if it does not contain wings and called pseudoknotted otherwise.

A stem is called pseudoknotted if its loop is pseudoknotted.

Let us apply the classification of loops on Matthews-Turner to introduced generalization based on the number of faces included in the loop. Note that in this case, the faces can be both faces of the stems (in other words - simple, closed regions) and faces of the blocks (complex, closed regions).

Definition 8. A loop will be called hairpin if it does not contain faces and therefore has a single side. A loop will be called internal if it contains exactly one face and therefore has two sides. A loop will be called multiple junction if it contains more than one face and therefore more than two sides.

An internal loop is called bulge if one of its sides has a length of zero.

This classification covers both normal and external (belonging to the "external" stem) loops.


5. Dictionary    

A Closed Stem is a sequence of base pairs of the form (i, j), (i+1, j-1),..., (i+k, j-k) such that

1) k ≥ 1;

2) i+k < j-k;

3) All pairs (i + x, j - x), where x = 0, ..., k, form base pairs, where (i, j) and (i + k, j - k) are Watson-Crick pairs (WC pairs) or Wobble pairs (WB pairs).

A Free Stem is a sequence of base pairs of the form (i, j), (i+1, j-1),..., (i+k, j-k) such that

1) k ≥ 1;

2) i+k < j-k;

3) All pairs (i + x, j - x), where x = 0, ..., k, form base pairs of any type.

There are bases forming H-bonds with more than one other bases. Some of such bases are parts of multiplets.

A Multiplet is a connected graph (V,E) such that:

1) V is a set of nucleotides;

2) E is a set of base pairs of any type between nucleotides from V.


V. URSDB    

URSDB (Universe of RNA Structures Database) is a relational database containing information on all PDB documents containing RNA chains. It is updated each month. URSDB consists of 49 tables that can be divided in four groups:

1) Tables containing processed PDB data;

2) Annotation of H-bonds, stems, etc. based on results of DSSR program package;

3) Other data created by DSSR package (are not in use);

4) Annotation of loops, one-strand regions, pseudoknots, elementary closed regions, and multiplets based on proposed approach.

URSDB, in particular, contains information on following elements of RNA structures:

- BasePairs of various types;

- Stems of various types;

- Wings, i.e. complementary regions forming stems;

- Links, i.e base pairs that do not belong to any stem;

- Loops of various types;

- Threads, i.e. one-strand regions;

- Relations between Loops, Wings and Threads;

- Elementary Closed Regions;

- Pseudoknots and their types and ranks;

- Nucleotide Multiplets;


VI. Example of URS session   

To get the impression of how URS works, let's try to solve the following problem: what types of pseudoknots take place in ribozymes?

First of all we need to select all ribozymes from the database. For that end we go to the Structures page and add "ribozyme" to query form as a keyword (see Fig33a).

Using example a
Figure 33a.

After we are sure that our parameter is added to the query form, we press the Search button (see Fig33b).

Using example b
Figure 33b.

We can see that 160 documents with ribozymes were found (see Fig33c).

Using example c
Figure 33c.

To see pseudoknots from the selected subset, we go to the Statistics page, ensure that the Selected Set radio-button is checked, choose the Pseudoknots field and press the Show button (see Fig33d,e).

Using example d
Figure 33d.

Using example e
Figure 33e.

Now we can see that only 13 types of pseudoknots take place in ribozymes and the most part of cases is abAB pseudoknot. To see the full list of selected pseudoknots, we need to click on the "List" block (under the Statistics block), see Fig33f.

Using example f
Figure 33f.

Now we can see the full list of selected pseudoknots. To see 3D representation of each pseudoknot, we need to click on its signature (see Fig33h,i).

Using example h
Figure 33h.

Using example i
Figure 33i.