Pairwise Structure Alignment
What is Structure Alignment?
Structure alignment is a process wherein molecular structures of two or more biopolymers (e.g., proteins or large ribonucleic acids) are compared to establish equivalences in their three-dimensional shapes. Since these comparisons are commonly done on protein structures, this discussion will focus on proteins.
The objective of structure alignment is identification of the maximal set of corresponding pairs of amino acid residues that gives a good structural match when the structures are overlaid, i.e., superposed. Only the positions of the protein’s backbone C-alpha atoms and/or location of secondary structural elements are considered in this alignment. The amino residue type is ignored.
This tool presents options for structure alignment of a pair of protein chains (either within the same structure or from different structures). This process is called pairwise structure alignment. If molecular structures of three or more proteins are aligned, the process is called multiple structure alignment (reviewed in Ma and Wang, 2014).
A few different types of structural alignments and their rationales are described here.
Rigid Body Alignment
In a rigid body alignment, the relative orientations and positions of atoms within each structure remain fixed during the alignment process. In the resulting superposition, only the overall shapes of the structures are aligned. Rigid body alignments are well-suited for identification of structural equivalences between proteins that are closely evolutionarily related and thus have similar shapes.
In a flexible structure alignment relative mobility between domains or subdomains in each structure is accommodated. When superposition by rigid alignment alone does not yield meaningful results, introducing flexibility to structural alignment becomes useful for two main reasons:
- It helps compare two protein chains that have adopted different conformational states, e.g., due to post-translational modifications such as phosphorylation or interaction with other proteins/ligands.
- It also helps identify conserved regions in proteins that may have distant evolutionary relationship. For example one of these proteins may contain extra loops or truncations that alter relative orientation of different domains in the structures.
Most structure alignment algorithms assume that the structural units of two similar proteins appear in the same order (in the N-terminal to C-terminal direction) within their sequences. However, this assumption may not always be true. There are many examples of natural and designed proteins where the spatial arrangement of secondary structural elements or protein domains is maintained but the protein backbone connections between these structural elements are different - i.e., the proteins have different topologies.
One such example is circular permutation, where the relative locations of structural elements (and the N- and C-termini) within two proteins are different, but their overall shape and structure (e.g., secondary structural elements and their relative orientations) are conserved.
When is Structure Alignment useful?
Structure alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Pairwise protein structure comparison can be used for analysis of conformational changes on ligand binding, analysis of structural variation between proteins within an evolutionary family, and identification of common structural domains.
Structure Alignment Interface
The structure alignment tool can be accessed from the “Analyze” options in the left hand menu of the home page or the top bar visible from any of the RCSB PDB pages.
The interface allows you to specify two polymer chains to be compared - either by listing the PDB and Chain IDs of the polymer chains to be compared or by uploading your own files with coordinates for the polymer chains. When specifying PDB IDs, the options for selecting a chain ID become available. The appropriate chain IDs to compare must be specified. Note that the chain IDs are case-sensitive and correspond to the
label_asym_id in the file. The author defined chain IDs will appear in square brackets in the same box.
If only part of a polymer chain should be compared, the segments of polymer chains can be chosen by specifying residue ranges using the PDB residue numbers (sequential numbers from 1 to N using
label_seq_id). Note if you are matching residues based on the author specified residue numbers (e.g., reported in the manuscripts) you may have to first convert it to the label_seq_id. If no range is specified all residues of the chain are included in the alignment by default.
When selecting polymers from PDB structures, the polymer chains selected for alignment must contain at least 10 residues and the structure must contain the coordinates of at least the C-alpha backbone atoms.
When uploading coordinate files for structural alignment, the file formats that are recognized include PDBx/mmCIF - must have .cif extension, Binary CIF - must have .bcif extension, and PDB - must have .pdb or .ent extension. Files in one of these formats compressed with Gzip algorithm (.gz) are also allowed.
When both chains are selected, the Compare button becomes available to launch the structure alignment.
A number of algorithms are provided to perform pairwise structural alignments. Brief descriptions of these algorithms are included below:
|jFATCAT-rigid||The structure alignment algorithm Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists (FATCAT) allows for flexible protein structure comparison (Ye and Godzik, 2003, Li et al., 2020). This tool offers use of the Java port of the original FATCAT. The rigid flavor of the algorithm is used to run a rigid-body superposition that only considers alignments with matching sequence order. For most structures the performance of this structure alignment is similar to that of CE.|
|jFATCAT-flexible||The flexible flavor of FATCAT introduces twists (hinges) between different parts of the superposed proteins so that these parts are aligned independently. This makes it possible to effectively compare protein structures that undergo conformational changes in specific parts of the molecule such that global (rigid body) superposition cannot capture the underlying similarity between domains. For example, when the two polymers being compared are in different functional forms (e.g., bound to partner proteins/ligands), were crystallized under different conditions, or have mutations. The downside of this approach is that it can lead to false positive matches in unrelated structures, requiring that results be carefully reviewed|
|jCE||The original Combinatorial Extension (CE) algorithm (Shindyalov and Bourne, 1998) works by identifying segments of the two structures with similar local structure, and then combining those regions to align the maximum number of residues in order to keep the root mean squared deviations (rmsd) between the pair of structures low. This Java port of the original CE uses a rigid-body alignment algorithm. Relative orientations of atoms in the structures being compared are kept fixed during superposition. It assumes that aligned residues occur in the same order in both proteins (i.e., the alignment is sequence-order dependent).|
|jCE-CP||Some protein pairs are related by a circular permutation, i.e., the N-terminal part of one protein is related to the C-terminal part of the other or vice versa, or the topology of loops connecting secondary structural elements in a domain are different. Combinatorial Extension with Circular Permutations (CE-CP, Bilven et al., 2015) allows the structural comparison of such circularly permuted proteins.|
|TM-align||Sequence-independent protein structure comparison TM-align is sensitive to global topology (Zhang and Skolnick, 2005). It uses dynamic programming iterations to generate sequence-independent residue-to-residue alignments between template and model structures.|
|Smith-Waterman 3D||Smith and Waterman's 1981 algorithm aligns similar sequence segments using Blosum65 scoring matrix. The Smith-Waterman 3D is based on this algorithm and aligns two structures based on the sequence alignment. Note that this method works well for structures with significant sequence similarity and is faster than the structure-based methods. However, any errors in locating gaps, or a small number of badly aligned residues can lead to high RMSD in the resulting superposition.|
Following structure alignment the superposed structures are visualized in 3D using the interactive molecule viewer Mol*. The sequence alignment that results from a selected structure alignment is shown in a box next to the Mol* widget. Pairs of residues that are structurally equivalent are colored orange (the first structure) or blue (the second structure). Using the pairwise structure alignment of hemoglobin (PDB ID 4HHB, chain A) and neuroglobin (PDB ID 1OJ6, chain A) the alignment results are explained below.
|The structure alignment results display: sequence alignment and superposed 3D structures|
The following information is reported about the structures selected/uploaded for superposition:
- Structure ID lists PDB ID or file name of aligned structures
- Description lists the name of the polymer entity for the chains being compared
- Sequence Length is the number of polymeric residues in the deposited sequence for the polymer chains being compared
- Modeled Residues is the number of residues with coordinates used for structure alignment. The difference between sequence length and modeled residues indicates missing residues in the polymer chain
- Coverage is the fraction of residues matched by the superposition (related by spacial proximity) relative to the total number of modeled residues being aligned
The structure alignments are characterized by the following parameters, and are commonly used to describe the extent of overlap or similarity between the two polymers:
- RMSD (root mean square deviation) is computed between aligned pairs of the backbone C-alpha atoms in superposed structures. The lower the RMSD, the better the structure alignment between the pair of structures. This is the most commonly reported metric when comparing two structures, but it is sensitive to the local structure deviation. If a few residues in a loop are not aligned, the RMSD value is large, even though the rest of the structure is well aligned
- TM-score (template modeling score) is a measure of topological similarity between the template and model structures (Xu and Zhang, 2010). The TM-score ranges between 0 and 1, where 1 indicates a perfect match and 0 is no match between the two structures. Scores < 0.2 usually indicate that the proteins are unrelated while those >0.5 generally have the same protein fold (e.g., classified by SCOP/CATH)
- Score is a measure of structural similarity that is specific to the alignment method used. For example, in FATCAT-flexible the chaining score is reported. This is a measure based on various parameters such as the length of aligned fragment pairs (AFPs), distance cut-offs, and the maximum number of twists allowed for the alignment Review the parameter for each method and refer to the alignment algorithm references for more information on the specific scores reported
- SI% (sequence identity percentage) is the percent of paired residues in the alignment that are identical in sequence
- SS% (sequence similarity percentage) is the percent of paired residues in the alignment that are similar in sequence (using the Blosum65 scoring matrix)
- Length is the number of residue pairs that are structurally equivalent in the alignment
In addition to the structure alignment statistics and scores, the superposed structures can be visualized and downloaded for further analysis. Options for viewing and downloading the structures are described below.
View and Download Options
Options in the pull-down menu "Selecting a View" can be used to change what is currently displayed in the interactive Mol* viewer. The options are shown in the figure and listed below:
|Select View options and the 3 different views of the aligned structures to see only the aligned residues, the protein chains, or the entire models as 3D representation of the alignment results|
- Aligned Residues: these are residues within a distance cutoff, defined for the alignment method. Note that the aligned regions of the two structures are shown in orange and blue
- Polymer Chains: show the full protein chains, including any parts of the polymer chain that are not aligned. Regions of the polymer chain that are not aligned are colored in lighter shades of orange and blue
- Full Structures: shows the full content of the deposited entry for the two structures being compared - including polymers, carbohydrates, ligands and water molecules. Regions of the polymer chain and other polymer entities that are not aligned are colored in lighter shades of orange and blue
The 3D View can be expanded to the fullscreen to provide fine-grained control over the view. Mol* will create designated components for a given selection that can be toggled or removed. Built-in Mol* functionality is available to change coloring and representations. Using the Set Coloring menu option for any given component shown in the Mol* full screen, the coloring can be changed as desired or the original (structure alignment) coloring can be restored with Superpose coloring option.
|Full screen view of the Selected View of the structure alignment in Mol*. The aligned residues are shown in orange and blue, while the parts that are not aligned are shown in lighter shades of the same colors.|
Options in the pull-down menu "Download files" can be used to download coordinates, sequences, and matrices used for the alignment. The options are shown in the figure and listed below:
|Download File options following structure alignment|
- Superposed structures - allows users to download the transformed atomic coordinates in mmCIF format for both structures after superposition
- Sequence alignment - allows users to download the aligned sequences in FASTA format from the selected structure alignment
- Transformation matrices - are the matrices in JSON format, used to transform the structures during structure alignment
Note that downloading the superposed structures will include only the coordinates of the structure that is currently loaded into the viewer (e.g. residues, chains or full structures). The superposed structures can also be downloaded from the Mol* user interface, under the Export panel.
In any structure alignment, the first structure (query) is assumed to be rigid. The second structure (target) is superposed on the query structure. The Transformation Matrices are the operations necessary to move the coordinates of the target structure to match the query structure. In rigid-body alignment the transformation matrices of the single block is saved, while in flexible (and circular permutation) alignments transformation matrices for each flexible region (blocks) are reported in the downloaded file. The transformation matrices can be downloaded.
Share Alignment Results
Copy Link option is available when both structures are selected from the PDB archive by providing PDB IDs as an input. After clicking on Copy Link button, the alignment results URL will be copied into your clipboard and can be pasted into e-mail, document, spreadsheet, notepad, or any other file or web page.
1. Rigid Body Structure Alignment
Alignment of the mammalian tubulin (1TUB.A) with a close structural homolog within prokaryotes, the bacterial cell division protein FtsZ (1FSZ.A), shows that these proteins are structurally similar (with reported RMSD 3.02) despite low sequence identity (13.5%).
|Structural alignment of the mammalian tubulin (1TUB.A, in orange) and the bacterial cell division protein FtsZ (1FSZ.A, in blue)|
2. Rigid Body vs Flexible Structure Alignment
The structures of calmodulin with and without calcium bound can be much better aligned using a flexible rather than a rigid-body alignment algorithm. Below is an example of two calmodulin structures: calcium-free (1CLL.A) and calcium-loaded (1QX5.A) aligned with jFATCAT-flexible (left) and jFATCAT-rigid (right) algorithms.
|Structure alignment of calmodulin proteins in different conformation: calcium-free (1CLL.A, in orange) and calcium-loaded (1QX5.A, in blue). Structures are aligned with jFATCAT-flexible (left) and jFATCAT-rigid (right) algorithms Brightly colored regions (blue and orange) show alignment, while the lighter shades of the same color are not aligned.|
3. Sequential vs Circular Permutation Structure Alignment
The proteins in this example, Concanavalin A ( PDB ID 3cna, chain A or 3CNA.A) and peanut lectin (PDB ID 2pel chain A or 2PEL.A), are related by a circular permutation. The 3D folds of the two proteins are highly similar but the N- and C- termini are located at different positions. While sequence-order dependent jCP algorithm can only find part of the alignment, the jCE-CP algorithm can discover a full alignment.
|Structure alignment of Concanavalin A (3CNA.A, in orange) and peanut lectin (2PEL.A, in blue) proteins using jCE-CP (left) and jCP (right) Brightly colored regions (blue and orange) show alignment, while the lighter shades of the same color are not aligned.|
- Bliven, S E, Bourne, P E, Prlić, A, (2015) Detection of circular permutations within protein structures using CE-CP. Bioinformatics, 31(8): 1316–1318. doi: 10.1093/bioinformatics/btu823 (CE-CP)
- Li, Z, Lukasz Jaroszewski, L, Iyer, M, Sedova, M, Godzik, A. (2020) FATCAT 2.0: towards a better understanding of the structural diversity of proteins Nucleic Acids Research, 48 (W1) W60–W64. doi:10.1093/nar/gkaa443 (FATCAT 2.0)
- Ma, J, and Wang, S (2014). Algorithms, Applications, and Challenges of Protein Structure Alignment. Advances In Protein Chemistry And Structural Biology 121-175. doi: 10.1016/B978-0-12-800168-4.00005-6
- Shindyalov, I N, Bourne, P E (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering, Design and Selection, 11(9): 739–747. doi: 10.1093/protein/11.9.739 (CE)
- Smith, T F, Waterman, M S, (1981) Identification of common molecular subsequences, Journal of Molecular Biology. 147(1): 195-197, doi: 10.1016/0022-2836(81)90087-5 (for Smith-Waterman 3D)
- Ye Y, Godzik A (2003) Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics, 19 Suppl 2:ii246-55. doi: 10.1093/bioinformatics/btg1086. (FATCAT)
- Zhang, Y, Skolnick, J (2005) TM-align: A protein structure alignment algorithm based on TM-score. Nucleic Acids Research, 33: 2302-2309. doi: 10.1093/nar/gki524 (TM-align)