Tools
Pairwise Structure Alignment
Introduction
What is Structure Alignment?
Structure alignment attempts to establish residue-residue correspondence between two or more macromolecular structures based on the optimal superposition of their shape and three-dimensional conformation. Structure alignment requires no prior knowledge of equivalent pairs of residues, does not rely on the sequence alignment, and the type of residues is ignored when the correspondence is established.
This tool presents options for pairwise structure alignment of proteins. In the case of pairwise alignment, structures are always compared in pairs. In contrast to multiple structure alignment (reviewed in Ma and Wang, 2014) that provides a global solution for three or more structures.
Different types of structural alignments and their rationales are described below.
Rigid Body Alignment
In a rigid body alignment, the relative orientations and positions of atoms within each structure remain fixed during the alignment process. In the resulting superposition, only the overall shapes of the structures are aligned. Rigid body alignments are well-suited for identification of structural equivalences between proteins that are closely evolutionarily related and thus have similar shapes.
Flexible Alignment
In a flexible structure alignment relative mobility between domains or subdomains in each structure is accommodated. When superposition by rigid alignment alone does not yield meaningful results, introducing flexibility to structural alignment becomes useful for two main reasons:
- It helps compare two protein chains that have adopted different conformational states, e.g., due to post-translational modifications such as phosphorylation or interaction with other proteins/ligands.
- It also helps identify conserved regions in proteins that may have distant evolutionary relationship. For example one of these proteins may contain extra loops or truncations that alter relative orientation of different domains in the structures.
Topology-Independent Alignments
Most structure alignment algorithms assume that the structural units of two similar proteins appear in the same order (in the N-terminal to C-terminal direction) within their sequences. However, this assumption may not always be true. There are many examples of natural and designed proteins where the spatial arrangement of secondary structural elements or protein domains is maintained but the protein backbone connections between these structural elements are different - i.e., the proteins have different topologies.
One such example is circular permutation, where the relative locations of structural elements (and the N- and C-termini) within two proteins are different, but their overall shape and structure (e.g., secondary structural elements and their relative orientations) are conserved.
When is Structure Alignment useful?
Structure alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Pairwise protein structure comparison can be used for analysis of conformational changes on ligand binding, analysis of structural variation between proteins within an evolutionary family, and identification of common structural domains.
Documentation
Structure Alignment Interface
The structure alignment tool provides a simple-to-use, web-accessible interface for performing a wide range of structural superpositions. The tool can be accessed from the “Analyze” section of the menu bar. The interface allows you to align one or more structures to a given reference structure. You can select up to 10 structures for comparison. First structure will be used as a reference, and all other structures will be aligned to it in a pairwise manner.
The user interface allows selecting protein structures for structure alignment (Figure 1). You can choose any one of the following options to specify the structures in this tool.
- Use Entry ID option to enter an existing PDB or Computed Structure Model (CSM) ID available on RCSB.org (e.g. 1AOB, AF_AFP60325F1)
- Use UniProt ID option to enter an existing UniProtKB ID for sequences with known 3D structure (e.g. P06213)
- Use AlphaFold DB option to fetch predicted structure coordinates from AlphaFold Protein Structure Database for a given UniProtKB ID
- Use ESMAtlas option to fetch predicted structure coordinated from ESM Metagenomic Atlas for a given MGnify protein ID existing in the MGnify protein sequence database
- Use File Upload option to upload your own structure file with coordinates. Accepted file formats include PDBx/mmCIF - must have .cif or .bcif extension, and PDB - must have .pdb or .ent extension
- Use File URL option to reference atomic structure coordinates by providing a URL to a file. Links to both mmCIF or PDB format files may be included here
The Chain ID input field must be populated. Note: The polymer chains selected for alignment must be at least 10 residues long and the structure must contain the coordinates of at least the C-alpha backbone atoms for the selected residues.
When a valid Entry ID is provided, the selection of chain IDs will be available listing the proteins with sequences longer than 10 residues. For other options, chain ID must be typed in. Note that the chain IDs are case-sensitive.
When the structure is provided as a file in PDBx/mmCIF format, the chain ID should correspond to the _label_asym_id
assigned for each chain during the deposition. See this documentation article for more information on PDB identifiers for macromolecular chains.
If only a part of the polymeric chain should be compared, the segments of polymer chains can be chosen by specifying residue ranges using the PDB residue numbers (sequential numbers from 1 to N using _label_seq_id
). Note if you are matching residues based on the author specified residue numbers (e.g., reported in the manuscripts) you may have to first convert it to the _label_seq_id
. If no range is specified all residues of the chain are included in the alignment by default.
When at least 2 chains are selected, the Compare button becomes available to launch the structure alignment.
Alignment Methods
A number of algorithms are provided to perform pairwise structural alignments. Brief descriptions of these algorithms are included below:
Algorithm | Brief Summary | Description |
---|---|---|
jFATCAT-rigid | Rigid-body protein structure comparison for identification of the largest structurally conserved core | The structure alignment algorithm Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists (FATCAT) allows for flexible protein structure comparison (Ye and Godzik, 2003, Li et al., 2020). This tool offers use of the Java port of the original FATCAT. The rigid flavor of the algorithm is used to run a rigid-body superposition that only considers alignments with matching sequence order. For most structures the performance of this structure alignment is similar to that of CE |
jFATCAT-flexible | Flexible protein structure comparison for identification of internally rigid domains in the presence of large conformational changes | The flexible flavor of FATCAT introduces twists (hinges) between different parts of the superposed proteins so that these parts are aligned independently. This makes it possible to effectively compare protein structures that undergo conformational changes in specific parts of the molecule such that global (rigid body) superposition cannot capture the underlying similarity between domains. For example, when the two polymers being compared are in different functional forms (e.g., bound to partner proteins/ligands), were crystallized under different conditions, or have mutations. The downside of this approach is that it can lead to false positive matches in unrelated structures, requiring that results be carefully reviewed |
jCE | Rigid-body protein structure comparison for identification of the optimal set of substructural similarities | The original Combinatorial Extension (CE) algorithm (Shindyalov and Bourne, 1998) works by identifying segments of the two structures with similar local structure, and then combining those regions to align the maximum number of residues in order to keep the root mean squared deviations (RMSD) between the pair of structures low. This Java port of the original CE uses a rigid-body alignment algorithm. Relative orientations of atoms in the structures being compared are kept fixed during superposition. It assumes that aligned residues occur in the same order in both proteins (i.e., the alignment is sequence-order dependent) |
jCE-CP | Flexible structure comparison for proteins with similar overall three-dimensional shape but different connectivity (circular permutations) | Some protein pairs are related by a circular permutation, i.e., the N-terminal part of one protein is related to the C-terminal part of the other or vice versa, or the topology of loops connecting secondary structural elements in a domain are different. Combinatorial Extension with Circular Permutations (CE-CP, Bilven et al., 2015) allows the structural comparison of such circularly permuted proteins |
TM-align | Fast TM-score based protein structure comparison for proteins with similar global topology | Sequence-independent protein structure comparison TM-align is sensitive to global topology (Zhang and Skolnick, 2005). It uses dynamic programming iterations to generate sequence-independent residue-to-residue alignments between template and model structures |
Smith-Waterman 3D | Sequence-dependent protein structure comparison for close homologues | Smith and Waterman's 1981 algorithm aligns similar sequence segments using Blosum65 scoring matrix. The Smith-Waterman 3D is based on this algorithm and aligns two structures based on the sequence alignment. Note that this method works well for structures with significant sequence similarity and is faster than the structure-based methods. However, any errors in locating gaps, or a small number of badly aligned residues can lead to high RMSD in the resulting superposition |
Alignment Results
Following the pairwise structure alignment, measures describing the extent of structural similarity are displayed. For example, the results of aligning the structures of hemoglobin subunit alpha (PDB ID 4HHB, chain A) and neuroglobin (PDB ID 1OJ6, chain A) are characterized as by the measures shown in (Figure 2):
- RMSD (root mean square deviation) is computed between aligned pairs of the backbone C-alpha atoms in superposed structures in Å. The lower the RMSD, the better the structure alignment between the pair of structures. This is the most commonly reported metric when comparing two structures, but it is sensitive to the local structure deviation. Residues in loops that are not well aligned are excluded from the RMSD is calculation (i.e., RMSD is calculated only using the residues which can be aligned).
- TM-score (template modeling score) is a measure of topological similarity between the template and model structures (Xu and Zhang, 2010). The TM-score ranges between 0 and 1, where 1 indicates a perfect match and 0 is no match between the two structures. Scores < 0.2 usually indicate that the proteins are unrelated while those >0.5 generally have the same protein fold (e.g., classified by SCOP/CATH)
- Identity (sequence identity percentage) is the percent of paired residues in the alignment that are identical in sequence
- Equivalent Residues is the number of residue pairs that are structurally equivalent in the alignment
- Sequence Length is the total number of polymeric residues in the deposited sequence for a given chain
- Modeled Residues is the number of residues with coordinates that were used for structure alignment
Note: Since the first structure in the table is the reference molecule, the values for RMSD, TM-score, Identity, and Equivalent Residues are not reported.
Below the table displaying measures of the structure alignment, the sequence and structures of the superposed polymer chains are displayed in two side-by-side interactive panels. The sequence alignment is shown in the left-hand panel while the 3D structure alignment is shown in the right using the interactive molecular visualization tool, Mol* (Figure 3). The panels are connected and responsive to clicks on either side - i.e., clicking on the sequence alignment will select and focus on that region of the 3D structure and clicking on specific residues in the 3D structure panel, will highlight the corresponding residue(s) in the sequence.
Sequence Alignment panel: The sequence alignment is based on the structure alignment of the specified polymer chains. The reference polymer sequence is marked with an orange colored vertical bar on the left. While the first matched sequence is marked with a blue vertical bar, additional aligned sequences are marked with different colored bars.
Structure Alignment panel: The aligned parts of the structures aligned are displayed as superposed structures in Mol*. Matched regions of the reference structure are colored orange while that of the first matched polymer structure is colored blue. Additional structures in the alignment are colored according to the color representing their respective sequences. Correspondence of the color of the vertical bar in the sequence alignment panel and the 3D structure in the Mol* display is always maintained.
The 3D View can be expanded to the fullscreen to provide fine-grained control over the view. Mol* will create designated components for a given selection that can be toggled or removed. Built-in Mol* functionality is available to change coloring and representations. Using the Set Coloring menu option for any given component shown in the Mol* full screen, the coloring can be changed as desired or the original (structure alignment) coloring can be restored with Superpose coloring option.
Options on the left hand side of the sequence alignment allow users to selectively turn on and off a polymer chain in the alignment. Clicking on the arrow and entry name (Figure 3) can show or hide a polymer. Once a polymer is displayed you can view it in three different modes using the small boxes that show up next to the polymer entry ID and chain ID listing (Figure 4).
Upon displaying a polymer chain by clicking on the options shown in Figure 4, the structure is activated and the left-most box is colored dark blue.
Clicking on the second or middle box displays (to make it turn blue) displays additional polymer chains present in the structure (that were not aligned to the reference structure). The additional polymer chains in the structure are shown in a lighter shade of the color used to display the matched portion of the polymer chain.
Clicking on the right hand box displays any ligands or small molecules presented in the aligned structures. For example, in the above example, clicking on the boxes to display the additional polymer chains and ligands in the PDB entry 4hhb would change the display as shown in Figure 5.
Figure 5: Structure alignment of the hemoglobin alpha chain and neuroglobin showing the full hemoglobin molecule along with the heme ligands bound to it. |
The sequence and structure alignment panels are connected so that clicking on any specific amino acid in the sequence alignment selects, zooms in and displays the specific amino acid and the interactions in its neighborhood (Figure 6).
Export Options
Options in the "Export" pull-down menu can be used to download coordinates, sequences, and matrices used for the alignment. The options are shown in Figure 7:
Download File options following structure alignment include:
- Superposed Structures - allows downloading the transformed atomic coordinates in mmCIF format for both structures after superposition
- Sequence Alignment - allows downloading the aligned sequences in FASTA format from the selected structure alignment
- Transformation Matrices - allows downloading JSON file with 4x4 transformation matrix in a column major (j * 4 + i indexing) format, used to superimpose the structures
Note that downloading the superposed structures will include only the coordinates of the structure that is currently loaded into the viewer (e.g. residues, chains or full structures). The superposed structures can also be downloaded from the Mol* user interface, under the Export panel.
In any structure alignment, the first structure (query) is assumed to be rigid. The second structure (target) is superposed on the query structure. The Transformation Matrices are the operations necessary to move the coordinates of the target structure to match the query structure. In rigid-body alignment the transformation matrices of the single block are saved, while in flexible (and circular permutation) alignments transformation matrices for each flexible region (blocks) are reported in the downloaded file. The transformation matrices can be downloaded.
Share Alignment Results
Copy Link option allows to share or bookmark your alignment results. After clicking on Copy Link button, the URL will be copied into your clipboard and can be pasted into e-mail, document, spreadsheet, notepad, or any other file or web page.
Structure Alignment API
The Structure Alignment UI is hinged upon the utilization of the public Alignment API (alignment.rcsb.org). This API can be accessed using a button in the top right corner of the page (Figure 8) and offers a convenient avenue for running the structure alignment calculations programmatically. The Alignment API button opens an API Query Editor and populates it with the most recent query. The editor provides a canvas for customizing API queries enabling users with varying levels of technical expertise to utilize the Alignment API.
Refer to the full API Reference for detailed documentation.
Figure 8: Alignment API box for running structure alignment programmatically |
Examples:
1. Rigid-body structure alignment
Alignment of the mammalian tubulin (1TUB.A) with a close structural homolog within prokaryotes, the bacterial cell division protein FtsZ (1FSZ.A), shows that these proteins are structurally similar (with reported RMSD 3.02, Figure 9) despite low sequence identity (14%).
Rigid-body comparison is performed using the jFATCAT (rigid) method.
Figure 9: Structural alignment of the mammalian tubulin (1TUB.A, in orange) and the bacterial cell division protein FtsZ (1FSZ.A, in blue) |
2. Rigid-body vs flexible structure alignment
The structures of calmodulin with and without calcium bound can be much better aligned using a flexible rather than a rigid-body alignment algorithm. An example of two calmodulin structures: calcium-free (1CLL.A) and calcium-loaded (1QX5.A) aligned with jFATCAT-flexible (left) and jFATCAT-rigid (right) algorithms in shown in Figure 10.
Note: The jFATCAT (flexible) alignment algorithm breaks the polymer chain into domains and aligns each domain separately. While this improves the structural alignment, options for displaying the complete polymer chain and other polymer chains or ligands in the structure are not available.
3. Fixed topology vs Circular Permutation structure alignment
The proteins in this example, Concanavalin A ( PDB ID 3cna, chain A or 3CNA.A) and peanut lectin (PDB ID 2pel chain A or 2PEL.A), are related by a circular permutation. The 3D folds of the two proteins are highly similar but the N- and C- termini are located at different positions. While sequence-order dependent jCE algorithm can only find part of the alignment, the jCE-CP algorithm can discover a full alignment (Figure 11).
Note: The CE (with CP) alignment algorithm breaks the polymer chain into parts and aligns them separately. While this improves the overall structural alignment, options for displaying the complete polymer chain and other polymer chains or ligands in the structure are not available.
4. Align multiple structures to an AlphaFold structure
You can overlay multiple proteins onto a common reference structure by structurally aligning them. Up to 10 structures can be selected. This can be useful to produce superpositions of different domains on a full protein. This example (Figure 12) combines AlphaFold model of human Hepatocyte nuclear factor 4-alpha (AF-P41235-F1, in orange) and 2 PDB structures: crystal structure of human HNF4α DNA binding domain in complex with DNA target (3CBB C[auth A], in blue) and a complex of HNF-4α bound to fatty acid ligand and SRC-1 coactivator peptide (1PZL A, in green).
With availability of Computed Structure Models (CSMs) from RCSB.org, this example can also be run using the RCSB.org assigned CSM ID for the AlphaFold structure (AF_AFP41235F1) instead of providing a File URL (https://alphafold.ebi.ac.uk/files/AF-P41235-F1-model_v2.cif) to access it. Learn more about CSMs and the RCSB.org.
5. Align multiple structures to explore two related enzymes
This example compares the structures of glyceraldehyde-3-phosphate dehydrogenase (G3PD) from various types of Archaea (PDB.chain IDs - 1CF2.A, 2YYY.A, 2CZC.A) and aspartate-semialdehyde dehydrogenase (ASDH) from Eukaryota (PDB.chain IDs - 3HSK.A, 6C8W.A), exploring the NADP-binding pocket. Both these enzymes are oxidoreductases that act on aldehyde or oxo groups and use NAD+ or NADP+ as acceptors. However the Enzyme Commission numbers (E.C. numbers) are different. A comparison of these protein structures can shed light on the NAD+/NADP+ binding site and mechanism of enzyme function (Figure 13).
6. Reference predicted model available in the AlphaFold DB
AlphaFold is an artificial intelligence method for predicting protein structures. The method is described in Highly accurate protein structure prediction with AlphaFold. AlphaFold predictions are freely available from the AlphaFold Protein Structure Database (AlphaFold DB) and now include predictions for over 200 million protein sequences from UniProt 2021_04 dataset.
You can enter an existing UniProtKB accession into the AlphaFold DB input box and structure coordinates will be automatically fetched from the AlphaFold Protein Structure Database.
Figure 14: Reference predicted model available in the AlphaFold DB |
7. Reference predicted model available in the ESM Metagenomic Atlas
ESM Metagenomic Atlas is a database created by Meta AI. Meta AI has developed a protein-folding approach that uses large language models to predict three-dimensional protein structures at the scale of hundreds of millions of proteins. ESM Metagenomic Atlas includes 600+ million protein structure predictions encompassing nearly the entire MGnify90 database, a public resource cataloging metagenomic sequences. Sequences are assigned an MGYP accession (e.g. MGYP002782905563).
You can enter an existing MGYP accession into the ESMAtlas input box and structure coordinates will be automatically fetched from the ESM Metagenomic Atlas.
Figure 15: Reference predicted model available in the ESM Metagenomic Atlas |
In this example experimental structure of human glycogen branching enzyme (GBE1) from 4BZY PDB structure is aligned to MGYP003603436052 structure which share Alpha-amylase_C domain
Figure 16: Alignment of human glycogen branching enzyme (GBE1) structure from PDB (in orange) and MGYP003603436052 structure from ESM Metagenomic Atlas (in blue) which share Alpha-amylase_C domain (highlighted in green) |
Citation
To cite this tool, please reference: Sebastian Bittrich, Joan Segura, Jose M Duarte, Stephen K Burley, Yana Rose, RCSB protein Data Bank: exploring protein 3D similarities via comprehensive structural alignments, Bioinformatics, Volume 40, Issue 6, June 2024, btae370, https://doi.org/10.1093/bioinformatics/btae370
References
- Bliven, S E, Bourne, P E, Prlić, A, (2015) Detection of circular permutations within protein structures using CE-CP. Bioinformatics, 31(8): 1316–1318. https://doi.org/10.1093/bioinformatics/btu823 (CE-CP)
- Li, Z, Lukasz Jaroszewski, L, Iyer, M, Sedova, M, Godzik, A. (2020) FATCAT 2.0: towards a better understanding of the structural diversity of proteins Nucleic Acids Research, 48 (W1) W60–W64. https://doi.org/10.1093/nar/gkaa443 (FATCAT 2.0)
- Ma, J, and Wang, S (2014). Algorithms, Applications, and Challenges of Protein Structure Alignment. Advances In Protein Chemistry And Structural Biology 121-175. https://doi.org/10.1016/B978-0-12-800168-4.00005-6
- Shindyalov, I N, Bourne, P E (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering, Design and Selection, 11(9): 739–747. https://doi.org/10.1093/protein/11.9.739 (CE)
- Smith, T F, Waterman, M S, (1981) Identification of common molecular subsequences, Journal of Molecular Biology. 147(1): 195-197, https://doi.org/10.1016/0022-2836(81)90087-5 (for Smith-Waterman 3D)
- Ye Y, Godzik A (2003) Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics, 19 Suppl 2:ii246-55. https://doi.org/10.1093/bioinformatics/btg1086. (FATCAT)
- Zhang, Y, Skolnick, J (2005) TM-align: A protein structure alignment algorithm based on TM-score. Nucleic Acids Research, 33: 2302-2309. https://doi.org/10.1093/nar/gki524 (TM-align)