Advertisement

Using sequence data to predict the self-assembly of supramolecular collagen structures

Open AccessPublished:July 19, 2022DOI:https://doi.org/10.1016/j.bpj.2022.07.019

      Abstract

      Collagen fibrils are the major constituents of the extracellular matrix, which provides structural support to vertebrate connective tissues. It is widely assumed that the superstructure of collagen fibrils is encoded in the primary sequences of the molecular building blocks. However, the interplay between large-scale architecture and small-scale molecular interactions makes the ab initio prediction of collagen structure challenging. Here, we propose a model that allows us to predict the periodic structure of collagen fibers and the axial offset between the molecules, purely on the basis of simple predictive rules for the interaction between amino acid residues. With our model, we identify the sequence-dependent collagen fiber geometries with the lowest free energy and validate the predicted geometries against the available experimental data. We propose a procedure for searching for optimal staggering distances. Finally, we build a classification algorithm and use it to scan 11 data sets of vertebrate fibrillar collagens, and predict the periodicity of the resulting assemblies. We analyzed the experimentally observed variance of the optimal stagger distances across species, and find that these distances, and the resulting fibrillar phenotypes, are evolutionary well preserved. Moreover, we observed that the energy minimum at the optimal stagger distance is broad in all cases, suggesting a further evolutionary adaptation designed to improve the assembly kinetics. Our periodicity predictions are not only in good agreement with the experimental data on collagen molecular staggering for all collagen types analyzed, but also for synthetic peptides. We argue that, with our model, it becomes possible to design tailor-made, periodic collagen structures, thereby enabling the design of novel biomimetic materials based on collagen-mimetic trimers.

      Significance

      The pathway for protein self-assembly is determined by the free energy landscape coded in the noncovalent interactions between the building blocks. We use this basic principle to develop a model that describes the mechanisms involved in the staggering of collagen molecules in fibrillar assemblies. In this work we present a simple, parameter-free model for collagen fibril design that allows us to predict the structure of self-assembling collagen fibers on the basis of the amino acid sequence of the constituent α-chain subunits. We develop a classification algorithm and use it to scan through large data sets of collagen molecules to predict the periodicity of the resulting assemblies. We argue that the interaction model presented in this work provides a foundation for engineering of novel collagen molecules with specific material properties for targeted applications.

      Introduction

      The material properties of connective tissues, such as tendon, skin, bone, and cartilage, are largely controlled by fibrillar assemblies of collagen proteins. Collagen molecules are long ( 300 nm), rope-like structures, formed from three monomeric α-chains twisted together into a triple helix (
      • Kadler K.E.
      • Baldock C.
      • Boot-Handford R.P.
      • et al.
      Collagens at a glance.
      ). In vertebrates, there are at least 10 distinct collagen molecules, each comprising 3 monomers, drawn from 12 different α-chains, encoded by 11 genes. The primary structure of the individual α-chains determines the geometrical and biophysical parameters of the collagen helix, which in turn govern the organization of molecules within the fibril, thereby establishing interactions necessary for quaternary structures to form.
      Collagen fibrils are composed of hundreds of aligned helices. The major collagens, types I, II, and III, form wide, long, unbranched fibrils, which are the dominant components of structural tissue, typically in conjunction with smaller quantities of the minor collagens, types V and XI, which are thought to act as fibril nucleators (
      • Kadler K.E.
      • Baldock C.
      • Boot-Handford R.P.
      • et al.
      Collagens at a glance.
      ). TEM studies of these fibrils show periodic dark-light bands along their length with periodicity D67nm, attributed to the constituent molecules being longitudinally staggered relative to their neighbors by integer multiples of D (
      • Petruska J.A.
      • Hodge A.J.
      A subunit model for the tropocollagen macromolecule.
      ,
      • Meek K.M.
      • Chapman J.A.
      • Hardcastle R.A.
      The staining pattern of collagen fibrils. Improved correlation with sequence data.
      ,
      • Smith J.W.
      Molecular pattern in native collagen.
      ,
      • Antipova O.
      • Orgel J.P.R.O.
      In situ D-periodic molecular structure of type II collagen.
      ). Such fibrils are found in tendons, cornea, skin, and cartilage (
      • Bos K.J.
      • Holmes D.F.
      • Bishop P.N.
      • et al.
      Axial structure of the heterotypic collagen fibrils of vitreous humour and cartilage.
      ,
      • Parkin J.D.
      • San Antonio J.D.
      • Savige J.
      • et al.
      The collαgen III fibril has a “flexi-rod” structure of flexible sequences interspersed with rigid bioactive domains including two with hemostatic roles.
      ,
      • Holmes D.F.
      • Gilpin C.J.
      • Kadler K.E.
      • et al.
      Corneal collagen fibril structure in three dimensions: structural insights into fibril assembly, mechanical properties, and tissue organization.
      ). However, not all collagen molecular species assemble into these classical periodic fibrils. Regulatory or developmental collagen proteins do not form wide, striated fibrils under physiological conditions. These polymers are incorporated into the structurally defined suprastructure as a result of heterotypic interactions (collagen type V and XI) (
      • Birk D.E.
      • Fitch J.M.
      • Linsenmayer T.F.
      • et al.
      Collagen fibrillogenesis in vitro: interaction of types I and V collagen regulates fibril diameter.
      ). In addition, some collagens form thin, nonbanded assemblies (type XXIV and XXVII) (
      • Ricard-Blum S.
      • Ruggiero F.
      The collagen superfamily: from the extracellular matrix to the cell membrane.
      ,
      • Plumb D.A.
      • Dhir V.
      • Boot-Handford R.P.
      • et al.
      Collagen XXVII is developmentally regulated and forms thin fibrillar structures distinct from those of classical vertebrate fibrillar collagens.
      ,
      • Boot-Handford R.P.
      • Tuckwell D.S.
      • Poulsom R.
      • et al.
      A novel and highly conserved collagen (proα1 (XXVII)) with a unique expression pattern and unusual molecular characteristics establishes a new clade within the vertebrate fibrillar collagen family.
      ,
      • Hjorten R.
      • Hansen U.
      • Pace J.M.
      • et al.
      Type XXVII collagen at the transition of cartilage to bone during skeletogenesis.
      ).
      To unravel the design principles of collagen assembly, we must find a mapping between the primary sequence of the collagen trimer and the phenotypic, structural features of the collagen fibril. Given the primary sequence of the α-chain subunits, is it possible to predict the value of the axial offset between assembled polymers? Previous work has provided evidence for a link between sequence and the supramolecular structure of collagen assemblies (
      • Hulmes D.J.
      • Miller A.
      • Woodhead-Galloway J.
      • et al.
      Analysis of the primary structure of collagen for the origins of molecular packing.
      ,
      • Hofmann H.
      • Fietzek P.P.
      • Kühn K.
      The role of polar and hydrophobic interactions for the molecular packing of type I collagen: a three-dimensional evaluation of the amino acid sequence.
      ,
      • Jones E.Y.
      • Miller A.
      Analysis of structural design features in collagen.
      ,
      • Trus B.L.
      • Piez K.A.
      Molecular packing of collagen: three-dimensional analysis of electrostatic interactions.
      ). In fact, interaction-based scoring systems for linear sequences have been proposed in (
      • Hulmes D.J.
      • Miller A.
      • Woodhead-Galloway J.
      • et al.
      Analysis of the primary structure of collagen for the origins of molecular packing.
      ,
      • Hofmann H.
      • Fietzek P.P.
      • Kühn K.
      The role of polar and hydrophobic interactions for the molecular packing of type I collagen: a three-dimensional evaluation of the amino acid sequence.
      ,
      • Trus B.L.
      • Piez K.A.
      Molecular packing of collagen: three-dimensional analysis of electrostatic interactions.
      ). In what follows, we use a more physically detailed model to arrive at a simple theoretical tool to predict the observed molecular geometry. Given the size of each collagen monomer of around 3000 amino acid residues, and the lack of detailed structural data, a fully atomistic (free) energy optimization procedure to model collagenous assemblies would be prohibitively expensive. Consequently, we take a coarse-grained approach to estimate the free energy of assembly. We make use of well-established empirical estimates of the strength of residue-residue interactions, based on so-called statistical contact potentials (CPs). We integrate these CPs in a simplified representation of collagen molecular structure. The resulting model allows us to estimate the relative stability of various collagen arrangements. We analyzed the primary structures of collagen proteins that can be classified into various functional types, across several vertebrate organisms. We used primary sequence data for collagen types for which experimental data regarding the phenotype of higher-order structure are available (Table 1), to establish a procedure for periodicity prediction. Here, we show that the axial staggering is indeed fully encoded in the collagen helix and can be predicted solely based on the primary structure of the trimer α-chain subunits. Moreover, we provide evidence that the stagger distance between collagen molecules in their fibrils and as a result, the phenotypic features of those fibrils, are well preserved across evolutionary time.
      Table 1The 11 types of fibrillar proteins analyzed in this work and their corresponding experimentally determined molecular aggregations
      IDTypeTrimer compositionLength [s]Fibril periodicity D [nm]D [s]Species
      1I
      Variant of collagen type I found in development and disease.
      [α1(I)]3101267234calf (
      • Mcbride Jr., D.J.
      • Kadler K.E.
      • Prockop D.J.
      • et al.
      Self-assembly into fibrils of a homotrimer of type I collagen.
      ) (bovine
      Test set sequences used for the estimation of the α-parameter (see in text).
      )
      2I[α1(I)]2α2(I)101267234human
      Test set sequences used for the estimation of the α-parameter (see in text).
      , rat
      Test set sequences used for the estimation of the α-parameter (see in text).
      , bovine
      Test set sequences used for the estimation of the α-parameter (see in text).
       (
      • Hulmes D.J.
      • Jesior J.-C.
      • Wolff C.
      • et al.
      Electron microscopy shows periodic structure in collagen fibril cross sections.
      ,
      • Doyle B.B.
      • Hulmes D.J.
      • Woodhead-Galloway J.
      • et al.
      A D-periodic narrow filament in collagen.
      )
      3II[α1(II)]3101267234human
      Test set sequences used for the estimation of the α-parameter (see in text).
      , lamprey, bovine
      Test set sequences used for the estimation of the α-parameter (see in text).
       (
      • Antipova O.
      • Orgel J.P.R.O.
      In situ D-periodic molecular structure of type II collagen.
      )
      4III[α1(III)]3102766.7 ± 0.2234calf (
      • Brodsky B.
      • Eikenberry E.F.
      • Cassidy K.
      An unusual collagen periodicity in skin.
      ) (bovine
      Test set sequences used for the estimation of the α-parameter (see in text).
      )
      5Vα1(V)α2(V)α3(V)1012unknown
      6V[α1(V)]2α2(V)1012periodic, 67234rat (
      • Mizuno K.
      • Adachi E.
      • Hayashi T.
      • et al.
      The fibril structure of type V collagen triple-helical domain.
      )
      7V[α1(V)]31012nonperiodiccalf (
      • Chanut-Delalande H.
      • Fichard A.
      • Ruggiero F.
      • et al.
      Control of heterotypic fibril formation by collagen V is determined by chain stoichiometry.
      )
      8XIα1(XI)α2(XI)α3(XI)1012periodic, 67234chick (
      • Hansen U.
      • Bruckner P.
      Macromolecular specificity of collagen fibrillogenesis fibrils of collagens I and XI contain A heterotypic alloyed core and A collagen I sheath.
      )
      9XXIV[α1(XXIV)]3979unknown
      10XXVII[α1(XXVII)]3988nonperiodicmouse (
      • Plumb D.A.
      • Dhir V.
      • Boot-Handford R.P.
      • et al.
      Collagen XXVII is developmentally regulated and forms thin fibrillar structures distinct from those of classical vertebrate fibrillar collagens.
      )
      11I
      Not found in vivo, but present in vitro when collagen type I heterotrimers are denatured and then renatured (25).
      [α2(I)]31012
      IDs 1–10 are natural molecular types of collagen. Some collagen types are known to build periodic assemblies, but not others (see Fibril periodicity column). The length of a trimer is given in helical segments s (see text).
      a Variant of collagen type I found in development and disease.
      b Not found in vivo, but present in vitro when collagen type I heterotrimers are denatured and then renatured (
      • Leikina E.
      • Mertts M.V.
      • Leikin S.
      • et al.
      Type I collagen is thermally unstable at body temperature.
      ).
      c Test set sequences used for the estimation of the α-parameter (see in text).

      Methods

      Model

      Our model aims to strike a balance between simplicity at the level of the description of the structure of the collagen triple helices, and realism in the description of the intercollagen interactions. We achieve this compromise by using a knowledge-based representation of the interaction between individual amino acid residues between the pairs of triple helices arranged as in the three-dimensional fibril structure.

      Representation of collagen molecules

      On the basis of the available experimental data regarding the structure of collagen molecules, and inspired by earlier models (
      • Morozova S.
      • Muthukumar M.
      Electrostatic effects in collagen fibril formation.
      ,
      • Gautieri A.
      • Vesentini S.
      • Buehler M.J.
      • et al.
      Hierarchical structure and nanomechanics of collagen microfibrils from the atomistic scale up.
      ), we use the following representation of trimeric collagen molecules:
      • 1
        We consider the triple helical domains of the collagen proteins only, based on the widely held understanding that the triple helix domain is the main driver in determining the fibrillar structure. Each triple helix is modeled as a rigid rod denoted by T.
      • 2
        The triple helical rigid rod is considered to be composed of elementary subunits, hereafter referred to as segments {sn}, where n describes the position of a segment along the rod axis, n{1,,N} and N is the total number of segments. In what follows, we denote distances along the collagen triple helix in terms of segments, such that the length of the triple helix, L=N. The segments are thus separated by a distance l0.29nm along the rod axis, the expected rise per residue of the collagen triple helix.
      • 3
        Each segment comprises a group of three coplanar amino acids, i.e., a cross section of the respective triple helix.

      Interaction between collagen helices

      Consider two parallel triple helices, Ta and Tb, each composed of N 1000 segments (Fig. 1). We will consider all possible relative displacements of the two helices. We number these displacements using the index p; we let Δxp denotes the distance by which one helix is shifted with respect to the other, such that Δxp will change as p is changed. We measure Δxp in terms of integer numbers of helical segments, so that Δxp{1,,N}. For p = 0, the two helices are completely aligned (no relative shift). In this case, Δx0 = 0. Below we explain how we compute the interaction energy as Δxp is increased.
      Figure thumbnail gr1
      Figure 1(A) Schematic representation of collagen trimers, here modeled as rigid rods. Each trimer (Ti) is modeled as having a polymeric structure, where each monomer, or segment (sn), is composed of two solvent-exposed amino acids at positions X and Y and a glycine occluded inside the helix (see A for detail). To investigate how the relative stagger of two collagen trimers affects their interaction energy, we sample possible intermolecular arrangements along the common long axis x. (B) The two-dimensional representation of the longitudinal arrangement of collagen molecules in fibrils (lateral organization is shown in (C)). The longitudinal repeating unit (LRU, red box) and the stagger Δx between interacting molecules positioned relative to a reference molecule (red) is indicated. The length of the longitudinal gap g between successive molecules in the axial direction is proportional to Δx, the magnitude of the stagger. The case shown in the figure corresponds to the situation where the stagger is in the range between L5 and L4, where L is the length of a collagen molecule measured in segments. (C) The three-dimensional (top) and lateral cross section (bottom) representation of the collagen fibrils modeled in this work. Any given molecule, e.g., those represented by red circles, is surrounded by six interacting neighbors, where all possible mΔx staggers from m=1 to (M1) are shown. Here, M = 5 is the number of molecules in the LRU. To see this figure in color, go online.

      Calculation of pairwise interactions between the segments

      Firstly, we need to decide how to score interactions between the helical segments. To a first approximation, we assume that the interaction energy between two segments is given by the average over all possible inter-residue contacts (see Section S1.1 for details). If two segments belonging to neighboring molecules: si = (G,X1,Y1) and sj = (G,X2,Y2) are in proximity, the total interaction energy between these two segments is approximated as:
      e(si,sj)=14x,yε(Ax,Ay),
      (1)


      where the εAx,Ay denotes the energy of contact between the amino acid Ax{X1,Y1} and Ay{X2,Y2}. For each of four possible AxAy interactions, εAx,Ay is selected from a CP matrix discussed below.

      Energy of intermolecular interactions

      Given two molecular trimers and a specific value of the offset Δxp, we need an energy function H(Δxp) that provides a reasonable estimate of the intermolecular interaction energy. To calculate this total interaction energy, we defined the matrix EΔxpRL×L, such that its elements EijΔxp = e(si,sj) describe the interaction energies that result when i-th segment of trimer Ta makes contact with the j-th segment of trimer Tb (Fig. 1). The array EΔxp contains all pairwise interactions between the segments, calculated for the molecular alignment Δxp. Note that for Δx0= 0, E is symmetric.
      Having defined EΔxp, we must now specify which segments are in fact interacting for a given value Δxp of the stagger. When mapping real-space protein structures on to lattice models, it is commonly assumed that a pair of residues form a contact if the distance between their Cα atoms is less than 0.75 nm. The lateral intermolecular distances in collagen fibrils vary between 1.1 and 1.6 nm (
      • Gautieri A.
      • Pate M.I.
      • Buehler M.J.
      • et al.
      Hydration and distance dependence of intermolecular shearing between collagen molecules in a model microfibril.
      ) and the internal radius of a triple helix falls in the range 0.1–0.2 nm (
      • Chang S.-W.
      • Buehler M.J.
      Molecular biomechanics of collagen molecules.
      ,
      • Rainey J.K.
      • Goh M.C.
      A statistically derived parameterization for the collagen triple-helix.
      ). Thus, we ignore interactions between pairs of amino acids that are separated by more than one segment (see Fig. S1A and B). This information is encoded in the binary contact matrix QRL×L, where:
      Qij={1if |ij|10elsewhere.
      (2)


      Finally, we compute the total energy of two interacting trimers by simply adding contributions from the relevant elements of EΔxp, selected for range defined above by applying the matrix Q. Following the notation described above, this can be simply written as:
      H(Δxp)=13(i,j):|ij|1energy(si,sj)=13i,jEijQji=13Tr(EΔxpQ).
      (3)


      To summarize, Eq. 3 describes the following: for each given configuration (Δxp), we check current contacts between neighboring helical segments of interacting trimers and add together their energies. At every iteration, matrix EΔxp, which stores pairwise energy quantities is updated. The H(Δxp), given by a sum of pair interactions, can be interpreted as the free energy of the molecular complex, computed for the given conformation directly from the amino acid sequences of interacting trimers. The factor 1/3 in (
      • Meek K.M.
      • Chapman J.A.
      • Hardcastle R.A.
      The staining pattern of collagen fibrils. Improved correlation with sequence data.
      ) accounts for the intersegmental interaction range, as given by the matrix Q.

      Constraints imposed on the staggering distance

      In this work we made a few assumptions that constrain the possible values of the stagger Δxp. These constraints, described in the two subsequent paragraphs, result from known features of collagen fibrillar structures, and specific choices that we made to facilitate comparison of our findings with experimental data.

      The longitudinal repeating structural unit

      The energy calculation described above for two interacting triple helices can be easily extended to describe the amino acid-residue interactions that exist in the collagen fibrillar environment. The total interaction energy in the fibril results from the pairwise interactions within one longitudinal repeating structural unit (LRU) of the fibril. The LRU is the minimum set of molecules that have mutual longitudinal overlap. Each molecule in the LRU is defined as a cyclic permutation of the collagen triple helix segments such that the N-terminal end of the triple helix is offset, respectively, by mΔxp, where m=1,2,,(M1). Here, M denotes the number of the members of the LRU, which depends on the molecular length and the stagger Δxp (see Fig. 1). The possible intermolecular staggers between the M molecules of the LRU are thus Δxp,2Δxp,,(M1)Δxp. Here, we compute the individual interaction energies for all M possible intermolecular interactions in the LRU, making no assumptions about how the collagen molecules are laterally arranged within the fibril.
      In this work we model interactions between identical collagen proteins. This constraint causes a reduction in the number of stagger possibilities since staggers can become mirror reflections of each other due to the imposed periodic boundary conditions. Therefore, the interaction energies for staggers of mΔxp and (Mm)Δxp are energetically identical (see supporting material for details).

      Gap insertion

      The stagger in the fibril enforces a longitudinal break g between trimers along a common axis when the stagger Δxp is not a divisor of the molecular length L. Concretely, if (q,r) denotes, respectively, the quotient and remainder of LΔx and if r 0 and Δx 0, then the gap g (measured in helical segments) is computed as (q+1)×ΔxL. It then follows that g depends linearly on Δx and is equal to 0 when L can be written as an integer number times Δx.

      Amino acid contact forms

      We select the pairwise energies εAx,Ay with the aid of a statistical protein CP. In general, CPs are free energy functions derived from protein structural data by various principles, and are typically used in protein folding (e.g., to distinguish between the native fold and decoys) or docking (
      • Sippl M.J.
      Calculation of conformational ensembles from potentials of mena force: an approach to the knowledge-based prediction of local structures in globular proteins.
      ,
      • Lu H.
      • Lu L.
      • Skolnick J.
      Development of unified statistical potentials describing protein-protein interactions.
      ,
      • Ravikant D.V.S.
      • Elber R.
      Energy design for protein-protein interactions.
      ,
      • Dosztányi Z.
      • Csizmók V.
      • Simon I.
      • et al.
      The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins.
      ). CPs can differ according to the assumptions made in deriving them, and according to the details of the estimation (e.g., the number of parameters). Empirical CPs are called knowledge-based potentials because they are derived under the assumption that the frequency of observed contacts between a given pair of amino acid residues across protein structures reflects the strength of interaction between those residues. Miyazawa and Jernigan (MJ) used a quasi-chemical approximation to relate the effective contact energy between amino acids x and y to the frequency with which nonbonded contacts between x and y are observed in known protein structures (
      • Sippl M.J.
      Calculation of conformational ensembles from potentials of mena force: an approach to the knowledge-based prediction of local structures in globular proteins.
      ,
      • Miyazawa S.
      • Jernigan R.L.
      Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation.
      ,
      • Miyazawa S.
      • Jernigan R.L.
      Residue–residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading.
      ,
      • Miyazawa S.
      • Jernigan R.L.
      Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues.
      ). MJ introduced two types of CPs: one containing strong hydrophobic components is intended to describe the energy of amino acid transfer from solvent to the protein internal environment (MJh), the second describes the interactions between amino acid residues within the average protein environment (MJb).
      In this work, we assumed that the MJb matrices are better suited to study interactions between collagen trimers. The environment within a collagen fibril mimics the internal environment of a globular protein because it is dominated by protein-protein interactions while solvent-protein contacts are limited. The MJ statistical CP matrices used in this work were obtained from the AAindex database (https://www.genome.jp/aaindex/) (
      • Kawashima S.
      • Kanehisa M.
      AAindex: amino acid index database.
      ) using the indexes: MIYS850103, MIYS960102, MIYS990107.

      Detection of periodicity signals

      The goal of our analysis is to discriminate among collagen trimer types that can self-assemble into periodic fibrillar structures. To detect periodicity signals across polypeptide sequences, we developed a simple method to search for the local minima of the interaction energy, which is computed as a function of a staggering position H=f(Δxp), defined by Eq. 3. Firstly, we consider a point p corresponding to the interhelical distance Δxp (see Fig. 1 A) as a potential periodicity signal if the interaction energy computed at this point—H(Δxp)—is significantly lower than the average interaction energy computed over the sampled range of staggering positions. Specifically we ask whether a stagger Δxp exists for which H(Δxp) satisfies the following inequality:
      H(Δxp)H¯luασlu,
      (4)


      where H¯lu=i=luH(Δxi)ul+1 is the average value of the interaction energy computed over the indicated range of stagger positions Δxp, where {l,u} are, respectively, the lower and an upper bound of that range. In this work we chose the range of Δxp values such that the number of molecules in the LRU, M, is between 3 and 7. The parameter σ is the standard deviation of the energy H(Δxp) from mean H¯lu, and α is a parameter we specify that selects the dispersion range, and is chosen empirically (see Fig. 3 B). We tested the performance of different values of α using the validation sets and selected α2.15. For validation, we extracted the subset of trimers from the species for which the periodicity is known experimentally (according to Table 1). The selected value of the α parameter used in the calculations is found as the maximal value for which we make the correct prediction for all known labeled examples: low α finds even small minima (makes false-positive predictions), whereas a high α value requires a very deep minimum (makes false-negative prediction, see Fig. 3 B).
      Given that we have identified a set of points p{p1,,pn} where the function H(Δxp) reaches a local minimum for each trimer type, we next check if any of these candidate points satisfy the requirement that the above relation is satisfied by each integral m multiple of the selected offset distance Δxp. In summary, we assign periodicity p to a given collagen trimer under the following condition:
      H(mΔxp)H¯ml,muασml,mu,m=1,,M.
      (5)


      The above requirement states that if we identify an energy minimum at position p (distance Δxp), all of its integral multiples indexed here by the index m, which runs over the possible intermolecular displacement choices, m=1,,(M1), are also local minima of the interaction energy. This asserts that the pairwise interaction energy between the trimers shifted by the distance mΔxp reaches a minimum, whichever the m value is. For practical reasons, we restricted our analysis to consideration of M7.

      Data sets

      To carry out this analysis we used the data sets of homologous fibrillar collagen α-chain sequences as described below. A multiple sequence alignment (MSA) is built for each fibrillar α-chain allowing the helical region of each pro-collagen α-chain sequence to be identified and extracted. The MSA can be viewed as an array where each sequence occupies a row and the columns correspond to the sequence sites. We then used the extracted regions of each sequence record to build various types of collagen trimer found in each species.

      Data acquisition

      For each collagen α-chain, the set of sequence orthologs was obtained using the National Center for Biotechnology Information Protein Reference Sequences Database resource (RefSeq) (https://www.ncbi.nlm.nih.gov/protein/). We identified homologous sequences using BLAST. This approach makes the assumption that orthologous sequences originate from a shared ancestor via an evolutionary process that includes mutations, insertions, and deletions. The algorithm implemented in the software program MUSCLE (http://www.drive5.com/muscle/) aims to reproduce the pattern of these events by maximizing similarity between aligned sequences (
      • Edgar R.C.
      MUSCLE: a multiple sequence alignment method with reduced time and space complexity.
      ,
      • Edgar R.C.
      MUSCLE: multiple sequence alignment with high accuracy and high throughput.
      ).

      Data processing

      Each data set of sequences was filtered to remove duplicates, leaving a single representative sequence for each collagen α-chain for each species. For each α-chain type we built an MSA with MUSCLE using its default parameters.

      Hetero-trimer construction

      There are various ways in which each hetero-trimer can be constructed by alternating the relative positions of the three component α-chains among leading, middle, and trailing. In cases where a hetero-trimer is composed of three unique α-chains, there are six possible variants, whereas if the hetero-trimer contains two distinct α-chains, then there are just three possible variants. In this study we analyze four known collagen hetero-trimers (Table 1). Each hetero-trimer is constructed assuming the following order of α-chains: type I: α1(I)α1(I)α2(I); type V (a): α1(V)α2(V)α1(V); type V (b): α2(V)α3(V)α1(V); type XI: α1(II)α1(XI)α2(XI).

      Results

      The self-assembly of supramolecular collagen structures is mediated by a large number of interhelix interactions. Our hypothesis is that the preferred arrangements of two (or more) collagen triple helices should have a more favorable interaction energy than alignments that are not observed in experiments.
      To examine how the relative stagger of collagen triple helices affects the energy of fibrillar ensembles, we constructed the energy function (Eq. 3) that provides a first-order estimate of the strength of the sequence-dependent pairwise interaction between two trimers. Figs. 2 and S3–S6 show estimates of pairwise interaction energy between identical collagen trimers for different M values across several trimer types listed in Table 1. Here, we investigate the interaction energies obtained using Eq. 3 for each of the m differently aligned trimers and take the average across the data sets of orthologs from diverse species, such that the error bands show the standard deviation at each Δxp distance. By comparing these curves, we have identified that a well-defined energy minimum exists universally among both the major and minor collagen types for M=5. The arrangement with M = 7 (L7ΔxpL6, Fig. S3) does not seem to be probable—we do not detect an energy minimum at any offset from this range. Similarly, for M = 6 (Fig. S4) and M = 4 (Fig. S5) staggers we do not observe significant signals suggesting periodic fibrils structures in the interaction curves. For the arrangement with three staggers, which spans a broad range of offset values, L3ΔxpL2, we observe a drop at Δx352 for mutual interactions between type III collagen trimers, but not for any other collagen types (Fig. S6). Moreover, we noticed a lack of correlation between the energy curves reported for each of the Δxp multiples for the arrangements constructed by M = 4, 6, and 7 staggers across all the collagen types analyzed. Conversely, the suprastructure with M = 5 gives correlated patterns of interaction, with a clear energetic minimum reported for all Δxm×p neighbors, where m{1,,4}, for 9 out of 11 trimer types (Fig. 2).
      Figure thumbnail gr2
      Figure 2Estimated values of pairwise interaction energies between the triple helical regions of two identical fibrillar collagen molecules of various types () computed as a function of the longitudinal stagger Δx between the molecules in units of helical segments. The estimates shown have been computed as the average taken over each data set of orthologous trimers and the error bands indicate the spread of the distribution (standard deviation) at each Δxp point. The violet curve corresponds to the Δxp (and Δx4×p), the yellow plot shows the pattern for Δx2×p (and Δx3×p). Note that the overlapping of interaction curves reported for the two effective staggers does not result from geometrical constrains and is purely determined by the trimer sequence. (bd) Proteins reported to build periodic fibrils; (fi) minor regulatory collagens; (a and j) developmental collagens; (c) collagen type I homotrimer relevant in development and disease, e.g., osteoarthritis (
      • Sharma U.
      • Carrique L.
      • Hulmes D.J.S.
      • et al.
      Structural basis of homo-and heterotrimerization of collagen I.
      ); (k) collagen type I homotrimer not known to occur in vivo, but detected in vitro in collagen renaturation experiments (
      • Leikina E.
      • Mertts M.V.
      • Leikin S.
      • et al.
      Type I collagen is thermally unstable at body temperature.
      ). The top panel shows the pairing of obtained in this work energy plot with the experimentally derived collagen fibrillar structure form Plumb et al. (
      • Plumb D.A.
      • Dhir V.
      • Boot-Handford R.P.
      • et al.
      Collagen XXVII is developmentally regulated and forms thin fibrillar structures distinct from those of classical vertebrate fibrillar collagens.
      ); (a) homotrimer of collagen type XXVII and the corresponding aperiodic fibril (A), and (b) heterotrimer of collagen type I and the corresponding periodic fibril (B). To see this figure in color, go online.
      The second factor determining the arrangement topology are the entropic effects: the entropy of the system (lattice) increases with the number of degrees of freedom to distribute trimers in the fibril lattice. Intuitively, the higher the number of molecules per repeating unit, M, the greater the number of possible molecular configurations, and so the greater the entropy. However, what is the exact relation between the entropy and M? To address this question, we examined the entropy gain per triple helix assuming different M stagger values. If all staggers are equiprobable, the entropy of a macrostate (system with M staggers) can be evaluated with the Boltzmann formula. Our approach is illustrated in Fig. S2 (see figure caption for details). The slow (logarithmic) growth of Sn from S3 = 0 to S7 = 3.22 R/mol is shown in Fig. S2 B. Clearly, this analysis shows that the entropy gain per triple helix is too small to justify the selection of any particular arrangement (M) over others. Furthermore, it is sub-extensive: if we increase the length of the trimers and build a fiber twice the original length, we would not change the stagger entropy.

      Periodicity prediction

      Fibrillar collagens

      For each sample trimer of vertebrate species, we use Eq. 5 to examine the interaction curves presented in Fig. 2 across various collagen types. The results are summarized in Table 2 and visualized in Fig. 3 C. In general, we detect periodicity signals for 10 out of 11 of the data sets of trimers analyzed, among which 7 trimer types exhibit no or marginal signal variance across the different species. We predict that the analyzed molecular trimers of collagen type I (Fig. 3 Da, b), type II (Fig. 3 Dc), type XI (Fig. 3 Dh), and heterotrimers of collagen type V (Fig. 3 Df, g) upon self-interactions are capable of assembly into periodic fibrils, regardless of their phylogenetic origin. Conversely, we anticipate that all studied collagen type XXIV trimers do not aggregate into periodic structures. As for the second developmental collagen, type XXVII, we do not find periodicity signals for the high percentage of analyzed trimers (91%), although some exceptions are provided by some species of birds (Fig. 3 Dj).
      Table 2Summary of periodicity prediction for orthologs of collagen triple helices of various types
      Trimer type/molecular compositionSpecies with predicted periodicity (%)Δxp values [s]No. of recordsPercentage of records for the Δxp given in ( )Predicted periodicity p [s]
      [α1(I)3]100235, 23412599.2 (235) 96.80 (234)235, 234
      [α1(I)2α2(I)]100235, 234118100 (235) 97.46 (234)235, 234
      [α1(II)3]100235, 234, 236116100 (235) 28.45 (234) 24.14 (236)235
      [α1(III)3]84.72235, 23414482.64 (235) 66.67 (234)235, 234
      [α1(V)3]65.04235, 23412363.41 (235) 2.44 (234)235
      [α1(V)2α2(V)]96.4235, 23411194.59 (235) 7.21 (234)235
      [α1(V)α2(V)α3(V)]92.19235, 2346492.19 (235) 32.81 (234) 3.12 (236)235
      [α1(XI)α2(XI)α1(II)]100235, 2347565.33 (235) 100 (234)234
      [α1(XXIV)3]0148nonperiodic
      [α1(XXVII)3]8.33240, 2411567.05 (240) 1.28 (241)nonperiodic
      [α2(I)3]
      A collagen type I homotrimer not known to occur in vivo but detected in vitro (25).
      83.08236, 235, 2342018.96 (236) 82.09 (235) 7.46 (234)235
      Bold values are intended to highlight the collagen types for which fibril periodicity is completely conserved across species.
      a A collagen type I homotrimer not known to occur in vivo but detected in vitro (
      • Leikina E.
      • Mertts M.V.
      • Leikin S.
      • et al.
      Type I collagen is thermally unstable at body temperature.
      ).
      Figure thumbnail gr3
      Figure 3(A) Predicted values of an axial offset Δxp for which the interaction energy H(Δxp) reaches a local minimum. The results have been collected by sampling the offset values such that for each trimer Δxp(L5,L4] (see for details). The phylogenetic grouping of orthologous trimers is marked by color according to the legend on the left side (top), revealing the variance across organisms. (B) The value of the parameter α (see ) was selected using a validation set held out from model fitting, and chosen to equal 2.15 (see text). (C) The summary statistic derived enables classification of trimers into two general groups: periodic (build periodic fibrils) and nonperiodic. The details can be found in . (D) The classification results for each data set are shown and the division of species into their corresponding phylogenetic group is indicated. We predict that molecular trimers of collagen type I (a, b) and collagen type II (c) are capable of assembly into periodic fibrils, regardless of their phylogenetic origin. Conversely, for collagen type XXIV (i) we predict no periodicity signals. For collagen type XXVII (j) we predict that assemblies built by the trimers of selected species of birds contain periodicity signals at a staggering value p of 240[s] (see A(j) and ). Note that the collagen type I α2(I)3 trimer (part (k) in A and D and bottom of C is not known to occur in vivo but has been detected in vitro (
      • Leikina E.
      • Mertts M.V.
      • Leikin S.
      • et al.
      Type I collagen is thermally unstable at body temperature.
      ). To see this figure in color, go online.
      For the remaining collagen types, we noticed periodicity signals for a fraction of trimers (Table 2). This raises the possibility that the ability to form periodic structure could be, at least to some extent, species dependent for these trimer types. On the other hand, this could arise as a result of sequencing errors or could be an artifact of the simple binary classification scheme used (our model only returns the decision, we do not have any information about the distance form the decision boundaries to the actual value measured). To explore further, within each collagen type, we classify trimers into nine taxonomic groups, taking into account the class of the corresponding organisms. The results of this analysis are shown in Fig. 3 C and D. Here, different phylogenetic groups are marked by colors according to the legend. We detect periodicity signals among the majority of collagen type III trimers (86%), except for some species of birds and reptiles (Fig. 3 Dd). The analysis of interactions between type V homotrimers suggests that most trimers of fish and birds do not encode any periodicity signals, whereas more evolutionarily advanced mammalian species are likely to confer signals of periodic assembly (Fig. 3 De). At present we do not know the meaning of this observation. We extended our analysis to include a second type of collagen type I homotrimer: α2(I)3. The structure of the C-terminal propeptide of pro-α2(I)-chain prevents this trimer from occurring in vivo (
      • Sharma U.
      • Carrique L.
      • Hulmes D.J.S.
      • et al.
      Structural basis of homo-and heterotrimerization of collagen I.
      ,
      • Lees J.F.
      • Tasab M.
      • Bulleid N.J.
      Identification of the molecular recognition sequence which determines the type-specific assembly of procollagen.
      ). However, this homotrimer has been detected in vitro in collagen renaturation experiments (
      • Leikina E.
      • Mertts M.V.
      • Leikin S.
      • et al.
      Type I collagen is thermally unstable at body temperature.
      ). We include here an analysis of its putative propensity to assemble into periodic fibrils as the α2-chain is known to influence the biophysical and molecular properties of the collagen type I heterotrimers, such as denaturation temperature or binding to matrix proteins (
      • Kuznetsova N.V.
      • McBride D.J.
      • Leikin S.
      Changes in thermal stability and microunfolding pattern of collagen helix resulting from the loss of α2(I) chain in osteogenesis imperfecta murine.
      ), and thus future research may be aimed at engineering this molecule. For this homotrimer, we detect putative indicators of periodic assembly for 83% of the tested trimers, where the remaining 17% of nonperiodic results comes mostly from different species of fish (Fig. 3 Dk).
      The predicted values of the stagger distance p (Δxp) which corresponds to the periodic energy minimum of the interhelical interaction energy (i.e., periodicity) defined with Eq. 5 are listed for each type of trimer in Table 2 and shown across phylogenetic groups in Fig. 3 A. Among major and regulatory collagen types (Fig. 3 Aa–h, k) we detected three possible values of an interhelical stagger: p={234,235,236}[s]. For a marginal percentage of collagen type XXVII trimers of birds species (8%) we predicted a single value p=240[s].

      Collagen model peptides

      The prediction of fiber periodicity is one of the most prominent challenges in the design of fibrous proteins for biomedical applications. Recent studies show advancements in collagen mimetic peptide design, which has succeeded in obtaining collagen-mimetic trimers that are capable of self-assembly into periodic mini-fibers (
      • Kaur P.J.
      • Strawn R.
      • Xu Y.
      • et al.
      The self-assembly of a mini-fibril with axial periodicity from a designed collagen-mimetic triple helix.
      ,
      • Chen F.
      • Strawn R.
      • Xu Y.
      The predominant roles of the sequence periodicity in the self-assembly of collagen-mimetic mini-fibrils.
      ,
      • Strawn R.
      • Chen F.
      • Xu Y.
      • et al.
      To achieve self-assembled collagen mimetic fibrils using designed peptides.
      ,
      • Rele S.
      • Song Y.
      • Chaikof E.L.
      • et al.
      D-periodic collagen-mimetic microfibers.
      ). In these cases, sequence design is based on the selection of a fragment of a collagen helix, which is subsequently repeated in tandem (
      • Kaur P.J.
      • Strawn R.
      • Xu Y.
      • et al.
      The self-assembly of a mini-fibril with axial periodicity from a designed collagen-mimetic triple helix.
      ,
      • Chen F.
      • Strawn R.
      • Xu Y.
      The predominant roles of the sequence periodicity in the self-assembly of collagen-mimetic mini-fibrils.
      ,
      • Strawn R.
      • Chen F.
      • Xu Y.
      • et al.
      To achieve self-assembled collagen mimetic fibrils using designed peptides.
      ,
      • Rele S.
      • Song Y.
      • Chaikof E.L.
      • et al.
      D-periodic collagen-mimetic microfibers.
      ). Chen et al. (
      • Chen F.
      • Strawn R.
      • Xu Y.
      The predominant roles of the sequence periodicity in the self-assembly of collagen-mimetic mini-fibrils.
      ) have produced three collagen-mimetic helical peptides, two of which (designated by the authors as COL108 and COL877) have been shown experimentally to self-assemble into periodic structures. For these two constructs, the authors designed primary sequences that repeat a 108 amino-acid-long triple-helical motif extracted from the human α1(I)-chain three times. These 378-residue-long peptides form a stable helix which self-assembles into periodic mini-fibrils with a periodicity of 35 nm (
      • Kaur P.J.
      • Strawn R.
      • Xu Y.
      • et al.
      The self-assembly of a mini-fibril with axial periodicity from a designed collagen-mimetic triple helix.
      ). For the reference (negative control), the authors constructed a third peptide (named COL108rr), which in contrast contains a randomized fragment of α1(I)-helical sequence in the middle (
      • Chen F.
      • Strawn R.
      • Xu Y.
      The predominant roles of the sequence periodicity in the self-assembly of collagen-mimetic mini-fibrils.
      ). These helices do not form periodic fibrils but instead build nonspecific aggregates (
      • Chen F.
      • Strawn R.
      • Xu Y.
      The predominant roles of the sequence periodicity in the self-assembly of collagen-mimetic mini-fibrils.
      ).
      We tested our model predictions on these three collagen-mimetic helices. First, we retrieved the peptide sequences provided in (
      • Chen F.
      • Strawn R.
      • Xu Y.
      The predominant roles of the sequence periodicity in the self-assembly of collagen-mimetic mini-fibrils.
      ,
      • Kaur P.J.
      • Strawn R.
      • Xu Y.
      • et al.
      The self-assembly of a mini-fibril with axial periodicity from a designed collagen-mimetic triple helix.
      ) and constructed corresponding homotrimer models. We then examined the pairwise interaction energies between the trimers for different offset values using the same approach as for the natural fibrillar proteins. The resulting interaction energy curves as a function of Δxp are shown in Figs. 4, S7 and S8. Finally, given the computed energy patterns, we used Eq. 5 to predict periodicity.
      Figure thumbnail gr4
      Figure 4Estimated energies for the triple helical peptide COL877 (
      • Chen F.
      • Strawn R.
      • Xu Y.
      The predominant roles of the sequence periodicity in the self-assembly of collagen-mimetic mini-fibrils.
      ) for M = (A) 3; (B) 4; (C) 5; (D) 6; (E) 7. Predicted periodicity p = 123 [s]. To see this figure in color, go online.
      We found that our predictions are in agreement with experimentally measured periodicity values for collagen mimetic peptides. The homotrimers built by assembling COL108 peptides are predicted to form fibrils with a periodicity of 35 nm (0.3 gap, 0.7 overlap), as observed by TEM (
      • Kaur P.J.
      • Strawn R.
      • Xu Y.
      • et al.
      The self-assembly of a mini-fibril with axial periodicity from a designed collagen-mimetic triple helix.
      ). Our model predicts a periodicity value of 122 helical segments 35 nm (Fig. S7), reproducing the estimate made by authors using a linear scoring model (
      • Kaur P.J.
      • Strawn R.
      • Xu Y.
      • et al.
      The self-assembly of a mini-fibril with axial periodicity from a designed collagen-mimetic triple helix.
      ). For the peptide COL877, the measured periodicity value equals 32±1.4 nm (
      • Chen F.
      • Strawn R.
      • Xu Y.
      The predominant roles of the sequence periodicity in the self-assembly of collagen-mimetic mini-fibrils.
      ), whereas our model predicts p=123 helical segments, which corresponds to 35 nm (Fig. 4). Given the predicted offset value p={122,123}, we anticipate that these collagen-mimetic trimers self-assemble into an arrangement with M = 4 in the LRU, instead of M = 5 as found in the case of natural collagens.

      Evaluation with randomized sequences

      To evaluate the likelihood of detecting periodicity signals by chance we carried out three experiments. First, we generated a set of 1000 pseudorandom sequences of length L = 1014 [s], where the amino acid at each position was drawn from the uniform distribution. We did not detect periodicity signals for any of these samples. To explore further, we generated additional two sets of sequences: 1) 1000 samples obtained by sampling the amino acid at X and Y position from the distribution of amino acid occurrences at X and Y position estimated for the human α1(I)-chain, and 2) 1000 samples obtained by permuting the order of GXY tripeptides of human α1(I)-chain. This α variant is known to contain a periodicity signal. Therefore if the sequence composition has a predominant impact on the reported signal, we would expect to be identified in cases 1) and 2). For the first group, we found that just 1.3% of samples are predicted to encode some level of periodicity signal. Finally, if we only permute the triplets GXY that occur in the α1(I) sequence, we find that only 1.4% of samples contain some level of periodicity signal across a statistical sample of 1000 records.

      Discussion

      The energy of noncovalent interactions drives collagen proteins to form contacts and stabilizes their initial register, while the formation of intermolecular covalent bonds finalizes quaternary structure formation. Thermodynamically, the formation of collagen fibrils in normal conditions (physiological salt, 293–310 K) is an endothermic process, which occurs due to the great positive value of assembly entropy, resulting from solvent rearrangement (
      • Cooper A.
      Thermodynamic studies of the assembly in vitro of native collagen fibrils.
      ,
      • Kadler K.E.
      • Hojima Y.
      • Prockop D.J.
      Assembly of collagen fibrils de novo by cleavage of the type I pC-collagen with procollagen C-proteinase. Assay of critical concentration demonstrates that collagen self-assembly is a classical example of an entropy-driven process.
      ).
      After the initial spontaneous association of trimers, the exact details of the quaternary structure are established by the gain in stabilization energy. This further energy gain can be attributed to the optimal alignment of collagen trimers in the fibril interior. To infer the most optimal alignment, we first asked about the constraints imposed on the arrangement of trimers within the fibril. Here, we have provided evidence that the value of molecular stagger is encoded in the sequence of collagen trimers and that the entropic effects that result from the selection of specific M are negligible. We examined the free energy change as a function of the mutual orientation between the trimers using a pairwise approximation to the possible intersegmental interactions that result if two trimers are in contact. Our analysis suggests that the emergence of intermolecular stagger and, as a consequence, detectable structural features of collagen fibrils, results from the drop of the interaction free energy observed when the trimers are shifted by the optimal distance.
      In genomes of modern vertebrate species, 11 fibrillar collagen genes encode for distinct α-polypeptides which combine to form collagen molecules of seven types (see Table 1). These genes have been grouped into three clades—A, B, and C—by similarity (
      • Boot-Handford R.P.
      • Tuckwell D.S.
      Fibrillar collagen: the key to vertebrate evolution? A tale of molecular incest.
      ,
      • Exposito J.-Y.
      • Cluzel C.
      • Lethias C.
      • et al.
      Evolution of collagens.
      ). Major fibrillar collagens (I, II, and III, built exclusively from chains of subclass A) and minor fibrillar collagens (V and XI, combining chains from clade A and B) constitute the main component of fibrils, forming the core of the extracellular matrix (
      • Parry D.A.
      • Squire J.M.
      Fibrous Proteins: Structures and Mechanisms.
      ). By comparing the interaction energy curves averaged over the data sets of protein orthologs, we identified that a well-defined energy minimum exists universally among the major and minor collagen types for M=5 (Fig. 2 AH). Moreover, for M=5 the patterns are correlated, such that the energy minimum exists for all possible trimer interactions in the LRU, i.e., for all integer multiples of the Δxp stagger. Interestingly, the value of the energy minima are similar for all integer multiples of the Δxp stagger across all collagen types, except for the collagen type II α1(II)3 trimer, implying that, with this exception, there is little energy penalty for altering the lateral arrangement of molecules within collagen fibrils. This suggests that lateral compression of collagen fibrils, for instance, is likely to be relatively facile. We find that the computed energy minima are broad and funnel shaped in all cases, suggesting the adaptation of collagen assemblies to recover from longitudinal strains without compromising their molecular registry and that a component of the fibril elasticity is encoded in the protein sequences. Moreover, it is plausible that the funnel-shaped interaction curves increase the robustness of the assembly kinetics.
      We then employed Eq. 5 to predict periodicity for analyzed collagen trimers across the species. The results are summarized in Table 2. Since the specific stagger distance can clearly be selected over competing ones, we predict that collagen trimer types I, II, III, V, and XI self-assemble into periodic supra structures. This has been confirmed experimentally for some species (see Table 1 and Fig. 2). Conversely, for the developmental collagen types (XXIV and XXVII, made up of chains from subclass C) we predict the formation of nonspecific aggregates since no energetically favorable alignment can be identified (see Fig. 2 for comparison). It has been hypothesized that the partial processing and retention of the N-terminal globular extension by the mature form of the protein gives rise to the lack of experimentally observed banding pattern (
      • Boot-Handford R.P.
      • Tuckwell D.S.
      • Poulsom R.
      • et al.
      A novel and highly conserved collagen (proα1 (XXVII)) with a unique expression pattern and unusual molecular characteristics establishes a new clade within the vertebrate fibrillar collagen family.
      ,
      • Koch M.
      • Laub F.
      • Gordon M.K.
      • et al.
      Collagen XXIV, a vertebrate fibrillar collagen with structural features of invertebrate collagens selective expression in developing cornea and bone.
      ). Our analysis reveals that the lack of structural features observed for these trimer types is encoded, to a large extent, in the helical region of underlying α-polypeptides.

      Conclusions

      The unique structural properties of collagen triple helices endow them with the capability to encode information about self-assembly mechanisms (
      • Fidler A.L.
      • Boudko S.P.
      • Hudson B.G.
      • et al.
      The triple helix of collagens–an ancient protein structure that enabled animal multicellularity and tissue evolution.
      ). In this work we presented a simple, parameter-free model for collagen fibril design that predicts the structure of self-assembling collagen fibers on the basis of the amino acid sequence of the constituent α-chain subunits. The increasing availability of genomic data allows us to test the idea that optimal molecular alignment is dictated by the free energy of intermolecular interactions. Using our simple model we can estimate the free energy of interactions between each pair of trimers from the 11 data sets of vertebrate fibrillar collagens. We have analyzed the variance in the reported optimal stagger value across the species for each trimer type, and conclude that the stagger distance between collagen molecules in their fibrils and as a result, the phenotypic features of those fibrils, are well preserved across evolutionary time. Our predicted periodicities are in good agreement with experimental findings concerning the structural features of collagenous fibrils. We believe that the interaction model presented in this work provides a foundation for the future studies which aim to design the new α-peptide sequences for targeted applications. Collagen-mimetic trimers capable of assembly into the fibrillar suprastructures with desirable structural features are currently in high demand for medicine and material engineering and the understanding given by our model should allow simple prediction of the ability of a sequence both to form periodic fibrils itself and to design optimal interaction with other collagenous proteins, e.g., in vivo prosthetic applications.

      Author contributions

      M.J.D., L.J.C., D.F., and A.M.P. designed the research. A.M.P. collected and processed the data and carried out all calculations. A.M.P., M.J.D., L.J.C., and D.F. wrote the article.

      Acknowledgments

      We thank Ieva Goldberga for helpful discussions related to the collagen TEM studies. A.M.P. was funded by a Raymond and Beverly Sackler Fund for Physics of Medicine (University of Cambridge), the European Research Council , and the Simons Foundation .

      Declaration of interests

      The authors declare no competing interests.

      Supporting material

      References

        • Kadler K.E.
        • Baldock C.
        • Boot-Handford R.P.
        • et al.
        Collagens at a glance.
        J. Cell Sci. 2007; 120: 1955-1958
        • Petruska J.A.
        • Hodge A.J.
        A subunit model for the tropocollagen macromolecule.
        Proc. Natl. Acad. Sci. USA. 1964; 51: 871-876
        • Meek K.M.
        • Chapman J.A.
        • Hardcastle R.A.
        The staining pattern of collagen fibrils. Improved correlation with sequence data.
        J. Biol. Chem. 1979; 254: 10710-10714
        • Smith J.W.
        Molecular pattern in native collagen.
        Nature. 1968; 219: 157-158
        • Antipova O.
        • Orgel J.P.R.O.
        In situ D-periodic molecular structure of type II collagen.
        J. Biol. Chem. 2010; 285: 7087-7096
        • Bos K.J.
        • Holmes D.F.
        • Bishop P.N.
        • et al.
        Axial structure of the heterotypic collagen fibrils of vitreous humour and cartilage.
        J. Mol. Biol. 2001; 306: 1011-1022
        • Parkin J.D.
        • San Antonio J.D.
        • Savige J.
        • et al.
        The collαgen III fibril has a “flexi-rod” structure of flexible sequences interspersed with rigid bioactive domains including two with hemostatic roles.
        PLoS One. 2017; 12: e0175582
        • Holmes D.F.
        • Gilpin C.J.
        • Kadler K.E.
        • et al.
        Corneal collagen fibril structure in three dimensions: structural insights into fibril assembly, mechanical properties, and tissue organization.
        Proc. Natl. Acad. Sci. USA. 2001; 98: 7307-7312
        • Birk D.E.
        • Fitch J.M.
        • Linsenmayer T.F.
        • et al.
        Collagen fibrillogenesis in vitro: interaction of types I and V collagen regulates fibril diameter.
        J. Cell Sci. 1990; 95: 649-657
        • Ricard-Blum S.
        • Ruggiero F.
        The collagen superfamily: from the extracellular matrix to the cell membrane.
        Pathol. Biol. 2005; 53: 430-442
        • Plumb D.A.
        • Dhir V.
        • Boot-Handford R.P.
        • et al.
        Collagen XXVII is developmentally regulated and forms thin fibrillar structures distinct from those of classical vertebrate fibrillar collagens.
        J. Biol. Chem. 2007; 282: 12791-12795
        • Boot-Handford R.P.
        • Tuckwell D.S.
        • Poulsom R.
        • et al.
        A novel and highly conserved collagen (proα1 (XXVII)) with a unique expression pattern and unusual molecular characteristics establishes a new clade within the vertebrate fibrillar collagen family.
        J. Biol. Chem. 2003; 278: 31067-31077
        • Hjorten R.
        • Hansen U.
        • Pace J.M.
        • et al.
        Type XXVII collagen at the transition of cartilage to bone during skeletogenesis.
        Bone. 2007; 41: 535-542
        • Hulmes D.J.
        • Miller A.
        • Woodhead-Galloway J.
        • et al.
        Analysis of the primary structure of collagen for the origins of molecular packing.
        J. Mol. Biol. 1973; 79: 137-148
        • Hofmann H.
        • Fietzek P.P.
        • Kühn K.
        The role of polar and hydrophobic interactions for the molecular packing of type I collagen: a three-dimensional evaluation of the amino acid sequence.
        J. Mol. Biol. 1978; 125: 137-165
        • Jones E.Y.
        • Miller A.
        Analysis of structural design features in collagen.
        J. Mol. Biol. 1991; 218: 209-219
        • Trus B.L.
        • Piez K.A.
        Molecular packing of collagen: three-dimensional analysis of electrostatic interactions.
        J. Mol. Biol. 1976; 108: 705-732
        • Mcbride Jr., D.J.
        • Kadler K.E.
        • Prockop D.J.
        • et al.
        Self-assembly into fibrils of a homotrimer of type I collagen.
        Matrix. 1992; 12: 256-263
        • Hulmes D.J.
        • Jesior J.-C.
        • Wolff C.
        • et al.
        Electron microscopy shows periodic structure in collagen fibril cross sections.
        Proc. Natl. Acad. Sci. USA. 1981; 78: 3567-3571
        • Doyle B.B.
        • Hulmes D.J.
        • Woodhead-Galloway J.
        • et al.
        A D-periodic narrow filament in collagen.
        Proc. R. Soc. Lond. B Biol. Sci. 1974; 186: 67-74
        • Brodsky B.
        • Eikenberry E.F.
        • Cassidy K.
        An unusual collagen periodicity in skin.
        Biochim. Biophys. Acta. 1980; 621: 162-166
        • Mizuno K.
        • Adachi E.
        • Hayashi T.
        • et al.
        The fibril structure of type V collagen triple-helical domain.
        Micron. 2001; 32: 317-323
        • Chanut-Delalande H.
        • Fichard A.
        • Ruggiero F.
        • et al.
        Control of heterotypic fibril formation by collagen V is determined by chain stoichiometry.
        J. Biol. Chem. 2001; 276: 24352-24359
        • Hansen U.
        • Bruckner P.
        Macromolecular specificity of collagen fibrillogenesis fibrils of collagens I and XI contain A heterotypic alloyed core and A collagen I sheath.
        J. Biol. Chem. 2003; 278: 37352-37359
        • Leikina E.
        • Mertts M.V.
        • Leikin S.
        • et al.
        Type I collagen is thermally unstable at body temperature.
        Proc. Natl. Acad. Sci. USA. 2002; 99: 1314-1318
        • Morozova S.
        • Muthukumar M.
        Electrostatic effects in collagen fibril formation.
        J. Chem. Phys. 2018; 149: 163333
        • Gautieri A.
        • Vesentini S.
        • Buehler M.J.
        • et al.
        Hierarchical structure and nanomechanics of collagen microfibrils from the atomistic scale up.
        Nano Lett. 2011; 11: 757-766
        • Gautieri A.
        • Pate M.I.
        • Buehler M.J.
        • et al.
        Hydration and distance dependence of intermolecular shearing between collagen molecules in a model microfibril.
        J. Biomech. 2012; 45: 2079-2083
        • Chang S.-W.
        • Buehler M.J.
        Molecular biomechanics of collagen molecules.
        Mater. Today. 2014; 17: 70-76
        • Rainey J.K.
        • Goh M.C.
        A statistically derived parameterization for the collagen triple-helix.
        Protein Sci. 2002; 11: 2748-2754
        • Sippl M.J.
        Calculation of conformational ensembles from potentials of mena force: an approach to the knowledge-based prediction of local structures in globular proteins.
        J. Mol. Biol. 1990; 213: 859-883
        • Lu H.
        • Lu L.
        • Skolnick J.
        Development of unified statistical potentials describing protein-protein interactions.
        Biophys. J. 2003; 84: 1895-1901
        • Ravikant D.V.S.
        • Elber R.
        Energy design for protein-protein interactions.
        J. Chem. Phys. 2011; 135: 065102
        • Dosztányi Z.
        • Csizmók V.
        • Simon I.
        • et al.
        The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins.
        J. Mol. Biol. 2005; 347: 827-839
        • Miyazawa S.
        • Jernigan R.L.
        Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation.
        Macromolecules. 1985; 18: 534-552
        • Miyazawa S.
        • Jernigan R.L.
        Residue–residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading.
        J. Mol. Biol. 1996; 256: 623-644
        • Miyazawa S.
        • Jernigan R.L.
        Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues.
        Proteins. 1999; 34: 49-68
        • Kawashima S.
        • Kanehisa M.
        AAindex: amino acid index database.
        Nucleic Acids Res. 2000; 28: 374
        • Edgar R.C.
        MUSCLE: a multiple sequence alignment method with reduced time and space complexity.
        BMC Bioinformatics. 2004; 5: 113
        • Edgar R.C.
        MUSCLE: multiple sequence alignment with high accuracy and high throughput.
        Nucleic Acids Res. 2004; 32: 1792-1797
        • Sharma U.
        • Carrique L.
        • Hulmes D.J.S.
        • et al.
        Structural basis of homo-and heterotrimerization of collagen I.
        Nat. Commun. 2017; 8: 14671
        • Lees J.F.
        • Tasab M.
        • Bulleid N.J.
        Identification of the molecular recognition sequence which determines the type-specific assembly of procollagen.
        EMBO J. 1997; 16: 908-916
        • Kuznetsova N.V.
        • McBride D.J.
        • Leikin S.
        Changes in thermal stability and microunfolding pattern of collagen helix resulting from the loss of α2(I) chain in osteogenesis imperfecta murine.
        J. Mol. Biol. 2003; 331: 191-200
        • Kaur P.J.
        • Strawn R.
        • Xu Y.
        • et al.
        The self-assembly of a mini-fibril with axial periodicity from a designed collagen-mimetic triple helix.
        J. Biol. Chem. 2015; 290: 9251-9261
        • Chen F.
        • Strawn R.
        • Xu Y.
        The predominant roles of the sequence periodicity in the self-assembly of collagen-mimetic mini-fibrils.
        Protein Sci. 2019; 28: 1640-1651
        • Strawn R.
        • Chen F.
        • Xu Y.
        • et al.
        To achieve self-assembled collagen mimetic fibrils using designed peptides.
        Biopolymers. 2018; 109: e23226
        • Rele S.
        • Song Y.
        • Chaikof E.L.
        • et al.
        D-periodic collagen-mimetic microfibers.
        J. Am. Chem. Soc. 2007; 129: 14780-14787
        • Cooper A.
        Thermodynamic studies of the assembly in vitro of native collagen fibrils.
        Biochem. J. 1970; 118: 355-365
        • Kadler K.E.
        • Hojima Y.
        • Prockop D.J.
        Assembly of collagen fibrils de novo by cleavage of the type I pC-collagen with procollagen C-proteinase. Assay of critical concentration demonstrates that collagen self-assembly is a classical example of an entropy-driven process.
        J. Biol. Chem. 1987; 262: 15696-15701
        • Boot-Handford R.P.
        • Tuckwell D.S.
        Fibrillar collagen: the key to vertebrate evolution? A tale of molecular incest.
        Bioessays. 2003; 25: 142-151
        • Exposito J.-Y.
        • Cluzel C.
        • Lethias C.
        • et al.
        Evolution of collagens.
        Anat. Rec. 2002; 268: 302-316
        • Parry D.A.
        • Squire J.M.
        Fibrous Proteins: Structures and Mechanisms.
        82. Springer, 2017
        • Koch M.
        • Laub F.
        • Gordon M.K.
        • et al.
        Collagen XXIV, a vertebrate fibrillar collagen with structural features of invertebrate collagens selective expression in developing cornea and bone.
        J. Biol. Chem. 2003; 278: 43236-43244
        • Fidler A.L.
        • Boudko S.P.
        • Hudson B.G.
        • et al.
        The triple helix of collagens–an ancient protein structure that enabled animal multicellularity and tissue evolution.
        J. Cell Sci. 2018; 131 (jcs203950)