<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1748-7188-2-3</ui>
   <ji>1748-7188</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>A spatio-temporal mining approach towards summarizing and analyzing protein folding trajectories</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Yang</snm>
               <fnm>Hui</fnm>
               <insr iid="I1"/>
               <email>huiyang@sfsu.edu</email>
            </au>
            <au id="A2">
               <snm>Parthasarathy</snm>
               <fnm>Srinivasan</fnm>
               <insr iid="I2"/>
               <email>srini@cse.ohio-state.edu</email>
            </au>
            <au id="A3">
               <snm>Ucar</snm>
               <fnm>Duygu</fnm>
               <insr iid="I2"/>
               <email>ucar@cse.ohio-state.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Computer Science, San Francisco State University, 1600 Holloway Avenue, San Francisco, California, USA</p>
            </ins>
            <ins id="I2">
               <p>Department of Computer Science and Engineering, Ohio State University, 2015 Neil Avenue, Columbus, Ohio, USA</p>
            </ins>
         </insg>
         <source>Algorithms for Molecular Biology</source>
         <issn>1748-7188</issn>
         <pubdate>2007</pubdate>
         <volume>2</volume>
         <issue>1</issue>
         <fpage>3</fpage>
         <url>http://www.almob.org/content/2/1/3</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17407611</pubid>
               <pubid idtype="doi">10.1186/1748-7188-2-3</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>04</day>
               <month>8</month>
               <year>2006</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>04</day>
               <month>4</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>04</day>
               <month>4</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Yang et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>Understanding the protein folding mechanism remains a grand challenge in structural biology. In the past several years, computational theories in molecular dynamics have been employed to shed light on the folding process. Coupled with high computing power and large scale storage, researchers now can computationally simulate the protein folding process in atomistic details at femtosecond temporal resolution. Such simulation often produces a large number of folding trajectories, each consisting of a series of 3D conformations of the protein under study. As a result, effectively managing and analyzing such trajectories is becoming increasingly important.</p>
            <p>In this article, we present a spatio-temporal mining approach to analyze protein folding trajectories. It exploits the simplicity of contact maps, while also integrating 3D structural information in the analysis. It characterizes the dynamic folding process by first identifying spatio-temporal association patterns in contact maps, then studying how such patterns evolve along a folding trajectory. We demonstrate that such patterns can be leveraged to summarize folding trajectories, and to facilitate the detection and ordering of important folding events along a folding path. We also show that such patterns can be used to identify a consensus partial folding pathway across multiple folding trajectories. Furthermore, we argue that such patterns can capture both local and global structural topology in a 3D protein conformation, thereby facilitating effective structural comparison amongst conformations.</p>
            <p>We apply this approach to analyze the folding trajectories of two small synthetic proteins-BBA5 and GSGS (or Beta3S). We show that this approach is promising towards addressing the above issues, namely, folding trajectory summarization, folding events detection and ordering, and consensus partial folding pathway identification across trajectories.</p>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>1 Background</p>
         </st>
         <p>The three dimensional (3D) native structures of proteins have important implications in proteomics. Understanding such structures enables us to explore the function of a protein, explain substrate and ligand binding, perform realistic drug design and potentially cure diseases caused by protein misfolding. The protein folding problem is therefore one of the most fundamental yet unsolved problems in computational molecular biology. One major challenge in simulating the protein folding process is its complexity. Snow <it>et al</it>. state that performing a Molecular Dynamics (MD) simulation on a mini-protein for just 10 <it>&#956;</it>s would require decades of computation time on a typical CPU <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Researchers in the Folding@home project recently proposed a World Wide Web-based computing model to simulate the protein folding process <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>.</p>
         <p>As the volume of folding trajectories produced from high-throughput simulation tools increases drastically, there is an urgent need to compare, analyze, and manage such data. Previously, researchers have examined several summary statistics (e.g. radius of gyration, root mean square deviation (RMSD)) to identify similar 3D conformations in folding trajectories. Although summary statistics are commonly used for comparison, they can only capture biased and limited global properties of the conformation. Recently, Russel <it>et al</it>. <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> suggested using geometric spanners for mapping a simulation to a more discrete combinatorial representation. They apply geometric spanners to discover the proximity between different segments of a protein across a range of scales, and track the changes of such proximity over time.</p>
         <p>To overcome the difficulties in managing and analyzing the large amount of protein folding simulation data, Berrar <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> proposed using a data warehouse system. They embed the warehouse in a grid computing environment to enable data sharing. They also propose implementing a set of data mining algorithms to facilitate commonly needed data analysis tasks.</p>
         <p>In this article, we propose a spatio-temporal mining approach to analyze folding trajectories. We extend the spatio-temporal data mining framework that we have developed earlier to analyze and manage such data <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. This framework is designed to analyze spatio-temporal data produced in several scientific domains. Previously, we have applied this framework to analyze 8732 proteins taken from the Protein Data Bank to identify structural fingerprints for different protein classes (e.g., <it>&#945;</it>-proteins) <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Each protein is associated with a set of objects that are extracted from its contact map. We then realize the notion of Spatial Object Association Pattern (SOAP) to effectively capture spatial relationships among such objects, Furthermore, by associating SOAPs with proteins in different protein classes, we have identified multiple types of SOAPs that can potentially function as the structural fingerprints for different protein classes. In this article, we extend such strategies to a new application domain: analyzing and characterizing the folding process of a protein.</p>
         <p>Clearly, protein folding trajectories consist of both spatial and temporal components. Each protein in a MD simulation is composed of a number of residues spatially located in the 3D space that move over time. Each frame (or snapshot) of the trajectory can be represented as a 2D contact map, which captures the pair-wise 3D distances between residues. We extract non-local bit-patterns from these contact maps. We then use an entropy-based clustering algorithm to cluster such bit-patterns into groups. These bit-patterns are further associated to form spatial object association patterns (SOAPs). By using SOAPs, we are able to effectively summarize and analyze folding trajectories produced by MD simulations. A major advantage of this representation is its appropriateness for cross-comparison across different simulations, as discussed in later sections.</p>
         <p>Compared to our previous work on protein structural analysis <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>, we have made the following contributions:</p>
         <p>&#8226; <it>Propose a contact map-based approach to analyze protein folding trajectories: </it>Our previous work focused on identifying structural signatures in native conformation of proteins in different classes or folds. Thus, there is no temporal component involved. In contrast, a folding trajectory has both spatial and temporal components. In addition, bit-patterns in a folding trajectory will interact with each other and evolve over time. Moreover, the proposed approach also effectively integrates 3D structural information in the overall analysis. This is critical in understanding the protein folding mechanism.</p>
         <p>&#8226; <it>Map 2D bit-patterns in contact maps with 3D structural motifs: </it>To better understand and explain the biological meaning of the bit-patterns in contact maps, we have made an effort to establish a mapping between such bit-patterns and well-known structural motifs (e.g., <it>&#945;</it>-helices and <it>&#946;</it>-turns) in 3D conformations. Currently, this task is carried out manually. We are in the process of automating this mapping. Such a 2D-3D mapping is essential to folding data analysis due to the following reasons: First, to gain insight into the folding process, it is critical to identify the formation of important local 3D motifs such as <it>&#946;</it>-turns. Second, our previous studies show that by associating multiple bit-patterns in contact maps, one can construct effective structural signatures for different protein classes or folds <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. This leads us to hypothesize that a mapping might exist between 2D bit-patterns in contact maps and 3D local motifs of a protein. In this work, we validate this hypothesis and report the mapping result later. Finally, such a mapping not only enables one to take advantage of the simplicity of working in the 2D space of contact maps, but also allows one to relate to the 3D space of protein conformations. This is important in understanding the protein folding process.</p>
         <p>&#8226; <it>Indirectly capture interactions among structural motifs in 3D space: </it>In our previous work, two bit-patterns are considered spatially proximate if they are located in the same vicinity within a 2D contact map. This is problematic in the context of protein folding, as two bit-patterns can be spatially proximate in a contact map even though their corresponding motifs are distant in the 3D conformation. (See Section 3 for more details.) We address this issue by considering the 3D distance between two bit-patterns.</p>
         <p>&#8226; <it>Propose novel strategies to analyze protein folding trajectories: </it>We propose several novel strategies to analyze protein folding trajectories based on spatial and spatio-temporal association patterns.</p>
         <p>In summary, one can benefit from our mining approach in two main aspects:</p>
         <p>&#8226; <b>Effective, informative and scalable representation of folding simulations</b>: We represent each frame by a set of SOAPs, where each SOAP in turn characterizes the spatial relationship (or interactions in the folding case) among multiple bit-patterns. SOAPs are not only easily obtainable but also, as we will show, able to capture folding events along a folding trajectory.</p>
         <p>&#8226; <b>Cross-analysis of trajectories to reveal a consensus partial folding pathway</b>: By representing each frame as a set of SOAPs, one can carry out analysis across different trajectories. Such analysis includes detecting critical events and identifying consensus partial folding pathways across trajectories.</p>
         <p>The remainder of the article is organized as follows. In Section 2, we describe the two proteins-BBA5 and GSGS-and their trajectories produced from computational simulation. We also identify two main goals to analyze such trajectories. In Section 3, we present a step-by-step description of our analysis approach. We next report the empirical results on analyzing the trajectories of the two proteins in Section 4. We focus on the protein BBA5. Finally we conclude and report several ongoing research directions in Section 5.</p>
      </sec>
      <sec>
         <st>
            <p>2 Analyzing Protein Folding Trajectories</p>
         </st>
         <sec>
            <st>
               <p>2.1 Protein Folding Trajectories</p>
            </st>
            <p>Advances in high-performance computing technologies and molecular dynamics have led to successful simulations of folding dynamics for (small) proteins at the atomistic level <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. Such simulations result in a large number of <it>folding trajectories</it>, each of which consists of a series of 3D conformations of the protein under simulation. These conformations are usually sampled regularly (e.g., every 200fs) during a simulation. In this article, we also refer to each conformation as a <it>folding frame </it>or simply a <it>frame</it>. Furthermore, to represent a protein conformation, we adopt one of the commonly adopted representation schemes, where a conformation is represented as a sequence of <it>&#945;</it>-carbons (<it>C</it><sub><it>&#945;</it></sub>) located in 3D space.</p>
            <p>In this article, we focus on the folding trajectories of two mini proteins: BBA5 (Protein Data Bank ID) <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> and GSGS (orBeta3s) <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>. Such trajectories were produced by the Folding@ home research group at Stanford University <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>.</p>
            <p>BBA5 is a 23-residue protein that folds at microsecond timescale. The native structure (or fold) of BBA5 shows a <it>&#946;</it>-hairpin involving residues 1&#8211;10 and centering about residues 4&#8211;5. It also includes an <it>&#945;</it>-helix involving the remaining residues 11&#8211;23. By convention, residues are numbered increasingly from the N-terminal to C-terminal of a protein. Figure <figr fid="F1">1(a)</figr> illustrates the native conformation of BBA5. The two folding trajectories, referred to as <it>T</it><sub>23 </sub>and <it>T</it><sub>24 </sub>respectively, are of different length. <it>T</it><sub>23 </sub>consists of a series of 192 conformations (or frames), while <it>T</it><sub>24 </sub>has 150 frames. Each conformation is described at atomistic level in PDB format adopted by the Protein Data Bank programs. GSGS (or Beta3s) is a 20-residue peptide with an average folding rate of microseconds. Its NMR conformation shows a three-stranded anti-parallel <it>&#946;</it>-sheet with turns at residues 6 &#8211; 7 and 14 &#8211; 15. Figure <figr fid="F2">2(a)</figr> depicts this native conformation. There are a total of 5 GSGS folding trajectories: <it>T</it><sub>1</sub>, <it>T</it><sub>2</sub>, <it>T</it><sub>3</sub>, <it>T</it><sub>4</sub>, and <it>T</it><sub>5</sub>. The number of conformations in each trajectory is listed in Table <tblr tid="T1">1</tblr>. Similar to BBA5, each conformation corresponds to one PDB file. Pande <it>et al</it>. explained in detail on the simulation model and methods employed to produce such trajectories <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Different conformations of the small protein BBA5, where only the <it>C</it><sub><it>&#945; </it></sub>atoms are shown</p>
               </caption>
               <text>
                  <p><b>Different conformations of the small protein BBA5, where only the <it>C</it><sub><it>&#945; </it></sub>atoms are shown</b>. (a)The native NMR structure of BBA5 based on data from the SCOP website. (b)The initial conformation of both folding trajectories. (c)The last conformation in the first trajectory. (d)The last conformation in the second trajectory.</p>
               </text>
               <graphic file="1748-7188-2-3-1"/>
            </fig>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Different conformations of the GSGS peptide, where only the <it>C</it><sub><it>&#945; </it></sub>atoms are shown</p>
               </caption>
               <text>
                  <p><b>Different conformations of the GSGS peptide, where only the <it>C</it><sub><it>&#945; </it></sub>atoms are shown</b>. (a)The native NMR conformation of GSGS. (b)The initial conformation in all the five folding trajectories. (c)The last conformation in the 1<sup><it>st </it></sup>trajectory. (d)The last conformation in the 3<sup><it>rd </it></sup>trajectory. (e)The last conformation in the 5<sup><it>th </it></sup>trajectory.</p>
               </text>
               <graphic file="1748-7188-2-3-2"/>
            </fig>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>A brief description of the GSGS folding trajectories.</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="center">
                        <p>Trajectory ID</p>
                     </c>
                     <c ca="center">
                        <p>Total number of conformations</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <it>T</it>
                           <sub>1</sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>25,664</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <it>T</it>
                           <sub>2</sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>30,075</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <it>T</it>
                           <sub>3</sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>19,649</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <it>T</it>
                           <sub>4</sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>25,263</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <it>T</it>
                           <sub>5</sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>25,664</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>2.2 Comparing Conformations of BBA5 and GSGS Across Trajectories</p>
            </st>
            <p>Although both trajectories of BBA5 start from the same extended conformation as shown in Figure <figr fid="F1">1(b)</figr>, when we examine the visualized frames, they seem to identify two very different folding processes. Figures <figr fid="F1">1(c)</figr> and <figr fid="F1">1(d)</figr> illustrate the last frame in the two trajectories <it>T</it><sub>23 </sub>and <it>T</it><sub>24 </sub>respectively. This also applies to the five GSGS folding trajectories, where each starts with the same conformation (Figure <figr fid="F2">2(b)</figr>) but ends at a different conformation (Figures <figr fid="F2">2(c), 2(d)</figr> &amp;<figr fid="F2">2(e)</figr>).</p>
            <p>This seeming difference might be attributed to the stochastic nature of the folding simulation process <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. However, it is also desirable to characterize the similarities (or dissimilarities) across multiple trajectories.</p>
            <p>To compare two trajectories, one must address the following key issue: how can we compare two protein conformations? Several measures have been commonly used towards such a purpose, including RMSD (root mean squared distance) <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, contact order <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, and native contacts <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. However, all these measures are designed to quantify the global topology of a conformation. Furthermore, based on our empirical analysis of these measures, we notice that they are generally too coarse and thus can often be misleading. Even more importantly, such measures fail to identify similar local structures (or motifs) between conformations. This is especially crucial for small proteins like BBA5. As demonstrated in both experimental and theoretical studies, small proteins often fold hierarchically and begin locally <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. For instance, it has been shown that BBA5 tends to first form secondary structures such as <it>&#946;</it>-turns and <it>&#945;</it>-helices, then conform to its global topology <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. Finally, as suggested by Pande <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, both sterics (local motifs) and global topology might play an important role in protein folding. Therefore, to compare conformations of (small) proteins, a more reasonable comparison should consider both local and global structures. Moreover, it should also take the native topology of the protein under study into account.</p>
            <p>To meet these requirements, we propose the following two-step approach to compare conformations of BBA5. First, we partition the 23 residues of BBA5 into four fragments: (i) <it>F</it><sub>1</sub>: N-terminal 1&#8211;10 <it>&#946;</it>-hairpin; (ii) <it>F</it><sub>2</sub>: C-terminal 11 &#8211; 23 <it>&#945;</it>-helix fragment; (iii) <it>F</it><sub>3</sub>: the first half of <it>F</it><sub>1 </sub>and the second half of <it>F</it><sub>2</sub>; and (iv) <it>F</it><sub>4</sub>: the second half of <it>F</it><sub>1 </sub>and the first half of <it>F</it><sub>2</sub>, i.e., the middle section in the primary sequence. This segmentation of is also summarized in Table <tblr tid="T2">2</tblr>. Second, we recognize the secondary structure propensity in each fragment. Two conformations are said to be similar if they demonstrate the same secondary structure propensity in the same fragment. For instance, the pair of conformations in Figure <figr fid="F3">3(a)</figr> are similar as residues in <it>F</it><sub>1</sub>, <it>F</it><sub>2 </sub>and <it>F</it><sub>4 </sub>from both conformations indicate a <it>&#946;</it>-turn like local motif. Please note that the orientation of local motifs does not affect the comparison. For instance, in Figure <figr fid="F3">3(d)</figr>, we say the two conformations have a similar structure in <it>F</it><sub>1 </sub>fragment, even though the <it>&#946;</it>-turn motifs have different orientations.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Partitions along the primary sequence of BBA5.</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Partition</p>
                     </c>
                     <c ca="left">
                        <p>Amino Acids</p>
                     </c>
                     <c ca="left">
                        <p>Remark</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>1</sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>1&#8211;10</p>
                     </c>
                     <c ca="left">
                        <p><it>&#946;</it>-hairpin</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>2</sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>11&#8211;23</p>
                     </c>
                     <c ca="left">
                        <p><it>&#945;</it>-helix</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>3</sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>1&#8211;6, 16&#8211;23</p>
                     </c>
                     <c ca="left">
                        <p>The 1<sup><it>st </it></sup>half of <it>F</it><sub>1 </sub>and the 2<sup><it>nd </it></sup>half of <it>F</it><sub>2</sub></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>4</sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>6&#8211;17</p>
                     </c>
                     <c ca="left">
                        <p>The 2<sup><it>nd </it></sup>half of <it>F</it><sub>1 </sub>and the 1<sup><it>st </it></sup>half of <it>F</it><sub>2</sub></p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Selected conformation-pairs along the consensus partial folding pathway of BBA5</p>
               </caption>
               <text>
                  <p><b>Selected conformation-pairs along the consensus partial folding pathway of BBA5</b>. The figure illustrates four conformation-pairs, one from each trajectory, along the consensus partial folding pathway identified in the two BBA5 trajectories.</p>
               </text>
               <graphic file="1748-7188-2-3-3"/>
            </fig>
            <p>The same two-step approach is also applied to find similar GSGS conformations, except that a different segmentation strategy is adopted according to the native GSGS structure. A total of seven segments are being used to identify the relative location of a motif in GSGS. Table <tblr tid="T3">3</tblr> lists such segments. Also listed are the residues involved in each segment and its biological meaning.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Partitions along the primary sequence of GSGS.</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Partition ID</p>
                     </c>
                     <c ca="left">
                        <p>Amino Acids</p>
                     </c>
                     <c ca="left">
                        <p>Remark</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>1</sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>1&#8211;15</p>
                     </c>
                     <c ca="left">
                        <p>The 1<sup><it>st </it></sup><it>&#946;</it>-turn</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>2</sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>1&#8211;7</p>
                     </c>
                     <c ca="left">
                        <p>The 1<sup><it>st </it></sup><it>&#946;</it>-strand</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>3</sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>3&#8211;10</p>
                     </c>
                     <c ca="left">
                        <p>Critical region of the 1<sup><it>st </it></sup><it>&#946;</it>-turn</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>4</sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>6&#8211;15</p>
                     </c>
                     <c ca="left">
                        <p>The 2<sup><it>nd </it></sup><it>&#946;</it>-strand</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>5</sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>6&#8211;20</p>
                     </c>
                     <c ca="left">
                        <p>The 2<sup><it>nd </it></sup><it>&#946;</it>-turn</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>6</sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>10&#8211;18</p>
                     </c>
                     <c ca="left">
                        <p>Critical region of the 2<sup><it>nd </it></sup><it>&#946;</it>-turn</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>7</sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>14&#8211;20</p>
                     </c>
                     <c ca="left">
                        <p>The 3<sup><it>rd </it></sup><it>&#946;</it>-strand</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>To realize the comparison of conformations, two more issues must still be addressed. First, how can one effectively capture and represent local motifs? Second, how can we represent the global topology of a conformation in terms of local motifs? To address the first issue, we leverage the non-local patterns in protein contact maps. For the second, we characterize the spatial arrangement among non-local patterns. Please see Section 3 for more details.</p>
         </sec>
         <sec>
            <st>
               <p>2.3 Folding Trajectory Analysis: Objectives</p>
            </st>
            <p>There are two main goals we would like to achieve in analyzing the folding trajectories. First, we would like to address the following issues for individual trajectories: (1) to detect (or predict) significant folding events, including the formation of <it>&#946;</it>-turns, <it>&#945;</it>-helices, and native-like conformations; and (2) to recognize the temporal ordering of important folding events in the trajectory. For instance, between the two secondary structures <it>&#945;</it>-helix and <it>&#946;</it>-hairpin in BBA5, which forms earlier? What is ordering of the two events preceding a <it>&#946;</it>-hairpin formation: formation of two extended strands or formation of the turn?</p>
            <p>In contrast to the first goal, our second goal concerns multiple trajectories. Specifically, we would like to identify a sub-sequence of similar conformations across trajectories. This sub-sequence of conformations is referred to as the <it>consensus partial folding pathway</it>. This is analogous to the Longest Common Sub-sequence (LCS) problem <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, but much more challenging due to the following reasons. First, we are dealing with time series of 3D protein structures. Second, we are looking for <it>similar conformations across trajectories</it>, and our work on mining spatio-temporal data <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>3 Algorithm</p>
         </st>
         <p>In this section, we describe in detail the proposed approach for analyzing protein folding trajectories. As shown in Figure <figr fid="F4">4</figr>, such an approach consists of three main phases: (I) Data preprocessing, (II) Spatio-temporal object association pattern mining, and (III) Trajectory analysis. We next discuss each phase in further details.</p>
         <fig id="F4">
            <title>
               <p>Figure 4</p>
            </title>
            <caption>
               <p>Algorithm</p>
            </caption>
            <text>
               <p><b>Algorithm</b>. Main steps of summarizing and analyzing protein folding trajectories.</p>
            </text>
            <graphic file="1748-7188-2-3-4"/>
         </fig>
         <sec>
            <st>
               <p>3.1 Data Preprocessing</p>
            </st>
            <p>Same as in our previous studies on protein structural analysis <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>, we represent 3D protein conformations by contact maps. In order for this algorithm to be self-contained, we next briefly go over these preprocessing steps. We also explain the rationale of such steps in the context of protein folding.</p>
            <sec>
               <st>
                  <p>Contact Map Generation</p>
               </st>
               <p>When generating contact maps, we consider the Euclidean distances between <it>&#945;</it>-carbons (<it>C</it><sub><it>&#945;</it></sub>) of each amino acid. Two <it>&#945;</it>-carbons are considered to be in contact if their distance is within 8.5 &#197;. Thus, for a protein of <it>N </it>residues, its <it>contact map </it>is an <it>N </it>&#215; <it>N </it>binary matrix, where the cell at (<it>i</it>, <it>j</it>) is 1 if the <it>i</it><sup><it>th </it></sup>and <it>j</it><sup><it>th </it></sup><it>&#945;</it>-carbons are in contact, 0 otherwise. Since contact maps are symmetric across the diagonal, we only consider the bits below the diagonal. Furthermore, we also ignore the pairs of <it>C</it><sub><it>&#945; </it></sub>atoms whose distance in the primary sequence is &#8804; 2, as they are sure to be in contact. This step transforms the two BBA5 trajectories into two series of contact maps, with each contact map of size 23 &#215; 23. By the same token, the 5 GSGS trajectories are transformed into 5 sequences of contact maps.</p>
            </sec>
            <sec>
               <st>
                  <p>Identifying Maximally Connected Bit-patterns</p>
               </st>
               <p>Every bit in a contact map has eight neighbor bits. For an edge position, we assume its out-of-boundary positions contain 0. In a contact map, a connected bit-pattern is a collection of bit-1 positions, where for each 1, at least one of its neighbors is 1. Correspondingly, we define a <it>maximally-connected bit-pattern </it>(also referred to as a <it>bit-pattern </it>in this article) to be a connected pattern <it>p </it>where every neighbor bit not in <it>p </it>is 0. We apply a simple region growth algorithm to identify all the <it>maximally-connected patterns </it>in each contact map within the two series of contact maps, corresponding to the two folding trajectories of BBA5. Altogether, we identified 352 maximally-connected bit-patterns in such contact maps. For the GSGS folding data, a total of 50,572 unique bit-patterns are constructed. We then represent each identified bit-pattern as a 6-tuple feature vector consisting of the following attributes:</p>
               <p>&#8226; <it>Height</it>: the number of rows contained in the pattern's Minimum Bounding Rectangle (MBR).</p>
               <p>&#8226; <it>Width</it>: the number of columns in the pattern's MBR.</p>
               <p>&#8226; <it>NumOnes</it>: the number of 1s in the pattern.</p>
               <p>&#8226; <it>Slope</it>: the general linear distribution trend of all the 1s in the pattern within its MBR. To compute the angle of a connected pattern we use the least-squares method to estimate the slope of a linear regression line. For a pattern containing <it>n </it>1s, we denote the positions of the 1s as: (<it>x</it><sub>1</sub>, <it>y</it><sub>1</sub>)...(<it>x</it><sub><it>n</it></sub>, <it>y</it><sub><it>n</it></sub>). The least-squares method then estimates the slope <it>&#946;</it><sub>1 </sub>as: <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1748-7188-2-3-i1"><m:semantics><m:mrow><m:msub><m:mi>&#946;</m:mi><m:mn>1</m:mn></m:msub><m:mo>=</m:mo><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>n</m:mi></m:msubsup><m:mrow><m:mo stretchy="false">(</m:mo><m:mo stretchy="false">(</m:mo><m:msub><m:mi>x</m:mi><m:mi>i</m:mi></m:msub><m:mo>&#8722;</m:mo><m:mover accent="true"><m:mi>x</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo stretchy="false">)</m:mo><m:mo>&#8727;</m:mo><m:mo stretchy="false">(</m:mo><m:msub><m:mi>y</m:mi><m:mi>i</m:mi></m:msub><m:mo>&#8722;</m:mo><m:mover accent="true"><m:mi>y</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo stretchy="false">)</m:mo><m:mo stretchy="false">)</m:mo><m:mo>/</m:mo><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>n</m:mi></m:msubsup><m:mrow><m:mo stretchy="false">(</m:mo><m:msup><m:mrow><m:mo stretchy="false">(</m:mo><m:msub><m:mi>x</m:mi><m:mi>i</m:mi></m:msub><m:mo>&#8722;</m:mo><m:mover accent="true"><m:mi>x</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo stretchy="false">)</m:mo></m:mrow><m:mn>2</m:mn></m:msup><m:mo stretchy="false">)</m:mo></m:mrow></m:mstyle></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFYoGydaWgaaWcbaGaeGymaedabeaakiabg2da9maaqadabaGaeiikaGIaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGHsislcuWG4baEgaqeaiabcMcaPiabgEHiQiabcIcaOiabdMha5naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IafmyEaKNbaebacqGGPaqkcqGGPaqkcqGGVaWldaaeWaqaaiabcIcaOiabcIcaOiabdIha4naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IafmiEaGNbaebacqGGPaqkdaahaaWcbeqaaiabikdaYaaakiabcMcaPaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoaaaa@5A04@</m:annotation></m:semantics></m:math></inline-formula></p>
               <p>&#8226; <it>xStdDev</it>: the standard deviation of all the 1s' x-coordinates (this quantifies how the 1s spread along the x dimension).</p>
               <p>&#8226; <it>yStdDev</it>: the standard deviation of all the 1s' y-coordinates.</p>
               <p>Note that this feature vector captures the main geometric properties of a bit-pattern.</p>
               <p>As discussed in the literature <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp>, non-local patterns (where bit-patterns are one type of non-local patterns,) in contact maps can effectively capture the secondary structure of proteins. Our previous work <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp> demonstrated that by characterizing the spatial relationship among the above described bit-patterns, one can construct structural signatures for proteins of different classes or folds. In the context of protein folding, we have observed that the above-defined bit-patterns are also capable of capturing a wide range of local 3D structural motifs. They can even approximately measure the strength of secondary structure propensity in a conformation. For instance, we have identified bit-patterns that correspond to "premature" <it>&#945;</it>-helices and native-like <it>&#945;</it>-helices respectively. Henceforth, we refer to the 3D structure formed by all the participating residues of a bit-pattern as the <it>3D motif of the bit-pattern</it>. The relationship between bit-patterns and 3D motifs will be further discussed in the next section.</p>
            </sec>
            <sec>
               <st>
                  <p>Clustering Bit-patterns into Approximately Equivalent Groups</p>
               </st>
               <p>In this step, we partition the extracted bit-patterns into <it>approximately equivalent groups</it>, each of which consists of bit-patterns that exhibit similar geometric properties (e.g., shape and size). To construct such equivalent groups, we run the <it>k</it>-means based clustering algorithm <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> over the bit-patterns' corresponding feature vectors, where <it>k </it>is the number of clusters (or equivalent groups) that will be produced.</p>
               <p>To determine an optimal value of <it>k</it>, we take the following three steps. First, we run the clustering algorithm on different <it>k </it>values. This produces different clustering schemes for the same set of bit-patterns. Second, for each clustering scheme, we compute its entropy. Let <it>c</it><sub>1</sub>, ..., <it>c</it><sub><it>l </it></sub>be the <it>l </it>clusters after clustering the set of <it>N </it>bit-patterns. Furthermore, each cluster <it>c</it><sub><it>i </it></sub>(1 &#8804; <it>i </it>&#8804; <it>l</it>) has an individual entropy <it>H</it><sub><it>i </it></sub>and contains <it>N</it><sub><it>i </it></sub>elements, then the total entropy of this clustering is given by the following formula: <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1748-7188-2-3-i2"><m:semantics><m:mrow><m:mi>H</m:mi><m:mo>=</m:mo><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>k</m:mi></m:msubsup><m:mrow><m:msub><m:mi>H</m:mi><m:mi>i</m:mi></m:msub><m:mo>&#8727;</m:mo><m:mo stretchy="false">(</m:mo><m:msub><m:mi>N</m:mi><m:mi>i</m:mi></m:msub><m:mo>/</m:mo><m:mi>N</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGibascqGH9aqpdaaeWaqaaiabdIeainaaBaaaleaacqWGPbqAaeqaaOGaey4fIOIaeiikaGIaemOta40aaSbaaSqaaiabdMgaPbqabaGccqGGVaWlcqWGobGtcqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabdUgaRbqdcqGHris5aaaa@3F89@</m:annotation></m:semantics></m:math></inline-formula> The entropy of each individual cluster, i.e., <it>H</it><sub><it>i </it></sub>, is computed by summing up the entropy of each of the six bit-pattern attributes such as its height and width. For an attribute, we compute its entropy in a cluster according to the procedure explained by Shannon <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. In the third and final step, we plot the entropy against the number of clusters, i.e., <it>k</it>, and choose a value <it>k </it>where the entropy plot begins to show a linear trend. For the BBA5 folding data, this clustering step groups the 352 bit-patterns into 10 clusters (or types). As for the GSGS data, 12 clusters are identified.</p>
               <p>Intuitively, the 3D motifs of the bit-patterns in a cluster will also have similar 3D geometric properties. This is verified based on our manual analysis on the BBA5 trajectories. Figure <figr fid="F5">5</figr> illustrates the representative 3D motifs.corresponding to the 9 of 10 types of bit-patterns identified in BBA5 trajectories. We omit type 0, as bit-patterns of this type, unlike the others, correspond to a wide variety of 3D motifs.</p>
               <fig id="F5">
                  <title>
                     <p>Figure 5</p>
                  </title>
                  <caption>
                     <p>Mapping between 2D bit-patterns and 3D sub-structures</p>
                  </caption>
                  <text>
                     <p><b>Mapping between 2D bit-patterns and 3D sub-structures</b>. The figure visualizes the representative 3D sub-structures corresponding to the 10 classes of bit-patterns identified in the contact maps along BBA5's two folding trajectories. The bit-patterns shown here are randomly selected from their respective group for illustration purpose.</p>
                  </text>
                  <graphic file="1748-7188-2-3-5"/>
               </fig>
               <p>We also observed a similar scenario for the 12 types of bit-patterns identified in the GSGS trajectories. For instance, the typical 3D motifs of type 0 bit-patterns resemble the native conformation of GSGS (See Figure <figr fid="F2">2(a)</figr>); whereas those of type 6 identify with <it>&#945;</it>-helices.</p>
               <p>Upon a closer look at this 2D-3D mapping illustrated in Figure <figr fid="F5">5</figr>, one can observe the following interesting aspects. First, multiple types of bit-patterns can be associated with a single type of 3D motif. For instance, there are 3 types of bit-patterns are mapped to an <it>&#945;</it>-helical motif. Second, contrary to a commonly accepted belief that <it>&#946;</it>-turns or <it>&#946;</it>-sheets cannot be captured by maximally connected bit-patterns as defined earlier, our analysis shows that this belief does not stand. To illustrate this point, we take two examples. The first example, illustrated in Figure <figr fid="F6">6</figr>, corresponds to the <it>&#946;</it>-turn structure. As shown in Figure <figr fid="F6">6(b)</figr>, the <it>&#946;</it>-turn formed by the first 10 <it>C</it><sub><it>&#945; </it></sub>atoms of BBA5 can be captured by the maximally connected bit-pattern shown in Figure <figr fid="F6">6(a)</figr>. The second example, shown in Figure <figr fid="F7">7</figr>, illustrates that a two turn <it>&#946;</it>-sheet (Figure <figr fid="F7">7(b)</figr>) can also be captured by a bit-pattern (Figure <figr fid="F7">7(a)</figr>). Finally, not every type of bit-patterns can be mapped to a typical 3D motif. This might be attributed to our entropy-based criteria for selecting an "optimal" value of the parameter <it>k </it>in the clustering task.</p>
               <fig id="F6">
                  <title>
                     <p>Figure 6</p>
                  </title>
                  <caption>
                     <p><it>&#946;</it>-turns vs. maximally connected bit-patterns: an example</p>
                  </caption>
                  <text>
                     <p><b><it>&#946;</it>-turns vs. maximally connected bit-patterns: an example</b>. (a) A type 8 bit-pattern is identified in the 166<sup><it>th </it></sup>frame of the BBA5 <it>T</it>23 trajectory. This bit-pattern corresponds to the the connected 1s in the table, where a '1' indicates two corresponding <it>C</it><sub><it>&#945; </it></sub>atoms are in contact,'-' otherwise. This pattern consists of the first 10 <it>C</it><sub><it>&#945; </it></sub>atoms. (b) The 3D conformation of this frame, where the first 10 <it>C</it><sub><it>&#945; </it></sub>atoms resembles a <it>&#946;</it>-turn.</p>
                  </text>
                  <graphic file="1748-7188-2-3-6"/>
               </fig>
               <fig id="F7">
                  <title>
                     <p>Figure 7</p>
                  </title>
                  <caption>
                     <p><it>&#946;</it>-sheets vs. maximally connected bit-patterns: an example</p>
                  </caption>
                  <text>
                     <p><b><it>&#946;</it>-sheets vs. maximally connected bit-patterns: an example</b>. (a) A type 0 bit-pattern is identified in the 24201<sup><it>th </it></sup>frame of the GSGS <it>T</it><sub>1 </sub>trajectory. This bit-pattern corresponds to the the connected 'x'-es in the table, where an 'x' indicates two corresponding <it>C</it><sub><it>&#945; </it></sub>atoms are in contact,'-' otherwise. It consists of <it>C</it><sub><it>&#945; </it></sub>atoms from 5 through 20. (b) The 3D conformation of this frame, where the 5&#8211;20 <it>C</it><sub><it>&#945; </it></sub>atoms resembles a <it>&#946;</it>-sheet of two turns.</p>
                  </text>
                  <graphic file="1748-7188-2-3-7"/>
               </fig>
               <p>This demonstrates, to a certain extent, the advantage of using 2D contact maps to analyze 3D protein conformations. Undoubtedly, using contact maps greatly reduces the computational complexity, though at the cost of loss in structural information. However, some of this information loss is re-compensated by mapping bit-patterns to structural motifs in 3D conformations. More importantly, by exploiting different features in contact maps (bit-patterns in this work), we are able to connect 2D features with features in 3D space. In the BBA5 case, by identifying 10 types of bit-patterns in contact maps, we indirectly recognize 10 different 3D structural motifs in the folding conformations.</p>
            </sec>
            <sec>
               <st>
                  <p>Re-labeling Bit-patterns with The Corresponding Cluster Label</p>
               </st>
               <p>In this step, we re-label all the previously identified bit-patterns with their corresponding cluster label. Let <it>p </it>be a labeled bit-pattern. It can be represented as follows: <it>p </it>= (<it>trajID</it>, <it>frameID</it>, <it>listC</it><sub><it>&#945;</it></sub>, <it>label</it>). Here, <it>trajID </it>identifies a folding trajectory, and <it>frameID </it>indicates the frame where <it>p </it>occurs, <it>listC</it><sub><it>&#945; </it></sub>consists of all participating <it>&#945;</it>-carbons of <it>p</it>, identified by their position in the primary sequence. Finally, <it>label </it>is the cluster label of <it>p</it>. For BBA5, <it>label </it>&#8712; {<it>g</it><sub>0</sub>, <it>g</it><sub>1</sub>, &#8943;, <it>g</it><sub>9</sub>}, corresponding to the 10 approximately equivalent groups (or types).</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>3.2 Mining Spatio-temporal Object Association Patterns</p>
            </st>
            <p>The preprocessing steps transform a 3D protein conformation into a set of labeled 2D bit-patterns, that indirectly capture the local 3D structural characteristics of the conformation. For the two BBA5 trajectories, each conformation contains an average of 6 bit-patterns. As for the five GSGS trajectories, the average number of bit-patterns in each conformation is 4.</p>
            <p>As BBA5 and GSGS fold, the dynamics among their residues is constantly changing until it reaches an equilibrium. This means that two residues previously in contact may become out of contact later. As a result, bit-patterns present in one conformation may be absent in the next. The evolving nature of contacting residues and in turn bit-patterns, is essentially the consequence of a variety of weak interactions among amino acids at different levels. Such weak interactions include hydrogen bonds, electrostatic interactions, van der Waal's packing and hydrophobic interactions <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. To capture these (potential) interactions, a simple yet effective method is to consider how close two amino acids are located from each other in 3D. We also adopt this method here. Specifically, we consider interactions between local 3D motifs captured by labeled bit-patterns. We denote such interactions as "interactions among bit-patterns". Let <it>p</it><sub><it>i </it></sub>and <it>p</it><sub><it>j </it></sub>be two bit-patterns in a protein conformation, and <it>p<sub><it>i</it></sub>.listC</it><sub><it>&#945; </it></sub>and <it>p<sub><it>j</it></sub>.listC</it><sub><it>&#945; </it></sub>be the list of <it>&#945;</it>-carbons involved in <it>p</it><sub><it>i </it></sub>and <it>p</it><sub><it>j</it></sub>, respectively. We define <it>p</it><sub><it>i</it></sub>and <it>p</it><sub><it>j </it></sub>as <it>interacting bit-patterns </it>if at least one pair of <it>&#945;</it>-carbons, each from <it>p<sub><it>i</it></sub>.listC</it><sub><it>&#945; </it></sub>and <it>p<sub><it>j</it></sub>.listC</it><sub><it>&#945; </it></sub>are located within a short distance <it>&#948;</it>. Note that the value of <it>&#948; </it>should be greater than the distance that is being used to identify contacting <it>&#945;</it>-carbons when generating contact maps. In our analysis, we set <it>&#948; </it>= 10 &#197;.</p>
            <p>It is noteworthy that the above notion of interacting bit-patterns is new compared to our previous work, where two bit-patterns are associated if their distance in the 2D contact map space is below a certain threshold. This can be misleading in the context of protein folding analysis. As demonstrated in Figure <figr fid="F8">8</figr>, the two bit-patterns-<it>BP #1 </it>and <it>BP </it>#<it>2</it>-are only 2 amino acids away in the 2D contact map. However, they can be relatively far apart in 3D. On the other hand, although the bit-patterns <it>BP #2 </it>and <it>BP #3 </it>are relatively far apart from each other in the 2D contact map, they are close to each other in 3D. Therefore, measuring the distance between bit-patterns in the actual 3D conformation is more robust with respect to capturing potential interaction among local motifs.</p>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>Discrepancy between distances in 2D and 3D spaces</p>
               </caption>
               <text>
                  <p><b>Discrepancy between distances in 2D and 3D spaces</b>. Bit-patterns that are close to each other in the 2D contact map space, for instance, BP#1 and BP#2, can be distant from each other in 3D. Similarly, bit-patterns that are distant in 2D space, for instance, BP#1 and BP#3, can be close to each other in 3D.</p>
               </text>
               <graphic file="1748-7188-2-3-8"/>
            </fig>
            <p>So far, we have discussed our approach of using bit-patterns in contact maps to characterize local 3D motifs and further represent a protein conformation during folding. We also define the notion of interacting bit-patterns in the folding context. We are now ready to present our method of summarizing folding trajectories to fulfill the two objectives described in Section 2.3. The main idea is that we can summarize a folding trajectory by characterizing the evolutionary behavior of interactions among different types of bit-patterns and in turn, the interactions among local 3D motifs.</p>
            <sec>
               <st>
                  <p>Definition of (minLink = 1) SOAP</p>
               </st>
               <p>As proposed in our previous work <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B25">25</abbr></abbrgrp>, such interactions can be modeled and captured by discovering different types of spatial object association patterns (SOAPs). Essentially, SOAPs characterize the specific way that objects, bit-patterns in this case, are interacting with each other at a given time. Among the proposed SOAP types, after a careful evaluation, we empirically select (<it>minLink </it>= 1) SOAPs to model the interacting bit-patterns in the folding process. Let <it>p </it>= (<it>g</it><sub>1</sub>, <it>g</it><sub>2</sub>, &#8943;, <it>g</it><sub><it>k</it></sub>) be a (<it>minLink </it>= 1) SOAP of size <it>k</it>, where <it>g</it><sub><it>i </it></sub>is one of the 10 types of bit-patterns described above. In the context of folding trajectories, <it>p </it>prescribes that there exists <it>k </it>bit-patterns <it>b</it><sub>1</sub>, <it>b</it><sub>2</sub>, ..., <it>b</it><sub><it>k </it></sub>in a conformation, where <it>b</it><sub><it>i</it></sub>.<it>label </it>= <it>g</it><sub><it>i </it></sub>(1 &#8804; <it>i </it>&#8804; <it>k</it>). Furthermore, for each <it>b</it><sub><it>i</it></sub>, it interacts with at least one of the remaining (<it>k </it>- 1) bit-patterns. Note that the <it>k </it>labels in <it>p </it>are not mutually exclusive. For instance, one can have SOAPs such as (7 9 9), which involves one type 7 bit-pattern and two type 9 bit-patterns.</p>
               <p>We further restrict ourselves to SOAPs that occur frequently during the folding process (<it>frequent SOAPs</it>). However, we are not ruling out rarely-occurring SOAPS in our future studies. A SOAP is said to be frequent if it appears in no fewer than <it>minSupp </it>frames in a trajectory. In our studies, we set <it>minSupp </it>= 5 for BBA5 and 10 from GSGS.</p>
            </sec>
            <sec>
               <st>
                  <p>SOAP Episodes</p>
               </st>
               <p>The next step is to capture the evolutionary nature of the folding process. We do this by identifying the evolutionary nature of SOAPs. As mentioned earlier, small proteins like BBA5 and GSGS often fold hierarchically and begin with local folded structures. As they fold, new SOAPs can be created and existing one can dissipate. To capture such evolutionary behavior, we proposed the concept of <it>SOAP episodes</it>, which provide an effective approach to model the evolution of interactions among spatial objects over time <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. To reiterate, a SOAP episode <it>E </it>is defined as follows: <it>E </it>= (<it>p</it>, <it>F</it><sub><it>beg</it></sub>, <it>F</it><sub><it>end</it></sub>), where <it>p </it>is a SOAP composed of one or more bit-patterns, <it>p </it>was created in frame <it>F</it><sub><it>beg </it></sub>and persisted till frame <it>F</it><sub><it>end</it></sub>. Note that for a given <it>p</it>, it can be created more than once during protein folding, and thus can have more than one episode. To discover frequent (<it>minLink </it>= 1) SOAPs and their episodes in the trajectories of BBA5 and GSGS, we apply our SOAP mining algorithm as explained in our previous work <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
               <p>In summary, this mining phase produces the following results: (i) A list of (<it>minLink </it>= 1) SOAPs of bit-patterns that appeared in at least 5 conformations in each folding trajectories for the protein BBA5 and 10 for GSGS; and (ii) A list of episodes, ordered by beginning frame <it>F</it><sub><it>beg</it></sub>, associated with each of these SOAPs.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>3.3 Folding Trajectory Analysis</p>
            </st>
            <p>In this section, we describe our strategy on utilizing SOAPs to summarize a folding trajectory and address the two folding analysis issues described in Section 2.3.</p>
            <sec>
               <st>
                  <p>SOAP-based Trajectory Summarization</p>
               </st>
               <p>The previous mining phase discovers a collection of frequent (<it>minLink </it>= 1) SOAPs and the associated episodes in each trajectory. Therefore, it identifies all the conformations in the trajectories that contain at least one frequent (<it>minLink </it>= 1) SOAPs. For instance, the last conformation in trajectory <it>T</it>23 (Figure <figr fid="F1">1(c)</figr>) has two SOAPs of size 2:(5 8) (i.e., association of a type 5 and a type 8 bit-pattern) and (7 8), and three SOAPs of size 1: (5), (7), and (8), while the last conformation in trajectory <it>T</it>24 has three SOAPs: (7 8), (7) and (8). This leads to our SOAP-based approach for folding trajectory summarization.</p>
               <p>To summarize a folding trajectory, we perform the following three steps. First, for each conformation, we identify all the frequent SOAPs that appear in it and use these SOAPs to represent this conformation. Note that not every conformation contains frequent SOAPs, especially when <it>minSupp </it>is set high. Second, for each SOAP-representable conformation, we carry out two tasks on its associated SOAPs. We next use the folding trajectories of BBA5 to explain how these two tasks are carried out.</p>
               <p>In the first task, for each SOAP, we mark the relative location of each involved bit-pattern in the primary sequence of BBA5. This is done by identifying the segment of BBA5 where the majority of a bit-pattern's <it>&#945;</it>-carbons are located. The segment can be one of the following as described in Section 2.2: <it>F</it><sub>1</sub>, residues 1 &#8211; 10; <it>F</it><sub>2 </sub>, residues 11 &#8211; 23; <it>F</it><sub>3</sub>, residues 6&#8211;17; and <it>F</it><sub>4</sub>: residues 1&#8211;5 and 18&#8211;23. Let us again take the last conformation in <it>T</it>24 as an example. It can be summarized by three SOAPs: (7 8), (7) and (8). When we look at the list of <it>&#945;</it>-carbons involved in these bit-patterns, we find out that 7 is mainly located in <it>F</it><sub>2 </sub>and 8 in <it>F</it><sub>1</sub>. Therefore, we mark the three SOAPs as follows: (8.1 7.2), (7.2) and (8.1). We re-arrange the bit-patterns in a SOAP by their relative locations in BBA5. This super-imposes BBA5-specific spatial information to a SOAP. In the second task, we prune away redundant SOAPs after marking each bit-pattern with its relative location in BBA5. A SOAP is redundant if it is embedded in another SOAP. For instance, in the previous example, we can prune away (8.1) and (7.2) as both are embedded in (7.2 8.1). After pruning, most conformations in such a small protein can often be represented by a single SOAP. We can even take this summarization a step further, where we replace a bit-pattern with its corresponding 3D motif, as illustrated in Figure <figr fid="F5">5</figr>. For instance, SOAP (7.2 8.1) will be transformed into (<it>&#946;</it>.1 <it>&#945;</it>.2). We refer to such SOAPs as <it>generalized SOAPs</it>, and the corresponding trajectory as <it>a generalized trajectory</it>. Note that in a generalized trajectory, multiple types of bit-patterns can be mapped into a single type of 3D motif. For instance, the <it>&#945;</it>-motif corresponds to three types of bit-patterns 4, 7, and 9 (Figure <figr fid="F5">5</figr>). Figure <figr fid="F9">9</figr> shows a segment in each summarized BBA5 folding trajectory before and after being generalized with 3D motifs.</p>
               <fig id="F9">
                  <title>
                     <p>Figure 9</p>
                  </title>
                  <caption>
                     <p>SOAP-based folding trajectory summarization</p>
                  </caption>
                  <text>
                     <p><b>SOAP-based folding trajectory summarization</b>. An sample segment in each of the two BBA5 folding trajectories is presented, (a) After superimposing the relative location of each bit-pattern and pruning away redundant SOAPs. (b) After further generalizing each bit-pattern by corresponding 3D motif.</p>
                  </text>
                  <graphic file="1748-7188-2-3-9"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Detecting Folding Events and Recognizing Ordering Among Events</p>
               </st>
               <p>Once each folding trajectory is summarized into generalized SOAPs, it is fairly straightforward to detect folding events such as the formation of <it>&#945;</it>-helix or <it>&#946;</it>-turn like local structures. This can be done by simply locating the frames that contain the local motif(s) of interest. We can also easily identify native-like conformations, by finding those that contain the generalized SOAP (<it>&#946;</it>.1 <it>&#945;</it>.2). Finally, based on the summarization, one can quickly identify the ordering of folding events in a trajectory. For instance, to check which secondary structure forms more rapidly, <it>&#945;</it>-helix or <it>&#946;</it>-hairpin, one can simply compare the first occurrence of these structures in the summarized trajectory (Figure <figr fid="F9">9(b)</figr>).</p>
            </sec>
            <sec>
               <st>
                  <p>Identifying the Consensus Partial Folding Pathway Across Trajectories</p>
               </st>
               <p>To do this, we simply compute the longest common sub-sequence (LCS) <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> between two summarized trajectories. One can utilize the summarization either before the 3D motif generalization (Figure <figr fid="F9">9(a)</figr>) or after (Figure <figr fid="F9">9(b)</figr>). We use the latter in our analysis. Based on the LCS of generalized SOAPs, we construct the consensus folding pathway by identifying pairs of conformations, one from each trajectory, along the LCS of two summarized trajectories. In other words, the resulting consensus pathway consists of a sequence of conformation-pairs of similar 3D structures. Notice here that the comparison between 3D protein conformations (as described in Section 2.2) is done by using bit-patterns to model local structural motifs, and associations of bit-patterns (SOAPs) to characterize the global structure. This forms a hierarchical comparison and is in accordance with the hierarchical folding process of small proteins.</p>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>4 Results</p>
         </st>
         <p>In this section, we report results on analyzing the two trajectories of the small synthetic protein BBA5 and the five trajectories of another small protein GSGS. However, we will focus on BBA5. In previous sections, we have described in detail the structure of BBA5 of GSGS and their folding trajectories. Such information is summarized and tabulated in Table <tblr tid="T4">4</tblr> and Table <tblr tid="T5">5</tblr>.</p>
         <tbl id="T4">
            <title>
               <p>Table 4</p>
            </title>
            <caption>
               <p>A summary of the BBA5 folding trajectories.</p>
            </caption>
            <tblbdy cols="2">
               <r>
                  <c ca="left">
                     <p>Protein</p>
                  </c>
                  <c ca="left">
                     <p>PDB Identifier: BBA5; Primary sequence: 23 residues; Designed protein;</p>
                     <p>Native fold: N-terminal 1&#8211;10 <it>&#946; </it>hairpin, C-terminal 11&#8211;23 <it>&#945;</it>-helix</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Trajectory</p>
                  </c>
                  <c ca="left">
                     <p>Two trajectories: <it>T</it>23 and <it>T</it>24;</p>
                     <p><it>T</it>23: 192 conformations; <it>T</it>24: 150 conformations</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Contact map</p>
                  </c>
                  <c ca="left">
                     <p>Based on contacts between <it>&#945;</it>-carbons.</p>
                     <p>Two <it>&#945;</it>-carbons are in contact if their Euclidian distance is &#8804; 8.5 &#197;</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Bit-patterns</p>
                  </c>
                  <c ca="left">
                     <p>A total of 352 unique maximally connected bit-patterns were identified from all conformations;</p>
                     <p>Average number of bit-patterns per conformation is 6;</p>
                     <p>Bit-patterns are further classified into 10 approximately equivalent types</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Interacting bit-patterns</p>
                  </c>
                  <c ca="left">
                     <p>If at least one pair of <it>&#945;</it>-carbons, one from each bit-pattern, is of Euclidian distance &#8804; 10 &#197;</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Frequent SOAPs</p>
                  </c>
                  <c ca="left">
                     <p>A SOAP is frequent if it appears in &#8805; 5 conformations;</p>
                     <p>A total of 444 frequent SOAPs identified in trajectory <it>T</it>23, and 258 in <it>T</it>24</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Consensus partial folding pathway</p>
                  </c>
                  <c ca="left">
                     <p>We identified a consensus partial folding pathway across the two trajectories.</p>
                     <p>It is composed of 71 pairs of similar conformations, one from each trajectory</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <tbl id="T5">
            <title>
               <p>Table 5</p>
            </title>
            <caption>
               <p>A summary of the GSGS folding trajectories.</p>
            </caption>
            <tblbdy cols="2">
               <r>
                  <c ca="left">
                     <p>Protein</p>
                  </c>
                  <c ca="left">
                     <p>Name: GSGS or Beta3s; Primary sequence: 20 residues; Designed protein;</p>
                     <p>Native fold: three stranded anti-parallel <it>&#946;</it>-sheets with turns at 6&#8211;7 and 14&#8211;15</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Trajectory</p>
                  </c>
                  <c ca="left">
                     <p>Five trajectories: <it>T</it>1, <it>T</it>2, <it>T</it>3, <it>T</it>4 and <it>T</it>5;</p>
                     <p><it>T</it>1 : 25, 664 conformations; <it>T</it>2 : 30, 075 conformations;</p>
                     <p><it>T</it>3 : 19, 649 conformations; <it>T</it>4 : 25, 263 conformations;</p>
                     <p><it>T</it>5 : 25, 664 conformations;</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Contact map</p>
                  </c>
                  <c ca="left">
                     <p>Based on contacts between <it>&#945;</it>-carbons.</p>
                     <p>Two <it>&#945;</it>-carbons are in contact if their Euclidian distance is &#8804; 8.5 &#197;</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Bit-patterns</p>
                  </c>
                  <c ca="left">
                     <p>A total of 50, 572 unique maximally connected bit-patterns were identified from all conformations;</p>
                     <p>Average number of bit-patterns per conformation is 4;</p>
                     <p>Bit-patterns are further classified into 12 approximately equivalent types</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Interacting bit-patterns</p>
                  </c>
                  <c ca="left">
                     <p>If at least one pair of <it>&#945;</it>-carbons, one from each bit-pattern, is of Euclidian distance &#8804; 10 &#197;</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Frequent SOAPs</p>
                  </c>
                  <c ca="left">
                     <p>A SOAP is frequent if it appears in &#8805; 10 conformations;</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <sec>
            <st>
               <p>4.1 Detecting and Ordering Folding Events</p>
            </st>
            <p>We summarize both folding trajectories of BBA5 into a sequence of SOAPs as illustrated in Figure <figr fid="F9">9</figr>. Coincidently, both summarized trajectories consist of 64 conformations.</p>
            <p>Based on these summarized trajectories, we can quickly identify all the conformations where the first <it>&#945;</it>-helix-like or <it>&#946;</it>-turn-like local motifs were formed. For trajectory <it>T</it><sub>23</sub>, the first <it>&#945;</it>-helix-like motif was identified in frame 26, and the first <it>&#946;</it>-turn-like local motif was formed in frame 63. For the other trajectory <it>T</it><sub>24</sub>, the frames were 29 and 38. This is in accordance with experimental results that <it>&#945;</it>-helices generally fold more rapidly than <it>&#946;</it>-turns. However, since we only consider frequent SOAPs, it is very possible that we might miss the actual first formation of such local motifs. To address this issue, we might need to consider rarely occurring SOAPs. We plan to investigate this in the future. For the two events related to <it>&#946;</it>-turn formation, formation of two extended strands and formation of the turn, we found that for both trajectories, the formation of extended strands preceded the formation of the turn.</p>
            <p>Also, we identify two conformations in each trajectory that show native-like structure. We do this by locating the conformations associated with the generalized SOAP (<it>&#946;</it>.1 <it>&#945;</it>.2). Figure <figr fid="F10">10</figr> presents the 3D structure of these native-like conformations along with the native conformation of BBA5. One can see that our SOAP-based comparison does well in identifying similar 3D conformations.</p>
            <fig id="F10">
               <title>
                  <p>Figure 10</p>
               </title>
               <caption>
                  <p>The native-like conformations identified in the two BBA5 trajectories</p>
               </caption>
               <text>
                  <p><b>The native-like conformations identified in the two BBA5 trajectories</b>. According to the SOAP-based summarization of the two BBA5 folding trajectories, two native-like conformations are identified in each trajectory.</p>
               </text>
               <graphic file="1748-7188-2-3-10"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>4.2 Consensus Partial Folding Pathway Across Trajectories</p>
            </st>
            <p>Based on the generalized trajectory summarization of BBA5, we identify a consensus partial folding pathway of length 71. In other words, 71 pairs of conformations, one from each trajectory, are considered similar to each other. Figure <figr fid="F3">3</figr> displays four such pairs along this consensus folding pathway. For instance, the two conformations shown in Figure <figr fid="F3">3(c)</figr>, corresponding to the 182<sup><it>th </it></sup>frame in the <it>T</it><sub>23 </sub>trajectory and the 116<sup><it>th </it></sup>frame in the <it>T</it><sub>24 </sub>trajectory of BBA5 respectively, are considered structurally similar, since both conformations exhibit an <it>&#945;</it>-helix in the left half of the backbone, and a <it>&#946;</it>-turn in the right half.</p>
            <p>Figure <figr fid="F11">11</figr> illustrate 5 pairs of conformations along the consensus folding pathway of the 1<sup><it>st </it></sup>and 3<sup><it>rd </it></sup>trajectories of GSGS. And Figure <figr fid="F12">12</figr> illustrates 5 conformation-pairs along consensus pathway of the 1<sup><it>st </it></sup>and 5<sup><it>th </it></sup>trajectories of GSGS. We are currently in the process of identifying consensus pathways across more than 2 trajectories of GSGS. Note that by using bit-patterns, we naturally realize a rotation-invariant comparison. To illustrate this, let us again examine the afore-discussed conformation pair of BBA5. One notices that although the <it>&#946;</it>-turn in the two conformations orients differently, the two conformations are still identified as being structurally similar by our approach.</p>
            <fig id="F11">
               <title>
                  <p>Figure 11</p>
               </title>
               <caption>
                  <p>Selected conformation-pairs along the consensus partial folding pathway across the 1<sup><it>st </it></sup>and 3<sup><it>rd </it></sup>trajectories of the GSGS peptide</p>
               </caption>
               <text>
                  <p><b>Selected conformation-pairs along the consensus partial folding pathway across the 1<sup><it>st </it></sup>and 3<sup><it>rd </it></sup>trajectories of the GSGS peptide</b>. The figure illustrates five pairs of conformations, one from each trajectory, along the consensus partial folding pathway identified in the 1<sup><it>st </it></sup>and 3<sup><it>rd </it></sup>trajectories.</p>
               </text>
               <graphic file="1748-7188-2-3-11"/>
            </fig>
            <fig id="F12">
               <title>
                  <p>Figure 12</p>
               </title>
               <caption>
                  <p>Selected conformation-pairs along the consensus partial folding pathway across the 1<sup><it>st </it></sup>and 5<sup><it>th </it></sup>trajectories of the GSGS peptide</p>
               </caption>
               <text>
                  <p><b>Selected conformation-pairs along the consensus partial folding pathway across the 1<sup><it>st </it></sup>and 5<sup><it>th </it></sup>trajectories of the GSGS peptide</b>. The figure illustrates five pairs of conformations, one from each trajectory, along the consensus partial folding pathway identified in the 1<sup><it>st </it></sup>and 5<sup><it>th </it></sup>trajectories.</p>
               </text>
               <graphic file="1748-7188-2-3-12"/>
            </fig>
            <p>Currently, we rely on visual tools to justify these consensus pathways. We did attempt to use several measurements that have been used previously to quantify the similarity between 3D protein conformations, but to no avail. These measurements include <it>RMSD</it>, contact order, and native contacts. If we identify the pathway based on the best match given by any of the above measurements, we often ended up with a very short consensus pathway (as short as 10 frames). Two conformations are said to be a best match if they have the lowest RMSD or have the smallest difference in contact order or native contacts. Moreover, different best-matched measurements rendered very different consensus pathways. Finally, we notice that the best-matched conformations based on any of such measurements can often exhibit very different structural characteristics. We are investigating alternative methods for quantitative validation of our results.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>5 Conclusions and Ongoing Work</p>
         </st>
         <p>In this article, we present a novel approach to analyze protein folding trajectories and a case study on the small proteins BBA5 and GSGS. We capture a variety of structural motifs in the 3D protein conformations by non-local bit-patterns identified in their 2D contact maps. By modeling the interactions or spatial relationships among bit-patterns as SOAPs and SOAP episodes, we effectively characterize the evolutionary nature of the folding process. We also describe two methods to summarize folding trajectories by super-imposing protein specific information and 3D motifs onto SOAPs. Utilizing the summarized trajectories, we demonstrate that one can detect folding events and the temporal order among events. We also show that through comparing such summarized trajectories, one can identify a partial folding pathway common to multiple trajectories.</p>
         <p>We realize that it is a very hard and challenging task to understand the folding mechanism of proteins. Based on our analysis results over a small protein, we are not in the position to make any general comments on the protein folding problem. However, the approach presented here is general and applicable to any folding trajectories.</p>
         <p>Presently, we are in the process of addressing several other related issues. First, we are automating the mapping between 2D bit-patterns and 3D motifs. Second, we are further analyzing the identified consensus folding pathways and validating them through other means. Third, it is well-known that the side chains of a protein play a crucial role in the folding process. We are currently investigating different approaches to involve side chains in our analysis. Finally, we are investigating whether bit-patterns can be used to index and manage protein folding simulation data.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank Dr. Yusu Wang at The Ohio State University for providing the folding simulation data and sharing many constructive and insightful thoughts with us.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Absolute comparison of simulated and experimental protein-folding dynamics</p>
            </title>
            <aug>
               <au>
                  <snm>Snow</snm>
                  <fnm>CD</fnm>
               </au>
               <au>
                  <snm>Nguyen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Pande</snm>
                  <fnm>VS</fnm>
               </au>
               <au>
                  <snm>Gruebele</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2002</pubdate>
            <volume>420</volume>
            <fpage>102</fpage>
            <lpage>106</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12422224</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Screen savers of the world unite</p>
            </title>
            <aug>
               <au>
                  <snm>Shirts</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Pande</snm>
                  <fnm>VS</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2000</pubdate>
            <volume>290</volume>
            <fpage>1903</fpage>
            <lpage>1904</lpage>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Exploring protein folding trajectories using geometric spanners</p>
            </title>
            <aug>
               <au>
                  <snm>Russel</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Guibas</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing</source>
            <pubdate>2005</pubdate>
            <fpage>40</fpage>
            <lpage>51</lpage>
            <xrefbib>
               <pubid idtype="pmpid">15759612</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Towards data warehousing and mining of protein unfolding simulation data</p>
            </title>
            <aug>
               <au>
                  <snm>Berrar</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Stahl</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Silva</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Rodrigues</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Brito</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Dubitzky</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Journal of clinical monitoring and computing</source>
            <pubdate>2005</pubdate>
            <volume>19</volume>
            <issue>4&#8211;5</issue>
            <fpage>307</fpage>
            <lpage>17</lpage>
         </bibl>
         <bibl id="B5">
            <title>
               <p>generalized framework for mining spatio-temporal patterns in scientific data</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Parthasarathy</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Mehta</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>KDD '05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining</source>
            <publisher>New York, NY, USA: ACM Press</publisher>
            <pubdate>2005</pubdate>
            <fpage>716</fpage>
            <lpage>721</lpage>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Mining Spatial Object Patterns in Scientific Data</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Mehta</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Parthasarathy</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Proceedings of the 19th Internationl Joint Conference of Artificial Intelligence (IJCAI)</source>
            <pubdate>2005</pubdate>
            <fpage>902</fpage>
            <lpage>907</lpage>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Discovering Spatial Relationships Between Approximately Equivalent Patterns</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Marsolo</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Parthasarathy</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Mehta</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>The Fourth Workshop on Data Mining and Bioinformatics, ACM SIGKDD (BIOKDD)</source>
            <pubdate>2004</pubdate>
            <fpage>62</fpage>
            <lpage>71</lpage>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Meeting halfway on the bridge between protein folding theory and experiment</p>
            </title>
            <aug>
               <au>
                  <snm>Pande</snm>
                  <fnm>VS</fnm>
               </au>
            </aug>
            <source>Proceedings of National Academy of Sciences</source>
            <pubdate>2003</pubdate>
            <volume>100</volume>
            <issue>7</issue>
            <fpage>3555</fpage>
            <lpage>3556</lpage>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Absolute comparison of simulated and experimental protein-folding dynamics</p>
            </title>
            <aug>
               <au>
                  <snm>Snow</snm>
                  <fnm>CD</fnm>
               </au>
               <au>
                  <snm>Nguyen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Pandel</snm>
                  <fnm>VS</fnm>
               </au>
               <au>
                  <snm>Gruebele</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2002</pubdate>
            <volume>420</volume>
            <fpage>102</fpage>
            <lpage>106</lpage>
            <xrefbib>
               <pubid idtype="pmpid">12422224</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Folding Simulations of a three-stranded antiparallel <it>&#946;</it>-sheet Peptide</p>
            </title>
            <aug>
               <au>
                  <snm>Ferrara</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Caflisch</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proc NAtl Acad Sci</source>
            <pubdate>2000</pubdate>
            <volume>97</volume>
            <issue>20</issue>
            <fpage>10780</fpage>
            <lpage>10785</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">27100</pubid>
                  <pubid idtype="pmpid" link="fulltext">10984515</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Native Topology or Specific Interactions: What is More Important for Protein Folding?</p>
            </title>
            <aug>
               <au>
                  <snm>Ferrara</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Caflisch</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2001</pubdate>
            <volume>306</volume>
            <fpage>837</fpage>
            <lpage>850</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11243792</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Folding@home</p>
            </title>
            <url>http://folding.stanford.edu/</url>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Native-like Mean Structure in the Unfolded Ensemble of Small Proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Zagrovic</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Snow</snm>
                  <fnm>CD</fnm>
               </au>
               <au>
                  <snm>Khaliq</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Shirts</snm>
                  <fnm>MR</fnm>
               </au>
               <au>
                  <snm>Pande</snm>
                  <fnm>VS</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2002</pubdate>
            <volume>323</volume>
            <fpage>153</fpage>
            <lpage>164</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12368107</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Contact order, transition state placement and the refolding rates of single domain proteins</p>
            </title>
            <aug>
               <au>
                  <snm>KW</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>KT</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>D</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1998</pubdate>
            <volume>277</volume>
            <fpage>985</fpage>
            <lpage>994</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9545386</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>How the folding rate constant of simple, single-domain proteins depends on the number of ative contacts</p>
            </title>
            <aug>
               <au>
                  <snm>DE</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>CA</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>KW</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>H</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proc Nad Acad Sci USA</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <fpage>3535</fpage>
            <lpage>39</lpage>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Is protein folding hierarchic? I. Local structure and peptide folding</p>
            </title>
            <aug>
               <au>
                  <snm>Baldwin</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Rose</snm>
                  <fnm>GD</fnm>
               </au>
            </aug>
            <source>Trends in Biochemical Sciences</source>
            <pubdate>1999</pubdate>
            <volume>24</volume>
            <fpage>26</fpage>
            <lpage>33</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10087919</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <aug>
               <au>
                  <snm>Garey</snm>
                  <fnm>MR</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>DS</fnm>
               </au>
            </aug>
            <source>Computers and Intractability: A Guide to the Theory of NP-Completeness</source>
            <publisher>WH Freeman</publisher>
            <pubdate>1979</pubdate>
            <volume>A421</volume>
            <fpage>SR10</fpage>
            <note>ISBN 0716710455</note>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Mining Non-local Structural Motifs in Proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Hu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Shen</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Shao</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Bystroff</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Zaki</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>BIOKDD 2002</source>
            <publisher>Edmonton, Canada</publisher>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Mining Protein Contact Maps</p>
            </title>
            <aug>
               <au>
                  <snm>Hu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Shen</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Shao</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Bystroff</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Zaki</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <url>http://citeseer.ist.psu.edu/533466.html</url>
         </bibl>
         <bibl id="B20">
            <title>
               <p>101 optimal PDB structure alignments: a branch-and-cut algorithm for the maximum contact map overlap problem</p>
            </title>
            <aug>
               <au>
                  <snm>Lancia</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Carr</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Walenz</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Istrail</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Proceedings of the fifth annual international conference on Computational biology</source>
            <publisher>ACM Press</publisher>
            <pubdate>2001</pubdate>
            <fpage>193</fpage>
            <lpage>202</lpage>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Efficient Dynamics in the Space of Contact Maps</p>
            </title>
            <aug>
               <au>
                  <snm>Vendruscolo</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Domany</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Folding &amp; Design</source>
            <pubdate>1998</pubdate>
            <volume>3</volume>
            <issue>5</issue>
            <fpage>329</fpage>
            <lpage>336</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9806935</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Some Methods for Classification and Analysis of Multivariate Observation</p>
            </title>
            <aug>
               <au>
                  <snm>MacQueen</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability</source>
            <publisher>Berkeley: University of California Press</publisher>
            <editor>Cam LL, Neyman J</editor>
            <pubdate>1967</pubdate>
            <volume>1</volume>
            <fpage>281</fpage>
            <lpage>297</lpage>
         </bibl>
         <bibl id="B23">
            <title>
               <p>A mathematical theory of communication</p>
            </title>
            <aug>
               <au>
                  <snm>Shannon</snm>
                  <fnm>CE</fnm>
               </au>
            </aug>
            <source>SIGMOBILE Mob Comput Commun Rev</source>
            <pubdate>2001</pubdate>
            <volume>5</volume>
            <fpage>3</fpage>
            <lpage>55</lpage>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Peptide models for protein beta-sheets</p>
            </title>
            <aug>
               <au>
                  <snm>Griffiths-Jones</snm>
                  <fnm>SR</fnm>
               </au>
            </aug>
            <source>PhD thesis</source>
            <publisher>University of Nottingham</publisher>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Towards Association Based Spatio-temporal Reasoning</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Parthasarathy</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Mehta</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Proceedings of the 19th IJCAI Workshop on Spatio-temporal Reasoning</source>
            <pubdate>2005</pubdate>
         </bibl>
      </refgrp>
   </bm>
</art>
