Abstract
Background
Estrogen is a chemical messenger that has an influence on many breast cancers as it helps cells to grow and divide. These cancers are often known as estrogen responsive cancers in which estrogen receptor occupies the surface of the cells. The successful treatment of breast cancers requires understanding gene expression, identifying of tumor markers, acquiring knowledge of cellular pathways, etc. In this paper we introduce our proposed triclustering algorithm δTRIMAX that aims to find genes that are coexpressed over subset of samples across a subset of time points. Here we introduce a novel meansquared residue for such 3D dataset. Our proposed algorithm yields triclusters that have a meansquared residue score below a threshold δ.
Results
We have applied our algorithm on one simulated dataset and one reallife dataset. The reallife dataset is a timeseries dataset in estrogen induced breast cancer cell line. To establish the biological significance of genes belonging to resultant triclusters we have performed gene ontology, KEGG pathway and transcription factor binding site enrichment analysis. Additionally, we represent each resultant tricluster by computing its eigengene and verify whether its eigengene is also differentially expressed at early, middle and late estrogen responsive stages. We also identified hubgenes for each resultant triclusters and verified whether the hubgenes are found to be associated with breast cancer. Through our analysis CCL2, CD47, NFIB, BRD4, HPGD, CSNK1E, NPC1L1, PTEN, PTPN2 and ADAM9 are identified as hubgenes which are already known to be associated with breast cancer. The other genes that have also been identified as hubgenes might be associated with breast cancer or estrogen responsive elements. The TFBS enrichment analysis also reveals that transcription factor POU2F1 binds to the promoter region of ESR1 that encodes estrogen receptor α. Transcription factor E2F1 binds to the promoter regions of coexpressed genes MCM7, ANAPC1 and WEE1.
Conclusions
Thus our integrative approach provides insights into breast cancer prognosis.
Keywords:
Time series gene expression data; Tricluster; Meansquared residue; Eigengene; Affirmation score; Gene ontology; KEGG pathway; TRANSFACBackground
In the context of genomics research, the functional approach is based on the ability to analyze genomewide patterns of gene expression and the mechanisms by which gene expression is coordinated. Microarray technology and other highthroughput methods are used to measure expression values of thousands of genes over different samples/experimental conditions. In recent years the microarray technology has been used to measure in a single experiment expression values of thousands of genes under a huge variety of experimental conditions across different time points. This kind of datasets can be referred to as time series microarray datasets. Because of the large data volume, computational methods are used to analyze such datasets. Clustering is one of the most common methods for identifying coexpressed genes [1]. This kind of analysis is facilitative for constructing gene regulatory networks in which single or groups of genes interact with other genes. Besides this, coexpression analysis also reveals information about some unknown genes that form a cluster with some known genes.
A clustering algorithm is used to group genes that are coexpressed over all conditions/samples or to group experimental conditions over all genes based on some similarity/dissimilarity metric. However clustering may fail to find the group of genes that are similarly expressed over a subset of samples/experimental conditions i.e. clustering algorithms are unable to find such local patterns in the gene expression dataset. To deal with that problem, biclustering algorithms are used. A bicluster can be defined as a subset of genes that are coexpressed over a subset of samples/experimental conditions. The first biclustering algorithm that was used to analyse gene expression datasets was proposed by Cheng and Church and they used a greedy search heuristic approach to retrieve largest possible bicluster having mean squared residue (MSR) under a predefined threshold value δ (δbicluster) [2]. But nowadays, biologists are eager to analyze 3D microarray dataset to answer the question: “Which genes are coexpressed under which subset of experimental conditions/samples across which subset of time points?” Biclustering is not able to deal with such 3D datasets. So, in this case we need some other clustering technique that can mine 3D datasets. Hence the term Triclustering has been defined and a tricluster can be delineated as a subset of genes that are similarly expressed across a subset of experimental conditions/samples over a subset of time points. Zhao and Zaki proposed a triclustering algorithm TRICLUSTER that is based on graphbased approach. They defined coherence of a tricluster as , where e_{ia},e_{ib} denote the expression values of two columns a and b respectively for a row i. A tricluster is valid if it has a ratio below a maximum ratio threshold ϵ[3].
Here we introduce an efficient triclustering algorithm δTRIMAX[4] that aims to cope with noisy 3D gene expression dataset and is less sensitive to input parameters. The normalization method does not influence the performance of our algorithm, as it produces the same results for both normalized and raw datasets. Here we propose a novel extension of MSR [2] for 3D gene expression data and use a greedy search heuristic approach to retrieve triclusters, having MSR values below a threshold δ. Hence the triclusters can be defined as δtricluster.
In this work we have applied our proposed δTRIMAX algorithm on a timeseries gene expression data in estrogen induced breast cancer cell. Estrogen, a chemical messenger plays an instrumental role in normal sexual development, regulating woman’s menstrual cycles and normal development of the breast. Estrogen is also needed for heart and healthy bones. As estrogen plays vital role in stimulating breast cell division, has an effect on other hormones implicated in breast cell division and provides support to the growth of estrogenresponsive tumors, it may be involved in risk for breast cancer [5]. Though since last decade, some research has been done to decipher some unknown questions on breast cancer risk, still some questions such as involvement of genes in breast cancer risk etc. remain unanswered. Here our coexpression analysis reveals some genes that have already been found to be associated with estrogen induced breast cancer and some other genes that might play an important role in this context. Additionally, our coregulation analysis brings out some important information such as which transcription factor binds the promoter regions of genes and play an important role in this context.
In section 2, we have described our proposed triclustering algorithm in detail. Section 3 shows results of our algorithm using one artificial dataset and one reallife dataset. In section 4, we conclude our work.
Methods
Definitions
Definition 1 (Time Series Microarray Gene Expression Dataset). We can model a time series microarray gene expression dataset (D) as a G × C × T matrix and each element of D (d _{ijk}) corresponds to the expression value of gene i over jth sample/experimental condition across time point k, where i ∈ (g_{1},g_{2},...,g _{G}), j ∈ (c_{1},c_{2},...,c _{C}) and k ∈ (t_{1},t_{2},...,t _{T}).
Definition 2 (Tricluster). A tricluster is defined as a submatrix M(I,J,K) = [m _{ijk}], where i ∈ I, j ∈ J and k ∈ K. The submatrix M represents a subset of genes (I) that are coexpressed over a subset of conditions (J) across a subset of time points (K).
Definition 3 (Perfect Shifting Tricluster). A Tricluster M(I,J,K) = m _{ijk}, where i ∈ I, j ∈ J and k ∈ K, is called a perfect shifting tricluster if each element of the submatrix M is represented as: m _{ijk} = Γ + α_{i} + β_{j} + η_{k}, where Γ is a constant value for the tricluster, α_{i}, β_{j} and η_{k} are shifting factors of ith gene, jth samples/experimental condition and kth time point, respectively. As the noise is present in microarray datasets, the deviation from actual value and expected value of each element in the dataset also exists. For this deviation, every tricluster is not a perfect one.
Cheng and Church proposed an algorithm for retrieving large and maximal biclusters that have mean squared residue score (MSR) below a threshold δ in 2D microarray gene expression dataset. They also showed that MSR of a perfect δbicluster and perfect shifting bicluster is zero (S = δ = 0) [2,6]. Now extending this idea, here we present a novel definition of Mean Squared Residue score for 3D microarray gene expression datasets. The MSR (S) of a perfect shifting tricluster becomes also zero, where each element m_{ijk} = Γ + α_{i} + β_{j} + η_{k}. For delineating new MSR score (S), at first we need to define the residue score:
Let the mean of ith gene (m _{ijk}): , the mean of jth sample/experimental condition (m _{ijk}): , the mean of kth time point (m _{ijk}): , and the mean of tricluster (m _{ijk}): . Now the mean of the tricluster can be considered as the value of constant i.e. Γ=m_{IJK}. We can define the shifting factor for the ith gene (α_{i}) as the difference between m _{ijk} and m _{ijk} i.e. α_{i}=m_{iJK}m_{IJK}. Similarly, we can define shifting factor for the jth condition (β_{j}) as β_{j}=m_{IjK}m_{IJK} and shifting factor for the kth time point (η_{k}) can be defined as η_{k}=m_{IJk}m_{IJK}. Hence we can define each element of a perfect shifting tricluster as m_{ijk}=Γ+α_{i}+β_{j}+η_{k}=m_{IJK}+(m_{iJK}m_{IJK})+(m_{IjK}m_{IJK})+(m_{IJk}m_{IJK})=(m_{iJK}+m_{IjK}+m_{IJk}2m_{IJK}). But usually noise is evident in microarray gene expression dataset. Therefore to evaluate the difference between the actual value of an element (m _{ijk}) and its expected value, obtained from above equation, the term “residue” can be used [6]. Thus the residue of a tricluster (r _{ijk}) can be defined as follows: r_{ijk}=m_{ijk}(m_{iJK}+m_{IjK}+m_{IJk}2m_{IJK})=(m_{ijk}m_{iJK}m_{IjK}m_{IJk}+2m_{IJK}).
Definition 4 (Mean Squared Residue). We define the term Mean Squared Residue MSR(I,J,K) or S of a tricluster M(I,J,K) to estimate the quality of a tricluster i.e. the level of coherence among the elements of a tricluster as follows:
Lower residue score represents larger coherence and better quality of a tricluster.
Proposed method
δTRIMAX aims to find largest and maximal triclusters in a 3D microarray gene expression dataset. It is an extension of Cheng and Church biclustering algorithm [2] that deals with 2D microarray datasets. In contrast, our algorithm is capable to mine 3D gene expression dataset. There is always a submatrix in an expression dataset that has a perfect MSR(I,J,K) or S score i.e. S = 0 and this submatrix is each element of the dataset. But as mentioned above, our algorithm finds maximal triclusters having S score under a threshold δ, hence we have used a greedy heuristic approach to find triclusters. Our algorithm therefore starts with the entire dataset containing all genes, all samples/experimental conditions and all time points.
Algorithm 1 (δTRIMAX)
Input. D, a matrix that represents 3D microarray gene expression dataset, λ> 1, an input parameter for multiple node deletion algorithm, δ≥ 0, maximum allowable MSR score.
Output. All possible δtriclusters.
Initialization. Missing elements in D ← random numbers, D’ ← D
Repeat
a. D’_{1} ← Results of Algorithm 2 on D’ using delta and λ. If the no. of genes (conditions/samples and/or no. of time points) is 50 (This value can be choosen experimentally. Large value increases the execution time of the algorithm as it then executes more number of iterations.), then do not apply Algorithm 2 on genes (conditions/samples and/or time points).
b. D’_{2} ← Results of Algorithm 3 on D’_{1} usingδ.
c. D’_{3} ← Results of Algorithm 4 on D’_{2}.
d. Return D’_{3} and replace the elements that exist in D’ and D’_{3} with random numbers.
Until(No gene is found for δtricluster)
Initially, our algorithm removes genes or conditions or time points from the dataset to accomplish largest diminishing of score S; this step is described in the following section in which a node corresponds to a gene or experimental condition or time point in the 3D microarray gene expression dataset.
Algorithm 2 (Multiple node deletion)
Input. D, a matrix of real numbers that represents 3D microarray gene expression dataset; δ≥ 0, maximum allowable MSR threshold, λ> 1, threshold for multiple node deletion. The value of λ has been set experimentally to optimize the speed and performance (to avoid falling into local optimum) of the algorithm.
Output. M _{ijk}, a δtricluster, consisting of a subset(I) of genes, a subset(J) of samples/experimental conditions and a subset of time points, having MSR score (S) less than or equal to δ.
Initialization. I ←{set of all genes }, J ←{set of all experimental conditions/ samples } and K ←{set of all time points } and to M(I,J,K) ← D
Repeat
Calculate m _{ijk}, ∀ i ∈ I; m _{ijk}, ∀ j ∈ J; m _{ijk}, ∀ k ∈ K; m _{ijk} and S.
If S ≤δ return M(I,J,K)
Else
Delete genes i ∈ I that satisfy the following inequality
Recalculate m _{ijk}, ∀ i ∈ I; m _{ijk}, ∀ j ∈ J; m _{ijk}, ∀ k ∈ K; m _{ijk} and S
Delete samples/experimental conditions j ∈ J that satisfy the following inequality
Recalculate m _{ijk}, ∀ i ∈ I; m _{ijk}, ∀ j ∈ J; m _{ijk}, ∀ k ∈ K; m _{ijk} and S
Delete time points k ∈ K that satisfy the following inequality
End if
Until(There is no change in I, J and/or K)
The complexity of this algorithm is O(max(m,n,p)) where m, n and p are the number of genes, samples and time points in the 3D microarray dataset.
In the second step, we delete one node at each iteration from the resultant submatrix, produced by Algorithm 2, until the score S of the resultant submatrix is less than or equal to δ. This step results in a δtricluster.
Algorithm 3 (Single node deletion)
Input. D, a matrix of real numbers that represents 3D microarray gene expression dataset; δ≥ 0, maximum allowable MSR threshold.
Output. M _{ijk}, a δtricluster, consisting of a subset(I) of genes, a subset(J) of samples/experimental conditions and a subset of time points, having MSR score (S) less than or equal to δ.
Initialization.I ←{set of all genes in D }, J ←{set of experimental conditions/samples in D } and K ←{set of time points in D } and to M(I,J,K) ← D
Calculate m _{ijk}, ∀ i ∈ I; m _{ijk}, ∀ j ∈ J; m _{ijk}, ∀ k ∈ K; m _{ijk} and S.
While S >δ
Detect gene i ∈ I that has the highest score
Detect sample/experimental condition j ∈ J that has the highest score
Detect time point k ∈ K that has the highest score
Delete gene or sample/experimental condition or time point that has highest μ score and modify I or J or K. Recalculate m _{ijk}, ∀ i ∈ I; m _{ijk}, ∀ j ∈ J; m _{ijk}, ∀ k ∈ K; m _{ijk} and S.
End while
Return M(I,J,K)
The complexity of first and second steps is O(mnp) as those will iterate (m+n+p) times. The complexity of selection of best genes, samples and time points is O(log m + log n + log p). So it is suggested to use algorithm II before algorithm 3.
As the goal of our algorithm is to find maximal triclusters, having MSR score (S) below the threshold δ, the resultant tricluster M(I,J,K) may not be the largest one. That means some genes and/or experimental conditions/samples and/or time points may be added to the resultant tricluster T produced by node deletion algorithm, so that the MSR score of new tricluster T’ produced after node addition does not exceed the MSR score of T. Now the third step of our algorithm is described below.
Algorithm 4 (Node addition)
Input. D, a matrix of real numbers that represents δtricluster, having a subset of genes (I), a subset of experimental conditions/samples (J) and a subset of time points (K).
Output. M, a δtricluster, consisting of a subset of genes (I’), a subset of samples/experimental conditions (J’) and a subset of time points (K’), such that I ⊂ I ^{′}, J ⊂ J ^{′}, K ⊂ K’ and MSR(I ^{′},J ^{′},K’) ≤ MSR of D.
Initialization. M(I,J,K) ← D
Repeat
Calculate m _{ijk}, ∀ i; m _{ijk}, ∀ j; m _{ijk}, ∀ k; m _{ijk} and S. Add genes i ∉ I that satisfy the following inequality
Recalculate m _{ijk}, ∀ j; m _{ijk}, ∀; m _{ijk} and S
Add samples/experimental conditions j ∉ J that satisfy the following inequality
Recalculate m _{ijk}, ∀ i; m _{ijk}, ∀ k; m _{ijk} and S
Add time points k ∉ K that satisfy the following inequality
Until(There is no change in I, J and/or K)
I’ ← I, J’ ← J, K’ ← K Return I’, J’, K’
The complexity of this algorithm is O(mnp) as each step iterates (m+n+p) times.
Tricluster eigengene
To find tricluster eigengene we applied singular value decomposition method (SVD) on the expression data of each tricluster [7]. For instance, represents the expression matrix of ith tricluster, where g, c and t represent the number of genes, samples and time points of ith tricluster. Now we apply SVD on the data matrix (normalized to mean=0 and variance=1). Now, the SVD of ith tricluster can be represented as,
where U and V are the orthogonal matrices. U^{i} is a g ∗ (c ∗ t) matrix with orthonormal columns, V^{i} is a (c ∗ t) × (c ∗ t) orthogonal matrix and D^{i} is (c ∗ t) × (c ∗ t) diagonal matrix of singular values.
Assuming that singular values in matrix D^{i} are arranged in nondecreasing order, we can represent eigengene of ith tricluster by the first column of matrix V^{i}, i.e.
Results and discussion
Results on simulated dataset
We have produced one simulated dataset SMD of size 2000 × 30 × 30. At first we have implanted three perfect shifting triclusters of size 100 × 6 × 6, 80 × 6 × 6 and 60 × 5 × 5 into the dataset SMD and then implanted three noisy shifting triclusters of the same size mentioned before into it. To estimate the degree of similarity between the implanted and obtained triclusters, we define affirmation score in the same way as Prelic et. al. defined for two sets of biclusters [6,8]. So, overall average affirmation score of T_{1} with respect to T_{2} is as follows, where (SM(T_{1}, T_{2})) is the average gene affirmation score, (SM(T_{1}, T_{2})) is the average sample affirmation score and (SM(T_{1}, T_{2})) is the average time point affirmation score of T_{1} with respect to T_{2}:
Suppose, we have two sets of triclusters T _{im} and T _{res} where T _{im} represents the set of implanted triclusters and T _{res} corresponds to the set of triclusters retrieved by any triclustering algorithm. Hence SM ^{∗}(T_{im},T_{res}) denotes how well the triclustering algorithm finds the true triclusters that have been implanted into the dataset. This score varies from 0 to 1 (if T _{im} = T _{res}).
For the dataset containing perfect shifing triclusters, we have assigned 0.35 and 1.0005 to the parameters δ and λ, respectively. The value of δ varies from one dataset to another dataset. Then we have added noisy triclusters having different levels i.e. standard deviations (σ = 0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.7). To have an idea about the δ value, we have first clustered the genes over all time points and then the time points over the subset of genes for each gene cluster in each sample plane using the Kmeans algorithm. Then we have computed the MSR value(S) of the submatrix, considering a randomly selected sample plane, gene and timepont cluster for 100 times. Then we have taken the lowest value as the value of δ. For these noisy datasets, we have assigned 3.75 and 1.004 to the parameters δ and λ, respectively. In Figure 1 we have compared the performance of our algorithm with that of the TRICLUSTER algorithm [3] in terms of affirmation score using the artificial dataset. Our δTRIMAX algorithm performs better than TRICLUSTER algorithm for the noisy dataset. For perfect additive triclusters, performances of both these algorithms are comparable with each other.
Figure 1. Comparison in terms of Affirmation Scores.a. Comparison of Affirmation scores produced by TRIMAX and TRICLUSTER algorithm. b. Comparison of running time of TRIMAX and TRICLUSTER algorithm on the synthetic dataset.
Results on reallife dataset
Datasets for genomewide analysis of estrogen receptor binding sites
This dataset contains 54675 affymetrix probeset ids, 3 biological replicates and 4 time points. In this experiment MCF7 cells are stimulated with 100 nm estrogen for 0, 3, 6 and 12 hours and the experiments are performed in triplicate. This dataset is publicly available at Gene Expression Omnibus (GEO) (dataset id GSE 11324). It was used for discovering of cisregulatory sites in previously uninvestigated regions and cooperating transcription factors underlying estrogen signaling in breast cancer [9]. We assign 0.012382 and 1.2 to δ and λ respectively. In this case our algorithm results in 115 triclusters. From Figure 2, we observe that the genes in tricluster 4 have similar expression profiles over all three samples across 0, 6 and 12 hours but not at 3 hour.
Figure 2. Expression Profiles. Figures in first row show the expression profiles of genes ESR1, HOXA11, FAM71A, SPEF2, IFIH1, FPR2, SPAG9, NCF4, ADAM3A, CCNYL1 respectively of tricluster 4 over all samples; The redcolored time point (3 hrs.) is not a member of this tricluster. The figures in second row show the expression profiles of the same genes across 0, 6 and 12 hours.
To compare the performance of our proposed algorithm with TRICLUSER algorithm on reallife dataset, we have used three validation indexes.
Coverage
Coverage for any triclustering algorithm can be delineated as
where g_{alg}, c_{alg} and t_{alg} denote total number of genes, experimental samples and time points retrieved by the triclustering algorithm. G, C and T represent number of all genes, experimental samples and time points in the dataset.
Triclustering Quality Index (TQI)
We can elucidate Triclusering Quality Index of a tricluster by equation 4.
where MSR_{i} and Volume_{i} represent meansquared residue and volume of ith tricluster. Lower TQI score represents better quality of tricluster.
Statistical Difference from Background (SDB)
Here we have introduced another quality measurement, termed as Statistical Differences from Background (SDB) [10] as
where n is the total number of triclusters extracted by the algorithm. MSR_{i} represents mean squared residue of ith tricluster retrieved by the algorithm and RMSR_{j} represents mean squared residue of jth random tricluster having the same number of genes, experimental samples and time points as that of ith resultant tricluster. Here higher value of the denominator denotes better quality of the resultant tricluster. Hence, lower SDB score signifies better performance of the algorithm. Table 1 shows the comparison between proposed δTRIMAX algorithm and TRICLUSTER algorithm in terms of coverage, SDB and TQI score.
Table 1. Comparison between δ TRIMAX and TRICLUSTER algorithm using coverage, Statistical Difference of from Background (SDB) and Triclustering Quality Index (TQI)
Biological significance
We have established the biological significance of genes belonging to each resultant tricluster by performing (a) Gene Ontology (GO) and KEGG pathway enrichment analysis, (b) cogitating each tricluster with different estrogenresponsive stages (early (3 hour), middle (6 hour) and late (12 hour)), (c) identifying hub genes of each tricluster and (d) Transcription Factor Binding Site (TFBS) enrichment analysis.
GO and KEGG pathway enrichment analysis
We have used GOStats package [11] in R to perform GO and KEGG pathway enrichment analysis for establishing biological significance of genes belonging to each tricluster. We have adjusted the pvalues using BenjaminiHochberg FDR method [12] and considered those terms as significant ones that have a pvalue below a threshold of 0.05. The smaller pvalue represents higher significance level. We have found statistically enriched GO terms for genes belonging to each tricluster. We have compared the performance of our proposed δTRIMAX algorithm with that of TRICLUSTER algorithm on reallife dataset. For comparison of the performances we have considered GO Biological Processes (GOBP) and KEGG pathway terms that have already been reported to play an important role in estrogen induced breast cancer cell. Table 2 shows the comparison between δTRIMAX and TRICLUSTER algorithm in terms corrected pvalues of GOBP and KEGG pathway terms cell adhesion and Wnt signaling pathway that are observed to be associated with estrogen induced breast cancer [13,14], respectively.
Table 2. Comparison between δ TRIMAX and TRICLUSTER algorithm in terms of pvalues of GO and KEGG pathway term enrichment analysis
Association of triclusters with different stages of response to estrogen stimulus
To cogitate each tricluster with different estrogen responsive stages of the experiment, we represent each tricluster by eigengene. Then we have examined whether the eigengene of each tricluster is differentially expressed at early, middle and late estrogen responsive stages using Limma package in R [15] (FDRBH corrected pvalue cutoff 0.05). If eigengene of one tricluster is found to be differentially expressed at any possible responsive stages, then the genes having highly correlated expression profiles with that of eigengene can also be considered to be significantly expressed at the same stages. In total our algorithm results in 115 triclusters. Eigengene of tricluster 7 has been found to be differentially expressed between 0 hour  6 hours, 0 hour  12 hours, 3 hours  12 hours and 6 hours  12 hours. 429 genes among 505 genes are found to be differentially expressed in this tricluster. KEGG pathway term mTOR signaling pathway is observed to be meliorated in this tricluster and has been reported to be associated with estrogen induced breast cancer cell [16]. Genes PIK3CA, PRKAA1, RPS6, ULK2 participate in that pathway. The genes belonging to tricluster 50 are coexpressed over all samples across 0, 6 and 12 hours. The eigengene of tricluster 50 has been observed to be differentially expressed between 0 hour  12 hours and 6 hours  12 hours. 96% of the genes belonging to this tricluster are found to be differentially expressed. The genes in this tricluster are meliorated with the KEGG pathway term ubiquitin mediated proteolysis (UBE2K, CUL4B, PIAS1, CDC23). It has been reported in a previous study that there is crosstalk between ER α and targets of ER α for ubiquitin mediated proteolysis [17]. In tricluster 71 time points 3, 6 and 12 hours are present in that tricluster and the eigengene is significantly expressed between 3 hours and 12 hours. 44 genes out of 52 genes in this tricluster are significantly expressed between 3 and 12 hours. Genes belonging to this tricluster are also enriched with the KEGG pathway term ubiquitin mediated proteolysis (UBA6, BIRC6, ANAPC1, CUL5). Genes belonging to tricluster 48 are coexpressed across 0, 3 and 12 hours. Eigengene of tricluster 48 is significantly expressed between 0 and 12 hours, 3 and 12 hours. The KEGG pathway term TGFbeta signaling pathway is meliorated in this tricluster and the crosstalk between TGFbeta signaling pathway and ER α has been reported in a previous study [18]. Genes SKP1, BMPR2 are found to play a role in the enriched pathway. Eigengene of tricluster 95 is significantly upregulated between 0 and 12 hours. 60% of all genes belonging to this tricluster are differentially up regulated at late responsive stage. The genes in this tricluster has been found to be coexpressed across 0 hour, 12 hours over all samples. The KEGG pathway term apoptosis (XIAP, IRAK4, CASP6) is observed to be meliorated in this tricluster and it has been found in a recent study that apoptosis can be induced by estrogen in estrogen deprivationresistant breast cancer cell [19]. The genes of that tricluster 75 have been observed to be coexpressed over all samples across 3 and 12 hours. The eigengene of tricluster 75 is differentially expressed between 3 hours and 12 hours. 39 genes among 64 genes are significantly expressed between 3 and 12 hours. In this case we have observed enrichment for KEGG pathway terms ErbB signaling pathway that is found to be associated with estrogen induced breast cancer cell [20]. The coexpressed genes NCK1, SOS2 in this tricluster participate in that pathway.
Identification and roles of hub genes
To identify hub genes of each tricluster, we have computed tricluster membership of each gene by calculating Pearson correlation coefficient between each gene and the eigengene of that tricluster. We have considered the top fifteen genes as hub genes having highest correlation coefficient with the eigengene of that tricluster. For tricluster 1, we have identified NPC1L1, TMEM161BAS1, POU5F1P3, POU5F1P4, POU5F1B, CCL2 as hubgenes that are coexpressed over alltime points. It has been observed in a previous study that high doses of estrogen augment intestinal cholesterol absorption attributable in part to an upregulated expression of NPC1L1 which is known as intestinal sterol influx transporter [21]. CCL2 is found to play an important role in mediating crosstalk between cancer cells and stromal fibroblasts in breast cancer cells [22]. DNAJC3AS1, ITSN2, TRPC1, CD47, ZNF286A, TSC22D2, PHF17, ZNF286B, TMEM67, NFIB, JKAMP, DENND4A, HPGD are identified as hubgenes that are coexpressed over all samples and 0, 3, 6 and 12 hours in tricluster 7. NFIB has been reported as a potential target of ER negative breast cancers [23]. Transient receptor potential cation channel (TRPC1) is known to play an important role in breast cancer [24]. CD47 has been found to intervene killing of breast cancer cells [25]. HPGD plays important role in epithelialmesenchymal transition and migration in breast cancer cells [26]. In tricluster 4, the genes are coexpressed over 0, 6 and 12 hours and IGKV113, FAM69C, SGCD, CSNK1E, TRMU, CRYBA2, IGKV1D13, IGSF11, PACS1, IQCK are identified as hubgenes. CSNK1E has been observed to play an important role proliferation of breast cancer cells and act as a regulator of activated catenin driven transcription [27]. For tricluster 13 we have identified ESYT3, SERINC2, LRRC14, ALDH4A1, RPL10, BRD4, DEC1, ZFP30, TCP11L2, ALDOA as hubgenes. The gene for Aldolase A (ALDOA) plays an instrumental role in hypoxia which is a feature of solid tumors in breast cancer [28]. Besides this BRD4 known as Bromodomain 4 is found to be associated with breast cancer progression [29]. Genes PFKFB1, TAF1, PIKFYVE, MEMO1P1, KIF1B, PHF20L1, ARHGAP24, TSC22D1, AK7, DPY30, MEMO1, PTEN, ADAM9, PTPN2, MTSS1L are found as hubgenes of tricluster 95. PTEN is known to be a tumor suppressor gene in breast cancer [30,31]. PTPN2, ADAM9 have been reported to be associated with breast cancer in previous studies [32,33]. PIKFYVE has been found to intervene epidermal growth factor receptor that is associated with human breast cancers [34]. In case of tricluster 42, ANTXR2, RHBDL2, GSTCD, DENND1B, KLC3, PREP, NOS1, STOML3, CDK5R1, CLEC7A, HGD, FOXC1, MSRB3, TEX34, SLC36A1 are appeared as hubgenes that are coexpressed over all samples and across 0, 12 hours. In a recent work, the activity of RHBDL2 has been identified in many tumour cells including breast cancer [35]. The role of FOXC1 as a regulator of human breast cancer cells by activating NF κB signaling has been discovered in a recent work [36].
TFBS enrichment analysis
To analyse the potential coregulation of coexpressed genes, we have done transcription factor binding site (TFBS) enrichment analysis using the TRANSFAC library (version 2009.4) [37] that contains eukaryotic transcription factors, their experimentally proven binding sites, and regulated genes. Here we used 42,544,964 TFBS predictions that have high affinity scores and are conserved between human, mouse, dog and cow [38]. Out of these 42 million conserved TFBSs we have selected the best 1% for each TRANSFAC matrix individually to identify the most specific regulator (transcription factor)  target interactions. We have used hypergeometric test [39] and Benjamini YekutieliFDR method [40] for pvalue correction to find overrepresented binding sites (pvalue ≤ 0.05) in the upstream regions of genes belonging to each tricluster. Table 3 shows the list of triclusters where we have found statistically meliorated TFBSs. From Table 3, we can observe that the genes in tricluster 26 are enriched with helixturnhelix, zinccoordinating DNAbinding and basic domain transcription factors. The helixturnhelix domain transcription factor E2F1, to which TRANSFAC matrix V$E2F_Q2 is associated acts as a regulator of cell proliferation in estrogeninduced breast cancer cell [41]. The zinc finger transcription factors Sp1 and Sp4, asociated with matrix V$SP1_Q6_01 have already been reported to play an important role in estrogeninduced MCF7 breast cancer cell line [42,43]. In tricluster 17, the basic domain transcrption factor CREB (matrix V$CREB_01) is important for malignancy in breast cancer cell. ATF1, ATF2, ATF3, ATF4, ATF5 (matrix V$CREBATF_Q6) likewise play an important role in breast cancer cell [44]. We have observed enrichment for matrix V$NFAT1_Q6. The corresponding transcription factor (NFATC1) has been found to be associated with clinical characteristics in breast cancer cell [45]. In tricluster 4 POU2F1, the TF associated with matrix V$OCT1_03 is a helixturnhelix domain transcription factor (Oct1) and has been reported to be estrogenresponsive in a previous study [46]. Table 4 shows some statistically enriched KEGG pathway terms for coexpressed and differentially expressed (using adjusted pvalue ≤ 0.05) genes the promoters of which are bound by aforementioned transcription factors.
Conclusion
In this work we have proposed δTRIMAX triclustering algorithm that aims to retrieve large and coherent groups of genes, having an MSR score below a threshold δ. Genes belonging to each tricluster are coexpressed over a subset of samples/ experimental conditions and across subset of time points. The results of GO and KEGG pathway enrichment analysis show that our proposed algorithm is able to extract group of coexpressed genes that are biologically significant. We have performed TFBS enrichment analysis to establish the fact that the promoter regions of the genes having similar expression profile are bound by the same transcription factors. We have compared the performance of our algorithm with that of existing algorithm using one artificial dataset in terms of affirmation score and one reallife dataset in terms of coverage, statistical difference from background and triclustering quality index score. In case of these two datasets our proposed algorithm outperformed the existing one. Additionally, here we have represented the expression profiles of genes belonging to each tricluster by eigengene and then identified hub genes using the profile of eigengene. We have observed that most of the identified hubgenes are previously reported to be associated with breast cancer and estrogen responsive elements. The other identified hub genes might be associated with breast cancer and need to be verified experimentally. Hence our integrative approach and findings might provide new insights into breast cancer prognosis.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AB, MH, AM, UM and SB carried out the literature study and preplanning of this work. AB, MH collected the datasets. MH participated in the prediction of transcription factor binding sites. AB developed the code, did the experiments, analysed the results and wrote the draft of the manuscript. MH, AM, UM, SB and EW corrected the draft. EW supervised the entire work. All authors read and approved the final manuscript.
Acknowledgements
All authors acknowledge the chairs of Workshop on Algorithms in Bioinformatics, 2012 conference for inviting us to extend our paper. Anirban Bhar gratefully acknowledges the financial support from Erasmus Mundus Eurindia Project.
References

Maulik U, Mukhopadhyay A, Bandyopadhyay S: Finding multiple coherent biclusters in microarray data using variable string length multiobjective genetic algorithm.

Cheng Y, Church GM: Biclustering of expression data.
Proc Int Conf Intell Syst Mol Biol (ISMB 2000) 2000, 93103.

Zhao L, Zaki MJ: TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data.
SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data 2005 2005, 694705.

Bhar A, Haubrock M, Mukhopadhyay A, Maulik U, Bandyopadhyay S, Wingender E: δTRIMAX: extracting triclusters and analysing coregulation in time series gene expression data. In Algorithms in Bioinformatics, 12th International Workshop, WABI 2012, Ljubljana, Slovenia, September 1012, 2012, Proceedings. Edited by Raphael B, Tang J. Berlin, Heidelberg: Springer; 2012:165177.

Wolff MS, Collman GW, Barrett JC, Huff J: Breast cancer and environmental risk factors: epidemiological and experimental findings.
Annu Rev Pharmacol Toxicol 1996, 36:573596. PubMed Abstract  Publisher Full Text

Mukhopadhyay A, Maulik U, Bandyopadhyay S: A novel coherence measure for discovering scaling biclusters from gene expression data.
J Bioinform Comput Biol 2009, 7(5):853868. PubMed Abstract  Publisher Full Text

Langfelder P, Horvath S: Eigengene networks for studying the relationships between coexpression modules.
BMC Syst Biol 2007, 1:54. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Prelic A, Bleuler S, Zimmermann P, Wille A, Bhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data.
Bioinformatics 2006, 22:11221129. PubMed Abstract  Publisher Full Text

Carroll JS, Meyer CA, Song J, Li W, Geistlinger TR, Eeckhoute J, Brodsky AS, Keeton EK, Fertuck KC, Hall GF, Wang Q, Bekiranov S, Sementchenko V, FOX EA, Silver PA, Gingeras TR, Liu XS, Brown M: Genomewide analysis of estrogen receptor binding sites.
Nat Genet 2006, 38(11):12891297. PubMed Abstract  Publisher Full Text

Maulik U, Bandyopadhyay S, Mukhopadhyay A: Multiobjective fuzzy biclustering in microarray data: method and a new performance measure.
Evolutionary Computation, 2008. CEC 2008. (IEEE World Congress on Computational Intelligence). IEEE Congress on 2008, 15361543.

Falcon S, Gentleman R: Using GOSTATS to test gene lists for GO term association.
Bioinformatics 2007, 23(2):257258. PubMed Abstract  Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Schlange T, Matsuda Y, Lienhard S, Huber A, Hynes NE: Autocrine WNT signaling contributes to breast cancer cell proliferation via the canonical WNT pathway and EGFR transactivation.
Breast Cancer Res 2007, 9(5):R63. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Maynadier M, Nird P, Ramirez JM, Cathiard AM, Platet N, Chambon M, Garcia M: Role of estrogens and their receptors in adhesion and invasiveness of breast cancer cells.
Adv Exp Med Biol 2008, 617:485491. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.

Boulay A, Rudloff J, Ye J, ZumsteinMecker S, O’Reilly T, Evans DB, Chen S, Lane HA: Dual inhibition of mTOR and estrogen receptor signaling in vitro induces cell death in models of breast cancer.

Chu I, Arnaout A, Loiseau S, Sun J, Seth A, McMahon C, Chun K, Hennessy B, Mills GB, Nawaz Z, Slingerland JM: Src promotes estrogendependent estrogen receptor α proteolysis in human breast cancer.
J Clin Invest 2007, 117:22052215. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Stope M, Popp SL, Knabbe C, Buck MB: Estrogen receptor alpha attenuates transforming growth factorbeta signaling in breast cancer cells independent from agonistic and antagonistic ligands.
Breast Cancer Res Treat 2010, 120(2):357367. PubMed Abstract  Publisher Full Text

Ariazi EA, Cunliffe HE, LewisWambi JS, Slifker MJ, Willis AL, Ramos P, Tapia C, Kim HR, Yerrum S, Sharma CG, Nicolas E, Balagurunathan Y, Ross EA, Jordan VC: Estrogen induces apoptosis in estrogen deprivationresistant breast cancer through stress responses as identified by global gene expression across time.
PNAS 2011, 108(47):1887918886. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

SonneHansen K, Norrie IC, Emdal KB, Benjaminsen RV, Frogne T, Christiansen IJ, Kirkegaard T, Lykkesfeldt AE: Breast cancer cells can switch between estrogen receptor alpha and ErbB signaling and combined treatment against both signaling pathways postpones development of resistance.
Breast Cancer Res Treat 2010, 121(3):601613. PubMed Abstract  Publisher Full Text

Wang HH, Liu M, Clegg DJ, Portincasa P, Wang DQ: New insights into the molecular mechanisms underlying effects of estrogen on cholesterol gallstone formation.
Biochimica etBiophysica Acta 2009, 1791(11):10371047. Publisher Full Text

Tsuyada A, Chow A, Wu J, Somlo G, Chu P, Loera S, Luu T, Li AX, Wu X, Ye W, Chen S, Zhou W, Yu Y, Wang YZ, Ren X, Li H, Scherle P, Kuroki Y, Wang SE: CCL2 mediates crosstalk between cancer cells and stromal fibroblasts that regulates breast cancer stem cells.
Cancer Res 2012, 72(11):27682779. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Moon HG, Hwang KT, Kim JA, Kim HS, Lee MJ, Jung EM, Ko E, Han W, Noh DY: NFIB is a potential target for estrogen receptornegative breast cancers.
Mol Oncol 2011, 5(6):538544. PubMed Abstract  Publisher Full Text

El Hiani Y, Ahidouch A, Lehen’kyi V, Hague F, Gouilleux F, Mentaverri R, Kamel S, Lassoued K, Brl G, OuadidAhidouch H: Extracellular signalregulated kinases 1 and 2 and TRPC1 channels are required for calciumsensing receptorstimulated MCF7 breast cancer cell proliferation.
Cell Physiol Biochem 2009, 23:335346. PubMed Abstract  Publisher Full Text

Manna PP, Frazier WA: CD47 mediates killing of breast tumor cells via Gidependent inhibition of protein kinase A.
Cancer Res 2004, 64:10261036. PubMed Abstract  Publisher Full Text

Lehtinen L, Vainio P, Wikman H, Reemts J, Hilvo M, Issa R, Pollari S, Brandt B, Oresic M, Pantel K, Kallioniemi O, Iljin K: 15Hydroxyprostaglandin dehydrogenase associates with poor prognosis in breast cancer, induces epithelialmesenchymal transition, and promotes cell migration in cultured breast cancer cells.
J Pathol 2012, 226(4):674686. PubMed Abstract  Publisher Full Text

Kim SY, Dunn IF, Firestein R, Gupta P, Wardwell L, Repich K, Schinzel AC, Wittner B, Silver SJ, Root DE, Boehm JS, Ramaswamy S, Lander ES, Hahn WC: CK1ϵ is required for breast cancers dependent on βCatenin activity.
PLOS ONE 2010, 5(2):e8979. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Favaro E, Lord S, Harris AL, Buffa FM: Gene expression and hypoxia in breast cancer.
Genome Med 2011, 3(8):55. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Crawford NP, Alsarraj J, Lukes L, Walker RC, Officewala JS, Yang HH, Lee MP, Ozato K, Hunter KW: Bromodomain 4 activation predicts breast cancer survival.
PNAS 2008, 105(17):63806385. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Brough R, Frankum JR, Sims D, Mackay A, MendesPereira A, Bajrami I, CostaCabral S, Rafiq R, Ahmad A, Cerone M, Natrajan R, Sharpe R, Shiu KK, Wetterskog D, Dedes KJ, Lambros MB, Rawjee T, Linardopoulos S, ReisFilho JS, Turner NC, Lord CJ, Ashworth A: Functional viability profiles of breast cancer.
Cancer Discov 2011, 1(3):260273. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li Y, Prasad A, Jia Y, Roy S, Loison F, Mondal S, Kocjan P, Silberstein L, Ding S, Luo H: Pretreatment with phosphatase and tensin homolog deleted on chromosome 10 (PTEN) inhibitor SF1670 augments the efficacy of granulocyte transfusion in a clinically relevant mouse model.
Blood 2011, 117(24):67026713. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bekhouche I, Finetti P, Adelade J, Ferrari A, Tarpin C, CharafeJauffret E, Charpin C, Houvenaeghel G, Jacquemier J, Bidaut G, Birnbaum D, Viens P, Chaffanet M, Bertucci F: Highresolution comparative genomic hybridization of inflammatory breast cancer and identification of candidate genes.
PLoS One 2011, 6(2):e16950. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sieuwerts AM, Meijervan Gelder ME, Timmermans M, Trapman AM, Garcia RR, Arnold M, Goedheer AJ, Portengen H, Klijn JG, Foekens JA: How ADAM9 and ADAM11 differentially from estrogen receptor predict response to tamoxifen treatment in patients with recurrent breast cancer: a retrospective study.
Clin Cancer Res 2005, 11(20):73117321. PubMed Abstract  Publisher Full Text

Kim J, Jahng WJ, Di Vizio D, Lee JS, Jhaveri R, Rubin MA, Shisheva A, Freeman MR: The phosphoinositide kinase PIKfyve mediates epidermal growth factor receptor trafficking to the nucleus.
Cancer Res 2007, 67:92299237. PubMed Abstract  Publisher Full Text

Adrain C, Strisovsky K, Zettl M, Hu L, Lemberg MK, Freeman M: Mammalian EGF receptor activation by the rhomboid protease RHBDL2.
EMBO Rep 2011, 12(5):421427. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang J, Ray PS, Sim MS, Zhou XZ, Lu KP, Lee AV, Lin X, Bagaria SP, Giuliano AE, Cui X: FOXC1 regulates the functions of human basallike breast cancer cells by activating NFKB signaling.
Oncogene 2012, 31(45):47984802. PubMed Abstract  Publisher Full Text

Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, Krull M, Matys V, Michael H, Ohnhuser R, Prss M, Schacherer F, Thiele S, Urbach S: The TRANSFAC system on gene expression regulation.
Nucleic Acids Res 2001, 29(29):281283. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, LindbladToh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals.
NATURE 2005, 434(7031):338345. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO::TermFinderopen source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes.
Bioinformatics 2004, 20(18):37103715. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency.
Ann Stat 2001, 29(4):11651188. Publisher Full Text

Stender JD, Frasor J, Komm B, Chang KC, Kraus WL, Katzenellenbogen BS: Estrogenregulated gene networks in human breast cancer cells: involvement of E2F1 in the regulation of cell proliferation.
Mol Endocrinol 2007, 21(9):21122123. PubMed Abstract  Publisher Full Text

Khan S, Wu F, Liu S, Wu Q, Safe S: Role of specificity protein transcription factors in estrogeninduced gene expression in MCF7 breast cancer cells.
J Mol Endocrinol 2007, 39:289304. PubMed Abstract  Publisher Full Text

Kim K, Barhoumi R, Burghardt R, Safe S: Analysis of estrogen receptor αSp1 interactions in breast cancer cells by fluorescence resonance energy transfer.
Mol Endocrinol 2005, 19(4):843854. PubMed Abstract  Publisher Full Text

Haakenson JK, Kester M, Liu DX: The ATF/CREB family of transcription factors in breast cancer.
In Targeting New Pathways and Cell Death in Breast Cancer Edited by Aft RL. InTech. 2012.

Mancini M, Toker A: NFAT proteins: emerging roles in cancer progression.
Nat Rev Cancer 2009, 9(11):810820. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang C, Yu J, Kallen CB: Two estrogen response element sequences near the PCNA gene are not responsible for its estrogenenhanced expression in MCF7 cells.
PLOS ONE 2008, 3(10):e3523. PubMed Abstract  Publisher Full Text  PubMed Central Full Text