Identification of a prognostic gene signature of colon cancer using integrated bioinformatics analysis

Background Colon cancer is a worldwide leading cause of cancer-related mortality, and the prognosis of colon cancer is still needed to be improved. This study aimed to construct a prognostic model for predicting the prognosis of colon cancer. Methods The gene expression profile data of colon cancer were obtained from the TCGA, GSE44861, and GSE44076 datasets. The WGCNA module genes and common differentially expressed genes (DEGs) were used to screen out the prognosis-associated DEGs, which were used to construct a prognostic model. The performance of the prognostic model was assessed and validated in the TCGA training and microarray validation sets (GSE38832 and GSE17538). At last, the model and prognosis-associated clinical factors were used for the construction of the nomogram. Results Five colon cancer-related WGCNA modules (including 1160 genes) and 1153 DEGs between tumor and normal tissues were identified, inclusive of 556 overlapping DEGs. Stepwise Cox regression analyses identified there were 14 prognosis-associated DEGs, of which 12 DEGs were included in the optimized prognostic gene signature. This prognostic model presented a high forecast ability for the prognosis of colon cancer both in the TCGA training dataset and the validation datasets (GSE38832 and GSE17538; AUC > 0.8). In addition, patients’ age, T classification, recurrence status, and prognostic risk score were associated with the prognosis of TCGA patients with colon cancer. The nomogram was constructed using the above factors, and the predictive 3- and 5-year survival probabilities had high compliance with the actual survival proportions. Conclusions The 12-gene signature prognostic model had a high predictive ability for the prognosis of colon cancer. Supplementary Information The online version contains supplementary material available at 10.1186/s12957-020-02116-y.


Introduction
As one of the most common gastrointestinal malignant diseases, colon cancer is a worldwide leading cause of cancer-related mortality [1,2]. Of the 36 cancers estimated globally in 2018, the number of new cases and related deaths of colon cancer ranked fourth, with estimated new cases of approximately 1,100,000 [2]. The current standard therapeutic strategy for colon cancer is the combination of surgery and adjuvant chemotherapy or radiation therapy [3]. However, the prognosis of patients with colon cancer varies by multiple factors, including the clinical histological subtypes, age, genetic profiles, and treatment responses [4][5][6][7][8]. Also, the unsatisfactory prognostic outcomes still exist due to the complex pathogenesis that involves a variety of molecular or genetic factors [3,[9][10][11][12]. Therefore, the identification of prognostic biomarkers for colon cancer is still necessary.
The advances of biomarkers identified by highthroughput genome sequencing and bioinformatics analysis have attracted a great amount of interest in the last two decades. Computational bioinformatics analysis identifies potential biomarkers by deducing the association with disease status and progression. Most important of all, some of them are verifiable and reliable in clinical trials [13,14]. For instance, Dalerba et al. [15] emphasized that the lack of the caudal-related homoeobox transcription factor 2 (CDX2) is associated with a poor prognosis in patients with stage II/III colon cancers using bioinformatics analysis. Besides, the association between the loss of CDX2 expression and poor diseasefree survival in two Denmark cohorts of patients with colon cancer was validated by Hansen et al. [13]. These results showed that computational bioinformatics tools are of great value for identifying and providing potential prognostic biomarkers before the implements of clinical or preclinical experiments.
In the past decades, a lot of data mining analysis of mRNA, microRNA, long non-coding RNA, and DNA methylation have been performed on human cancers, including colon cancer [16][17][18][19]. As the biomarkers identified by the above techniques are of diagnostic and prognostic values in cancers and the revolution of sequencing technologies and bioinformatics tools facilitates the identification of more potential biomarkers related to disease progression [20][21][22][23], the more potential biomarkers identified, the more recognition and options for the diagnosis and treatment of colon cancer.
This current study aimed to identify a potential prognostic biomarker or gene signature using bioinformatics analysis. An integrated bioinformatics analysis was performed using The Cancer Genome Atlas (TCGA) and microarray datasets in the gene expression omnibus (GEO) database. The differentially expressed genes (DEGs) between the colon tumor and non-tumor control tissues and prognosis-associated genes were identified and used for the construction of a gene signature with prognostic predictive power. The possibility of using the prognostic model as a biomarker for colon cancer was validated using different cohorts. This study may provide a clinical reference for predicting the survival probability of patients with different clinical subtypes.

Data extraction
The public colon cancer gene expression profiles data were preliminarily extracted from the National Center for Biotechnology Information (NCBI) GEO repository (https://www.ncbi.nlm.nih.gov/geo/) using the search words "colon cancer". Datasets selected if they met the following inclusion criteria: (1) human gene expression profiles data, and (2) Fig. 1.

Screening of colon cancer-related gene module
WGCNA has been widely applied to identify the gene module associated with diseases and extract potential therapeutic targets [24]. WGCNA software (version 1.61; https://cran.r-project.org/web/packages/WGCNA/index. html) [25] in R3.4.1 was used to screen the colon cancer-related stable gene modules with the following criteria: min size ≥ 150 and cutHeight = 0.99. The TCGA data were utilized as the training set, and the GSE44861 and GSE44076 datasets were used as the validation sets for the identification of stable gene coexpression modules. The preservation and correlation properties of the above WGCNA modules were analyzed, and modules with a preservation Z-score of > 5.0 and correlation p value of < 0.05 were defined as colon cancer-related stable gene modules.

DEG identification by meta-analysis
The common DEGs across the TCGA, GSE44861, and GSE44076 datasets were identified using the MetaDE.ES methods in the R MetaDE package (https://cran.rproject.org/web/packages/MetaDE/) [26,27]. Briefly, the heterogeneity test of gene expression profiles from different platforms was first conducted according to the statistical tau2, Q value, and Q pval. The common DEGs were screened out according to the following criteria: tau 2 = 0, p < 0.05, Q pval > 0.05, false discovery rate (FDR) < 0.05, and log 2 fold change (FC) had the same differential expression direction across the three datasets (> 0 or < 0). The overlapping genes between the above WGCNA module genes and the common DEGs across the three datasets were retained and used for further functional enrichment analysis and the construction of the prognostic prediction model.

Functional enrichment analysis
To investigate the biological functions associated with the above overlapping genes (DEGs), functional enrichment analyses were performed. The Gene Ontology biological processes and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways associated with these DEGs were identified using the DAVID online tool (version 6.8; https://david.ncifcrf.gov/) [28,29]. Significant enrichment was considered when p value < 0.05.

Construction and evaluation of prognostic prediction model
Before the construction of the prognostic prediction model, the prognosis-associated DEGs were identified using the univariate and multivariate Cox regression analysis in the R survival package (version 2.4, https://cran.rproject.org/web/packages/survival/index.html) [30]. The prognosis-associated DEGs in the TCGA training set (n = 432) were identified when log-rank p value < 0.05. Then, the optimal prognostic gene signature was identified using the L1-penalized least absolute shrinkage and selection operator (LASSO) Cox-proportional hazards (Cox-PH) model (lamba = 1000) in the penalized package (version 0.9-50, http://bioconductor.org/packages/penalized/) [31,32]. Subsequently, the prognosis risk score of each sample was calculated using the following gene signature model: risk score = ∑β gene × Exp gene , where β represents the LASSO coefficient and Exp denotes the expression level. All the samples in the TCGA training set were divided into the high-and low-risk groups according to the median risk score. The Kaplan-Meier (K-M) curve analysis in the R survival package (version 2.41-1) and the receiver operating characteristic (ROC) curve were used to assess the association of the risk score with the overall survival in patients with colon cancer. Similarly, the samples in the validation sets (GSE17538 and GSE38832) were separately divided into the high-and low-risk groups according to the above prognostic model. The performance of the above gene signature model in predicting the prognosis of colon cancer was validated in the validation sets (GSE17538 and GSE38832) using the K-M survival test and ROC curves.

Identification of clinical factors associated with the prognosis of colon cancer
The clinical factors associated with the prognosis of colon cancer were identified in the TCGA training set using the univariate and multivariate Cox regression analysis of the survival package (version 2.41-1) in R3.4.1. The threshold was log-rank p value < 0.05. Also, the K-M survival test was used to validate the performance of the gene signature model in predicting the prognosis of patients with different clinical subtypes.

Nomogram survival model analysis
The final nomogram was established using the "rms" package (Version 5.1-2; https://cran.r-project.org/web/ packages/rms/index.html) in R3.4.0 to estimate the individualized survival probability for patients with colon cancer. The prognosis-associated clinical factors and the gene signature model were used for the construction of the nomogram. Each factor in the nomogram was ascribed points according to its weight. The total point of each sample was calculated and the 3-and 5-year survival probabilities of each sample were predicted accordingly.

Screening of DEGs between the high-and low-risk groups
At last, the DEGs between the samples in the high-and low-risk groups were identified to investigate the different gene expression profiles and features between patients with different survival probabilities. The DEGs between the high-and low-risk groups in the training set were screened using the limma package (Version 3.34.7, https://bioconductor.org/packages/release/bioc/ html/limma.html) [33], with the thresholds of FDR < 0.05 and |log 2 FC| > 0.5.

Extraction of WGCNA modules related to colon cancer
The correlation analysis of RNA-seq data showed there were significant positive correlations (expression correlation coefficient > 0.700 and p < 1e −200 ) and connectivities (p < 1e −06 ) across the TCGA, GSE44861, and GSE44076 datasets ( Figure S1A). Before the identification of the WGCNA modules analysis, the scale-free topology criterion was identified: the soft threshold power = 7 when the scale-free topology model fit R 2 was maximized (R 2 = 0.9; Figure S1B). Then, 8 WGCNA modules were identified in the training dataset according to the criteria: soft threshold power = 7, min size ≥ 150, and cutHeight = 0.99 (Fig. 2a). The same module division was identified in the two validation datasets (GSE44861 and GSE44076; Fig. 2a).
Subsequently, 5 robust modules (blue, brown, green, red, and yellow) with a preservation Z-score of > 5.0 and a p value of < 0.05 were obtained. A total of 1160 genes, including 381, 205, 195, 184, and 195 genes in the blue, brown, green, red, and yellow modules, were obtained ( Table 1). The correlation of these 8 WGCNA modules with clinical factors, including patients' age, gender, history of colon polyps, lymphatic invasion, microsatellite instability, radiation therapy, death, tumor recurrence, pathologic M, pathologic N, pathologic T, and pathologic stage, is shown in Fig. 2b. For instance, the genes in the red module were significantly correlated with the pathologic T classification (cor = 0.54, p < 0.0001).

Identification of common DEGs using the MetaDE analysis
Following the aforementioned criteria for the MetaDE analysis, 1153 common DEGs were identified across the three datasets (TCGA, GSE44861, and GSE44076), including 724 downregulated DEGs and 429 upregulated DEGs. These DEGs had distinctively different expression profiles in the tumor and control samples and showed the same differential expression direction across the three datasets (Fig. 3).

Enrichment analysis of common DEGs
The Venn diagram indicated that 556 genes were overlapped between the five WGCNA module genes (n = 1160) and common DEGs (n = 1153) were obtained (Fig. 4a), including 218, 73, 166, 0, and 99 genes in the blue, brown, green, red, and yellow modules, respectively. The functional enrichment analyses indicated that these common DEGs were significantly associated with 24 biological processes related to immune response and the defense response (Fig. 4b) and 8 KEGG pathways including cytokinecytokine receptor interaction, chemokine signaling pathway, and focal adhesion (Fig. 4b).

Construction of the prognostic model
Based on the univariate Cox regression analysis, 84 prognosis-associated DEGs were identified in the TCGA training dataset. The multivariate Cox regression analysis showed that 14 out of the 84 DEGs were independently correlated with the prognosis of patients with colon cancer (Table S1). Afterward, an optimized prognostic gene signature was identified using the Cox-PH model, which consisted of 12 DEGs, including ADORA3, CPA3, CPM, EDN3, FCRL2, MFNG, NAT1, PCSK5, PPARGC1A, PRRX2, TNFRSF17, and WDR78 ( Table 2).
Most of these 12 genes were in the blue (n = 5) and green modules (n = 6). The prognostic gene model of colon cancer was built according to the following algorithm: prognostic risk score = 0. low-risk (n = 216) groups according to the median prognostic risk score. The K-M survival test indicated that patients with high-risk scores had a significantly shorter survival time compared with patients with low-risk scores (hazard ratio, HR = 3.287, 95% CI 2.082-5.189, p = 4.096e −08 ; Fig. 5a). The ROC curve analysis showed the prognostic model had a high accuracy in predicting the prognosis of colon cancer in the training set (area under the ROC curve, AUC = 0.922; Fig. 5a).

Validation of the prognostic model
Similarly, the samples with clinical overall survival data in the two validation datasets (GSE17538, n = 232; and GSE38832, n = 122) were separately divided into the highand low-risk groups according to the prognostic risk scores (Fig. 5b, c). The K-M survival analysis showed there was a significant difference in the overall survival time between patients in the high and low groups in the two datasets (GSE17538: HR = 1.659, 95% CI 1.042-2.642, p = 3.059e −02 ; GSE38832: HR = 3.247, 95% CI 1.312-9.037, p = 5.273e −03 ; Fig. 5b, c). Besides, the model had high accuracies in predicting the prognosis in the two datasets (GSE17538: AUC = 0.841; GSE38832: AUC = 0.824). These results suggested the high performance of this model in predicting the prognosis of colon cancer.

Nomogram model construction
According to the above analyses, the nomogram model was constructed using the prognosis-associated factors, including patients' age, clinical T classification, and tumor recurrence status (Fig. 7a). According to the The closer the color is to red, the higher the significance nomogram, we found that patients with older age, an advanced T classification, tumor recurrence, and a high risk score had low 3-and 5-year survival probabilities. Take an 85-year-old man (~5 points), with T3 classification (~33.7 points), with tumor recurrence (0 points), and a risk score of 1.5 (~9.3 points), for example, he had a total point of 48. His 3-and 5year survival probabilities were approximately 40% and 28%, respectively (Fig. 7a). What's more, the predicted 3-and 5-year survival probabilities had high compliance with the actual situations (c-index = 0.752 and 0.721; Fig. 7b). These results suggested the clinical applicability of this prognostic model in predicting the prognosis of colon cancer.

The features of the DEGs between patients with different prognosis risk scores
At last, we investigated the differential gene expression profiles between TCGA samples with high-and low-risk scores. A total of 514 DEGs were identified between high-and low-risk groups, including 102 downregulated and 412 upregulated genes (Fig. 8a). The clustering analysis indicated that the expression profiles of these DEGs changed with the risk scores (Fig. 8b), showing the coexpression profiles of these DEGs with the 12-gene signature.

Discussion
In the present study, 5 significantly stable gene modules (including 1160 genes) related to colon cancer were constructed by the WGCNA algorithm. Then, 1153 common DEGs across the TCGA, GSE44861, and GSE44076 datasets were identified between colon cancer tumor and normal tissue samples. Furthermore, the expression features of 12 prognosis-associated DEGs (ADORA3, CPA3, CPM, EDN3, FCRL2, MFNG, NAT1, PCSK5, PPARGC1A, PRRX2, TNFRSF17, and WDR78) were identified as the optimized prognostic gene signature. The corresponding prognostic model presented high performance for predicting the prognosis of colon cancer both in the training dataset and in the validation datasets. Besides, we found that the predicted 3-and 5year survival probabilities using the combination of the model status with clinical factors (including patients' age, pathological T classification, and tumor recurrence status) showed high compliance with the actual 3-and 5-year overall survival proportion. These results indicated that the prognostic gene signature was of great reference value for predicting the prognosis and survival probability of colon cancer.
The advances in mining the genetic properties of various diseases have been enhanced due to the rapid technological development in high-throughput sequencing and bioinformatics [34]. The GEO and TCGA databases, as public available cancer genomic databases, provide the comprehensive data of cancers, including mRNA expression data, miRNA expression data, copy number variation, DNA methylation, and clinical information [35,36]. The TCGA and GEO data have been effectively applied to improve diagnostic and therapeutic methods and potential of cancers [35][36][37]. Thus, this study was performed based on the gene expression profile data and clinical information of colon cancer retrieved from the TCGA and GEO databases. Gene expression profiles have been reported to predict the prognosis outcome of cancers [38][39][40]. Computationally, the Cox regression methods were commonly used to construct the prognostic models and screen prognostic factors [41]. The availability of this model in survival analysis has been confirmed in recent studies [42,43]. Similarly, in this study, the Cox regression model based on the LASSO was applied to screen the optimized gene set with potential prognostic value. The 12-gene prognostic signature constructed by the LASSO Cox regression model showed a higher predictive ability both in the TCGA training data and the two validation sets (GSE17538 and GSE38832; AUC > 0.800).
Besides, this study showed that age, pathological T classification, and tumor recurrence were prognosisassociated factors in patients with colon cancer. Consistent with our results, previous studies have also demonstrated that older age, advanced pathological T, and tumor recurrence are associated with poor prognosis in patients with colon cancer [44][45][46]. Notably, the nomogram analysis in the current study revealed that the   These results further showed that the 12-gene prognostic model had a significant predictive ability for the prognosis of colon cancer. In this study, the prognostic model was constructed based on the signature of 12 prognosis-associated genes, including 12 DEGs, ADORA3, CPA3, CPM, EDN3, FCRL2, MFNG, NAT1, PCSK5, PPARGC1A, PRRX2, TNFRSF17, and WDR78. Specifically, the adenosine receptor A3 (ADORA3) protein encoded by the ADORA3 gene is a G-protein-coupled receptor that functions in inflammatory and immunological responses as well as cancer growth through influencing the nucleotide metabolic process [47][48][49]. There is increasing evidence proving that ADORA3 is overexpressed in several cancers, including breast cancer [50], thyroid cancer [51], bladder cancer [52], and colon cancer [53] and functions as a tumor promoter [54]. Carboxypeptidase A3 (CPA3) is a member of the CPA family of zinc metalloproteases released by mast cells and may be involved in the inactivation of venom-associated peptides and the degradation of endogenous proteins [55]. Previous studies have shown the elevated expression of CPA3 in asthma [56] and anaphylactic shock [57]. However, few studies have investigated the role of CPA3 in cancers. CPM is also an arginine/lysine CP which exerts important roles in angiogenesis, proliferation, and apoptosis through modulating chemokines or kinins in cancer cells [58]. Notably, a recent study reports that CPM/Src-FAK pathway is involved in cell migration and invasion in colon cancer [59]. Endothelin 3 (END3) is reported to participate in the progression of several cancers including malignant melanoma [60], cervical cancer [61], and colon cancer [62]. Fc Receptor Like 2 (FCRL2) is a member of the immunoglobulin receptor superfamily that is involved in the development of lymphoblastic leukemia by immunomodulating B cell function [63][64][65]. Besides, it has been reported that the inherited polymorphism in the acetyltransferase 1 (NAT1) gene increases the risk of colorectal adenocarcinoma [66]. Manic fringe (MFNG) is reported to exhibit antitumor effects in lung cancer [67]. The peroxisome proliferator-activated receptor-γ coactivator 1-α (PPAR GC1A) gene also contributes to tumor growth and metastasis in several cancers [68,69]. In addition, studies have suggested that both the paired related homeobox 2 (PRRX2) gene [70,71] and the tumor necrosis factor receptor superfamily member 17 (TNFRSF17) gene [72,73] are associated with the development of several cancers, while the proprotein convertase subtilisin/kexin type 5 (PCSK5) gene and the WD repeat domain 78 (WDR78) gene have not been reported to be associated with pathogenesis and progression. Thus, the functions of these genes in colon cancer should be further investigated using preclinical and clinical experiments.

Conclusions
In conclusion, the prognostic model based on the signature of the 12 genes (ADORA3, CPA3, CPM, EDN3, FCRL2, MFNG, NAT1, PCSK5, PPARGC1A, PRRX2, TNFRSF17, and WDR78) exhibited a relatively satisfactory and credible predictive power for the prognosis of colon cancer, making it a great potential biomarker. However, the prognostic significance and practicability of the 12-gene prognostic model in colon cancer should be further confirmed in clinical studies.
Additional file 1: Figure S1. Additional file 2: Table S1. The list of the prognosis-associated differentially expressed genes across the three datasets (TCGA, GSE44861, and GSE44076) using the Cox regression analysis.