A novel prognostic model based on epithelial-mesenchymal transition-related genes predicts patient survival in gastric cancer

Background Gastric cancer (GC) represents a major malignancy and is the third deathliest cancer globally. Several lines of evidence indicate that the epithelial-mesenchymal transition (EMT) has a critical function in the development of gastric cancer. Although plentiful molecular biomarkers have been identified, a precise risk model is still necessary to help doctors determine patient prognosis in GC. Methods Gene expression data and clinical information for GC were acquired from The Cancer Genome Atlas (TCGA) database and 200 EMT-related genes (ERGs) from the Molecular Signatures Database (MSigDB). Then, ERGs correlated with patient prognosis in GC were assessed by univariable and multivariable Cox regression analyses. Next, a risk score formula was established for evaluating patient outcome in GC and validated by survival and ROC curves. In addition, Kaplan-Meier curves were generated to assess the associations of the clinicopathological data with prognosis. And a cohort from the Gene Expression Omnibus (GEO) database was used for validation. Results Six EMT-related genes, including CDH6, COL5A2, ITGAV, MATN3, PLOD2, and POSTN, were identified. Based on the risk model, GC patients were assigned to the high- and low-risk groups. The results revealed that the model had good performance in predicting patient prognosis in GC. Conclusions We constructed a prognosis risk model for GC. Then, we verified the performance of the model, which may help doctors predict patient prognosis.

Epithelial-mesenchymal transition (EMT) represents an important molecular program in diverse cellular events such as wound healing, tissue fibrosis, and cancer progression [5]. Studies have demonstrated that EMT is closely related to tumor heterogeneity, metastasis, and therapeutic resistance in cancer cells [6]. EMT can promote cancer cell invasion into the stroma and is correlated with cancer stem cell traits and increased tumorigenicity [7]. Meanwhile, the EMT/inflammation axis contributes to the metastatic stage and poor outcome of multiple cancer entities and confers chemoresistance to primary tumors [8]. Multiple lines of evidence suggest that many genes, such as TSP50, CD44, and FOXA1, regulate EMT in GC [9][10][11]. Because EMT is a vital process in GC progression, studying the role of related genes in GC is valuable.
In this study, based on TCGA and MSigDB data, differentially expressed genes associated with EMT were selected. Then, univariable and multivariable Cox regression analyses were carried out for identifying overall survival (OS)-related genes. Next, a risk model was constructed to predict and evaluate patient prognosis in GC. We finally used additional bioinformatic and statistical methods to validate this model.

Data acquisition and processing
The original gene expression data of the TCGA gastric dataset were acquired from the TCGA database (https:// portal.gdc.cancer.gov/repository). The RNA sequencing data of GC patients were retrieved. Clinical features, including survival status, survival time, age, gender, grade, stage, T stage, N stage, and M stage, were adopted in the current study. The "HALLMARK_EPITHELIAL_MES-ENCHYMAL_TRANSITION" gene list containing 200 genes was obtained from MSigDB v7.2(http://www.gseamsigdb.org/gsea/msigdb/index.jsp). To identify differentially expressed epithelial-mesenchymal transitionrelated genes (ERGs), we utilized the "limma" package in the R software 4.0.2 to analyze mRNA sequencing data for these 200 ERGs, with significance threshold set at p< 0.05.

Identification of EMT-related gene signature
After merging clinical information and the expression data of ERGs, univariable Cox regression analysis was performed for screening genes related to OS with p< 0.05. Then, multivariable Cox proportional regression analysis was performed for the screened genes to identify those independently related to OS.

Construction of prognosis model
A risk score formula was generated for prognosis based on multivariable Cox regression analysis. Cases were classified into two categories, including the high-and low-risk groups, according to the median patient risk score. Then, survival curves, areas under the receiver operating characteristic (ROC) curves (AUCs) for 1-year, 3-year, and 5-year overall survival were utilized to determine the prediction accuracy of the constructed model.

Prognostic value validation of EMT-related genes
In order to assess whether the above risk score independently predicts GC prognosis, univariable and multivariable Cox regression analyses were carried out. The expression levels of the six detected ERGs in normal and tumor tissues were examined, and Kaplan-Meier survival analysis based on different clinicopathological parameters was conducted. We also chose a cohort (GSE62254) from the GEO database (https://www.ncbi.nlm.nih.gov/ geo/) for validating the prognostic value of the established risk model.

Assessment of immunocyte infiltration
The Tumor IMmune Estimation Resource (TIMER) database (https://cistrome.shinyapps.io/timer/) represents a comprehensive platform that can analyze immune infiltration systematically in multiple cancers [12]. By exploring the abundance of B cells, CD8+ T cells, CD4+ T cells, macrophages, neutrophils, and dendritic cells in gastric cancer, we could evaluate the possible associations of the hub genes in GC with the six immune cell types.

Statistical analysis
All of the data analyses were performed using the R software and Perl languages. P < 0.05 indicated statistical significance. Cox regression analyses were conducted to screen survival-associated ERGs. The survival curves were plotted by the Kaplan-Meier method to show differences in OS between high/low-risk patients and the log-rank test was utilized to assess the significance of the differences. And the ROC curves and the corresponding AUC values were used to evaluate the predictive ability of the model by using the "survival ROC" package.

Results
Identification of differentially-expressed epithelialmesenchymal transition-related genes Gene expression data for GC patients and their clinicopathological information were obtained from the TCGA database. Subsequently, 200 ERGs from the "HALL-MARK_EPITHELIAL_MESENCHYMAL_TRANSI-TION" gene set in the MSigDB website were obtained. This database supplies a list of 200 genes that have been shown to be correlated with EMT. Next, we analyzed the retrieved gene expression data and screened for differentially expressed ERGs by using the package "limma" in R (P< 0.05). After screening, we selected 371 gastric cancer patients for analysis. The clinical characteristics of GC patients are shown in Table 1.

Identification of survival associated ERGs in GC
To assess potential associations of ERGs with OS in GC, univariable Cox regression analysis was carried out to identify ERGs that were significantly correlated with survival. A total of 20 ERGs were identified (P < 0.05) ( Table 2). Then, multivariable Cox analysis was carried out to determine whether the associations of ERGs with OS were independent. The most significant ERGs were included in the model, with the corresponding coefficients.

Construction of the ERGs-related prognosis risk model
Based on gene expression data and regression coefficients, a prognostic model was ultimately established.
The prognostic risk score was calculated according to the following formula: risk score = 0.2159 × CDH6 expression − 0.0116 × COL5A2 expression + 0.0208 × ITGAV expression + 0.0488 × MATN3 expression + 0.0379 × PLOD2 expression + 0.0052 × POSTN expression. According to individual risk scores, all patients were categorized into the low-and high-risk groups as described above (Fig. 1a). In survival analysis, individuals with a low-risk score had a higher survival rate compared with patients with a high-risk score (p < 0.001) (Fig. 1b). This suggested that the model was accurate for predicting the prognosis of low-and high-risk gastric cancer patients. Then, GC patients were ranked by risk score to assess the associations of survival status with survival time (Fig. 1c). The scatterplot depicts the   survival statuses of patients with different risk scores, and the mortality rate of patients increased with the risk score. Then, the 1-, 3-, and 5-year ROC curves were generated to verify the ability of the risk model to predict prognostic; the obtained AUCs were 0.662, 0.719, and 0.736, respectively, indicating an acceptable predictive performance (Fig. 2).
Age and risk score are independent prognostic predictors of GC Univariable and multivariable analyses were performed to examine the prognostic strengths of the risk score as well as select clinicopathological parameters. Univariable Cox analysis revealed that age, stage (T or N), and risk score were OS-associated indicators (Fig. 3a). Subsequently, in multivariable Cox analysis, age and risk score were independent predictors of prognosis (p < 0.05) (Fig. 3b).

Expression levels and validation of six ERGs in GC
The expression levels of the six genes in GC and normal tissue specimens were next compared. These six ERGs all showed upregulation in GC specimens versus normal gastric tissue samples (P < 0.05) (Fig. 4).
In addition, clinicopathological parameters were collected from the TCGA database. Kaplan-Meier curve analysis revealed that age, stage, T stage, N stage, and M stage had significant associations with GC patient survival (Fig. 5). Then, we applied the new model to 300 GC samples from GSE62254 to validate our findings. The patients were also divided into two categories by median risk score, and the KM survival curve showed significant results (P < 0.001) (Fig. 6). And the AUC values of the ROC curves (0.673 at 1 year, 0.693 at 3 years, 0.677 at 5 years) indicated that this risk model had a satisfactory prognostic value (Fig. 7). Furthermore, the TIMER website was utilized to evaluate the associations of the six ERGs with tumor purity and immune cell infiltration in gastric cancer (Fig. 8). The results showed that MATN3 and PLOD2 were not associated with tumor purity, whereas all six genes were significantly associated with CD4+ T cell, macrophage, and dendritic cell levels.

Discussion
Gastric cancer remains a globally important pathology, with elevated mortality. GC represents a major social burden and peritoneal metastasis is common in advanced GC cases [13]. Postoperative prognosis varies according to the types of GC, and thus, improvements in GC's treatment are urgently required [14]. In EMT, epithelial cells forfeit their apical-basal polarity and cell-cell adhesion features, becoming invasive cells with mesenchymal characteristics [15]. The relationship between gastric cancer and EMT has been widely reported, with many studies assessing the related genes. For instance, REC8 inhibits EMT in GC cells by controlling EGR1 expression and binding to it [16]. In addition, tumorassociated neutrophils contribute to GC cell migration and invasion through IL-17a-induced EMT [17]. Alpha B-crystallin (CRYAB) also promotes GC cell migration and invasion through EMT under the control of NF-κB signaling [18]. Although EMT-related genes have been assessed for their functions and mechanisms, the link between ERGs and prognosis in GC remains unclear. To date, increasing amounts of attention focus on the construction of cancer models. The prognosis of cancer can be predicted by screening out genes related to biological processes, such as the effect of glycolysis-related genes on the prognosis of hepatocellular carcinoma and the association of autophagy-related genes with the survival of lung cancer [19,20]. However, there is not much discussion about the relationship between EMT-related genes and gastric cancer. Hence, considering the importance of EMT in gastric cancer, it is reasonable to speculate that EMT-associated genes could be applied in GC prognosis. Here, expression data for 371 GC samples and the respective clinical data were retrieved from the TCGA database, as well as a list of 200 ERGs from the MSigDB. Univariable and multivariable Cox regression analyses were performed to obtain six genes associated with survival in GC. According to the established formula, risk scores for individual patients were calculated, based on which cases were significantly stratified. Survival analysis and ROC curves were applied to assess the predictive value of this risk model. As shown above, the risk score and age were determined as independent OSassociated factors. Furthermore, the expression levels of the six genes were higher in tumor tissue samples compared with noncancerous specimens. We also found that the developed prognostic model was correlated with clinicopathological data. And the analysis of the samples in GSE62254 confirmed the predictive ability of the model. The associations of genes with immune cell infiltration were also discussed. We found that the six ERGs are closely related to immune cells, and they are all positively correlated with the degree of CD4+ T cell, macrophage, and Dendritic Cell infiltration in GC. As for B cells, the expression of CDH6 and ITGAV was positively correlated with it, but the expression of COL5A2 and POSTN was negatively correlated with it. Furthermore, the expression of ITGAV and POSTN was positively associated with the degree of CD8+ T cell and neutrophil infiltration, and there was a positive correlation between COL5A2 expression and neutrophil. These findings jointly suggest that the six EMT-related genes play critical roles in GC prognosis, and the new model has a high prognostic predictive value.
Previous studies have reported the roles of these genes in GC. COL5A2 is related to prognosis in patients with GC [21]. In bladder cancer, cases with low COL5A2 amounts show improved clinicopathological phenotypes [22]. COL5A2 is also involved in the carcinogenesis of colorectal cancer [23]. ITGAV, belonging to the integrin family of extracellular matrix receptors, is implicated in multiple cancers. Downregulation of ITGAV can cause the inhibition of GC cell proliferation, migration, and invasion, indicating a critical role for ITGAV in GC progression [24]. MATN3, a protein-coding gene, might affect GC occurrence and development, acting as an oncogene [25]. In addition, MATN3 is differentially expressed and aberrantly methylated in GC [26]. PLOD2 is an important regulator of peritoneal dissemination in GC patients, leading to poor prognosis [27]. POSTN affects cancer cell proliferation via ERK and may promote EMT [28]. POSTN overexpression is a known risk factor for GC, inducing metastasis and aggravating malignant behavior [29]. CDH6, a class II Cadherin, induces EMT during embryonic development and shows abnormal reactivation in malignant tumors [30]. Different from other mesenchymal biomarkers that are broadly found in cancer, CDH6 may preferentially participate in the late stage of EMT, with direct association with cell invasiveness [31]. However, CDH6's function in GC remains undefined. Therefore, further research for discovering the function of CDH6 is urgently required. Besides, there are other studies on gastric cancer. Follistatin-like 1 (FSTL1) is a kind of glycoprotein, its high expression is associated with poor prognosis and has a positive correlation with immune infiltration in GC patients [32]. CTLA-4 polymorphisms have a tight relationship with digestive system malignancies including gastric cancer, it related to predisposition to GC [33]. RNAs are also involved in the development of gastric cancer. For example, miR-489 overexpression can inhibit the survival, invasion, and migration of gastric cancer cells, and circulating lncRNAs play an important role in the early diagnosis of GC [34,35]. Furthermore, the clinical results of gastric cancer can be predicted by constructing nomogram [36]. So far, most studies have assessed the relationship between a single gene and prognosis in GC. Here, we constructed a dependable prognostic risk model for GC based on sequencing data for ERGs and clinicopathological features from the TCGA database. This might provide a reliable method for predicting patient survival in GC. However, some limitations are worth noting. First, we established the risk model by using the TCGA database, and further validation of this model should be carried out in vitro, in vivo, and in clinical trials. Secondly, the immune statuses of these six genes deserve further investigation.

Conclusions
This study has developed a six-gene prognostic risk model based on EMT, which could provide a reference for assessing whether gastric cancer patients are at high risk. This risk model could help better manage patients and predict prognosis. These findings may provide novel insights into the relationship between GC and EMT.