Expression of EMT-related genes in lymph node metastasis in endometrial cancer: a TCGA-based study

Endometrial cancer (EC) with metastasis in pelvic/para-aortic lymph nodes suggests an unsatisfactory prognosis. Nevertheless, there is still rare literature focusing on the role of epithelial-mesenchymal transition (EMT) in lymph node metastasis (LNM) in EC. Transcriptional data were derived from the TCGA database. Patients with stage IA–IIIC2 EC were included, constituting the LN-positive and LN-negative groups. To evaluate the extent of EMT, an EMT signature composed of 315 genes was adopted. The EMT-related genes (ERGs) were obtained from the dbEMT2 database, and the differentially expressed ERGs (DEERGs) between these two groups were screened. On the basis of DEERGs, pathway analysis was carried out. We eventually adopted the logistic regression model to build an ERG-based gene signature with predictive value for LNM in EC. A total of 498 patients were included, with 75 in the LN-positive group. Median EMT score of tumor tissues from LN-negative group was − 0.369, while that from the LN-positive group was − 0.296 (P < 0.001), which clearly exhibited a more mesenchymal phenotype for LNM cases on the EMT continuum. By comparing expression profiles, 266 genes were identified as DEERGs, in which 184 were upregulated and 82 were downregulated. In pathway analysis, various EMT-related pathways were enriched. DEERGs shared between molecular subtypes were comparatively few. The ROC curve and logistic regression analysis screened 7 genes with the best performance to distinguish between the LN-positive and LN-negative group, i.e., CIRBP, DDR1, F2RL2, HOXA10, PPARGC1A, SEMA3E, and TGFB1. A logistic regression model including the 7-gene-based risk score, age, grade, myometrial invasion, and histological subtype was built, with an AUC of 0.850 and a favorite calibration (P = 0.074). In the validation dataset composed of 83 EC patients, the model exhibited a satisfactory predictive value and was well-calibrated (P = 0.42). The EMT status and expression of ERGs varied in LNM and non-LNM EC tissues, involving multiple EMT-related signaling pathways. Aside from that, the distribution of DEERGs differed among molecular subtypes. An ERG-based gene signature including 7 DEERGs exhibited a desirable predictive value for LNM in EC, which required further validation based upon clinical specimens in the future.


Background
In 2022, it is expected to be an incidence of 84,520 of uterine corpus cancer in China and 17,543 new deaths, where endometrial cancer (EC) accounts for the vast majority [1]. EC with metastasis in pelvic/para-aortic lymph nodes (LNs), even with micrometastasis, is associated with at least the International Federation of Gynecology and Obstetrics (FIGO) stage IIIC1/2 and suggests a less satisfactory prognosis [2,3]. LN metastasis (LNM) of cancer is a complex process including lymphangiogenesis and activation of epithelial-mesenchymal transition (EMT) to initiate metastasis and survival in the LN [4][5][6][7][8][9][10][11][12]. A substantial proportion of studies on EMT in EC have been published [13][14][15][16][17], but literature fixing attention on the role of EMT in LNM remains rare to be reported [18]. As persuasively demonstrated by relevant studies, for primary prevention, adherence to cancer prevention guidelines was negatively associated with EC risk [19,20]. Nonetheless, for tertiary prevention of EC, the prediction and treatment of LNM remain unsatisfactory. The mechanism-level description of this process will be advantageous for us to facilitate the management of the lymphatic spread of EC.
For this purpose, we presented a retrospective study on the basis of The Cancer Genome Atlas (TCGA) to discuss the role of EMT in LNM, Apart from that, this research not only probed deep into differentially expressed EMTrelated genes (DEERGs) and pathways associated with them, but also ultimately provided an ERG-based gene signature with predictive value for LNM in EC.

Materials and methods
We extracted mRNA expression profiles from the Uterine Corpus Endometrial Carcinoma (TCGA, PanCancer Atlas) database through cBioPortal [21,22]. Especially, tumor grade and microsatellite status were obtained from Xena [23]. Cases with a non-endometrioid histology were categorized into G3. Patients with FIGO 2009 stage IA-IIIC2 EC were included in this research, and stage IIIC1/2 were considered as stage IA-IIIB with pelvic/para-aortic LNM, constituting two groups (LN-positive and LN-negative). There were cases annotated by FIGO 1988 staging system, and patients with FIGO 1988 stage IIIA were excluded from this study, since peritoneal cytology was contained in this stage, which was absent in FIGO 2009 staging system.
To quantitively locate a tumor tissue on the EMT continuum, a validated transcriptome-based scoring system composed of 315 genes (epithelial: 145, mesenchymal: 170) was adopted. An EMT score was calculated by the maximum vertical distance between the empirical distribution function (ECDF) curves of the mesenchymal and the epithelial gene set by a two-sample Kolmogorov-Smirnov test (i.e., the KS method). The EMT score of a tumor with a higher expression of mesenchymal signatures would be defined as positive, and negative for epithelial ones. Hence, an EMT score in the KS method had a range of [− 1, 1] [24,25] (Supplementary Fig. 1).
With regard to differential expression analysis, 1027 ERGs were obtained from an EMT gene database, "dbEMT2" [26], and DEERGs between the LN-positive and LN-negative groups were screened by the 'limma' package [27]. A false discovery rate (FDR) < 0.05 and |logFC|> 1 were defined as the significance threshold. On the basis of DEERGs, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses were carried out. The STRING database and Cytoscape were adopted to construct a protein-protein interaction (PPI) network to visualize the relationships among DEERGs, where 0.4 was defined as the minimum required interaction score, and genes with node degree > 15 were considered as hub genes [28][29][30].
For the sake of effectively screening candidate genes to construct a risk score for LNM, we adopted the "pROC" package to evaluate their predictive value [31]. An area under the curve (AUC) > 0.7 was considered acceptable for further evaluation. The logistic regression model was built to evaluate the predictive value of the genes screened, on the basis of which a nomogram was plotted. According to the probability predicted by the model, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were reported. The consistency between the observed incidence rate and the predicted probability of the nomogram was evaluated by adopting a calibration plot with 2000 bootstrap replications. Meanwhile, the Hosmer and Lemeshow tests were used. We also introduced an EC database composed of 95 patients from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) for external validation [32].
All the analyses were conducted by means of R software (version 4.1.1) [33]. Apart from the statistical methods above, the Wilcoxon test and chi-square test were adopted to compare continuous and categorical variables, respectively. A P < 0.05 was considered statistically significant.

Results
A total of 498 patients diagnosed between 1995 and 2013 were included in this study (Table 1), with 15.1% of patients exhibiting LNM. Tumor grade, histological subtype, myometrial invasion (MI), and molecular subtype were significantly associated with LNM (P < 0.001).

EMT scores stratified by LN status
A total of 314 ERGs (144 epithelial and 170 mesenchymal) were included to evaluate the EMT status, with a pseudogene OR7E14P excluded on account of its absence in the TCGA database.
Coinciding with the epithelial origin of EC, general expressions of ERGs revealed a more epithelial phenotype , with a P < 0.001 by Wilcoxon test, which presented a more mesenchymal phenotype for tumors with LNM on the EMT continuum ( Fig. 1). Further subgroup analysis revealed that this phenotype difference was more significant in G3, MI ≥ 1/2, and POLE cases ( Supplementary  Fig. 2).

Differentially expressed genes and pathway analysis
By comparing mRNA expression profiles of tumor tissues from 423 LN-negative cases and 75 LN-positive cases, 266 genes were identified as DEERGs, in which 184 were upregulated and 82 were downregulated. Expressions of ERGs were exhibited in Supplementary Fig. 3 and Supplementary Table 1.
The PPI network of DEERGs based upon the STRING database was presented in Supplementary Fig. 5. The top 5 hub genes calculated by Edge Percolated Component (EPC) method were MYC, ESR1, KRAS, MAPK3, and HDAC1. We also summarized hub genes by each method in Supplementary Table 4.
As is exhibited in Table 1, molecular subtypes of 96.8% of patients were available, and CNH cases were more likely to develop LNM than the others (χ 2 = 30.5,   Table 5). A Venn graph displaying DEERGs shared between subtypes was depicted in Fig. 4, where DEERGs differed tremendously between subtypes. No DEERGs were shared between all subtypes, and only HOXB7, a gene encoding an activator of the TGF-β pathway, was shared by CNL, MSI-H, and POLE.

Identification of a gene signature effective to predict LNM
On the basis of receiver operating characteristic (ROC) curves adopted to discriminate between LN-positive and LN-negative groups, no DEERG exhibited an AUC > 0.7 (Supplementary Table 6). Consequently, a logistics regression model was constructed to screen candidate genes by including DEERG expression, age, race, grade, MI, and histological subtype, ROC curves from which were employed to screen target genes. As a result, the median AUC was 0.802 (IQR 0.801-0.805), which displays a comparatively high predictive value of the new models. To refine the candidate genes, genes from models with an AUC > 0.7 and P < 0.05 were selected for further analysis. Subsequently, 29 genes were selected (Supplementary Table 7), which were included in a further logistic regression model, and ultimately 7 genes with the best performance were selected: CIRBP, DDR1, F2RL2, HOXA10, PPARGC1A, SEMA3E, TGFB1 (Supplementary Table 8). An EMT-based risk score (ERS) was calculated as the logit from the logistic regression as follows: 1.845 Then we constructed a new logistic model to predict LNM with ERS, race, age, grade, MI, and histological subtype. A further stepwise algorithm filtered out race, eventually reaching an AUC of 0.850 (Fig. 5A). The cut-off value of the probability predicted by the logistic model is 0.176, with a sensitivity of 0.773, a specificity of 0.797, a PPV of 0.403, and an NPV of 0.952. A nomogram based upon the above model was presented in Fig. 6, with ERS having the highest weight in the model compared with traditional clinicopathological parameters. A Hosmer-Lemeshow test exhibited a desirable calibration for this model (χ 2 = 14.3, P = 0.074), with a calibration curve in Supplementary Fig. 6.
In an effort to validate the gene signature externally, we extracted mRNA expression profile of EC tissue from the CPTAC program, from which 83 cases identified as stage IA-IIIC2 were included. A similar logistic regression model was built (Supplementary Table 8), with an AUC of 0.908 (Fig. 5B) and an ideal calibration (χ 2 = 3.9, P = 0.42). Aside from that, proteome from CPTAC exhibited a significantly positive correlation between proteins and mRNA in members of the 7-gene signature (Supplementary Fig. 7). Nevertheless, owing to the absence of F2RL2 and PPARGC1A proteins, the value of a proteinbased ERS was not assessed in this study.

Discussion
LNM had an incidence of 18.1% in EC, which was quite difficult to predict in clinical practice [3]. In our institution, EC patients with MI ≤ 1/2, G1/2, and endometrioid histology, with no further evidence of invasion/metastasis, used to undergo a surgery without pelvic and paraaortic lymphadenectomy, all the others would receive lymphadenectomy, with sentinel lymphadenectomy (SLND) optional for patients of relatively low risk. Nevertheless, non-therapeutic systemic lymphadenectomy still existed in most cases (71.4%, unpublished data), which gives rise to an augment in the complication morbidity, e.g., blood loss, lymphatic cyst, and lower limb edema [34]. Meanwhile, variables employed to determine lymphadenectomy were mostly those significantly correlated with LNM. The molecular subtype has been proven to be significantly correlated with LNM [3], but has not been extensively adopted in the decision of lymphadenectomy yet. At the mechanism level, LNM is a complex biological process where EMT plays a central role in the invasion of Fig. 4 DEERGs shared between molecular subtypes were comparatively few. DEERG: differentially expressed EMT-related genes lymphatic vessels and survival in LNs [35,36], literature concentrating on the role of EMT in LNM in EC remains few [18]. On that account, our team designed this first TCGA-based study combining transcriptomic data with conventional parameters to explore EMT-related molecular characteristics of LNM in EC.
We applied a well-validated EMT quantification method to describe a tumor sample on the EMT continuum [25]. Though unable to discriminate hybrid E/M samples from a mixture of E&M ones, this model strongly suggested samples from LNM cases tended to have a more mesenchymal phenotype, consistent with the association between EMT and LNM in the literature, which exhibited the feasibility of further exploration of this process in EC. Intriguingly, this difference in EMT score was more significant in G3 and MI ≥ 1/2 cases, which indicates an association between invasive phenotypes and the role of EMT in LNM. In all the molecular  Fig. 2). This indicated a possible variation between molecular subtypes regarding the extent of dependence on EMT during LNM. There is a blank of study on EMT in different molecular subtypes, which seems promising when taking into account the abundant pathways and molecules in the EMT process [37]. Among the 1027 ERGs included in this study, over 1/4 showed a differential expression between LNM and non-LNM cases, with multiple pathways being enriched, in which the TGF-β signaling pathway has been proven a potent inducer of EMT in EC [38]. Nonetheless, no published literature discussed the role of the TGF-β pathway for LNM in EC. In the GO analysis based upon DEERGs, various pathways bound up with TGF-β were enriched in the BP group. A similar result was reached by KEGG analysis, with other EMT-related pathways significantly enriched as well, members of which were mostly upregulated (Fig. 3). Hub genes suggested by PPI analysis, e.g., Myc, ESR1, and KRAS, were reported to participate in EMT pathways, i.e., Wnt, TGF-β, and MAPK signaling pathways [39][40][41][42][43][44], and they also showed most interactions with other DEERGs. These hub genes might be associated with the core mechanism and containment strategy of LNM in EC in need of further exploration.
Data from this study showed different LNM rates between molecular subtypes, especially for a significantly higher rate in CNH cases, consistent with the report by Jamieson et al. [3]. DEERGs shared between subtypes were comparatively few, which exhibits possibly different landscapes of LNM between molecular subtypes in EC, as stated above. Detailed pathway analyses were omitted as a result of a small sample size in the subgroup analysis, and a prospective study on EMT in LNM and its correlation with molecular subtype in EC is being carried out in our institution (NSFC81972426).
As mentioned earlier, an accurate method of calculating the LNM probability will contribute to a precise strategy for lymphadenectomy and adjuvant therapy. Several predictive models with desirable performance have been created, predominantly based upon clinicopathological parameters [45,46]. But before the definitive surgery, molecular features of EC can be acquired through hysteroscopy samples as well [47], providing a chance to calculate the risk of LNM more comprehensively. With the logistic regression model, we ultimately reached a gene signature composed of 7 DEERGs, the new model based upon which exhibited a satisfactory discrimination and calibration. It is also pivotal to note that in accordance with the nomogram, ERS accounted for the majority of LNM risk, which demonstrated that molecular features from the primary tumor did provide significant information of LNM, which is independent from traditional clinicopathological parameters. This partly explained the difficulty to predict LNM in clinical practice, for traditional models (e.g., Mayo criteria) lacked paramount molecular features. In the TCGAbased logistic model, a PPV of 0.403 and an NPV of 0.952 were reached, which seems acceptable owing to the comparatively uncommon and unpredictable nature of LNM in EC. In terms of screening candidates for abandoning lymphadenectomy, a model with an NPV high enough is more crucial for medical safety [48,49]. What is more, it is worth mentioning that SLN detection has been proven to accurately detect LNM in EC [50,51], effectively lessen complications associated with systemic lymphadenectomy. For the time being, the number of research literature on molecular characteristics guiding SLN detection applications is still limited (predominantly molecular subtypes) [52], which evidently exhibits a promising prospect for molecular feature-guided lymphadenectomy.
The value of the 7-gene signature was further validated in a new EC database from CPTAC, where most members of the signature exhibited a positive correlation between proteins and mRNA, which conspicuously illustrates that to transform this transcriptome-based gene signature into an immunohistochemistry-based tissue microarray (TMA) from hysteroscopy biopsies is promising in clinical practice. In accordance with the latest report of Jamieson et al. on molecular subtype and LNM in EC, the molecular subtype is a potential predictor of LNM [3]. And the 7 genes proposed in this study are not covered by the commonly used Trans-PORTEC and ProMisE algorithms, i.e., the 7-gene signature might provide independent predictive value preoperatively on the basis of molecular subtype. Moreover, different molecular subtypes of EC had different DEERGs in LNM (Fig. 4), which suggested the existence of different EMT pathways and variant dependence on EMT during the lymphatic invasion in EC. When EMT inhibitors are applied into clinical research to control metastasis in the future (as in anti-PD-1/PD-L1) [53], this study may contribute to precise medication in line with molecular features in EC.
There are several limitations in our research. First, owing to the heterogeneity of transcriptome from different databases, our study did not establish a universally applicable clinical prediction model. But the ERG-based gene signature did exhibit a predictive value independent of clinicopathological parameters across databases, supporting future clinical applications. Aside from that, this study is simply limited to bioinformatics analysis, not verified by specimens from our institution yet. Furthermore, the above limitations can be remedied through developing a TMA with a unified standard. Meanwhile, there is a huge gap for laboratory research to fill on the mechanisms of EMT to induce lymphatic spread in EC.

Conclusions
To sum up, our study suggested the EMT status and expression of ERGs varied in LNM and non-LNM EC tissues, involving multiple EMT-related signaling pathways, and the distribution of DEERGs differed among molecular subtypes. Furthermore, an ERG-based gene signature including 7 DEERGs exhibited a desirable predictive value for LNM in EC, which requires further validation on the basis of clinical specimens in the future.