Comprehensive analysis of the cancer driver genes in breast cancer demonstrates their roles in cancer prognosis and tumor microenvironment

Background Breast cancer is the most common malignancy in women. Cancer driver gene-mediated alterations in the tumor microenvironment are critical factors affecting the biological behavior of breast cancer. The purpose of this study was to identify the expression characteristics and prognostic value of cancer driver genes in breast cancer. Methods The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) datasets are used as the training and test sets. Classified according to cancer and paracancerous tissues, we identified differentially expressed cancer driver genes. We further screened prognosis-associated genes, and candidate genes were submitted for the construction of a risk signature. Functional enrichment analysis and transcriptional regulatory networks were performed to search for possible mechanisms by which cancer driver genes affect breast cancer prognosis. Results We identified more than 200 differentially expressed driver genes and 27 prognosis-related genes. High-risk group patients had a lower survival rate compared to the low-risk group (P<0.05), and risk signature showed high specificity and sensitivity in predicting the patient prognosis (AUC 0.790). Multivariate regression analysis suggested that risk scores can independently predict patient prognosis. Further, we found differences in PD-1 expression, immune score, and stromal score among different risk groups. Conclusion Our study confirms the critical prognosis role of cancer driver genes in breast cancer. The cancer driver gene risk signature may provide a novel biomarker for clinical treatment strategy and survival prediction of breast cancer.


Background
In the past few years, the incidence of breast cancer has been on the rise. Among all malignant tumors in women, breast cancer ranks first with an incidence rate of 30% [1]. Women with breast cancer have the secondhighest mortality rate, and the prognosis of patients with different molecular types varies significantly [2][3][4]. For example, the choice of drugs varies among patients with different hormone receptor status, and patients with triple-negative breast cancer usually have a poorer prognosis [5][6][7]. Besides, traditional tumor staging sometimes cannot objectively represent the tumor status of patients. Patients with the same tumor stage sometimes have significant survival differences [8]. The heterogeneity of breast cancer poses a challenge for clinical treatment strategy choice and prognosis prediction, while studies of different molecular typing are expected to elucidate the differences in heterogeneity and better guide clinical treatment of breast cancer.
The journal of Nature Reviews Cancer recently reported on the important role played by cancer driver genes in the progression of tumor malignancy [9]. 568 cancer driver genes were identified through a large-scale transcriptome analysis, which may mediate complex molecular regulatory networks and changes in the tumor microenvironment. Aberrantly expressed cancer driver genes may result in multiple processes such as uncontrolled tumor cell proliferation, invasion, recurrence, and drug resistance [10,11]. Zhang's study reported the frequency of mutations in the whole genes of breast cancer [12]. The most frequently mutated genes were TP53 (45%), followed by PIK3CA (44%), GATA3 (18%), and MAP3K1 (10%), all of which also were identified as cancer driver genes. Kruse's study [13] identified metastasis driver genes in breast cancer by massive parallel sequencing, and the cancer driver genes DCC and CREBBP were also identified as key metastatic genes.
Tumorigenesis is often associated with alterations in the stromal environment and immune status, mainly manifested as changes in the tumor microenvironment (TME) [14,15]. TME plays a key role in several steps of tumor development, including local drug resistance, immune escape, recurrence, and distant metastasis [16,17]. Tumor cell exposure to TME also contributes to shaping the tissuespecificity of driver genes [18]. Antibodies or inhibitors that target driver gene-mediated signaling pathways may effectively inhibit tumor growth and prolong patient survival [19]. KRAS mutations may affect the TME and patient response to immunotherapy [20]. Cancer driver genes and TME determine the type of liver cancer and can be considered as predictors of patient survival outcomes [21,22]. All these studies provide a reference for clinical decisions. However, there are fewer studies on cancer driver genes affecting the TME and prognosis of breast cancer.
In this study, we systematically analyzed the expression characteristics of 568 cancer driver genes in breast cancer. We screened for risk driver molecules that affect breast cancer prognosis, and the relationship between risk groups with the tumor microenvironment. Our study is expected to find novel molecular subtypes of breast cancer for better guiding clinical treatment strategies.

Data sources
Breast cancer transcriptome data and corresponding clinical data were downloaded from The Cancer Genome Atlas (TCGA) (https://portal.gdc.cancer.gov/) database. As a training set, TCGA transcriptome data contained a total of 112 cases of paracancerous tissues and 1096 cases of cancerous tissues, and the clinical information of patients is shown in Table 1.
The Gene Expression Omnibus (GEO) (https://www. ncbi.nlm.nih.gov/geo/) database serves as an external validation of the TCGA dataset and contains two subsets: GSE7390 and GSE42568. GSE42568 dataset contained 17  transcriptome data from different platforms were normalized using the "limma" and "sva" R packages.

Expression of cancer driver genes in breast cancer tissues
The list of 568 cancer driver genes was obtained from the Integrative OncoGenomics platform (https://www. intogen.org/search). The differentially expressed genes (DEGs) were analyzed according to the classification of cancerous and paracancerous tissues by using the "limma" R package (P<0.05) in GSE42568 and TCG datasets. Venn diagrams were drawn to summarize differentially expressed genes both in TCGA and GEO datasets. These DEGs are considered to be associated with tumor progression and are used for further prognostic molecular screening.

Construction and validation of the risk signature
We used the univariate Cox and Kaplan-Meier method to screen for cancer driver genes associated with overall survival (OS) (P<0.05) in breast cancer patients. Candidate genes that were both differentially expressed and associated with OS were substituted into least absolute shrinkage and selection operator (Lasso) Cox regression and stepwise multivariate Cox proportional regression to construct the risk signature. According to the risk driver genes expression of the prognostic signature, R package Stats was used for Principal Component Analysis (PCA). PCA could confirm the clustering ability of the cancer driver genes risk signature.
We used several methods to analyze the clinical value of the risk signature. Survival curves and receiver operating characteristic (ROC) curves were used to verify the prognostic value of different risk groups. Univariate and multifactorial regression analyses were performed to explore the independent prognostic role of risk scores.
Besides, we further analyzed whether age, tumor stage, and other clinicopathology were associated with risk scores.

Construction of transcription factor regulatory networks and functional enrichment analysis
To clarify the possible mechanisms by which cancer driver genes affect the progression and prognosis of breast cancer, we constructed a transcription factor regulatory network and gene set enrichment analysis (GSEA). The cancer-associated transcription factors were obtained from the Cistrome platform (http:// cistrome.org/). We screened for cancer driver genes associated with transcription factors (|R 2 | > 0.3 and P < 0.05) and constructed a regulatory network using Cytoscape software (https://cytoscape.org/). Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed with the "enrichplot" R package, which suggests possible functional and pathway sets of cancer driver genes affecting breast cancer prognosis.

Correlation analysis of tumor microenvironment
The immune and TME scores were calculated for each sample using single-sample GSEA (ssGSEA) algorithm by the "gsva" R package. We first analyzed the expression levels of PD-1, stromal score, immune score, and estimate score in different risk groups. We likewise analyzed the correlation of risk groups with 16 immune cell scores and 13 immune-related pathway scores. These results contribute to confirm the interaction of tumor microenvironment with cancer driver genes.

Statistical analysis
All data processing was performed on R v3.4.1 (https:// www.r-project.org/). Differences in gene expression between cancerous and paracancerous tissues were analyzed by the Wilcoxon method. The correlation between transcription factors and cancer driver gene expression was performed using the Pearson's correlation coefficient method. Kaplan-Meier method verifies the impact of the candidate driver genes on patient OS. Mann-Whitney U test was used to compare immune or TME scores between the high-risk and low-risk groups. Univariate and multifactorial regression were used to analyze the prognostic value of risk scores.

Results
Differentially expressed cancer driver genes in TCGA and GEO datasets In the TCGA cohort, the expression of 194 cancer driver genes was higher in cancer tissues than in paracancerous tissues (P<0.05), and 257 cancer driver genes were lower in cancer tissues than in paracancerous tissues (P<0.05)  1A). The results of the GEO cohort were consistent with the TCGA cohort, with 210 genes highly expressed and 139 genes lowly expressed in cancer tissues (P<0.05) (Fig. 1B, C). Most cancer driver genes are specifically expressed in cancerous tissues, which suggest that driver genes may be involved in multiple biological processes in breast cancer.
Construction of a thirteen-mRNA signature for predicting patient prognosis Through univariate Cox and Kaplan-Meier analysis (P< 0.05), we screened 27 cancer driver genes associated with OS in breast cancer patients ( Fig. 2A). Substituting these candidate molecules in lasso Cox regression and stepwise multivariate Cox proportional regression, we finally constructed a thirteen-mRNA risk signature (Fig. 2B, C). The coefficients for each risk genes are shown in Table 3. . According to the median values of the risk scores in the TCGA cohort, all patients were divided into high-risk and low-risk groups. PCA Fig. 1 The differentially expressed genes in TCGA cohort and GEO cohort. A The expression of cancer driver genes between tumor tissues and paracancerous tissues in TCGA cohort. B The expression of cancer driver genes between tumor tissues and paracancerous tissues in GEO cohort. C Venn diagrams for differentially expressed genes validates that the different risk groups showed a two-way distribution, indicating the high specificity of the risk signature (Fig. 2D, E).
Survival curves were used to analyze the relationship between risk groups and OS of breast cancer patients. In the training set, the overall survival rate in the high-risk group was lower than in the low-risk group (P<0.05) (Fig. 3A). The area under curve (AUC) of the risk score (AUC 0.790) was higher than other clinical features, suggesting the high precision of the risk signature for predicting patient survival (Fig. 3B). The results of the test set were consistent with the training set. The survival of patients in different risk groups differed significantly (P< 0.05), and the risk score could predict the survival of breast cancer patients well (AUC 0.641) (Fig. 3C, D). Univariate and multifactorial regression analyses suggested that risk score could be an independent risk factor affecting the prognosis of breast cancer patients (P< 0.05) (Fig. 3E, F).
We analyzed the level of risk scores in different clinical characteristics separately. The results suggest that people older than 60 years, advanced tumor stage, and estrogen and progesterone receptor-negative patients have higher risk scores (P<0.05), which may represent a poorer prognosis (Fig. 4). This group belongs to the high-risk group of the risk signature and requires more attention.

Transcription factor regulatory networks and functional enrichment analysis
To explore the possible mechanisms by which cancer driver genes affect breast cancer progression and prognosis, we first tried to explore the co-expression network of transcription factors and cancer driver genes. Through Pearson's correlation coefficient analysis (|R 2 |> 0.3 and P< 0.05), 206 transcription factors were associated with  cancer driver gene expression, and DDX3X, JAK1, EGR2, IKZF3, and CCR7 were the five most enriched cancer driver genes (Fig. 5A). A series of twelve cancer driver gene expressions were identified to be associated with transcription factors, of which CCR7, BRD7, DDX3X, and UBE2A were upregulated in breast cancer tissues, and the others were downregulated in breast cancer tissues. Next, all molecules were substituted into Cytoscape to map the regulatory network (Fig. 5B). GO and KEGG analysis suggest possible mechanisms by which cancer driver genes impact breast cancer prognosis. GO analysis showed that the functional sets of immune response, T cell activity, and immune-related receptors were significantly enriched. KEGG analysis suggests that B cell receptor signaling pathway, T cell receptor signaling pathway, and PD−1 checkpoint pathway were significantly enriched. These results suggest a strong association between cancer driver genes and the immune status, which supports our further studies on the risk signature and tumor microenvironment.

Correlation of the risk signature with tumor microenvironment
Alterations in the tumor microenvironment may be an essential way in which cancer driver genes affect breast cancer. We analyzed the correlation between PD-1 levels, tumor microenvironment, and risk groups. The results suggested that PD-1 expression levels, stromal scores, and immune scores were higher in the low-risk group compared to the high-risk group (P<0.05) (Fig. 6A-D). We likewise analyzed the correlation of immune cells and immune pathways with the risk signature, and the results suggested that most immune function scores differed significantly in the risk groups (P<0.05) (Fig. 6E, F).

Discussion
In the precision medicine era, the mutation and expression characteristics of breast cancer patients' tumor genomes are playing an increasingly critical role in the choice of therapeutic strategies [23]. Tumor heterogeneity is the main reason for the different responses to the treatment of breast cancer patients [24,25]. Different molecular typing is used to explain breast cancer heterogeneity. For example, estrogen and progesterone receptor status can indicate whether a patient is suitable for endocrine therapy [26]. In contrast, traditional tumor markers are gradually showing their limitations, while novel markers at the molecular level, such as mRNA, miRNA, and DNA methylation, are being intensively investigated [27][28][29][30]. Tumor-infiltrating lymphocytes (TIL) are a heterogeneous  [31]. Triple-negative breast cancers have the highest degree of TIL infiltration, followed by HER2+ breast cancers [32]. The KEYNOTE-086 study found that abundant mesenchymal TIL was closely associated with better efficacy of pembrolizumab [33]. The combined consideration of PD-L1 levels in tumor cells and immune cells predicts the possible benefit of receiving immunotherapy in breast cancer patients [34]. Molecular targeted therapy, immunotherapy, and novel molecular biomarkers have broadened the treatment opportunities for breast cancer, improving patients' quality of life and prolonging survival time.
The specific expression of cancer driver genes is a major contributor to the formation of the tumor microenvironment, which further leads to various biological processes such as uncontrolled tumor cell proliferation, invasion, metastasis, recurrence, and drug resistance [21,35]. Martínez's study analyzed the gene expression of over 20,000 tumor samples and mapped cancer driver gene profiles [9]. Algorithms and identification of cancer driver genes have been intensively studied [36,37]; however, the clinical value of newly identified cancer driver genes has been relatively under-explored. Aberrant expression and mutations of cancer driver genes are very common in breast cancer patients, approaching 50% [12]. Comprehensive analysis of the clinical significance of these genes may lead to the discovery of new clinical biomarkers. We studied cancer driver gene expression profiles of more than 1000 breast cancer samples and identified cancer driver genes associated with breast cancer progression and prognosis. For example, Rizeq's study found that activation of the C-C chemokine receptor 7 (CCR7)-related complex increased tumor cell proliferation and migration [38]. Vahidi's study concluded that CD8-positive memory T cells in tumordraining lymph nodes are associated with CCR7 [39]. We found the specific expression and prognostic value of CCR7 in breast cancer. Both Niu [40] and our study confirmed the oncogenic role of bromodomaincontaining 7 (BRD7) molecule in breast cancer. In general, these studies suggest the involvement of cancer driver genes in complex tumor biological behavior and clinically important roles. Based on the identification of the cancer driver genes, we further constructed a thirteen-mRNA prognostic signature with high specificity and sensitivity of prediction. The survival of patients in different risk groups differed significantly, and the risk score was an independent risk factor in predicting breast cancer prognosis.
Several clinical trials have demonstrated favorable results of PD-1/PD-L1 antibodies in the treatment of breast cancer [34,41,42]. Targeting PD-1-related pathways effectively blocks immune surveillance and escape and activates the immune response of T cells against tumor cells [43,44]. The stromal environment, immune cell level, and composition are crucial in the immunotherapy process. Cancer driver genes can indirectly alter the tumor microenvironment through multiple regulatory networks [21], but fewer studies have been conducted. Raskov's study [45] found that mutations in some cancer driver genes could increase accessibility for DNA targeting chemotherapeutics and reduce cytotoxic drug resistance in colorectal cancer. In our study, KEGG analysis revealed significant enrichment of immunerelated pathways such as T cell activation and PD-1-related pathways, suggesting possible interactions between cancer driver genes and immunity. We found that the survival time was longer in the low-risk group compared to the high-risk group, who also had higher PD-1 expression, immune-related scores, and tumor microenvironment scores. Our findings suggest that cancer driver genes may be involved in the body's immune response and have an enhancing effect on the effectiveness of immunotherapy.

Conclusion
In summary, we analyzed the expression profiles of 568 cancer driver genes in breast cancer and identified driver genes associated with breast cancer prognosis. We further constructed a risk signature to predict breast cancer prognosis, and both the training and external test sets performed sensitive predictive efficacy. Cancer driver genes may affect the biological behavior of breast cancer by mediating transcriptional processes and alterations in the immune and tumor microenvironment. Our study confirms the critical role of cancer driver genes in breast cancer, and it is expected to supply a reference for the prognostic stratification and treatment strategy of breast cancer.