Introducing novel and comprehensive models for predicting recurrence in breast cancer using the group LASSO approach: are estimates of early and late recurrence different?

Background In here, we constructed personalized models for predicting breast cancer (BC) recurrence according to timing of recurrence (as early and late recurrence). Methods An efficient algorithm called group LASSO was used for simultaneous variable selection and risk factor prediction in a logistic regression model. Results For recurrence < 5 years, age (OR 0.96, 95% CI = 0.95–0.97), number of pregnancies (OR 0.94, 95% CI = 0.89–0.99), family history of other cancers (OR 0.73, 95% CI = 0.60–0.89), hormone therapy (OR 0.76, 95% CI = 0.61–0.96), dissected lymph nodes (OR 0.98, 95% CI = 0.97–0.99), right-sided BC (OR 0.87, 95% CI = 0.77–0.99), diabetes (OR 0.77, 95% CI = 0.60–0.98), history of breast operations (OR 0.38, 95% CI = 0.17–0.88), smoking (OR 5.72, 95% CI = 2.11–15.55), history of breast disease (OR 3.32, 95% CI = 1.92–5.76), in situ component (OR 1.58, 95% CI = 1.35–1.84), tumor necrosis (OR 1.87, 95% CI = 1.57–2.22), sentinel lymph node biopsy (SLNB) (OR 2.90, 95% CI = 2.05–4.11) and SLNB+axillary node dissection (OR 3.50, 95% CI = 2.26–5.42), grade 3 (OR 1.79, 95% CI = 1.46–2.21), stage 2 (OR 2.71, 95% CI = 2.18–3.35), stages 3 and 4 (OR 5.01, 95% CI = 3.52–7.13), and mastectomy+radiotherapy (OR 2.97, 95% CI = 2.39–3.68) were predictors of recurrence < 5 years. Moreover, relative to mastectomy without radiotherapy (as reference for comparison), quadrantectomy without radiotherapy had a noticeably higher odds ratio compared to quadranectomy with radiotherapy for recurrence < 5 years. (OR 17.58, 95% CI = 6.70–46.10 vs. OR: 2.50, 95% CI = 2–3.12). Accuracy, sensitivity, and specificity of the model were 82%, 75.6%, and 74.9%, respectively. For recurrence > 5 years, stage 2 cancer (OR 1.67, 95% CI = 1.31–2.14) and radiotherapy+mastectomy (OR 2.45, 95% CI = 1.81–3.32) were significant predictors; furthermore, relative to mastectomy without radiotherapy (as reference for comparison), quadranectomy without radiotherapy had a noticeably higher odds ratio compared to quadranectomy with radiotherapy for recurrence > 5 years (OR 7.62, 95% CI = 1.52–38.15 vs. OR 1.75, 95% CI = 1.32–2.32). Accuracy, sensitivity, and specificity of the model were 71%, 78.8%, and 55.8%, respectively. Conclusion For the first time, we constructed models for estimating recurrence based on timing of recurrence which are among the most applicable models with excellent accuracy (> 80%). Electronic supplementary material The online version of this article (10.1186/s12957-018-1489-0) contains supplementary material, which is available to authorized users.


Background
Breast cancer (BC) is the most common cancer among women and is considered to be the second cause of death among all cancer-related deaths in women [1].
The main treatment of BC is surgery. Recurrence poses a major concern after surgical treatment of BC and is associated with a great increase in BC-related death, which usually occurs during the first 5 years of diagnosis [2,3].
To date, multiple models have been introduced for predicting BC prognosis, which have mainly focused on survival [4,5]. These models have aided in developing guidelines and managing BC patients. However, these models have utilized limited variables regarding patients' clinical characteristics and some have used machine learning algorithms which have difficult clinical interpretations [6,7].
Considering that most predictors of recurrence (clinicopathological features and tumor specific characteristics) are highly correlated, we aimed to develop a comprehensive model to predict recurrence which would preclude associational factors. In addition, considering that BC recurrence during early stages and late stages of the disease course significantly affects patients' quality of life, we hypothesized that predictors of recurrence may differ for early recurrence and late recurrence. Thus, in order to answer the question whether or not predictors of early recurrence (defined as earlier than 5 years) are different from those of late recurrence (later than 5 years), we further developed two other models based on time of recurrence using advanced statistical modeling.

Study settings and patient selection
This study is part of an ongoing BC registry termed the Shiraz Breast Cancer Registry (SBCR), which has started its patient registration program since 2005. The breast clinic is located in Motahhari Medical Clinic, Shiraz, Iran. Patients are referred to the breast clinic from multiple medical health centers within the city and from other provinces (mostly those from Southern Iran). Currently, the registry includes more than 6000 registered patients with BC and data on more than 200 variables on patient and clinical characteristics have been documented for each individual within the registry.
Participants were selected from the SBCR, and all individuals diagnosed with BC since 1995 have been included in the current study. All male cases of BC were excluded.
Patients were categorized into three groups according to their recurrence time: those who presented with recurrence during the first 5 years of their initial diagnosis of BC, those who had recurrence after 5 years of their initial diagnosis, and those who did not present with any recurrence more than 10 years from diagnosis.

Variable selection
More than 35 variables on baseline characteristics, socioeconomic determinants, obstetrics and gynecological history, family history, history of other diseases and other tumors, BC specifics including side of involvement, type of BC, treatment specifics, staging and grading of tumor, and histopathological features were considered and compared between the groups.
Education level was defined as illiterate, high school or less, and college education.
Job of individuals was classified as stay at home, retired, governmental job, and self-employment.
Regarding sports activity patients either said yes or no to having sport activities. Regularity of sport activity was also questioned (either regular or irregular sports activity).
Axillary management was classified as either sentinel lymph node biopsy (SLNB), axillary lymph node dissection (AND), both, or none.
Breast surgery was classified as either mastectomy or breast conserving surgery (BCS).
Histopathological in situ component and tumor necrosis were either existing or not.

Statistical analysis
All 1273 individuals were included in the final model for estimating patient recurrence. End point was considered metastasis (local, regional, and distant). Primary outcome was considered from time of diagnosis to confirmed recurrence. Initially, individuals were classified as either with recurrence or without recurrence (more than 10 years) and compared. After which, in order to clarify the differences between those with early recurrence and those with late recurrence, we categorized patients into three groups based on their timing of recurrence as those with < 5 years recurrence, > 5-year recurrence, and those without recurrence of more than 10 years. For constructing a model for assessing estimates of recurrence in the population, as we had a large sample size from the BC registry, an efficient algorithm called group LASSO was used to simultaneously perform variable selection and to estimate risk factors in a logistic regression model. In situations that variables present in several levels and can be expressed through a group of dummy variables, group LASSO is suggested. Group LASSO also has excellent properties in terms of both variable importance and prediction and avoids over-shrinking large coefficients. By placing constrain on the absolute value of regression coefficients, the penalized function shrinks many of the coefficients. Furthermore, by deleting additional and redundant variables and creating a brief bias in the models, the group LASSO method controls existing multi-collinearity and is excellent in the settings of high number of variables [8]. Ten-fold cross validation was used to estimate amounts of penalty and bootstrap with 1000 replications was applied to calculate standard error of coefficients. To investigate prediction accuracy of the proposed model in classification of patients with and without recurrence, receiver operator characteristic (ROC) curve analysis was performed and optimal cut off point for obtained probability of BC recurrence was reported; in addition, area under the curve (AUC), sensitivity, and specificity of the obtained cut-off point were also reported. Statistical analysis was performed using SPSS 18.0 and grpreg package in R 3.3.1 software. Considering the main research question, we further classified patients into two groups of those with early recurrence (< 5 years) and late recurrence (> 5 years) and constructed models to predict recurrence in each of these groups, separately.
For evaluation of radiotherapy, considering that indications of radiotherapy differ according to type of breast surgery (either mastectomy or BCS), individuals were first categorized based on type of surgery and radiotherapy was then evaluated in each groups, separately.
Statistical tests were two-sided, and a p value of less than 0.05 was considered statistically significant.

Results
Patients' baseline characteristics and comparison of individuals with recurrence and those without recurrence are shown in Table 1.
Those with recurrence of > 5 years had higher rates of other types of cancer in family members compared to the < 5 years recurrence group and those without recurrence > 10 years (25.8%, 14.6%, and 22.1%, respectively, p < 0.001), and higher rates of hormone therapy (88.6%, 76.7%, and 86.3%, respectively, p < 0.001).
The three groups were also significantly different regarding invasion status (p < 0.001) and pathological grade (p < 0.001) ( Table 2).
Using coefficients, probability of BC recurrence was calculated for each patient. Cut-off point was determined as p = 0.566 in ROC analysis and accuracy of the proposed model was equal to 80% (95% CI = 78.2-82.6%). Furthermore, sensitivity and specificity for group LASSO was 70.1% and 76.8%, respectively. Tuning parameter for this model was 0.006 ( Fig. 1).
When stratified according to timing of recurrence, our models showed that for recurrence <  had a noticeably higher odds ratio compared to quadranectomy with radiotherapy for recurrence < 5 years (OR 17.58, 95% CI = 6.70-46.10 vs. OR 2.50, 95% CI = 2-3.12) ( Table 4). Cut-off point for this model (< 5-year recurrence) was determined as p = 0.495 in ROC analysis and accuracy of the proposed model was equal to 82% (95% CI = 80-84%). Sensitivity and specificity for group LASSO was 75.6% and 74.9%, respectively. Tuning parameter for this model was 0.006 (Fig. 2).
Cut-off point for this model (> 5-year recurrence) was determined as p = 0.206 in ROC analysis and accuracy of the proposed model was equal to 71% (95% CI = 67-74%). Sensitivity and specificity for group LASSO were 78.8% and 55.8%, respectively. Tuning parameter for this model was 0.007 (Fig. 2).  The three final models are provided in Additional file 1, which can be utilized to estimate recurrence time based on our selected variables; furthermore, the models will also be available at the breast clinic website at www.bdrc.sums.ac.ir.

Discussion
In here, we aimed to introduce models to predict recurrence in a large sample of individuals during a period of 20 years from 1995 to 2016. We further defined a model to predict recurrence among those with early recurrence and late recurrence and compared estimates. In our final model which included more than 50 variables on different aspects of BC and patient baseline characteristics, we found that aside to more common and previously known risk factors like clinical stage, and pathological grade, factors like sports activity, higher age, number of LNs dissected in SLNB and AND, and radiotherapy in BCS were protective against recurrence, on the other hand, in situ component in pathology, tumor necrosis, having other breast diseases, smoking, LN management including SLNB, and simultaneous SLNB and AND (considering AND as base for comparison), number of invasive LNs after dissection, and radiotherapy after mastectomy were associated with earlier recurrence. When we stratified our models based on early and late recurrence, we found that for early recurrence (< 5 years), in addition to factors that were significant for overall recurrence, number of pregnancies, family history of other cancers, hormone therapy, right-sided BC, diabetes, and history of breast operations were predictors of better outcome. Furthermore, only stage 2 BC and radiotherapy were significant predictors in late recurrence (> 5 years).
Recently, Wu et al. [9] introduced a model for estimating 5-year recurrence in a population of 4505 women. In their final model, they found age of less than 54 years old, alcohol consumption and adjuvant therapy to be protective, African American ethnicity, nuclear grade 3, tumor size, number of positive nodes, and lymphovascular invasion to be malignant predictors of 5-year recurrence. They introduced one of the most comprehensive models for estimating 5-year recurrence using both epidemiological data and BC specific data and by using a Cox analysis approach. Similar to the mentioned study, we had one of the most comprehensive models for predicting BC recurrence in two phases of early recurrence and late recurrence. Furthermore, as we included a wide range of data from our BC registry, we introduced a more comprehensive model including baseline characteristics, socioeconomic determinants, obstetrics and gynecological data, pathological data, and personal habits like smoking and sports activity. In our results, we also found a number of positive nodes and grade to be predictors of worse recurrence.
Considering the clinical value of timing of recurrence, we introduced two models according to time of recurrence as early and late recurrence. Accordingly, our models showed that only radiotherapy and stage of cancer remained to be significant in recurrence of > 5 years. This is an important clinical finding as it aids significantly in the understanding of late recurrence in BC patients.
In a smaller study in 2016 [10], those with early (< 5 years) and those with late recurrence (> 5 years) were compared regarding clinical characteristics. They found that these two groups differed regarding tumor size, number of positive nodes, grade, ER and PR receptors and HER2, and adjuvant therapy. In their multivariate regression models, they found tumor size, ER receptor and HER2 to be associated with worse > 5-year recurrence and grade 2 BC to be associated with better late recurrence. They used regression modelling to estimate predictors of late recurrence in a population of 300 women, and their study did not provide an overview of differences between those who present with early and those who present with late recurrence as they only had limited set of participants and variables. Another study in 2016 [11] evaluated factors associated with BC Shows statistical significance (p < 0.05) †Irregular sports activity was considered base for comparison ‡Grades 1 and 2 were considered base for comparison §Having axillary lymph node dissection was considered base for comparison ||Having mastectomy without radiotherapy was considered base for comparison ¶Stage zero was considered base for comparison recurrence after BCS and found premenopausal state, ER expression, and hormone therapy to be factors associated with recurrence. Similarly, we also found hormone therapy to be significant in our < 5-year recurrence model. Our study presents a novel assessment of BC recurrence, and accordingly, we found some interesting results regarding determinants of BC recurrence using advanced statistical modeling.
Among the most interesting findings was the association between sports activity and recurrence, although sports activity was measured in a subjective manner and patients were asked regarding their daily routine and physical activity, sports activity presented as highly protective in BC recurrence. Studies on recurrence and physical activity in the settings of a large sample with long-term follow-up were mainly missing up to 2006 according to a meta-analysis by McNeely [12] in 2006 who evaluated the relationship between exercise and BC. To date, most studies have mostly focused on physical activity and BC outcomes as a whole; however, more recently two studies evaluated the association between exercise and BC recurrence, one was conducted in Germany and another in a Canadian registry. A meta-analysis in 2015 [13] found that using data from the two mentioned studies, exercise showed a protective role against recurrence with an odds ratio very similar to that of our study (OR 0.72; 95% CI = 0.56-0.91). Although the exact mechanism by which exercise decreases recurrence rates still remains unknown, studies have shown exercise to improve quality of life in BC patients [12], and others have also attributed this to changes in adipose tissue and skeletal muscle [14].
Regarding pathology-related parameters, in situ component and tumor necrosis were associated with worse recurrence.
We found those who had both SLNB and AND were at higher risk of recurrence when compared to those who had isolated SLNB or AND alone, respectively.
As SLNB has recently been added as a treatment modality to replace isolate AND [15], furthermore considering isolated SLNB has recently been accepted and applied in our center and in literature, and our study included patients from 1995 which is before the introduction of SLNB to replace AND, some patients that had axillary dissection may have been node negative in the past (a mixture of both patients with good and bad prognosis). This may have been among the reasons for the higher recorded recurrence rate associated with SLNB (compared to AND), thus, judgment on the matter should be done with caution. Those who had both axillary management modalities had definitive positive LNs and consequently had worse prognosis. However, all these are mainly considered for locoregional recurrence, and distant metastasis presents more complicated phenomena and may not be easily explained. Although, AND is not considered among patient without palpable masses or signs of metastasis in sonography evaluation, a review in 2013 [16] found that among individuals without the mentioned conditions, AND improves recurrence rates by 1-3% compared to isolated SLNB, which was similar to our results. In a more recent review by Bromham and colleagues [17] that included RCT's comparing individuals with no axillary surgery and those with AND, they found that no axillary surgery increased locoregional recurrence by 1.10 to 3.06; however, regarding distant metastasis, they found uncertain results as to whether no surgery increased metastasis rates (HR 1.06; 95% CI = 0.87-1.30). Comparing isolated AND and SLNB showed uncertain results regarding distant metastasis in the mentioned study (HR 0.80; 95% CI = 0.42-1.53).
Among the interesting findings was that number of LNs dissected in AND or SLNB management was associated with better overall recurrence. On the other hand, number of invasive nodes detected in AND and SLNB was associated with worse recurrence. These findings should be considered with caution regarding its clinical application as higher number of dissection will ultimately produce higher complications such as lymphedema.
In our model history of breast diseases presented as a strong risk factor for recurrence, this is a novel concept yet to be described and evaluated.
Those with mastectomy who received radiotherapy had earlier recurrence than those who only had mastectomy without radiotherapy; this is attributable to the more advanced stages of patient who receive concomitant mastectomy and radiotherapy and is expected.  Among other novel findings in our endeavor to find associated factors with recurrence, was the association of smoking with recurrence and the insignificant association of waterpipe use with recurrence. As the use of waterpipes continues to grow worldwide, it has become a global epidemic with recent reports from the middle-east indicating that it has even surpassed cigarette smoking to become the most common form of tobacco used in the region [18]. This is the first study to evaluate waterpipe use in BC recurrence.
We found multiple obstetrics-related variables such as number of pregnancies, number of abortions, number of children, and breast feeding duration to not be significantly associated with overall recurrence, however number of pregnancies, history of breast operation, hormone therapy, right-sided BC, history of previous breast disease demonstrated significance in our < 5-year recurrence model.
We found diabetes to be a good predictor of < 5-year recurrence, which was similar to that reported by Chen et al. [19]. Using the Surveillance, Epidemiology and End-Results (SEER)-Medicare database, they found Metformin use to be associated with a 31% (95% CI 0.53-0.90) decrease in BC recurrence.
Our results indicate that regarding treatment modalities only radiotherapy seems to affect recurrence of > 5 years which renders different results based on type of BC surgery performed for the patient (as either mastectomy or BCS).
This study was not without limitation. As we had limited number of individuals in some of the categories, all variables in our database were not applicable in the final model due to the limited number of data. Taking into consideration that individuals who were recently (less than 10 years from their initial diagnosis) added to our registry may not have had the chance to present signs of recurrence, we considered those without recurrence of more than 10 years from their initial diagnosis of BC and this decreased the size of the comparison groups.

Conclusion
As the main outcome of our study, we used advanced statistics to construct models based on multiple factors to predict both early and late recurrence. Compared to previous literature which has included limited variables, our models are among the most applicable and comprehensive models for predicting recurrence based on timing of recurrence with excellent accuracy (> 80%).