# Predicting 5-Year Survival Status of Patients with Breast Cancer based on Supervised Wavelet Method

## Article information

## Abstract

### Objectives

Classification of breast cancer patients into different risk classes is very important in clinical applications. It is estimated that the advent of high-dimensional gene expression data could improve patient classification. In this study, a new method for transforming the high-dimensional gene expression data in a low-dimensional space based on wavelet transform (WT) is presented.

### Methods

The proposed method was applied to three publicly available microarray data sets. After dimensionality reduction using supervised wavelet, a predictive support vector machine (SVM) model was built upon the reduced dimensional space. In addition, the proposed method was compared with the supervised principal component analysis (PCA).

### Results

The performance of supervised wavelet and supervised PCA based on selected genes were better than the signature genes identified in the other studies. Furthermore, the supervised wavelet method generally performed better than the supervised PCA for predicting the 5-year survival status of patients with breast cancer based on microarray data. In addition, the proposed method had a relatively acceptable performance compared with the other studies.

### Conclusion

The results suggest the possibility of developing a new tool using wavelets for the dimension reduction of microarray data sets in the classification framework.

## 1 Introduction

Metastatic breast cancer is a stage of breast cancer where the disease has spread to distant organs or tissues. Treatments against metastasis exist, but usually further treatments after surgery can have serious side effects and involve high medical costs [1]. An important task to optimize the adjuvant chemotherapy of metastasis related to breast cancer is to diagnose the risk of metastasis accurately [2–4].

Classification of cancer patients into different risk classes is very important in clinical applications. Traditional methods for patient classification were mainly based on a series of clinical and histological features [3]. It is estimated that the advent of high-dimensional gene expression data could improve patient classification [5]. Gene expression profiles of breast tumor samples could be used to predict relapse and metastatic patterns in breast cancer patients that could be potential candidate targets for new treatments [4]. It is reasonable to assume that any difference between the two tumors should be represented by some difference in gene expression. However, in microarray studies, the number of samples is relatively small compared to the number of genes per sample. Furthermore, from the biological aspect, only a small portion of genes have predicted the power for phenotypes. If all or most of the genes are considered in the predictive model, they can induce substantial noise and thereby lead to poor predictive performance [6]. Thus, in order to obtain good classification accuracy, a crucial step towards the application of microarray data is the dimensional reduction from the gene expression profiles. In recent years, both feature selection and feature extraction methods have been widely used for classifying gene expression data [7]. Bair and Tibshirani [8] and Bair et al. [9] explored the use of supervised principal component analysis (PCA), which is similar to conventional PCA except that it uses a subset of the predictors selected based on their association with the outcome. Wavelet-based methods have also been used to solve the dimension reduction problem. The primary intuition for applying wavelets in the case of gene expression is that genes are often coexpressed in groups. Therefore, it would be useful to treat the group as a single variable, akin to the motivation behind methods such as PCA [10]. One-dimensional discrete wavelet transform (DWT) is frequently used for feature extraction in the analysis of high-dimensional biomedical data [11]. Studies showed that this method has an acceptable performance in the field of feature extraction in the classification framework [11–15].

The current study aimed to introduce a dimension reduction strategy for transforming the high-dimensional gene expression data in a low-dimensional space based on wavelet transform (WT) in order to predict metastasis of breast cancer. Accordingly, a predictive support vector machine (SVM) model was built upon the reduced dimensional space. Then, the proposed novel supervised wavelet method of feature extraction was compared with the supervised PCA.

## 2 Materials and methods

The proposed method was applied to three publicly available microarray data sets related to breast cancer.

### 2.1 Data

#### 2.1.1 Breast cancer data from van't Veer (NKI_97)

The first data set is reported by van't Veer et al [2] and referred to as NKI_97. The original van't Veer data consists of gene expression profiles and clinical information for 97 samples of primary breast cancer tumors, and each case is described by the expression levels of 24,481 genes. Fifty-one patients remained free from metastasis for at least 5 years and were metastasis-negative, and 46 cancer patients developed metastasis within 5 years and were metastasis-positive. All patients were <55 years old and were lymph node-negative. They had no tumor cells in local lymph nodes [2]. The data used in this study is a filtered version of the van't Veer data including gene expression values of 4948 genes in 97 tumor samples [2]. The data are publicly available at the “cancer data” R package (http://www.bioconductor.org/packages/release/data/experiment/html/cancerdata.html).

#### 2.1.2 Breast cancer data from van de Vijver (NKI_295)

The second data set is reported by van de Vijver et al [4] and referred to as NKI_295. The data set provides the gene information for 295 primary breast cancer patients, of which 234 patients were new and the remaining 61 patients were involved in the first data set. Of the total 295 patients, 194 patients were metastasis-negative and 101 patients were metastasis-positive. Of the 234 new patients, 164 patients were metastasis-negative and 70 patients were metastasis-positive. Of the 61 patients involved in the first data set, 30 were metastasis-negative and 31 patients were metastasis-positive. The data is a filtered version of the van de Vijver data including gene expression values of 4948 genes in 295 tumor samples [4]. The data are publicly available at the “cancer data” R package.

#### 2.1.3 Breast cancer data from the Wang study (VDX_286)

The last data set, reported by Wang et al [16] and referred to as VDX_286, contains 286 lymph node-negative breast cancer patients who had not received any adjuvant systemic treatment [16]. Among them, 106 patients had distant metastasis within 5 years of follow up and were considered as metastatic patients, while the rest were considered as nonmetastatic patients. A set of 22,283 genes is available for this data set. The data are publicly available at the “breast cancer VDX” R package.

### 2.2 Wavelet Transform

A wavelet is a “small wave”, which has its energy concentrated in time. In signal processing, a transformation technique is used to transfer data in another domain where hidden information can be extracted. Wavelets have a nice feature of local description and separation of signal characteristics, and provide a tool for the analysis of transient or time-varying signals [11].

A wavelet is a set of orthonormal basis functions generated from dilation and translation of a single scaling function or father wavelet (

WTs are classified into two different categories: the continuous WT and the DWT. The DWT is a linear operation that operates on a data vector, transforming it into a wavelet coefficient. The idea underlying DWT is to express any function *t*) and ø (*t*) as follows:

*j*, respectively. The variable

*k*is the translation coefficient for the localization of gene expression data. The scales denote the different (low to high) scale bands. The variable symbol

One-dimensional DWT decomposes a signal as a sum of wavelets at different time shifts and scales (frequencies) using DWT. For this purpose, the signal is passed through a series of high-pass and low-pass filters in order to analyze low as well as high frequencies in the signal as follows:

At each level, the high-pass filter produces detail coefficients (wavelet coefficients) *d*_{1,} while the low-pass filter associated with the scaling function produces approximation coefficients (scaling coefficients) *c*_{1}. Subsequently, the approximation coefficients *c*_{1} are split into two parts by using the same algorithm and are replaced by *c*_{2} and *d*_{2}, and so on. This decomposition process is repeated until the required level is reached. The coefficient vectors are produced by down sampling and are only half the length of the signal or the coefficient vector at the previous level [12].

The main advantage of the WT is that each basis function is localized jointly in both the time and frequency domains. From a viewpoint of time-frequency, the approximation coefficients correspond to the larger-scale low-frequency components, and the detail coefficients correspond to the small-scale high-frequency components. Generally, the former can be used to approximate the original signal, and the latter represents some local details of the original signal [14,15].

There are different families of wavelets: symlets, coiflets, Daubechies, and biorthogonal wavelets. They vary in the various basic properties of wavelets, such as compactness. Haar wavelets, belonging to Daubechies wavelet family, are the most commonly used wavelets in database literature because they are easy to comprehend and fast to be computed.

### 2.3 Q-value

It is usual to simultaneously test many hundreds or thousands of genes in microarray studies to determine which are differentially expressed. Each of these tests will produce a *p* value. One main challenge in those studies is to find suitable multiple testing procedures that provide an accurate control of the error rates. Whereas the *p* value is a measure of significance in terms of the false positive rate, the *q* value is an approach used to measure statistical significance based on the concept of the false discovery rate. Similar to the *p* value, the *q* value gives each feature its own individual measure of significance [17].

### 2.4 Supervised WT

Firstly, any patients who remain free from metastasis for at least 5 years are placed into Class 1, otherwise into Class 2. The proposed DWT-based feature selection method consists of the following steps: (1) A *t* test is taken as the measure to identify differently expressed genes and a list of *q* values is derived. All the genes are ranked according to their corresponding *q* value and the required numbers of genes are selected from the list; and (2) in each step the top number of genes based on the *q* value are picked out. Then, this reduced set of genes is modeled by the one-dimensional DWT using Haar mother wavelet and finally, the wavelet approximation coefficients in the first and second levels of decomposition are used in the SVM model, respectively.

### 2.5 Supervised PCA

Bair and Tibshirani [8] and Bair et al [9] proposed supervised principal components regression. This procedure first picks out a subset of the gene expressions that correlates with response by using univariate selection, and then applies PCA to this subset. In our analysis, we pick out the top number of genes based on *q* values. We then apply PCA to this subset of genes, and in each step include the top numbers of principal components into a SVM model. The top numbers of principal components that will be comprised of at least 75% of the total variance are included in the SVM model.

### 2.6 SVM

The SVM model proposed by Vapnik [18] is a supervised learning method that is widely used in microarray data classification. Unlike many modeling techniques which aim to minimize the objective function (such as mean square error) for all instances, SVM attempts to find the hyperplanes that produce the largest separation between the decision function values for the instances located on the borderline between the two classes. The optimally identified hyperplane in the feature space corresponds to a nonlinear decision boundary in the input space. The SVM takes a set of input data with corresponding class labels and predicts a new input which belongs to the classes.

In the binary classification mode, given a training set of instance-label pairs (

*C*is a user-defined penalty parameter on the training error that controls the trade-off between classification errors and the complexity of the model. By solving the optimization problem (1) by finding the parameters w and b for a given training set, a decision hyperplane over an n-dimensional input space that produces the maximal margin in the space is designed. Thus, the decision function can be formulated as follows:

SVM can derive the optimal hyperplane for nonlinearly separated data by mapping the impute data into the n-dimensional space using kernel function [

In this study, the goal of SVM modeling was to classify patients who had a high risk of breast cancer recurrence. The predictive performance of the SVM-classifier was reported based on sensitivity, specificity, accuracy, and the area under the receiver operating characteristic curve (AUC). These criteria are defined as follows: (TP = true positive; TN = true negative; FN = false negative; and FP = false positive):

Accuracy: ACC =

Sensitivity: SN =

Specificity: SP =

The method is implemented using MATLAB r2012a software (MATLAB Release 2012a, the MathWorks, Inc., Natick, Massachusetts, United States) and R statistical package (e1071, *q* value).

### 2.7 Cross data set comparison

To avoid over fitting and to provide a realistic evaluation, the cross data method was used. In this method, features obtained from one data set were used to construct classifiers for the other data set. In this regard, common patients in the NKI_295 and NKI_97 data were removed and the remaining data (NKI_234) were used as a test data set. This method was implemented using genes selected from NKI_234 breast cancer data as input in the supervised wavelet method in the NKI_61 data.

## 3 Results

The *t* test statistics were used to identify discriminative genes in each data set. After selecting the top ranked genes based on *q* values, one-dimensional WT in the first and second levels was applied to these preselected genes. SVMs with three types of kernels—linear, sigmoid, and radial, were used based on wavelet approximation coefficients in the first and the second levels of decomposition. For further assessment of the reported subsets of 70 genes selected by van't Veer et al [2] (for NKI_97 and NKI_295) and 76 signature genes selected by Wang et al [16] (for VDX_286), the supervised wavelet method and supervised PCA were applied. The predictive performance of SVM models was tested by cross-validation, consisting of 10 times 10-folding experiments. The results of supervised wavelet and supervised PCA for the three data sets are shown in Tables 1–3, respectively.

In the NKI_97 data set, the results showed that the SVM with radial kernels based on wavelet approximation coefficients in the first level extracted from 58 preselected genes had the best performance in terms of the evaluation criteria with regard to accuracy (83.11) as well as AUC (83.45). In addition, the SVM with radial kernel based on the first supervised PCA computed based on 84 preselected genes had the best performance in terms of accuracy (79.22) as well as specificity (83.25), sensitivity (75.22), and AUC (79.24). In both methods (supervised wavelet and supervised PCA), the classifier performance based on the 70 genes selected by *q* values was better than the 70 gene signature from the van't Veer study (Table 1).

In the NKI_295 data set (Table 2), the results showed that the SVM with radial kernels based on wavelet approximation coefficients in the first level extracted from 91 preselected genes had the best performance in terms of the evaluation criteria, with the highest accuracy (75.37) as well as AUC (70.03). In addition, the SVM with linear kernel based on the first supervised PCA computed based on 91 preselected genes had the best performance in terms of accuracy (73.03) as well as AUC (66.63). In both methods (supervised wavelet and supervised PCA), the classifier performance based on the 70 genes selected by *q* values was better than the 70 gene signature from the van't Veer study.

In the VDX_286 data set (Table 3), the results showed that the SVM with linear kernels based on wavelet approximation coefficients in the second level extracted from 67 preselected genes had the best performance with the highest accuracy (79.21) as well as AUC (76.04). In addition, the SVM with linear kernel based on the first supervised PCA computed based on 67 preselected genes had the best performance in terms of accuracy (76.00) as well as AUC (74.71). In both methods (supervised wavelet and supervised PCA), the classifier performance based on the selected 76 genes using *t* statistics was better than the 76 gene signature identified in the Wang study.

To evaluate the reproducibility of the proposed method, a cross data-set comparison was also performed. As shown in Table 4, the results confirmed that the supervised wavelet method also had an acceptable performance, although the improvements were not as high as in the inner data set comparison. The results of other studies based on the same data sets are shown in Table 5. It can be seen that the proposed method had a higher capability for the prediction of metastasis than the other studies [20–29].

## 4 Discussion

This study proposed a new method based on WT to develop a novel predictive model for the prediction of breast cancer metastasis. Furthermore, the performance of this method was compared with supervised PCA.

The main purpose of the feature extraction method using WT is that the approximation coefficients usually comprise the majority of the important information [11]. In addition, the powerful capability of the DWT to compress the signal energy makes it a good candidate for feature extraction applications. The DWT compresses most of the energy from the input signal and concentrates it in a few high-magnitude coefficients in the transformed matrix.

The wavelet feature extraction method does not depend on the training data set to obtain the basis of feature space compared to the PCA method. Therefore, the wavelet feature extraction method dramatically reduces the computation load compared to PCA [11,12].

Considering the fact that most genes are irrelevant to patients' metastasis, we analyzed the reduced data set given by selecting genes that were significantly related to metastasis based on the *t* test statistics. If the WT is performed directly by using all of the genes in a data set, there is no guarantee that the resulting wavelet coefficients will be related to metastasis. Thus, this study introduced a supervised form of WT that can be considered as a supervised wavelet. After extracting supervised wavelet approximation coefficients using discrete Haar WT, these coefficients had higher predictive performances than the first three principal components. Therefore, our results suggested that the wavelet coefficients are the efficient way to characterize the features of high-dimensional microarray data. Because the performance of the proposed supervised wavelet method is likely to be improvable compared to some other studies, we conclude that this method is worth further investigation as a tool for cancer patient classification based on gene expression data. For example, to achieve optimal classification performance, a suitable combination of the classifier and the gene selection method needs to be specifically selected for a given data set.

Some studies reported misclassification rates that were obtained by the application of their classifier to a one splitting of the test and training set. For example, van't Veer et al [2] developed a 70-gene classifier predicting a distant metastasis of breast cancer. In the training set, the classifier predicted the class of 65/78 cases correctly (i.e., with an accuracy of 83.3%, corresponding to a weighted accuracy of 83.6%), whereas in the test set it predicted the class of 17/19 cases correctly (i.e., with an accuracy of 89.5%, corresponding to a weighted accuracy of 88.7%). However, in the present study, in order to avoid the over fitting problem, we followed the 10 times 10-fold cross-validation for evaluating the SVM classifier. The evaluation of the classifier based on one test set is very impressed with the data splitting process.

Future investigations can focus on different ways of preselecting genes in the first stage of the proposed method. For example, rather than ranking genes based on their *t* test scores, one would use a different metric to measure the association between a given gene and metastasis occurrence. By contrast, another mother wavelet and a different level of decomposition can be studied. In this study, gene expression data were employed as predictors. However, prediction performance may be improved by adding other covariates such as age, lymph node status, tumor size, and histological grade. It is likely that the classification performances could be improved with the use of some other classifiers.

This study confirmed that the SVM model based on the supervised wavelet feature extraction method was superior with regards to predictive performance than the supervised PCA and some other studies. Gene expression profiling can help to distinguish between patients at high risk and those at low risk for developing distant metastases, therefore, this technology and other high-throughput techniques are helping to alter our view of breast cancer and provide us with new tools for molecular diagnoses. These results exhibit the possibility of developing a new tool using wavelets for the dimension reduction of microarray data sets in the classification framework and therefore, the use of this method in similar classification problems is recommended.

## Conflicts of interest

The authors have no conflicts of interest to declare.

## References

## Acknowledgments

This study is part of a PhD thesis in Biostatistics (Grant no. 16/35/3500). The authors thank the Vice-Chancellor for Research and Technology of Hamadan University of Medical Sciences, Iran, for approving the project and providing financial support.

## Notes

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.