Reader Comments
Post a new comment on this article
Post Your Discussion Comment
Please follow our guidelines for comments and review our competing interests policy. Comments that do not conform to our guidelines will be promptly removed and the user account disabled. The following must be avoided:
- Remarks that could be interpreted as allegations of misconduct
- Unsupported assertions or statements
- Inflammatory or insulting language
Thank You!
Thank you for taking the time to flag this posting; we review flagged postings on a regular basis.
closeWhat Are the Better Gene Sets for Transcription-Based Prediction of Response to IFN?
Posted by PLOSBiology on 07 May 2009 at 22:12 GMT
Author: 'Wuju' 'Li'
Position: Bioinformatist
Institution: Center of Computational Biology, Beijing Institute of Basic Medical Sciences, P.O.Box 130(3), China
E-mail: wujuli@yahoo.com
Submitted Date: January 06, 2006
Published Date: January 13, 2006
This comment was originally posted as a “Reader Response” on the publication date indicated above. All Reader Responses are now available as comments.
In their recent article (1), Baranzini and coworkers explore the relationship between the expression profile of a set of carefully-selected 76 genes and the response to IFN? of 52 multiple sclerosis patients (33 good responders and 19 poor responders). The quadratic discriminant analysis-based IBIS were used to search the best gene triplets and nine sets of gene triplets were found. The best average prediction accuracy is 86.9% for the top-scoring triplet (Caspase 2, Caspase 10, and FLIP). Here we address the question: is there possible to incorporate more genes into the classifier to improve the average prediction accuracy? Furthermore, what are the better gene sets? In order to answer these questions, Tclass classification system was used (2).
In the Tclass system, both Bayes and Fisher's linear discriminant analysis are integrated with stepwise optimization procedure for feature selection. The classification scheme to find the better gene sets is as follows. First, with the whole dataset as the training set and prediction accuracy from leave-one-out cross-validation as the object function, both discriminant methods are used to search the certain number of top-performing gene sets. Then, the above gene sets are evaluated by randomly dividing the 52 patients into training set and test set 500 times with partition ratio 50%, 67%, 75%, or 85% respectively. For a particular partition ratio and gene set, the average prediction accuracy over 500 test sets was taken as the discriminant power of this gene set. In essence, the discriminant power of the best gene set should be higher than any other gene sets for any partition ratio. In most cases, because all possible combinations of different number of features are huge, the partition ratio can vary, and the discriminant power for the same gene set may be different for the different classification methods, it is impossible for us to find the best gene set. In this letter, the stepwise optimization procedure for feature selection was used to find the better gene sets. For the comparison, here we only list the results in 500 simulations for the partition ratio 75%.
For Bayes discriminant analysis, the better gene set contains seven genes: IL-2Rg, Tbet, JUN, BAX, FLIP, MAP3K1, and CD80. The average prediction accuracy is 94.3. The best gene triplet (Caspase 10, FLIP, and NFkBIA) gives the average prediction accuracy 86.7, which is very close to 86.9 in their paper (1).
For Fisher discriminant analysis, the better gene set also contains seven genes: IL-4Ra, NFkB-60, JUN, Caspase 2, Caspase 10, GZMB, and MAP3K1. The average prediction accuracy is 92.4. The best gene triplet (Caspase 2, Caspase 10, and FLIP) in their paper (1) was also found in our classification scheme. The average prediction accuracy is 87.6.
Finally, the scheme to randomly permute the class labels of patients (good or poor responders) was used to detect the significance level of the found better gene sets in 1000 simulations (1). The significance level for the above two better gene sets is 0.0, which indicate that the found two better gene sets are unlikely obtained by chance.
In summary, from the above analysis, we conclude that our classification scheme can not only find the best gene triplet (Caspase 2, Caspase 10, and FLIP) in their paper (1), but also find the better gene sets with seven genes. The average prediction accuracy was improved from 86.9 to 94.3 (Bayes???s method) or 92.4 (Fisher???s method). With the number of genes more than seven, the average prediction accuracy decreases gradually, which means that the model is overfitting with more number of genes in classifiers.
References
(1) Baranzini SE, Mousavi P, Rio J, Caillier SJ, Stillman A, et al. (2004) Transcription-based prediction of response to IFN?? using supervised computational methods. PLoS Biol 3(1): e2.
(2) Wuju Li, Momiao Xiong (2002) Tumor classification system based on gene expression profile. Bioinformatics 18: 325-326.