Incorporating Distribution Information into the Structural Risk Minimization Learning Algorithm of Support Vector Machine:
New Algorithms for Expression Data Analysis.

Zhen Zhang

We first derive a new unified maximum separability analysis (UMSA) procedure that allows for the incorporation of data distribution information into the structural risk minimization learning algorithm of Support Vector Machine. We will then show that for linear classification problems, the UMSA procedure unifies the classic linear discrminant alaysis method and the optimal margin hyperplane method. We argue that the main advantage of the UMSA procedure is its efficient use of information for problems with very limited number of samples. Two algorithms based on the UMSA procedure have been developed for biological expression data processing. The first algorithm projects the expression data onto a new component space. The axes of the component space correspond to the directions along which two predefined classes of data achieve maximum separability according to UMSA. The second algorithm uses a backward stepwise process to compute a ranking score for individual varaibles based on their contributions to the collective effort to separate two predefined classes of data. The two algorithms have been implemented in Java in a software package. Examples will be given using microarray and SELDI protein chip data. Two different types of applications will be discussed. The first application is to screen for potential biomarkers that individually or in combination differentiate two predefined clinical groups of patients. The second application is to use UMSA component analysis to "shave off" irrelevant variables and data points before applying cluster analysis so that the resultant clusters may be biologically more meaningful.