Incorporating Distribution Information
into the Structural Risk Minimization
Learning Algorithm of Support Vector
Machine:
New Algorithms for Expression
Data Analysis.
Zhen Zhang
We first derive a new unified maximum separability
analysis (UMSA) procedure that allows for the
incorporation of data distribution information into
the structural risk minimization learning algorithm
of Support Vector Machine. We will then show that for
linear classification problems, the UMSA procedure
unifies the classic linear discrminant alaysis method
and the optimal margin hyperplane method. We argue
that the main advantage of the UMSA procedure is its
efficient use of information for problems with very
limited number of samples.
Two algorithms based on the UMSA procedure have been
developed for biological expression data processing. The
first algorithm projects the expression data onto a
new component space. The axes of the component space
correspond to the directions along which two predefined
classes of data achieve maximum separability according
to UMSA. The second algorithm uses a backward stepwise
process to compute a ranking score for individual varaibles
based on their contributions to the collective effort
to separate two predefined classes of data.
The two algorithms have been implemented in Java in a
software package. Examples will be given using microarray
and SELDI protein chip data. Two different types of
applications will be discussed. The first application is
to screen for potential biomarkers that individually or
in combination differentiate two predefined clinical groups
of patients. The second application is to use UMSA
component analysis to "shave off" irrelevant variables
and data points before applying cluster analysis so that
the resultant clusters may be biologically more meaningful.