支持向量机(SVM)


支持向量机,英文为Support Vector Machine,一般简称SVM。它是一种監督式學習的方法,它广泛的应用于统计分类以及回归分析中。

支持向量机属于一般化线性分类器。他们也可以认为是提克洛夫规范化(Tikhonov Regularization)方法的一个特例。这族分类器的特点是他们能够同时最小化经验误差与最大化几何边缘区。因此支持向量机也被称为最大边缘区分类器。

介绍

支持向量机将向量映射到一个更高维的空间里,在这个空间里建立有一个最大间隔超平面。在分开数据的超平面的两边建有两个互相平行的超平面。分隔超平面使两个平行超平面的距离最大化。假定平行超平面间的距离或差距越大,分类器的总误差越小。一个极好的指南是C.J.C Burges的《模式识别支持向量机指南》。van der Walt和Barnard将支持向量机和其他分类器进行了比较。

动机

 

有很多个分类器(超平面)可以把数据分开,但是只有一个能够达到最大分割。
我们通常希望分类的过程是一个机器学习的过程。这些数据点并不需要是中的点,而可以是任意(统计学符号)中或者(计算机科学符号)的点。我们希望能够把这些点通过一个n-1维的超平面分开,通常这个被称为线性分类器。有很多分类器都符合这个要求,但是我们还希望找到分类最佳的平面,即使得属于两个不同类的数据点间隔最大的那个面,该面亦称为最大间隔超平面。如果我们能够找到这个面,那么这个分类器就称为最大间隔分类器。

问题定义

 

设样本属于两个类,用该样本训练svm得到的最大间隔超平面。在超平面上的样本点也称为支持向量.
我们考虑以下形式的样本点
其中ci为1或−1 –用以表示数据点属于哪个类. 是一个 (统计学符号),或 (计算机科学符号)维向量,其每个元素都被缩放到[0,1]或[-1,1].缩放的目的是防止方差大的随机变量主导分类过程。我们可以把这些数据称为“训练数据”,希望我们的支持向量机能够通过一个超平面正确的把他们分开。超平面的数学形式可以写作


其中是超平面上的点,是垂直于超平面的向量。

根据几何知识,我们知道向量垂直于分类超平面。加入位移b的目的是增加间隔。如果没有b的话,那超平面将不得不通过原点,限制了这个方法的灵活性。

由于我们要求最大间隔,因此我们需要知道支持向量以及(与最佳超平面)平行的并且离支持向量最近的超平面。我们可以看到这些平行超平面可以由方程族:

来表示。 由于只是超平面的法向量,长度未定,是一个变量,所以等式右边的1和-1只是为计算方便而取的常量,其他常量只要互为相反数亦可。

如果这些训练数据是线性可分的,那就可以找到这样两个超平面,在它们之间没有任何样本点并且这两个超平面之间的距离也最大。通过几何不难得到这两个超平面之间的距离是2/|w|,因此我们需要最小化 |w|。同时为了使得样本数据点都在超平面的间隔区以外,我们需要保证对于所有的满足其中的一个条件

 

这两个式子可以写作:
原型

现在寻找最佳超平面这个问题就变成了在(1)这个约束条件下最小化|w|.这是一个二次規劃QP(quadratic programming)最优化中的问题。

更清楚的表示:

最小化,满足其中。
1/2这个因子是为了数学上表达的方便加上的。

解如上问题通常的想法可能是使用非负拉格朗日乘数 于下式
不过这样可能出错. 原因是:假如我们能找到一族超平面将这些点分割开来;那么所有的 . 因此我们可能通过将所有趋向得到最小值, 此最小值对这一族内所有成员都有效,而不是解决原问题的最优解。

不过以前的约束问题可以表示为
此式表明我们寻找一个鞍点。这样所有可以被分离的点就无关紧要了,因为我们必须设置相应的 为零。

这个问题现在可以用标准二次规划技术标准和程序解决。结论可以表示为如下训练向量的线性组合
只有很少的会大于0. 相应的就是支持向量, 这些支持向量在边缘上并且满足. 由此可以推导出支持向量也满足: 因此允许定义偏移量. 实际上此支持向量比一般的支持向量鲁棒性更强:
对偶型(Dual Form)

把原型的分类规则写作对偶型,可以看到分类器其实是一个关于支持向量(即那些在间隔区边缘的训练样本点)的函数。

根据,并且带入,可以得到支持向量机的对偶型如下:

满足

后验svm

后验概率对分类器非常重要 分类器的输出必须结合后验概率才能确定 借助后验概率更好的改进超平面的泛化能力

软间隔

1995年, Corinna Cortes与Vapnik提出了一种改进的最大间隔区方法,这种方法可以处理标记错误的样本。如果可区分正负例的超平面不存在,则“软边界”将选择一个超平面尽可能清晰地区分样本,同时使其与分界最清晰的样本的距离最大化。这一成果使术语“支持向量机”(或“SVM”)得到推广。这种方法引入了松驰参数以衡量对数据的误分类度。


随后,将目标函数与一个针对非0的惩罚函数相加,在增大间距和缩小错误惩罚两大目标之间进行权衡优化。如果惩罚函数是一个线性函数,则等式(3)变形为

 

Nonlinear classification

 

Kernel machine
The original optimal hyperplane algorithm proposed by Vapnik in 1963 was a linear classifier. However, in 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick (originally proposed by Aizerman et al.[4]) to maximum-margin hyperplanes.[5] The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be nonlinear and the transformed space high dimensional; thus though the classifier is a hyperplane in the high-dimensional feature space, it may be nonlinear in the original input space.
If the kernel used is a Gaussian radial basis function, the corresponding feature space is a Hilbert space of infinite dimensions. Maximum margin classifiers are well regularized, so the infinite dimensions do not spoil the results. Some common kernels include:
Polynomial (homogeneous):
Polynomial (inhomogeneous):
Gaussian radial basis function: , for Sometimes parametrized using
Hyperbolic tangent: , for some (not every) and
The kernel is related to the transform by the equation . The value w is also in the transformed space, with Dot products with w for classification can again be computed by the kernel trick, i.e. . However, there does not in general exist a value w’ such that
Properties

SVMs belong to a family of generalized linear classifiers and can be interpreted as an extension of the perceptron. They can also be considered a special case of Tikhonov regularization. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers.
A comparison of the SVM to other classifiers has been made by Meyer, Leisch and Hornik.[6]
Parameter selection
The effectiveness of SVM depends on the selection of kernel, the kernel’s parameters, and soft margin parameter C.
A common choice is a Gaussian kernel, which has a single parameter γ. Best combination of C and γ is often selected by a grid search with exponentially growing sequences of C and γ, for example, ; . Typically, each combination of parameter choices is checked using cross validation, and the parameters with best cross-validation accuracy are picked. The final model, which is used for testing and for classifying new data, is then trained on the whole training set using the selected parameters.[7]
Issues
Potential drawbacks of the SVM are the following three aspects:
Uncalibrated class membership probabilities
The SVM is only directly applicable for two-class tasks. Therefore, algorithms that reduce the multi-class task to several binary problems have to be applied; see the multi-class SVM section.
Parameters of a solved model are difficult to interpret.
Extensions

Multiclass SVM
Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements.
The dominant approach for doing so is to reduce the single multiclass problem into multiple binary classification problems.[8] Common methods for such reduction include:[8] [9]
Building binary classifiers which distinguish between (i) one of the labels and the rest (one-versus-all) or (ii) between every pair of classes (one-versus-one). Classification of new instances for the one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with the most votes determines the instance classification.
Directed Acyclic Graph SVM (DAGSVM)[10]
error-correcting output codes[11]
Crammer and Singer proposed a multiclass SVM method which casts the multiclass classification problem into a single optimization problem, rather than decomposing it into multiple binary classification problems.[12] See also Lee, Lin and Wahba.[13][14]
Transductive support vector machines
Transductive support vector machines extend SVMs in that they could also treat partially labeled data in semi-supervised learning by following the principles of transduction. Here, in addition to the training set , the learner is also given a set

of test examples to be classified. Formally, a transductive support vector machine is defined by the following primal optimization problem:[15]
Minimize (in )

subject to (for any and any )
and

Transductive support vector machines were introduced by Vladimir N. Vapnik in 1998.
Structured SVM
SVMs have been generalized to structured SVMs, where the label space is structured and of possibly infinite size.
Regression
A version of SVM for regression was proposed in 1996 by Vladimir N. Vapnik, Harris Drucker, Christopher J. C. Burges, Linda Kaufman and Alexander J. Smola.[16] This method is called support vector regression (SVR). The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by SVR depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction (within a threshold ). Another SVM version known as least squares support vector machine (LS-SVM) has been proposed by Suykens and Vandewalle.[17]
Implementation

The parameters of the maximum-margin hyperplane are derived by solving the optimization. There exist several specialized algorithms for quickly solving the QP problem that arises from SVMs, mostly relying on heuristics for breaking the problem down into smaller, more-manageable chunks.
A common method is Platt’s Sequential Minimal Optimization (SMO) algorithm, which breaks the problem down into 2-dimensional sub-problems that may be solved analytically, eliminating the need for a numerical optimization algorithm.
Another approach is to use an interior point method that uses Newton-like iterations to find a solution of the Karush–Kuhn–Tucker conditions of the primal and dual problems.[18] Instead of solving a sequence of broken down problems, this approach directly solves the problem as a whole. To avoid solving a linear system involving the large kernel matrix, a low rank approximation to the matrix is often used in the kernel trick.

 

 

—————来自维基百科http://en.wikipedia.org/wiki/Support_vector_machine

http://zh.wikipedia.org/wiki/%E6%94%AF%E6%8C%81%E5%90%91%E9%87%8F%E6%9C%BA