Machine Learning Study Note - Kernel Machines
Last updated: 2017-09-25 19:30:12 PDT.
Kernel machines are maximum margin methods that allow the model to be written as a sum of the influences of a subset of the training instances.
Support Vector Machine - Introduction
Support vector machines (SVMs) is a kernel machine. It has been popular because:
- It is a discriminant-based method and uses Vapnik’s principle to never solve a more complex problem as a first step before the actual problem.
- After training, the parameters can be written down as a subset of training set, which are called support vectors.
- The output is written as a sum of influences of support vectors and theses are given by kernel functions that are application-specific measures of similarity between data instances.
- Typically data points are represented as vectors, and either dot product or Euclidean distance is used. A kernel function allowes us to go beyond that.
- Kernel-based algorithms are formulated as convex optimization problems, and there is a single optimum that we can solve for analytically.
Optimal Separating Hyperplane
Let's start with two classes and use lables for the each class. The sample is where . We would like to find and such thatwhich can be rewritten as Not only do we want the instances to be on the right side of the hyperplane, but we also want them some distanc away, for better generalization. This forms a optimal separating hperplane. Thus we would like to maximize some value
We want to maximize but there are an infinite number of solutions that we can get by scaling and for a unique solution, we need to fix and thus, to maximize the margin, we minimize . The task can therefore be defined as to
This needs to be generalized when the problem is not linearly separable. One way to do that is to increase the number of parameters. We are interested in a methods whose complexity does not depend on the input dimensionality.
To get the new formulation, we plugin Lagrange multipliers :This should be optimized with respect to , and maximized with respect to .
This is a convex quadratic optimization problem therefore we can equavalently solve the dual problem making use of the Karush-Huhn-Tucker conditions. The dual is to maximize with respect to , subject to the constraints that the gradient of with respect to and are , and also that :there fore we get the dual
This is a quadratic optimization problem, it can be solved in time. Once we solve for , we see that though there are of them, most vanish with and only a small percentage have . The set of whose are the support vectors. These are the that satisfyand lie on the margin. We can use this fact to calculate from any support vector as The eventual bias will be the average of over the support vectors. The discriminant thus found is called the support vector machine.
Being a discriminant-based method, the SVM cares only about the instances close to the boundary and discards those that lie in the interior. Using this idea, it is possible to use a simpler classifier before the SVM to filter out a large portion of such instances, thereby decreasing the complexity of the optimization step of the SVM.
The Nonseparable Case: Soft Margin Hyperplane
If the data is not linearly separable, we look for the one that incurs the least error.
Define slack variables, which store the deviation from the margin. Relaxing our condition, we require
If , is correctly classified but in the margin. If , is misclassified. We define soft error asand add this as a penalty term: is a hyperparameter. Adding the constraints, the Lagrangian becomes: where are the new Lagrange multipliers to guarantee the positivity of . Optimize the Lagrangian The last implies . The dual problem: subject to
The support vectors are those whose and we use those whose to calculate since they have .
It is shown that the expected test error rate is
The definition implies that we define error if the instance is on the wrong side or if the margin is less than . This is called the hinge loss. If is the output and is desired output, hinge loss is defined as
is the regularization parameter. When is high, we have a high penalty for nonseparable points, and we may store many support vectors and overfit. If it is too small, we may find too simple solution that underfit. Typically, one choose from log scale by looking at the accuracy on a validation set.
In order to solve a non-linear problem, we can map the problem to a new space by doing nonlinear transformation using suitably chose basis functions and then use a linear model in this new space. The linear model in the new space corresponds to a nonlinear model in the oriignal space.
Let us say we have the new dimensions calculated through the basis functionsmapping form the -dimensional space to the -dimensional space where we write the discriminant as The dual Lagrangian is then
The kernel trick is to replace with a kernel function . This implies that if we have the kernel function, we do not have to map the vector to another space. The kernel function also shows up in the discriminant
Furthermore, for any valid kernel, there exists a corresponding mapping function, but it may be much simpler to use the kernel than calculating the mapping and take the dot product. Many algorithms have be kernelized, as we will see in the later sections.
The mose popular, general-purpose kernel functions are
- polynomial of degree In the case of , it corresponds to the basis function When , it corresponds to the original formulation.
- radial-based functions One can have a Mahalanobis kernel, generalized from the Euclidean distance: where is the covariance matrix Or, in the most general case, where is some general distance function.
- sigmoidal functions: This has the same shape as sigmoid functions except it ranges between and . It is similar to MLPs.
Multiple Kernel Learning
It is possible to construct new kernels by combining simpler kernesl. If and are kernels, and a constant, then , , are also valid. Given multiple models, you can also combine them by direct product. If and are kernels in space and , then you can construct a kernel in space , for example,We can generalize it to a number of kernels which works as similar to an average of kernels. It is also possible to use a weighted sum and also learn the weights from data: subject to , with or without the constaint of (simplex constraint). This is called multiple kernel learning. The objective becomes
After training, will take values depending on how the corresponding kernel is useful in discriminating. It is also possible to localize kernels by defining kernel weights as a parameterized function of the input , rather like the gating function in mixture of experts
and the gating parameters are learned together with the support vector machine parameters.
Kernel Machines for Regression
We start with a linear model
We use the square of the difference as errorwhereas in support vector regression, we use the -sensitive loss function: means we tolerate errors up to and also that errors beyond have a linear loss and not a quadratic one. As in the hinge loss, there is a region of no loss.
Similar to classification, we introduce slack variables to account for deviations out of the -zone and we getsubject to
We use two types of slack variables, one for positive and one for negative deviations. The lagrangian istaking the derivatives The dual is subject to Once we solve this, we see that all instances that fall in the tube have . There are the instances that are fitted with enough precision. The support vectors satisfy either or and are of two types.
Eventually, the fitted function is a weighted sum of the support vectors