Machine Learning Study Note - Supervised Learning Algorithm Overview
Last updated: 2017-09-25 19:30:12 PDT.
Formilizing Supervised Learning Algorithm
Given a sampleThe sample is IID: independently draw from the the same distribution . is a -dimsional exclusive binary vector (where exactly one of the dimensions is and others ).
The aim is to build a approximation to using the model . There are three decisions we must make:
- Model where is the input, and is call the parameter. defines the hypothesis class .
- Loss function, , to compute the difference between the desired output and our approximation . The appoximation error, or loss, is the sum of losses over the individual instances
- Optimization procedure to find that minimizes the loss
Bayesian Decision Theory
The information in the world is not alway known to the observer. The extra knowledge we do not have access to are named the unobservable variables. Otherwise, called observable variables. Denoting the unobservables by and the observable as , in reality we havewhere is a deterministic function that defines the outcome from the unobservable piece of knowledge.
Since we cannot model the world this way, we define the outcome as a random variable drawn from a probability distribution . can be described as a cumulative distribution function (CDF) , or a probablity density function (PDF) such that:
or in the case of discrete variable, probability mass function (PMF) such that:
Note: The two are related by the Dirac delta function defined as the PDF of the unit step CDF, that is, the distribution function of a random variable that always equals to zero. From physics and signal processing perspective:which is not well-defined at . But we can integrate it around zero by allowing iterated integral. For all :
This mean it is indeed the PDF of unit step CDF. We can then describe a PMF in terms of PDF:and keeps the CDF unchanged.
If we do not know , we want to estimate this from a given sample . The goal is to build an estimator that given an instance , estimates its likelihood.
In the case of classification, given classes , and sample , we want to estimate the conditional probability of the class given , . Using Bayes' rule, it can be written asis called the prior. is call the posterior. is called the class likelihood. is calle the evidence. We can rewrite this equation as
Also, the Law of total probability:We can then rewrite the Bayes' rule as
A Bayes' classifier chooses the class with the highest posterior probability
Losses and Risks
It may be the case the decisions are not equally good or costly. Let us define action as the decision to assign the input to class and be the loss for taking action when the input actually belongs to . Then expected risk for taking action isand we choose the action with minimum risk:
If we consider all mistakes to have the same loss, we get the 0/1 loss functions:
Thus to minimize risk, we choose the most probable case.
In some applications, wrong decisions have very high cost, we can add a reject or doubt action to th the actions. A possible loss function iswhere . The decision rule becomes in other words, reject if the maximum likelihood does not exceed certain threshold .
Classification can also be seen as a set of Discriminant functions, such that we
It is equivalent to Bayes' classifier for we can define the discrimitant function asWhen using 0/1 loss, we have: or This divides the feature space into decision regions , where . The regions are separated by decision boundaries.
When there is only one case, we can define a single discriminantand we
When , the classification system is called a dichotomizer and for , it is a polychotomizer.
We now generalize expected risk to utility theory, which is concerned with making rational decisions when we are uncertain about the state.
Given evidence , the probability of state is calculated as . If we define a utility function , which measures how good it is to take action when the state is . The expected utility is
A rational decision maker chooses the action that maximizes the expected utility.
In the context of classification, maximizing the EU is equivalent to minimize expected risk.