###
Machine Learning Study Note - Supervised Learning Algorithm Overview

Last updated: 2017-09-25 19:30:12 PDT.

### Formilizing Supervised Learning Algorithm

Given a sample

The sample is IID: independently draw from the the same distribution . is a -dimsional exclusive binary vector (where exactly one of the dimensions is and others ).The aim is to build a approximation to using the model . There are three decisions we must make:

*Model*where is the input, and is call the parameter. defines the hypothesis class .*Loss function*, , to compute the difference between the desired output and our approximation . The*appoximation error*, or*loss*, is the sum of losses over the individual instances*Optimization procedure*to find that minimizes the loss

## Bayesian Decision Theory

The information in the world is not alway known to the observer. The extra knowledge we do not have access to are named the *unobservable variables*. Otherwise, called *observable variables*. Denoting the unobservables by and the observable as , in reality we have

Since we cannot model the world this way, we define the outcome as a random variable drawn from a probability distribution . can be described as a cumulative distribution function (CDF) , or a probablity density function (PDF) such that:

or in the case of discrete variable, probability mass function (PMF) such that:

Note: The two are related by the Dirac delta function defined as the PDF of the unit step CDF, that is, the distribution function of a random variable that always equals to zero. From physics and signal processing perspective:

which is not well-defined at . But we can integrate it around zero by allowing iterated integral. For all :

This mean it is indeed the PDF of unit step CDF. We can then describe a PMF in terms of PDF:

and keeps the CDF unchanged.

If we do not know , we want to estimate this from a given sample . The goal is to build an estimator that given an instance , estimates its likelihood.

In the case of classification, given classes , and sample , we want to estimate the conditional probability of the class given , . Using Bayes' rule, it can be written as

is called the*prior*. is call the

*posterior*. is called the

*class likelihood*. is calle the

*evidence*. We can rewrite this equation as

Also, the Law of total probability:

We can then rewrite the Bayes' rule asA Bayes' classifier chooses the class with the highest posterior probability

## Losses and Risks

It may be the case the decisions are not equally good or costly. Let us define action as the decision to assign the input to class and be the *loss* for taking action when the input actually belongs to . Then *expected risk* for taking action is

If we consider all mistakes to have the same loss, we get the 0/1 loss functions:

Thus to minimize risk, we choose the most probable case.

In some applications, wrong decisions have very high cost, we can add a *reject* or *doubt* action to th the actions. A possible loss function is

## Discriminant functions

Classification can also be seen as a set of *Discriminant functions*, such that we

It is equivalent to Bayes' classifier for we can define the discrimitant function as

When using 0/1 loss, we have: or This divides the feature space into*decision regions*, where . The regions are separated by

*decision boundaries*.

When there is only one case, we can define a single discriminant

and weWhen , the classification system is called a dichotomizer and for , it is a polychotomizer.

## Utility Theory

We now generalize expected risk to *utility theory*, which is concerned with making rational decisions when we are uncertain about the state.

Given evidence , the probability of state is calculated as . If we define a *utility function* , which measures how good it is to take action when the state is . The *expected utility* is

A rational decision maker chooses the action that maximizes the expected utility.

In the context of classification, maximizing the EU is equivalent to minimize expected risk.