# Bei's Study Notes

### Machine Learning Study Notes - Bayesian EstimationLast updated: 2017-09-25 19:30:12 PDT.

Back to Intro

We assume the sample is drawn from a distribution which is controled by a small number of parameters. For example, if we assume the sample is drawn from normal distributionm then we can use mean and variance to describe it. The parameters are called the sufficient statistics of the distribution. We estimate the paramter by maximum likelihood estimation.

We start with density estimation, which is the general case of estimating .

## Maximum likelihood estimation

Let be a sample drawn from IID that

We want to find that makes sample as likely as possible. The likelihood of parameter will be

To maximize this is equivalent toe maximize its log, but more computationally tractable. The log likelihood is defined as

### Bernoulli Density

If the distribution is Bernoulli with a single parameter :

The log likelihood is

to maximize , we solve . The estimate is

Note that the estimate is a function of the sample and is another random variable.

The MLE of is

### Gaussian Density

The density function is

The MLE is obtained by makeing partial derivatives of the log likelihood zero:

### Evaluating the esimator: bias and variance

An estimator is a random variable based on a sample. To evaluate the quality of an estimator we need to evaluate the mean squire err (MSE)

where is the real parameter is drawn from.

The bias of an estimator is

If for all values, we say that is an unbiased estimator of . Now let's check the .

So the MLE of is not unbiased. However when is large, the difference is negligable. This is called an asymptotically unbiased estimator.

The MSE can be written as

## Bayes' Estimator

Sometimes we have some prior information about the we try to estimate. This is called the prior densiy. We combine the data and get the posterior density of

Then we estimate the density at

because is the sufficient statistics. When the integral is intractable, we can assume the peak of is narraw and we only use the maximum a posteriori estimate (MAP):

If we have no prior reason to favor some values of , then the prior density is flat and the posterior will have the same form as the maximum likelihood.

Another estimator is the Bayes' estimator, which is defined as expected value of the posterior density

In the case of normal density, if the prior density of is , and , then

then the Bayes' estimator

Since we know this is a proper distribution density, this means

Thus the Bayes' estimator of is the weighted mean of prior and posterior mean.

Both MAP and Bayes' estimators reduce the whole posterior density to a single point and lose information unless the posterior is unimodal and makes a narrow peak around these points. We can instead use a Monte Carlo approach that generates samples from the posterior density. There are also approximation methods to evaluate the full integral.

## Estimating the Parameter of a Distribution

### Discrete Variables

Given a multinomial variable taking one of distinct values, let if instance takes value , or otherwise. The parameter we need to estimate is a vector . The likelihood is

The prior distribution we use is the Dirichlet distribution

where the normalization factor is the multivariate Beta function

where is the parameters of the prior, called hyperparameters. is the Gamma Function defined as

Note .

Given the prior and the likelihood, we can derive the posterier

where .

We can see that the posterior distribution and prior distribution is of the same form. We call these priors conjugate priors. We have

where . .

Look at the posterior, we can obtain the intuition of hyperparameters . Just as are counts of occurrences of value in a sample of , we can view as the imaginary samples of instances. Note that larger implies that we have a higher confidence (a more peaked distribution) in our subjective proportions.

When the variable is binary, multinomial becoms Bernoulli and Dirichlet distribution becomes Beta distribution.

### Continous Variables

When , with parameter and . The likelihood is

The conjugate prior for is Gaussian and the posterior is

where

where is the sample average. The prior meaning is as if we have done some imaginary experiments as well.

It is more intuitive to work with the reciprocal of the variance than then variance it self. Let . The likelihood is written as:

The conjugate prior for precision is the Gamma distribution:

The posterior is

where is the sample variance.

## Bayesian Estimation of a Parameters of a Function

### Regression

Given a linear regression model:

where is the precision of the additive noise.

The paramters are the weights and we have a sample . We can break down the sample in to a matrix of inputs a vector of desired outputs as where each row of is an instance. We have

and log likelihood is

For the case of ML estimate, we find that maximize the likelihood, or equivalently, minimizes the last term as is the sum of squared error

Taking the derivative with respect to and setting it to , we get the maximum likelihood estimator

The predicted result in test set is

In the case of Bayesian approach, we define a indenpendent Gaussian prior

the posterior is

where

The mean output is

If we want to you a point estimate, the MAP gives

and the estimation is

with variance

Comparing with ML estimation, it is the Gaussian estimator when .

### Ridge Regression

The log of the posterior is

which we maximize to find the MAP estimation. In general, we can write an augmented error function

where . This is known as parameeter shinkage or ridge regression.

In neural network, this is L2 regularization, or weight decay.

This does reduces the square of the weights, but as the weight get smaller, the pressure reduces as well. It cannot be used for feature selection. For this, one can use a Laplacian prior that uses norm instead of the norm

the log posterior is

The posterior is no longer Gaussian and the MAP estimate is found by minimizing

This is known as lasso (least absolute shrinkage and selection operator).