We assume the sample is drawn from a distribution which is controled by a small number of parameters. For example, if we assume the sample is drawn from normal distributionm then we can use mean and variance to describe it. The parameters are called the sufficient statistics of the distribution. We estimate the paramter by maximum likelihood estimation.
We start with density estimation, which is the general case of estimating .
Maximum likelihood estimation
Let be a sample drawn from IID that
We want to find that makes sample as likely as possible. The likelihood of parameter will be
To maximize this is equivalent toe maximize its log, but more computationally tractable. The log likelihood is defined as
If the distribution is Bernoulli with a single parameter :
The log likelihood is
to maximize , we solve . The estimate is
Note that the estimate is a function of the sample and is another random variable.
The MLE of is
The density function is
The MLE is obtained by makeing partial derivatives of the log likelihood zero:
Evaluating the esimator: bias and variance
An estimator is a random variable based on a sample. To evaluate the quality of an estimator we need to evaluate the mean squire err (MSE)
where is the real parameter is drawn from.
The bias of an estimator is
If for all values, we say that is an unbiased estimator of .
Now let's check the .
So the MLE of is not unbiased. However when is large, the difference is negligable. This is called an asymptotically unbiased estimator.
The MSE can be written as
Sometimes we have some prior information about the we try to estimate. This is called the prior densiy. We combine the data and get the posterior density of
Then we estimate the density at
because is the sufficient statistics. When the integral is intractable, we can assume the peak of is narraw and we only use the maximum a posteriori estimate (MAP):
If we have no prior reason to favor some values of , then the prior density is flat and the posterior will have the same form as the maximum likelihood.
Another estimator is the Bayes' estimator, which is defined as expected value of the posterior density
In the case of normal density, if the prior density of is , and , then
then the Bayes' estimator
Since we know this is a proper distribution density, this means
Thus the Bayes' estimator of is the weighted mean of prior and posterior mean.
Both MAP and Bayes' estimators reduce the whole posterior density to a single point and lose information unless the posterior is unimodal and makes a narrow peak around these points. We can instead use a Monte Carlo approach that generates samples from the posterior density. There are also approximation methods to evaluate the full integral.
Estimating the Parameter of a Distribution
Given a multinomial variable taking one of distinct values, let if instance takes value , or otherwise. The parameter we need to estimate is a vector . The likelihood is
The prior distribution we use is the Dirichlet distribution
where the normalization factor is the multivariate Beta function
where is the parameters of the prior, called hyperparameters. is the Gamma Function defined as
Given the prior and the likelihood, we can derive the posterier
We can see that the posterior distribution and prior distribution is of the same form. We call these priors conjugate priors. We have
where . .
Look at the posterior, we can obtain the intuition of hyperparameters . Just as are counts of occurrences of value in a sample of , we can view as the imaginary samples of instances. Note that larger implies that we have a higher confidence (a more peaked distribution) in our subjective proportions.
When the variable is binary, multinomial becoms Bernoulli and Dirichlet distribution becomes Beta distribution.
When , with parameter and . The likelihood is
The conjugate prior for is Gaussian and the posterior is
where is the sample average. The prior meaning is as if we have done some imaginary experiments as well.
It is more intuitive to work with the reciprocal of the variance than then variance it self. Let . The likelihood is written as:
The conjugate prior for precision is the Gamma distribution:
The posterior is
where is the sample variance.
Bayesian Estimation of a Parameters of a Function
Given a linear regression model:
where is the precision of the additive noise.
The paramters are the weights and we have a sample . We can break down the sample in to a matrix of inputs a vector of desired outputs as where each row of is an instance. We have
and log likelihood is
For the case of ML estimate, we find that maximize the likelihood, or equivalently, minimizes the last term as is the sum of squared error
Taking the derivative with respect to and setting it to , we get the maximum likelihood estimator
The predicted result in test set is
In the case of Bayesian approach, we define a indenpendent Gaussian prior
the posterior is
The mean output is
If we want to you a point estimate, the MAP gives
and the estimation is
Comparing with ML estimation, it is the Gaussian estimator when .
The log of the posterior is
which we maximize to find the MAP estimation. In general, we can write an augmented error function
where . This is known as parameeter shinkage or ridge regression.
In neural network, this is L2 regularization, or weight decay.
This does reduces the square of the weights, but as the weight get smaller, the pressure reduces as well. It cannot be used for feature selection. For this, one can use a Laplacian prior that uses norm instead of the norm
the log posterior is
The posterior is no longer Gaussian and the MAP estimate is found by minimizing
This is known as lasso (least absolute shrinkage and selection operator).