Machine Learning Study Note - Baysian Network
Last updated: 2017-10-10 15:39:14 PDT.
Graphical models are graphs of node and edges, where each node represent a random variable and corresponding probability of the random variable . If there is an direct arc from node to node , this indicates that has a direct influence on . This influence is specified by the conditional probability .
A baysian network is a directed acyclic graph. The canoncial cases for conditional independence include:
We say that blocks the path from to , or in other words, it separates them in the sense that if is removed, there is no path between to .
One major advantage to using a Bayesian network is that we do not need to designate explicitly certain variables as input and certain others as output. Any proportion of the random variables can be observed and we can infer the rest of the variables, and the difference between unsupervised and supervised learning becomes blurry.
There may also be hidden variables whose values are never known from the evidence.
Examples of Graphical Models
Naive Bayes' Classifier
It is a tail-to-tail baysian network that assumes that the observables are indenpendent:
Hidden Markov Model
HMMs are examples of baysian network where each observable is associated with a hidden state where each state is linked with previous state described with a transition probability matrix , and observables linked with the current state with a observation probability matrix . There are several variations of HMM:
- Input-Output HMM: Observable inputs , hidden states and observable ouputs with
Fractal HMM: Observable outputs and hidden states :
Counpled HMM: Two types of observable outputs , and two types of hidden states :
Switching HMM: In a switching HMM, there are parallel independent hidden state sequences and the state variable at any one time picks one of them and the chosen one generates the output. That is, we switch between state sequences as we go along.
In HMM, observables can take continous values, but states take discrete values. In a linear dynamic system, also known as the Kalman filter, both the state and the observations are continuous (although the states them self has discrete labels, ). In the basic case, the state at is plus an additive zero-mean Gaussian noise. The two linear mappings and the covariances of the two covariances of the two noise sources make up the parameters.
Linear regression can be considered a graphical model:Then given a new input , estimate as . Then the regression is the last line is due to IID assumption.
We now generalize the concept to blocking and separation under the name of d-separation, and we defined it in a way so that for arbitrary subsets of nodes , , and , we can check if and are independent given . This is called a Bayes' Ball.
To check whether and are d-separated given , we consider all possible paths between any node in and any node in . Any such path is blocked if
- the direction of the edges on the path either meed head-to-tail or tail-to-tail and the node it passes through is in , or
- the direction of the edges meet head-to-head and neither that node nor any of its descendant is in .
If all paths are blocks, we say that and are d-separated, that is, conditioanl independent given .
Given a graph, we now are interested in an algorithm that can answer the queries of where is any query node in the graph given some evidence nodes .
If the evidence is in the ancestors of query, we can just do a diagnostic inference and propagate evidence down the chain. Otherwise we do a causal inference and propagate upwade using Bayes' rule. For example:
Since blocks and , we have
Thenwhere is normalizing constant that does not depend on .
In order to calculate , we recursively use the marginalization
We define a and . When evidence nodes are set to a value, they initiate traffic and nodes continue updating util there is convergence.
Note: this looks like path integral!
Given a graphical model that forms a tree, the same belief propagation also applies. denoting the message receives from its child , and sends different -messages to its children, denoting the message sends to its child .
Again if separates two parts of the evidence and we can haveTo compute the two terms, to propagate up: if is a parent of ,
Then, to compute propagagte down:The trick to compute is as follows. Given as children of :
Undirected Graph: Markov Random Field
The treatment of undirected graphs is simpler. It is much easier to check if and are independent given , by checking if removing removes all the path through and $B.
To turn a directed graph to an undirected one, we moralize the graph by adding links bewteen nodes that have common children.
The belief propagation algorithm is conducted by converting it into a factor graph and uses the sum-product algorithm.