Bayesian Analysis
Recall that when we discussed maximum likelihood estimation, we viewed the parameter $\theta$ to be an unknown quantity which we wish to predict based on some observations, $X_1, X_2, \ldots$. In Bayesian analysis, the other major methodology for statistical estimation, we view $\theta$ as a randomly determined quantity that the observations $X_1, X_2, \ldots$ provide us information about. In this lecture, we will write the likelihood function, previously written as $f_\theta{\left(x^n\right)}$, as $f_{X^n|\Theta}{\left(x^n,\theta\right)}$ reflecting the viewpoint that $\theta$, written as $\Theta$ when viewed as unknown, is random.
Since $\Theta$ is random, in Bayesian analysis, one presumes knowledge of its distribution, which we denote by $f_{\Theta}{\left(\theta\right)}$ and refer to as the prior distribution. In Bayesian analysis, the focus is on the distribution of the parameters given the observations, known as the posterior distribution which we will write as $f_{\Theta|X^n}{\left(\theta,x^n\right)}$. Note that in maximum likelihood estimation, we choose a particular value as the estimate for the unknown parameters. By contrast, in Bayesian analysis, we choose the posterior distribution of the unknown parameters. For example, in Bayesian analysis, it is more apparent how much uncertainty there is in the parameter since we have the entire distribution.
To calculate the posterior distribution, one can use Bayes theorem, which we first state in terms of point mass functions and then in terms of densities:
For a point mass function $f_\Theta$ and density $f_{X^n|\Theta}$: \begin{eqnarray}\label{Bayespmf} f_{\Theta|X^n}{\left(\theta,x^n\right)} = \frac{f_{X^n|\Theta}{\left(x^n,\theta\right)}f_\Theta{\left(\theta\right)}}{\sum_\theta f_{X^n|\Theta}{\left(x^n,\theta\right)}f_\Theta{\left(\theta\right)}} \end{eqnarray} For a density function $f_\Theta$ and density function $f_{X^n|\Theta}$: \begin{eqnarray}\label{Bayes} f_{\Theta|X^n}{\left(\theta,x^n\right)} = \frac{f_{X^n|\Theta}{\left(x^n,\theta\right)}f_\Theta{\left(\theta\right)}}{\int_\theta f_{X^n|\Theta}{\left(x^n,\theta\right)}f_\Theta{\left(\theta\right)}d\theta} \end{eqnarray}Note that these follow immediately from the definition of conditional probability (and the law of total probability). In fact, I view these more as definitions than theorems.
Note that (\ref{Bayes}) can be viewed as the joint distribution of $\theta$ and $x^n$ normalized to be a distribution on $\theta$. This observation will be extremely useful in computing the posterior distribution.
We now go through several examples.
Example: IID Normal Mean
Let us assume we have observations $X_1, X_2, \ldots$ which are IID with unknown mean $\mu$ and known variance $\sigma^2$. The likelihood of $X_i$ is given by: \begin{eqnarray*} f_{X_i|\Theta}{\left(x_i,\theta\right)} = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\left(x_i - \mu\right)^2}{2\sigma^2}\right) \end{eqnarray*} As a prior distribution, we assume that the unknown mean $\mu$ is distributed normally with known mean $\mu_0$ and known variance $\sigma_0^2$: \begin{eqnarray*} f_M{\left(\mu\right)} = \frac{1}{\sqrt{2\pi\sigma_0^2}}\exp\left(-\frac{\left(\mu - \mu_0\right)^2}{2\sigma_0^2}\right) \end{eqnarray*} We will now derive the posterior distribution. Note that from (\ref{Bayes}), the posterior distribution is the joint distribution of $\theta$ and $x^n$ divided by integral of the joint distribution over $\theta$. Hence, we first derive the joint distribution: \begin{eqnarray*} f_{M,X^n}{\left(\mu, x^n\right)} & = & \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\left(x_i - \mu\right)^2}{2\sigma^2}\right)\frac{1}{\sqrt{2\pi\sigma_0^2}}\exp\left(-\frac{\left(\mu - \mu_0\right)^2}{2\sigma_0^2}\right)\\ & = & \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{\sum_{i=1}^n\left(x_i - \mu\right)^2}{2\sigma^2}\right)\frac{1}{\sqrt{2\pi\sigma_0^2}}\exp\left(-\frac{\left(\mu - \mu_0\right)^2}{2\sigma_0^2}\right)\\ & = & \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{\sum_{i=1}^n\left(x_i^2 - 2x_i\mu + \mu^2\right)}{2\sigma^2}\right)\frac{1}{\sqrt{2\pi\sigma_0^2}}\exp\left(-\frac{\left(\mu^2 - 2\mu\mu_0 + \mu_0^2\right)}{2\sigma_0^2}\right)\\ & = & \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(-\frac{\sum_{i=1}^nx_i^2}{2\sigma^2} + \frac{2\mu\sum_ix_i}{2\sigma^2} - \frac{n\mu^2}{2\sigma^2}\right)\frac{1}{\sqrt{2\pi\sigma_0^2}}\exp\left(-\frac{\mu^2}{2\sigma_0^2} + \frac{2\mu\mu_0}{2\sigma_0^2} - \frac{\mu_0^2}{2\sigma_0^2}\right)\\ & = & \frac{1}{\sqrt{2^{n+1}\pi^{n+1}}\sigma^n\sigma_0}\exp\left(-\frac{\sum_{i=1}^nx_i^2}{2\sigma^2} + \frac{2\mu\sum_ix_i}{2\sigma^2} - \frac{n\mu^2}{2\sigma^2}-\frac{\mu^2}{2\sigma_0^2} + \frac{2\mu\mu_0}{2\sigma_0^2} - \frac{\mu_0^2}{2\sigma_0^2}\right)\\ & = & \frac{1}{\sqrt{2^{n+1}\pi^{n+1}}\sigma^n\sigma_0}\exp\left(-\mu^2\left(\frac{n}{2\sigma^2}+\frac{1}{2\sigma_0^2}\right) + \mu\left(\frac{\sum_{i=1}^nx_i}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right) - \frac{\sum_{i=1}^nx_i^2}{2\sigma^2}-\frac{\mu_0^2}{2\sigma_0^2}\right) \end{eqnarray*} We wish to look at this as a distribution on $\mu$. Note that, in terms of $\mu$, it has the form $\exp\left(a_n\mu^2 + b_n\mu\right)$ for some values $a_n$ and $b_n$. Since the normal distribution has this form, we will find that when we normalize the distribution, it will be normal with variance $\sigma_n^2 = -\frac{1}{2a_n}$ and mean $\mu_n=b\sigma_n^2$: \begin{eqnarray*} \sigma_n^2 & = & \frac{1}{2\left(\frac{n}{2\sigma^2}+\frac{1}{2\sigma_0^2}\right)} = \frac{1}{\frac{n\sigma_0^2 + \sigma^2}{\sigma^2\sigma_0^2}} = \frac{\sigma^2\sigma_0^2}{n\sigma_0^2 + \sigma^2}\\ \mu_n & = & \left(\frac{\sum_{i=1}^nx_i}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right)\sigma_n^2 = \left(\frac{\sum_{i=1}^nx_i}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right)\frac{\sigma^2\sigma_0^2}{n\sigma_0^2 + \sigma^2} = \frac{\sigma_0^2\sum_{i=1}^nx_i + \sigma^2\mu_0}{n\sigma_0^2+\sigma^2} \end{eqnarray*} After we normalize by the integral of the joint distribution, the posterior distribution of $\mu$ given $x^n$ will be normal with mean $\mu_n$ and variance $\sigma_n^2$.
$\mu_0$ represents our guess of where the mean is before seeing any data and $\sigma_0$ represents our initial uncertainty around this value. We demonstrate this with the following remarks:
- As $\sigma_0$ gets large: \begin{eqnarray*} \lim_{\sigma_0\rightarrow\infty} \mu_n = \lim_{\sigma_0\rightarrow\infty} \frac{\sigma_0^2\sum_{i=1}^nx_i + \sigma^2\mu_0}{n\sigma_0^2+\sigma^2} = \frac{\sum_{i=1}^nx_i}{n} \end{eqnarray*} which is exactly the maximum likelihood estimator. Hence, as our uncertainty about the parameter grows, the mean of the posterior distribution approaches the maximum likelihood estimator.
- As $\sigma_0$ gets small: \begin{eqnarray*} \lim_{\sigma_0\rightarrow 0} \mu_n = \lim_{\sigma_0\rightarrow 0} \frac{\sigma_0^2\sum_{i=1}^nx_i + \sigma^2\mu_0}{n\sigma_0^2+\sigma^2} = \mu_0 \end{eqnarray*} Hence, as our certainty about the initial parameter grows, we begin to ignore the data and assume that the parameter is the initial value.
- $\frac{\sigma^2}{\sigma_0^2}$ corresponds with the number of observations that we would equate our initial estimate with: \begin{eqnarray*} \mu_n = \frac{\sigma_0^2\sum_{i=1}^nx_i + \sigma^2\mu_0}{n\sigma_0^2+\sigma^2} = \frac{\sum_{i=1}^nx_i + \frac{\sigma^2}{\sigma_0^2}\mu_0}{n+\frac{\sigma^2}{\sigma_0^2}} \end{eqnarray*} From this equation, we can see that if we had $\frac{\sigma^2}{\sigma_0}$ observations which were exactly $\mu_0$ and used the maximum likelihood estimator, we would get the same estimate as the mean of the posterior distribution.
Example: The Binomial Distribution
Let $X_1, X_2, \ldots$ be a sequence of IID random variables such that: \[ X_i = \left\{\begin{array}{rl}0 & \mbox{w.p. $1-p$}\\ 1 & \mbox{w.p. $p$}\\\end{array}\right. \] As a prior distribution, we assume that the unknown probability $p$ has a uniform distribution on $[0,1]$: \begin{eqnarray*} f_P{\left(p\right)} = \left\{\begin{array}{rl}1 & \mbox{if $p\in[0,1]$}\\ 0 & \mbox{otherwise}\end{array}\right. \end{eqnarray*} Again, we derive the posterior distribution by first forming the joint distribution of $p$ and $x^n$: \begin{eqnarray*} f_{P,X^n}{\left(p, x^n\right)} = \prod_{i=1}^n p^x_i\left(1-p\right)^{1-x_i} = p^{n_1{\left(x^n\right)}}\left(1-p\right)^{n_0{\left(x^n\right)}} \end{eqnarray*} Again, we would like to normalize this into a distribution on $p$. In this case, the normalized distribution is the $\beta$ distribution: \begin{eqnarray*} f_{\alpha,\beta}{\left(p\right)} = \frac{\Gamma\left(\alpha+\beta\right)}{\Gamma\left(\alpha\right)\Gamma\left(\beta\right)} p^{\alpha-1}\left(1-p\right)^{\beta-1} \end{eqnarray*} where $\Gamma(x)$ is the $\Gamma$ function. The $\Gamma$ function is a generalization of the factorial to non-integer arguments, in particular, $\Gamma\left(n+1\right) = n!$. In particular, the posterior is a $\beta$ distribution with: \begin{eqnarray*} \alpha=n_1{\left(x^n\right)}+1\\ \beta=n_0{\left(x^n\right)}+1\\ \end{eqnarray*} The mean the $\beta$ distribution is given by $\frac{\alpha}{\alpha+\beta}$ so that the posterior mean is: \begin{eqnarray*} \mu_n = \frac{n_1{\left(x^n\right)}+1}{n+2}\\ \end{eqnarray*} This posterior mean was first discovered by Laplace and is sometimes called Laplace's rule.
We point out the following differences between the previous example, the posterior distribution of the normal mean, and this example, the posterior distribution of the binomial:
Both of these differences go away if we start with a $\beta$ distribution as a prior. In particular, if the prior is $\beta$ with parameters $\alpha_0$ and $\beta_0$, then the posterior is $\beta$ with parameters: \begin{eqnarray*} \alpha_n=n_1{\left(x^n\right)}+\alpha_0\\ \beta_n=n_0{\left(x^n\right)}+\beta_0\\ \end{eqnarray*} Hence, the posterior mean becomes: \begin{eqnarray*} \mu_n = \frac{n_1{\left(x^n\right)}+\alpha_0}{n+\alpha_0+\beta_0} \end{eqnarray*} which is equivalent to maximum likelihood with $\alpha_0$ observations of $1$ and $\beta_0$ observations of $0$ added.
- In the case of the normal mean, one of the parameters of the prior, $\sigma_0$, corresponded with the uncertainty in the parameter. In the binomial case, the uniform prior distribution is equivalent to maximum likelihood with exactly $1$ observation of $1$ and $1$ observation of $0$ added.
- In the case of the normal mean, the posterior distribution had the same form (normal) as the prior distribution. In the binomial case, the prior distribution is a $\beta$ distribution while the prior distribution is a uniform distribution. Note that the uniform distribution is the $\beta$ distribution with parameters $\alpha=1$ and $\beta=1$.
Example: The Multinomial Distribution
We consider the generalization of the preceding example to the multinomial distribution where each observation $X_i$ takes on values in $\left\{1,\ldots,m\right\}$ with probabilities $p_1, p_2, \ldots, p_m$. We now require a prior distribution on probability vectors. The Dirichlet distribution is a generalization of the $\beta$ distribution to probability vectors. Given an $m$-vector $\alpha$, the density of the Dirichlet distribution is given by: \begin{eqnarray*} f_\alpha{\left(p\right)} = \frac{\Gamma\left(\sum_{i=1}^m\alpha_i\right)}{\prod_{i=1}^m\Gamma\left(\alpha_i\right)}\prod_{i=1}^m p_i^{\alpha_i-1} \end{eqnarray*} Given a Dirichlet prior with parameter vector $\alpha$, the joint distribution of $p$ and $x^n$ is given by: \begin{eqnarray*} f_{P,X^n}{\left(p,x^n\right)} & = & \frac{\Gamma\left(\sum_{i=1}^m\alpha_i\right)}{\prod_{i=1}^m\Gamma\left(\alpha_i\right)}\prod_{i=1}^m p_i^{\alpha_i-1} \prod_{j=1}^n \prod_{i=1}^m p_i^{n_i{\left(x_i\right)}}\\ & = & \frac{\Gamma\left(\sum_{i=1}^m\alpha_i\right)}{\prod_{i=1}^m\Gamma\left(\alpha_i\right)}\prod_{i=1}^m p_i^{\alpha_i-1} \prod_{i=1}^m p_i^{n_i{\left(x^n\right)}}\\ & = & \frac{\Gamma\left(\sum_{i=1}^m\alpha_i\right)}{\prod_{i=1}^m\Gamma\left(\alpha_i\right)} \prod_{i=1}^m p_i^{n_i{\left(x^n\right)}+\alpha_i}\\ \end{eqnarray*} When we consider the functional dependence of the above with respect to $p$, we see that it is Dirichlet with parameters: \begin{eqnarray*} \alpha'_i = n_i{\left(x^n\right)} + \alpha_i \end{eqnarray*} Note that the mean of $p_i$ in the Dirichlet distribution with parameters $\alpha$ is given by $\frac{\alpha_i}{\sum_{i=1}^m \alpha_i}$. Hence, the posterior mean of $p_i$ is given by: \begin{eqnarray*} \mu'_i = \frac{n_i{\left(x^n\right)} + \alpha_i}{n + \sum_{i=1}^m\alpha_i} \end{eqnarray*} Again, here, the parameter of the prior, $\alpha_i$ corresponds with how many additional observations of $i$ one would assume.
Example: Markov Chains
For fitting Markov chains, just as in the case of maximum likelihood, Bayesian analysis is very similar to that for the multinomial distribution. In this case, we must specify the prior distribution for each row of the transition matrix. As in the multinomial case, we use the Dirchlet distribution for this. Let $\alpha_i$ for $i\in\left\{1,\ldots,m\right\}$ be non-negative $m$-vector. We assume that $P_i$, the $i$th row of the transition matrix is distributed Dirichlet with parameter $\alpha_i$. We also assume that $P_i$ and $P_j$ are independent for $i\neq j$. The prior distribution is given by: \begin{eqnarray*} f_P{\left(p\right)} = \prod_{i=1}^m \left(\frac{\Gamma\left(\sum_{j=1}^m \alpha_{i,j}\right)}{\prod_{j=1}^m\Gamma\left(\alpha_{i,j}\right)}\prod_{j=1}^m p_{i,j}^{\alpha_{i,j}-1}\right) \end{eqnarray*} The joint distribution of $p$ and $x^n$ is: \begin{eqnarray*} f_{P,X^n}{\left(p,x^n\right)} & = & \prod_{i=1}^m \left(\frac{\Gamma\left(\sum_{j=1}^m \alpha_{i,j}\right)}{\prod_{j=1}^m\Gamma\left(\alpha_{i,j}\right)}\prod_{j=1}^m p_{i,j}^{\alpha_{i,j}-1}\right) \prod_{i=1}^m\prod_{j=1}^m p_{i,j}^{n_{i,j}{\left(x^n\right)}}\\ & = & \prod_{i=1}^m \frac{\Gamma\left(\sum_{j=1}^m \alpha_{i,j}\right)}{\prod_{j=1}^m\Gamma\left(\alpha_{i,j}\right)} \prod_{i=1}^m\prod_{j=1}^m p_{i,j}^{n_{i,j}{\left(x^n\right)}+\alpha_{i,j}-1}\\ \end{eqnarray*} Here again, the posterior distribution is Dirichlet with parameters: \begin{eqnarray*} \alpha'_{i,j} = \alpha_{i,j} + n_{i,j}{\left(x^n\right)} \end{eqnarray*} so that the posterior mean is equivalent to maximum likelihood estimation with $\alpha_{i,j}$ transitions from $i$ to $j$ added in.
Example: Model Order Selection
Bayesian analysis also provides a solution to the problem of model order selection that we discussed when discussing fitting Markov chains. Consider a set of finitely parameterized models which we denote by $\Theta_1, \Theta_2, \ldots$. On each finitely parameterized model, we put a prior distribution $f_{\Theta_i}$. We then place a prior distribution on models by choosing model $\Theta_i$ with some probability $p_i$. The combined prior distribution is given by: \begin{eqnarray*} f_{\Theta}\left(\theta_1, \theta_2, \ldots\right) = \sum_{i=1}^\infty p_i f_{\Theta_i}{\left(\theta_i\right)} \end{eqnarray*} As we'll see in the next section, Bayes estimates are broadly consistent and that result will apply to this situation as well, which we will show empirically.
Bayesian Consistency
Recall that consistency in the case of maximum likelihood referred to the maximum likelihood estimator converging to the true parameter with increasing numbers of observations of data drawn from the distribution with that true parameter. Bayesian analysis involves the different viewpoint in which we are choosing a distribution of the parameter rather than a specific value. Hence, we will say that the Bayesian estimate is consistent if the posterior distribution converges to the distribution which puts all probability mass on the true parameter, $\theta^*$, given a growing stream of observations from that true parameter. In this sense, Bayesian analysis is consistent under a wider set of conditions than maximum likelihood estimators are consistent:
If for every $\theta$ and $\theta'$ such that $\theta\neq\theta'$, there is some $x^n$ such that $f_{X^n|\Theta}{\left(x^n,\theta\right)} \neq f_{X^n|\Theta}{\left(x^n,\theta'\right)}$, then the posterior distribution converges to the distribution with all mass on the true parameter except on a set of $\theta$ which has prior probability $0$.
Note that the condition in the antecedent of the theorem is called identifiability and is also required for consistency of the maximum likelihood estimator. However, for the maximum likelihood estimator, there are additional conditions required.
Monte Carlo Markov Chain
We have seen several examples where Bayesian analysis can be performed in a closed form. In these cases, we were able to avoid calculation of the normalization constant by recognizing that the joint distribution was a known distribution. Unfortunately, in many models that people build, this is not the case.
Notice that, in calculating the posterior distribution, (\ref{Bayes}), we need to integrate over the $\theta$ space. This integral is often difficult to compute. Monte Carlo Markov Chain, or MCMC, is a method of drawing samples from the posterior distribution without calculating the normalizing constant which we refer to as posterior sampling. While EM specifically helps eliminate an exponential calculation for the maximum of the likelihood function when there are hidden variables, MCMC can help with a much wider range of models.
The idea behind Markov Chain Monte Carlo is to draw a sequence of samples, $\theta_k$, which evolve according to a Markov chain whose stationary distribution is the distribution you wish to draw from, in this case, the posterior distribution. We know from Markov chain theory that $\theta_k$ will converge to the stationary distribution, $\pi$, of the chain. It is possible, in several different ways, to define a Markov chain with stationary distribution $\pi$ even if $\pi$ is only known up to a constant multiple. Hence, we can sample values from the posterior distribution without knowing the normalization constant of the distribution. Furthermore, for any function $g$, we will have the law of large numbers: \begin{eqnarray*} \lim_{k\rightarrow\infty} \frac{1}{k}\sum_{i=1}^k g\left(\theta_i\right) = E_\pi\left[g\left(\Theta\right)\right] \end{eqnarray*} where $E_\pi$ denotes expectation with respect to the stationary distribution $\pi$. Hence, we can calculate any expected values that we wish, such as the probability of some set $A$ by using the indicator function of $A$: \begin{eqnarray*} g\left(\theta\right) = \left\{\begin{array}{rl}1 & \mbox{if $\theta\in A$}\\ 0 & \mbox{otherwise}\end{array}\right. \end{eqnarray*}
The Metropolis-Hastings Algorithm
We now present an MCMC algorithm referred to as the Metropolis-Hastings algorithm. We assume that we have some distribution $\pi$, known as the target distribution from which we wish to sample. However, we only know $\pi$ up to a constant multiple, that is, we can calculate ratios, $\frac{\pi\left(y\right)}{\pi\left(x\right)}$, but not directly calculate $\pi$. In practice, for MCMC, we will use $\pi\left(\theta\right) = f_{\Theta|X^n}{\left(\theta,x^n\right)}$ for some fixed observed data $x^n$. Since, the normalization is the same we will then have: \begin{eqnarray*} \frac{\pi\left(\theta'\right)}{\pi\left(\theta\right)} = \frac{f_{\Theta|X^n}{\left(\theta',x^n\right)}}{f_{\Theta|X^n}{\left(\theta,x^n\right)}} = \frac{\frac{f_{\Theta,X^n}{\left(\theta',x^n\right)}}{f_{X^n}{\left(x^n\right)}}}{\frac{f_{\Theta,X^n}{\left(\theta,x^n\right)}}{f_{X^n}{\left(x^n\right)}}} = \frac{f_{\Theta,X^n}{\left(\theta',x^n\right)}}{f_{\Theta,X^n}{\left(\theta,x^n\right)}} \end{eqnarray*} Hence, we need to know the joint distribution of the parameters and data which is easier to calculate than the posterior distribution since we don't need to calculate the normalization.
Note that the Metropolis-Hastings algorithm chooses a vector $\theta_{k+1}$ randomly based only on another vector $\theta_k$ such that the sequence $\theta_1, \theta_2, \ldots$ is Markovian. However, $\theta_{k+1}$ and $\theta_k$ are typically real vectors rather than a discrete quantity so that we are no longer in the realm of discrete space Markov chains. Some of the theory of Markov chains carries over to continuous spaces but we need some definitions. We will call a function $k:\mathbb{R}^m \times \mathbb{R}^m\rightarrow\mathbb{R}$ a transition kernel if it is a density for each fixed value of its first argument, that is, for all $x\in\mathbb{R}^m$: \begin{eqnarray*} k(x,y) & \geq & 0 \mbox{ for all $y\in\mathbb{R}^m$}\\ \int k(x,y) dy & = & 1\\ \end{eqnarray*} Note that $k(x,y)$ represents the conditional density of $y$ given that the previous value was $x$. The transition kernel is the equivalent of the probability transition matrix for a continuous Markov process.
We will need an auxilliary transition kernel $q\left(y|x\right)$ called the proposal distribution which is fairly arbitrary but we will see later some conditions that it needs to satisfy. Given an initial value $\theta_1$, the algorithm proceeds as follows:
Note that this algorithm is fairly simple to implement in comparison to other algorithms for doing statistics for complex models, such as EM.
- For each $k = 1, 2\ldots$:
- Choose $\theta'_k$ from the distribution $q\left(\theta'_k|\theta_k\right)$
- Choose $\theta_{k+1}$ from: \begin{eqnarray*} \theta_{k+1} = \left\{\begin{array}{rl}\theta'_k & \mbox{with probability $\rho\left(\theta_k,\theta'_k\right)$}\\ \theta_k & \mbox{with probability $1-\rho\left(\theta_k,\theta'_k\right)$}\end{array}\right. \end{eqnarray*} where $\rho(x,y) = \min\left(1,\frac{\pi\left(y\right)q\left(x|y\right)}{\pi\left(x\right)q\left(y|x\right)}\right)$.
We now discuss the stationary distribution of the algorithm. For this, we will need to introduce the detailed balance condition. A Markov chain with transition kernel $k$ satisfies the detailed balance condition if there is a function $f$ such that: \begin{eqnarray*} k(x,y)f(x) = k(y,x)f(y) \end{eqnarray*} Note that, if $k$ satisfies detailed balance then: \begin{eqnarray*} \int f(x) k(x,y) dx = \int f(y) k(y,x) dx = f(y) \end{eqnarray*} Hence, if $f$ is a density, it is a stationary distribution of a Markov process having transition kernel $k$.
We now calculate the transition kernel of Metropolis-Hastings algorithm: \begin{eqnarray*} k(x,y) = \rho(x,y)q(y|x) + (1 - r(x))\delta_x(y) \end{eqnarray*} where $r(x) = \int \rho(x,y)q(y|x)dy$. From this we can show detailed balance by showing: \begin{eqnarray} \rho(x,y)q(y|x)\pi(x) = \rho(y,x)q(x|y)\pi(y)\label{accept}\\ (1-r(x))\delta_x(y)\pi(x) = (1-r(y))\delta_y(x)\pi(y)\label{reject} \end{eqnarray} To show (\ref{accept}), let $f(x,y) = \frac{\pi(y)q(x|y)}{\pi(x)q(y|x)}$. We have that $f(y,x) = \frac{1}{f(x,y)}$ and $\rho(x,y) = \min\left(1,f(x,y)\right)$ Hence, $\rho(y,x)=\min\left(1,\frac{1}{f(x,y)}\right)$ so that either $\rho(x,y) = 1$, when $f(x,y)\leq 1$, or $\rho(y,x)=1$ otherwise. We assume, without loss of generality that $\rho(y,x)= 1$, in which case: \begin{eqnarray*} \rho(x,y)q(y|x)\pi(x) = \frac{\pi(y)q(x|y)}{\pi(x)q(y|x)}q(y|x)\pi(x) = q(x|y)\pi(y) = \rho(y,x)q(x|y)\pi(y) \end{eqnarray*} so that (\ref{accept}) holds. To show (\ref{reject}), first assume that $x=y$. In this case, $\rho(x,y) = 1$ so that $r(x) = 1$ and the equation corresponds to $0=0$. In the other case, when $x\neq y$, the equation (\ref{reject}) also corresponds to $0=0$.
Note that since we are now talking about Markov processes rather than Markov chains, there are many new subtleties that occur. For example, a Markov process can be irreducible and aperiodic (though these need more careful definitions in the case of general Markov processes) but still not converge to a stationary distribution. Care must be taken to show convergence in the case of the Metropolis-Hastings algorithm and other MCMC algorithms. Furthermore, even if the algorithm is known to converge, it might not do so in a reasonable time scale. There have been examples in the past where this has been found to be the case after a study has been published. In general, the theoretical bounds for the speed of convergence of MCMC algorithms are not very tight. There is still a great deal of art in the use of these methods.
References
Robert, Christian P. The Bayesian Choice. Springer, 2006.
This is a good book on Bayesian analysis.
Robert, Christian P. and Casella, George. Monte Carlo Statistical Methods. Springer, 2004.
This is a good book on Markov Chain Monte Carlo and other Monte Carlo methods in stastics.