Tutorial on Bayesian Statistics for Geophysicists
The primary authors of this tutorial are L.M. Berliner and N. Cressie.
Essence of Bayesian Reasoning
One rationale for the Bayesian approach is that it is a mathematically rigorous method for combining information in the presence of uncertainty. Uncertainty is a notion that applies to observations, models, and parameters, in differing amounts depending on the problem under investigation. This tutorial in combination with the Web Project: Ice Streams, demonstrates how uncertainty can be accounted for and scientific inference can be made on ice-stream dynamics using the Bayesian approach.
The primary computational tool used is Bayes' Theorem, which is simply a result in probability theory relating various conditional distributions. However, the Bayesian view of statistical modeling and analysis involves more than simple application of probability theory. Indeed, it is a paradigm that involves the modeling of unknowns as random variables and using observations to update that modeling effort.
Two primary sources of information are available for inference on the unknown quantities of interest: (i) observations or data that convey some information regarding those unknowns, and (ii) prior information, based on scientific reasoning regarding the unknowns, as well as past experience and data. Both sources of information are subject to uncertainties. The basis of the Bayesian approach is that all such uncertainties are modeled probabilistically. Specifically, a data model or statistical distribution for the data is developed conditional upon the unknowns; the result is denoted by $[\mbox{data} \mid \mbox{unknowns}]$. Similarly, prior information and uncertainty are summarized through a prior probability distribution, denoted by $[\mbox{unknowns}]$.
Construction of a data model does not require that we actually know the values of the unknowns (fortunately!). Rather, this distribution is viewed as a function of both the unknowns and the data. For example, if we observe the surface velocity of an ice-stream at some point in space, we would not believe that the resulting observation, $U$, is exactly equal to the true value, $u$. The presence of an unobservable measurement error, $e$, would be anticipated. A common assumption is that \begin{equation} U = u + e. \end{equation} To complete the statistical argument, we would suggest a probability distribution for $e$. Suppose that it is plausible to assert that $e$ is a Gaussian or normally distributed random variable with mean zero and a known (say from previous calibration studies) variance $\sigma^2$; in shorthand, $e \sim N(0,\sigma^2)$, where $\sim$ is read "is distributed as". This suggests a statistical model for $U$, conditional upon $u$: \begin{equation} U \mid u \, \sim \, N(u, \sigma^2). \end{equation}
The development of a prior distribution, $[\mbox{unknowns}]$, is the subject of much of the Web Project: Ice Streams. For now, suppose this step has been accomplished. Bayes' Theorem then yields the solution to our problem, which is the posterior distribution of the unknowns conditional upon the observed (i.e., now fixed) data, given by \begin{equation} [\mbox{unknowns} \mid \mbox{data}] = \frac{[\mbox{data} \mid \mbox{unknowns}] \, [\mbox{unknowns}]}{[\mbox{data}]}. \end{equation} The denominator, $[\mbox{data}]$, in this expression can be viewed simply as that constant, depending on the observed data but not the unknowns, that normalizes the numerator, ensuring that the posterior is indeed a probability distribution (physicists may note the analogy to the partition function in statistical physics).
Pursuing our simple example regarding velocity, suppose that our prior information suggests that the true velocity $u$ is normally distributed:
\begin{equation} u\sim N(\mu,\sigma^2_u). \end{equation}
One can show that the posterior distribution $[u{\mid}U]$ is also Gaussian (e.g., Berger 1985, pp.127-128) with mean
\begin{equation} \frac{\sigma^2_u}{\sigma^2_u+\sigma^2}U+\frac{\sigma^2}{\sigma^2_u+\sigma^2}\mu \end{equation}
and variance
\begin{equation} \frac{\sigma^2\sigma^2_u}{\sigma^2_u+\sigma^2}. \end{equation}
Inspection of these formulas indicates how Bayesian analysis provides combinations of data and prior information. First, note that the posterior mean of $u$ is a weighted average of the prior mean (or guess) $\mu$ and the purely-data-based estimate $U$. Also, note that if the datum is extremely precise relative to our prior information ($\sigma^2$ is small compared with $\sigma^2_u$), the posterior mean is very close to $U$. Next, note that the posterior variance is smaller than either $\sigma^2$ and $\sigma^2_u$; that is, we learn more via the combination of data and prior than each tells us separately. (These aspects are very common, but not universal, aspects of Bayesian analysis.)
It is instructive to compare the Bayesian approach to a conventional statistical approach. In the latter case, primary attention is directed to the data model Eq.(2) only. In particular, the optimal estimator in the sense of both maximum likelihood and minimum squared errors of $u$ is the observation $U$. But notice that if we simply choose the prior distribution to be uniform or constant everywhere, Eq.(3) reduces to \begin{equation} [\mbox{unknowns} \mid \mbox{data}] \propto [\mbox{data} \mid \mbox{unknowns}], \end{equation} suggesting that many traditional estimation procedures are interpretable as modes (i.e., maxima) of posterior distributions corresponding to uniform priors. This applies quite generally to least-squares estimates, maximum-likelihood estimates, including spline smoothers and certain three- and four-dimensional variational data assimilation procedures, and control methods applied in glaciology (see below). In this sense, the Bayesian approach is more general since geophysical knowledge typically leads to a prior distribution $[\mbox{unknowns}]$ that is not uniform. For example, the force of gravity counterbalanced by stresses on an ice stream can be incorporated into $[\mbox{unknowns}]$, making it far less uncertain than the uniform model that yields the conventional statistical approach.
A key issue in statistical inference is that it involves not only the production of estimates, but also quantifications of uncertainly associated with them. The Bayesian approach responds with the production of the full posterior distribution. By contrast, conventional statistical approaches must rely only on the data model. In our example, the only quantification of variability in $U\mid u\sim N(u,\sigma^2)$ is $\sigma^2$. In classical statistics, this quantity can be used to produce confidence intervals for the true value $u$: $U\pm z\sigma$, where $z$ depends on the level of confidence desired (e.g., a 95% interval arises if $z$=1.96). However, the correct interpretation of a 95% confidence interval is that if data were gathered under identical conditions 100 times, one would expect that 95 out of the 100 confidence intervals would contain the true value $u$. Under the conventional statistical approach, $u$ is a constant and hence is not amenable to a probability statement about its value.
Rather than velocity at a single point, we are interested in velocity profiles or velocity fields. Suppose that a vector of observations $U$ is available for estimation of the surface velocity profile $u(x)$ for locations $x$ in some domain. By analogy to Eq.(2), we might postulate the data model, $[{\bf U}\mid {\bf u}]$ as a multivariate normal (Gaussian) distribution, $N({\bf u}, \sigma^2 I)$, where $u$ is the vector of the true velocities at the observation locations. Choosing the simple estimate $U$ to estimate $u$ is the result of using conventional statistics, but again it accounts for none of our scientific understanding regarding the structure of velocity fields and, furthermore, it does not suggest a method for estimating velocity at unobserved locations. A purely statistical modeling approach would postulate the true profile to be a stochastic process with some mean function and some spatial covariance function, followed by spatial prediction of the field $u(x)$ for all $x$ in the domain. This would require the tasks of formulation and estimation of the covariances. Such approaches are known as kriging or "objective analysis" (though the modifier "objective" is subject to misinterpretation); kriging and related spatial prediction methods are discussed in an introductory article by Cressie (1989).
Physical-statistical modeling offers the opportunity to develop models based on physical reasoning, and we shall use this to build the distribution, $[\mbox{unknowns}]$; see Web Project: Ice Streams. We note that control methods (see below) also make formal use of a physical model, though quantification of uncertainty remains a difficult issue. The physical-statistical Bayesian approach relies on physically based interrelationships among (unknown and sometimes imperfectly observed) quantities of interest, but it focuses on probability distributional interrelationships. To accomplish our goal of scientific inference on the unknown quantities of interest, we use the technology of hierarchical statistical modeling.