[BAYES] Lesson 9: Bayesian simulation techniques

May 28, 2021 06:36 · 2681 words · 13 minute read

Welcome to unit nine, the last unit of this course on Bayesian probability theory. My name is Wolfgang von der Linden and I will enable you to help Captain Bayes and her crew to get familiar with Bayesian simulation techniques. This unit is dedicated to Monte Carlo methods and eﬃcient stochastic algorithms. • We will learn how stochastic algorithms can be used to compute quantities of interest, like the number of pi, • we will learn how to quantify the uncertainty and to derive the probability distribution for the desired quantity • and we will learn clever Monte Carlo technques including Nested Sampling.

00:44 - In the adventure Captain Bayes proposed to determine pi - the ratio of the circumference of a circle to its diameter - by a random experiment.

00:56 - You just need a stick of a certain length l and some straight parallel lines which have a distance d each. Then throw the stick repeatedly on the array of lines and count how often it cuts a line. Then you can estimate pi from the ratio given by the number of cuts K divided by the number of throws N.

01:18 - Let’s see how we can justify this relation that Captain Bayes had in mind.

01:24 - To keep the discussion simple, we assume from now on that the length of the sticks is equal to the distance between the lines.

01:32 - By a we denote the distance of the midpoint of the stick to the nearest line.

01:38 - The angle between the stick and the lines is denoted by phi. This is enough information to decide whether a stick cuts a line or not.

01:47 - The stick cuts the nearest line if its extension from the midpoint to the tip perpendicular to the line exceeds the distance a.

01:55 - Let’s visualize the possible and favorable stick orientations in the following a − phi plane • phi ranges from zero to π/2 and the parameter a is between zero and L/2. Therefore the area of possible events is given by the product of the ranges.

02:18 - • The favorable orientations of the stick - so that it intersects a line - are given by the area under the curve L/2 times sine phi.

02:27 - That results in the area of favourable events of L half Thus, the probability that a randomly thrown stick cuts one of the parallel lines is given by the ratio of the areas, resulting in TWO over PI.

02:43 - In the limit of inﬁnite throws, the law of large numbers tells us that the ration K/N approaches the intrinsic probability. That implies that 2N/K approaches pi.

02:59 - For ﬁnite number of throws 2N/K corresponds to the approximation proposed by Captain Bayes.

03:07 - A warning is in order here. We know that K/N is an unbiased estimator for the intrinsic probability q. The inverse N/K, however, is not an unbiased estimator for 1/q because the mean of 1/X is not the same as 1 over the mean of X.

03:30 - The warning becomes even more urgent if looking at transformations of errors. Assume you have a measure Δq for the error of q for instance the standard error, then the error of 1/q won’t be 1 over Δq.

03:46 - As a side note, it is worth mentioning that the propagation of experimental errors can also be treated with the rules of probability theory.

03:56 - Although Captain Venn, who just arrived, would be satisﬁed with this asymptotic limit, Captain Bayes would like to answer Lyra’s question about how long it takes to get a reliable result for the ﬁrst four digits. Captain Venn could argue that he knows the variance of the 2N/K estimator for PI, based on the law of large numbers. However, the latter applies to the inverse K/N and not directly to N/K. One can do a Taylor expansion to overcome this problem for very large values of N, but for general N and for a systematic approach, the Bayesian way is to determine the posterior probability density for the numerical value XI of PI based on a ﬁnite number of throws N and the observed number of intersected lines.

04:51 - From the posterior density she would then compute the moments in the usual way. For the posterior we invoke Bayes’ theorem to obtain the product of likelihood and prior.

05:04 - The moments can then be calculated as integral over likelihood times prior with the explicit normalization. Now it is advantageous to introduce a variable transformation from ξ to the cut probability q, because then likelihood and prior are easier to grasp.

05:24 - The likelihood then turns into is the binomial distribution and for the prior p(q) we assume for simplicity the uniform distribution on the unit interval. The remaining integrals involve Beta-functions and can be determined analytically. The ﬁnal result for the mean is almost identical to the approximation Captain Bayes proposed and it converges towards PI for N going to infinity, for the same reason outlined before. Finally, to answer the question of Lyra we consider the the variance.

As we want to know how many steps it takes to reach a four-digits accuracy, we can assume that both N and K are much greater than one. Then the variance and its square root, the uncertainty, simplify further and we obtain the ubiquitous one over square root N dependence with a pre-factor that is roughly SIX.

06:26 - In order to have m fractional digits accuracy it is DELTA has to be less than 10 to the minus m, we obtain…

06:35 - Lyra, was eager to see how long it takes to get the ﬁrst four digits right, which corresponds to 3 fractional digits. The answer for Lyra is it takes 6 million throws.

06:50 - In the previous example the integral over he posterior was one-dimensional and it could be evaluated analytically. Unfortunately, this is rather the exception and not the rule.

07:03 - Say we are interested in the density proﬁle within a cross section of a fusion plasma. Since the temperature of a plasma is typically a 100 million Kelvin, you can’t put measuring devices inside the plasma, they would just melt away. But the plasma emits X-rays and their emissivity can be measured by cameras outside the plasma. The emissivity is proportional to the integral over the density along the line along which the camera looks into the plasma.

Say we are interested in the density in a 2D cross section at 100 × 100 points and we want to determine it without resorting to a parameterised model. This so-called form-free reconstruction then has 10000 unknown parameters.

07:57 - The problem is related to the Abel-transform, if the plasma density is axial symmetric and noise-free. Here, both conditions are not met and typically, there are much less line-integrals than unknowns. So we have an under determined ill-posed inversion problem. It could not be worse. Nevertheless, Bayes’ theorem allows to infer the density vector RHO, corresponding to the values on the grid.

08:25 - Z is, as always, the normalization. Here, as with all underdetermined problems, the prior plays an important role and a generalized entropic prior is well suited, one that takes into account, that the testable information, provided by the data, is corrupted by noise.

08:46 - This approach is called: Quantiﬁed Maximum Entropy This example should only show you a real-world example, in which a high-dimensional integral occurs. We do not want to go into detail of the posterior here, but rather demonstrate how we can compute mean values of some observables f(RHO) like mean density and covariance. To this end you have to evaluate 10000-dimensional integrals of the following form.

09:17 - If we would use standard quadrature methods, with say 10 integration points per dimension. Then you would have to evaluate the posterior at 10 to the 10000 times. If each evaluation could be done on an atomic times scale, say in Femto seconds it would still take 10 to the 9977 years. The Big bang happened roughly 10 to the 10 years ago. So it would take 10 to the 9967 big bangs before the computer spits out the result; not to mention how it can survive it all.

This impressively illustrates the curse of dimensionality. Fortunately, we can overcome it by using probabilistic ideas. We have learned in previous lessons that we can estimate the mean of a function f(RHO) based on independent samples drawn from the posterior and we also know how to estimate the variance to have a measure for the uncertainty for the estimate of the mean The crucial aspect is, that in many cases the uncertainty is more or less independent of the dimension and the uncertainty can be reduced by the square root N term to reach fairly reliable results.

So we see that the curse of dimensionality disappeared and that the problem boils down to generating independent samples from the posterior.

10:52 - The task of generating independent samples from a pdf was already solved in the Manhattan Project some 80 years ago. The idea is pretty simple, but the proof is a bit tricky. Here is the recipe.

11:06 - We start with a randomly selected initial density proﬁle. Let’s skip some initial iterations and assume n elements of the sample have already been created. Starting from the current density, the next element is generated by the following steps.

11:25 - • Make a small change to the current density and call it a trial That can be done, for instance, by adding a Gaussian random number to one of the components of the density vector. • Next compute the ratio of the posterior for the trial and the current density and compute from that the so-called acceptance probability • The trial density is accepted with that probability by drawing a uniform random number r from the unit interval and if r is less than the acceptance probability the trial density is accepted, otherwise it is rejected.

12:06 - • The next element of the sample is the trial density, if it is accepted, otherwise it is again the current density.

12:14 - Here is a question for you: How is this stochastic process called? Right, it’s a Markov process and that is why the approach is called MCMC Markov Chain Monte Carlo. But more precisely, we are using the Metropolis algorithm for the individual steps.

12:38 - We have seen that for the evaluation of mean values we do not need the normalization of the posterior, as it drops out in the acceptence probability. For that reason these MCMC algorithms cannot be used to compute the normalization.

12:54 - But for many problems it is of central importance, for instance for model comparison as outlined earlier. So we need a diﬀerent tool to compute high dimensional integrals over likelihood times prior.

13:08 - A similar expression is at the heart of statistical physics, the partition function. Once we know the partition function we can determine all other thermodynamic variables. A particularly challenging situation is encountered in physical systems at ﬁrst order phase transitions. There are a few methods that allow the numerical computation of such integrals but the most promising one has been found in the treasure box by Captain Venn: nested sampling. That is an approach that recently has been proposed by John Skilling.

Nested sampling makes use of Lebesgue integrals by transforming the high-dimensional integral over the space spanned by the vectors x into a one-dimensional integral over the possible values of the integrand f = f(x).

14:03 - Here X(f) is the integral over the prior enclosed by the contour given by f(x) = f.

14:14 - This object is therefore the restricted prior mass. Let’s consider a very simple example with a prior p(x) = x restricted to the interval [0, 2] and a function f(x) = x^2 that ranges from [0,4] Then the integral of interest yields 4.

14:37 - Then the constraint according to the contour results in the restricted prior mass From that the Lebesgue integral also yields the value 4.

14:49 - As outlined in the attached notes of John Skilling the plausibility argument for Nested Sampling is the following. Draw an ensemble of n random locations x according to the prior pdf, each of which has its own value f(x), which we can place in ascending order. Crudely, if we discarded the lower half of the values, the survivors would be random samples from the prior, taken within the restricted volume corresponding to f(x) > f_median. And, statistically, that restricted volume would be roughly half the original volume because that’s what a median is.

So by ordering the samples according to their f-value we can infer on the prior masses. Repeating that procedure r times would yield compression of the prior mass X(f) by a factor something like 2 to the power r. This is the exponential behaviour required to overcome the curse of dimensionality.

15:57 - So how does Nested Sampling actually work? As mentioned we rewrite the high dimensional integral into a one-dimensional integral Now the next logical step is a discretization in f.

16:10 - The only but decisive problem is, we don’t know the constraint prior mass X(f). John Skilling introduced an ingenious algorithm of choosing f_n sta- tistically in the following sense, which is slight modiﬁed as compared to the scheme mentioned at the beginning. • initialize a counter n = 1 • set the ﬁrst value f_n to the lower bound of the function f(x) First step: generate a sample of size N according to the prior with a constraint f(x) > f_n 2) compute the corresponding f-values and determine the smallest one and call it f* 3) increment the counter n and set f_n equal to the f* • Now repeat the steps one to three until suitable convergence is achieved.

17:12 - The estimation of the integral now uses the sequence of f-values generated in this way.

17:19 - We still don’t know the X-values X_n corresponding to the f-values, but now we can determine them statistically. The explanation in a nutshell goes as follows.

17:33 - The X-values X_n corresponding to the sequence of f-values f_n are uniformly distributed.

17:42 - The smallest f-value of the sample corresponds to the largest X-value, because by con- struction X(f) is a monotonically decreasing function in f. Then we can use order-statistics to determine the pdf of X_n From that we can generate a sample of X-values X_n that corresponds to the sequence of f-values Nested sampling can even be used to solve really tough physical problems.

18:14 - for instance, it allows to determine the critical exponents and many other observables of the Potts model that exhibits a phase transition of second or ﬁrst order depending on a parameter of the model.

18:30 - This concludes Unit 9 and we have reached the end of the entire course. In this unit, we learned how stochastic simulation techniques work and how Bayesian probability theory helps to estimate errors.

18:45 - We have seen how Markov Chain Monte Carlo methods can be used to tame high dimensional integrals to calculate the desired moments of a probability distribution and also how to evaluate partition functions and evaluate normalization terms using the recently proposed Nested Sampling technique.

19:07 - Have a look at the bonus material and the interactive Pluto notebooks to get deeper insights into these technics and explore the vast conﬁguration space of Sudoku. Please feel free to ask questions in the forum and feel encouraged to test your knowledge in the quiz! We hope you enjoyed this course and our adventures with Captain Bayes and her crew. You are now equipped to apply Bayesian probability theory in your daily applications, you have learned the skills to update your knowledge base and you should feel better prepared to face uncertainty. .