[BAYES] Lesson 6: Bayesian deep reasoning

Apr 27, 2021 10:10 · 3108 words · 15 minute read

Welcome to unit six of the course on Bayesian probability theory.

00:08 - My name is Wolfgang von der Linden and I will enable you to help Captain Bayes and her crew in Bayesian deep reasoning.

00:16 - This unit is dedicated to the central part of this course, namely Bayes’ theorem.

00:22 - We will learn how to apply Bayes theorem to solve inverse problems like to ﬁnd a probability for having found treasure island, we will learn how to assign probabilities in the Bayesian way using the principle of maximum entropy and we will learn how to update probabilities based on new information Now we come to the philosopher’s stone in probability theory: Bayes’ Theorem.

00:49 - It turn noisy data and erroneous information into precious results: You can use it, if you want to infer the parameters of your model from measured data, if you want to quantify the uncertainty of the parameters, if you want to access the validity of your model, if you want to access the validity of certain hypotheses. Let A and B be two arbitrary propositions, then the product rule tells us. Alternatively since A ∧ B = B ∧ A we ﬁnd So the right hand sites have to be equal, leading to BAYES’ Theorem.

01:31 - Let’s assume A is the object of interest, e. g, parameters of a model and B are the measured data. In this context, the terms in Bayes’ theorem have the following names and meanings: The term on the left hand site is the posterior probability, it is the probability for the quantity of interest after adding new information, therefore posterior.

01:57 - The counterpart is the prior probability. It encodes our knowledge prior to adding the new information into the inference.

02:06 - The meaning of prior and posterior is not necessarily to be understood in a temporal sense, but only in the order in which we consider the information in the inference process. The term probability B given A is the so-called likelihood. You may think to yourself, that’s a probability after all, why isn’t it called that. And you are partly right: it is a probability for B, but we are interested in A. And in terms of A it is not a probability.

02:38 - But in many cases, however, if the likelihood increases by varying A that also means that A is more likely. It is good to make this distinction in the naming of the terms, because the rules of probability do not apply to the likelihood as far as A is concerned, If you disregard this inequality you could end up in prison! Disregarding this inequality let to a catastrophically wrong judgments in a court case.

03:07 - We ﬁnally arrive at the term in the denominator of Bayes’ theorem.

03:12 - It guarantees the normalization of the posterior and is called normalization constant.

03:18 - In a deeper sense it is the data evidence. Let’s assume A is part of a set of complete and exclusive propositions. Do you still know what that means? Then the denominator has a simple form. In the following we omit the background information The ﬁrst and most simple application of Bayes’ theorem is given, if the set only consists of one proposition A and its complement A. I leave it up to you so solve the following question: Given, a rapid antigen test gave the result positive or negative, what is the probability to have the virus or not to have the virus? To answer this question, we need the sensitivity and the speciﬁcity of the test In addition we need the prior probability for a person to have the virus without additional information.

04:17 - Now we want to apply Bayes’ theorem to the treasure island problem. Captain Bayes correctly pointed out that the draftsman of the map wanted to give us a hint with the name ”Inverse sea”. Because this is indeed an in- verse problem. And of course the letter of her brother Thomas Bayes has the solution to it. Why inverse? Because, the likelihood corresponds in a sense to the forward problem. If we know the island, we also know the percentage of frog-ﬁsh and then the likelihood of catching K frogﬁsh is straight FORWARDLY computed.

It is given by the probability for K frogﬁsh in a catch of size N and as such it is a binomial distribution.

05:02 - But we are interested in the inverse probability: The probability that this is a particular island given the number K frogﬁsh in a catch of a 100 ﬁsh in total. We introduce the propositions concerning the island I_i where i= 1 stands for treasure island, i=2 for paradox island and i=3, as conjectured by Pascal, for an unknown island. The desired probability is given by Thomas Bayes’ theorem.

05:37 - Here the second factor in the numerator is the prior probability, the one without the result of the ﬁshing. Let’s assume that the prior probability for the unknown island is alpha and very small, while the probability for treasure and paradox island are the same Then the likelihood for the two known islands is binomial where Q_i is the corresponding frogﬁsh ratio, Q =. 1 for treasure island and Q =. 2 for paradox island. The remaining unknown probability is the likelihood for the third island.

If the island is unknown, then the frog-ﬁsh ratio is also unknown, and thus this probability is independent of K.

06:24 - Finally, the normalization has the form given before for complete and exclusive events.

06:31 - By now we have everything we need to apply it to the Bayesian adventure where they counted or - in the terms of Captain Venn - measured 14 frogﬁsh in a catch of 100 ﬁsh. To begin with, we assume there are just the two islands mentioned in the treasure-map. Then alpha is equal to zero and we obtain that the posterior probability for treasure island is 0. 605. Therefore, the crew will decide that the island they found is almost certainly treasure island.

But what if we allow for the third alternative and assume, since we don’t know any better, that all three islands have the same probability one third. Then we obtain completely diﬀerent posterior probabilities.

07:21 - When Captain Bayes visited Claire’s pottery store she was wondering: ”It would be an intriguing question whether you can guess this distribution just from the mean starﬁsh ratings?” Guessing a distribution just from its mean value is a very general question, it concerns the assignment of prior probabilities. For the task of assigning prior probabilities we so far only know the principle of indiﬀerence based on symmetry or indiﬀerent properties leading to a uniform distribution.

07:54 - We will now learn, how to combine those techniques, so how to use additional so-called testable information, such as the mean starﬁsh-rating, which will lead us to a non-uniform distribution. The general approach is the so-called maximum entropy method. The main idea is to evaluate Shannon’s entropy for a probability distribution. It is a measure for the uncertainty, encoded in the probability distribution. Before we show that this statement is plausible guided by a simple example, let us discuss some details of the entropy.

08:33 - The individuals summands are greater or equal to zero, and they are only zero in the case of CERTAINTY, which means the corresponding event will deﬁnitely happen (Q equal to 1) or will deﬁnitely never happen (Q equal to 0). In between the entropy looks like an inverse parabola.

08:54 - The total Shannon entropy is zero if the entire probability mass is concen- trated at a single index. Zero entropy means zero uncertainty, or rather we are certain about the outcome of the corresponding event. The most uncertain or vague probability distribution is the one that maximizes the entropy but still fulﬁls the normalization constraint.

09:19 - The most uncertain distribution fulﬁlling that constraint will be proven soon to be the uniform distribution. Which is very reasonable, as the entropy does not make any diﬀerence between the individual term.

09:33 - So the principle of indiﬀerence still holds. The entropy for a uniform distribution is the logarithm of the number of outcomes.

09:43 - Let’s consider some simple examples of distributions which all yield a mean value of 3.

09:51 - One possibility for the intrinsic starﬁsh-rating could be a distribution that encodes certainty, because it would always predict to obtain 3 starﬁsh.

10:01 - The corresponding entropy is ZERO, as outlined before.

10:05 - Next we consider a distribution, which is a little more insecure as it predicts to see the rating values 1 and 5 stars with equal probability. The corresponding entropy is ln(2) which is obviously greater than ZERO.

10:21 - Finally we consider the uniform distribution. This is the least committed distribution and it has, as we know already, the greatest entropy ln(5).

10:32 - So, if we know only the mean being 3 of a normalized probability distribution in the case of the starﬁsh-rating, then the uniform distribution is the best choice. It fullﬁlls the constraint but it adds no extra information.

10:49 - Any other distribution favours some starﬁsh-rating values over others, which is not supported by the mean value of 3. If we prefer a distribution that has a lower entropy than the maximum possible given the constraints, then there has to be additional information to support it. By now, you should be able to guess how to proceed, if additional so-called testable information, like the mean starﬁsh-rating, is given.

11:18 - I am sure you got it right. The entropy has to be maximized fulﬁlling all the constraints including normalization. For the sake of clarity we will only use one constraint in addition to the normalization.

11:34 - Maximizing the entropy under those constraints leads to probabilities that are consistent with the constraints but otherwise as uninformative ir uncommitted as possible. If you draw conclusions based on the maximum entropy probability distribution, the deviation to the true result is thereby minimized on average. Constrained maximization can be achieved most elegantly by the method of Lagrange multipliers. We deﬁne the so-called Lagrangian.

12:07 - The maximization condition is that the derivative with respect to the probabilities vanishes.

12:13 - The derivative can be computed easily and we obtain an exponential expression for the probabilities. The normalization Z follows readily.

12:23 - The second Lagrange parameter λ follows from the second constraint.

12:29 - We cannot solve that in a closed form but the result can be obtained easily by numerical means. The behaviour of µ(λ) can be estimated qualitatively.

12:41 - In the limit λ → −∞ only the r = 1 term survives resulting in µ = 1 and in the opposite case (λ → ∞ ) only r = 5 survives, leading to µ = 5. In between the function increases monotonically as you can see in the ﬁgure. Since the curve increases monotonically, for each mean µ there exists a unique solution for the Lagrange parameter λ, as illustrated in the ﬁgure, which shows the result for a mean value µ = 4. 5. That leads to the following rating probabilities.

13:24 - For the numerical determination of the Lagrange-parameter λ the Newton-Raphson method is ideally suited in this case. The result depends on the intrinsic or true mean µ and reads.

13:38 - You will ﬁnd a Pluto note book, where you can experiment with diﬀerent mean starﬁsh-ratings and obtain the corresponding rating probabilities.

13:50 - Captain Bayes found the starﬁsh-rating interesting, but slightly confusing, as the number of votes was diﬀerent for the various items. We will discuss the reliability of the voting in two steps, ﬁrst by a sort of a statistical experiment and then in the correct Bayesian framework.

14:10 - Since we can’t try all four pots, we want to ﬁnd out how good the pots really by a statistical experiment. Take the measured ratings and think of how our own vote would be and how it would change the overall rating.

14:25 - For simplicity, let’s assume that starﬁsh ratings are binary outcomes, which means we only have ”like” or ”dislike. ” So to describe the tin pot rating of 3. 9 out of 5 starﬁsh with 250 votes with binary votes we have 195 likes and 55 dislikes. Let’s assume Laplace has no preference, so his prior prejudice is 50% like and 50% dislike. Adding his own rating as one proportional vote leads to 195. 5 likes and 55. 5 dislikes and a new rating of 3. 89 starﬁsh - so a very modest reduction.

15:14 - For the iron vessel with 5 out of 5 starﬁsh and two votes only, the situation is diﬀerent. Adding Laplace’s own reasoning leads to 4. 2 starﬁsh - so a strong reduction that makes the other pots more likely to be a good choice, which is plausible because you wouldn’t rely on an opinion of just two strangers.

15:38 - Now this technique can be adopted in two ways. First, Laplace might have concerns about the pots that are similar to the broken one, say the iron and the tin pot, because they have similar colors to the old pot, so he might change his mind and give only 0. 3 likes to the tin and iron pot, for example.

15:59 - Second, he could give more weight to his opinion by contributing his opinion as 2 or more votes rather than as a single vote.

16:09 - There is a Pluto notebook waiting for you, where you can vary your own opinion and appropriate weightings, and also explore how a prize can change your personal decision.

16:21 - The historical Laplace had asked himself how likely it was that the sun would rise again the next day, based on an estimated number of days it had risen before. This is nowadays called the Laplace law of succession.

16:37 - For the starﬁsh rating the corresponding question would be: What is the probability that the next vote will be r starﬁsh, given the earlier votes. This is a nice exercise in Bayesian probability theory. Some details will be given in the supplementary material.

16:55 - Now we turn to the correct Bayesian approach for the starﬁsh rating. We are interested in the probability for the true/intrinsic rating µ_R , given the mean rating and the the number of votes. The true rating is what you would get with an inﬁnite number of votes. The ﬁrst step is to invoke Bayes’ theorem to express the posterior in terms of likelihood and prior. To be able to compute the likelihood we need the individual ratings R_i of all N voters, which we introduce via the marginalization rule.

Here we used the fact that the individual votes are uncorrelated. The factors in the product are nothing but the rating probability we derived in the maximum entropy section.

17:45 - That should be enough information for you to perform the remaining steps on the Bayesian path yourself. But don’t be disappointed the ﬁnal expression has to be evaluated numerically. That’s mostly the case in all real world problems. The ﬁgures show the results for the copper pot and the iron vessel. We see that the distribution of iron is much broader, as it is based on only 2 votes. We have used an uninformative uniform prior.

18:17 - We can characterize the results by mean plus minus standard deviation and obtain. This quantiﬁes the observation of the previous statistical experiment that the voting of the iron vessel is not very reliable and the iron vessel is not really more popular than the tin pot.

18:38 - You will ﬁnd a Pluto notebook with the all the details and you can experiment how the rating probability distribution depends on the mean rating and the number of votes.

18:49 - Bayes’ theorem cannot only be used for inverse problems, it can also be considered as update rule for additional information. Assume we have two independent data sets D_1 and D_2 and we want to infer some parameters or answer any other question, which we will generically denote as proposition X. By the deﬁnition of the conditional probability and the product rule we obtain a remarkable formula. So instead of taking both data sets in one step into account, we can use it iteratively.

19:25 - In that case, D_1 deﬁnes the prior for the next iteration step, where D_2 is taken into account. The result of the posterior can again be summarized by the mode or maximum of the posterior which is called the MAP or the mean to gain an estimation for the parameter set X.

19:46 - A more reliable and robust estimator for the parameters is given by the mean.

19:52 - This allows also to quantify its uncertainty by the corresponding standard deviation. Studying Bayes’ theorem and especially the relation of its parts in more detail reveals many ﬁndings.

20:06 - If the prior is constant then the posterior is proportional to the likelihood and the MAP solution is equal to what is called the Maximum-Likelihood solution.

20:17 - If the likelihood is Gaussian then Maximum-Likelihood is the same as the standard least squares approach. We leave it up to you to proof this statement.

20:28 - So far the normalization was a mere nuisance, but it actually has deeper meaning. Suppose B encodes the measured data, for instance the positions of Captain Bayes’ ship on the ocean measured at consecutive days. And A_n describes exclusive and complete propositions about how these positions are aligned. Say the positions are on a straight line, on a parabola, or completely random. Then the probability for B is what is called the data evidence which is the probability to ﬁnd the measured data if these assumptions are correct.

21:09 - This concludes unit six. We have learned Bayes theorem and how to apply it to inverse problems such as parameter estimation, model selection - think of treasure island - or hypothesis testing, which in the Bayesian frame is one and the same thing. We learned how to update probabilities based on new information and how to assign prior probabilities using the maximum entropy principle. Now it’s your turn to study more inverse problems in the bonus material and to have a look at the interactive Pluto notebooks.

21:45 - Please feel free to ask questions in the forum and feel encouraged to test your knowledge in the quiz!.