Probabilities

🧠 Introduction to AI

Andy Weeger

Neu-Ulm University of Applied Sciences

February 13, 2024

Introduction

Discussion

What are challenges related to logic as an approach to knowledge representation?

Qualification problem

Logic is good, but often conclusions under uncertainty need to be drawn, as …

  • the agent’s knowledge of the world is incomplete (not enough information) or uncertain (sensors are unreliable);
  • the agent must consider every possible explanation for its percepts (no matter how unlikely);
  • this leads to a large belief-state full of unlikely possibilities and arbitrarily large contingent plans; and
  • rules about the world are often incomplete (e.g., are all preconditions for an action known?) or even incorrect

Example

Goal: Be in Ulm at 8:15 to give a lecture

There are several plans that achieve the goal:

  • \(P_1:\) Get up at 6:00, take the bike, arrive at 7:30, take a shower, …
  • \(P_2:\) Get up at 6:30, take the car at 7:00, arrive at 7:45, …
  • …

All these plans are correct, but they imply different costs and different probabilities of actually achieving the goal.

\(P_2\) eventually is the plan of choice as the success rate of P1 is only 80%

Uncertainty in rules

Take an expert dental diagnosis system as an example.

\(Toothache \implies Cavity\)

→ This rule is incorrect, better:

\(Toothache \implies Cavity \lor GumProblem \lor Abscess ...\)

→ However, we don’t know all the causes1

Perhaps a causal rule is better?

\(Cavity \implies Toothache\)

→ This is still wrong and does not allow to reason from symptoms to causes.

Discussion

What do we learn from that?

Learnings

  • We cannot enumerate all possible causes (i.e., laziness)
  • And even if we could, we do not know how correct the rules are (i.e., theoretical ignorance2)
  • And even if we did there will always be uncertainty about the patient (i.e., practical ignorance3)

→ Without perfect knowledge (i.e., “the rules of the game”), logical rules do not help much

Discussion

What issues might arise when localizing a robot?

Uncertainty in facts

Let us suppose we wanted to support the localization of a robot with (constant) landmarks. With the availability of landmarks, we can narrow down on the area.

Problem: sensors can be imprecise

  • From the fact that a landmark was perceived, we cannot conclude with certainty that the robot is at that location
  • The same is true when no landmark is perceived
  • Only the probability increases or decreases

Probability theory

  • We (and other agents) are convinced by facts and rules only up to a certain degree
  • One possibility for expressing the degree of belief is to use probabilities
  • The agent is 90% (or 0.9) convinced by its sensor information
    (in 9 out of 10 cases, the information is correct–the agent believes)
  • Probabilities sum up the “uncertainty” that stems from lack of knowledge.
  • Probabilities are not to be confused with vagueness
    The predicate tall is vague; the statement, “A man is 1.75–1.80m tall” is uncertain.

Decisions under uncertainty

Decision theory

= utility theory + probability theory`

  • We have a choice of actions (or plans)
  • These can lead to different solutions with different probabilities
  • The actions have different (subjective) costs
  • The results have different (subjective) utilities

It would be rational to choose the action with the maximum expected utility (MEU)—the “average”, or “statistical mean” of the outcome utilities minus the costs of the actions leading to the outcome, weighted by the probability of the outcome.

Unconditional probabilities

\(P(A)\) denotes the unconditional probability or prior probability that \(A\) will appear in the absence of any other information, e.g. \(P(Cavity) = 0.1\)

  • Prior probabilities can be obtained from statistical analysis or general rules
  • A random variable4 can take on some range—the set of possible values (e.g., Numbers, Boolean, arbitary tokens), e.g. \(P(Weather=sunny=0.6)\)
  • Logical connectors can be used to build probabilistic propositions, e.g. \(P(Cavity \land \neg Insured) = 0.06\)
  • Propositions can contain equations over random variables, e.g. \(P(NoonTemp=x) = \textrm{Uniform}(x;18C;26C)\)5, usually called a probability density function

Conditional probabilities

New information (usually called evidence) can change the probability, e.g. the probability of a cavity increases if we know the patient has a toothache

\(P(a|b)\) is the conditional probability of \(a\) given that all we know is \(b\)

Conditional probabilities (or posterior probabilities) are defined in terms of unconditional probabilities as follows:

For any propositions \(a\) and \(b\) (if \(P(b)>0\)), we have

\[ \begin{flalign} P(a|b) = \frac{P(a \land b)}{P(b)} = \frac{P(a, b)}{P(b)}&& \end{flalign} \]

Or in a different form as product rule

\(P(a \land b) = P(a,b) = P(a|b)P(b) = P(b|a)P(a)\)6

Example

The product rules for all possible values of Weather and Cavity can be written as a single equation:

\(P(Weather,Cavity)=P(Weather|Cavity)P(Cavity)\)

which corresponds to a system of equations (using abbreviations W and C):

\(P(W=sun,C=true) = P(W=sun|C=true)P(C=true)\)

\(P(W=rain,C=true) = P(W=rain|C=true)P(C=true)\)

…

\(P(W=snow,C=false) = P(W=snow|C=false)P(C=false)\)

Marginalization

Marginalisation in probability refers to “summing out” the probability of a random variable X given the joint probability distribution of X with other variable(s).

For any sets of variables X and Z we have

\[ \begin{flalign} P(X=x) = \sum_Z P(X=x,Z=z) && \end{flalign} \]

To find \(P(X=x)\), we sum all the probability values where \(X=x\) occurs with all the possible values of Z.

We can abbreviate \(P(X=x)\) by \(P(x)\) and \(P(X=x,Z=z)\) by \(P(x,z)\).

Using the product rule, we can replace \(P(x,z)\) by \(P(X|z)P(z)\),

\[ \begin{flalign} P(x) = \sum_Z P(x|z)P(z) && \end{flalign} \]

Probability axioms

The basic axioms of probability theory say that every possible world has a probability between 0 and 1 and that the total probability of the set of possible worlds is 1:

\(0 \leq P(a) \leq 1 \quad \textrm{for every } \omega \textrm{ and} \sum_{\omega \in \Omega}P(\omega)=1\)

The probability of a disjunction can be derived by following formula, sometimes called inclusion-exclusion principle:

\(P(a \lor b) = P(a) + P(b) - P(a \land b)\)

All other properties can be derived from these axioms, e.g.

\(P(\neg A) = 1–P(A)\)

follows from \(P(A \lor \neg A) = 1\) and \(P(A \land \neg A) = 0\)

Joint probability

Probabilities can be assigned to every proposition; an atomic event is an assignment of values to all random variables (i.e., a complete specification of a state)

Example: Let X and \(Y\) be boolean variables, leading to following four atomic events: \(X \land Y, \neg X \land Y, X \land \neg Y, \neg X \land \neg Y\)

The joint probability distribution \(P(x_1,...,x_n)\) assigns a probability to every atomic event.

Toothache ÂŹToothache
Cavity 0.04 0.06
ÂŹCavity 0.01 0.89
Table 1: Probabilities of the atomic events (full joint distribution for the Toothache, Cavity world)

Since all atomic events are disjoint, the sum of all fields is 1.

Working with joint probabilities

All relevant probabilities can be computed using the joint probability by expressing them as a disjunction of atomic events. Example:

\[ \begin{align} P(Cavity \lor Toothache) & = P(Cavity \land Toothache) \\ &+ P(\neg Cavity \land Toothache) \\ &+ P(Cavity \land \neg Toothache) \end{align} \]

Unconditional probabilities are obtained by adding across a row or column:

\(P(Cavity) = P(Cavity \land Toothache) + P(Cavity \land \neg Toothache)\)

While conditional probabilities are obtained by using the product rule:

\(P(Cavity|Toothache) = \frac{P(Cavity \land Toothache)}{P(Toothache)}=\frac{0.04}{0.04+0.01}=0.80\)

Discussion

What challenges arise when dealing with joint probabilities?

Problems with joint probabilities

We can easily obtain all probabilities from the joint probability. The joint probability, however, involves \(k^n\) values, if there are \(n\) random variables with \(k\) values.

→ Implies difficulties of representation and assessment

Questions:

  • Is there a more compact way of representing joint probabilities?
  • Is there an efficient method to work with this representation?

Not in general, but it can work in many cases. Modern systems work directly with conditional probabilities and make assumptions on the independence of variables7 in order to simplify calculations.

Single work

Work through the Bayes’ rule chapter, try to get it and note issues of understanding.

Bayes’ rule

Derivation

We already know the product rule that can be written in two forms:

\(P(a \land b) = P(a|b) P(b) \quad and \quad P(a \land b) = P(b|a) P(a)\)

By equating the right-hand sides and dividing by \(P(a)\), we get

\(P(b|a) = \frac{P(a|b)P(b)}{P(a)}\quad\) (known as Bayes’ rule)

For multi-valued variables (set of equalities):

\(P(Y|X) = \frac{P(X|Y) P(Y)}{P(X)}\)

Generalization (conditioning on background evidence \(e\)):

\(P(Y|X,e) = \frac{P(X|Y,e) P(Y|e)}{P(X|e)}\)

Value of Bayes’ rule

Bayes’ rule only allows us to compute the single term \(P(b|a)\) in terms of three terms:

\(P(a|b)\), \(P(b)\), and \(P(a)\)

That is useful in practice as there are many cases where we do have good probability estimates for these three numbers and need to compute the fourth

Often, we perceive as evidence the effect of some known cause and we would like to determine that cause. In that case, Bayes’ rule becomes:

\(P(cause|\mathit{effect}) = \frac{P(\mathit{effect}|cause)P(cause)}{P(\mathit{effect})}\)

Example

\[ \begin{align} P(Toothache | Cavity) = {} & 0.04 \\ P(Cavity) = {} & 0.1 \\ P(Toothache) = {} & 0.05 \\ P(Cavity | Toothache) = {} & \frac{0.04*0.1}{0.05} = 0.8 \end{align} \]

  • Here only information that quantifies the relationship between Caviety and Toothache in the causal direction is given, the diagnosis is derived by applying Baye’s rule
  • If there is quantative information in the diagnostic direction from symptons to causes, there is no need to use Baye’s rule
  • However, causal knowledge is more robust than diagnostic knowledge (imagine, e.g., a cavity epidemic8)

Relative probability

Often we would consider multiple causes. For example, a dentist would also like to consider the probability that the patient has a gum disease.

\[\begin{align} P(Toothache | GumDisease) = {} & 0.7 \\ P(GumDisease) = {} & 0.02 \end{align}\]

Which diagonsis is more probable?

\(P(c|t) = \frac{P(t | c)P(c)}{P(t)} \quad\) or \(\quad P(g | t) = \frac{P(t | g)P(g)}{P(t)}\)

If we are only interested in the relative probability, we need not assess \(P(T)\):

\(\frac{P(c|t)}{P(g|t)} = \frac{P(t|c)P(c)}{P(t)} \frac{P(t)}{P(t|g)P(g)} = \frac{P(t|c)P(c)}{P(t|g)P(g)} = \frac{0.4*0.1}{0.7*0.02} = 2.857\)

→ Important for excluding possible diagnoses

Normalization

If we wish to determine the absolute probability of \(P(c | t)\) and we do not know \(P(t)\), we can also carry out a complete case analysis (e.g. for \(c\) and \(\neg c\))

We would use the fact that

\(P(c | t) + P(\neg c | t) = 1\) (here \(c\) and \(t\) are boolean variables)

\[ \begin{flalign} P(c | t) = {} & \frac{P(t|c)P(c)}{P(t)} && (1.1) \\ P(\neg c | t) = {} & \frac{P(t|\neg c)P(\neg c)}{P(t)} && (1.2) \\ P(c | t) + P(\neg c | t) = {} & \frac{P(t|c)P(c)}{P(t)} + \frac{P(t|\neg c)P(\neg c)}{P(t)} && (1.3) \\ P(t) = {} & P(t|c) P(c) + P(t|\neg c) P(\neg c) && (1.4) \end{flalign} \]

By substituting equation (1.4) into the first equation (1.1), we get:

\[ \begin{flalign} P(c|t) = {} & \frac{P(t|c)P(c)}{P(t|c)P(c)+P(t|\neg c)P(\neg c)} && (1.5) \end{flalign} \]

Example

Your doctor tells you that you have tested positive for a serious but rare (1/10000) disease (\(D\)). This test (\(T\)) is correct to 99% (1% false positive & 1% false negative results). What does this mean for you?

\[ \begin{flalign} P(D|T) = \frac{P(T|D) P(D)}{P(T)} = \frac{P(T|D) P(D)}{P(T|D) P(D) + P(T| \neg D) P( \neg D)} && \end{flalign} \] Applying marginalization to Bayes Law with Boolean variables.

\[ \begin{flalign} P(D) = 0.0001 \quad P(T|D) = 0.99 \quad P(T | \neg D) = 0.01 && \end{flalign} \]

\[ \begin{flalign} P(D|T) = \frac{0.99 \times 0.0001}{0.99 \times 0.0001 + 0.01 \times 0.9999} = \frac{0.000099}{0.010088} ≈ 0.01 && \end{flalign} \]

Bottom line: if the inaccuracy of the test is much greater than the frequency of occurrence of the disease, then a positive result is not as threatening as one might think.

Normalizing constant

For random variables with multiple values:

\[ \begin{flalign} P(Y|X) = {} & \frac{P(X|Y)P(Y)}{P(X)} && \\ \end{flalign} \]

\[ \begin{flalign} \alpha = \frac{1}{P(X)} = \sum_i P(X,Y_i)P(Y_i) && \\ \end{flalign} \]

The sum of all possible values of the unknown parameter \(P(X)\) is often infinite. Thus, the normalization constant \(\alpha\) is often determined approximately. We get:

\[ \begin{flalign} P(Y|X) = {} & \alpha P(X|Y) P(Y) && \end{flalign} \]

Multiple evidence

A dentist’s probe catches in the aching tooth of a patient.

Using Bayes’ rule, we can calculate:

\[ \begin{flalign} P(Cav | Catch) = 0.95 && \end{flalign} \]

But how does the combined evidence help?

With the help of Bayes’ rule, the dentist could conclude:

\[ \begin{flalign} P(Cav | Toothache \land Catch) = \frac{P(Toothache \land Catch|Cav)P(Cav)}{P(Toothache \land Catch)} && \end{flalign} \]

\[ \begin{flalign} P(Cav | Toothache \land Catch) = \alpha{P(Toothache \land Catch|Cav)P(Cav)} && (2.0) \end{flalign} \]

Problem: The dentist needs to know the conditional probabilities of the conjunction \(Toothache \land Catch\) for each value of \(Cav\), i.e., diagnostic knowledge of all combinations of symptoms in the general case. This might be possible for just two evidence variables, but does not scale up.

As it would reduce the complexity of the inference problem, it would be nice if \(Toothache\) and \(Catch\) were independent, but they are not: if a probe catches in the tooth, it probably has cavity which probably causes toothache.

However, given we know whether the tooth has cavity, these variables are independent, as each is directly caused by the cavity, but neither has a direct effect on the other9:

\[ \begin{flalign} P(Toothache \land Catch | Cav) = P(Toothache | Cav) P(Catch | Cav) && (2.1) \end{flalign} \]

Conditional independence

The general definition of conditional independence of two variables X and \(Y\) given a third variable Z is:

\[ \begin{flalign} P(X,Y|Z) = P(X|Z)P(Y|Z) && \end{flalign} \]

In the dentist example, it seems reasonable to assert conditional independence of the variables \(Toothache\) and \(Catch\) given \(Cav\), thus we can insert equation (2.1) into (2.0) leading to (2.2)

\[ \begin{flalign} P(Cav | Toothache \land Catch) = \alpha{P(Toothache \land Catch|Cav)P(Cav)} && (2.0) \end{flalign} \]

\[ \begin{flalign} P(Toothache \land Catch | Cav) = P(Toothache | Cav) P(Catch | Cav) && (2.1) \end{flalign} \]

\[ \begin{flalign} P(Cav | Toothache \land Catch) = \alpha{P(Toothache | Cav) P(Catch | Cav)P(Cav)} && (2.2) \end{flalign} \]

Naive Bayes models

Recursive Bayesian updating

Multiple evidence can be reduced to prior probabilities and conditional probabilities (assuming conditional independence).

The general combination rule, if \(Z_1\) and \(Z_2\) are independent given X is

\[ \begin{flalign} P(C|Z_1, Z_2) = \alpha P(C) P(Z_1|C) P(Z_2|C) && \end{flalign} \]

where \(\alpha\) is the normalization constant.

The generalization is called recursive Bayesian updating:

\[ \begin{flalign} P(Cause|e) = \alpha P(Cause) \prod_j P(e_j|Cause) && \end{flalign} \]

Types of variables

Variables can be discrete or continuous

Discrete variables

  • \(Weather\): sunny, rain, cloudy, snow
  • \(Cavity\): true, false (boolean)

Continuous variables

  • Tomorrow’s maximum temperature in Berkeley
  • Domain can be the entire real line or any subset
  • Distributions for continuous variables are typically given by probability density functions

Text classification

Imagine following task: given a text, decide which of a predefined set of classes or categories it belongs to.

  • Cause is the Category variable
  • Effect variables are the presence or abscence of a certaing key words, HasWordi

Example sentences:

  • Stock rallied on Monday, with major indexes gaining 1% as optimism persisted over the first quarter earnings season.
  • Heavy rain continued to pound much of the east coast on Monday, with flood warnings issued in New York City and other locations

The task is to classify each sentence into a Category—the major sections of the newspaper ([news, sports, business, weather, entertainment])

The naive Bayes model consists of

  • the prior probabilities \(P(Category)\) and
  • the conditional probabilities \(P(HasWord_i|Category)\)

For each category c \(P(Category=c)\) is estimated as the fraction of all previously seen documents that are of category c10.

Similarily, \(P(HasWord_i|Category)\) is estimated as the fraction of documents of each category that contain word \(i\)11.

To categorize a new document, we check which key words appear in the document and then apply recursive Bayesian updating to obtain the posterior probablity distribution over categories:

\[ \begin{flalign} P(Cause|e) = \alpha P(Cause) \prod_j P(e_j|Cause) && \end{flalign} \]

The category with the highest posterior probability is taken.

Summary

  • Uncertainty is unavoidable in complex, dynamic worlds in which agents are ignorant.
  • Probabilities express the agent’s inability to reach a definite decision. They summarize the agent’s beliefs.
  • Conditional and unconditional probabilities can be formulated over propositions.
  • Bayes’ rule allows us to calculate known probabilities from unknown probabilities.
  • Multiple evidence (assuming independence) can be effectively incorporated using recursive Bayesian updating.

✏️ Exercises

Conditional probability

Show from the definition of conditional probability that \(P(a | b \land a) = 1\).

Beliefs

Would it be rational for an agent to hold the three beliefs \(P(A)=0.4\), \(P(B)=0.3\), and \(P(A \lor B)=0.5\)? If so, what range of probabilities would be rational for the agent to hold for \(A \land B\)? Make up a table like the one in Table 1, and show how it supports your argument about rationality.

Medical tests

Consider two medical tests, A and B, for a virus. Test A is 95% effective at recognizing the virus when it is present, but has a 10% false positive rate (indicating that the virus is present, when it is not). Test B is 90% effective at recognizing the virus, but has a 5% false positive rate. The two tests use independent methods of identifying the virus. The virus is carried by 1% of all people.

Say that a person is tested for the virus using only one of the tests, and that test comes back positive for carrying the virus. Which test returning positive is more indicative of someone really carrying the virus? Justify your answer mathematically.

Conditional independence

Show that the statement of conditional independence

\[ \begin{flalign} P(A,B|C)=P(A|C)P(B|C) && \end{flalign} \]

is equivalent to each of the statements

\[ \begin{flalign} P(A|B,C)=P(A|C) \quad \textrm{and} \quad P(B|A,C)=P(B|C) && \end{flalign} \]

Pacman

Pacman has developed a hobby of fishing. Over the years, he has learned that a day can be considered fit or unfit for fishing \(Y\) which results in three features: whether or not Ms. Pacman can show up \(M\), the temperature of the day \(T\), and how high the water level is \(W\). Pacman models it as an Naive Bayes classification problem.

We wish to calculate the probability a day is fit for fishing given features of the day. Consider the conditional probability tables that Pacman has estimated over the years:

\(Y\) \(P(Y)\)
yes 0.1
no. 0.9
\(M\) \(Y\) \(P(M|Y)\)
yes yes 0.5
no yes 0.5
yes no 0.2
no no 0.8
\(W\) \(Y\) \(P(W|Y)\)
high yes 0.1
low yes 0.9
high no 0.5
low no 0.5
\(T\) \(Y\) \(P(T|Y)\)
cold yes 0.2
warm yes 0.3
hot yes 0.5
cold no 0.5
warm no 0.2
hot no 0.3

Using the method of Naive Bayes, if Ms. Pacman is available, the weather is cold, and the water level is high, do we predict that the day is fit for fishing?

Literature

Russel, Stuart, and Peter Norvig. 2022. Artificial Intelligence: A Modern Approach. Harlow: Pearson Education.

Footnotes

  1. See qualification problem

  2. Medical science has no complete theory for the domain

  3. For instance, the coincidence of having a toothache and a cavity that are unrelated, or the fact that not all tests have been run

  4. Variables in probability theory are called random variables

  5. Represents the belief that the temperature at noon is distributed uniformly between 18 and 26 degrees Celcius

  6. \(P(a \land b) = P(a|b)P(b)\) equals \(P(a \land b) = P(b|a)P(a)\)

  7. Factoring large joint distributions into smaller joint distributions, using absolute indepence. For instance, weather and dental problems are independent, thus, these can be split.

  8. If there is a cavity epidemic and \(P(Cavity)\) increases, \(P(Toothache | Cavity)\) does not change, but \(P(Toothache)\) and \(P(Cavity | Toothache)\) will change proportionally

  9. Toothache depends on the state of the nerves in the tooth, whereas the probe’s accuracy depends primarily on the dentist’s skill, to which the toothache is irrelevant.

  10. For example, if 9% of the articles are about weather, we set \(P(Category=weather)=0.09\).

  11. For example, if 37% of articles about business contain word 6, “stocks,” so \(P(HasWord_6=true|Category=business)\) is set to 0.37