Probability Theory

Introduction to AI (I2AI)

Andy Weeger

Neu-Ulm University of Applied Sciences

April 19, 2026

Agenda

Warm-up 10 min
Why probability? 10 min
Joint probability & marginalization 20 min
Conditional probability & Bayes 30 min
Naive Bayes in action 15 min
Wrap-up 5 min

Why probability?

Sources of uncertainty

Probability is the response to a fundamental limitation of logic:
Agents rarely have complete, reliable information.

Laziness: enumerating all possible causes is infeasible in practice
Theoretical ignorance: even known rules are incomplete or approximate
Practical ignorance: even with perfect theory, facts about the specific case are unavailable
(e.g., sensors fail, history is missing)

Decision-making under uncertainty

expected utility = (utility - cost) x probability

A rational agent selects the action with maximum expected utility (MEU)
Probability quantifies the degree of belief, not just frequency
Probability is the basis for rational choice in uncertain environments.

Probability theory is, thus, a foundation for the design of rational AI agents.

True or false?

Decide: true or false and give a one-sentence justification.

\(P(A) + P(\neg A) = 1\) always holds.
If \(P(A) = 0.5\) and \(P(B) = 0.5\), then \(P(A \land B) = 0.25\).
\(P(A | B) = P(B | A)\) whenever \(P(A) = P(B)\).
A highly accurate test returning positive for a rare disease means you probably have it.

Rapid-fire, individual answers then quick class poll.

TRUE: axiom of complementarity. Non-negotiable.
FALSE: \(P(A \land B) = P(A) \cdot P(B)\) only when \(A\) and \(B\) are independent. Without that assumption the value is unconstrained (subject only to \(P(A \land B) \leq \min(P(A), P(B))\)).
TRUE: Bayes: \(P(A|B) = P(B|A) \cdot P(A)/P(B)\). If \(P(A) = P(B)\), the ratio equals 1, so \(P(A|B) = P(B|A)\).
FALSE: the base rate dominates when the disease is rare. This previews the medical-test exercise later. Let the class sit with the intuition before the calculation confirms it. It feels like a betrayal of common sense, doesn’t it? If a doctor tells you a test is 99% accurate and you test positive, your brain naturally wants to scream, “There’s a 99% chance I’m sick!”
But in the world of statistics, the Base Rate (how common the disease is) often carries more weight than the test’s accuracy. This is known as the Base Rate Fallacy.

These four questions intentionally surface the three most common probability misconceptions: confusion between complement and independence, symmetric-probability fallacy, and base-rate neglect.

Joint probability & marginalization

Prior probability

The prior probability \(P(x)\) expresses the degree of belief in \(x\) in the absence of any other information (Russel & Norvig, 2022).

Every valid probability mass function satisfies three axioms:

Non-negativity: \(p_X(x) \geq 0\) for all \(x\)
(A probability can never be a negative number.)
Normalization: \(\sum_{x \in V(X)} p_X(x) = 1\)
(If you add up the probabilities of every possible outcome, the total must equal exactly 1.)
Support: \(p_X(x) = 0\) for all \(x \notin V(X)\)#
(If an outcome is not in the set of possible values (\(V(X)\)), its probability is zero.)

Probabilities do not need to come from frequencies, they can encode expert belief (i.e., probability can represent a degree of certainty or expert judgment about a unique event that hasn’t happened yet).

Joint distribution

The full joint distribution over \(N\) variables is the collection of probabilities for every possible combination of values.

	Toothache (\(y_1\))	\(\neg\)Toothache (\(y_2\))	Marginal \(P(X)\)
Cavity (\(x_1\))	0.04	0.06	0.1
\(\neg\)Cavity (\(x_2\))	0.01	0.89	0.9

Table 1: Full joint distribution for the Toothache/Cavity world (Russel & Norvig, 2022)

Marginalization law: \(P(x_i) = \sum_{y_j \in V(Y)} P(x_i, y_j)\)

“I want to know the probability of X happening, and I don’t care what happens with Y.” To “marginalize” a variable is to remove it from the equation by accounting for every possible way it could have occurred.

Exercise: passing I2AI

A study tracks student performance in the I2AI exam: Passing (R) and Coffee Consumption (C).

	\(C=T\)	\(C=F\)
\(R=T\)	0.60	0.20
\(R=F\)	0.10	0.10

Table 2: Joint distribution — Exam Result / Coffee

Tasks

Compute \(P(R=T)\) and \(P(C=T)\).
Are \(R\) and \(C\) independent? Show your working.
You meet a student who drank coffee (\(C=T\)). What is the probability they passed (\(R=T\))?

10:00

Conditional probability & Bayes

Conditional Probability

New evidence narrows the sample space to only those outcomes consistent with the data. Because the total probability of the sample space must always equal 1, the evidence doesn’t create new probability mass but redistributes the existing mass into a smaller, more concentrated area.

\[P(x_i | y_j) = \frac{P(x_i, y_j)}{P(y_j)} \qquad \text{(provided } P(y_j) > 0\text{)}\]

Imagine 10,000 people. A disease affects 1% of them, and a test is 95% accurate.

	Test Positive	Test Negative	Total
Sick	95	5	100
Healthy	495	9,405	9,900
Total	590	9,410	10,000

The Update: Once you receive a Positive result, the “Test Negative” column is eliminated. The total sample space shrinks from 10,000 to 590.

Your probability of being sick is now: \[\frac{95}{590} \approx 16\%\]

Bayes’ theorem

\[P(x_i | y_j) = \frac{P(y_j | x_i) \cdot P(x_i)}{P(y_j)}\]

\(P(x_i)\): prior: belief before seeing evidence
\(P(y_j | x_i)\): likelihood: how probable is the evidence if \(x_i\) is true?
\(P(y_j)\): evidence: total probability of observing \(y_j\) (normalizing constant)
\(P(x_i | y_j)\): posterior: updated belief after seeing evidence

The Law of Total Probability calculates the overall probability of an outcome by summing its likelihood across every possible, mutually exclusive scenario that could lead to it.

\[P(y_j) = \sum_{x_k \in V(X)} P(y_j | x_k) \cdot P(x_k) \quad \text{(law of total probability)}\]

Stress the direction flip: the likelihood \(P(y_j | x_i)\) is what we can usually measure or learn (e.g., “how often does this symptom appear in confirmed disease cases?”), while the posterior \(P(x_i | y_j)\) is what we actually want to know (“does this patient have the disease?”). Bayes’ theorem is the bridge.

The normalizing denominator \(P(y_j)\) is often the hardest part to compute; walk through the total-probability formula carefully before the exercise.

\(P(x_i)\): Prior (The Initial Hunch):
What you believed before you saw the evidence.
\(P(y_j | x_i)\): Likelihood (The Story):
How well does this specific hypothesis explain the evidence?
\(P(y_j)\): Evidence (The Reality Check):
How common is this observation regardless of the hypothesis?
\(P(x_i | y_j)\): Posterior (The New Normal):
Your updated belief after the data has been processed.

The total probability of an event is the sum of its probability in every possible scenario, weighted by the likelihood of each scenario actually occurring.

Beliefs table

An agent holds three beliefs: \(P(A) = 0.4\), \(P(B) = 0.3\), and \(P(A \lor B) = 0.5\).

Tasks

Create a \(2 \times 2\) joint distribution table where the variables \(a, b, c, d\) represent the four possible worlds.
Using the table above, write four equations that represent the agent’s beliefs and the fundamental rules of probability.
Solve your equations to find the values for \(a, b, c, \text{and } d\).
Are \(A\) and \(B\) independent?

08:00

This exercise is from the lecture notes. It tests whether students can bridge from symbolic probability statements to a concrete table — and back.

Solution:

	\(B\)	\(\neg B\)
\(A\)	\(a\)	\(b\)
\(\neg A\)	\(c\)	\(d\)

Equations: \(a + b = 0.4\); \(a + c = 0.3\); \(a + b + c = 0.5\); \(a + b + c + d = 1\).

From the third equation: \(c = 0.5 - 0.4 = 0.1\). From the second: \(a = 0.3 - 0.1 = 0.2\). From the first: \(b = 0.4 - 0.2 = 0.2\). From the fourth: \(d = 0.5\).

So \(P(A \land B) = a = 0.2\).

Independence check: \(P(A) \cdot P(B) = 0.4 \times 0.3 = 0.12 \neq 0.2\) — not independent.

Debrief quickly: the table technique is general and will reappear in every Bayes calculation.

Medical test

A test screens for a disease affecting 1% of the population.

Sensitivity: \(P(\text{pos} | \text{disease}) = 0.95\)
Specificity: \(P(\text{neg} | \text{no disease}) = 0.98\)

A random person tests positive.

Tasks

What is \(P(\text{disease} | \text{positive})\)?
Use Bayes’ theorem and the law of total probability for the denominator.
The person tests positive a second time (independent test).
Use your answer from (1) as the new prior. What is the updated probability now?
Would you recommend treatment after two positive tests? Why or why not?

15:00

This is the main exercise of the session. Allow 15 min; circulate and check that students correctly apply the law of total probability for the denominator before plugging in.

Task 1:

\(P(+) = P(+|\text{D}) \cdot P(\text{D}) + P(+|\neg\text{D}) \cdot P(\neg\text{D})\) \(= 0.95 \times 0.01 + 0.02 \times 0.99 = 0.0095 + 0.0198 = 0.0293\)

\(P(\text{D}|+) = \frac{0.95 \times 0.01}{0.0293} \approx 0.324\)

Despite a 95%-sensitive test, only ~32% chance of disease. This is the base-rate fallacy / false-positive paradox.

Task 2: New prior = 0.324, new \(P(\neg\text{D}) = 0.676\).

\(P(+_2) = 0.95 \times 0.324 + 0.02 \times 0.676 = 0.3078 + 0.01352 = 0.3213\)

\(P(\text{D}|+_1, +_2) = \frac{0.95 \times 0.324}{0.3213} \approx 0.957\)

After two positives, the probability exceeds 95%. Bayesian updating compounds evidence correctly.

Task 3: Open-ended. Key considerations: what are the costs of a false positive (over-treatment) vs. a false negative (untreated disease)? Medical ethics: treatment decision depends on disease severity, treatment side-effects, and the availability of further confirmatory tests. No single correct answer, but students should demonstrate that the posterior probability alone does not determine the decision — the utility structure matters too.

Debrief: connect back to MEU from the motivation section.

Naive Bayes in action

Naive Bayes Classifier

How do we predict a category (\(Y\)) based on multiple clues (\(X\))?
We treat each clue as an independent “vote” for the outcome.

\[P(\text{Class} \mid \text{Features}) \propto P(\text{Class}) \cdot P(\text{Feature}_1 \mid \text{Class}) \cdot P(\text{Feature}_2 \mid \text{Class}) \dots\]

The “naive” assumption: We assume every feature is independent of the others (\(\propto\) means proportional).

Example: In a spam filter, we assume the word “money” appearing has nothing to do with the word “free” appearing.
In reality this is almost always false, but the math still works surprisingly well for ranking.
Thus, we just study each word’s frequency individually and calculate “unnormalized scores” for each class
To get an actual % probability, we divide each class score by the sum of all scores.

Classify the patient

A symptom-based diagnostic system distinguishes Cold (C) from Flu (F). From clinic records:

Symptom	\(P(\cdot \| \text{Cold})\)	\(P(\cdot \| \text{Flu})\)
Fever=T	0.30	0.90
Cough=T	0.80	0.95

Class priors: \(P(\text{Cold}) = 0.70\), \(P(\text{Flu}) = 0.30\).

Tasks

Patient A: Fever=T, Cough=T. Compute unnormalized scores and classify.
Patient B: Fever=F, Cough=T. Compute unnormalized scores and classify.
Which symptom is more diagnostic for flu: fever or cough? Why?

12:00

This exercise builds Naive Bayes intuition before any code is involved.

Patient A (Fever=T, Cough=T):

\(\text{score}(\text{Cold}) = P(\text{Cold}) \cdot P(\text{Fv=T}|\text{Cold}) \cdot P(\text{Co=T}|\text{Cold}) = 0.70 \times 0.30 \times 0.80 = 0.168\)

\(\text{score}(\text{Flu}) = P(\text{Flu}) \cdot P(\text{Fv=T}|\text{Flu}) \cdot P(\text{Co=T}|\text{Flu}) = 0.30 \times 0.90 \times 0.95 = 0.2565\)

Normalize: total = 0.4245; \(P(\text{Cold}|\ldots) \approx 0.40\); \(P(\text{Flu}|\ldots) \approx 0.60\). Classify as Flu.

Patient B (Fever=F, Cough=T):

\(P(\text{Fv=F}|\text{Cold}) = 1 - 0.30 = 0.70\); \(P(\text{Fv=F}|\text{Flu}) = 1 - 0.90 = 0.10\)

\(\text{score}(\text{Cold}) = 0.70 \times 0.70 \times 0.80 = 0.392\)

\(\text{score}(\text{Flu}) = 0.30 \times 0.10 \times 0.95 = 0.0285\)

Normalize: total = 0.4205; \(P(\text{Cold}|\ldots) \approx 0.93\). Classify as Cold. The absent fever almost completely rules out flu.

Task 3: Fever is more diagnostic. The likelihood ratio \(P(\text{Fv=T}|\text{Flu})/P(\text{Fv=T}|\text{Cold}) = 0.90/0.30 = 3.0\), while for cough it is \(0.95/0.80 \approx 1.19\). A feature with a high likelihood ratio separates classes more sharply.

Debrief: connect to the prior. The strong Cold prior (0.70 vs. 0.30) means fever alone is not enough to flip the classification — two strong signals are needed. This illustrates the interaction between likelihood and prior.

Wrap-up

Key takeaways

Why probability

Logic fails when knowledge is incomplete, rules are approximate, or sensor data is unreliable
Probability provides a principled calculus for degrees of belief — not just frequencies

Joint distributions, marginalization, and conditional probability

The full joint distribution contains all information; other distributions are derived from it by summation
Conditioning on evidence narrows the sample space and rescales probabilities

Bayes’ theorem

Prior \(\times\) likelihood / evidence = posterior; each observed datum updates belief
Base rates matter: a rare event remains improbable even after a positive test, until enough evidence accumulates
Sequential Bayesian updating (using last posterior as current’s prior) is correct and powerful

Key takeaways #2

Naive Bayes

The conditional independence assumption replaces a single intractable joint likelihood with a product of simple ones
Despite the “naive” assumption being almost always false, the classifier is often surprisingly accurate
The prior shapes classification: high base-rate classes resist being overridden by weak evidence

Q&A

Literature

Russel, S., & Norvig, P. (2022). Artificial intelligence: A modern approach. Pearson Education.