Probability Theory

Introduction to AI (I2AI)

Andy Weeger

Neu-Ulm University of Applied Sciences

April 19, 2026

Agenda

  • Warm-up 10 min
  • Why probability? 10 min
  • Joint probability & marginalization 20 min
  • Conditional probability & Bayes 30 min
  • Naive Bayes in action 15 min
  • Wrap-up 5 min

Why probability?

Sources of uncertainty

Probability is the response to a fundamental limitation of logic:
Agents rarely have complete, reliable information.

  • Laziness: enumerating all possible causes is infeasible in practice
  • Theoretical ignorance: even known rules are incomplete or approximate
  • Practical ignorance: even with perfect theory, facts about the specific case are unavailable
    (e.g., sensors fail, history is missing)

Decision-making under uncertainty

expected utility = (utility - cost) x probability

  • A rational agent selects the action with maximum expected utility (MEU)
  • Probability quantifies the degree of belief, not just frequency
  • Probability is the basis for rational choice in uncertain environments.

Probability theory is, thus, a foundation for the design of rational AI agents.

True or false?

Decide: true or false and give a one-sentence justification.

  1. \(P(A) + P(\neg A) = 1\) always holds.
  2. If \(P(A) = 0.5\) and \(P(B) = 0.5\), then \(P(A \land B) = 0.25\).
  3. \(P(A | B) = P(B | A)\) whenever \(P(A) = P(B)\).
  4. A highly accurate test returning positive for a rare disease means you probably have it.

Joint probability & marginalization

Prior probability

The prior probability \(P(x)\) expresses the degree of belief in \(x\) in the absence of any other information (Russel & Norvig, 2022).

Every valid probability mass function satisfies three axioms:

  • Non-negativity: \(p_X(x) \geq 0\) for all \(x\)
    (A probability can never be a negative number.)
  • Normalization: \(\sum_{x \in V(X)} p_X(x) = 1\)
    (If you add up the probabilities of every possible outcome, the total must equal exactly 1.)
  • Support: \(p_X(x) = 0\) for all \(x \notin V(X)\)#
    (If an outcome is not in the set of possible values (\(V(X)\)), its probability is zero.)

Probabilities do not need to come from frequencies, they can encode expert belief (i.e., probability can represent a degree of certainty or expert judgment about a unique event that hasn’t happened yet).

Joint distribution

The full joint distribution over \(N\) variables is the collection of probabilities for every possible combination of values.

Toothache (\(y_1\)) \(\neg\)Toothache (\(y_2\)) Marginal \(P(X)\)
Cavity (\(x_1\)) 0.04 0.06 0.1
\(\neg\)Cavity (\(x_2\)) 0.01 0.89 0.9
Table 1: Full joint distribution for the Toothache/Cavity world (Russel & Norvig, 2022)

Marginalization law: \(P(x_i) = \sum_{y_j \in V(Y)} P(x_i, y_j)\)

“I want to know the probability of X happening, and I don’t care what happens with Y.” To “marginalize” a variable is to remove it from the equation by accounting for every possible way it could have occurred.

Exercise: passing I2AI

A study tracks student performance in the I2AI exam: Passing (R) and Coffee Consumption (C).

\(C=T\) \(C=F\)
\(R=T\) 0.60 0.20
\(R=F\) 0.10 0.10
Table 2: Joint distribution — Exam Result / Coffee

Tasks

  1. Compute \(P(R=T)\) and \(P(C=T)\).
  2. Are \(R\) and \(C\) independent? Show your working.
  3. You meet a student who drank coffee (\(C=T\)). What is the probability they passed (\(R=T\))?
10:00

Conditional probability & Bayes

Conditional Probability

New evidence narrows the sample space to only those outcomes consistent with the data. Because the total probability of the sample space must always equal 1, the evidence doesn’t create new probability mass but redistributes the existing mass into a smaller, more concentrated area.

\[P(x_i | y_j) = \frac{P(x_i, y_j)}{P(y_j)} \qquad \text{(provided } P(y_j) > 0\text{)}\]

Imagine 10,000 people. A disease affects 1% of them, and a test is 99% accurate.

Test Positive Test Negative Total
Sick 99 1 100
Healthy 99 9,801 9,900
Total 198 9,802 10,000

The Update: Once you receive a Positive result, the “Test Negative” column is eliminated. The total sample space shrinks from 10,000 to 198.

Your probability of being sick is now: \[\frac{99}{198} = 50\%\]

Bayes’ theorem

\[P(x_i | y_j) = \frac{P(y_j | x_i) \cdot P(x_i)}{P(y_j)}\]

  • \(P(x_i)\): prior: belief before seeing evidence
  • \(P(y_j | x_i)\): likelihood: how probable is the evidence if \(x_i\) is true?
  • \(P(y_j)\): evidence: total probability of observing \(y_j\) (normalizing constant)
  • \(P(x_i | y_j)\): posterior: updated belief after seeing evidence

The Law of Total Probability calculates the overall probability of an outcome by summing its likelihood across every possible, mutually exclusive scenario that could lead to it.

\[P(y_j) = \sum_{x_k \in V(X)} P(y_j | x_k) \cdot P(x_k) \quad \text{(law of total probability)}\]

Beliefs table

An agent holds three beliefs: \(P(A) = 0.4\), \(P(B) = 0.3\), and \(P(A \lor B) = 0.5\).

Tasks

  1. Create a \(2 \times 2\) joint distribution table where the variables \(a, b, c, d\) represent the four possible worlds.
  2. Using the table above, write four equations that represent the agent’s beliefs and the fundamental rules of probability.
  3. Solve your equations to find the values for \(a, b, c, \text{and } d\).
  4. Are \(A\) and \(B\) independent?
08:00

Medical test

A test screens for a disease affecting 1% of the population.

  • Sensitivity: \(P(\text{pos} | \text{disease}) = 0.95\)
  • Specificity: \(P(\text{neg} | \text{no disease}) = 0.98\)

A random person tests positive.

Tasks

  1. What is \(P(\text{disease} | \text{positive})\)?
    Use Bayes’ theorem and the law of total probability for the denominator.
  2. The person tests positive a second time (independent test).
    Use your answer from (1) as the new prior. What is the updated probability now?
  3. Would you recommend treatment after two positive tests? Why or why not?
15:00

Naive Bayes in action

Naive Bayes Classifier

How do we predict a category (\(Y\)) based on multiple clues (\(X\))?
We treat each clue as an independent “vote” for the outcome.

\[P(\text{Class} \mid \text{Features}) \propto P(\text{Class}) \cdot P(\text{Feature}_1 \mid \text{Class}) \cdot P(\text{Feature}_2 \mid \text{Class}) \dots\]

The “naive” assumption: We assume every feature is independent of the others (\(\propto\) means proportional).

  • Example: In a spam filter, we assume the word “money” appearing has nothing to do with the word “free” appearing.
  • In reality this is almost always false, but the math still works surprisingly well for ranking.
  • Thus, we just study each word’s frequency individually and calculate “unnormalized scores” for each class
  • To get an actual % probability, we divide each class score by the sum of all scores.

Classify the patient

A symptom-based diagnostic system distinguishes Cold (C) from Flu (F). From clinic records:

Symptom \(P(\cdot | \text{Cold})\) \(P(\cdot | \text{Flu})\)
Fever=T 0.30 0.90
Cough=T 0.80 0.95

Class priors: \(P(\text{Cold}) = 0.70\), \(P(\text{Flu}) = 0.30\).

Tasks

  1. Patient A: Fever=T, Cough=T. Compute unnormalized scores and classify.
  2. Patient B: Fever=F, Cough=T. Compute unnormalized scores and classify.
  3. Which symptom is more diagnostic for flu: fever or cough? Why?
12:00

Wrap-up

Key takeaways

Why probability

  • Logic fails when knowledge is incomplete, rules are approximate, or sensor data is unreliable
  • Probability provides a principled calculus for degrees of belief — not just frequencies

Joint distributions, marginalization, and conditional probability

  • The full joint distribution contains all information; other distributions are derived from it by summation
  • Conditioning on evidence narrows the sample space and rescales probabilities

Bayes’ theorem

  • Prior \(\times\) likelihood / evidence = posterior; each observed datum updates belief
  • Base rates matter: a rare event remains improbable even after a positive test, until enough evidence accumulates
  • Sequential Bayesian updating (using last posterior as current’s prior) is correct and powerful

Key takeaways #2

Naive Bayes

  • The conditional independence assumption replaces a single intractable joint likelihood with a product of simple ones
  • Despite the “naive” assumption being almost always false, the classifier is often surprisingly accurate
  • The prior shapes classification: high base-rate classes resist being overridden by weak evidence

Q&A

Literature

Russel, S., & Norvig, P. (2022). Artificial intelligence: A modern approach. Pearson Education.