Introduction to AI (I2AI)
Neu-Ulm University of Applied Sciences
April 19, 2026
Probability is the response to a fundamental limitation of logic:
Agents rarely have complete, reliable information.
expected utility = (utility - cost) x probability
Probability theory is, thus, a foundation for the design of rational AI agents.
Decide: true or false and give a one-sentence justification.
The prior probability \(P(x)\) expresses the degree of belief in \(x\) in the absence of any other information (Russel & Norvig, 2022).
Every valid probability mass function satisfies three axioms:
Probabilities do not need to come from frequencies, they can encode expert belief (i.e., probability can represent a degree of certainty or expert judgment about a unique event that hasn’t happened yet).
The full joint distribution over \(N\) variables is the collection of probabilities for every possible combination of values.
| Toothache (\(y_1\)) | \(\neg\)Toothache (\(y_2\)) | Marginal \(P(X)\) | |
|---|---|---|---|
| Cavity (\(x_1\)) | 0.04 | 0.06 | 0.1 |
| \(\neg\)Cavity (\(x_2\)) | 0.01 | 0.89 | 0.9 |
Marginalization law: \(P(x_i) = \sum_{y_j \in V(Y)} P(x_i, y_j)\)
“I want to know the probability of X happening, and I don’t care what happens with Y.” To “marginalize” a variable is to remove it from the equation by accounting for every possible way it could have occurred.
A study tracks student performance in the I2AI exam: Passing (R) and Coffee Consumption (C).
| \(C=T\) | \(C=F\) | |
|---|---|---|
| \(R=T\) | 0.60 | 0.20 |
| \(R=F\) | 0.10 | 0.10 |
Tasks
10:00
New evidence narrows the sample space to only those outcomes consistent with the data. Because the total probability of the sample space must always equal 1, the evidence doesn’t create new probability mass but redistributes the existing mass into a smaller, more concentrated area.
\[P(x_i | y_j) = \frac{P(x_i, y_j)}{P(y_j)} \qquad \text{(provided } P(y_j) > 0\text{)}\]
Imagine 10,000 people. A disease affects 1% of them, and a test is 99% accurate.
| Test Positive | Test Negative | Total | |
|---|---|---|---|
| Sick | 99 | 1 | 100 |
| Healthy | 99 | 9,801 | 9,900 |
| Total | 198 | 9,802 | 10,000 |
The Update: Once you receive a Positive result, the “Test Negative” column is eliminated. The total sample space shrinks from 10,000 to 198.
Your probability of being sick is now: \[\frac{99}{198} = 50\%\]
\[P(x_i | y_j) = \frac{P(y_j | x_i) \cdot P(x_i)}{P(y_j)}\]
The Law of Total Probability calculates the overall probability of an outcome by summing its likelihood across every possible, mutually exclusive scenario that could lead to it.
\[P(y_j) = \sum_{x_k \in V(X)} P(y_j | x_k) \cdot P(x_k) \quad \text{(law of total probability)}\]
An agent holds three beliefs: \(P(A) = 0.4\), \(P(B) = 0.3\), and \(P(A \lor B) = 0.5\).
Tasks
08:00
A test screens for a disease affecting 1% of the population.
A random person tests positive.
Tasks
15:00
How do we predict a category (\(Y\)) based on multiple clues (\(X\))?
We treat each clue as an independent “vote” for the outcome.
\[P(\text{Class} \mid \text{Features}) \propto P(\text{Class}) \cdot P(\text{Feature}_1 \mid \text{Class}) \cdot P(\text{Feature}_2 \mid \text{Class}) \dots\]
The “naive” assumption: We assume every feature is independent of the others (\(\propto\) means proportional).
A symptom-based diagnostic system distinguishes Cold (C) from Flu (F). From clinic records:
| Symptom | \(P(\cdot | \text{Cold})\) | \(P(\cdot | \text{Flu})\) |
|---|---|---|
| Fever=T | 0.30 | 0.90 |
| Cough=T | 0.80 | 0.95 |
Class priors: \(P(\text{Cold}) = 0.70\), \(P(\text{Flu}) = 0.30\).
Tasks
12:00
Why probability
Joint distributions, marginalization, and conditional probability
Bayes’ theorem
Naive Bayes