Probability Theory

Introduction to AI (I2AI)

Andy Weeger

Neu-Ulm University of Applied Sciences

April 18, 2026

Introduction

Discussion

What are challenges related to logic as an approach to knowledge representation?

Qualification problem

Logic is good, but often conclusions under uncertainty need to be drawn, as:

  • the knowledge of the world is incomplete (not enough information) or uncertain (sensors are unreliable);
  • every possible explanation for given percepts need to be considered (no matter how unlikely), which leads to a large belief-state full of unlikely possibilities and arbitrarily large contingent plans; and
  • rules about the world are often incomplete (e.g., are all preconditions for an action known?) or even incorrect

Uncertainty in rules

Take an expert dental diagnosis system as an example.

\(Toothache \implies Cavity\)

This rule is incorrect as there are other causes for toothache, thus a better rule would be:

\(Toothache \implies Cavity \lor GumProblem \lor Abscess \lor ...\)

However, as we don’t know all the causes, this rule for diagnostic reasoning is still incomplete1.

What about a rule for causal reasoning?

\(Cavity \implies Toothache\)

This is still wrong as not cavity does not always imply toothache. Furthger, it does not allow to reason from symptoms to causes.

Discussion

What do we learn from that?

Learnings

  • We cannot enumerate all possible causes (i.e., laziness).
  • And even if we could, we do not know how correct the rules are (i.e., theoretical ignorance2).
  • And even if we did there will always be uncertainty about the patient (i.e., practical ignorance3).
  • And even if there would be no uncertainty about the case, our sensors could be imprecises (e.g., the recognition of a caviety).

Thus, formal logical reasoning systems have significant limitations when dealing with real-world problems where we lack complete information.

Probability theory

Probabilities

  • We (and other agents) are convinced by facts and rules only up to a certain degree
  • One possibility for expressing the degree of belief is to use probabilities
    (e.g., we expect that the informatio is correct in 9 out of 10 cases — probability of .9)
  • Probabilities sum up the “uncertainty” that stems from lack of knowledge.
  • Probabilities are not to be confused with vagueness
    (e.g., the predicate tall is vague; the statement, “A man is 1.75–1.80m tall” is uncertain)

Example

Goal: Be in Ulm at 8:15 to give a lecture

There are several plans that achieve the goal:

  • \(P_1:\) Get up at 6:00, take the bike, arrive at 7:30, take a shower, …
  • \(P_2:\) Get up at 6:30, take the car at 7:00, arrive at 7:45, …

All these plans are correct, but they imply different costs and different probabilities of actually achieving the goal.

\(P_2\) is probably the plan of choice, as the success rate of \(P_1\) is only 80%, though the rewards are high.

Decision-making under uncertainty

(utility - cost) * probability

  • We have a choice of actions (or plans)
  • These can lead to different solutions with different probabilities
  • The actions have different (subjective) costs
  • The results have different (subjective) utilities

It would be rational to choose the action with the maximum expected utility (MEU) — the “average”, or “statistical mean” of the outcome utilities minus the costs of the actions leading to the outcome, weighted by the probability of the outcome.

Discrete and continuous variables

A random variable4 \(X\) is a variable that can take multiple values \(X=x_i\) depending on the outcome of a random event. We denote the set of possible values, that can be taken by the variable, by \(V(X)\).

  • If the outcomes are finite5 or at least countable the random variable is said to be discrete.
  • If the possible outcomes are not finite, for example, drawing a real number \(x \in \left[0,1\right] \subset \mathbb{R}\), the random variable is said to be continuous.

The probability that the random variable \(X\) takes the value \(x\) is dentoted by \(P(X=x)\) or for short \(P(x)\).

The description of the probabilities \(P(x)\) for all possible \(x \in V(X)\) is called the probability distribution of variable \(X\).

Unconditional probability

Prior probability

\(P(x)\) denotes the unconditional probability or prior probability that \(x\) will appear in the absence of any other information, e.g. \(P(Cavity) = 0.1\)

  • Prior probabilities can be obtained from statistical analysis or general rules
  • Logical connectors can be used to build probabilistic propositions, e.g. \(P(Cavity \land \neg Insured) = 0.06\)
  • Propositions can contain equations over random variables, e.g. \(P(NoonTemp=x) = \textrm{Uniform}(x;18C;26C)\)6, usually called a probability density function

Probability mass function

In the case of discrete random variables \(X\), the probability distribution is called probability mass function (PMF) \(p_X(x)\), it assigns to all \(x \in V(X)\) the corresponding probability \(P(x)\). For all probabilities the following conditions must be satisfied:

1. Non-negativity

The probability assigned to any value \(X\) must be greater than or equal to zero.

\[ \begin{flalign} p_X(x) \geq 0 \quad \text{for all } x && \end{flalign} \]

2. Normalization

The sum of the probabilities of all possible values that the random variable \(X\) can take is equal to 1.

\[ \begin{flalign} \sum_{x \in V(X)} p_X(x) = 1 && \end{flalign} \]

3. Support

The probability that the random variable \(X\) takes a value outside its possible values is zero. So only values \(X\) can actually take (with non-zero probability) are those in the support.

\[ \begin{flalign} p_X(x) = 0 \quad \text{for all } x \notin V(X) && \end{flalign} \]

Cumulative distribution function

The cumulative distribution function (CDF) \(F_X(x)\) of a real-valued random variable \(X\) is the probability that \(X\) will take a value less than or equal to \(x\).

\[ \begin{flalign} F_X(x) = P(X \leq x) && \end{flalign} \]

Hence, the probability that X takes a value larger than \(x=a\) and smaller or equal than \(x=b\) is:

\[ \begin{flalign} P(a <x \leq b) = F_X(b) - F_X(a) && \end{flalign} \]

Every CDF is non-decreasing and has a maximum value of 1.

Discrete uniform distribution

For the experiment “rolling a dice” the possible outcomes are 1, 2, 3, 4, 5 and 6. The corresponding discrete random variable \(X\), has the probability mass function \(p_X(x)\), which is defined by:

\[ \begin{flalign} P(X=1) &= P(X=2) = P(X=3) =P(X=4) \\ &= P(X=5) = P(X=6) = \frac{1}{6} \end{flalign} \]

Such a distribution, for which \(P(x)\) is equal for all \(x \in V(X)\) is called a uniform distribution.

Bernoulli distribution

A discrete binary random variable \(X\) has only two possible outcomes: \(0\) or \(1\), i.e. in this case \(V(X) = \lbrace 0,1 \rbrace\). The corresponding probability mass function is the Bernoulli distribution, which is defined as follows:

\[ p_X(x) = \left\{ \begin{array}{ll} p & \mbox{ for } x=1 \\ 1-p & \mbox{ for } x=0 \end{array} \right. \]

Binomial distribution

The Binomial distribution helps us calculate the probability of getting a specific number of successful outcomes (let’s say \(k\)) in a fixed number of independent trials (let’s say \(n\)).

\[ \left( \begin{array}{c} n \\ k \end{array} \right) p^k (1-p)^{n-k} \]

For example in a coin-tossing experiment the probability of success is \(P(X=1)=p=0.5\). If we toss the coin \(n=5\) times the probability, that exactly \(k=4\) tosses yield success is

\[ \left( \begin{array}{c} 5 \\ 4 \end{array} \right) 0.5^4 (1-0.5)^{5-4} = 0.15625 \]

Geometric distribution

The Geometric distribution models the number of tries you need to keep going until you finally get a success. Think about rolling a die until you roll a 6. The Geometric distribution helps you find the probability that it takes exactly \(k\) rolls. Every roll is independent, has two potential outcomes (getting a 6 or not), and the probability of rolling a 6 (\(p\)) remains constant.

\[ (1-p)^{(k-1)} \cdot p \]

For the coin-tossing experiment, the probability that the first success comes …

  • … at the first toss (\(k=1\)) is \(0.5^0 \cdot 0.5 = 0.5\)
  • … at the second toss (\(k=2\)) is \(0.5^1 \cdot 0.5 = 0.25\)
  • … at the third toss (\(k=3\)) is \(0.5^2 \cdot 0.5 = 0.125\)

Continuous random variables

For continuous random variables, we use the probability density function (PDF), written as \(p_X(x)\), to understand how the variable’s values are distributed. Think of it like the probability mass function (PMF) for discrete variables.

The PDF tells us the relative likelihood of the random variable \(X\) taking on a specific value \(x\). While the exact probability of a continuous variable being any single value is zero (because there are infinitely many possibilities), the PDF helps us compare how likely different values are. A higher PDF value at one point means that values around that point are more likely to occur than values around a point with a lower PDF.

More precisely, the PDF is used to find the probability of the random variable falling within a range of values. This probability is the area under the PDF curve over that range.

Key properties of a PDF:

  • It’s always non-negative: \(p_X(x) > 0\).
  • The total area under the curve is 1: \(\int_{-\infty}^{\infty} p_X(x) \cdot dx = 1\).
  • Unlike the PMF, the PDF’s value can be greater than 1.

Cumulative distribution function

The CDF, written as \(F_X(x)\), tells us the probability that the random variable \(X\) will be less than or equal to a specific value \(x\). For continuous variables, this is the area under the PDF curve from negative infinity up to \(x\):

\[ F_X(x)= \int_{-\infty}^{x} p_X(t) \cdot dt \]

Using the CDF, we can find the probability that \(X\) falls within a range between \(a\) and \(b\) (where \(a < b\)):

\[ P(a< x \leq b) = F_X(b)-F_X(a)=\int_{a}^{b} p_X(t) \cdot dt \]

The Gaussian distribution

The Gaussian distribution (also called the normal distribution) is a common continuous distribution. Its PDF has a bell shape and is defined by its mean (\(\mu\)) and standard deviation (\(\sigma\)):

\[p_X(x)=\frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]

Joint probability

Definition

The joint probability of two random variables \(X\) and \(Y\), denoted as \(P(X=x_i, Y=y_j)\) or simply \(P(x_i, y_j)\), represents the probability that \(X\) takes the specific value \(x_i\) and \(Y\) simultaneously takes the specific value \(y_j\). The comma “,” signifies the logical “and”.

The joint probability distribution of two random variables \(X\) and \(Y\) is the set of all possible joint probabilities for all possible value combinations of \(X\) (from its value set \(V(X)\)) and \(Y\) (from its value set \(V(Y)\)):

\[ P(X=x_i, Y=y_j) \quad \forall \quad x_i \in V(X), y_j \in V(Y) \]

Example

Consider two binary random variables: Toothache (True or False, denoted as \(\neg\)Toothache) and Cavity (True or False, denoted as \(\neg\)Cavity).

Toothache ¬Toothache
Cavity 0.04 0.06
¬Cavity 0.01 0.89
Table 1: Probabilities of the atomic events (full joint distribution for the Toothache, Cavity world)

From this table, we can directly read the joint probability of any combination of Toothache and Cavity. For example, the probability of having a toothache and a cavity is:

\[ P(\text{Toothache=True, Cavity=True}) = 0.04 \]

Similarly, the probability of not having a toothache and not having a cavity is:

\[ P(\text{Toothache=False, Cavity=False}) = 0.89 \]

Generalization

For \(N\) random variables \(X_1, X_2, \ldots, X_N\), the joint probability is:

\[ P(X_1=x_{i_1}, X_2=x_{i_2}, \ldots, X_N=x_{i_N}) \quad \text{or} \quad P(x_{i_1}, x_{i_2}, \ldots, x_{i_N}) \]

This represents the probability that \(X_1\) takes value \(x_{i_1}\) and \(X_2\) takes value \(x_{i_2}\) and so on, up to \(X_N\) taking value \(x_{i_N}\).

The joint probability distribution for these \(N\) variables is the collection of all such probabilities for all possible combinations of their values.

For continuous random variables, the joint probability distribution is described by a joint cumulative distribution function (CDF) or a joint probability density function (PDF). For discrete random variables, it’s described by a probability mass function (PMF) or the CDF.

Independence of random variables

Two random variables \(X\) and \(Y\) are considered independent if the value taken by one variable does not influence the value taken by the other.

Example: Two consecutive rolls of a fair die are independent events. The outcome of the second roll is not affected by the outcome of the first roll. In contrast, in a Lotto draw where balls are drawn without replacement, the outcomes are dependent. The probability of drawing a specific number on the second draw depends on which number was drawn first.

Two random variables \(X\) and \(Y\) are independent if and only if their joint probability can be factored into the product of their individual probabilities (also known as marginal probabilities):

\[ P(X=x_{i}, Y=y_j) = P(X=x_{i}) \cdot P(Y=y_j) \]

Example: For two independent dice rolls, the probability of getting a 1 on the first roll (\(X=1\)) and a 2 on the second roll (\(Y=2\)) is:

\[ P(X=1, Y=2) = P(X=1) \cdot P(Y=2) = \frac{1}{6} \cdot \frac{1}{6} = \frac{1}{36} \]

Marginal probability

If we have a set of random variables \(X_1, X_2, \ldots, X_N\) and we know their joint probability distribution \(P(X_1=x_{i_1}, X_2=x_{i_2}, \ldots, X_N=x_{i_N})\), we can find the marginal probability distribution for a subset \({X_{i_1}, X_{i_2}, \ldots, X_{i_Z}}\) by marginalizing (summing or integrating over) the variables that are not in the subset.

Marginalization law: For two discrete random variables \(X\) and \(Y\) with a known joint probability distribution \(P(x_i, y_j)\), the marginal probability of \(X\) taking the value \(x_i\) is obtained by summing the joint probabilities over all possible values of \(Y\):

\[ P(x_i) = \sum_{y_j \in V(Y)} P(x_i, y_j) \]

Similarly, the marginal probability of \(Y\) taking the value \(y_j\) is:

\[ P(y_j) = \sum_{x_i \in V(X)} P(x_i, y_j) \]

The variables we are interested in (like \(X\) in the first equation) are called marginal variables.

Generalization to multiple variables

For three random variables \(X, Y, Z\), the marginal probability of \(X\) is:

\[ P(x_i) = \sum_{y_j \in V(Y)} \sum_{z_k \in V(Z)} P(x_i, y_j, z_k) \]

This concept extends to any number of variables and any subset.

Example

Let’s calculate the marginal probability of having a cavity (\(P(\text{Cavity=True})\)). To do this, we sum the joint probabilities where Cavity is True, across all possible values of Toothache:

\[ \begin{flalign} P(\text{Cavity=True}) &= P(\text{Toothache=True, Cavity=True}) \\ &+ P(\text{Toothache=False, Cavity=True}) \end{flalign} \]

Substituting the values from the table:

\[ P(\text{Cavity=True}) = 0.04 + 0.06 = 0.10 \]

So, the marginal probability of having a cavity is 0.10 or 10%. This probability considers all individuals in our population, regardless of whether they have a toothache or not.

Similarly, we can calculate the marginal probability of not having a cavity (\(P(\text{Cavity=False})\)) by summing the joint probabilities where Cavity is False:

\[ \begin{flalign} P(\text{Cavity=False}) &= P(\text{Toothache=True, Cavity=False}) \\ &+ P(\text{Toothache=False, Cavity=False}) \end{flalign} \]

Substituting the values from the table:

\[ P(\text{Cavity=False}) = 0.01 + 0.89 = 0.90 \]

The marginal probability of not having a cavity is 0.90 or 90%.

These calculations demonstrate how we can obtain the probability distribution for a single variable (Cavity) by marginalizing over the other variable (Toothache) in the joint probability distribution.

Conditional probability

Definition

New information (usually called evidence) can change the probability, e.g. the probability of a cavity increases if we know the patient has a toothache

The conditional probability of a random variable \(X\) taking the value \(x_i\) given that another random variable \(Y\) has taken the value \(y_j\), denoted as \(P(X=x_i | Y=y_j)\) or \(P(x_i | y_j)\), is the probability of \(X=x_i\) occurring under the condition that \(Y=y_j\) is already known or has been observed.

Marginal probability is the probability of an event occurring on its own, whereas conditional probability considers the occurrence of one event given that another has already happened. This introduces a dependency in the probability calculation.

The conditional probability of \(X\) given \(Y\) is calculated as the ratio of the joint probability of \(X\) and \(Y\) to the marginal probability of \(Y\) (provided \(P(y_j) > 0\)):

\[ P(x_i | y_j) = \frac{P(x_i \land y_j)}{P(y_j)} = \frac{P(x_i, y_j)}{P(y_j)} \]

Product rule

By rearranging the conditional probability equation we can calculate a joint probability as a product of a conditional probability and an a-priori probability:

\[ P(x_i,y_j)= P(x_i|y_j)\cdot P(y_j) \]

This is actually the most simple case of the product rule.

For 3 variables we can write:

\[ P(x_i,y_j,z_k)= P(x_i|y_j,z_j)\cdot P(y_j,z_j). \]

Since the last factor on the right hand side of this equation can be again written as

\[ P(y_j,z_j)= P(y_j|z_k)\cdot P(z_k), \]

We finally obtain:

\[ P(x_i,y_j,z_k)= P(x_i|y_j,z_j)\cdot P(y_j|z_k)\cdot p(z_k) \]

Generalization

The joint probability can be expressed as a product of conditional probabilities and an a-priori probability.

This can be generalized to the case of \(N\) random variables. The general form of the chain rule is:

\[ \begin{flalign} P(x_{i_1}, x_{i_2}, \ldots x_{i_{N}}) &= P(x_{i_1} | x_{i_2}, \ldots x_{i_{N}} ) \cdot P(x_{i_2} | x_{i_3}, \ldots x_{i_{N}}) \\ & \cdot P(x_{i_3} | x_{i_4}, \ldots x_{i_{N}}) \cdots P(x_{i_{N}}) \\ &= \prod\limits_{j=1}^N P(x_{i_j} | x_{i_{j+1}}, \ldots x_{i_{N}} ) \end{flalign} \]

Bayesian inference

Bayes’ Theorem

Bayes’ Theorem is a fundamental concept in probability theory and serves as the cornerstone of modern AI and machine learning. It allows us to update our beliefs based on new evidence.

At its core, Bayes’ Theorem relates conditional probabilities. Starting from the conditional probability formula:

\[ P(x_i | y_j) = \frac{P(x_i, y_j)}{P(y_j)} \]

We can derive Bayes’ Theorem as:

\[ P(x_i | y_j)=\frac{P(y_j | x_i) P(x_i)}{P(y_j)} \]

Computing the evidence term

The denominator \(P(y_j)\) can be expanded using the law of total probability7:

\[ P(y_j) = \sum_{x_k \in V(X)} P(y_j | x_k) P(x_k) \]

Where \(V(X)\) represents all possible values of the random variable \(X\).

This gives us the complete form of Bayes’ Theorem:

\[ P(x_i | y_j)=\frac{P(y_j | x_i) P(x_i)}{\sum\limits_{x_k \in V(X)}P(y_j | x_k) P(x_k)} \]

Bayesian inference in practice

Bayesian inference is the process of applying Bayes’ Theorem to update probabilities as new information becomes available. Here’s how it works in practice:

  1. Start with a prior probability \(P(x_i)\) based on existing knowledge
  2. Collect new evidence \(y_j\)
  3. Calculate how likely that evidence would be under different scenarios using the likelihood \(P(y_j|x_i)\)
  4. Update your belief to the posterior probability \(P(x_i|y_j)\) using Bayes’ Theorem

Naive Bayes classifiers

Given a feature vector8 \(X = (x_1, x_2, ..., x_n)\) and a class variable9 \(Y\), Bayes’ Theorem gives us:

\[P(Y|X) = \frac{P(X|Y) \cdot P(Y)}{P(X)}\]

The naive independence assumption states that:

\[P(X|Y) = P(x_1|Y) \cdot P(x_2|Y) \cdot ... \cdot P(x_n|Y)\]

This gives us:

\[ \begin{flalign} P(Y|X) &= \frac{P(X|Y) \cdot P(Y)}{P(X)}\\ &= \frac{P(x_1|Y) \cdot P(x_2|Y) \cdot ... \cdot P(x_n|Y) \cdot P(Y)}{P(x_1, x_2, \ldots, x_n)}\\ \end{flalign} \]

In practice, often the unnormalized probabilities are calculated for each class:

\[P(x_1|Y) \cdot P(x_2|Y) \cdot ... \cdot P(x_n|Y) \cdot P(Y)\]

And then normalized by dividing by the total probability:

\[P(x_1, x_2, \ldots, x_n) = \sum_{j=1}^{m} \left[\prod_{i=1}^{n} P(x_i|Y=y_j)\right] \cdot P(Y=y_j)\]

Overall, Naive Bayes simplify the calculation dramatically, as we now only need to estimate these simpler conditional probabilities from the training data.

Exercises

Conditional probability

Show from the definition of conditional probability that \(P(a | b \land a) = 1\).

Beliefs

An agent holds the three beliefs: \(P(A)=0.4\), \(P(B)=0.3\), and \(P(A \lor B)=0.5\).

What is the probability of \(A \land B\)?

Make up a table like the one in Table 1 to answer that question.

Medical tests

A medical test is being used to screen for a rare disease that affects 1% of the population. The test has the following characteristics:

  • If a person has the disease, the test will correctly identify them as positive 95% of the time (sensitivity).
  • If a person does not have the disease, the test will correctly identify them as negative 98% of the time (specificity).

A random person from the population takes the test and receives a positive result.

Questions

  1. What is the probability that this person actually has the disease?
  2. If the person takes the test a second time and again receives a positive result, what is the updated probability that they have the disease? Assume that test results are independent for the same person.
  3. Would you recommend that this person undergo treatment based on these test results? Why or why not? Consider the reliability of the diagnosis given the test results.

Conditional independence

Show that the statement of conditional independence

\[ \begin{flalign} P(A,B|C)=P(A|C)P(B|C) && \end{flalign} \]

is equivalent to each of the statements

\[ \begin{flalign} P(A|B,C)=P(A|C) \quad \textrm{and} \quad P(B|A,C)=P(B|C) && \end{flalign} \]

Pacman

Pacman has developed a hobby of fishing. Over the years, he has learned that a day can be considered fit or unfit for fishing \(Y\) which results in three features: whether or not Ms. Pacman can show up \(M\), the temperature of the day \(T\), and how high the water level is \(W\). Pacman models it as an Naive Bayes classification problem.

We wish to calculate the probability a day is fit for fishing given features of the day. Consider the conditional probability tables that Pacman has estimated over the years:

\(Y\) \(P(Y)\)
yes 0.1
no. 0.9
\(M\) \(Y\) \(P(M|Y)\)
yes yes 0.5
no yes 0.5
yes no 0.2
no no 0.8
\(W\) \(Y\) \(P(W|Y)\)
high yes 0.1
low yes 0.9
high no 0.5
low no 0.5
\(T\) \(Y\) \(P(T|Y)\)
cold yes 0.2
warm yes 0.3
hot yes 0.5
cold no 0.5
warm no 0.2
hot no 0.3

Determine if a day is fit for fishing given the following conditions:

  • Ms. Pacman is available: M = yes
  • The weather is cold: T = cold
  • The water level is high: W = high

Footnotes

  1. See qualification problem.

  2. Theoretical ignorance refers to our incomplete understanding of the precise mechanisms and relationships between conditions and symptoms, even when we can identify them.

  3. Practical ignorance refers to our unavoidable uncertainty about the specific details and circumstances of an individual case, even when we have comprehensive theoretical knowledge about the conditions involved (e.g., individual pain tresholds differ significantly).

  4. Variables in probability theory are called random variables

  5. Finite outcomes are, e.g., the 6 possibilities in a dice-rolling event.

  6. Represents the belief that the temperature at noon is distributed uniformly between 18 and 26 degrees Celcius

  7. The law of total probability states that if you have a sample space divided into mutually exclusive and exhaustive events (often called a partition), then the probability of any event \(y_j\) can be calculated by adding up the conditional probabilities of \(y_j\) given each event in the partition \(X\), weighted by the probabilities of those partition events (i.e., representing all possible paths or scenarios through which the event can occur).

  8. A feature vector is an n-dimensional vector of numerical features that represent an object. It’s essentially an n-dimensional vector where each dimension represents a specific attribute. Features can be numerical (age, income), categorical (color, brand), binary (yes/no), or derived (calculated from other features).

  9. A class variable (Y) is what we’re trying to predict or classify. It represents the outcome or category that our machine learning model should determine. Examples include: spam/not spam, disease present/absent, sentiment (positive/negative/neutral). Can be binary (two options), multi-class (several distinct categories), or multi-label (multiple categories can apply simultaneously).