Introduction to AI (I2AI)
Neu-Ulm University of Applied Sciences
April 18, 2026
What are challenges related to logic as an approach to knowledge representation?
Logic is good, but often conclusions under uncertainty need to be drawn, as:
Take an expert dental diagnosis system as an example.
\(Toothache \implies Cavity\)
This rule is incorrect as there are other causes for toothache, thus a better rule would be:
\(Toothache \implies Cavity \lor GumProblem \lor Abscess \lor ...\)
However, as we don’t know all the causes, this rule for diagnostic reasoning is still incomplete1.
What about a rule for causal reasoning?
\(Cavity \implies Toothache\)
This is still wrong as not cavity does not always imply toothache. Furthger, it does not allow to reason from symptoms to causes.
What do we learn from that?
Thus, formal logical reasoning systems have significant limitations when dealing with real-world problems where we lack complete information.
Goal: Be in Ulm at 8:15 to give a lecture
There are several plans that achieve the goal:
All these plans are correct, but they imply different costs and different probabilities of actually achieving the goal.
\(P_2\) is probably the plan of choice, as the success rate of \(P_1\) is only 80%, though the rewards are high.
(utility - cost) * probability
It would be rational to choose the action with the maximum expected utility (MEU) — the “average”, or “statistical mean” of the outcome utilities minus the costs of the actions leading to the outcome, weighted by the probability of the outcome.
A random variable4 \(X\) is a variable that can take multiple values \(X=x_i\) depending on the outcome of a random event. We denote the set of possible values, that can be taken by the variable, by \(V(X)\).
The probability that the random variable \(X\) takes the value \(x\) is dentoted by \(P(X=x)\) or for short \(P(x)\).
The description of the probabilities \(P(x)\) for all possible \(x \in V(X)\) is called the probability distribution of variable \(X\).
\(P(x)\) denotes the unconditional probability or prior probability that \(x\) will appear in the absence of any other information, e.g. \(P(Cavity) = 0.1\)
In the case of discrete random variables \(X\), the probability distribution is called probability mass function (PMF) \(p_X(x)\), it assigns to all \(x \in V(X)\) the corresponding probability \(P(x)\). For all probabilities the following conditions must be satisfied:
1. Non-negativity
The probability assigned to any value \(X\) must be greater than or equal to zero.
\[ \begin{flalign} p_X(x) \geq 0 \quad \text{for all } x && \end{flalign} \]
2. Normalization
The sum of the probabilities of all possible values that the random variable \(X\) can take is equal to 1.
\[ \begin{flalign} \sum_{x \in V(X)} p_X(x) = 1 && \end{flalign} \]
3. Support
The probability that the random variable \(X\) takes a value outside its possible values is zero. So only values \(X\) can actually take (with non-zero probability) are those in the support.
\[ \begin{flalign} p_X(x) = 0 \quad \text{for all } x \notin V(X) && \end{flalign} \]
The cumulative distribution function (CDF) \(F_X(x)\) of a real-valued random variable \(X\) is the probability that \(X\) will take a value less than or equal to \(x\).
\[ \begin{flalign} F_X(x) = P(X \leq x) && \end{flalign} \]
Hence, the probability that X takes a value larger than \(x=a\) and smaller or equal than \(x=b\) is:
\[ \begin{flalign} P(a <x \leq b) = F_X(b) - F_X(a) && \end{flalign} \]
Every CDF is non-decreasing and has a maximum value of 1.
For the experiment “rolling a dice” the possible outcomes are 1, 2, 3, 4, 5 and 6. The corresponding discrete random variable \(X\), has the probability mass function \(p_X(x)\), which is defined by:
\[ \begin{flalign} P(X=1) &= P(X=2) = P(X=3) =P(X=4) \\ &= P(X=5) = P(X=6) = \frac{1}{6} \end{flalign} \]
Such a distribution, for which \(P(x)\) is equal for all \(x \in V(X)\) is called a uniform distribution.
A discrete binary random variable \(X\) has only two possible outcomes: \(0\) or \(1\), i.e. in this case \(V(X) = \lbrace 0,1 \rbrace\). The corresponding probability mass function is the Bernoulli distribution, which is defined as follows:
\[ p_X(x) = \left\{ \begin{array}{ll} p & \mbox{ for } x=1 \\ 1-p & \mbox{ for } x=0 \end{array} \right. \]
The Binomial distribution helps us calculate the probability of getting a specific number of successful outcomes (let’s say \(k\)) in a fixed number of independent trials (let’s say \(n\)).
\[ \left( \begin{array}{c} n \\ k \end{array} \right) p^k (1-p)^{n-k} \]
For example in a coin-tossing experiment the probability of success is \(P(X=1)=p=0.5\). If we toss the coin \(n=5\) times the probability, that exactly \(k=4\) tosses yield success is
\[ \left( \begin{array}{c} 5 \\ 4 \end{array} \right) 0.5^4 (1-0.5)^{5-4} = 0.15625 \]
The Geometric distribution models the number of tries you need to keep going until you finally get a success. Think about rolling a die until you roll a 6. The Geometric distribution helps you find the probability that it takes exactly \(k\) rolls. Every roll is independent, has two potential outcomes (getting a 6 or not), and the probability of rolling a 6 (\(p\)) remains constant.
\[ (1-p)^{(k-1)} \cdot p \]
For the coin-tossing experiment, the probability that the first success comes …
For continuous random variables, we use the probability density function (PDF), written as \(p_X(x)\), to understand how the variable’s values are distributed. Think of it like the probability mass function (PMF) for discrete variables.
The PDF tells us the relative likelihood of the random variable \(X\) taking on a specific value \(x\). While the exact probability of a continuous variable being any single value is zero (because there are infinitely many possibilities), the PDF helps us compare how likely different values are. A higher PDF value at one point means that values around that point are more likely to occur than values around a point with a lower PDF.
More precisely, the PDF is used to find the probability of the random variable falling within a range of values. This probability is the area under the PDF curve over that range.
Key properties of a PDF:
The CDF, written as \(F_X(x)\), tells us the probability that the random variable \(X\) will be less than or equal to a specific value \(x\). For continuous variables, this is the area under the PDF curve from negative infinity up to \(x\):
\[ F_X(x)= \int_{-\infty}^{x} p_X(t) \cdot dt \]
Using the CDF, we can find the probability that \(X\) falls within a range between \(a\) and \(b\) (where \(a < b\)):
\[ P(a< x \leq b) = F_X(b)-F_X(a)=\int_{a}^{b} p_X(t) \cdot dt \]
The Gaussian distribution (also called the normal distribution) is a common continuous distribution. Its PDF has a bell shape and is defined by its mean (\(\mu\)) and standard deviation (\(\sigma\)):
\[p_X(x)=\frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]
The joint probability of two random variables \(X\) and \(Y\), denoted as \(P(X=x_i, Y=y_j)\) or simply \(P(x_i, y_j)\), represents the probability that \(X\) takes the specific value \(x_i\) and \(Y\) simultaneously takes the specific value \(y_j\). The comma “,” signifies the logical “and”.
The joint probability distribution of two random variables \(X\) and \(Y\) is the set of all possible joint probabilities for all possible value combinations of \(X\) (from its value set \(V(X)\)) and \(Y\) (from its value set \(V(Y)\)):
\[ P(X=x_i, Y=y_j) \quad \forall \quad x_i \in V(X), y_j \in V(Y) \]
Consider two binary random variables: Toothache (True or False, denoted as \(\neg\)Toothache) and Cavity (True or False, denoted as \(\neg\)Cavity).
| Toothache | ¬Toothache | |
| Cavity | 0.04 | 0.06 |
| ¬Cavity | 0.01 | 0.89 |
From this table, we can directly read the joint probability of any combination of Toothache and Cavity. For example, the probability of having a toothache and a cavity is:
\[ P(\text{Toothache=True, Cavity=True}) = 0.04 \]
Similarly, the probability of not having a toothache and not having a cavity is:
\[ P(\text{Toothache=False, Cavity=False}) = 0.89 \]
For \(N\) random variables \(X_1, X_2, \ldots, X_N\), the joint probability is:
\[ P(X_1=x_{i_1}, X_2=x_{i_2}, \ldots, X_N=x_{i_N}) \quad \text{or} \quad P(x_{i_1}, x_{i_2}, \ldots, x_{i_N}) \]
This represents the probability that \(X_1\) takes value \(x_{i_1}\) and \(X_2\) takes value \(x_{i_2}\) and so on, up to \(X_N\) taking value \(x_{i_N}\).
The joint probability distribution for these \(N\) variables is the collection of all such probabilities for all possible combinations of their values.
For continuous random variables, the joint probability distribution is described by a joint cumulative distribution function (CDF) or a joint probability density function (PDF). For discrete random variables, it’s described by a probability mass function (PMF) or the CDF.
Two random variables \(X\) and \(Y\) are considered independent if the value taken by one variable does not influence the value taken by the other.
Example: Two consecutive rolls of a fair die are independent events. The outcome of the second roll is not affected by the outcome of the first roll. In contrast, in a Lotto draw where balls are drawn without replacement, the outcomes are dependent. The probability of drawing a specific number on the second draw depends on which number was drawn first.
Two random variables \(X\) and \(Y\) are independent if and only if their joint probability can be factored into the product of their individual probabilities (also known as marginal probabilities):
\[ P(X=x_{i}, Y=y_j) = P(X=x_{i}) \cdot P(Y=y_j) \]
Example: For two independent dice rolls, the probability of getting a 1 on the first roll (\(X=1\)) and a 2 on the second roll (\(Y=2\)) is:
\[ P(X=1, Y=2) = P(X=1) \cdot P(Y=2) = \frac{1}{6} \cdot \frac{1}{6} = \frac{1}{36} \]
If we have a set of random variables \(X_1, X_2, \ldots, X_N\) and we know their joint probability distribution \(P(X_1=x_{i_1}, X_2=x_{i_2}, \ldots, X_N=x_{i_N})\), we can find the marginal probability distribution for a subset \({X_{i_1}, X_{i_2}, \ldots, X_{i_Z}}\) by marginalizing (summing or integrating over) the variables that are not in the subset.
Marginalization law: For two discrete random variables \(X\) and \(Y\) with a known joint probability distribution \(P(x_i, y_j)\), the marginal probability of \(X\) taking the value \(x_i\) is obtained by summing the joint probabilities over all possible values of \(Y\):
\[ P(x_i) = \sum_{y_j \in V(Y)} P(x_i, y_j) \]
Similarly, the marginal probability of \(Y\) taking the value \(y_j\) is:
\[ P(y_j) = \sum_{x_i \in V(X)} P(x_i, y_j) \]
The variables we are interested in (like \(X\) in the first equation) are called marginal variables.
For three random variables \(X, Y, Z\), the marginal probability of \(X\) is:
\[ P(x_i) = \sum_{y_j \in V(Y)} \sum_{z_k \in V(Z)} P(x_i, y_j, z_k) \]
This concept extends to any number of variables and any subset.
Let’s calculate the marginal probability of having a cavity (\(P(\text{Cavity=True})\)). To do this, we sum the joint probabilities where Cavity is True, across all possible values of Toothache:
\[ \begin{flalign} P(\text{Cavity=True}) &= P(\text{Toothache=True, Cavity=True}) \\ &+ P(\text{Toothache=False, Cavity=True}) \end{flalign} \]
Substituting the values from the table:
\[ P(\text{Cavity=True}) = 0.04 + 0.06 = 0.10 \]
So, the marginal probability of having a cavity is 0.10 or 10%. This probability considers all individuals in our population, regardless of whether they have a toothache or not.
Similarly, we can calculate the marginal probability of not having a cavity (\(P(\text{Cavity=False})\)) by summing the joint probabilities where Cavity is False:
\[ \begin{flalign} P(\text{Cavity=False}) &= P(\text{Toothache=True, Cavity=False}) \\ &+ P(\text{Toothache=False, Cavity=False}) \end{flalign} \]
Substituting the values from the table:
\[ P(\text{Cavity=False}) = 0.01 + 0.89 = 0.90 \]
The marginal probability of not having a cavity is 0.90 or 90%.
These calculations demonstrate how we can obtain the probability distribution for a single variable (Cavity) by marginalizing over the other variable (Toothache) in the joint probability distribution.
New information (usually called evidence) can change the probability, e.g. the probability of a cavity increases if we know the patient has a toothache
The conditional probability of a random variable \(X\) taking the value \(x_i\) given that another random variable \(Y\) has taken the value \(y_j\), denoted as \(P(X=x_i | Y=y_j)\) or \(P(x_i | y_j)\), is the probability of \(X=x_i\) occurring under the condition that \(Y=y_j\) is already known or has been observed.
Marginal probability is the probability of an event occurring on its own, whereas conditional probability considers the occurrence of one event given that another has already happened. This introduces a dependency in the probability calculation.
The conditional probability of \(X\) given \(Y\) is calculated as the ratio of the joint probability of \(X\) and \(Y\) to the marginal probability of \(Y\) (provided \(P(y_j) > 0\)):
\[ P(x_i | y_j) = \frac{P(x_i \land y_j)}{P(y_j)} = \frac{P(x_i, y_j)}{P(y_j)} \]
By rearranging the conditional probability equation we can calculate a joint probability as a product of a conditional probability and an a-priori probability:
\[ P(x_i,y_j)= P(x_i|y_j)\cdot P(y_j) \]
This is actually the most simple case of the product rule.
For 3 variables we can write:
\[ P(x_i,y_j,z_k)= P(x_i|y_j,z_j)\cdot P(y_j,z_j). \]
Since the last factor on the right hand side of this equation can be again written as
\[ P(y_j,z_j)= P(y_j|z_k)\cdot P(z_k), \]
We finally obtain:
\[ P(x_i,y_j,z_k)= P(x_i|y_j,z_j)\cdot P(y_j|z_k)\cdot p(z_k) \]
The joint probability can be expressed as a product of conditional probabilities and an a-priori probability.
This can be generalized to the case of \(N\) random variables. The general form of the chain rule is:
\[ \begin{flalign} P(x_{i_1}, x_{i_2}, \ldots x_{i_{N}}) &= P(x_{i_1} | x_{i_2}, \ldots x_{i_{N}} ) \cdot P(x_{i_2} | x_{i_3}, \ldots x_{i_{N}}) \\ & \cdot P(x_{i_3} | x_{i_4}, \ldots x_{i_{N}}) \cdots P(x_{i_{N}}) \\ &= \prod\limits_{j=1}^N P(x_{i_j} | x_{i_{j+1}}, \ldots x_{i_{N}} ) \end{flalign} \]
Bayes’ Theorem is a fundamental concept in probability theory and serves as the cornerstone of modern AI and machine learning. It allows us to update our beliefs based on new evidence.
At its core, Bayes’ Theorem relates conditional probabilities. Starting from the conditional probability formula:
\[ P(x_i | y_j) = \frac{P(x_i, y_j)}{P(y_j)} \]
We can derive Bayes’ Theorem as:
\[ P(x_i | y_j)=\frac{P(y_j | x_i) P(x_i)}{P(y_j)} \]
The denominator \(P(y_j)\) can be expanded using the law of total probability7:
\[ P(y_j) = \sum_{x_k \in V(X)} P(y_j | x_k) P(x_k) \]
Where \(V(X)\) represents all possible values of the random variable \(X\).
This gives us the complete form of Bayes’ Theorem:
\[ P(x_i | y_j)=\frac{P(y_j | x_i) P(x_i)}{\sum\limits_{x_k \in V(X)}P(y_j | x_k) P(x_k)} \]
Bayesian inference is the process of applying Bayes’ Theorem to update probabilities as new information becomes available. Here’s how it works in practice:
Given a feature vector8 \(X = (x_1, x_2, ..., x_n)\) and a class variable9 \(Y\), Bayes’ Theorem gives us:
\[P(Y|X) = \frac{P(X|Y) \cdot P(Y)}{P(X)}\]
The naive independence assumption states that:
\[P(X|Y) = P(x_1|Y) \cdot P(x_2|Y) \cdot ... \cdot P(x_n|Y)\]
This gives us:
\[ \begin{flalign} P(Y|X) &= \frac{P(X|Y) \cdot P(Y)}{P(X)}\\ &= \frac{P(x_1|Y) \cdot P(x_2|Y) \cdot ... \cdot P(x_n|Y) \cdot P(Y)}{P(x_1, x_2, \ldots, x_n)}\\ \end{flalign} \]
In practice, often the unnormalized probabilities are calculated for each class:
\[P(x_1|Y) \cdot P(x_2|Y) \cdot ... \cdot P(x_n|Y) \cdot P(Y)\]
And then normalized by dividing by the total probability:
\[P(x_1, x_2, \ldots, x_n) = \sum_{j=1}^{m} \left[\prod_{i=1}^{n} P(x_i|Y=y_j)\right] \cdot P(Y=y_j)\]
Overall, Naive Bayes simplify the calculation dramatically, as we now only need to estimate these simpler conditional probabilities from the training data.
Show from the definition of conditional probability that \(P(a | b \land a) = 1\).
An agent holds the three beliefs: \(P(A)=0.4\), \(P(B)=0.3\), and \(P(A \lor B)=0.5\).
What is the probability of \(A \land B\)?
Make up a table like the one in Table 1 to answer that question.
A medical test is being used to screen for a rare disease that affects 1% of the population. The test has the following characteristics:
A random person from the population takes the test and receives a positive result.
Questions
Show that the statement of conditional independence
\[ \begin{flalign} P(A,B|C)=P(A|C)P(B|C) && \end{flalign} \]
is equivalent to each of the statements
\[ \begin{flalign} P(A|B,C)=P(A|C) \quad \textrm{and} \quad P(B|A,C)=P(B|C) && \end{flalign} \]
Pacman has developed a hobby of fishing. Over the years, he has learned that a day can be considered fit or unfit for fishing \(Y\) which results in three features: whether or not Ms. Pacman can show up \(M\), the temperature of the day \(T\), and how high the water level is \(W\). Pacman models it as an Naive Bayes classification problem.
We wish to calculate the probability a day is fit for fishing given features of the day. Consider the conditional probability tables that Pacman has estimated over the years:
| \(Y\) | \(P(Y)\) |
|---|---|
| yes | 0.1 |
| no. | 0.9 |
| \(M\) | \(Y\) | \(P(M|Y)\) |
|---|---|---|
| yes | yes | 0.5 |
| no | yes | 0.5 |
| yes | no | 0.2 |
| no | no | 0.8 |
| \(W\) | \(Y\) | \(P(W|Y)\) |
|---|---|---|
| high | yes | 0.1 |
| low | yes | 0.9 |
| high | no | 0.5 |
| low | no | 0.5 |
| \(T\) | \(Y\) | \(P(T|Y)\) |
|---|---|---|
| cold | yes | 0.2 |
| warm | yes | 0.3 |
| hot | yes | 0.5 |
| cold | no | 0.5 |
| warm | no | 0.2 |
| hot | no | 0.3 |
Determine if a day is fit for fishing given the following conditions:
See qualification problem.
Theoretical ignorance refers to our incomplete understanding of the precise mechanisms and relationships between conditions and symptoms, even when we can identify them.
Practical ignorance refers to our unavoidable uncertainty about the specific details and circumstances of an individual case, even when we have comprehensive theoretical knowledge about the conditions involved (e.g., individual pain tresholds differ significantly).
Variables in probability theory are called random variables
Finite outcomes are, e.g., the 6 possibilities in a dice-rolling event.
Represents the belief that the temperature at noon is distributed uniformly between 18 and 26 degrees Celcius
The law of total probability states that if you have a sample space divided into mutually exclusive and exhaustive events (often called a partition), then the probability of any event \(y_j\) can be calculated by adding up the conditional probabilities of \(y_j\) given each event in the partition \(X\), weighted by the probabilities of those partition events (i.e., representing all possible paths or scenarios through which the event can occur).
A feature vector is an n-dimensional vector of numerical features that represent an object. It’s essentially an n-dimensional vector where each dimension represents a specific attribute. Features can be numerical (age, income), categorical (color, brand), binary (yes/no), or derived (calculated from other features).
A class variable (Y) is what we’re trying to predict or classify. It represents the outcome or category that our machine learning model should determine. Examples include: spam/not spam, disease present/absent, sentiment (positive/negative/neutral). Can be binary (two options), multi-class (several distinct categories), or multi-label (multiple categories can apply simultaneously).