🧠 Introduction to AI
Neu-Ulm University of Applied Sciences
February 13, 2024
What are challenges related to logic as an approach to knowledge representation?
Logic is good, but often conclusions under uncertainty need to be drawn, as …
Goal: Be in Ulm at 8:15 to give a lecture
There are several plans that achieve the goal:
All these plans are correct, but they imply different costs and different probabilities of actually achieving the goal.
\(P_2\) eventually is the plan of choice as the success rate of P1 is only 80%
Take an expert dental diagnosis system as an example.
\(Toothache \implies Cavity\)
→ This rule is incorrect, better:
\(Toothache \implies Cavity \lor GumProblem \lor Abscess ...\)
→ However, we don’t know all the causes1
Perhaps a causal rule is better?
\(Cavity \implies Toothache\)
→ This is still wrong and does not allow to reason from symptoms to causes.
What do we learn from that?
→ Without perfect knowledge (i.e., “the rules of the game”), logical rules do not help much
What issues might arise when localizing a robot?
Let us suppose we wanted to support the localization of a robot with (constant) landmarks. With the availability of landmarks, we can narrow down on the area.
Problem: sensors can be imprecise
= utility theory + probability theory`
It would be rational to choose the action with the maximum expected utility (MEU)—the “average”, or “statistical mean” of the outcome utilities minus the costs of the actions leading to the outcome, weighted by the probability of the outcome.
\(P(A)\) denotes the unconditional probability or prior probability that \(A\) will appear in the absence of any other information, e.g. \(P(Cavity) = 0.1\)
New information (usually called evidence) can change the probability, e.g. the probability of a cavity increases if we know the patient has a toothache
\(P(a|b)\) is the conditional probability of \(a\) given that all we know is \(b\)
Conditional probabilities (or posterior probabilities) are defined in terms of unconditional probabilities as follows:
For any propositions \(a\) and \(b\) (if \(P(b)>0\)), we have
\[ \begin{flalign} P(a|b) = \frac{P(a \land b)}{P(b)} = \frac{P(a, b)}{P(b)}&& \end{flalign} \]
Or in a different form as product rule
\(P(a \land b) = P(a,b) = P(a|b)P(b) = P(b|a)P(a)\)6
The product rules for all possible values of Weather and Cavity can be written as a single equation:
\(P(Weather,Cavity)=P(Weather|Cavity)P(Cavity)\)
which corresponds to a system of equations (using abbreviations W and C):
\(P(W=sun,C=true) = P(W=sun|C=true)P(C=true)\)
\(P(W=rain,C=true) = P(W=rain|C=true)P(C=true)\)
…
\(P(W=snow,C=false) = P(W=snow|C=false)P(C=false)\)
Marginalisation in probability refers to “summing out” the probability of a random variable X given the joint probability distribution of X with other variable(s).
For any sets of variables X and Z we have
\[ \begin{flalign} P(X=x) = \sum_Z P(X=x,Z=z) && \end{flalign} \]
To find \(P(X=x)\), we sum all the probability values where \(X=x\) occurs with all the possible values of Z.
We can abbreviate \(P(X=x)\) by \(P(x)\) and \(P(X=x,Z=z)\) by \(P(x,z)\).
Using the product rule, we can replace \(P(x,z)\) by \(P(X|z)P(z)\),
\[ \begin{flalign} P(x) = \sum_Z P(x|z)P(z) && \end{flalign} \]
The basic axioms of probability theory say that every possible world has a probability between 0 and 1 and that the total probability of the set of possible worlds is 1:
\(0 \leq P(a) \leq 1 \quad \textrm{for every } \omega \textrm{ and} \sum_{\omega \in \Omega}P(\omega)=1\)
The probability of a disjunction can be derived by following formula, sometimes called inclusion-exclusion principle:
\(P(a \lor b) = P(a) + P(b) - P(a \land b)\)
All other properties can be derived from these axioms, e.g.
\(P(\neg A) = 1–P(A)\)
follows from \(P(A \lor \neg A) = 1\) and \(P(A \land \neg A) = 0\)
Probabilities can be assigned to every proposition; an atomic event is an assignment of values to all random variables (i.e., a complete specification of a state)
Example: Let X and \(Y\) be boolean variables, leading to following four atomic events: \(X \land Y, \neg X \land Y, X \land \neg Y, \neg X \land \neg Y\)
The joint probability distribution \(P(x_1,...,x_n)\) assigns a probability to every atomic event.
Toothache | ¬Toothache | |
Cavity | 0.04 | 0.06 |
¬Cavity | 0.01 | 0.89 |
Since all atomic events are disjoint, the sum of all fields is 1.
All relevant probabilities can be computed using the joint probability by expressing them as a disjunction of atomic events. Example:
\[ \begin{align} P(Cavity \lor Toothache) & = P(Cavity \land Toothache) \\ &+ P(\neg Cavity \land Toothache) \\ &+ P(Cavity \land \neg Toothache) \end{align} \]
Unconditional probabilities are obtained by adding across a row or column:
\(P(Cavity) = P(Cavity \land Toothache) + P(Cavity \land \neg Toothache)\)
While conditional probabilities are obtained by using the product rule:
\(P(Cavity|Toothache) = \frac{P(Cavity \land Toothache)}{P(Toothache)}=\frac{0.04}{0.04+0.01}=0.80\)
What challenges arise when dealing with joint probabilities?
We can easily obtain all probabilities from the joint probability. The joint probability, however, involves \(k^n\) values, if there are \(n\) random variables with \(k\) values.
→ Implies difficulties of representation and assessment
Questions:
Not in general, but it can work in many cases. Modern systems work directly with conditional probabilities and make assumptions on the independence of variables7 in order to simplify calculations.
Work through the Bayes’ rule chapter, try to get it and note issues of understanding.
We already know the product rule that can be written in two forms:
\(P(a \land b) = P(a|b) P(b) \quad and \quad P(a \land b) = P(b|a) P(a)\)
By equating the right-hand sides and dividing by \(P(a)\), we get
\(P(b|a) = \frac{P(a|b)P(b)}{P(a)}\quad\) (known as Bayes’ rule)
For multi-valued variables (set of equalities):
\(P(Y|X) = \frac{P(X|Y) P(Y)}{P(X)}\)
Generalization (conditioning on background evidence \(e\)):
\(P(Y|X,e) = \frac{P(X|Y,e) P(Y|e)}{P(X|e)}\)
Bayes’ rule only allows us to compute the single term \(P(b|a)\) in terms of three terms:
\(P(a|b)\), \(P(b)\), and \(P(a)\)
That is useful in practice as there are many cases where we do have good probability estimates for these three numbers and need to compute the fourth
Often, we perceive as evidence the effect of some known cause and we would like to determine that cause. In that case, Bayes’ rule becomes:
\(P(cause|\mathit{effect}) = \frac{P(\mathit{effect}|cause)P(cause)}{P(\mathit{effect})}\)
\[ \begin{align} P(Toothache | Cavity) = {} & 0.04 \\ P(Cavity) = {} & 0.1 \\ P(Toothache) = {} & 0.05 \\ P(Cavity | Toothache) = {} & \frac{0.04*0.1}{0.05} = 0.8 \end{align} \]
Often we would consider multiple causes. For example, a dentist would also like to consider the probability that the patient has a gum disease.
\[\begin{align} P(Toothache | GumDisease) = {} & 0.7 \\ P(GumDisease) = {} & 0.02 \end{align}\]
Which diagonsis is more probable?
\(P(c|t) = \frac{P(t | c)P(c)}{P(t)} \quad\) or \(\quad P(g | t) = \frac{P(t | g)P(g)}{P(t)}\)
If we are only interested in the relative probability, we need not assess \(P(T)\):
\(\frac{P(c|t)}{P(g|t)} = \frac{P(t|c)P(c)}{P(t)} \frac{P(t)}{P(t|g)P(g)} = \frac{P(t|c)P(c)}{P(t|g)P(g)} = \frac{0.4*0.1}{0.7*0.02} = 2.857\)
→ Important for excluding possible diagnoses
If we wish to determine the absolute probability of \(P(c | t)\) and we do not know \(P(t)\), we can also carry out a complete case analysis (e.g. for \(c\) and \(\neg c\))
We would use the fact that
\(P(c | t) + P(\neg c | t) = 1\) (here \(c\) and \(t\) are boolean variables)
\[ \begin{flalign} P(c | t) = {} & \frac{P(t|c)P(c)}{P(t)} && (1.1) \\ P(\neg c | t) = {} & \frac{P(t|\neg c)P(\neg c)}{P(t)} && (1.2) \\ P(c | t) + P(\neg c | t) = {} & \frac{P(t|c)P(c)}{P(t)} + \frac{P(t|\neg c)P(\neg c)}{P(t)} && (1.3) \\ P(t) = {} & P(t|c) P(c) + P(t|\neg c) P(\neg c) && (1.4) \end{flalign} \]
By substituting equation (1.4) into the first equation (1.1), we get:
\[ \begin{flalign} P(c|t) = {} & \frac{P(t|c)P(c)}{P(t|c)P(c)+P(t|\neg c)P(\neg c)} && (1.5) \end{flalign} \]
Your doctor tells you that you have tested positive for a serious but rare (1/10000) disease (\(D\)). This test (\(T\)) is correct to 99% (1% false positive & 1% false negative results). What does this mean for you?
\[ \begin{flalign} P(D|T) = \frac{P(T|D) P(D)}{P(T)} = \frac{P(T|D) P(D)}{P(T|D) P(D) + P(T| \neg D) P( \neg D)} && \end{flalign} \] Applying marginalization to Bayes Law with Boolean variables.
\[ \begin{flalign} P(D) = 0.0001 \quad P(T|D) = 0.99 \quad P(T | \neg D) = 0.01 && \end{flalign} \]
\[ \begin{flalign} P(D|T) = \frac{0.99 \times 0.0001}{0.99 \times 0.0001 + 0.01 \times 0.9999} = \frac{0.000099}{0.010088} ≈ 0.01 && \end{flalign} \]
Bottom line: if the inaccuracy of the test is much greater than the frequency of occurrence of the disease, then a positive result is not as threatening as one might think.
For random variables with multiple values:
\[ \begin{flalign} P(Y|X) = {} & \frac{P(X|Y)P(Y)}{P(X)} && \\ \end{flalign} \]
\[ \begin{flalign} \alpha = \frac{1}{P(X)} = \sum_i P(X,Y_i)P(Y_i) && \\ \end{flalign} \]
The sum of all possible values of the unknown parameter \(P(X)\) is often infinite. Thus, the normalization constant \(\alpha\) is often determined approximately. We get:
\[ \begin{flalign} P(Y|X) = {} & \alpha P(X|Y) P(Y) && \end{flalign} \]
A dentist’s probe catches in the aching tooth of a patient.
Using Bayes’ rule, we can calculate:
\[ \begin{flalign} P(Cav | Catch) = 0.95 && \end{flalign} \]
But how does the combined evidence help?
With the help of Bayes’ rule, the dentist could conclude:
\[ \begin{flalign} P(Cav | Toothache \land Catch) = \frac{P(Toothache \land Catch|Cav)P(Cav)}{P(Toothache \land Catch)} && \end{flalign} \]
\[ \begin{flalign} P(Cav | Toothache \land Catch) = \alpha{P(Toothache \land Catch|Cav)P(Cav)} && (2.0) \end{flalign} \]
Problem: The dentist needs to know the conditional probabilities of the conjunction \(Toothache \land Catch\) for each value of \(Cav\), i.e., diagnostic knowledge of all combinations of symptoms in the general case. This might be possible for just two evidence variables, but does not scale up.
As it would reduce the complexity of the inference problem, it would be nice if \(Toothache\) and \(Catch\) were independent, but they are not: if a probe catches in the tooth, it probably has cavity which probably causes toothache.
However, given we know whether the tooth has cavity, these variables are independent, as each is directly caused by the cavity, but neither has a direct effect on the other9:
\[ \begin{flalign} P(Toothache \land Catch | Cav) = P(Toothache | Cav) P(Catch | Cav) && (2.1) \end{flalign} \]
The general definition of conditional independence of two variables X and \(Y\) given a third variable Z is:
\[ \begin{flalign} P(X,Y|Z) = P(X|Z)P(Y|Z) && \end{flalign} \]
In the dentist example, it seems reasonable to assert conditional independence of the variables \(Toothache\) and \(Catch\) given \(Cav\), thus we can insert equation (2.1) into (2.0) leading to (2.2)
\[ \begin{flalign} P(Cav | Toothache \land Catch) = \alpha{P(Toothache \land Catch|Cav)P(Cav)} && (2.0) \end{flalign} \]
\[ \begin{flalign} P(Toothache \land Catch | Cav) = P(Toothache | Cav) P(Catch | Cav) && (2.1) \end{flalign} \]
\[ \begin{flalign} P(Cav | Toothache \land Catch) = \alpha{P(Toothache | Cav) P(Catch | Cav)P(Cav)} && (2.2) \end{flalign} \]
Multiple evidence can be reduced to prior probabilities and conditional probabilities (assuming conditional independence).
The general combination rule, if \(Z_1\) and \(Z_2\) are independent given X is
\[ \begin{flalign} P(C|Z_1, Z_2) = \alpha P(C) P(Z_1|C) P(Z_2|C) && \end{flalign} \]
where \(\alpha\) is the normalization constant.
The generalization is called recursive Bayesian updating:
\[ \begin{flalign} P(Cause|e) = \alpha P(Cause) \prod_j P(e_j|Cause) && \end{flalign} \]
Variables can be discrete or continuous
Discrete variables
Continuous variables
Imagine following task: given a text, decide which of a predefined set of classes or categories it belongs to.
Example sentences:
The task is to classify each sentence into a Category—the major sections of the newspaper ([news, sports, business, weather, entertainment]
)
The naive Bayes model consists of
For each category c \(P(Category=c)\) is estimated as the fraction of all previously seen documents that are of category c10.
Similarily, \(P(HasWord_i|Category)\) is estimated as the fraction of documents of each category that contain word \(i\)11.
To categorize a new document, we check which key words appear in the document and then apply recursive Bayesian updating to obtain the posterior probablity distribution over categories:
\[ \begin{flalign} P(Cause|e) = \alpha P(Cause) \prod_j P(e_j|Cause) && \end{flalign} \]
The category with the highest posterior probability is taken.
Show from the definition of conditional probability that \(P(a | b \land a) = 1\).
Would it be rational for an agent to hold the three beliefs \(P(A)=0.4\), \(P(B)=0.3\), and \(P(A \lor B)=0.5\)? If so, what range of probabilities would be rational for the agent to hold for \(A \land B\)? Make up a table like the one in Table 1, and show how it supports your argument about rationality.
Consider two medical tests, A and B, for a virus. Test A is 95% effective at recognizing the virus when it is present, but has a 10% false positive rate (indicating that the virus is present, when it is not). Test B is 90% effective at recognizing the virus, but has a 5% false positive rate. The two tests use independent methods of identifying the virus. The virus is carried by 1% of all people.
Say that a person is tested for the virus using only one of the tests, and that test comes back positive for carrying the virus. Which test returning positive is more indicative of someone really carrying the virus? Justify your answer mathematically.
Show that the statement of conditional independence
\[ \begin{flalign} P(A,B|C)=P(A|C)P(B|C) && \end{flalign} \]
is equivalent to each of the statements
\[ \begin{flalign} P(A|B,C)=P(A|C) \quad \textrm{and} \quad P(B|A,C)=P(B|C) && \end{flalign} \]
Pacman has developed a hobby of fishing. Over the years, he has learned that a day can be considered fit or unfit for fishing \(Y\) which results in three features: whether or not Ms. Pacman can show up \(M\), the temperature of the day \(T\), and how high the water level is \(W\). Pacman models it as an Naive Bayes classification problem.
We wish to calculate the probability a day is fit for fishing given features of the day. Consider the conditional probability tables that Pacman has estimated over the years:
\(Y\) | \(P(Y)\) |
---|---|
yes | 0.1 |
no. | 0.9 |
\(M\) | \(Y\) | \(P(M|Y)\) |
---|---|---|
yes | yes | 0.5 |
no | yes | 0.5 |
yes | no | 0.2 |
no | no | 0.8 |
\(W\) | \(Y\) | \(P(W|Y)\) |
---|---|---|
high | yes | 0.1 |
low | yes | 0.9 |
high | no | 0.5 |
low | no | 0.5 |
\(T\) | \(Y\) | \(P(T|Y)\) |
---|---|---|
cold | yes | 0.2 |
warm | yes | 0.3 |
hot | yes | 0.5 |
cold | no | 0.5 |
warm | no | 0.2 |
hot | no | 0.3 |
Using the method of Naive Bayes, if Ms. Pacman is available, the weather is cold, and the water level is high, do we predict that the day is fit for fishing?
See qualification problem
Medical science has no complete theory for the domain
For instance, the coincidence of having a toothache and a cavity that are unrelated, or the fact that not all tests have been run
Variables in probability theory are called random variables
Represents the belief that the temperature at noon is distributed uniformly between 18 and 26 degrees Celcius
\(P(a \land b) = P(a|b)P(b)\) equals \(P(a \land b) = P(b|a)P(a)\)
Factoring large joint distributions into smaller joint distributions, using absolute indepence. For instance, weather and dental problems are independent, thus, these can be split.
If there is a cavity epidemic and \(P(Cavity)\) increases, \(P(Toothache | Cavity)\) does not change, but \(P(Toothache)\) and \(P(Cavity | Toothache)\) will change proportionally
Toothache depends on the state of the nerves in the tooth, whereas the probe’s accuracy depends primarily on the dentist’s skill, to which the toothache is irrelevant.
For example, if 9% of the articles are about weather, we set \(P(Category=weather)=0.09\).
For example, if 37% of articles about business contain word 6, “stocks,” so \(P(HasWord_6=true|Category=business)\) is set to 0.37