Introduction
Qualification problem
Logic is good, but often conclusions under uncertainty need to be drawn, as …
- the agent’s knowledge of the world is incomplete (not enough information) or uncertain (sensors are unreliable);
- the agent must consider every possible explanation for its percepts (no matter how unlikely);
- this leads to a large belief-state full of unlikely possibilities and arbitrarily large contingent plans; and
- rules about the world are often incomplete (e.g., are all preconditions for an action known?) or even incorrect
In philosophy and AI, the qualification problem is concerned with the impossibility of listing all the preconditions required for a real-world action to have its intended effect.
As we have learned, there is no complete solution within logic. System designers must use good judgment in deciding how much detail to specify in their model and what details to omit. The reason for that is quite simple: often, all the conditions for an action that are necessary to achieve the intended effect can’t be known. Or if they can be known, they often lead to a large belief-state full of unlikely possibilities. This is called the qualification problem in logic.
However, there are good news: probability theory allows all exceptions to be grouped together without explicitly naming them.
Example
Goal: Be in Ulm at 8:15 to give a lecture
There are several plans that achieve the goal:
- \(P_1:\) Get up at 6:00, take the bike, arrive at 7:30, take a shower, …
- \(P_2:\) Get up at 6:30, take the car at 7:00, arrive at 7:45, …
- …
All these plans are correct, but they imply different costs and different probabilities of actually achieving the goal.
\(P_2\) eventually is the plan of choice as the success rate of P1 is only 80%
Uncertainty in rules
Take an expert dental diagnosis system as an example.
\(Toothache \implies Cavity\)
→ This rule is incorrect, better:
\(Toothache \implies Cavity \lor GumProblem \lor Abscess ...\)
→ However, we don’t know all the causes1
Perhaps a causal rule is better?
\(Cavity \implies Toothache\)
→ This is still wrong and does not allow to reason from symptoms to causes.
Learnings
→ Without perfect knowledge (i.e., “the rules of the game”), logical rules do not help much
Uncertainty in facts
Let us suppose we wanted to support the localization of a robot with (constant) landmarks. With the availability of landmarks, we can narrow down on the area.
Problem: sensors can be imprecise
- From the fact that a landmark was perceived, we cannot conclude with certainty that the robot is at that location
- The same is true when no landmark is perceived
- Only the probability increases or decreases
Probability theory
- We (and other agents) are convinced by facts and rules only up to a certain degree
- One possibility for expressing the degree of belief is to use probabilities
- The agent is 90% (or 0.9) convinced by its sensor information
(in 9 out of 10 cases, the information is correct–the agent believes) - Probabilities sum up the “uncertainty” that stems from lack of knowledge.
- Probabilities are not to be confused with vagueness
The predicate tall is vague; the statement, “A man is 1.75–1.80m tall” is uncertain.
Decisions under uncertainty
Decision theory
= utility theory + probability theory`
- We have a choice of actions (or plans)
- These can lead to different solutions with different probabilities
- The actions have different (subjective) costs
- The results have different (subjective) utilities
It would be rational to choose the action with the maximum expected utility (MEU)—the “average”, or “statistical mean” of the outcome utilities minus the costs of the actions leading to the outcome, weighted by the probability of the outcome.
Unconditional probabilities
\(P(A)\) denotes the unconditional probability or prior probability that \(A\) will appear in the absence of any other information, e.g. \(P(Cavity) = 0.1\)
- Prior probabilities can be obtained from statistical analysis or general rules
- A random variable4 can take on some range—the set of possible values (e.g., Numbers, Boolean, arbitary tokens), e.g. \(P(Weather=sunny=0.6)\)
- Logical connectors can be used to build probabilistic propositions, e.g. \(P(Cavity \land \neg Insured) = 0.06\)
- Propositions can contain equations over random variables, e.g. \(P(NoonTemp=x) = \textrm{Uniform}(x;18C;26C)\)5, usually called a probability density function
Conditional probabilities
New information (usually called evidence) can change the probability, e.g. the probability of a cavity increases if we know the patient has a toothache
\(P(a|b)\) is the conditional probability of \(a\) given that all we know is \(b\)
Conditional probabilities (or posterior probabilities) are defined in terms of unconditional probabilities as follows:
For any propositions \(a\) and \(b\) (if \(P(b)>0\)), we have
\[ \begin{flalign} P(a|b) = \frac{P(a \land b)}{P(b)} = \frac{P(a, b)}{P(b)}&& \end{flalign} \]
Or in a different form as product rule
\(P(a \land b) = P(a,b) = P(a|b)P(b) = P(b|a)P(a)\)6
Example
The product rules for all possible values of Weather and Cavity can be written as a single equation:
\(P(Weather,Cavity)=P(Weather|Cavity)P(Cavity)\)
which corresponds to a system of equations (using abbreviations W and C):
\(P(W=sun,C=true) = P(W=sun|C=true)P(C=true)\)
\(P(W=rain,C=true) = P(W=rain|C=true)P(C=true)\)
…
\(P(W=snow,C=false) = P(W=snow|C=false)P(C=false)\)
Marginalization
Marginalisation in probability refers to “summing out” the probability of a random variable X given the joint probability distribution of X with other variable(s).
For any sets of variables X and Z we have
\[ \begin{flalign} P(X=x) = \sum_Z P(X=x,Z=z) && \end{flalign} \]
To find \(P(X=x)\), we sum all the probability values where \(X=x\) occurs with all the possible values of Z.
We can abbreviate \(P(X=x)\) by \(P(x)\) and \(P(X=x,Z=z)\) by \(P(x,z)\).
Using the product rule, we can replace \(P(x,z)\) by \(P(X|z)P(z)\),
\[ \begin{flalign} P(x) = \sum_Z P(x|z)P(z) && \end{flalign} \]
Probability axioms
The basic axioms of probability theory say that every possible world has a probability between 0 and 1 and that the total probability of the set of possible worlds is 1:
\(0 \leq P(a) \leq 1 \quad \textrm{for every } \omega \textrm{ and} \sum_{\omega \in \Omega}P(\omega)=1\)
The probability of a disjunction can be derived by following formula, sometimes called inclusion-exclusion principle:
\(P(a \lor b) = P(a) + P(b) - P(a \land b)\)
All other properties can be derived from these axioms, e.g.
\(P(\neg A) = 1–P(A)\)
follows from \(P(A \lor \neg A) = 1\) and \(P(A \land \neg A) = 0\)
Joint probability
Probabilities can be assigned to every proposition; an atomic event is an assignment of values to all random variables (i.e., a complete specification of a state)
Example: Let X and \(Y\) be boolean variables, leading to following four atomic events: \(X \land Y, \neg X \land Y, X \land \neg Y, \neg X \land \neg Y\)
The joint probability distribution \(P(x_1,...,x_n)\) assigns a probability to every atomic event.
Toothache | ¬Toothache | |
Cavity | 0.04 | 0.06 |
¬Cavity | 0.01 | 0.89 |
Since all atomic events are disjoint, the sum of all fields is 1.
Working with joint probabilities
All relevant probabilities can be computed using the joint probability by expressing them as a disjunction of atomic events. Example:
\[ \begin{align} P(Cavity \lor Toothache) & = P(Cavity \land Toothache) \\ &+ P(\neg Cavity \land Toothache) \\ &+ P(Cavity \land \neg Toothache) \end{align} \]
Unconditional probabilities are obtained by adding across a row or column:
\(P(Cavity) = P(Cavity \land Toothache) + P(Cavity \land \neg Toothache)\)
While conditional probabilities are obtained by using the product rule:
\(P(Cavity|Toothache) = \frac{P(Cavity \land Toothache)}{P(Toothache)}=\frac{0.04}{0.04+0.01}=0.80\)
Problems with joint probabilities
We can easily obtain all probabilities from the joint probability. The joint probability, however, involves \(k^n\) values, if there are \(n\) random variables with \(k\) values.
→ Implies difficulties of representation and assessment
Questions:
- Is there a more compact way of representing joint probabilities?
- Is there an efficient method to work with this representation?
Not in general, but it can work in many cases. Modern systems work directly with conditional probabilities and make assumptions on the independence of variables7 in order to simplify calculations.
Bayes’ rule
Derivation
We already know the product rule that can be written in two forms:
\(P(a \land b) = P(a|b) P(b) \quad and \quad P(a \land b) = P(b|a) P(a)\)
By equating the right-hand sides and dividing by \(P(a)\), we get
\(P(b|a) = \frac{P(a|b)P(b)}{P(a)}\quad\) (known as Bayes’ rule)
For multi-valued variables (set of equalities):
\(P(Y|X) = \frac{P(X|Y) P(Y)}{P(X)}\)
Generalization (conditioning on background evidence \(e\)):
\(P(Y|X,e) = \frac{P(X|Y,e) P(Y|e)}{P(X|e)}\)
Value of Bayes’ rule
Bayes’ rule only allows us to compute the single term \(P(b|a)\) in terms of three terms:
\(P(a|b)\), \(P(b)\), and \(P(a)\)
That is useful in practice as there are many cases where we do have good probability estimates for these three numbers and need to compute the fourth
Often, we perceive as evidence the effect of some known cause and we would like to determine that cause. In that case, Bayes’ rule becomes:
\(P(cause|\mathit{effect}) = \frac{P(\mathit{effect}|cause)P(cause)}{P(\mathit{effect})}\)
Example
\[ \begin{align} P(Toothache | Cavity) = {} & 0.04 \\ P(Cavity) = {} & 0.1 \\ P(Toothache) = {} & 0.05 \\ P(Cavity | Toothache) = {} & \frac{0.04*0.1}{0.05} = 0.8 \end{align} \]
- Here only information that quantifies the relationship between Caviety and Toothache in the causal direction is given, the diagnosis is derived by applying Baye’s rule
- If there is quantative information in the diagnostic direction from symptons to causes, there is no need to use Baye’s rule
- However, causal knowledge is more robust than diagnostic knowledge (imagine, e.g., a cavity epidemic8)
Relative probability
Often we would consider multiple causes. For example, a dentist would also like to consider the probability that the patient has a gum disease.
\[\begin{align} P(Toothache | GumDisease) = {} & 0.7 \\ P(GumDisease) = {} & 0.02 \end{align}\]
Which diagonsis is more probable?
\(P(c|t) = \frac{P(t | c)P(c)}{P(t)} \quad\) or \(\quad P(g | t) = \frac{P(t | g)P(g)}{P(t)}\)
If we are only interested in the relative probability, we need not assess \(P(T)\):
\(\frac{P(c|t)}{P(g|t)} = \frac{P(t|c)P(c)}{P(t)} \frac{P(t)}{P(t|g)P(g)} = \frac{P(t|c)P(c)}{P(t|g)P(g)} = \frac{0.4*0.1}{0.7*0.02} = 2.857\)
→ Important for excluding possible diagnoses
Normalization
If we wish to determine the absolute probability of \(P(c | t)\) and we do not know \(P(t)\), we can also carry out a complete case analysis (e.g. for \(c\) and \(\neg c\))
We would use the fact that
\(P(c | t) + P(\neg c | t) = 1\) (here \(c\) and \(t\) are boolean variables)
\[ \begin{flalign} P(c | t) = {} & \frac{P(t|c)P(c)}{P(t)} && (1.1) \\ P(\neg c | t) = {} & \frac{P(t|\neg c)P(\neg c)}{P(t)} && (1.2) \\ P(c | t) + P(\neg c | t) = {} & \frac{P(t|c)P(c)}{P(t)} + \frac{P(t|\neg c)P(\neg c)}{P(t)} && (1.3) \\ P(t) = {} & P(t|c) P(c) + P(t|\neg c) P(\neg c) && (1.4) \end{flalign} \]
By substituting equation (1.4) into the first equation (1.1), we get:
\[ \begin{flalign} P(c|t) = {} & \frac{P(t|c)P(c)}{P(t|c)P(c)+P(t|\neg c)P(\neg c)} && (1.5) \end{flalign} \]
Example
Your doctor tells you that you have tested positive for a serious but rare (1/10000) disease (\(D\)). This test (\(T\)) is correct to 99% (1% false positive & 1% false negative results). What does this mean for you?
\[ \begin{flalign} P(D|T) = \frac{P(T|D) P(D)}{P(T)} = \frac{P(T|D) P(D)}{P(T|D) P(D) + P(T| \neg D) P( \neg D)} && \end{flalign} \] Applying marginalization to Bayes Law with Boolean variables.
\[ \begin{flalign} P(D) = 0.0001 \quad P(T|D) = 0.99 \quad P(T | \neg D) = 0.01 && \end{flalign} \]
\[ \begin{flalign} P(D|T) = \frac{0.99 \times 0.0001}{0.99 \times 0.0001 + 0.01 \times 0.9999} = \frac{0.000099}{0.010088} ≈ 0.01 && \end{flalign} \]
Bottom line: if the inaccuracy of the test is much greater than the frequency of occurrence of the disease, then a positive result is not as threatening as one might think.
Normalizing constant
For random variables with multiple values:
\[ \begin{flalign} P(Y|X) = {} & \frac{P(X|Y)P(Y)}{P(X)} && \\ \end{flalign} \]
\[ \begin{flalign} \alpha = \frac{1}{P(X)} = \sum_i P(X,Y_i)P(Y_i) && \\ \end{flalign} \]
The sum of all possible values of the unknown parameter \(P(X)\) is often infinite. Thus, the normalization constant \(\alpha\) is often determined approximately. We get:
\[ \begin{flalign} P(Y|X) = {} & \alpha P(X|Y) P(Y) && \end{flalign} \]
Generalization
Let \(e\) be the observed value of an evidence variable \(E\), and let \(Y\) be the remaining unobserved variables.
The query is \(P(X|e)\) and can be evaluated as
\[ \begin{flalign} P(X|e) = \frac{P(X,e)}{P(e)} = \alpha P(X,e) = \alpha \sum_y P(X,e,y) && \end{flalign} \]
Since \(e\) is known, the factor \(1 / P(e)\) is the same for all values of X.
In fact, it can be viewed as a normalization constant for the distrubution \(P(X|e)\), ensuring that it adds up to 1.
The summation is over all possible \(y\)s (i.e., all possible combinations of values of the unobserved variables \(Y\)). Notice that together, the variables X, \(E\), and \(Y\) constitute the complete set of variables for the domain, so \(P(X,e,y)\) is simply a subset of probabilities from the full joint distribution.
Multiple evidence
A dentist’s probe catches in the aching tooth of a patient.
Using Bayes’ rule, we can calculate:
\[ \begin{flalign} P(Cav | Catch) = 0.95 && \end{flalign} \]
But how does the combined evidence help?
With the help of Bayes’ rule, the dentist could conclude:
\[ \begin{flalign} P(Cav | Toothache \land Catch) = \frac{P(Toothache \land Catch|Cav)P(Cav)}{P(Toothache \land Catch)} && \end{flalign} \]
\[ \begin{flalign} P(Cav | Toothache \land Catch) = \alpha{P(Toothache \land Catch|Cav)P(Cav)} && (2.0) \end{flalign} \]
Problem: The dentist needs to know the conditional probabilities of the conjunction \(Toothache \land Catch\) for each value of \(Cav\), i.e., diagnostic knowledge of all combinations of symptoms in the general case. This might be possible for just two evidence variables, but does not scale up.
As it would reduce the complexity of the inference problem, it would be nice if \(Toothache\) and \(Catch\) were independent, but they are not: if a probe catches in the tooth, it probably has cavity which probably causes toothache.
However, given we know whether the tooth has cavity, these variables are independent, as each is directly caused by the cavity, but neither has a direct effect on the other9:
\[ \begin{flalign} P(Toothache \land Catch | Cav) = P(Toothache | Cav) P(Catch | Cav) && (2.1) \end{flalign} \]
Conditional independence
The general definition of conditional independence of two variables X and \(Y\) given a third variable Z is:
\[ \begin{flalign} P(X,Y|Z) = P(X|Z)P(Y|Z) && \end{flalign} \]
In the dentist example, it seems reasonable to assert conditional independence of the variables \(Toothache\) and \(Catch\) given \(Cav\), thus we can insert equation (2.1) into (2.0) leading to (2.2)
\[ \begin{flalign} P(Cav | Toothache \land Catch) = \alpha{P(Toothache \land Catch|Cav)P(Cav)} && (2.0) \end{flalign} \]
\[ \begin{flalign} P(Toothache \land Catch | Cav) = P(Toothache | Cav) P(Catch | Cav) && (2.1) \end{flalign} \]
\[ \begin{flalign} P(Cav | Toothache \land Catch) = \alpha{P(Toothache | Cav) P(Catch | Cav)P(Cav)} && (2.2) \end{flalign} \]
Naive Bayes models
The dentistry example illustrates a commonly occuring pattern in which a single cause directly influences a number of effects, all of which are conditionally independent, given the cause.
The full joint distribution can be written as
\[ \begin{flalign} P(Cause| \mathit{Effect_1},..., \mathit{Effect_n}) = P(Cause) \prod_{i=1} P(\mathit{Effect_i}|Cause) && \end{flalign} \]
Such a probability distribution is called a naive Bayes model—“naive” because it is often used (as a simplifying assumption) in cases where the “effect” variables are not strictly independent given the “cause” variable.
In practice, naive Bayes systems often work very well, even when the conditional independence assumption is not strictly true.
Recursive Bayesian updating
Multiple evidence can be reduced to prior probabilities and conditional probabilities (assuming conditional independence).
The general combination rule, if \(Z_1\) and \(Z_2\) are independent given X is
\[ \begin{flalign} P(C|Z_1, Z_2) = \alpha P(C) P(Z_1|C) P(Z_2|C) && \end{flalign} \]
where \(\alpha\) is the normalization constant.
The generalization is called recursive Bayesian updating:
\[ \begin{flalign} P(Cause|e) = \alpha P(Cause) \prod_j P(e_j|Cause) && \end{flalign} \]
Reinterpreting this equation in words: for each possible cause, multiply the prior probability of the cause by the product of the conditional probabilities of the observed effects given the cause; then normalize the result.
Types of variables
Variables can be discrete or continuous
Discrete variables
- \(Weather\): sunny, rain, cloudy, snow
- \(Cavity\): true, false (boolean)
Continuous variables
- Tomorrow’s maximum temperature in Berkeley
- Domain can be the entire real line or any subset
- Distributions for continuous variables are typically given by probability density functions
Text classification
Imagine following task: given a text, decide which of a predefined set of classes or categories it belongs to.
- Cause is the Category variable
- Effect variables are the presence or abscence of a certaing key words, HasWordi
Example sentences:
- Stock rallied on Monday, with major indexes gaining 1% as optimism persisted over the first quarter earnings season.
- Heavy rain continued to pound much of the east coast on Monday, with flood warnings issued in New York City and other locations
The task is to classify each sentence into a Category—the major sections of the newspaper ([news, sports, business, weather, entertainment]
)
The naive Bayes model consists of
- the prior probabilities \(P(Category)\) and
- the conditional probabilities \(P(HasWord_i|Category)\)
For each category c \(P(Category=c)\) is estimated as the fraction of all previously seen documents that are of category c10.
Similarily, \(P(HasWord_i|Category)\) is estimated as the fraction of documents of each category that contain word \(i\)11.
To categorize a new document, we check which key words appear in the document and then apply recursive Bayesian updating to obtain the posterior probablity distribution over categories:
\[ \begin{flalign} P(Cause|e) = \alpha P(Cause) \prod_j P(e_j|Cause) && \end{flalign} \]
The category with the highest posterior probability is taken.
The naive Bayes model assumes that words occur independently in documents, with frequencies determined by the document category. This independence assumption is clearly violated in practice. For example, the phrase “first quarter” occurs more frequently in business (or sports) articles than would be suggested by multiplying the probabilities of “first” and “quarter.” The violation of independence usually means that the final posterior probabilities will be much closer to 1 or 0 than they should be; in other words, the model is overconfident in its predictions. On the other hand, even with these errors, the ranking of the possible categories is often quite accurate. Naive Bayes models are widely used for language determination, document retrieval, spam filtering, and other classification tasks. For tasks such as medical diagnosis, where the actual values of the posterior probabilities really matter–for example, in deciding wether to perform an appendoctomy–one would usually prefer to use more sophisticated models such as Bayesian networks (Russel and Norvig 2022, 422).
Summary
- Uncertainty is unavoidable in complex, dynamic worlds in which agents are ignorant.
- Probabilities express the agent’s inability to reach a definite decision. They summarize the agent’s beliefs.
- Conditional and unconditional probabilities can be formulated over propositions.
- Bayes’ rule allows us to calculate known probabilities from unknown probabilities.
- Multiple evidence (assuming independence) can be effectively incorporated using recursive Bayesian updating.
✏️ Exercises
Conditional probability
Show from the definition of conditional probability that \(P(a | b \land a) = 1\).
Open only if you need help.
The definition of conditional probability states that the conditional probability of A given B is the probability of the intersection of A and B divided by the probability of B.
For any propositions \(a\) and \(b\) (if \(P(b)>0\)), we have
\[ \begin{flalign} P(a|b)=\frac{P(a \land b)}{P(b)} && \end{flalign} \]
Thus:
\[ \begin{flalign} P(a|b \land a) = \frac{P(a\land(b \land a))}{P(b \land a)} && \end{flalign} \] Which is equal to
\[ \begin{flalign} P(a|b \land a) = \frac{P(b \land a)}{P(b \land a)} = 1 && \end{flalign} \]
Beliefs
Would it be rational for an agent to hold the three beliefs \(P(A)=0.4\), \(P(B)=0.3\), and \(P(A \lor B)=0.5\)? If so, what range of probabilities would be rational for the agent to hold for \(A \land B\)? Make up a table like the one in Table 1, and show how it supports your argument about rationality.
Open only if you need help.
B | ¬B | |
A | a | b |
¬A | c | d |
- \(P(A) = a + b = 0.4\)
- \(P(B) = a + c = 0.3\)
- \(P(A \lor B) = a + b + c = 0.5\)
- \(P(True) = a + b + c + d = 1\)
From these, it is straightforward to infer that a = 0.2, b = 0.2, c = 0.1, and d = 0.5.
Therefore, \(P(A \land B) = a = 0.2\). Thus the probabilities given are consistent with a rational assignment, and the probability \(P(A \land B)\) is exactly determined.
Medical tests
Consider two medical tests, A and B, for a virus. Test A is 95% effective at recognizing the virus when it is present, but has a 10% false positive rate (indicating that the virus is present, when it is not). Test B is 90% effective at recognizing the virus, but has a 5% false positive rate. The two tests use independent methods of identifying the virus. The virus is carried by 1% of all people.
Say that a person is tested for the virus using only one of the tests, and that test comes back positive for carrying the virus. Which test returning positive is more indicative of someone really carrying the virus? Justify your answer mathematically.
Open only if you need help.
Let \(V\) be the statement that the patient has the virus, and \(A\) and \(B\) the statements that the medical tests A and B returned positive, respectively. The problem statement gives:
\[ \begin{flalign} P(V ) = 0.01 && \\ P(A|V ) = 0.95 && \\ P(A| \neg V ) = 0.10 && \\ P(B| V ) = 0.90 && \\ P(B| \neg V ) = 0.05 && \\ \end{flalign} \]
The test whose positive result is more indicative of the virus being present is the one whose posterior probability, \(P(V |A)\) or \(P(V |B)\) is largest.
One can compute these probabilities directly from the information given, finding that \(P(V |A) = 0.0876\) and \(P(V |B) = 0.1538\), so B is more indicative.
Conditional independence
Show that the statement of conditional independence
\[ \begin{flalign} P(A,B|C)=P(A|C)P(B|C) && \end{flalign} \]
is equivalent to each of the statements
\[ \begin{flalign} P(A|B,C)=P(A|C) \quad \textrm{and} \quad P(B|A,C)=P(B|C) && \end{flalign} \]
Open only if you need help.
The key to this exercise is rigorous and frequent application of the definition of conditional probability, \(P(X|Y) = \frac{P(X,Y)}{P(Y)}\).
The original statement that we are given is:
\(P(A,B|C) = P(A|C)P(B|C)\)
We start by applying the definition of conditional probability to two of the terms in this statement:
\[ \begin{flalign} P(A,B|C) = \frac{P(A,B,C)}{P(C)} \quad \textrm{and} \quad P(B|C) = \frac{P(B,C)}{P(C)} && \end{flalign} \]
Now we substitute the right-hand side of these definitions for the left-hand sides in the original statement to get:
\[ \begin{flalign} \frac{P(A,B,C)}{P(C)} = P(A|C) \frac{P(B,C)}{P(C)} && \end{flalign} \]
Now we need the definition of conditional probability once more:
\[ \begin{flalign} P(X|Y) &= \frac{P(X,Y)}{P(Y)} \\ = P(X,Y) &= P(X|Y)P(Y) \\ \text{where } X = A \text{ and } Y &= B,C \\ P(A,B,C) &= P(A|B,C)P(B,C) \\ \end{flalign} \]
We substitute this right-hand side for P(A,B,C) to get:
\[ \begin{flalign} \frac{P(A|B,C)P(B,C)}{P(C)} = P(A|C) \frac{P(B,C)}{P(C)} && \end{flalign} \]
Finally, we cancel the \(P(B,C)\) and \(P(C)\)s to get:
\(P(A|B,C) = P(A|C)\)
The second part of the exercise follows from by a similar derivation, or by noticing that \(A\) and \(B\) are interchangeable in the original statement (because multiplication is commutative and \(A,B\) means the same as \(B,A\)).
Pacman
Pacman has developed a hobby of fishing. Over the years, he has learned that a day can be considered fit or unfit for fishing \(Y\) which results in three features: whether or not Ms. Pacman can show up \(M\), the temperature of the day \(T\), and how high the water level is \(W\). Pacman models it as an Naive Bayes classification problem.
We wish to calculate the probability a day is fit for fishing given features of the day. Consider the conditional probability tables that Pacman has estimated over the years:
\(Y\) | \(P(Y)\) |
---|---|
yes | 0.1 |
no. | 0.9 |
\(M\) | \(Y\) | \(P(M|Y)\) |
---|---|---|
yes | yes | 0.5 |
no | yes | 0.5 |
yes | no | 0.2 |
no | no | 0.8 |
\(W\) | \(Y\) | \(P(W|Y)\) |
---|---|---|
high | yes | 0.1 |
low | yes | 0.9 |
high | no | 0.5 |
low | no | 0.5 |
\(T\) | \(Y\) | \(P(T|Y)\) |
---|---|---|
cold | yes | 0.2 |
warm | yes | 0.3 |
hot | yes | 0.5 |
cold | no | 0.5 |
warm | no | 0.2 |
hot | no | 0.3 |
Using the method of Naive Bayes, if Ms. Pacman is available, the weather is cold, and the water level is high, do we predict that the day is fit for fishing?
Literature
Footnotes
See qualification problem↩︎
Medical science has no complete theory for the domain↩︎
For instance, the coincidence of having a toothache and a cavity that are unrelated, or the fact that not all tests have been run↩︎
Variables in probability theory are called random variables↩︎
Represents the belief that the temperature at noon is distributed uniformly between 18 and 26 degrees Celcius↩︎
\(P(a \land b) = P(a|b)P(b)\) equals \(P(a \land b) = P(b|a)P(a)\)↩︎
Factoring large joint distributions into smaller joint distributions, using absolute indepence. For instance, weather and dental problems are independent, thus, these can be split.↩︎
If there is a cavity epidemic and \(P(Cavity)\) increases, \(P(Toothache | Cavity)\) does not change, but \(P(Toothache)\) and \(P(Cavity | Toothache)\) will change proportionally↩︎
Toothache depends on the state of the nerves in the tooth, whereas the probe’s accuracy depends primarily on the dentist’s skill, to which the toothache is irrelevant.↩︎
For example, if 9% of the articles are about weather, we set \(P(Category=weather)=0.09\).↩︎
For example, if 37% of articles about business contain word 6, “stocks,” so \(P(HasWord_6=true|Category=business)\) is set to 0.37↩︎