Definition
Learning agents are those that can improve their behavior through diligent study of past experiences and predictions of the future Russel and Norvig (2022, 668)
A learning agent
- uses so-called machine learning (ML), if it is a computer;
- improves performance based on experience (i.e., observations of the world);
- is required when the designer lacks omniscience (i.e., in unknown environments) and/or
- have no idea how to program a solution themselves (e.g., recognizing faces)
The learning agent
So far an agent’s percepts have only served to help the agent choose its actions. Now they will also serve to improve future behavior.
Visualization
Building blocks
Performance element: Processes percepts and chooses actions (relates to the basics of AI we have studied so far)
Learning element: Carries out improvements. Requires awareness and feedback on how the agent is doing in the environment
Critic: Evaluation of the agent’s behavior based on a given external behavioral measure (i.e., feedback)
Problem generator: Suggests explorative actions that lead the agent to new experiences
The performance elements of the agent designs described in chapter Intelligent agents are composed of
- a direct mapping from conditions to the current state of actions;
- a means to infer relevant properties of the world from the percept sequence;
- information about the way the world evolves and about the results of possible actions the agent can take;
- utility information indicating the desirability of actions; and/or
- goals that describe the most desirable states;
The learning element
The design of the learning element is influenced by four important aspects:
- Which component of the performance element is to be improved?
- What representation should be chosen (i.e., what model)?
- What prior information is available (i.e., prior knowledge that influences the model)?
- What form of feedback is available?
Types of feedback
The type of feedback available for learning is usually the most important factor in determining the nature of the learning problem.
- Supervised learning: Involves learning a function from examples of its inputs and outputs ➞ correct answer for each training instance
- Unsupervised learning: The agent has to learn patterns in the input when no specific output values are given ➞ reward sequence, no correct answers
- Reinforcement learning: The most general form of learning in which the agent is not told what to do by a teacher. Rather, it must learn from reinforcements (punishments or rewards). It typically involves learning how the environment works ➞ “just make sense of the data”
- Supervised learning
- Image classification — inputs can be camera images, each one accompanied by an output saying, e.g., “bus” or “pedestrian”. An output like this is called a label. The agents learns a function that, when given a new image, predicts the appropriate label.
- Unsupervised learning
- Clustering — detecting potentially useful clusters of input examples. For instance, when shown millions of images, a computer vision system could identify large cluster of similar images (without “knowing” what is shown on these).
- Reinforcement learning
- A chess agent — imagine, it is told at the end of a game that it has won (a reward) or lost (a punishment). Based on that feedback, it has to decide which of the actions prior to the reinforcement were most responsible for it, and to alter its actions to aim towards more rewards in future.
Why learning works
How can we be sure that our learned hypothesis will predict well for previously unseen inputs? I.e., how do we know that the hypothesis \(h\) is close to the target function \(f\) when \(f\) is unknown?
The underlying principle of computational learning theory is, that any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples
Thus, any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong: that is, it must be probably approximately correct (PAC)
Inductive learning
The task of learning is to find good hypotheses about the world.
- An example is a pair \((x, f(x))\) (input and output)
- The complete set of examples is called the training set (supervised learning)
- Pure inductive inference: for a collection of examples for \(f\) (the target function), return a function \(h\) (hypothesis) that approximates \(f\)
- The function h typically is member of a hypothesis space \(H\)
- A good hypothesis should generalize the data well (i. e., will predict unseen examples correctly)
- A hypothesis is consistent with the data set if it agrees with all the data
How do we choose from among multiple consistent hypotheses?
Ockham’s razor: prefer the simplest hypothesis that matches the data
Ockham’s razor is a choice between more complex, low-bias hypotheses that fit the training data well and simple, low-variance hypotheses that may generalize better. Wililiam of Ockham stated in the first century the principle that “plurality [of entities] should not be posited without necessity — the so-called Ockham’s razor that”shaves off” dubious explanations.
A hypothesis is underfitting when it fails to find a pattern in the data (i.e., the model has not learned enough from the data). In turn, a hypothesis is overfitting the data when it pays too much attention to the particular data set it is trained on, causing it to perform poorly on unseen data (i.e., the generalization of the model is unreliable).
Example: curve-fitting
For plots of best-fit functions (\(h\)) from four different hypothesis spaces (\(H\)) trained on a data set (a = linear; b = degree-7 polynomial, c = degree-6 polynomial, d = sinusoidal)
Decision trees
Learning decision trees
A decision tree is a representation of a function that maps a vector of attribute values to a single output value—a “decision”.
Search is used to find a decision (i.e., performing a sequence of tests).
In Boolean decision trees, the input is a set of vector of input attributes \(X\) and a single Boolean output value \(y\)
Learning process: Definition of the goal predicate in the form of a decision tree.
- An internal node of the decision tree represents a test of a property
- Branches are labeled with the possible values of the test
- Each leaf node specifies the Boolean value to be returned if that leaf is reached
Expressiveness
A Boolean decision tree is equivalent to a logical statement of the form
\[ \begin{flalign} Output \iff (Path_1 \lor Path_2 \lor ...) \end{flalign} \]
where each \(Path_i\) is a conjunction of the form \((A_m = v_x \; \land \; A_n = v_y \; \land \; ...)\) of attribute-value tests corresponding to a path from the root to a \(true\) leaf.
Any function in propositional logic can be represented by a decision tree by translating every row of a truth table to a path in the tree.
This can lead to a tree whose size is exponential in the number of attributes.
How many distinct decision trees with \(n\) Boolean attributes?
= number of Boolean functions
= number of distinct truth tables with \(2^n\) rows = \(2^{2^n}\)
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
How many purely conjunctive hypotheses (e.g., \(Hungry \land \neg Rain\))?
Each attribute can be in (positive), in (negative), or out (i.e., not part of the sentence):
\(3^n\) distinct conjunctive hypotheses
Limitations
Although decision trees can represent functions with smaller trees, there are functions that require an exponentially large decision tree, e.g.
- Majority function, which returns true if more than half of the inputs are true
- Parity function, which returns true if and only if an even number of inputs are true
- \(y > A_1 + A_2\) with real-valued attributes 1
Summary: decision trees are good for some kinds of functions and bad for others.
Example problem
Supervised learning problem of deciding whether to wait for a table at a restaurant (Russel and Norvig 2022, 668)
The output (\(y\)) is a Boolean variable WillWait
The input (\(x\)) is a vector of ten attributes values (discrete values):
- Alternate – Is there an alternative? (T/F)
- Bar – Does the restaurant have a bar to wait in? (T/F)
- Fri – Is it Friday or Saturday? (T/F)
- Hungry – Am I hungry? (T/F)
- Patrons – How many guests are there? (none, some, full)
- Price – How expensive is the food? (€, €€, €€€)
- WaitEstimate – How long do we have to wait? (0-10, 10-30, 30-60, >60)
- Reservation – Have I made a reservation? (T/F)
- Raining – Is it raining outside? (T/F)
- Type – What kind of restaurant is it? (French, Italian, Thai, Burger)
Example decision tree
Underlying training set
Example | Alt | Bar | Fri | Hun | Pat | Price | Rain | Res | Type | Est | WillWait |
---|---|---|---|---|---|---|---|---|---|---|---|
\(x_1\) | Yes | No | No | Yes | Some | €€€ | No | Yes | French | 0-10 | \(y_1 =\) Yes |
\(x_2\) | Yes | No | No | Yes | Full | € | No | No | Thai | 30-60 | \(y_2 =\) No |
\(x_3\) | No | Yes | No | No | Some | € | No | No | Burger | 0-10 | \(y_3 =\) Yes |
\(x_4\) | Yes | No | Yes | Yes | Full | € | Yes | No | Thai | 10-30 | \(y_4 =\) Yes |
\(x_5\) | Yes | No | Yes | No | Full | €€€ | No | Yes | French | >60 | \(y_5 =\) No |
\(x_6\) | No | es | No | Yes | Some | €€ | Yes | Yes | Italian | 0-10 | \(y_6 =\) Yes |
\(x_7\) | No | Yes | No | No | None | € | Yes | No | Burger | 0-10 | \(y_7 =\) No |
\(x_8\) | No | No | No | Yes | Some | €€ | Yes | Yes | Thai | 0-10 | \(y_8 =\) Yes |
\(x_9\) | No | Yes | Yes | No | Full | € | Yes | No | Burger | >60 | \(y_9 =\) No |
\(x_{10}\) | Yes | Yes | Yes | Yes | Full | €€€ | No | Yes | Italian | 10-30 | \(y_{10}=\) No |
\(x_{11}\) | No | No | No | No | None | € | No | No | Thai | 0-10 | \(y_{11} =\) No |
\(x_{12}\) | Yes | Yes | Yes | Yes | Full | € | No | No | Burger | 30-60 | \(y_{12} =\) Yes |
Inducing decision trees from examples
To get a naive solution, we simply construct a tree with one path to a leaf for each example.
- We test all the attributes along the path and attach the classification of the example to the leaf
- Whereas the resulting tree will correctly classify all given examples, it will not say much about other cases
- It just memorizes the observations and does not generalize
How to find a tree that is consistent with the training set (Table 1), yet as small as possible?
- It is intractable to find a guaranteed smallest consistent tree.
- However, with some simple heuristics, we can efficiently find one that is close to the smallest (i.e., “smallish” tree).
- Decision tree learning adopts a greedy divide-and-conquer strategy.
Divide-and-conquer strategy
Always use the most important attribute first, then recursively solve the smaller subproblems
- The “most important attribute” is the one that makes the most difference2
- Split the training set into subsets each corresponding to a particular value of that attribute
- Now that we have divided the training set into several smaller training sets, we can recursively apply this process to the smaller training sets
That way, we hope to get to the correct classification with a small number of tests3
Recursive learning process
In each recursive step there are four cases to consider:
- Positive and negative examples: choose a new attribute
- Common outcome (only positive or only negative examples): done (answer is Yes or No).
- No examples: there was no example with the desired property. Answer Yes if the majority of the parent node’s examples is positive, otherwise No.
- No attributes left, but there are still examples with different classifications: there were errors in the data (i.e., noise) or the attributes do not give sufficient information. Answer Yes if the majority of examples is positive, otherwise No.
Resulting tree
Properties of the learning outcome:
- The resulting tree is considerably simpler than the one originally given (and from which the training examples were generated)
- The learning algorithm outputs a tree that is consistent with all examples it has seen
- The tree does not need to agree with the correct function, e.g. it suggests not to wait if we are not hungry. If we are, there are cases in which it tells us to wait.
- Some tests (Raining, Reservation) are not included since the algorithm can classify the examples without them
Performance assessment
To assess the power of the prediction, the following method can be applied:
- Collect a large number of examples
- Divide it into two disjoint sets: the training set and the test set
- Use the training set to generate \(h\)
- Measure the percentage of examples of the test set that are correctly classified by \(h\)
- Repeat the process for randomly-selected training sets of different sizes
As the training set grows, the prediction quality increases
Generalization
Pruning reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances.
- Pruning reduces the complexity of the final hypothesis (tree size) and reduces overfitting4
- Predictive accuracy as measured by a cross-validation set5
- One of the simplest forms of pruning is reduced error pruning:
- For each leave, each node is replaced with its most popular output
- If the prediction accuracy is not affected then the change is kept
- It is somewhat naive, but simple and speedy
Summary
- Decision trees are one possibility for representing (Boolean) functions
- They can be exponential in the number of attributes
- It is often too difficult to find the minimal decision tree
- One method for generating decision trees that are as flat as possible is based on ranking the attributes
- The ranks are computed based on the information gain
Statistical ML
Motivation
As discussed in chapter probability, probability and utility theory allow agents to deal with uncertainty
To apply probabilistic reasoning, however, the agents must first learn their probabilistic theories of the world from experience
We will discuss statistical learning methods as robust ways to learn probabilistic models
Bayesian learning
Learning can be viewed as Bayesian updating of a probability distribution over the hypothesis space
\(H\) is the hypothesis variable, values \(h_1, h_2, . . .\), prior \(P(H)\)
\(j\)th observation of \(d_j\) gives the outcome of random variable \(D_j\)
training data \(d = d_1,..., d_N\)
Given the data so far, each hypothesis has a posterior probability:
\[ \begin{flalign} P(h_i|d) = \alpha P(d|h_i)P(h_i) \end{flalign} \]
where \(P(d|h_i)\) is called the likelihood
Predictions use a likelihood-weighted average over the hypotheses:
\[ \begin{flalign} P(X|d) = \sum_i{P(X|d,h_i)P(h_i|d)} = \sum_i{P(X|h_i)P(h_i|d)} \end{flalign} \]
Example
Suppose there are five kinds of bags of candies:
- 10% are \(h_1\) : 100% cherry candies
- 20% are \(h_2\) : 75% cherry and 25% lime candies
- 40% are \(h_3\) : 50% cherry and 50% lime candies
- 20% are \(h_4\) : 25% cherry and 75% lime candies
- 10% are \(h_5\) : 100% lime candies
Then we draw 10 candies from some bag (\(d_1,...,d_10\)), which are all lime candies.
What kind of bag is it?
What flavor will the next candy be?
Posterior probability of hypotheses
\(P(h_k|d) = αP(d|h_k)P(h_k)\)
\(P(h1 | 5\;limes) = αP(5\;limes | h1)P(h1) = \alpha · 0.0^5 · 0.1 = 0\) \(P(h2 | 5\;limes) = αP(5\;limes | h2)P(h2) = \alpha · 0.25^5· 0.2 = 0.000195α\) \(P(h3 | 5\;limes) = αP(5\;limes | h3)P(h3) = \alpha · 0.5^5 · 0.4 = 0.0125α\) \(P(h4 | 5\;limes) = αP(5\;limes | h4)P(h4) = \alpha · 0.75^5 · 0.2 = 0.0475α\) \(P(h5 | 5\;limes) = αP(5\;limes | h5)P(h5) = \alpha · 1.0^5 · 0.1 = 0.1α\)
\(\alpha = 1/(0 + 0.000195 + 0.0125 + 0.0475 + 0.1) = 6.2424\)
\(P(h_1 | 5\;limes) = 0\)
\(P(h_2 | 5\;limes) = 0.00122\)
\(P(h_3 | 5\;limes) = 0.07803\)
\(P(h_4 | 5\;limes) = 0.29650\)
\(P(h_5 | 5\;limes) = 0.62424\)
Evolution of the five hypotheses
Prediction probability
\[ \begin{flalign} P(d_{j+1}|d) = \sum_k{P(d_{j+1}|d,h_k)P(h_k|X)} = \sum_k{P(d_{j+1}|h_k)P(h_k|d)} \end{flalign} \] \[ \begin{align} P(lime \; on \; 6 | 5 \; limes) & = P(lime \; on \; 6 | h1)P(h1 | 5 \; limes) \\ & + P(lime \; on \; 6 | h2)P(h2 | 5 \; limes) \\ & + P(lime \; on \; 6 | h3)P(h3 | 5 \; limes) \\ & + P(lime \; on \; 6 | h4)P(h4 | 5 \; limes) \\ & + P(lime \; on \; 6 | h5)P(h5 | 5 \; limes) \\ & = 0 × 0 \\ & + 0.25 × 0.00122 \\ & + 0.5 × 0.07830 \\ & + 0.75 × 0.29650 \\ & + 1.0 × 0.62424 \\ & = 0.88607 \\ \end{align} \]
Likelihood-weighted average
Observations
The Bayesian prediction eventually agrees with the true hypothesis
- For any fixed prior that does not rule out the true hypothesis, the posterior of any false hypothesis will eventually vanish
- The Bayesian prediction is optimal and, given the hypothesis prior, any other prediction will be correct less often
However, summing over the hypothesis space is often intractable
➞ Real problems require us to resort to approximate or simplified methods
MAP learning
Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes).
A common approximation is to make predictions based on a single most probable hypothesis
Maximum a posteriori (MAP)6 learning: choose \(h_{MAP}\) maximizing \(P(h_i|d)\)7.
For large data sets, prior becomes irrelevant.
Maximum likelihood (ML) learning8 means, thus, to choose \(h_{ML}\) maximizing \(P(d|h_i)\).
\[ \begin{flalign} P(h_i|d) = P(d|h_i)P(h_i) \approx P(d|h_{MAP}) \end{flalign} \]
As more data arrive, MAP and Bayesian predictions become closer
➞ Finding MAP hypotheses is often much easier than Bayesian learning9
Candy example
- In the candy example, \(h_{MAP}=h5\) after three lime candies in a row
- The MAP learner the predicts that the fourth candy is lime with probability 1.0, whereas the Bayesian prediction is still 0.8
Summary
- Bayesian learning techniques formulate learning as a form of probabilistic inference
- Full Bayesian learning gives best possible predictions but is intractable
- Maximum a posterior learning selects the most likely hypothesis given the data (balancing complexity with accuracy on training data)
- Maximum likelihood assumes uniform prior, OK for large data sets
Deep learning
Introduction
Conventional machine-learning techniques as those discussed above are limited in their ability to process natural data in their raw form.
Careful engineering and considerable domain expertise are required to transform the raw data into a suitable internal representation from which the learning system could detect or classify patterns in some input.
In representation learning the system uses methods to automatically discover the representations needed for detection or classification.
Deep-learning methods creates multiple levels of representation by transforming the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level.
With the composition of enough such transformations, very complex functions can be learned.
Representations
The key aspect of deep learning is that the levels of representations (i.e., layers of features) are not designed by human engineers: they are learned from data using a general-purpose learning procedure.
Example: object detection in images
- An image comes in the form of an array of pixel values (raw input; input layer)
- The first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image (hidden layer 1)
- The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions (hidden layer 2)
- The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects (hidden layer 3)
- Subsequent layers would detect objects as combinations of the parts of familiar objects
Visualization
Application
Deep learning methods can discover intricate structures in high-dimensional data.
Using this capability, deep learning is making major advances in solving problems that could not have been solved before (LeCun, Bengio, and Hinton (2015)), e.g.,
- image and speech recognition,
- analysing particle accelerator data,
- reconstructing brain circuits,
- predicting the effects of DNA mutations on gene expression and disease
- natural language understanding (e.g., topic classification, sentiment analysis, question answering and language translation)
Convolutional neural networks
A convolutional neural network (ConvNet) is a specific kind of neural network with multiple layers,designed to process data that come in the form of multiple arrays10
The role of the ConvNet is to reduce the arrays into a form which is easier to process, without losing features which are critical for getting a good prediction.
A ConvNet is structured as a series of stages, which are composed of two types of layers: convolutional layers and pooling layers.
- The convolution layer: extraction of the high-level features from the input (by applying convolutional operations/filters)
- The pooling layer: reduction of the spatial size of the convoled feature to decrease the computational power required to process the data through dimensionality reduction.
Good read: A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way
Development process
Overview
We are still in the early stages of defining a methodology for machine learning projects; the tools and processes are not as well developed as in software engineering
Russel and Norvig (2022, 722ff) propose a process that involves following typical steps
- Problem formulation
- Data collection, assessment, and management
- Model selection and training
- Checking trustworthiness of the system
- Operation, monitoring, and maintenance
Problem
Figuring out what problem you want to solve compromises three parts:
- Problem: What problem do I solve for my users?
➞ Find an objective that you can track and that relates to your “true goals” - Suitability: What parts of the problem(s) can be solved by ML?
➞ Often not all parts of the problem require ML - Approach: What kind of learning is appropriate?
➞ Often a semi-supervised learning approach is suitable (few labeled examples, large collection of unlabeled examples)
Data
Real data are messy
ML needs data, a lot of data, of which at least a subset is labeled
Manufacturing these data can be done by own labor or by crowdsourcing (paid, volunteers, users); one might also start with publicly available general-purpose dataset (or a model that has been pretrained) and then add specific data
Maintain a data provenance for all data (i.e., for each columns of your data set, you should know the exact definition, where the data come from, what the possible values are, and who has worked on it)
When data are limited, data augmentation can help (i.e., creating multiple versions of each image by rotating, translating, cropping, or scaling each image, or by changing brightness or color balancing or adding noise)
Questions to be asked are - Is this the right data for my task? - Does it capture enough of the right inputs to give us a chance of learning a model? - Does it contain the outputs I want to predict? - If not, can I build an unsupervised model? - Or can I label a portion of the data and then do semi-supervised learning? - Is it relevant data? - How much data do I need? Recommendation: draw a learning curve to see if more data will help. - Could there be data entry errors? - What to be done with missing data fields? - Are there spelling errors or inconsistent terminology in text data? - Are there outliers in the data?
Feature engineering
After correcting overt errors, the data should be preprocessed so that they can be handled more easily
- Quantization: forcing a continuous valued input into fixed bins (e.g., waiting time in 0-10, 10-30, 30-60, or >60 minutes)
- Normalization data: transforming data so that it has a standard deviation of 1
- Separating categorical attributes: transform the data into separate Boolean attributes, where exactly one of it is true
Model
Before starting with building a model, you might start with getting an intuitive feel for the data (e.g., by means of exploratory data analysis)
There is no guaranteed way to pick the best model class, but there are some rough guidelines:
- Random forests are good then there are a lot of categorical features, where many of these may be irrelevant
- Nonparametric methods are good when having a lot of data and no prior model
- Logistic regression does well when the data are linearly separable (or can be converted to be so)
Do what worked well in similar past problems—and search: run experiments with multiple possible models
Trustworthiness
Doing well with test data is a necessary but not sufficient condition for trust in the model (by you and your stakeholders), it also requires
- Verification and validation: you test on the training, validation, and datasets, do code reviews, monitoring, and set measures for accountability
➞ Do the best to ensure that the system will not be wrong and, for the case, set responsibilities - Interpretability: you understand how answers relate to inputs ➞ Do the best to inspect and interpret your model
- Explainability: the model helps you to understand why a certain output has been produced for a given input ➞ Do the best to explain what your model does11
Operation
After the model is deployed to the users, additional challenges will arise
- Long tail of user inputs: you might see inputs that were never tested before and you need to know whether your model generalizes well for them (i.e., you need to monitor the performance)
- Nonstationarity: the world changes over time (e.g., spammers adapt their tactics); you need to consider how often to adapt the model (i.e., find a tradeoff between a well tested model and a model that is built from the latest data)
Tests for features and data (1) Feature expectations are captured in a schema. (2) AU feature. are beneficial. (3) No feature’s cost is too much. ( 4) Features adhere to meta-level requirements. (5) The data pipeline has appropriate privacy controls. (6) New features can be added quickly. (7) All input feature code is tested.
Tests for model development (1) Every model specification undergoes a code review. (2) Every model is checked in to a repository. (3) Offline proxy metrics correlate with actual metrics (4) All hyperparameters have been tuned. (5) The impact of model staleness is known. (6) A simpler model is not better. (7) Model quality is sufficient on all important data slices. The model has been tested for considerations of inclusion.
Tests for ML infrastructure (1) Training is reproducible. (2) Model specification code is unit tested. (3) The full ML pipeline is integration tested. (4) Model quality is validated before attempting to serve it. (5) The model allows debugging by observing the step-by-step computation of training or inference on a single example. (6) Models are tested via a canary process before they enter production serving environments. (7) Models can be quickly and safely rolled back to a previous serving version.
Monitoring tests for ML (1) Dependency changes result in notification. (2) Data invariants hold in training and serving inputs. (3) Training and serving features compute the same values (4) Models are not too tales (5) The model is numerically stable. (6) The model ha not experienced regressions in training speed, serving latency, throughput, or RAM usage. (7) The model has not experienced a regression in prediction quality on served data.
Source: Russel and Norvig (2022, 731)
✏️ Exercises
I2AI_7 E1
Consider following problems:
- Me learning to play tennis
- An infant learning to speak
Discuss following questions:
- Explain how this process fits into the general learning model.
- Describe the percepts and actions of the player.
- What types of learning I must do?
- What example data is available?
I2AI_7 E2
Create the decision tree by applying the divide-and-conquer approach on the restaurant examples (approx. 20 minutes).
Compare the naive tree with the tree gained by applying the divide-and-conquer heuristic. What differences do you see? (approx. 5 minutes)
I2AI_7 E3
Describe the differences between supervised, unsupervised, and reinforcement learning.
I2AI_7 E4
Define the following machine-learning terms in your own words
- Training set
- Hypothesis
- Bias
- Variance
I2AI_7 E5
Draw a decision tree for the problem of deciding whether to move forward at a road intersection, given that the light has just turned green.
What problems do you see? Argue based on the qualification problem discussed in chapter probability.
I2AI_7 E6
We never test the same attribute twice along one path in a decision tree. Why not?
I2AI_7 E7
You want to teach a robot to recognize cat, using a set of features (the presence of blue eyes b
, stripe s
, or spot m
). You also have several pictures of dogs to use as negative examples. The following table summarizes the content of training samples. Each binary feature is represented as 1 (meaning the feature is present) or 0 (meaning it is absent). The subject of y
of the photo is encoded as 1 for cat or -1 for dog.
m |
b |
s |
Subject y |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 |
1 | 1 | 0 | 1 |
0 | 1 | 1 | 1 |
1 | 0 | 1 | -1 |
1 | 1 | 1 | -1 |
- Assuming no smoothing, given estimates for the parameters of the classification rule based on the training samples.
- Suppose a subject which has stripe but not spot or blue eyes. What would happen if the robot had to guess the identity of the subject?
I2AI_7 E8
Two statisticians go to the doctor and are both given the same prognosis: A 40% chance that the problem is the deadly disease A, and a 60% chance of the fatal disease B. Fortunately, there are anti-A and anti-B drugs that are inexpensive, 100% effective, and free of side-effects. The statisticians have the choice of taking one drug, both, or neither.
What will the first statistician (an avid Bayesian) do? How about the second statistician, who always uses the maximum likelihood hypothesis?
The doctor does some research and discovers that disease B actually comes in two versions, dextro-B and levo-B, which are equally likely and equally treatable by the anti-B drug.
Now that there are three hypotheses, what will the two statisticians do?
Literature
Footnotes
The decision is a diagonal line, and all decision tree tests divide the space up into rectangular, axis-aligned boxes. We would have to stack a lot of boxes to closely approximate the diagonal line.↩︎
The selection is implemented by means of a choosing attribute test, which is based on information theory and measures the information gain from the attribute tests. For more details please refer to Russel and Norvig (2022, 681 ff)↩︎
Meaning that all paths in the tree will be short and the tree as a whole will be shallow↩︎
A tree that is too large risks overfitting the training data and poorly generalizing to new samples.↩︎
The training set is divided into two groups; 70% of the training set is used to build the tree, and the remaining 30% for validation; leading to three data sets (training, validation, test)↩︎
Pronounced “em-ay-pee”↩︎
Which is equal to maximizing \(P(d|h_i)P(h_i)\) or \(\ log P(d|h_i) + \log P(h_i)\)↩︎
Maximum likelihood is the “standard” (non-Bayesian) statistical learning method↩︎
MAP learning provides a natural embodiment of Ockham’s razor↩︎
1D for signals an sequences such as language, 2D for images or audio spectograms, 3D for video↩︎
Regulations such as the European GDPR require systems to provide explanations↩︎