Learning

🧠 Introduction to AI

Andy Weeger

Neu-Ulm University of Applied Sciences

February 13, 2024

Definition

Learning agents are those that can improve their behavior through diligent study of past experiences and predictions of the future Russel and Norvig (2022, 668)

A learning agent

uses so-called machine learning (ML), if it is a computer;
improves performance based on experience (i.e., observations of the world);
is required when the designer lacks omniscience (i.e., in unknown environments) and/or
has no idea how to program a solution themselves (e.g., recognizing faces)

The learning agent

Visualization

Figure 1: A learning agent based on Russel and Norvig (2022, 74)

Building blocks

Performance element: Processes percepts and chooses actions (relates to the basics of AI we have studied so far)

Learning element: Carries out improvements. Requires awareness and feedback on how the agent is doing in the environment

Critic: Evaluation of the agent’s behavior based on a given external behavioral measure (i.e., feedback)

Problem generator: Suggests explorative actions that lead the agent to new experiences

The learning element

The design of the learning element is influenced by four important aspects:

Which component of the performance element is to be improved?
What representation should be chosen (i.e., what model)?
What prior information is available (i.e., prior knowledge that influences the model)?
What form of feedback is available?

Types of feedback

The type of feedback available for learning is usually the most important factor in determining the nature of the learning problem.

Supervised learning: Involves learning a function from examples of its inputs and outputs ➞ correct answer for each training instance
Unsupervised learning: The agent has to learn patterns in the input when no specific output values are given ➞ reward sequence, no correct answers
Reinforcement learning: The most general form of learning in which the agent is not told what to do by a teacher. Rather, it must learn from reinforcements (punishments or rewards). It typically involves learning how the environment works ➞ “just make sense of the data”

Examples

Supervised learning: Image classification — inputs can be camera images, each one accompanied by an output saying, e.g., “bus” or “pedestrian”. An output like this is called a label. The agents learns a function that, when given a new image, predicts the appropriate label.
Unsupervised learning: Clustering — detecting potentially useful clusters of input examples. For instance, when shown millions of images, a computer vision system could identify large cluster of similar images (without “knowing” what is shown on these).
Reinforcement learning: A chess agent — imagine, it is told at the end of a game that it has won (a reward) or lost (a punishment). Based on that feedback, it has to decide which of the actions prior to the reinforcement were most responsible for it, and to alter its actions to aim towards more rewards in future.

Why learning works

How can we be sure that our learned hypothesis will predict well for previously unseen inputs? I.e., how do we know that the hypothesis \(h\) is close to the target function \(f\) when \(f\) is unknown?

The underlying principle of computational learning theory is, that any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples

Thus, any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong: that is, it must be probably approximately correct (PAC)

Inductive learning

The task of learning is to find good hypotheses about the world.

An example is a pair \((x, f(x))\) (input and output)
The complete set of examples is called the training set (supervised learning)
Pure inductive inference: for a collection of examples for \(f\) (the target function), return a function \(h\) (hypothesis) that approximates \(f\)
The function h typically is member of a hypothesis space \(H\)
A good hypothesis should generalize the data well (i. e., will predict unseen examples correctly)
A hypothesis is consistent with the data set if it agrees with all the data

How do we choose from among multiple consistent hypotheses?

Ockham’s razor: prefer the simplest hypothesis that matches the data

Example: curve-fitting

Figure 2: Finding hypotheses to fit data

For plots of best-fit functions (h) from four different hypothesis spaces (H) trained on a data set (a = linear; b = degree-7 polynomial, c = degree-6 polynomial, d = sinusoidal)

Decision trees

Learning decision trees

A decision tree is a representation of a function that maps a vector of attribute values to a single output value—a “decision”.

Search is used to find a decision (i.e., performing a sequence of tests).

In Boolean decision trees, the input is a set of vector of input attributes X and a single Boolean output value y

Learning process: Definition of the goal predicate in the form of a decision tree.

An internal node of the decision tree represents a test of a property
Branches are labeled with the possible values of the test
Each leaf node specifies the Boolean value to be returned if that leaf is reached

Expressiveness

A Boolean decision tree is equivalent to a logical statement of the form

\[ \begin{flalign} Output \iff (Path_1 \lor Path_2 \lor ...) \end{flalign} \]

where each \(Path_i\) is a conjunction of the form \((A_m = v_x \; \land \; A_n = v_y \; \land \; ...)\) of attribute-value tests corresponding to a path from the root to a \(true\) leaf.

Any function in propositional logic can be represented by a decision tree by translating every row of a truth table to a path in the tree.

This can lead to a tree whose size is exponential in the number of attributes.

Limitations

Although decision trees can represent functions with smaller trees, there are functions that require an exponentially large decision tree, e.g.

Majority function, which returns true if more than half of the inputs are true
Parity function, which returns true if and only if an even number of inputs are true
\(y > A_1 + A_2\) with real-valued attributes ¹

Summary: decision trees are good for some kinds of functions and bad for others.

Example problem

Supervised learning problem of deciding whether to wait for a table at a restaurant (Russel and Norvig 2022, 668)

The output (y) is a Boolean variable WillWait

The input (x) is a vector of ten attributes values (discrete values):

Alternate – Is there an alternative? (T/F)
Bar – Does the restaurant have a bar to wait in? (T/F)
Fri – Is it Friday or Saturday? (T/F)
Hungry – Am I hungry? (T/F)
Patrons – How many guests are there? (none, some, full)
Price – How expensive is the food? (€, €€, €€€)
WaitEstimate – How long do we have to wait? (0-10, 10-30, 30-60, >60)
Reservation – Have I made a reservation? (T/F)
Raining – Is it raining outside? (T/F)
Type – What kind of restaurant is it? (French, Italian, Thai, Burger)

Example decision tree

Figure 3: Decision tree restaurant example based on Russel and Norvig (2022, 674)

Underlying training set

Example	Alt	Bar	Fri	Hun	Pat	Price	Rain	Res	Type	Est	WillWait
\(x_1\)	Yes	No	No	Yes	Some	€€€	No	Yes	French	0-10	\(y_1 =\) Yes
\(x_2\)	Yes	No	No	Yes	Full	€	No	No	Thai	30-60	\(y_2 =\) No
\(x_3\)	No	Yes	No	No	Some	€	No	No	Burger	0-10	\(y_3 =\) Yes
\(x_4\)	Yes	No	Yes	Yes	Full	€	Yes	No	Thai	10-30	\(y_4 =\) Yes
\(x_5\)	Yes	No	Yes	No	Full	€€€	No	Yes	French	>60	\(y_5 =\) No
\(x_6\)	No	es	No	Yes	Some	€€	Yes	Yes	Italian	0-10	\(y_6 =\) Yes
\(x_7\)	No	Yes	No	No	None	€	Yes	No	Burger	0-10	\(y_7 =\) No
\(x_8\)	No	No	No	Yes	Some	€€	Yes	Yes	Thai	0-10	\(y_8 =\) Yes
\(x_9\)	No	Yes	Yes	No	Full	€	Yes	No	Burger	>60	\(y_9 =\) No
\(x_{10}\)	Yes	Yes	Yes	Yes	Full	€€€	No	Yes	Italian	10-30	\(y_{10}=\) No
\(x_{11}\)	No	No	No	No	None	€	No	No	Thai	0-10	\(y_{11} =\) No
\(x_{12}\)	Yes	Yes	Yes	Yes	Full	€	No	No	Burger	30-60	\(y_{12} =\) Yes

Table 1: Examples for the restaurant domain

Inducing decision trees from examples

To get a naive solution, we simply construct a tree with one path to a leaf for each example.

We test all the attributes along the path and attach the classification of the example to the leaf
Whereas the resulting tree will correctly classify all given examples, it will not say much about other cases
It just memorizes the observations and does not generalize

How to find a tree that is consistent with the training set (Table 1), yet as small as possible?

It is intractable to find a guaranteed smallest consistent tree.
However, with some simple heuristics, we can efficiently find one that is close to the smallest (i.e., “smallish” tree).
Decision tree learning adopts a greedy divide-and-conquer strategy.

Divide-and-conquer strategy

Always use the most important attribute first, then recursively solve the smaller subproblems

The “most important attribute” is the one that makes the most difference²
Split the training set into subsets each corresponding to a particular value of that attribute
Now that we have divided the training set into several smaller training sets, we can recursively apply this process to the smaller training sets

That way, we hope to get to the correct classification with a small number of tests³

Recursive learning process

In each recursive step there are four cases to consider:

Positive and negative examples: choose a new attribute
Common outcome (only positive or only negative examples): done (answer is Yes or No).
No examples: there was no example with the desired property. Answer Yes if the majority of the parent node’s examples is positive, otherwise No.
No attributes left, but there are still examples with different classifications: there were errors in the data (i.e., noise) or the attributes do not give sufficient information. Answer Yes if the majority of examples is positive, otherwise No.

Resulting tree

Properties of the learning outcome:

The resulting tree is considerably simpler than the one originally given (and from which the training examples were generated)
The learning algorithm outputs a tree that is consistent with all examples it has seen
The tree does not need to agree with the correct function, e.g. it suggests not to wait if we are not hungry. If we are, there are cases in which it tells us to wait.
Some tests (Raining, Reservation) are not included since the algorithm can classify the examples without them

Performance assessment

To assess the power of the prediction, the following method can be applied:

Collect a large number of examples
Divide it into two disjoint sets: the training set and the test set
Use the training set to generate \(h\)
Measure the percentage of examples of the test set that are correctly classified by \(h\)
Repeat the process for randomly-selected training sets of different sizes

As the training set grows, the prediction quality increases

Generalization

Pruning reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances.

Pruning reduces the complexity of the final hypothesis (tree size) and reduces overfitting⁴
Predictive accuracy as measured by a cross-validation set⁵
One of the simplest forms of pruning is reduced error pruning:
- For each leave, each node is replaced with its most popular output
- If the prediction accuracy is not affected then the change is kept
- It is somewhat naive, but simple and speedy

Summary

Decision trees are one possibility for representing (Boolean) functions
They can be exponential in the number of attributes
It is often too difficult to find the minimal decision tree
One method for generating decision trees that are as flat as possible is based on ranking the attributes
The ranks are computed based on the information gain

Statistical ML

Motivation

As discussed in chapter probability, probability and utility theory allow agents to deal with uncertainty

To apply probabilistic reasoning, however, the agents must first learn their probabilistic theories of the world from experience

We will discuss statistical learning methods as robust ways to learn probabilistic models

Bayesian learning

Learning can be viewed as Bayesian updating of a probability distribution over the hypothesis space

\(H\) is the hypothesis variable, values \(h_1, h_2, . . .\), prior \(P(H)\)

\(j\)th observation of \(d_j\) gives the outcome of random variable \(D_j\)
training data \(d = d_1,..., d_N\)

Given the data so far, each hypothesis has a posterior probability:

\[ \begin{flalign} P(h_i|d) = \alpha P(d|h_i)P(h_i) \end{flalign} \]

where \(P(d|h_i)\) is called the likelihood

Predictions use a likelihood-weighted average over the hypotheses:

\[ \begin{flalign} P(H|d) = \sum_i{P(H|d,h_i)P(h_i|d)} = \sum_i{P(H|h_i)P(h_i|d)} \end{flalign} \]

Example

Suppose there are five kinds of bags of candies:

10% are \(h_1\) : 100% cherry candies
20% are \(h_2\) : 75% cherry and 25% lime candies
40% are \(h_3\) : 50% cherry and 50% lime candies
20% are \(h_4\) : 25% cherry and 75% lime candies
10% are \(h_5\) : 100% lime candies

Then we draw 5 candies from some bag (\(d_1,...,d_10\)), which are all lime candies.

What kind of bag is it?
What flavor will the next candy be?

Posterior probability of hypotheses

\(P(h_k|d) = αP(d|h_k)P(h_k)\)

\(\alpha = 1/(0 + 0.000195 + 0.0125 + 0.0475 + 0.1) = 6.2424\)

Evolution of the five hypotheses

The posterior probabilties change the more evidence we gain.

Figure 4: Posterior probability of candy-hypotheses

Prediction probability

Let’s calculate the probabilit that we draw a lime on our 6th attempt given the data we have (types of bags and observations made).

\[ \begin{flalign} P(d_{j+1}|d) = \sum_k{P(d_{j+1}|d,h_k)P(h_k|X)} = \sum_k{P(d_{j+1}|h_k)P(h_k|d)} \end{flalign} \] \[ \begin{align} P(lime \; on \; 6 | 5 \; limes) & = P(lime \; on \; 6 | h1)P(h1 | 5 \; limes) \\ & + P(lime \; on \; 6 | h2)P(h2 | 5 \; limes) \\ & + P(lime \; on \; 6 | h3)P(h3 | 5 \; limes) \\ & + P(lime \; on \; 6 | h4)P(h4 | 5 \; limes) \\ & + P(lime \; on \; 6 | h5)P(h5 | 5 \; limes) \\ & = 0 × 0 \\ & + 0.25 × 0.00122 \\ & + 0.5 × 0.07830 \\ & + 0.75 × 0.29650 \\ & + 1.0 × 0.62424 \\ & = 0.88607 \\ \end{align} \]

Likelihood-weighted average

Figure 5: Probability that next candy is lime given d

Observations

The Bayesian prediction eventually agrees with the true hypothesis

For any fixed prior that does not rule out the true hypothesis, the posterior of any false hypothesis will eventually vanish
The Bayesian prediction is optimal and, given the hypothesis prior, any other prediction will be correct less often

However, summing over the hypothesis space is often intractable

➞ Real problems require us to resort to approximate or simplified methods

MAP learning

Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes).

A common approximation is to make predictions based on a single most probable hypothesis.

Maximum a posteriori (MAP)⁶ learning: choose \(h_{MAP}\) maximizing \(P(h_i|d)\)⁷.

For large data sets, prior becomes irrelevant.

Maximum likelihood (ML) learning⁸ means, thus, to choose \(h_{ML}\) maximizing \(P(d|h_i)\).

\[ \begin{flalign} P(h_i|d) = \frac{P(d|h_i)P(h_i)}{P(d)} \approx P(d|h_i)P(h_i) \approx P(d|h_{MAP}) \end{flalign} \]

We don’t need to directly calculate \(P(d)\) (i.e., the evidence term) for finding the MAP estimate as \(P(d)\) acts like a constant and does not affect which value maximizes the posterior probability.

MAP learning aims to find the value that maximizes the product of likelihood (\(P(d|h_i)\)) and prior probability (\(P(h_i)\)).

As more data arrive, MAP and Bayesian predictions become closer

➞ Finding MAP hypotheses is often much easier than Bayesian learning⁹

Usually specialized machine learning algorithms handle the complex calculations behind finding the MAP estimate.

Summary

Bayesian learning techniques formulate learning as a form of probabilistic inference
Full Bayesian learning gives best possible predictions but is intractable
Maximum a posterior learning selects the most likely hypothesis given the data (balancing complexity with accuracy on training data)
Maximum likelihood assumes uniform prior, OK for large data sets

Deep learning

Introduction

Conventional machine-learning techniques as those discussed above are limited in their ability to process natural data in their raw form.

Careful engineering and considerable domain expertise are required to transform the raw data into a suitable internal representation from which the learning system could detect or classify patterns in some input.

In representation learning the system uses methods to automatically discover the representations needed for detection or classification.

Deep-learning methods creates multiple levels of representation by transforming the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level.

With the composition of enough such transformations, very complex functions can be learned.

Representations

The key aspect of deep learning is that the levels of representations (i.e., layers of features) are not designed by human engineers: they are learned from data using a general-purpose learning procedure.

Example: object detection in images

An image comes in the form of an array of pixel values (raw input; input layer)
The first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image (hidden layer 1)
The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions (hidden layer 2)
The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects (hidden layer 3)
Subsequent layers would detect objects as combinations of the parts of familiar objects

Visualization

Figure 6: Illustration of a deep neural network consisting of multiple representation layers

Application

Deep learning methods can discover intricate structures in high-dimensional data.

Using this capability, deep learning is making major advances in solving problems that could not have been solved before (LeCun, Bengio, and Hinton (2015)), e.g.,

image and speech recognition,
analysing particle accelerator data,
reconstructing brain circuits,
predicting the effects of DNA mutations on gene expression and disease
natural language understanding (e.g., topic classification, sentiment analysis, question answering and language translation)

Convolutional neural networks

A convolutional neural network (ConvNet) is a specific kind of neural network with multiple layers,designed to process data that come in the form of multiple arrays¹⁰

The role of the ConvNet is to reduce the arrays into a form which is easier to process, without losing features which are critical for getting a good prediction.

A ConvNet is structured as a series of stages, which are composed of two types of layers: convolutional layers and pooling layers.

The convolution layer: extraction of the high-level features from the input (by applying convolutional operations/filters)
The pooling layer: reduction of the spatial size of the convoled feature to decrease the computational power required to process the data through dimensionality reduction.

Good read: A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way

Development process

Overview

We are still in the early stages of defining a methodology for machine learning projects; the tools and processes are not as well developed as in software engineering

Russel and Norvig (2022, 722ff) propose a process that involves following typical steps

Problem formulation
Data collection, assessment, and management
Model selection and training
Checking trustworthiness of the system
Operation, monitoring, and maintenance

Problem

Figuring out what problem you want to solve compromises three parts:

Problem: What problem do I solve for my users?
➞ Find an objective that you can track and that relates to your “true goals”
Suitability: What parts of the problem(s) can be solved by ML?
➞ Often not all parts of the problem require ML
Approach: What kind of learning is appropriate?
➞ Often a semi-supervised learning approach is suitable (few labeled examples, large collection of unlabeled examples)

Data

Real data are messy

ML needs data, a lot of data, of which at least a subset is labeled

Manufacturing these data can be done by own labor or by crowdsourcing (paid, volunteers, users); one might also start with publicly available general-purpose dataset (or a model that has been pretrained) and then add specific data

Maintain a data provenance for all data (i.e., for each columns of your data set, you should know the exact definition, where the data come from, what the possible values are, and who has worked on it)

When data are limited, data augmentation can help (i.e., creating multiple versions of each image by rotating, translating, cropping, or scaling each image, or by changing brightness or color balancing or adding noise)

Feature engineering

After correcting overt errors, the data should be preprocessed so that they can be handled more easily

Quantization: forcing a continuous valued input into fixed bins (e.g., waiting time in 0-10, 10-30, 30-60, or >60 minutes)
Normalization data: transforming data so that it has a standard deviation of 1
Separating categorical attributes: transform the data into separate Boolean attributes, where exactly one of it is true

Importance

At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. Domingos (2012)

Model

Before starting with building a model, you might start with getting an intuitive feel for the data (e.g., by means of exploratory data analysis)

There is no guaranteed way to pick the best model class, but there are some rough guidelines:

Random forests are good then there are a lot of categorical features, where many of these may be irrelevant
Nonparametric methods are good when having a lot of data and no prior model
Logistic regression does well when the data are linearly separable (or can be converted to be so)

Do what worked well in similar past problems—and search: run experiments with multiple possible models

Trustworthiness

Doing well with test data is a necessary but not sufficient condition for trust in the model (by you and your stakeholders), it also requires

Verification and validation: you test on the training, validation, and datasets, do code reviews, monitoring, and set measures for accountability
➞ Do the best to ensure that the system will not be wrong and, for the case, set responsibilities
Interpretability: you understand how answers relate to inputs ➞ Do the best to inspect and interpret your model
Explainability: the model helps you to understand why a certain output has been produced for a given input ➞ Do the best to explain what your model does¹¹

Operation

After the model is deployed to the users, additional challenges will arise

Long tail of user inputs: you might see inputs that were never tested before and you need to know whether your model generalizes well for them (i.e., you need to monitor the performance)
Nonstationarity: the world changes over time (e.g., spammers adapt their tactics); you need to consider how often to adapt the model (i.e., find a tradeoff between a well tested model and a model that is built from the latest data)

A set of criteria to see how well you are doing at deploying ML models

Tests for features and data (1) Feature expectations are captured in a schema. (2) AU feature. are beneficial. (3) No feature’s cost is too much. ( 4) Features adhere to meta-level requirements. (5) The data pipeline has appropriate privacy controls. (6) New features can be added quickly. (7) All input feature code is tested.

Tests for model development (1) Every model specification undergoes a code review. (2) Every model is checked in to a repository. (3) Offline proxy metrics correlate with actual metrics (4) All hyperparameters have been tuned. (5) The impact of model staleness is known. (6) A simpler model is not better. (7) Model quality is sufficient on all important data slices. The model has been tested for considerations of inclusion.

Tests for ML infrastructure (1) Training is reproducible. (2) Model specification code is unit tested. (3) The full ML pipeline is integration tested. (4) Model quality is validated before attempting to serve it. (5) The model allows debugging by observing the step-by-step computation of training or inference on a single example. (6) Models are tested via a canary process before they enter production serving environments. (7) Models can be quickly and safely rolled back to a previous serving version.

Monitoring tests for ML (1) Dependency changes result in notification. (2) Data invariants hold in training and serving inputs. (3) Training and serving features compute the same values (4) Models are not too tales (5) The model is numerically stable. (6) The model ha not experienced regressions in training speed, serving latency, throughput, or RAM usage. (7) The model has not experienced a regression in prediction quality on served data.

Source: Russel and Norvig (2022, 731)

✏️ Exercises

Exercise

Learning scenarios

Consider following problems:

Me learning to play tennis
An infant learning to speak

Discuss following questions:

Explain how this process fits into the general learning model.
Describe the percepts and actions of the player.
What types of learning I must do?
What example data is available?

Learning types

Describe the differences between supervised, unsupervised, and reinforcement learning.

Ockham’s razor

In your own words, explain to us what Ockham’s razor is. Find an example that you can use to enrich your explanation of the concept.

ML concepts

Define the following machine-learning terms in your own words

Inductive learning
Training set
Hypothesis
Bias
Variance
Curve fitting

Qualification problem

Draw a decision tree for the problem of deciding whether to move forward at a road intersection, given that the light has just turned green.

What problems do you see? Argue based on the qualification problem discussed in chapter probability.

Generalize your findings and describe for which kind of problems decision trees are not suitable.

Divide-and-conquer

Create the decision tree by applying the divide-and-conquer approach on the restaurant examples.

Compare the naive tree with the tree gained by applying the divide-and-conquer heuristic. What differences do you see?.

Decision tree

Explain how decision trees can be used to create a learning agent. Relate your answers to the components outlined in the lecture notes.

We never test the same attribute twice along one path in a decision tree. Why not?

Cat recognition

You want to teach a robot to recognize cat, using a set of features (the presence of blue eyes b, stripe s, or spot m). You also have several pictures of dogs to use as negative examples. The following table summarizes the content of training samples. Each binary feature is represented as 1 (meaning the feature is present) or 0 (meaning it is absent). The subject of y of the photo is encoded as 1 for cat or -1 for dog.

`m`	`b`	`s`	Subject `y`
0	0	0	1
1	0	0	1
1	1	0	1
0	1	1	1
1	0	1	-1
1	1	1	-1

Table 2: Observations

Suppose the robot has a Naive Bayes-based brain. Write the Naive Bayes classification rule for this problem (i.e. write a formula which, given a data point x = (m, b, s), returns the most likely subject y). Write the formula in terms of likelyhood (conditional probabilties) and prior probabilities.
Give estimates for the parameters of the classification rule based on the training samples.
Suppose a subject which has stripes but no spots or blue eyes. What would happen if the robot had to guess the identity of the subject?

Statisticians

Two statisticians go to the doctor and are both given the same prognosis: A 40% chance that the problem is the deadly disease A, and a 60% chance of the fatal disease B. Fortunately, there are anti-A and anti-B drugs that are inexpensive, 100% effective, and free of side-effects. The statisticians have the choice of taking one drug, both, or neither.

What will the first statistician (an avid Bayesian) do? How about the second statistician, who always uses the maximum likelihood hypothesis?

The doctor does some research and discovers that disease B actually comes in two versions, dextro-B and levo-B, which are equally likely and equally treatable by the anti-B drug.

Now that there are three hypotheses, what will the two statisticians do?

Literature

Domingos, Pedro. 2012. “A Few Useful Things to Know about Machine Learning.” Communications of the ACM 55 (10): 78–87.

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–44.

Russel, Stuart, and Peter Norvig. 2022. Artificial Intelligence: A Modern Approach. Harlow: Pearson Education.

Footnotes

The decision is a diagonal line, and all decision tree tests divide the space up into rectangular, axis-aligned boxes. We would have to stack a lot of boxes to closely approximate the diagonal line.
The selection is implemented by means of a choosing attribute test, which is based on information theory and measures the information gain from the attribute tests. For more details please refer to Russel and Norvig (2022, 681 ff)
Meaning that all paths in the tree will be short and the tree as a whole will be shallow
A tree that is too large risks overfitting the training data and poorly generalizing to new samples.
The training set is divided into two groups; 70% of the training set is used to build the tree, and the remaining 30% for validation; leading to three data sets (training, validation, test)
Pronounced “em-ay-pee”
Which is equal to maximizing \(P(d|h_i)P(h_i)\) or \(\ log P(d|h_i) + \log P(h_i)\)
Maximum likelihood is the “standard” (non-Bayesian) statistical learning method
MAP learning provides a natural embodiment of Ockham’s razor
1D for signals an sequences such as language, 2D for images or audio spectograms, 3D for video
Regulations such as the European GDPR require systems to provide explanations