Learning

🧠 Introduction to AI

Andy Weeger

Neu-Ulm University of Applied Sciences

February 13, 2024

Definition

Learning agents are those that can improve their behavior through diligent study of past experiences and predictions of the future Russel and Norvig (2022, 668)

A learning agent

  • uses so-called machine learning (ML), if it is a computer;
  • improves performance based on experience (i.e., observations of the world);
  • is required when the designer lacks omniscience (i.e., in unknown environments) and/or
  • has no idea how to program a solution themselves (e.g., recognizing faces)

The learning agent

Visualization

Figure 1: A learning agent based on Russel and Norvig (2022, 74)

Building blocks

Performance element: Processes percepts and chooses actions (relates to the basics of AI we have studied so far)

Learning element: Carries out improvements. Requires awareness and feedback on how the agent is doing in the environment

Critic: Evaluation of the agent’s behavior based on a given external behavioral measure (i.e., feedback)

Problem generator: Suggests explorative actions that lead the agent to new experiences

The learning element

The design of the learning element is influenced by four important aspects:

  • Which component of the performance element is to be improved?
  • What representation should be chosen (i.e., what model)?
  • What prior information is available (i.e., prior knowledge that influences the model)?
  • What form of feedback is available?

Types of feedback

The type of feedback available for learning is usually the most important factor in determining the nature of the learning problem.

  • Supervised learning: Involves learning a function from examples of its inputs and outputs âžž correct answer for each training instance
  • Unsupervised learning: The agent has to learn patterns in the input when no specific output values are given âžž reward sequence, no correct answers
  • Reinforcement learning: The most general form of learning in which the agent is not told what to do by a teacher. Rather, it must learn from reinforcements (punishments or rewards). It typically involves learning how the environment works âžž “just make sense of the data”

Why learning works

How can we be sure that our learned hypothesis will predict well for previously unseen inputs? I.e., how do we know that the hypothesis \(h\) is close to the target function \(f\) when \(f\) is unknown?

The underlying principle of computational learning theory is, that any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples

Thus, any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong: that is, it must be probably approximately correct (PAC)

Inductive learning

The task of learning is to find good hypotheses about the world.

  • An example is a pair \((x, f(x))\) (input and output)
  • The complete set of examples is called the training set (supervised learning)
  • Pure inductive inference: for a collection of examples for \(f\) (the target function), return a function \(h\) (hypothesis) that approximates \(f\)
  • The function h typically is member of a hypothesis space \(H\)
  • A good hypothesis should generalize the data well (i. e., will predict unseen examples correctly)
  • A hypothesis is consistent with the data set if it agrees with all the data

How do we choose from among multiple consistent hypotheses?

Ockham’s razor: prefer the simplest hypothesis that matches the data

Example: curve-fitting

Figure 2: Finding hypotheses to fit data

For plots of best-fit functions (h) from four different hypothesis spaces (H) trained on a data set (a = linear; b = degree-7 polynomial, c = degree-6 polynomial, d = sinusoidal)

Decision trees

Learning decision trees

A decision tree is a representation of a function that maps a vector of attribute values to a single output value—a “decision”.

Search is used to find a decision (i.e., performing a sequence of tests).

In Boolean decision trees, the input is a set of vector of input attributes X and a single Boolean output value y

Learning process: Definition of the goal predicate in the form of a decision tree.

  • An internal node of the decision tree represents a test of a property
  • Branches are labeled with the possible values of the test
  • Each leaf node specifies the Boolean value to be returned if that leaf is reached

Expressiveness

A Boolean decision tree is equivalent to a logical statement of the form

\[ \begin{flalign} Output \iff (Path_1 \lor Path_2 \lor ...) \end{flalign} \]

where each \(Path_i\) is a conjunction of the form \((A_m = v_x \; \land \; A_n = v_y \; \land \; ...)\) of attribute-value tests corresponding to a path from the root to a \(true\) leaf.

Any function in propositional logic can be represented by a decision tree by translating every row of a truth table to a path in the tree.

This can lead to a tree whose size is exponential in the number of attributes.

Limitations

Although decision trees can represent functions with smaller trees, there are functions that require an exponentially large decision tree, e.g.

  • Majority function, which returns true if more than half of the inputs are true
  • Parity function, which returns true if and only if an even number of inputs are true
  • \(y > A_1 + A_2\) with real-valued attributes 1

Summary: decision trees are good for some kinds of functions and bad for others.

Example problem

Supervised learning problem of deciding whether to wait for a table at a restaurant (Russel and Norvig 2022, 668)

The output (y) is a Boolean variable WillWait

The input (x) is a vector of ten attributes values (discrete values):

  • Alternate – Is there an alternative? (T/F)
  • Bar – Does the restaurant have a bar to wait in? (T/F)
  • Fri – Is it Friday or Saturday? (T/F)
  • Hungry – Am I hungry? (T/F)
  • Patrons – How many guests are there? (none, some, full)
  • Price – How expensive is the food? (€, €€, €€€)
  • WaitEstimate – How long do we have to wait? (0-10, 10-30, 30-60, >60)
  • Reservation – Have I made a reservation? (T/F)
  • Raining – Is it raining outside? (T/F)
  • Type – What kind of restaurant is it? (French, Italian, Thai, Burger)

Example decision tree

Figure 3: Decision tree restaurant example based on Russel and Norvig (2022, 674)

Underlying training set

Example Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait
\(x_1\) Yes No No Yes Some €€€ No Yes French 0-10 \(y_1 =\) Yes
\(x_2\) Yes No No Yes Full € No No Thai 30-60 \(y_2 =\) No
\(x_3\) No Yes No No Some € No No Burger 0-10 \(y_3 =\) Yes
\(x_4\) Yes No Yes Yes Full € Yes No Thai 10-30 \(y_4 =\) Yes
\(x_5\) Yes No Yes No Full €€€ No Yes French >60 \(y_5 =\) No
\(x_6\) No es No Yes Some €€ Yes Yes Italian 0-10 \(y_6 =\) Yes
\(x_7\) No Yes No No None € Yes No Burger 0-10 \(y_7 =\) No
\(x_8\) No No No Yes Some €€ Yes Yes Thai 0-10 \(y_8 =\) Yes
\(x_9\) No Yes Yes No Full € Yes No Burger >60 \(y_9 =\) No
\(x_{10}\) Yes Yes Yes Yes Full €€€ No Yes Italian 10-30 \(y_{10}=\) No
\(x_{11}\) No No No No None € No No Thai 0-10 \(y_{11} =\) No
\(x_{12}\) Yes Yes Yes Yes Full € No No Burger 30-60 \(y_{12} =\) Yes
Table 1: Examples for the restaurant domain

Inducing decision trees from examples

To get a naive solution, we simply construct a tree with one path to a leaf for each example.

  • We test all the attributes along the path and attach the classification of the example to the leaf
  • Whereas the resulting tree will correctly classify all given examples, it will not say much about other cases
  • It just memorizes the observations and does not generalize

How to find a tree that is consistent with the training set (Table 1), yet as small as possible?

  • It is intractable to find a guaranteed smallest consistent tree.
  • However, with some simple heuristics, we can efficiently find one that is close to the smallest (i.e., “smallish” tree).
  • Decision tree learning adopts a greedy divide-and-conquer strategy.

Divide-and-conquer strategy

Always use the most important attribute first, then recursively solve the smaller subproblems

  • The “most important attribute” is the one that makes the most difference2
  • Split the training set into subsets each corresponding to a particular value of that attribute
  • Now that we have divided the training set into several smaller training sets, we can recursively apply this process to the smaller training sets

That way, we hope to get to the correct classification with a small number of tests3

Recursive learning process

In each recursive step there are four cases to consider:

  • Positive and negative examples: choose a new attribute
  • Common outcome (only positive or only negative examples): done (answer is Yes or No).
  • No examples: there was no example with the desired property. Answer Yes if the majority of the parent node’s examples is positive, otherwise No.
  • No attributes left, but there are still examples with different classifications: there were errors in the data (i.e., noise) or the attributes do not give sufficient information. Answer Yes if the majority of examples is positive, otherwise No.

Resulting tree

Properties of the learning outcome:

  • The resulting tree is considerably simpler than the one originally given (and from which the training examples were generated)
  • The learning algorithm outputs a tree that is consistent with all examples it has seen
  • The tree does not need to agree with the correct function, e.g. it suggests not to wait if we are not hungry. If we are, there are cases in which it tells us to wait.
  • Some tests (Raining, Reservation) are not included since the algorithm can classify the examples without them

Performance assessment

To assess the power of the prediction, the following method can be applied:

  • Collect a large number of examples
  • Divide it into two disjoint sets: the training set and the test set
  • Use the training set to generate \(h\)
  • Measure the percentage of examples of the test set that are correctly classified by \(h\)
  • Repeat the process for randomly-selected training sets of different sizes

As the training set grows, the prediction quality increases

Generalization

Pruning reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances.

  • Pruning reduces the complexity of the final hypothesis (tree size) and reduces overfitting4
  • Predictive accuracy as measured by a cross-validation set5
  • One of the simplest forms of pruning is reduced error pruning:
    • For each leave, each node is replaced with its most popular output
    • If the prediction accuracy is not affected then the change is kept
    • It is somewhat naive, but simple and speedy

Summary

  • Decision trees are one possibility for representing (Boolean) functions
  • They can be exponential in the number of attributes
  • It is often too difficult to find the minimal decision tree
  • One method for generating decision trees that are as flat as possible is based on ranking the attributes
  • The ranks are computed based on the information gain

Statistical ML

Motivation

As discussed in chapter probability, probability and utility theory allow agents to deal with uncertainty

To apply probabilistic reasoning, however, the agents must first learn their probabilistic theories of the world from experience

We will discuss statistical learning methods as robust ways to learn probabilistic models

Bayesian learning

Learning can be viewed as Bayesian updating of a probability distribution over the hypothesis space

\(H\) is the hypothesis variable, values \(h_1, h_2, . . .\), prior \(P(H)\)

\(j\)th observation of \(d_j\) gives the outcome of random variable \(D_j\)
training data \(d = d_1,..., d_N\)

Given the data so far, each hypothesis has a posterior probability:

\[ \begin{flalign} P(h_i|d) = \alpha P(d|h_i)P(h_i) \end{flalign} \]

where \(P(d|h_i)\) is called the likelihood

Predictions use a likelihood-weighted average over the hypotheses:

\[ \begin{flalign} P(H|d) = \sum_i{P(H|d,h_i)P(h_i|d)} = \sum_i{P(H|h_i)P(h_i|d)} \end{flalign} \]

Example

Suppose there are five kinds of bags of candies:

  • 10% are \(h_1\) : 100% cherry candies
  • 20% are \(h_2\) : 75% cherry and 25% lime candies
  • 40% are \(h_3\) : 50% cherry and 50% lime candies
  • 20% are \(h_4\) : 25% cherry and 75% lime candies
  • 10% are \(h_5\) : 100% lime candies

Then we draw 5 candies from some bag (\(d_1,...,d_10\)), which are all lime candies.

What kind of bag is it?
What flavor will the next candy be?

Posterior probability of hypotheses

\(P(h_k|d) = αP(d|h_k)P(h_k)\)

\(P(h1 | 5\;limes) = αP(5\;limes | h1)P(h1) = \alpha · 0.0^5 · 0.1 = 0\) \(P(h2 | 5\;limes) = αP(5\;limes | h2)P(h2) = \alpha · 0.25^5· 0.2 = 0.000195α\) \(P(h3 | 5\;limes) = αP(5\;limes | h3)P(h3) = \alpha · 0.5^5 · 0.4 = 0.0125α\) \(P(h4 | 5\;limes) = αP(5\;limes | h4)P(h4) = \alpha · 0.75^5 · 0.2 = 0.0475α\) \(P(h5 | 5\;limes) = αP(5\;limes | h5)P(h5) = \alpha · 1.0^5 · 0.1 = 0.1α\)

\(\alpha = 1/(0 + 0.000195 + 0.0125 + 0.0475 + 0.1) = 6.2424\)

\(P(h_1 | 5\;limes) = 0\)
\(P(h_2 | 5\;limes) = 0.00122\)
\(P(h_3 | 5\;limes) = 0.07803\)
\(P(h_4 | 5\;limes) = 0.29650\)
\(P(h_5 | 5\;limes) = 0.62424\)

Evolution of the five hypotheses

The posterior probabilties change the more evidence we gain.

Figure 4: Posterior probability of candy-hypotheses

Prediction probability

Let’s calculate the probabilit that we draw a lime on our 6th attempt given the data we have (types of bags and observations made).

\[ \begin{flalign} P(d_{j+1}|d) = \sum_k{P(d_{j+1}|d,h_k)P(h_k|X)} = \sum_k{P(d_{j+1}|h_k)P(h_k|d)} \end{flalign} \] \[ \begin{align} P(lime \; on \; 6 | 5 \; limes) & = P(lime \; on \; 6 | h1)P(h1 | 5 \; limes) \\ & + P(lime \; on \; 6 | h2)P(h2 | 5 \; limes) \\ & + P(lime \; on \; 6 | h3)P(h3 | 5 \; limes) \\ & + P(lime \; on \; 6 | h4)P(h4 | 5 \; limes) \\ & + P(lime \; on \; 6 | h5)P(h5 | 5 \; limes) \\ & = 0 Ă— 0 \\ & + 0.25 Ă— 0.00122 \\ & + 0.5 Ă— 0.07830 \\ & + 0.75 Ă— 0.29650 \\ & + 1.0 Ă— 0.62424 \\ & = 0.88607 \\ \end{align} \]

Likelihood-weighted average

Figure 5: Probability that next candy is lime given d

Observations

The Bayesian prediction eventually agrees with the true hypothesis

  • For any fixed prior that does not rule out the true hypothesis, the posterior of any false hypothesis will eventually vanish
  • The Bayesian prediction is optimal and, given the hypothesis prior, any other prediction will be correct less often

However, summing over the hypothesis space is often intractable

âžž Real problems require us to resort to approximate or simplified methods

MAP learning

Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes).

A common approximation is to make predictions based on a single most probable hypothesis.

Maximum a posteriori (MAP)6 learning: choose \(h_{MAP}\) maximizing \(P(h_i|d)\)7.

For large data sets, prior becomes irrelevant.

Maximum likelihood (ML) learning8 means, thus, to choose \(h_{ML}\) maximizing \(P(d|h_i)\).

\[ \begin{flalign} P(h_i|d) = \frac{P(d|h_i)P(h_i)}{P(d)} \approx P(d|h_i)P(h_i) \approx P(d|h_{MAP}) \end{flalign} \]

We don’t need to directly calculate \(P(d)\) (i.e., the evidence term) for finding the MAP estimate as \(P(d)\) acts like a constant and does not affect which value maximizes the posterior probability.

MAP learning aims to find the value that maximizes the product of likelihood (\(P(d|h_i)\)) and prior probability (\(P(h_i)\)).

As more data arrive, MAP and Bayesian predictions become closer

âžž Finding MAP hypotheses is often much easier than Bayesian learning9

Usually specialized machine learning algorithms handle the complex calculations behind finding the MAP estimate.

Summary

  • Bayesian learning techniques formulate learning as a form of probabilistic inference
  • Full Bayesian learning gives best possible predictions but is intractable
  • Maximum a posterior learning selects the most likely hypothesis given the data (balancing complexity with accuracy on training data)
  • Maximum likelihood assumes uniform prior, OK for large data sets

Deep learning

Introduction

Conventional machine-learning techniques as those discussed above are limited in their ability to process natural data in their raw form.

Careful engineering and considerable domain expertise are required to transform the raw data into a suitable internal representation from which the learning system could detect or classify patterns in some input.

In representation learning the system uses methods to automatically discover the representations needed for detection or classification.

Deep-learning methods creates multiple levels of representation by transforming the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level.

With the composition of enough such transformations, very complex functions can be learned.

Representations

The key aspect of deep learning is that the levels of representations (i.e., layers of features) are not designed by human engineers: they are learned from data using a general-purpose learning procedure.

Example: object detection in images

  • An image comes in the form of an array of pixel values (raw input; input layer)
  • The first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image (hidden layer 1)
  • The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions (hidden layer 2)
  • The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects (hidden layer 3)
  • Subsequent layers would detect objects as combinations of the parts of familiar objects

Visualization

Figure 6: Illustration of a deep neural network consisting of multiple representation layers

Application

Deep learning methods can discover intricate structures in high-dimensional data.

Using this capability, deep learning is making major advances in solving problems that could not have been solved before (LeCun, Bengio, and Hinton (2015)), e.g.,

  • image and speech recognition,
  • analysing particle accelerator data,
  • reconstructing brain circuits,
  • predicting the effects of DNA mutations on gene expression and disease
  • natural language understanding (e.g., topic classification, sentiment analysis, question answering and language translation)

Convolutional neural networks

A convolutional neural network (ConvNet) is a specific kind of neural network with multiple layers,designed to process data that come in the form of multiple arrays10

The role of the ConvNet is to reduce the arrays into a form which is easier to process, without losing features which are critical for getting a good prediction.

A ConvNet is structured as a series of stages, which are composed of two types of layers: convolutional layers and pooling layers.

  • The convolution layer: extraction of the high-level features from the input (by applying convolutional operations/filters)
  • The pooling layer: reduction of the spatial size of the convoled feature to decrease the computational power required to process the data through dimensionality reduction.

Development process

Overview

We are still in the early stages of defining a methodology for machine learning projects; the tools and processes are not as well developed as in software engineering

Russel and Norvig (2022, 722ff) propose a process that involves following typical steps

  • Problem formulation
  • Data collection, assessment, and management
  • Model selection and training
  • Checking trustworthiness of the system
  • Operation, monitoring, and maintenance

Problem

Figuring out what problem you want to solve compromises three parts:

  1. Problem: What problem do I solve for my users?
    ➞ Find an objective that you can track and that relates to your “true goals”
  2. Suitability: What parts of the problem(s) can be solved by ML?
    âžž Often not all parts of the problem require ML
  3. Approach: What kind of learning is appropriate?
    âžž Often a semi-supervised learning approach is suitable (few labeled examples, large collection of unlabeled examples)

Data

Real data are messy

ML needs data, a lot of data, of which at least a subset is labeled

Manufacturing these data can be done by own labor or by crowdsourcing (paid, volunteers, users); one might also start with publicly available general-purpose dataset (or a model that has been pretrained) and then add specific data

Maintain a data provenance for all data (i.e., for each columns of your data set, you should know the exact definition, where the data come from, what the possible values are, and who has worked on it)

When data are limited, data augmentation can help (i.e., creating multiple versions of each image by rotating, translating, cropping, or scaling each image, or by changing brightness or color balancing or adding noise)

Feature engineering

After correcting overt errors, the data should be preprocessed so that they can be handled more easily

  • Quantization: forcing a continuous valued input into fixed bins (e.g., waiting time in 0-10, 10-30, 30-60, or >60 minutes)
  • Normalization data: transforming data so that it has a standard deviation of 1
  • Separating categorical attributes: transform the data into separate Boolean attributes, where exactly one of it is true

Importance

At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. Domingos (2012)

Model

Before starting with building a model, you might start with getting an intuitive feel for the data (e.g., by means of exploratory data analysis)

There is no guaranteed way to pick the best model class, but there are some rough guidelines:

  • Random forests are good then there are a lot of categorical features, where many of these may be irrelevant
  • Nonparametric methods are good when having a lot of data and no prior model
  • Logistic regression does well when the data are linearly separable (or can be converted to be so)

Do what worked well in similar past problems—and search: run experiments with multiple possible models

Trustworthiness

Doing well with test data is a necessary but not sufficient condition for trust in the model (by you and your stakeholders), it also requires

  • Verification and validation: you test on the training, validation, and datasets, do code reviews, monitoring, and set measures for accountability
    âžž Do the best to ensure that the system will not be wrong and, for the case, set responsibilities
  • Interpretability: you understand how answers relate to inputs âžž Do the best to inspect and interpret your model
  • Explainability: the model helps you to understand why a certain output has been produced for a given input âžž Do the best to explain what your model does11

Operation

After the model is deployed to the users, additional challenges will arise

  • Long tail of user inputs: you might see inputs that were never tested before and you need to know whether your model generalizes well for them (i.e., you need to monitor the performance)
  • Nonstationarity: the world changes over time (e.g., spammers adapt their tactics); you need to consider how often to adapt the model (i.e., find a tradeoff between a well tested model and a model that is built from the latest data)

✏️ Exercises

Exercise

Learning scenarios

Consider following problems:

  1. Me learning to play tennis
  2. An infant learning to speak

Discuss following questions:

  • Explain how this process fits into the general learning model.
  • Describe the percepts and actions of the player.
  • What types of learning I must do?
  • What example data is available?

Learning types

Describe the differences between supervised, unsupervised, and reinforcement learning.

Ockham’s razor

In your own words, explain to us what Ockham’s razor is. Find an example that you can use to enrich your explanation of the concept.

ML concepts

Define the following machine-learning terms in your own words

  1. Inductive learning
  2. Training set
  3. Hypothesis
  4. Bias
  5. Variance
  6. Curve fitting

Qualification problem

Draw a decision tree for the problem of deciding whether to move forward at a road intersection, given that the light has just turned green.

What problems do you see? Argue based on the qualification problem discussed in chapter probability.

Generalize your findings and describe for which kind of problems decision trees are not suitable.

Divide-and-conquer

Create the decision tree by applying the divide-and-conquer approach on the restaurant examples.

Compare the naive tree with the tree gained by applying the divide-and-conquer heuristic. What differences do you see?.

Decision tree

Explain how decision trees can be used to create a learning agent. Relate your answers to the components outlined in the lecture notes.

We never test the same attribute twice along one path in a decision tree. Why not?

Cat recognition

You want to teach a robot to recognize cat, using a set of features (the presence of blue eyes b, stripe s, or spot m). You also have several pictures of dogs to use as negative examples. The following table summarizes the content of training samples. Each binary feature is represented as 1 (meaning the feature is present) or 0 (meaning it is absent). The subject of y of the photo is encoded as 1 for cat or -1 for dog.

m b s Subject y
0 0 0 1
1 0 0 1
1 1 0 1
0 1 1 1
1 0 1 -1
1 1 1 -1
Table 2: Observations
  1. Suppose the robot has a Naive Bayes-based brain. Write the Naive Bayes classification rule for this problem (i.e. write a formula which, given a data point x = (m, b, s), returns the most likely subject y). Write the formula in terms of likelyhood (conditional probabilties) and prior probabilities.
  2. Give estimates for the parameters of the classification rule based on the training samples.
  3. Suppose a subject which has stripes but no spots or blue eyes. What would happen if the robot had to guess the identity of the subject?

Statisticians

Two statisticians go to the doctor and are both given the same prognosis: A 40% chance that the problem is the deadly disease A, and a 60% chance of the fatal disease B. Fortunately, there are anti-A and anti-B drugs that are inexpensive, 100% effective, and free of side-effects. The statisticians have the choice of taking one drug, both, or neither.

What will the first statistician (an avid Bayesian) do? How about the second statistician, who always uses the maximum likelihood hypothesis?

The doctor does some research and discovers that disease B actually comes in two versions, dextro-B and levo-B, which are equally likely and equally treatable by the anti-B drug.

Now that there are three hypotheses, what will the two statisticians do?

Literature

Domingos, Pedro. 2012. “A Few Useful Things to Know about Machine Learning.” Communications of the ACM 55 (10): 78–87.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–44.
Russel, Stuart, and Peter Norvig. 2022. Artificial Intelligence: A Modern Approach. Harlow: Pearson Education.

Footnotes

  1. The decision is a diagonal line, and all decision tree tests divide the space up into rectangular, axis-aligned boxes. We would have to stack a lot of boxes to closely approximate the diagonal line.

  2. The selection is implemented by means of a choosing attribute test, which is based on information theory and measures the information gain from the attribute tests. For more details please refer to Russel and Norvig (2022, 681 ff)

  3. Meaning that all paths in the tree will be short and the tree as a whole will be shallow

  4. A tree that is too large risks overfitting the training data and poorly generalizing to new samples.

  5. The training set is divided into two groups; 70% of the training set is used to build the tree, and the remaining 30% for validation; leading to three data sets (training, validation, test)

  6. Pronounced “em-ay-pee”

  7. Which is equal to maximizing \(P(d|h_i)P(h_i)\) or \(\ log P(d|h_i) + \log P(h_i)\)

  8. Maximum likelihood is the “standard” (non-Bayesian) statistical learning method

  9. MAP learning provides a natural embodiment of Ockham’s razor

  10. 1D for signals an sequences such as language, 2D for images or audio spectograms, 3D for video

  11. Regulations such as the European GDPR require systems to provide explanations