Introduction to ML

🧠 Introduction to AI

Andy Weeger

Neu-Ulm University of Applied Sciences

April 15, 2025

Introduction

Characteristics

Learning agents are those that can improve their behavior through diligent study of past experiences and predictions of the future. Russel and Norvig (2022, 668)

At its core, a learning agent (LA):

Uses machine learning (ML) when it’s a computer system
Improves performance based on experience (observations)
Is necessary when designers lack complete knowledge of environments
Solves problems that are difficult to program explicitly (e.g., face recognition)

Definition

A computer is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Mitchel (1997, 2)

Any ML project needs to clearly specify:

The task T (what problem are we solving?)
The experience E (what data will the system learn from?)
The performance measure P (how will we evaluate success?)

The performance measure component is arguably the most critical component to define correctly. Poorly defined performance measures lead to misaligned systems that optimize for the wrong objectives, creating several serious problems:

Goal misalignment: The system optimizes for metrics that don’t actually capture what humans value or need
Unexpected behaviors: The system may find unexpected or undesirable ways to maximize the specified metric
Goodhart’s Law: Once a measure becomes the explicit target for optimization, it often loses its value as an accurate representation of what we actually care about¹.
Neglect of unmeasured factors: Important considerations that aren’t explicitly measured may be ignored or sacrificed

Thus, the careful definition of performance measures is essential for ensuring ML systems solve the intended problems and align with human values.

The experience component is what fundamentally distinguishes machine learning from traditional programming. In traditional programming, humans encode rules that the computer follows, while in machine learning, the computer discovers rules from data.

Traditional programming

Rules are explicitly coded by humans
System behavior is predetermined by these rules
Changes require manual reprogramming
No ability to improve from data

Machine Learning

Rules are derived from data (the “experience”)
System behavior emerges from patterns in the data
Changes can occur automatically with new data
Continuous improvement through additional experience

This shift represents a profound change in how problem-solving with computers is approached. Rather than trying to encode all possible rules and scenarios, ML allows to let systems discover patterns themselves through exposure to relevant experiences.

Exercise

Select one of the specific applications for ML below and define T, E, and P.

Face recognition system
Language translation service
Credit card fraud detection

Why learning works

How can we be sure that our learned hypothesis will predict well for previously unseen inputs? I.e., how do we know that the hypothesis \(h\) is close to the target function \(f\) when \(f\) is unknown?

The underlying principle of computational learning theory is, that any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples.

Thus, any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong: that is, it must be probably approximately correct (PAC).

ML x AI

ML constitutes one of the 4 categories of AI and is currently dominant in AI applications.

flowchart TD
    AI[Artificial Intelligence] --> ML[Machine Learning]
    AI --> SP[Search and Planning]
    AI --> KI[Knowledge and Inference]
    AI --> MU[Modeling of Uncertainty]
    
    ML --> SL[Supervised Learning]
    ML --> UL[Unsupervised Learning]
    ML --> RL[Reinforcement Learning]
    
    style AI fill:#000,stroke:#000,stroke-width:1px,color:#fff
    style ML fill:#0333ff,stroke:#0333ff,stroke-width:1px,color:#fff

However, the other branches have important historical significance and complementary roles:

Machine learning excels at pattern recognition, handling messy real-world data, adapting to new situations
Search and planning is concerned with finding optimal solutions, guaranteeing results, and handling combinatorial problems. It encompasses algorithms like A*, minimax, and planning systems.
Knowledge and inference excels at explicit reasoning, incorporating human expertise, and transparency. It includes expert systems, knowledge graphs, and logical reasoning.
Modeling of uncertainty is concerned with handling incomplete information, quantifying confidence, and risk assessment. It covers Bayesian networks, fuzzy logic, and probabilistic reasoning

Consequently, the most powerful AI systems often combine techniques from multiple branches. For example:

Autonomous vehicles use ML for perception, planning algorithms for navigation, and uncertainty modeling for safety decisions
Medical diagnosis systems might combine knowledge graphs of medical facts with ML-based pattern recognition from patient data
Game AI like AlphaGo combines reinforcement learning with tree search algorithms

LA architecture

Visualization

Figure 1: A learning agent based on Russel and Norvig (2022, 74)

Building blocks

Performance element: Processes percepts and chooses actions (relates to the basics of AI we have studied so far).

Learning element: Carries out improvements — requires awareness and feedback on how the agent is doing in the environment.

Critic: Evaluation of the agent’s behavior based on a given external behavioral measure (i.e., feedback).

Problem generator: Suggests explorative actions that lead the agent to new experiences.

The learning element

The design of the learning element is influenced by four important aspects:

Which component of the performance element is to be improved?
What representation should be chosen (i.e., model type)?
What prior information is available (i.e., prior knowledge that influences the model)?
What form of feedback is available?

Supervised learning

Visualization

The model is then used as follows:

Figure 3: Application phase of supervised learning

Key challenges

Getting enough labeled data
Ensuring labels are accurate
Dealing with imbalanced data classes
Feature selection and engineering

Practical applications

Sentiment analysis
Spam detection
Loan approval prediction
Image classification

Unsupervised learning

Visualization

The model is then used as follows:

Figure 5: Application phase of unsupervised learning

Unsupervised learning mirrors how we naturally discover patterns in the world around us. Just as children learn to categorize objects without being explicitly taught every category, unsupervised algorithms find structure in unlabeled data. This approach reflects human intuition — we recognize similarities, identify outliers, and group related items without requiring prior labels. The absence of predetermined outputs makes unsupervised learning both powerful and challenging. Unlike supervised learning, these algorithms must determine what’s important within the data itself, similar to how humans can walk into an unfamiliar environment and instinctively organize what they perceive. Unsupervised methods excel at revealing hidden structures that might never be discovered through directed approaches, though their results can sometimes be more difficult to validate objectively.

Key challenges

Evaluating the quality of results without ground truth
Determining the optimal number of groups or patterns
Interpreting the discovered patterns meaningfully
Dealing with high-dimensional data

The open-ended nature of unsupervised learning creates unique considerations. Without labeled examples to guide the process, these algorithms must rely on inherent data properties like density, distance metrics, or statistical distributions. This raises fundamental questions: How do we know if discovered patterns are meaningful rather than arbitrary? What makes one clustering better than another? The absence of clear right or wrong answers means evaluation often relies on domain expertise and business context. Additionally, many unsupervised techniques struggle with the “curse of dimensionality”² — a phenomenon where data becomes increasingly sparse as dimensions increase. In high-dimensional spaces, points become nearly equidistant from each other, making similarity measures less meaningful. For example, in a 1,000-dimensional space, the difference between the closest and farthest points becomes negligible, undermining the foundation of distance-based clustering. This dramatically impacts algorithms that rely on distance metrics, as the concept of proximity becomes less informative. Even with large datasets, the available data points become insufficient to adequately represent the vastly increased volume of the feature space.

Practical applications

Computer vision — when shown millions of images, a computer vision system could identify large cluster of similar images (without “knowing” what is shown on these).

Customer segmentation
Anomaly detection
Topic modeling in text
Recommender systems
Image compression

Example

We begin with the unlabeled data, representing customer purchases of different products. A value of 1 indicates a purchase, while 0 indicates no purchase.

Unlabeled data

Unlabeled customer purchase data
Customer	Product 1	Product 2	Product 3	Product 4	Product 5	Product 6
Customer 1	0	1	1	0	0	0
Customer 2	1	0	0	0	0	1
Customer 3	1	0	0	0	1	1
Customer 4	0	0	1	1	0	0
Customer 5	0	0	0	1	1	1
Customer 6	1	1	1	0	0	0
Customer 7	0	1	1	1	0	0
Customer 8	1	1	0	0	0	0

First, we apply clustering to group customers with similar product purchasing behavior. Customers with similar rows (purchase patterns) are grouped together.

Customers 1, 6, 7, and 8 tend to buy products 1–3, often in combination.

Cluster A: customers 1, 6, 7, 8
Customer	Product 1	Product 2	Product 3	Product 4
Customer 1	0	1	1	0
Customer 6	1	1	1	0
Customer 7	0	1	1	1
Customer 8	1	1	0	0

Customers 2, 3, and 5 show a preference for product 6 and product 5.

Cluster B: customers 2, 3, 5
Customer	Product 1	Product 3	Product 4	Product 5	Product 6
Customer 2	1	0	0	0	1
Customer 3	1	0	0	1	1
Customer 5	0	1	1	1	1

Customer 4 has a unique pattern and does not fit well into either cluster.

Second, we perform association rule mining to find relationships between product purchases, i.e., which products are frequently bought together.

Metric.	Meaning
Support	% of transactions that contain both items (how common is the combo?)
Confidence	% of times Product B is bought when Product A is bought (A → B)
Lift	How much more likely A and B are bought together vs. by chance³

Confidence: table of association rules
Rule	Support Count (X & Y)	X Count	Confidence	Support(Y)	Lift
P1 → P2	2	4	50.0%	50.0%	1.00
P1 → P3	1	4	25.0%	50.0%	0.50
P1 → P4	0	4	0.0%	37.5%	0.00
P1 → P5	1	4	25.0%	25.0%	1.00
P1 → P6	2	4	50.0%	37.5%	1.33
P2 → P1	2	4	50.0%	50.0%	1.00
P2 → P3	3	4	75.0%	50.0%	1.50
P2 → P4	1	4	25.0%	37.5%	0.67
P2 → P5	0	4	0.0%	25.0%	0.00
P2 → P6	0	4	0.0%	37.5%	0.00
P3 → P1	1	4	25.0%	50.0%	0.50
P3 → P2	3	4	75.0%	50.0%	1.50
P3 → P4	2	4	50.0%	37.5%	1.33
P3 → P5	0	4	0.0%	25.0%	0.00
P3 → P6	0	4	0.0%	37.5%	0.00
P4 → P1	0	3	0.0%	50.0%	0.00
P4 → P2	1	3	33.3%	50.0%	0.67
P4 → P3	2	3	66.7%	50.0%	1.33
P4 → P5	1	3	33.3%	25.0%	1.33
P4 → P6	1	3	33.3%	37.5%	0.89
P5 → P1	1	2	50.0%	50.0%	1.00
P5 → P2	0	2	0.0%	50.0%	0.00
P5 → P3	0	2	0.0%	50.0%	0.00
P5 → P4	1	2	50.0%	37.5%	1.33
P5 → P6	2	2	100.0%	37.5%	2.67
P6 → P1	2	3	66.7%	50.0%	1.33
P6 → P2	0	3	0.0%	50.0%	0.00
P6 → P3	0	3	0.0%	50.0%	0.00
P6 → P4	1	3	33.3%	37.5%	0.89
P6 → P5	2	3	66.7%	25.0%	2.67

Insights

Based on the association rules analysis, we can derive the following insights:

Strong product pairings:
- Products 5 and 6 have the strongest association with the highest lift value (2.67) in both directions. Customers who buy either one are much more likely to buy the other, suggesting these products strongly complement each other.
- Products 2 and 3 also show a strong positive association (lift = 1.50), with 75% of product 2 buyers also purchasing product 3.
One-way associations:
- The rule P4 → P3 shows that 66.7% of customers who buy product 4 also buy product 3 (lift = 1.33), but the reverse isn’t as strong (P3 → P4 has only 50% confidence).
- This suggests Product 4 buyers are a subset of product 3 buyers, but not vice versa.
Product clustering:
- Group 1: Products 5 and 6 (strongest association)
- Group 2: Products 2 and 3 (strong association)
- Group 3: Products 1 and 6 (moderate association)
Product Independence and Negative Associations:
- Several products never appear together (lift = 0), such as P1 and P4, P2 and P5, P3 and P5.
- This suggests potential product incompatibility or different customer segments.
Product Popularity:
- Products 1, 2, and 3 have the highest support (each purchased by 50% of customers)
- Product 5 has the lowest support (only 25%)
Business Applications:
- Bundle marketing: The P5 → P6 rule with 100% confidence and 2.67 lift suggests these products could be effectively bundled.
- Recommendation systems: When a customer buys product 4, recommending product 3 would be logical (66.7% confidence).

Reinforcement learning

Visualization

The goal of reinforcement learning is to learn an optimal policy. A policy defines for each state \(S\) of the environment an action \(A\), which shall be executed in this state. The optimal policy is the policy for which the expected cumulative future reward is maximal.

Reinforcement-Learning is Trial-and-Error learning. The agent selects an action \(A_t\) in it’s current state \(S_t\). After the execution of this action the environment state changes, the new state is \(S_{t+1}\). Moreover, the agent may receive a positive or negative reward \(R_{t+1}\) for his previous action. These received rewards are applied to adapt the future action-selection. Since the reward is only available after performing actions, this type of learning is also called learning with a critic - in contrast to learning with a teacher (i.e. supervised learning).

Reinforcement-Learning works in non-deterministic environments (i.e. for a given state-action pair the successive state is not known for sure). Reinforcement-Learning can also be applied, if the environment is totally unknown to the agent (i.e. the agent doesn’t know the set of possible successive states and the set of possible rewards).

During the iterative training process in unknown environments the agent must explore. The challenge is to find a good explore-exploit trade off. Explore means going along new, not yet visited paths. Exploit means applying for the best action learned so far.

Reinforcement learning embodies how we learn through interaction with our environment. Just as children learn by exploring their surroundings and receiving feedback, reinforcement learning agents improve behavior through trial and error. This approach reflects natural learning processes — we take actions, observe consequences, and adjust future decisions accordingly. The dynamic interplay between exploration (trying new things) and exploitation (leveraging known rewards) mirrors how humans navigate unfamiliar situations. Unlike supervised or unsupervised learning, reinforcement learning incorporates the concept of time — actions influence not just immediate rewards but also future states and opportunities. This sequential decision-making aspect makes it particularly suitable for complex tasks where long-term strategy matters more than immediate outcomes. While powerful, this approach requires careful formulation of reward mechanisms that truly reflect desired behaviors rather than exploitable shortcuts.

Key challenges

Designing appropriate reward functions
Balancing exploration vs. exploitation
Dealing with delayed rewards and credit assignment
Sample efficiency in real-world applications
Transferring learning across different environments

The reward-driven nature of reinforcement learning presents unique complexities. Since the agent learns entirely from environmental feedback, the design of reward signals fundamentally shapes what is learned. An improperly specified reward function can lead to unexpected or undesired behaviors — a phenomenon known as “reward hacking⁴.” Additionally, reinforcement learning algorithms must decide when to explore new possibilities versus when to exploit known effective strategies. Too much exploration wastes resources; too little risks getting stuck in suboptimal solutions. Perhaps most challenging is the temporal credit assignment problem: when a positive outcome follows a long sequence of actions, how do we determine which actions were responsible? This difficulty increases with longer time horizons and more complex action spaces.

Practical applications

Game playing — imagine, it is told at the end of a game that it has won (a reward) or lost (a punishment). Based on that feedback, it has to decide which of the actions prior to the reinforcement were most responsible for it, and to alter its actions to aim towards more rewards in future.

Game playing (Chess, Go, video games)
Robotics and control systems
Resource management and scheduling
Recommendation systems

The learning process

Phases

Training-phase
Validation-phase
Test-phase
Operational phase

flowchart LR
    TD[(Training Data)] --> T[Training]
    T --> M[Model]
    VD[(Validation Data)] --> V[Validation]
    M --> V
    V --> |"Hyperparameter Tuning"| T
    V --> |"Model Selection"| SM[Selected Model]
    TestD[(Test Data)] --> TE[Testing]
    SM --> TE
    TE --> |"Performance Estimation"| FM[Final Model]
    ND[(New Data)] --> AP[Application]
    FM --> AP
    AP --> PR[Predictions]
    
    style TD fill:#f9f9f9,stroke:#333,stroke-width:1px
    style VD fill:#f9f9f9,stroke:#333,stroke-width:1px
    style TestD fill:#f9f9f9,stroke:#333,stroke-width:1px
    style ND fill:#f9f9f9,stroke:#333,stroke-width:1px
    style M fill:#c0f0c0,stroke:#333,stroke-width:1px
    style SM fill:#c0f0c0,stroke:#333,stroke-width:1px
    style FM fill:#c0f0c0,stroke:#333,stroke-width:1px
    style PR fill:#ffe0c0,stroke:#333,stroke-width:1px

Machine learning projects follow a systematic workflow with distinct phases, each serving a critical purpose in developing reliable and effective models. Understanding these phases helps ensure robust model development and deployment.

Training phase

During the training phase, we apply a carefully selected dataset to learn a general model. The nature of this model varies by learning paradigm — in unsupervised learning, it describes the inherent structure of the data (such as clusters or reduced dimensions), while in supervised and reinforcement learning, it captures functional relationships that map inputs to outputs or states to actions. This phase involves iterative optimization of model parameters to minimize error or maximize reward according to a specific objective function. The choice of training algorithm significantly impacts both computational efficiency and the quality of the resulting model. While the model aims to fit training data well, excessive focus on perfect training performance often leads to overfitting — a critical pitfall where the model essentially memorizes training examples rather than learning generalizable patterns.

Validation phase

The validation phase addresses a fundamental challenge in machine learning: we don’t simply want models that excel on training data; we need models that generalize well to new, unseen data. To assess generalization capacity, we apply the model to a validation dataset that remains completely separate from training data. For each input in the validation set, we calculate the model’s prediction and compare it against the true output (for supervised learning) or evaluate its performance under defined metrics (for unsupervised and reinforcement learning). This creates an error statistic or performance measure that serves multiple crucial purposes: evaluating individual models, comparing different approaches, tuning hyperparameters, and ultimately selecting the most promising candidate. The validation phase essentially simulates real-world performance in a controlled environment, helping identify issues like overfitting or underfitting before deployment.

Test phase

The test phase provides the final performance assessment before real-world deployment. After model selection and tuning in the validation phase, we evaluate the chosen model on a completely separate test dataset — one that hasn’t influenced any aspect of model development. This rigorous separation ensures an unbiased estimate of operational performance. The test dataset should closely mirror the distribution of data the model will encounter in production, making test results a reliable predictor of real-world effectiveness. Lower performance on test data compared to validation data often signals potential distribution shifts or overfitting to the validation set itself — critical issues to address before deployment. The test phase represents our last opportunity to detect problems in a controlled environment, making it an essential safeguard against deploying underperforming models.

Operational mode

Once a model demonstrates satisfactory performance in the test phase, it transitions to operational mode — deployment in real-world environments to fulfill its intended purpose. In this phase, the model processes new inputs to generate outputs (classifications, predictions, actions) that drive decision-making or insights. Operational deployment introduces new considerations beyond accuracy: computational efficiency, latency requirements, integration with existing systems, and monitoring mechanisms all become crucial factors. Even after deployment, the learning process doesn’t end; production models require ongoing monitoring for performance degradation, which often occurs as real-world data distributions gradually shift away from training distributions. Many sophisticated deployments incorporate feedback loops that capture new operational data, enabling periodic retraining to maintain or improve performance over time. This continuous improvement cycle helps ensure models remain effective as the environments they operate in evolve.

Model complexity

The Bias-variance tradeoff

In ML, selecting the appropriate model complexity is a fundamental challenge. Here this principle is demonstrated through polynomial curve fitting – one of the simplest yet most illustrative examples of the bias-variance tradeoff.

When building a machine learning model, we must balance two competing concerns:

Bias: The error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss relevant relations between features and outputs (underfitting).
Variance: The error from sensitivity to small fluctuations in the training set. High variance models can fit the training data very well but perform poorly on new, unseen data (overfitting).

Polynomial curve fitting example

In this visualization, we’ll see how polynomials of different degrees fit a dataset generated from a true quadratic function with some added noise:

Figure 7: Comparison of linear underfitting, quadratic good fit, and high-degree overfitting to noisy data generated from a quadratic function.

The learning principle

This example illustrates Ockham’s razor in action.

The simplest model that adequately explains the data is likely to have the best predictive power. While we could create a complex polynomial that passes through every training point perfectly, such a model would likely perform poorly on new data.

Conclusion

Machine learning fundamentally changes how we approach problem-solving with computers
Instead of explicit programming, we design systems that learn from data and experience
The field combines statistics, optimization, and domain knowledge
Understanding the core concepts helps in developing effective learning systems
Key tradeoffs include:
- Bias vs. variance
- Model complexity vs. generalization
- Exploration vs. exploitation
- Accuracy vs. interpretability

Exercises

Learning scenarios

Consider following problems:

Me learning to play tennis
An infant learning to speak

Discuss following questions:

Explain how this process fits into the general learning model.
Describe the percepts and actions of the player.
What types of learning I must do?
What example data is available?

Learning types

Describe the differences between supervised, unsupervised, and reinforcement learning.

Ockham’s razor

In your own words, explain to us what Ockham’s razor is. Find an example that you can use to enrich your explanation of the concept.

ML concepts

Define the following machine-learning terms in your own words

Inductive learning
Training set
Hypothesis
Bias
Variance
Curve fitting

Qualification problem

Draw a decision tree for the problem of deciding whether to move forward at a road intersection, given that the light has just turned green.

What problems do you see? Argue based on the qualification problem discussed in chapter probability.

Generalize your findings and describe for which kind of problems decision trees are not suitable.

Literature

Mitchel, T. 1997. Machine Learning (Mcgraw-Hill International Edit). McGraw-Hill Education. https://books.google.de/books?id=dMp2uwEACAAJ.

Russel, Stuart, and Peter Norvig. 2022. Artificial Intelligence: A Modern Approach. Harlow: Pearson Education.

Footnotes

Examples for Goodhart’s Law in ML: A recommendation system optimized solely for clicks might discover that clickbait titles and thumbnails maximize this metric, even if the content quality suffers and user satisfaction decreases long-term. If a content filter is optimized only to minimize false negatives (letting harmful content through), it might become overly restrictive and block large amounts of legitimate content.
To illustrate the curse of dimensionality: imagine a unit hypercube (with sides of length 1) in different dimensions. In 2D, it has area 1. In 3D, volume 1. In 100D, to capture just 1% of the hypercube’s volume, you’d need to extend 0.955 units along each dimension — meaning 99% of the volume is in the “corners.” This explains why data points become increasingly distant from each other and distance metrics become less useful as dimensions increase.] When implementing unsupervised methods, success depends on careful feature selection, appropriate distance metrics, and clear alignment with the underlying questions you’re trying to answer.
Lift measures how much more likely the consequent (Y) is when the antecedent (X) is present, compared to when the antecedent is absent.

\[\text{Lift}(X \rightarrow Y) = \frac{\text{Confidence}(X \rightarrow Y)}{\text{Support}(Y)}\]

Where:

\[\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \text{ and } Y)}{\text{Support}(X)}\] \[\text{Support}(Y) = \frac{\text{Count}(Y)}{\text{Total Transactions}}\]

Interpretation:
Lift > 1: Positive correlation (products appear together more than expected by chance)
Lift = 1: No correlation (independence)
Lift < 1: Negative correlation (products appear together less than expected by chance)
Example for reward hackin:, a reinforcement learning agent tasked with playing a video game might discover an unintended bug that produces high scores without completing the actual objective. Rather than learning the intended gameplay strategy, it optimizes for exploiting this glitch.