Introduction to AI (I2AI)
Neu-Ulm University of Applied Sciences
June 1, 2025
What makes handwritten digit recognition trivial for humans but extremely difficult for traditional programming?
Traditional programming approaches fail at tasks that humans find effortless.
For instance:
Traditional programming:
\(Input + Program \rightarrow Output\)
Machine learning:
\(Input + Output \rightarrow Program\)
Differences
Neural networks solve problems that traditional programming cannot handle:
If you had to design a learning system inspired by the brain, what key components would you include?
A neuron
receives inputs → weights them → sums up → activates
The hierarchical organization of neural networks mirrors how human visual processing works, and this parallel isn’t coincidental — it’s one of the key insights that makes deep learning so powerful.
Each connection between neurons has a weight (positive or negative) — a number that gets adusted during learning.
Weight mechanics
Positive vs. negative weights:
Weight magnitude:
The mathematical foundation
This weighted sum with bias is the fundamental computation in neural networks. The weights determine how much influence each input has on the output, while the bias determines the baseline level of activation.
The sigmoid function \(\sigma(x) = \frac{1}{1 + e^{-x}}\) serves as a “squashing” function that ensures the output stays between 0 and 1, regardless of how large or small the weighted sum becomes. This is crucial for maintaining the “activation” interpretation of neuron outputs. Other activation functions commonly used are tanh, relu, and leaky relu.
The bias is particularly important because it allows the neuron to fire even when all inputs are zero, or to require a higher threshold before firing. Without bias, neurons could only learn patterns that pass through the origin, severely limiting the network’s expressiveness.
Understanding this computation is key to grasping how neural networks work: each neuron computes a weighted combination of its inputs, adds a bias, and applies a nonlinear function to produce its output. This forms the basis of the backpropagation algorithm developed by Rumelhart, Hinton, and Williams (1986).
This perspective - viewing neural networks as complex mathematical functions - is crucial for understanding their power and limitations. The Universal Approximation Theorem (Cybenko 1989; Hornik, Stinchcombe, and White 1989) tells us that neural networks with sufficient hidden units can approximate any continuous function to arbitrary accuracy.
The weights and biases represent the “knobs and dials” that can be adjusted to make the network compute any function we want (within the constraints of the architecture). Training is the process of finding the right setting for these parameters.
The power of neural networks comes from this massive number of adjustable parameters, which allows them to learn complex patterns in data. However, this also presents challenges: how do we find the right values for all these parameters? This is where the learning algorithms come in.
Example architecture for detecting digits of the MNIST dataset1:
28×28 pixels → Neural Network → 10 probabilities
The architecture of our digit recognition network represents a carefully designed pipeline for transforming raw pixel data into digit classifications. Let’s understand why this specific structure makes sense:
Input layer (784 neurons):
Hidden layer 1 (16 neurons):
Hidden layer 2 (16 neurons):
Output layer (10 neurons):
Total Parameters:
This seems like a lot, but it’s actually quite modest by modern standards. Large language models can have billions of parameters. The key insight is that all these parameters work together to create a flexible function that can map any 28×28 image to a probability distribution over the 10 digit classes.
Goal: Find the values of all k parameters that make the network classify digits correctly.
Challenge: This is a k-dimensional optimization problem!
(In our digit example it is 13,002-dimensional)
We need a systematic way to:
Optimizing in 13,002 dimensions is conceptually challenging for humans to visualize, but mathematically tractable. Each dimension represents one parameter (weight or bias) in the network.
The challenge is immense: with 13,002 parameters, there are potentially infinite ways to set these values. Most combinations will perform poorly, and we need to find the tiny subset that actually works well for digit recognition.
Traditional optimization approaches (like trying random combinations or exhaustive search) would take longer than the age of the universe. We need smarter approaches that can navigate this high-dimensional space efficiently.
The key insight is that we can use calculus — specifically derivatives — to determine the direction of steepest improvement. This allows us to make educated guesses about how to adjust parameters rather than random exploration.
Let’s measure “wrongness”
For a single training example, if the network outputs \((a_0, a_1, ..., a_9)\) but the correct answer is digit \(k\):
Desired output: \((0, 0, ..., 1, ..., 0)\) (1 in position \(k\), 0 elsewhere)
Cost for this example:
\(C = \sum_{j=0}^{9} (a_j - y_j)^2\)
where \(y_j\) is the desired output for neuron \(j\).
Why squared differences?
The squared error cost function has several nice properties:
For digit recognition, if the correct answer is “3”, we want:
The cost function measures how far we are from this ideal. When the network is confident and correct, the cost is low. When the network is uncertain or wrong, the cost is high.
Alternative cost functions exist (like cross-entropy), but squared error is conceptually simpler and works well for educational purposes.
Intuition: Imagine the cost function as a landscape with hills and valleys. We want to find the lowest valley (minimum cost).
Gradient descent algorithm:
The geography of optimization
The landscape metaphor is powerful but limited. In 13,002 dimensions, we can’t visualize the actual landscape, but the mathematical principles remain the same.
Key insights about gradient descent:
Local vs global minima: Like a real landscape, the cost function may have multiple valleys. Gradient descent finds a local minimum (nearby valley) but might miss the global minimum (deepest valley overall).
Learning rate: This is a crucial hyperparameter:
High-dimensional intuition: In high dimensions, most points are neither maxima nor minima, but saddle points. This actually helps optimization because there are usually many directions that lead downhill.
Why it works: Even though we can’t visualize 13,002-dimensional space, the mathematical guarantee is that moving in the negative gradient direction will decrease the cost (at least for small steps).
Challenge: How do we compute the gradient of the cost function with respect to all k parameters efficiently?
Backpropagation algorithm:
This elegant algorithm, formalized by Rumelhart, Hinton, and Williams (1986), makes training deep networks computationally feasible (Sanderson 2017b).
The mathematical elegance of backpropagation
Backpropagation is essentially an efficient application of the chain rule from calculus. The key insight is that we can compute gradients by working backwards through the network.
Forward Pass Example: Input → Layer 1 → Layer 2 → Output → Cost
Backward Pass: Cost → ∂Cost/∂Output → ∂Cost/∂Layer2 → ∂Cost/∂Layer1 → ∂Cost/∂Weights
For each parameter, we ask: “If I change this parameter by a tiny amount, how much does the cost change?” The chain rule lets us compute this efficiently by decomposing the influence into steps.
Think of it as tracing cause and effect:
Why “Backpropagation”?: We propagate the error backwards through the network. Starting from the final cost, we compute how much each layer contributed to that cost, then how much each neuron contributed, and finally how much each weight contributed.
This algorithm is remarkably efficient: computing the gradient for all parameters takes roughly the same computational time as computing the network’s output itself. This efficiency made training deep networks practical (Sanderson 2017b).
Through millions cycles, the network gradually learns to recognize even complex patterns.
The remarkable thing is that complex behaviors (like recognizing handwriting) emerge from this simple process of error correction.
This transformation from random guesses to intelligent recognition happens purely through this iterative process of prediction, error measurement, and weight adjustment. No human explicitly programs the features - the network discovers these patterns automatically through experience.
There are three main approaches to gradient descent:
Batch Gradient Descent: Use all training examples to compute gradient
Stochastic Gradient Descent (SGD): Use one example at a time
Mini-batch SGD: Use small batches (typically 16-256 examples)
Mini-batches provide several advantages:
The choice of batch size is another hyperparameter that affects training dynamics and final performance (Sanderson 2017a).
Mini-batch stochastic gradient descent:
Neural networks excel at
We’ve learned how neural networks can recognize digits. How might we extend this approach to understand and generate language?
From images to language
Key differences between images and text:
We need architectures designed specifically for sequential data with long-range dependencies.
Standard neural networks x language
Standard neural networks, like our digit classifier, have limitations for language:
Early attempts to solve this included:
The breakthrough came with Transformers (Vaswani et al. 2017), which solved these problems through a fundamentally different approach: attention mechanisms that allow every word to directly interact with every other word in the sequence.
A transformer is a neural network architecture specifically designed for processing sequences.
The attention mechanism is the key innovation — it allows every element in the sequence to “attend to” every other element.
Transformer architecture
The Transformer architecture, introduced in the landmark 2017 paper “Attention Is All You Need” (Vaswani et al. 2017), revolutionized natural language processing. The key insight was that attention mechanisms could replace recurrent and convolutional layers entirely.
Before transformers, most language models were based on RNNs or CNNs, which processed sequences step-by-step or with limited context windows. This made them slow to train and limited in their ability to capture long-range dependencies.
The attention mechanism allows for:
The impact has been enormous:
The name “Transformer” comes from its ability to transform input sequences into output sequences through the attention mechanism.
Consider these sentences:
The word “tower” should mean different things in different contexts:
Attention mechanism allow context words to update the meaning of other words.
Tokenization means that text is broken down into small chunks called tokens — a crucial preprocessing step that bridges human language and machine processing.
Each token gets converted to a high-dimensional vector (e.g., 12,288 dimensions for GPT-3) — so called embedding vectors
This vector representation is what the transformer actually processes - it never sees raw text, only these numerical vectors (Sanderson 2024a).
Directions in embedding space can encode semantic relationships.
Examples:
The embedding layer learns to place semantically related words close together in the vector space.
The geometry of meaning
Word embeddings reveal that meaning has geometric structure. This isn’t just a mathematical curiosity - it reflects how language itself is structured:
Analogical reasoning — the famous “king - man + woman = queen” example shows that semantic relationships can be captured as vector operations. This suggests that certain directions in the embedding space consistently encode specific semantic properties.
Semantic clusters — words with similar meanings cluster together:
Hierarchical structure — the space can capture hierarchies:
Cultural and linguistic biases — embeddings can capture societal biases present in training data:
Training process — These embeddings aren’t hand-crafted but learned from data. The model discovers these geometric relationships by seeing how words are used together in context (Sanderson 2024a).
Rather than having fixed embeddings for each word, attention allows the embedding to be dynamically updated based on what other words are present in the context. This creates context-sensitive representations that can capture these nuanced meanings.
Goal: Update the embedding of some word on the context of that word.
Three key matrices (learned during training)4:
Process:
Let’s trace how attention helps resolve the ambiguity of “bank” (financial institution vs. riverbank).
The target word is “bank” (needs contextual disambiguation), the context word is “flooded”
The attention process
The ambiguity is resolved: we’re talking about a riverbank, not a financial institution
In reality different types of relationships matter simultaneously, such as
Each head learns to specialize in different types of patterns and relationships.
GPT-3 example: 96 attention heads per layer × 96 layers = 9,216 total attention heads
After attention, each token passes through a FFN.
FFNs are the “thinking” components that sit between attention layers in transformers. While attention figures out what information to gather, FFNs decide what to do with that information.
Example:
Following residual connections and layer normalization make deep transformers stable and trainable.
A residual connection means you add the input back to the output of a layer:
output = Layer(input) + input
Or more specific:
contextualized_embedding = attention(original_embedding) + original_embedding
final_output = ffn(contextualized_embedding) + contextualized_embedding
Without residual connections: Information can get “lost” or distorted as it passes through many layers With residual connections: The original information is always preserved and combined with the processed version.
Layer normalization standardizes the values within each embedding vector to have mean close to 0 and standard deviation close to 1.
From vectors back to text.
The unembedding process is how transformers convert their internal vector representations back into text predictions. It’s the crucial final step that makes language generation possible.
Process
Example
Key principle: Information flows through many layers of attention and processing (i.e., built through deep learning), allowing complex reasoning to emerge.
No explicit labels needed — the text itself provides the training signal.
Next-token prediction seems simple but is remarkably powerful (Radford et al. 2019):
As models scale up, they develop capabilities that weren’t explicitly programmed:
Complex intelligence seem to emerge from the simple objective of predicting the next word.
Despite their impressive capabilities, current language models have significant limitations:
Please check the resources provided by 3Blue1Brown on the basics of neural networks, and the math behind how they learn.
Given what we’ve learned about neural networks and transformers, what do you think are the most important challenges we need to solve to make AI systems more reliable and beneficial?
Design a neural network for classifying emails as spam or not spam. Specify:
Discuss the advantages and challenges of this approach compared to rule-based spam filtering.
Solution notes
Input representation options
Output interpretation
Training data requirements
Advantages over rules
Challenges
Consider the sentence: “The red car that John bought yesterday broke down on the highway.”
Solution notes
Strong attention relationships
Three attention head types
Car embedding updates
You’re training a small transformer to complete simple mathematical expressions like “2 + 3 = ?”
Solution notes
Tokenization strategies
Training data and objective
Challenges and solutions
Evaluation strategies
Evidence of understanding
A company wants to deploy a large language model for automated customer service. Consider the following scenario:
Situation: The AI occasionally provides incorrect information about product returns, leading to customer frustration and potential financial losses.
Solution notes
Potential risks and harms:
Mitigation strategies:
Monitoring metrics:
Human oversight triggers:
You are working with a language model that produces the following raw scores (logits) for the next token after the prompt “The weather today is”:
Raw scores: [sunny: 2.0, cloudy: 1.8, rainy: 1.2, snowy: 0.8, windy: 0.6]
\(P(token_i) = \frac{e^{score_i/T}}{\sum_j e^{score_j/T}}\)
Solution Notes
Probabilities
\(P(token_i) = \frac{e^{score_i/T}}{\sum_j e^{score_j/T}}\)
T = 0.5 (low/focused)
Sum = 110.5
Probabilities: [0.49, 0.33, 0.10, 0.05, 0.03]
T = 1.0 (balanced)
Sum = 20.7
Probabilities: [0.36, 0.29, 0.16, 0.11, 0.09]
T = 2.0 (high/creative)
Sum = 9.8
Probabilities: [0.28, 0.25, 0.18, 0.15, 0.13]
Analysis of effects
Practical implications
Key Insights
The MNIST (Modified National Institute of Standards and Technology) dataset is a popular dataset used for training and testing image classification systems, especially in the world of machine learning. It contains 60,000 training images and 10,000 test images of handwritten digits.
For a visual explanation see 3blue1brown — Visualizing the chain rule and product rule
An epoch is one complete pass through the entire training dataset. During one epoch, the model sees every training example exactly once. Training might stop after a certain number of epochs or when performance plateaus.
During training by means of backpropagation, the attention matrices \(W_Q\), \(W_K\), and \(W_V\) learn patterns. Thus, these are essentially weights in the neural network — they’re learned parameters just like weights in any other layer.
An embedding dimension of 12,288 means each word/token is represented as a vector with 12,288 numbers. Each position captures some aspect of meaning - though not interpretable to humans.
A vocabulary size of 50,257 tokens means the model knows 50,257 different tokens (words, word pieces, punctuation, etc.).
High temperature → more random/creative; low temperature → more focused/deterministic