Learning

Learning scenarios

The first step is to appreciate the variety of knowledge that goes under the heading “language.”

The infant must learn to recognize and produce speech, learn vocabulary, learn grammar, learn the semantic and pragmatic interpretation of a speech act, and learn strategies for disambiguation, among other things.

The performance elements for this (in humans) and their associated learning mechanisms are obviously very complex and as yet little is known about them.

A naive model of the learning environment considers just the exchange of speech sounds. In reality, the physical context of each utterance is crucial: a child must see the context in which “watermelon” is uttered in order to learn to associate “watermelon” with watermelons.

Thus, the environment consists not just of other humans but also the physical objects and events about which discourse takes place. Auditory sensors detect speech sounds, while other senses (primarily visual) provide information on the physical context. The relevant effectors are the speech organs and the motor capacities that allow the infant to respond to speech or that elicit verbal feedback.

The performance standard could simply be the infant’s general utility function, however that is realized, so that the infant performs reinforcement learning to perform and respond to speech acts so as to improve its well-being—for example, by obtaining food and attention.

However, humans’ built-in capacity for mimicry suggests that the production of sounds similar to those produced by other humans is a goal in itself. The child (once he or she learns to differentiate sounds and learn about pointing or other means of indicating salient objects) is also exposed to examples of supervised learning: an adult says “shoe” or “belly button” while indicating the appropriate object. So sentences produced by adults provide labelled positive examples, and the response of adults to the infant’s speech acts provides further classification feedback.

Mostly, it seems that adults do not correct the child’s speech, so there are very few negative classifications of the child’s attempted sentences. This is significant because early work on language learning, such as the work of Gold (1967) concentrated just on identifying the set of strings that are grammatical, assuming a particular grammatical formalism. If there are only positive examples, then there is nothing to rule out the grammar \(S \implies Word\). Some theorists (notably Chomsky and Fodor) used what they call the “poverty of the stimulus” argument to say that the basic universal grammar of languages must be innate, because otherwise (given the lack of negative examples) there would be no way that a child could learn a language (under the assumptions of language learning as learning a set of grammatical strings). Critics have called this the “poverty of the imagination” argument—I can’t think of a learning mechanism that would work, so it must be innate. Indeed, if we go to probabilistic context free grammars, then it is possible to learn a language without negative examples.

What with learning tennis?

Learning tennis is much simpler than learning to speak. The requisite skills can be divided into movement, playing strokes, and strategy.

The environment consists of the court, ball, opponent, and one’s own body. The relevant sensors are the visual system and proprioception (the sense of forces on and position of one’s own body parts). The effectors are the muscles involved in moving to the ball and hitting the stroke.

The learning process involves both supervised learning and reinforcement learning. Supervised learning occurs in acquiring the predictive transition models, e.g., where the opponent will hit the ball, where the ball will land, and what trajectory the ball will have after one’s own stroke (e.g., if I hit a half-volley this way, it goes into the net, but if I hit it that way, it clears the net).

Reinforcement learning occurs when points are won and lost—this is particularly important for strategic aspects of play such as shot placement and positioning (e.g., in 60% of the points where I hit a lob in response to a cross-court shot, I end up losing the point).

In the early stages, reinforcement also occurs when a shot succeeds in clearing the net and landing in the opponent’s court. Achieving this small success is itself a sequential process involving many motor control commands, and there is no teacher available to tell the learner’s motor cortex which motor control commands to issue.

Learning types

In supervised learning, the training data consists of input–output pairs, where the labeled outputs are what we are trying to learn. In unsupervised learning, there is no labeled output, and the goal is to find patterns or clusters in the input. In reinforcement learning, the learnin

ML concepts

Training set: A set of input–output pair examples, used as input to a machine learning program to create a hypothesis.
Hypothesis: In machine learning, a hypothesis is a function, learned from the training data and a member of the hypothesis space, that maps inputs to outputs.
Bias: The amount by which the output of a hypothesis consistently varies from the true answer in a particular direction, regardless of the exact training data.
Variance: The amount by which the output of a hypothesis randomly varies from the true answer, when trained on slightly different data sets.

Decision tree

In standard decision trees, attribute tests divide examples according to the attribute value. Therefore, any example reaching the second test already has a known value for the attribute and the second test is redundant.

In some decision tree systems, however, all tests are Boolean even if the attributes are multivalued or continuous. In this case, additional tests of the at tribute can be used to check different values or subdivide the range further (e.g., first check if X > 0, and then if it is, check if x > 10).

Statisticians

The Bayesian approach would be to take both drugs. The maximum likelihood approach would be to take the anti-B drug.

In the case where there are two versions of B, the Bayesian still recommends taking both drugs, while the maximum likelihood approach is now to take the anti-A drug, since it has a 40% chance of being correct, versus 30% for each of the B cases.

This is of course a caricature, and you would be hard-pressed to find a doctor, even a rabid maximum-likelihood advocate who would prescribe like this.