Probabilistic Machine Learning

Introduction

What is machine learning?

A popular definition of machine learning or ML, due to Tom Mitchell [Mit97], is as follows:

A computer program is said to learn from experience E with respect to some class of tasks T, and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Thus there are many different kinds of machine learning, depending on the nature of the tasks T we wish the system to learn, the nature of the performance measure P we use to evaluate the system, and the nature of the training signal or experience E we give it.

In this book, we will cover the most common types of ML, but from a probabilistic perspective. Roughly speaking, this means that we treat all unknown quantities (e.g., predictions about the future value of some quantity of interest, such as tomorrow’s temperature, or the parameters of some model) as random variables, that are endowed with probability distributions which describe a weighted set of possible values the variable may have. (See Chapter 2 for a quick refresher on the basics of probability, if necessary.)

There are two main reasons we adopt a probabilistic approach. First, it is the optimal approach to decision making under uncertainty, as we explain in Section 5.1. Second, probabilistic modeling is the language used by most other areas of science and engineering, and thus provides a unifying framework between these fields. As Shakir Mohamed, a researcher at DeepMind, put it:

Almost all of machine learning can be viewed in probabilistic terms, making probabilistic thinking fundamental. It is, of course, not the only view. But it is through this view that we can connect what we do in machine learning to every other computational science, whether that be in stochastic optimisation, control theory, operations research, econometrics, information theory, statistical physics or bio-statistics. For this reason alone, mastery of probabilistic thinking is essential.

Supervised learning

The most common form of ML is supervised learning. In this problem, the task T is to learn a mapping f from inputs x ∈ X to outputs y ∈ Y. The inputs x are also called the features, covariates, or predictors; this is often a fixed-dimensional vector of numbers, such as the height and weight of a person, or the pixels in an image. In this case, X = RD, where D is the dimensionality of the vector (i.e., the number of input features). The output y is also known as the label, target, or response. The experience E is given in the form of a set of N input-output pairs D = {(xn,yn)}Nn=1, known as the training set. (N is called the sample size.) The performance measure P depends on the type of output we are predicting, as we discuss below.

Classification

In classification problems, the output space is a set of C unordered and mutually exclusive labels known as classes, Y = {1, 2, . . . , C}. The problem of predicting the class label given an input is also called pattern recognition. (If there are just two classes, often denoted by y ∈ {0, 1} or y ∈ {−1,+1}, it is called binary classification.)

Example: classifying Iris flowers

As an example, consider the problem of classifying Iris flowers into their 3 subspecies, Setosa, Versicolor and Virginica. Figure 1.1 shows one example of each of these classes.

In image classification, the input space X is the set of images, which is a very high-dimensional space: for a color image with C = 3 channels (e.g., RGB) and D1 ×D2 pixels, we have X = RD, where D = C×D1×D2. (In practice we represent each pixel intensity with an integer, typically from the range {0, 1, . . . , 255}, but we assume real valued inputs for notational simplicity.) Learning a mapping f : X → Y from images to labels is quite challenging, as illustrated in Figure 1.2. However, it can be tackled using certain kinds of functions, such as a convolutional neural network or CNN, which we discuss in Section 14.1.

Fortunately for us, some botanists have already identified 4 simple, but highly informative, numeric features — sepal length, sepal width, petal length, petal width — which can be used to distinguish the three kinds of Iris flowers. In this section, we will use this much lower-dimensional input space, X = R4, for simplicity. The Iris dataset is a collection of 150 labeled examples of Iris flowers, 50 of each type, described by these 4 features. It is widely used as an example, because it is small and simple to understand. (We will discuss larger and more complex datasets later in the book.)

When we have small datasets of features, it is common to store them in an N ×D matrix, in which each row represents an example, and each column represents a feature. This is known as a design matrix; see Table 1.1 for an example. The Iris dataset is an example of tabular data. When the inputs are of variable size (e.g.,sequences of words, or social networks), rather than fixed-length vectors, the data is usually stored in some other format rather than in a design matrix. However, such data is often converted to a fixed-sized feature representation (a process known as featurization), thus implicitly creating a design matrix for further processing. We give an example of this in Section 1.5.4.1, where we discuss the “bag of words” representation for sequence data.

Category:	Machine Learning

Attribution

Kevin P. Murphy, MIT Press (2022), Probabilistic Machine Learning: An introduction, URL: http://probml.ai/

This work is licensed under Creative Commons CC-BY-NC-ND license:
https://creativecommons.org/licenses/by-nc-nd/4.0/

VP Flipbook Maker

Want to provide more interaction to your readers? Try to convert your work as digital flipbook with VP Online Flipbook Maker and share it with others!