Introduction to Machine Learning

Ethem ALPAYDIN

The MIT Press, October 2004, ISBN 0-262-01211-1

I am no longer maintaining this page, please refer to the second edition.

The book can be ordered through The MIT Press, Amazon (CA, DE, FR, JP, UK, US), Barnes&Noble (US), Pandora (TR).

Foreign Editions:

Prentice-Hall of India (IN) has published a reprint of the book in the Eastern Economy Edition series. ISBN: 81-203-2791-8 (2005).
Maschinelles Lernen (German language edition, translated by Simone Linke), Oldenbourg Verlag, ISBN-13: 978-3-486-58114-0, May 2008.
机器学习导论 (Chinese simplified character edition, translated by Ming Fan), Huazhang Press, ISBN: 7111265246/9787111265245, June 2009 (Amazon China).
Turkish language edition will be published by Bogazici University Press.

Description, Reviews, Table of Contents, Courses, Figures, Lecture Slides, Errata, Solutions to Exercises

Description: The goal of machine learning is to program computers to use example data or past experience to solve a given problem. Many successful applications of machine learning exist already, including systems that analyze past sales data to predict customer behavior, recognize faces or spoken speech, optimize robot behavior so that a task can be completed using minimum resources, and extract knowledge from bioinformatics data. Introduction to Machine Learning is a comprehensive textbook on the subject, covering a broad array of topics not usually included in introductory machine learning texts. It discusses many methods based in different fields, including statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining, in order to present a unified treatment of machine learning problems and solutions. All learning algorithms are explained so that the student can easily move from the equations in the book to a computer program. The book can be used by advanced undergraduates and graduate students who have completed courses in computer programming, probability, calculus, and linear algebra. It will also be of interest to engineers in the field who are concerned with the application of machine learning methods.

After an introduction that defines machine learning and gives examples of machine learning applications, the book covers supervised learning, Bayesian decision theory, parametric methods, multivariate methods, dimensionality reduction, clustering, nonparametric methods, decision trees, linear discrimination, multilayer perceptrons, local models, hidden Markov models, assessing and comparing classification algorithms, combining multiple learners, and reinforcement learning.

Reviews:

Table of Contents

Series Foreword xiii
Figures xv
Tables xxiii
Preface xxv
Acknowledgments xxvii
Notations xxix
1 Introduction 1

1.1 What Is Machine Learning? 1
1.2 Examples of Machine Learning Applications 3

1.2.1 Learning Associations 3
1.2.2 Classification 4
1.2.3 Regression 8
1.2.4 Unsupervised Learning 10
1.2.5 Reinforcement Learning 11

1.3 Notes 12
1.4 Relevant Resources 14
1.5 Exercises 15
1.6 References 16

2 Supervised Learning 17

2.1 Learning a Class from Examples 17
2.2 Vapnik-Chervonenkis (VC) Dimension 22
2.3 Probably Approximately Correct (PAC) Learning 24
2.4 Noise 25
2.5 Learning Multiple Classes 27
2.6 Regression 29
2.7 Model Selection and Generalization 32
2.8 Dimensions of a Supervised Machine Learning Algorithm 35
2.9 Notes 36
2.10 Exercises 37
2.11 References 38

3 Bayesian Decision Theory 39

3.1 Introduction 39
3.2 Classification 41
3.3 Losses and Risks 43
3.4 Discriminant Functions 45
3.5 Utility Theory 46
3.6 Value of Information 47
3.7 Bayesian Networks 48
3.8 Influence Diagrams 55
3.9 Association Rules 56
3.10 Notes 57
3.11 Exercises 57
3.12 References 58

4 Parametric Methods 61

4.1 Introduction 61
4.2 Maximum Likelihood Estimation 62

4.2.1 Bernoulli Density 62
4.2.2 Multinomial Density 63
4.2.3 Gaussian (Normal) Density 64

4.3 Evaluating an Estimator: Bias and Variance 64
4.4 The Bayes' Estimator 67
4.5 Parametric Classification 69
4.6 Regression 73
4.7 Tuning Model Complexity: Bias/Variance Dilemma 76
4.8 Model Selection Procedures 79
4.9 Notes 82
4.10 Exercises 82
4.11 References 83

5 Multivariate Methods 85

5.1 Multivariate Data 85
5.2 Parameter Estimation 86
5.3 Estimation of Missing Values 87
5.4 Multivariate Normal Distribution 88
5.5 Multivariate Classification 92
5.6 Tuning Complexity 98
5.7 Discrete Features 99
5.8 Multivariate Regression 100
5.9 Notes 102
5.10 Exercises 102
5.11 References 103

6 Dimensionality Reduction 105

6.1 Introduction 105
6.2 Subset Selection 106
6.3 Principal Components Analysis 108
6.4 Factor Analysis 116
6.5 Multidimensional Scaling 121
6.6 Linear Discriminant Analysis 124
6.7 Notes 127
6.8 Exercises 130
6.9 References 130

7 Clustering 133

7.1 Introduction 133
7.2 Mixture Densities 134
7.3 k-Means Clustering 135
7.4 Expectation-Maximization Algorithm 139
7.5 Mixtures of Latent Variable Models 144
7.6 Supervised Learning after Clustering 145
7.7 Hierarchical Clustering 146
7.8 Choosing the Number of Clusters 149
7.9 Notes 149
7.10 Exercises 150
7.11 References 150

8 Nonparametric Methods 153

8.1 Introduction 153
8.2 Nonparametric Density Estimation 154

8.2.1 Histogram Estimator 155
8.2.2 Kernel Estimator 157
8.2.3 k-Nearest Neighbor Estimator 158

8.3 Generalization to Multivariate Data 159
8.4 Nonparametric Classification 161
8.5 Condensed Nearest Neighbor 162
8.6 Nonparametric Regression: Smoothing Models 164

8.6.1 Running Mean Smoother 165
8.6.2 Kernel Smoother 166
8.6.3 Running Line Smoother 167

8.7 How to Choose the Smoothing Parameter 168
8.8 Notes 169
8.9 Exercises 170
8.10 References 170

9 Decision Trees 173

9.1 Introduction 173
9.2 Univariate Trees 175

9.2.1 Classification Trees 176
9.2.2 Regression Trees 180

9.3 Pruning 182
9.4 Rule Extraction from Trees 185
9.5 Learning Rules from Data 186
9.6 Multivariate Trees 190
9.7 Notes 192
9.8 Exercises 195
9.9 References 195

10 Linear Discrimination 197

10.1 Introduction 197
10.2 Generalizing the Linear Model 199
10.3 Geometry of the Linear Discriminant 200

10.3.1 Two Classes 200
10.3.2 Multiple Classes 202

10.4 Pairwise Separation 204
10.5 Parametric Discrimination Revisited 205
10.6 Gradient Descent 206
10.7 Logistic Discrimination 208

10.7.1 Two Classes 208
10.7.2 Multiple Classes 211

10.8 Discrimination by Regression 216
10.9 Support Vector Machines 218

10.9.1 Optimal Separating Hyperplane 218
10.9.2 The Nonseparable Case: Soft Margin Hyperplane 221
10.9.3 Kernel Functions 223
10.9.4 Support Vector Machines for Regression 225

10.10 Notes 227
10.11 Exercises 227
10.12 References 228

11 Multilayer Perceptrons 229

11.1 Introduction 229

11.1.1 Understanding the Brain 230
11.1.2 Neural Networks as a Paradigm for Parallel Processing 231

11.2 The Perceptron 233
11.3 Training a Perceptron 236
11.4 Learning Boolean Functions 239
11.5 Multilayer Perceptrons 241
11.6 MLP as a Universal Approximator 244
11.7 Backpropagation Algorithm 245

11.7.1 Nonlinear Regression 246
11.7.2 Two-Class Discrimination 248
11.7.3 Multiclass Discrimination 250
11.7.4 Multiple Hidden Layers 252

11.8 Training Procedures 252

11.8.1 Improving Convergence 252

Momentum 253
Adaptive Learning Rate 253

11.8.2 Overtraining 253
11.8.3 Structuring the Network 254
11.8.4 Hints 257

11.9 Tuning the Network Size 259
11.10 Bayesian View of Learning 262
11.11 Dimensionality Reduction 263
11.12 Learning Time 266

11.12.1 Time Delay Neural Networks 266
11.12.2 Recurrent Networks 267

11.13 Notes 268
11.14 Exercises 270
11.15 References 271

12 Local Models 275

12.1 Introduction 275
12.2 Competitive Learning 276

12.2.1 Online k-Means 276
12.2.2 Adaptive Resonance Theory 281
12.2.3 Self-Organizing Maps 282

12.3 Radial Basis Functions 284
12.4 Incorporating Rule-Based Knowledge 290
12.5 Normalized Basis Functions 291
12.6 Competitive Basis Functions 293
12.7 Learning Vector Quantization 296
12.8 Mixture of Experts 296

12.8.1 Cooperative Experts 299
12.8.2 Competitive Experts 300

12.9 Hierarchical Mixture of Experts 300
12.10 Notes 301
12.11 Exercises 302
12.12 References 302

13 Hidden Markov Models 305

13.1 Introduction 305
13.2 Discrete Markov Processes 306
13.3 Hidden Markov Models 309
13.4 Three Basic Problems of HMMs 311
13.5 Evaluation Problem 311
13.6 Finding the State Sequence 315
13.7 Learning Model Parameters 317
13.8 Continuous Observations 320
13.9 The HMM with Input 321
13.10 Model Selection in HMM 322
13.11 Notes 323
13.12 Exercises 325
13.13 References 325

14 Assessing and Comparing Classification Algorithms 327

14.1 Introduction 327
14.2 Cross-Validation and Resampling Methods 330

14.2.1 K-Fold Cross-Validation 331
14.2.2 5x2 Cross-Validation 331
14.2.3 Bootstrapping 332

14.3 Measuring Error 333
14.4 Interval Estimation 334
14.5 Hypothesis Testing 338
14.6 Assessing a Classification Algorithm's Performance 339

14.6.1 Binomial Test 340
14.6.2 Approximate Normal Test 341
14.6.3 Paired t Test 341

14.7 Comparing Two Classification Algorithms 341

14.7.1 McNemar's Test 342
14.7.2 K-Fold Cross-Validated Paired t Test 342
14.7.3 5x2 cv Paired t Test 343
14.7.4 5x2 cv Paired F Test 344

14.8 Comparing Multiple Classification Algorithms: Analysis of Variance 345
14.9 Notes 348
14.10 Exercises 349
14.11 References 350

15 Combining Multiple Learners 351

15.1 Rationale 351
15.2 Voting 354
15.3 Error-Correcting Output Codes 357
15.4 Bagging 360
15.5 Boosting 360
15.6 Mixture of Experts Revisited 363
15.7 Stacked Generalization 364
15.8 Cascading 366
15.9 Notes 368
15.10 Exercises 369
15.11 References 370

16 Reinforcement Learning 373

16.1 Introduction 373
16.2 Single State Case: K-Armed Bandit 375
16.3 Elements of Reinforcement Learning 376
16.4 Model-Based Learning 379

16.4.1 Value Iteration 379
16.4.2 Policy Iteration 380

16.5 Temporal Difference Learning 380

16.5.1 Exploration Strategies 381
16.5.2 Deterministic Rewards and Actions 382
16.5.3 Nondeterministic Rewards and Actions 383
16.5.4 Eligibility Traces 385

16.6 Generalization 387
16.7 Partially Observable States 389
16.8 Notes 391
16.9 Exercises 393
16.10 References 394

A Probability 397

A.1 Elements of Probability 397

A.1.1 Axioms of Probability 398
A.1.2 Conditional Probability 398

A.2 Random Variables 399

A.2.1 Probability Distribution and Density Functions 399
A.2.2 Joint Distribution and Density Functions 400
A.2.3 Conditional Distributions 400
A.2.4 Bayes' Rule 401
A.2.5 Expectation 401
A.2.6 Variance 402
A.2.7 Weak Law of Large Numbers 403

A.3 Special Random Variables 403

A.3.1 Bernoulli Distribution 403
A.3.2 Binomial Distribution 404
A.3.3 Multinomial Distribution 404
A.3.4 Uniform Distribution 404
A.3.5 Normal (Gaussian) Distribution 405
A.3.6 Chi-Square Distribution 406
A.3.7 t Distribution 407
A.3.8 F Distribution 407

A.4 References 407

Index 409

Courses: The book is used in the following courses, either as the main textbook, or as a reference book. I will be happy to be told of others.

Textbook:

Reference book:

Figures: The complete set of figures can be retrieved as a pdf file (2 MB). Instructors using the book are welcome to use these figures in their lecture slides as long as the use is non-commercial and the source is cited.

Lecture Slides: The following lecture slides (pdf and ppt) are made available for instructors using the book.

Chapter 1. Introduction (ppt)
Chapter 2. Supervised Learning (ppt)
Chapter 3. Bayesian Decision Theory (ppt)
Chapter 4. Parametric Methods (ppt)
Chapter 5. Multivariate Methods (ppt)
Chapter 6. Dimensionality Reduction (ppt)
Chapter 7. Clustering (ppt)
Chapter 8. Nonparametric Methods (ppt)
Chapter 9. Decision Trees (ppt)
Chapter 10. Linear Discrimination (ppt)
Chapter 11. Multilayer Perceptrons (ppt)
Chapter 12. Local Models (ppt)
Chapter 13. Hidden Markov Models (ppt)
Chapter 14. Assessing and Comparing Classification Algorithms (ppt)
Chapter 15. Combining Multiple Learners (ppt)
Chapter 16. Reinforcement Learning (ppt)

Errata:

(p. 20-22): S and G need not be unique. (Luc de Raedt)

Depending on the training set and the hypothesis class, there may be several S_i and G_j which respectively make up the S-set and the G-set. Every member of the S-set is consistent with all the instances and there are no consistent hypotheses that are more specific. Similarly, every member of the G-set is consistent with all the instances and there are no consistent hypotheses that are more general. These two make up the boundary sets and any hypothesis between them is consistent and is part of the version space. There is an algorithm called candidate elimination that incrementally updates the S- and G-sets as it sees training instances one by one. See (Mitchell, 1997; Russell and Norvig; 1995).

(p. 30): Eq. 2.15: w_1 x + w_0 should be w_1 x^t + w_0 (Mike Colagrosso)
(p. 30): Eq. 2.15: Not needed, but the summation should be multiplied by 1/N to match Eq. 2.12 (Mike Colagrosso)
(p. 35): Eq. 2.19: Missing closing ')' (Mike Colagrosso)
(p. 58): Ref (Agrawal et al., 1996): The second author's name should be "Mannila." (Kai Puolamäki)
(p. 59): Ref. The title of Pearl's 1988 book is "Probabilistic Reasoning in INTELLIGENT Systems." (Yang-feng Ji)
(p. 62): Eq. 4.1: l(\theta) should be l(\theta|X) (Chris Mansley)
(p. 63): Eq. 4.5: p(x_1, x_2, \dots, x_K) should be P(x_1, x_2, \dots, x_K). That is, P should be uppercase. (Mike Colagrosso)
(p. 86): Eq. 5.3: '[' missing after the first 'E'. (Hakan Haberdar)
(p. 89): Eq at the bottom of the page: +(plus) before 2 should be – (minus) (Barış Can Daylık).
(p. 90): Figure 5.2: Upper-left figure should be a circle, but the plot is squashed. x_1 axis is longer than the x_2 axis. (Mike Colagrosso)
(p. 118): Equation at the bottom: In the second equality, the last C is to transposed. (Chulhong Min)
(p. 124): Eq. 6.31: It should be x^t. (Stijn Vanderlooy)
(p. 157): Figure 8.2: h values are twice the actual values. The titles should read 2h=2, 2h=1 and 2h=0.5. (Onder Eker, Alex Kogan)
(p. 160): The first sentence of the second paragraph should read: "For example, the use of the Euclidean norm in equation 8.11 implies that ..." (Stijn Vanderlooy)
(p. 176): Second line of fourth paragraph should read: "... number of bits needed to encode the class code of an instance" (Stijn Vanderlooy)
(p. 178): Eq. 9.8: log should be base 2. (Ismail Ari)
(p. 187 and 196): The name of the author for the Irep reference is F{\"u}rnkranz. (Stijn Vanderlooy)
(p. 189): Third paragraph, line 5 from top: "instances of all other classes are taken as [negative] examples." (Ismail Ari)
(p. 191): Figure 9.8: w_{11} x_1 + w_{12} x_2 + w_{10} = 0 should be w_{11} x_1 + w_{12} x_2 + w_{10} > 0 (Mike Colagrosso)
(p. 198): Fourth line from the bottom of the page: "magnitude" is misspelt. (Michael Dominguez)
(p. 203): Eq. 10.7: w_{i0} shouldn't be bold. It's a scalar, not a vector, as in the sentence above and Eq. 10.6. (Mike Colagrosso)
(p. 209): Eq. 10.23: E(w, w_o | X) should be E(w, w_0 | X). That is, the subscript should be a zero, not an "oh." (Mike Colagrosso)
(p. 210): Fig 10.6. The 11th line (that starts with \Delta w_j) should also be enclosed in a for loop of j=0,\ldots,d. (Didem Unat)
(p. 222): Seventh line from the bottom of the page: The number of misclassifications is \#\{|xi^t \ge 1\}. (Ming Fan)
(p. 227): First sentence of 10.10: Change "discriminant" to "discrimination" (Mike Colagrosso)
(p. 227): Exercise 1: change "function" to "functions" (Mike Colagrosso)
(p. 235): Fig. 11.2 caption mentions w_{ij} but there is no w_{ij} in the figure. w_{ij} is the weight of the connection from input x_j to output y_i. w_i (the vector of weights to output y_i) are shown in the figure. (Stijn Vanderlooy)
(p. 236): The first line after eq. 11.6: ..., the linear model can also be used ... (Ming Fan, Michael Orlov)
(p. 238): In the first cross-entropy eq on the top of the page, the summation over i and all i subscripts should be omitted. (David Warde-Farley)
(p. 239): First word in the Figure 11.3 narrative should be "Perceptron" instead of "Percepton." (Mike Colagrosso)
(p. 240): In the line below the equation, it should read: Note that y=s(x_1+x_2-1.5) satisfies ..." (Ming Fan)
(p. 245): On the third line from the bottom of the page, it should read z_h and not h_j. (Joel Kammet)
(p. 252): sigmoid() missing in the second terms to the right of eqs defining z_1h and z_2l.
(p. 257): Insert "is" before "as" in the last sentence of the first paragraph to read "..., it is as if the training set ..." (Tunga Gungor)
(p. 267): Fig. 11.20: The input units x^{t-\tau},...,x^{t-1},x^t should be labeled in the opposite order; or equivalently, the arrows should point to the left. x^t is the current input seen (the latest) and x^{t-\tau} is the input seen \tau steps in the past (delayed \tau times).
(p. 279): Fig 12.2: On line 5 of the psudocode, m_j should be m_i (Murat Semerci)
(p. 282): Eq. 12.9: On the third line, x should be x^t. (Ismail Ari)
(p. 288): Remove the extra "the" in the first sentence. (Tunga Gungor)
(p. 308): Eq. 13.8: The denominator should read "#{sequences}"; "number of" in the curly brackets is redundant. (Can Kavaklioglu)
(p. 313): Fig 13.3, legend: "...computation of \alpha_{t+1}(j)..." (Ismail Ari)
(p.317): Fig. 13.4: Below the node for state j, '1' should follow the line O_{t+}; that is, the observation is named O_{t+1}.
(p.319): Eq. 13.32: In estimating b_j(m), t should range from 1 to T_k (and not T_k-1) in both the numerator and the denominator. (Cem Keskin)
(p. 320): Eq. 13.35: Drop j in P(G_{jl}). (Cem Keskin)
(p. 327): On the second line from the bottom of the page, “to” is missing before “say which one …” (Hussein Issa).
(p. 329): On line 6 from the bottom of the page, “of” is missing between “both” and “these.” (Hussein Issa).
(p. 330): "than" on line 16 should be changed to "then." (Tunga Gungor)
(p. 340): Eq. 14.12: The summation should start from j=0. (Alex Kogan)
(p. 343): Eq. 14.17: In the first term to the right, division by \sigma is missing in the numerator. It should be changed to: "\frac{p^{(1)}_1 / \sigma} / {\sqrt{M/5}}. (Alex Kogan)
(p. 362): Fig 15.2: On line 11, "Then" is missing after the condition. It should read: If y^t_j=r^t Then p^t_{j+1}\leftarrow \beta_j p^t_j Else p^t_{j+1}\leftarrow p^t_j (Stijn Vanderlooy)
(p. 375): First paragraph of 16.2: classification is misspelled. (Mike Colagrosso)
(p. 379): Eq. 16.9: V*(s_t) should be changed to V*(s_{t+1}). (Omer Korcak)
(p. 380): Fig 16.3, first line: Initialize a policy \Pi' arbitrarily (Stijn Vanderlooy)
(p. 381): Eqs. 16.10 and 16.11: Replace the denominators as \sum_{b\in{\cal A}} \exp ... (Stijn Vanderlooy)

Solutions to Exercises: Available as a gzipped tar or compressed zipped folder file for instructors who have adopted the book for course use. Please contact The MIT Press for user name and password. The manual contains solutions to exercises and example Matlab programs.

Created on Oct 24, 2004 by E. Alpaydin (my_last_name AT boun DOT edu DOT tr)

Jan 14, 2005: Added links to more online booksellers.
Jan 31, 2005: Added link to the pdf file of figures.
Apr, 3, 2005: Added Errata.
June 1, 2005: Further errata.
July 12, 2005: Added more bookseller link.
July 20, 2005: Added more bookseller links and the lecture slides of Chapters 1, 2 and 11.
July 28, 2005: Added all lecture slides.
Sep 26, 2005: Added ppt of all lecture slides. This new version (V1-1) is the same as the previously available V1-0 except that I retyped all equations using Microsoft Equation Editor.
Oct 25, 2005: Further errata.
Nov 12, 2005: Added reviews and courses.
Dec 14, 2005: Added links to MIT Press for sample pdfs of Foreword, Preface, and Chapter 1.
Feb 1, 2006: Added links to 2006 courses.
Apr 27, 2006: Added new course links and errata.
Jul 4, 2006: Added errata.pdf
Sep 1, 2006: Added links to Fall 2006 courses.
Nov 14, 2006: Added info on Foreign Editions.
Jan 12, 2007: Added Solutions to Exercises. Thanks to Oya Aran, our web admin, for her help in making the file protected.
Feb 5, 2007: Added links to Find-In-A-Library and new courses.
Jul 16, 2007: Some more errata.
Sep 18, 2007: Added links to 2007 courses.
May 1, 2008: Added an erratum and a review.
August 20, 2009: Added info about the Chinese edition.