The MIT Press, October 2004, ISBN 0-262-01211-1
I am no longer
maintaining this page, please refer to the
second edition.
The book can be ordered
through The MIT
Press, Amazon (CA, DE, FR, JP, UK, US), Barnes&Noble (US), Pandora (TR).
Find in a Library
Foreign
Editions:
- Prentice-Hall of India (IN) has published
a reprint of the book in the Eastern Economy Edition series. ISBN:
81-203-2791-8 (2005).
- Maschinelles
Lernen (German language edition, translated by Simone Linke),
Oldenbourg Verlag, ISBN-13: 978-3-486-58114-0, May 2008.
- 机器学习导论
(Chinese simplified
character edition, translated by Ming Fan), Huazhang Press, ISBN: 7111265246/9787111265245,
June 2009 (Amazon
China).
- Turkish language edition will be published by
Bogazici University Press.
Description, Reviews, Table of Contents, Courses, Figures, Lecture Slides, Errata,
Solutions to Exercises
Description:
The goal of machine learning is to program computers to use example data or
past experience to solve a given problem. Many successful applications of
machine learning exist already, including systems that analyze past sales data
to predict customer behavior, recognize faces or spoken speech, optimize robot
behavior so that a task can be completed using minimum resources, and extract knowledge
from bioinformatics data. Introduction to Machine Learning is a
comprehensive textbook on the subject, covering a broad array of topics not
usually included in introductory machine learning texts. It discusses many
methods based in different fields, including statistics, pattern recognition,
neural networks, artificial intelligence, signal processing, control, and data
mining, in order to present a unified treatment of machine learning problems
and solutions. All learning algorithms are explained so that the student can
easily move from the equations in the book to a computer program. The book can
be used by advanced undergraduates and graduate students who have completed
courses in computer programming, probability, calculus, and linear algebra. It
will also be of interest to engineers in the field who are concerned with the
application of machine learning methods.
After an introduction that
defines machine learning and gives examples of machine learning applications,
the book covers supervised learning, Bayesian decision theory, parametric
methods, multivariate methods, dimensionality reduction, clustering,
nonparametric methods, decision trees, linear discrimination, multilayer
perceptrons, local models, hidden Markov models, assessing and comparing classification
algorithms, combining multiple learners, and reinforcement learning.
Reviews:
Table
of Contents
- Series
Foreword xiii
- Figures xv
- Tables xxiii
- Preface
xxv
- Acknowledgments xxvii
- Notations xxix
- 1
Introduction 1
- 1.1 What Is Machine Learning? 1
- 1.2 Examples of Machine Learning
Applications 3
- 1.2.1 Learning Associations 3
- 1.2.2 Classification 4
- 1.2.3 Regression 8
- 1.2.4 Unsupervised Learning 10
- 1.2.5 Reinforcement Learning 11
- 1.3 Notes 12
- 1.4 Relevant Resources 14
- 1.5 Exercises 15
- 1.6 References 16
- 2 Supervised Learning 17
- 2.1 Learning a Class from
Examples 17
- 2.2 Vapnik-Chervonenkis (VC)
Dimension 22
- 2.3 Probably Approximately
Correct (PAC) Learning 24
- 2.4 Noise 25
- 2.5 Learning Multiple Classes 27
- 2.6 Regression 29
- 2.7 Model Selection and
Generalization 32
- 2.8 Dimensions of a Supervised
Machine Learning Algorithm 35
- 2.9 Notes 36
- 2.10 Exercises 37
- 2.11 References 38
- 3 Bayesian Decision Theory 39
- 3.1 Introduction 39
- 3.2 Classification 41
- 3.3 Losses and Risks 43
- 3.4 Discriminant Functions 45
- 3.5 Utility Theory 46
- 3.6 Value of Information 47
- 3.7 Bayesian Networks 48
- 3.8 Influence Diagrams 55
- 3.9 Association Rules 56
- 3.10 Notes 57
- 3.11 Exercises 57
- 3.12 References 58
- 4 Parametric Methods 61
- 4.1 Introduction 61
- 4.2 Maximum Likelihood Estimation
62
- 4.2.1 Bernoulli Density 62
- 4.2.2 Multinomial Density 63
- 4.2.3 Gaussian (Normal) Density 64
- 4.3 Evaluating an Estimator: Bias
and Variance 64
- 4.4 The Bayes' Estimator 67
- 4.5 Parametric Classification 69
- 4.6 Regression 73
- 4.7 Tuning Model Complexity:
Bias/Variance Dilemma 76
- 4.8 Model Selection Procedures 79
- 4.9 Notes 82
- 4.10 Exercises 82
- 4.11 References 83
- 5 Multivariate Methods 85
- 5.1 Multivariate Data 85
- 5.2 Parameter Estimation 86
- 5.3 Estimation of Missing Values
87
- 5.4 Multivariate Normal
Distribution 88
- 5.5 Multivariate Classification
92
- 5.6 Tuning Complexity 98
- 5.7 Discrete Features 99
- 5.8 Multivariate Regression 100
- 5.9 Notes 102
- 5.10 Exercises 102
- 5.11 References 103
- 6 Dimensionality Reduction 105
- 6.1 Introduction 105
- 6.2 Subset Selection 106
- 6.3 Principal Components Analysis
108
- 6.4 Factor Analysis 116
- 6.5 Multidimensional Scaling 121
- 6.6 Linear Discriminant Analysis
124
- 6.7 Notes 127
- 6.8 Exercises 130
- 6.9 References 130
- 7 Clustering 133
- 7.1 Introduction 133
- 7.2 Mixture Densities 134
- 7.3 k-Means Clustering 135
- 7.4 Expectation-Maximization
Algorithm 139
- 7.5 Mixtures of Latent Variable
Models 144
- 7.6 Supervised Learning after
Clustering 145
- 7.7 Hierarchical Clustering 146
- 7.8 Choosing the Number of
Clusters 149
- 7.9 Notes 149
- 7.10 Exercises 150
- 7.11 References 150
- 8 Nonparametric Methods 153
- 8.1 Introduction 153
- 8.2 Nonparametric Density
Estimation 154
- 8.2.1 Histogram Estimator 155
- 8.2.2 Kernel Estimator 157
- 8.2.3 k-Nearest Neighbor
Estimator 158
- 8.3 Generalization to
Multivariate Data 159
- 8.4 Nonparametric Classification
161
- 8.5 Condensed Nearest Neighbor
162
- 8.6 Nonparametric Regression:
Smoothing Models 164
- 8.6.1 Running Mean Smoother 165
- 8.6.2 Kernel Smoother 166
- 8.6.3 Running Line Smoother 167
- 8.7 How to Choose the Smoothing
Parameter 168
- 8.8 Notes 169
- 8.9 Exercises 170
- 8.10 References 170
- 9 Decision Trees 173
- 9.1 Introduction 173
- 9.2 Univariate Trees 175
- 9.2.1 Classification Trees 176
- 9.2.2 Regression Trees 180
- 9.3 Pruning 182
- 9.4 Rule Extraction from Trees
185
- 9.5 Learning Rules from Data 186
- 9.6 Multivariate Trees 190
- 9.7 Notes 192
- 9.8 Exercises 195
- 9.9 References 195
- 10 Linear Discrimination 197
- 10.1 Introduction 197
- 10.2 Generalizing the Linear
Model 199
- 10.3 Geometry of the Linear
Discriminant 200
- 10.3.1 Two Classes 200
- 10.3.2 Multiple Classes 202
- 10.4 Pairwise Separation 204
- 10.5 Parametric Discrimination
Revisited 205
- 10.6 Gradient Descent 206
- 10.7 Logistic Discrimination 208
- 10.7.1 Two Classes 208
- 10.7.2 Multiple Classes 211
- 10.8 Discrimination by Regression
216
- 10.9 Support Vector Machines 218
- 10.9.1 Optimal Separating
Hyperplane 218
- 10.9.2 The Nonseparable Case:
Soft Margin Hyperplane 221
- 10.9.3 Kernel Functions 223
- 10.9.4 Support Vector Machines
for Regression 225
- 10.10 Notes 227
- 10.11 Exercises 227
- 10.12 References 228
- 11 Multilayer Perceptrons 229
- 11.1 Introduction 229
- 11.1.1 Understanding the Brain
230
- 11.1.2 Neural Networks as a
Paradigm for Parallel Processing 231
- 11.2 The Perceptron 233
- 11.3 Training a Perceptron 236
- 11.4 Learning Boolean Functions
239
- 11.5 Multilayer Perceptrons 241
- 11.6 MLP as a Universal
Approximator 244
- 11.7 Backpropagation Algorithm
245
- 11.7.1 Nonlinear Regression 246
- 11.7.2 Two-Class Discrimination
248
- 11.7.3 Multiclass Discrimination
250
- 11.7.4 Multiple Hidden Layers
252
- 11.8 Training Procedures 252
- 11.8.1 Improving Convergence 252
- Momentum 253
- Adaptive Learning Rate 253
- 11.8.2 Overtraining 253
- 11.8.3 Structuring the Network
254
- 11.8.4 Hints 257
- 11.9 Tuning the Network Size 259
- 11.10 Bayesian View of Learning
262
- 11.11 Dimensionality Reduction
263
- 11.12 Learning Time 266
- 11.12.1 Time Delay Neural
Networks 266
- 11.12.2 Recurrent Networks 267
- 11.13 Notes 268
- 11.14 Exercises 270
- 11.15 References 271
- 12 Local Models 275
- 12.1 Introduction 275
- 12.2 Competitive Learning 276
- 12.2.1 Online k-Means 276
- 12.2.2 Adaptive Resonance Theory
281
- 12.2.3 Self-Organizing Maps 282
- 12.3 Radial Basis Functions 284
- 12.4 Incorporating Rule-Based
Knowledge 290
- 12.5 Normalized Basis Functions
291
- 12.6 Competitive Basis Functions
293
- 12.7 Learning Vector Quantization
296
- 12.8 Mixture of Experts 296
- 12.8.1 Cooperative Experts 299
- 12.8.2 Competitive Experts 300
- 12.9 Hierarchical Mixture of
Experts 300
- 12.10 Notes 301
- 12.11 Exercises 302
- 12.12 References 302
- 13 Hidden Markov Models 305
- 13.1 Introduction 305
- 13.2 Discrete Markov Processes
306
- 13.3 Hidden Markov Models 309
- 13.4 Three Basic Problems of HMMs
311
- 13.5 Evaluation Problem 311
- 13.6 Finding the State Sequence
315
- 13.7 Learning Model Parameters
317
- 13.8 Continuous Observations 320
- 13.9 The HMM with Input 321
- 13.10 Model Selection in HMM 322
- 13.11 Notes 323
- 13.12 Exercises 325
- 13.13 References 325
- 14 Assessing and Comparing Classification Algorithms
327
- 14.1 Introduction 327
- 14.2 Cross-Validation and
Resampling Methods 330
- 14.2.1 K-Fold
Cross-Validation 331
- 14.2.2 5x2 Cross-Validation 331
- 14.2.3 Bootstrapping 332
- 14.3 Measuring Error 333
- 14.4 Interval Estimation 334
- 14.5 Hypothesis Testing 338
- 14.6 Assessing a Classification
Algorithm's Performance 339
- 14.6.1 Binomial Test 340
- 14.6.2 Approximate Normal Test 341
- 14.6.3 Paired t Test 341
- 14.7 Comparing Two Classification
Algorithms 341
- 14.7.1 McNemar's Test 342
- 14.7.2 K-Fold
Cross-Validated Paired t Test 342
- 14.7.3 5x2 cv Paired t
Test 343
- 14.7.4 5x2 cv Paired F
Test 344
- 14.8 Comparing Multiple
Classification Algorithms: Analysis of Variance 345
- 14.9 Notes 348
- 14.10 Exercises 349
- 14.11 References 350
- 15 Combining Multiple Learners 351
- 15.1 Rationale 351
- 15.2 Voting 354
- 15.3 Error-Correcting Output
Codes 357
- 15.4 Bagging 360
- 15.5 Boosting 360
- 15.6 Mixture of Experts Revisited
363
- 15.7 Stacked Generalization 364
- 15.8 Cascading 366
- 15.9 Notes 368
- 15.10 Exercises 369
- 15.11 References 370
- 16 Reinforcement Learning 373
- 16.1 Introduction 373
- 16.2 Single State
Case: K-Armed Bandit 375
- 16.3 Elements of Reinforcement
Learning 376
- 16.4 Model-Based Learning 379
- 16.4.1 Value Iteration 379
- 16.4.2 Policy Iteration 380
- 16.5 Temporal Difference Learning
380
- 16.5.1 Exploration Strategies
381
- 16.5.2 Deterministic Rewards and
Actions 382
- 16.5.3 Nondeterministic Rewards
and Actions 383
- 16.5.4 Eligibility Traces 385
- 16.6 Generalization 387
- 16.7 Partially Observable States
389
- 16.8 Notes 391
- 16.9 Exercises 393
- 16.10 References 394
- A Probability 397
- A.1 Elements of Probability 397
- A.1.1 Axioms of Probability 398
- A.1.2 Conditional Probability
398
- A.2 Random Variables 399
- A.2.1 Probability Distribution
and Density Functions 399
- A.2.2 Joint Distribution and
Density Functions 400
- A.2.3 Conditional Distributions
400
- A.2.4 Bayes' Rule 401
- A.2.5 Expectation 401
- A.2.6 Variance 402
- A.2.7 Weak Law of Large Numbers
403
- A.3 Special Random Variables 403
- A.3.1 Bernoulli Distribution 403
- A.3.2 Binomial Distribution 404
- A.3.3 Multinomial Distribution
404
- A.3.4 Uniform Distribution 404
- A.3.5 Normal (Gaussian) Distribution 405
- A.3.6 Chi-Square Distribution
406
- A.3.7 t Distribution 407
- A.3.8 F Distribution 407
- A.4 References 407
- Index
409
Courses:
The book is used in the following courses, either as the main textbook, or as a
reference book. I will be happy to be told of others.
- Textbook:
- Reference book:
Figures:
The complete set of figures can be retrieved as a pdf file (2
MB). Instructors using the book are welcome to use these figures in their
lecture slides as long as the use is non-commercial and the source is cited.
Lecture
Slides: The following lecture slides (pdf and
ppt) are made available for instructors using the book.
Errata:
- (p. 20-22): S and G need not be unique. (Luc de
Raedt)
Depending
on the training set and the hypothesis class, there may be several S_i and G_j
which respectively make up the S-set and the G-set. Every member of the S-set
is consistent with all the instances and there are no consistent hypotheses
that are more specific. Similarly, every member of the G-set is consistent with
all the instances and there are no consistent hypotheses that are more general.
These two make up the boundary sets and any hypothesis between them is
consistent and is part of the version space. There is an algorithm called
candidate elimination that incrementally updates the S- and G-sets as it sees
training instances one by one. See (Mitchell, 1997; Russell and Norvig; 1995).
- (p. 30): Eq. 2.15: w_1 x + w_0 should be w_1 x^t +
w_0 (Mike Colagrosso)
- (p. 30): Eq. 2.15: Not needed, but the summation
should be multiplied by 1/N to match Eq. 2.12 (Mike Colagrosso)
- (p. 35): Eq. 2.19: Missing closing ')' (Mike
Colagrosso)
- (p. 58): Ref (Agrawal et al., 1996): The second
author's name should be "Mannila." (Kai Puolamäki)
- (p. 59): Ref. The title of Pearl's 1988 book is "Probabilistic
Reasoning in INTELLIGENT Systems." (Yang-feng Ji)
- (p. 62): Eq. 4.1: l(\theta) should be l(\theta|X)
(Chris Mansley)
- (p. 63): Eq. 4.5: p(x_1, x_2, \dots, x_K) should be
P(x_1, x_2, \dots, x_K). That is, P should be uppercase. (Mike Colagrosso)
- (p. 86): Eq. 5.3: '[' missing after the first 'E'.
(Hakan Haberdar)
- (p. 89): Eq at the bottom of the page: +(plus) before
2 should be – (minus) (Barış Can Daylık).
- (p. 90): Figure 5.2: Upper-left figure should be a
circle, but the plot is squashed. x_1 axis is longer than the x_2 axis.
(Mike Colagrosso)
- (p. 118): Equation at the bottom: In the second
equality, the last C is to transposed. (Chulhong Min)
- (p. 124): Eq. 6.31: It should be x^t. (Stijn
Vanderlooy)
- (p. 157): Figure 8.2: h values are twice the actual
values. The titles should read 2h=2, 2h=1 and 2h=0.5. (Onder Eker, Alex
Kogan)
- (p. 160): The first sentence of the second paragraph
should read: "For example, the use of the Euclidean norm in equation
8.11 implies that ..." (Stijn Vanderlooy)
- (p. 176): Second line of fourth paragraph should
read: "... number of bits needed to encode the class code of an
instance" (Stijn Vanderlooy)
- (p. 178): Eq. 9.8: log should be base 2. (Ismail Ari)
- (p. 187 and 196): The name of the author for the Irep
reference is F{\"u}rnkranz. (Stijn Vanderlooy)
- (p. 189): Third paragraph, line 5 from top:
"instances of all other classes are taken as [negative]
examples." (Ismail Ari)
- (p. 191): Figure 9.8: w_{11} x_1 + w_{12} x_2 +
w_{10} = 0 should be w_{11} x_1 + w_{12} x_2 + w_{10} > 0 (Mike
Colagrosso)
- (p. 198): Fourth line from the bottom of the page:
"magnitude" is misspelt. (Michael Dominguez)
- (p. 203): Eq. 10.7: w_{i0} shouldn't be bold. It's a
scalar, not a vector, as in the sentence above and Eq. 10.6. (Mike
Colagrosso)
- (p. 209): Eq. 10.23: E(w, w_o | X) should be E(w, w_0
| X). That is, the subscript should be a zero, not an "oh."
(Mike Colagrosso)
- (p. 210): Fig 10.6. The 11th line (that starts with
\Delta w_j) should also be enclosed in a for loop of j=0,\ldots,d. (Didem
Unat)
- (p. 222): Seventh line from the bottom of the page:
The number of misclassifications is \#\{|xi^t \ge 1\}. (Ming Fan)
- (p. 227): First sentence of 10.10: Change
"discriminant" to "discrimination" (Mike Colagrosso)
- (p. 227): Exercise 1: change "function" to
"functions" (Mike Colagrosso)
- (p. 235): Fig. 11.2 caption mentions w_{ij} but there
is no w_{ij} in the figure. w_{ij} is the weight of the connection from
input x_j to output y_i. w_i (the vector of weights to output y_i) are
shown in the figure. (Stijn Vanderlooy)
- (p. 236): The first line after eq. 11.6: ..., the
linear model can also be used ... (Ming Fan, Michael Orlov)
- (p. 238): In the first cross-entropy eq on the top of
the page, the summation over i and all i subscripts should be omitted.
(David Warde-Farley)
- (p. 239): First word in the Figure 11.3 narrative
should be "Perceptron" instead of "Percepton." (Mike
Colagrosso)
- (p. 240): In the line below the equation, it should
read: Note that y=s(x_1+x_2-1.5) satisfies ..." (Ming Fan)
- (p. 245): On the third line from the bottom of the
page, it should read z_h and not h_j. (Joel Kammet)
- (p. 252): sigmoid() missing in the second terms to
the right of eqs defining z_1h and z_2l.
- (p. 257): Insert "is" before "as"
in the last sentence of the first paragraph to read "..., it is as if
the training set ..." (Tunga Gungor)
- (p. 267): Fig. 11.20: The input units
x^{t-\tau},...,x^{t-1},x^t should be labeled in the opposite order; or
equivalently, the arrows should point to the left. x^t is the current
input seen (the latest) and x^{t-\tau} is the input seen \tau steps in the
past (delayed \tau times).
- (p. 279): Fig 12.2: On line 5 of the psudocode, m_j
should be m_i (Murat Semerci)
- (p. 282): Eq. 12.9: On the third line, x should be
x^t. (Ismail Ari)
- (p. 288): Remove the extra "the" in the
first sentence. (Tunga Gungor)
- (p. 308): Eq. 13.8: The denominator should read
"#{sequences}"; "number of" in the curly brackets is
redundant. (Can Kavaklioglu)
- (p. 313): Fig 13.3, legend: "...computation of
\alpha_{t+1}(j)..." (Ismail Ari)
- (p.317): Fig. 13.4: Below the node for state j, '1'
should follow the line O_{t+}; that is, the observation is named O_{t+1}.
- (p.319): Eq. 13.32: In estimating b_j(m), t should
range from 1 to T_k (and not T_k-1) in both the numerator and the
denominator. (Cem Keskin)
- (p. 320): Eq. 13.35: Drop j in P(G_{jl}). (Cem
Keskin)
- (p. 327): On the second line from the bottom of the
page, “to” is missing before “say which one …” (Hussein Issa).
- (p. 329): On line 6 from the bottom of the page, “of”
is missing between “both” and “these.” (Hussein Issa).
- (p. 330): "than" on line 16 should be
changed to "then." (Tunga Gungor)
- (p. 340): Eq. 14.12: The summation should start from
j=0. (Alex Kogan)
- (p. 343): Eq. 14.17: In the first term to the right,
division by \sigma is missing in the numerator. It should be changed to:
"\frac{p^{(1)}_1 / \sigma} / {\sqrt{M/5}}. (Alex Kogan)
- (p. 362): Fig 15.2: On line 11, "Then" is
missing after the condition. It should read: If y^t_j=r^t Then
p^t_{j+1}\leftarrow \beta_j p^t_j Else p^t_{j+1}\leftarrow p^t_j (Stijn
Vanderlooy)
- (p. 375): First paragraph of 16.2: classification is
misspelled. (Mike Colagrosso)
- (p. 379): Eq. 16.9: V*(s_t) should be changed to
V*(s_{t+1}). (Omer Korcak)
- (p. 380): Fig 16.3, first line: Initialize a policy
\Pi' arbitrarily (Stijn Vanderlooy)
- (p. 381): Eqs. 16.10 and 16.11: Replace the
denominators as \sum_{b\in{\cal A}} \exp ... (Stijn Vanderlooy)
Solutions
to Exercises: Available as a gzipped
tar or compressed
zipped folder file for instructors who have adopted the book for course
use. Please contact The MIT Press for user name and password. The manual
contains solutions to exercises and example Matlab programs.
Created on
Oct 24, 2004 by E. Alpaydin (my_last_name AT
boun DOT edu DOT tr)
- Jan 14, 2005: Added links to more online booksellers.
- Jan 31, 2005: Added link to the pdf file of figures.
- Apr, 3, 2005: Added Errata.
- June 1, 2005: Further errata.
- July 12, 2005: Added more bookseller link.
- July 20, 2005: Added more bookseller links and the
lecture slides of Chapters 1, 2 and 11.
- July 28, 2005: Added all lecture slides.
- Sep 26, 2005: Added ppt of all lecture slides. This
new version (V1-1) is the same as the previously available V1-0 except
that I retyped all equations using Microsoft Equation Editor.
- Oct 25, 2005: Further errata.
- Nov 12, 2005: Added reviews and courses.
- Dec 14, 2005: Added links to MIT Press for sample
pdfs of Foreword, Preface, and Chapter 1.
- Feb 1, 2006: Added links to 2006 courses.
- Apr 27, 2006: Added new course links and errata.
- Jul 4, 2006: Added errata.pdf
- Sep 1, 2006: Added links to Fall 2006 courses.
- Nov 14, 2006: Added info on Foreign Editions.
- Jan 12, 2007: Added Solutions to Exercises. Thanks to
Oya Aran, our web admin, for her help in making the file protected.
- Feb 5, 2007: Added links to Find-In-A-Library and new
courses.
- Jul 16, 2007: Some more errata.
- Sep 18, 2007: Added links to 2007 courses.
- May 1, 2008: Added an erratum and a review.
- August 20, 2009: Added info about the Chinese
edition.