Discussion on Data Mining

1.1 What Is Data Mining?
Data mining is the process of automatically discovering useful information in
large data repositories. Data mining techniques are deployed to scour large
data sets in order to find novel and useful patterns that might otherwise
remain unknown. They also provide the capability to predict the outcome of a
future observation, such as the amount a customer will spend at an online or a
brick-and-mortar store.
Not all information discovery tasks are considered to be data mining.
Examples include queries, e.g., looking up individual records in a database or
finding web pages that contain a particular set of keywords. This is because
such tasks can be accomplished through simple interactions with a database
management system or an information retrieval system. These systems rely
on traditional computer science techniques, which include sophisticated
indexing structures and query processing algorithms, for efficiently
organizing and retrieving information from large data repositories.
Nonetheless, data mining techniques have been used to enhance the
performance of such systems by improving the quality of the search results
based on their relevance to the input queries.
Data Mining and Knowledge
Discovery in Databases
Data mining is an integral part of knowledge discovery in databases
(KDD), which is the overall process of converting raw data into useful
information, as shown in Figure 1.1. This process consists of a series of steps,
from data preprocessing to postprocessing of data mining results.
Figure 1.1.
The process of knowledge discovery in databases (KDD).
Figure 1.1. Full Alternative Text
The input data can be stored in a variety of formats (flat files, spreadsheets, or
relational tables) and may reside in a centralized data repository or be
distributed across multiple sites. The purpose of preprocessing is to
transform the raw input data into an appropriate format for subsequent
analysis. The steps involved in data preprocessing include fusing data from
multiple sources, cleaning data to remove noise and duplicate observations,
and selecting records and features that are relevant to the data mining task at
hand. Because of the many ways data can be collected and stored, data
preprocessing is perhaps the most laborious and time-consuming step in the
overall knowledge discovery process.
“Closing the loop” is a phrase often used to refer to the process of integrating
data mining results into decision support systems. For example, in business
applications, the insights offered by data mining results can be integrated
with campaign management tools so that effective marketing promotions can
be conducted and tested. Such integration requires a postprocessing step to
ensure that only valid and useful results are incorporated into the decision
support system. An example of postprocessing is visualization, which allows
analysts to explore the data and the data mining results from a variety of
viewpoints. Hypothesis testing methods can also be applied during
postprocessing to eliminate spurious data mining results. (See Chapter 10.)
1.2 Motivating Challenges
As mentioned earlier, traditional data analysis techniques have often
encountered practical difficulties in meeting the challenges posed by big data
applications. The following are some of the specific challenges that
motivated the development of data mining.
Because of advances in data generation and collection, data sets with sizes of
terabytes, petabytes, or even exabytes are becoming common. If data mining
algorithms are to handle these massive data sets, they must be scalable. Many
data mining algorithms employ special search strategies to handle
exponential search problems. Scalability may also require the implementation
of novel data structures to access individual records in an efficient manner.
For instance, out-of-core algorithms may be necessary when processing data
sets that cannot fit into main memory. Scalability can also be improved by
using sampling or developing parallel and distributed algorithms. A general
overview of techniques for scaling up data mining algorithms is given in
Appendix F.
High Dimensionality
It is now common to encounter data sets with hundreds or thousands of
attributes instead of the handful common a few decades ago. In
bioinformatics, progress in microarray technology has produced gene
expression data involving thousands of features. Data sets with temporal or
spatial components also tend to have high dimensionality. For example,
consider a data set that contains measurements of temperature at various
locations. If the temperature measurements are taken repeatedly for an
extended period, the number of dimensions (features) increases in proportion
to the number of measurements taken. Traditional data analysis techniques
that were developed for low-dimensional data often do not work well for
such high-dimensional data due to issues such as curse of dimensionality (to
be discussed in Chapter 2). Also, for some data analysis algorithms, the
computational complexity increases rapidly as the dimensionality (the
number of features) increases.
Heterogeneous and Complex Data
Traditional data analysis methods often deal with data sets containing
attributes of the same type, either continuous or categorical. As the role of
data mining in business, science, medicine, and other fields has grown, so has
the need for techniques that can handle heterogeneous attributes. Recent
years have also seen the emergence of more complex data objects. Examples
of such non-traditional types of data include web and social media data
containing text, hyperlinks, images, audio, and videos; DNA data with
sequential and three-dimensional structure; and climate data that consists of
measurements (temperature, pressure, etc.) at various times and locations on
the Earth’s surface. Techniques developed for mining such complex objects
should take into consideration relationships in the data, such as temporal and
spatial autocorrelation, graph connectivity, and parent-child relationships
between the elements in semi-structured text and XML documents.
Data Ownership and Distribution
Sometimes, the data needed for an analysis is not stored in one location or
owned by one organization. Instead, the data is geographically distributed
among resources belonging to multiple entities. This requires the
development of distributed data mining techniques. The key challenges faced
by distributed data mining algorithms include the following: (1) how to
reduce the amount of communication needed to perform the distributed
computation, (2) how to effectively consolidate the data mining results
obtained from multiple sources, and (3) how to address data security and
privacy issues.
Non-traditional Analysis
The traditional statistical approach is based on a hypothesize-and-test
paradigm. In other words, a hypothesis is proposed, an experiment is
designed to gather the data, and then the data is analyzed with respect to the
hypothesis. Unfortunately, this process is extremely labor-intensive. Current
data analysis tasks often require the generation and evaluation of thousands
of hypotheses, and consequently, the development of some data mining
techniques has been motivated by the desire to automate the process of
hypothesis generation and evaluation. Furthermore, the data sets analyzed in
data mining are typically not the result of a carefully designed experiment
and often represent opportunistic samples of the data, rather than random
1.3 The Origins of Data Mining
While data mining has traditionally been viewed as an intermediate process
within the KDD framework, as shown in Figure 1.1, it has emerged over the
years as an academic field within computer science, focusing on all aspects of
KDD, including data preprocessing, mining, and postprocessing. Its origin
can be traced back to the late 1980s, following a series of workshops
organized on the topic of knowledge discovery in databases. The workshops
brought together researchers from different disciplines to discuss the
challenges and opportunities in applying computational techniques to extract
actionable knowledge from large databases. The workshops quickly grew
into hugely popular conferences that were attended by researchers and
practitioners from both the academia and industry. The success of these
conferences, along with the interest shown by businesses and industry in
recruiting new hires with data mining background, have fueled the
tremendous growth of this field.
The field was initially built upon the methodology and algorithms that
researchers had previously used. In particular, data mining researchers draw
upon ideas, such as (1) sampling, estimation, and hypothesis testing from
statistics and (2) search algorithms, modeling techniques, and learning
theories from artificial intelligence, pattern recognition, and machine
learning. Data mining has also been quick to adopt ideas from other areas,
including optimization, evolutionary computing, information theory, signal
processing, visualization, and information retrieval, and extending them to
solve the challenges of mining big data.
A number of other areas also play key supporting roles. In particular,
database systems are needed to provide support for efficient storage,
indexing, and query processing. Techniques from high performance (parallel)
computing are often important in addressing the massive size of some data
sets. Distributed techniques can also help address the issue of size and are
essential when the data cannot be gathered in one location. Figure 1.2 shows
the relationship of data mining to other areas.
Figure 1.2.
Data mining as a confluence of many disciplines.
Data Science and Data-Driven
Data science is an interdisciplinary field that studies and applies tools and
techniques for deriving useful insights from data. Although data science is
regarded as an emerging field with a distinct identity of its own, the tools and
techniques often come from many different areas of data analysis, such as
data mining, statistics, AI, machine learning, pattern recognition, database
technology, and distributed and parallel computing. (See Figure 1.2.)
The emergence of data science as a new field is a recognition that, often,
none of the existing areas of data analysis provides a complete set of tools for
the data analysis tasks that are often encountered in emerging applications.
Instead, a broad range of computational, mathematical, and statistical skills is
often required. To illustrate the challenges that arise in analyzing such data,
consider the following example. Social media and the Web present new
opportunities for social scientists to observe and quantitatively measure
human behavior on a large scale. To conduct such a study, social scientists
work with analysts who possess skills in areas such as web mining, natural
language processing (NLP), network analysis, data mining, and statistics.
Compared to more traditional research in social science, which is often based
on surveys, this analysis requires a broader range of skills and tools, and
involves far larger amounts of data. Thus, data science is, by necessity, a
highly interdisciplinary field that builds on the continuing work of many
The data-driven approach of data science emphasizes the direct discovery of
patterns and relationships from data, especially in large quantities of data,
often without the need for extensive domain knowledge. A notable example
of the success of this approach is represented by advances in neural networks,
i.e., deep learning, which have been particularly successful in areas which
have long proved challenging, e.g., recognizing objects in photos or videos
and words in speech, as well as in other application areas. However, note that
this is just one example of the success of data-driven approaches, and
dramatic improvements have also occurred in many other areas of data
analysis. Many of these developments are topics described later in this book.
Some cautions on potential limitations of a purely data-driven approach are
given in the Bibliographic Notes.
1.4 Data Mining Tasks
Data mining tasks are generally divided into two major categories:
Predictive tasks The objective of these tasks is to predict the value of a
particular attribute based on the values of other attributes. The attribute to be
predicted is commonly known as the target or dependent variable, while
the attributes used for making the prediction are known as the explanatory or
independent variables.
Descriptive tasks Here, the objective is to derive patterns (correlations,
trends, clusters, trajectories, and anomalies) that summarize the underlying
relationships in data. Descriptive data mining tasks are often exploratory in
nature and frequently require postprocessing techniques to validate and
explain the results.
Figure 1.3 illustrates four of the core data mining tasks that are described in
the remainder of this book.
Figure 1.3.
Four of the core data mining tasks.
Figure 1.3. Full Alternative Text
Predictive modeling refers to the task of building a model for the target
variable as a function of the explanatory variables. There are two types of
predictive modeling tasks: classification, which is used for discrete target
variables, and regression, which is used for continuous target variables. For
example, predicting whether a web user will make a purchase at an online
bookstore is a classification task because the target variable is binary-valued.
On the other hand, forecasting the future price of a stock is a regression task
because price is a continuous-valued attribute. The goal of both tasks is to
learn a model that minimizes the error between the predicted and true values
of the target variable. Predictive modeling can be used to identify customers
who will respond to a marketing campaign, predict disturbances in the
Earth’s ecosystem, or judge whether a patient has a particular disease based
on the results of medical tests.
Example 1.1 (Predicting the Type of
a Flower).
Consider the task of predicting a species of flower based on the
characteristics of the flower. In particular, consider classifying an Iris flower
as one of the following three Iris species: Setosa, Versicolour, or Virginica.
To perform this task, we need a data set containing the characteristics of
various flowers of these three species. A data set with this type of
information is the well-known Iris data set from the UCI Machine Learning
Repository at http://www.ics.uci.edu/~mlearn. In addition to the species of a
flower, this data set contains four other attributes: sepal width, sepal length,
petal length, and petal width. Figure 1.4 shows a plot of petal width versus
petal length for the 150 flowers in the Iris data set. Petal width is broken into
the categories low, medium, and high, which correspond to the intervals [0,
0.75), [0.75, 1.75), [1.75, ∞), respectively. Also, petal length is broken into
categories low, medium,and high, which correspond to the intervals [0, 2.5),
[2.5, 5), [5, ∞), respectively. Based on these categories of petal width and
length, the following rules can be derived:
Petal width low and petal length low implies Setosa.
Petal width medium and petal length medium implies Versicolour.
Petal width high and petal length high implies Virginica.
While these rules do not classify all the flowers, they do a good (but not
perfect) job of classifying most of the flowers. Note that flowers from the
Setosa species are well separated from the Versicolour and Virginica species
with respect to petal width and length, but the latter two species overlap
somewhat with respect to these attributes.
Figure 1.4.
Petal width versus petal length for 150 Iris flowers.
Figure 1.4. Full Alternative Text
Association analysis is used to discover patterns that describe strongly
associated features in the data. The discovered patterns are typically
represented in the form of implication rules or feature subsets. Because of the
exponential size of its search space, the goal of association analysis is to
extract the most interesting patterns in an efficient manner. Useful
applications of association analysis include finding groups of genes that have
related functionality, identifying web pages that are accessed together, or
understanding the relationships between different elements of Earth’s climate
Example 1.2 (Market Basket
The transactions shown in Table 1.1 illustrate point-of-sale data collected at
the checkout counters of a grocery store. Association analysis can be applied
to find items that are frequently bought together by customers. For example,
we may discover the rule {Diapers}→{Milk}, which suggests that customers
who buy diapers also tend to buy milk. This type of rule can be used to
identify potential cross-selling opportunities among related items.
Table 1.1. Market basket data.
Transaction ID Items
1 {Bread, Butter, Diapers, Milk}
2 {Coffee, Sugar, Cookies, Salmon}
3 {Bread, Butter, Coffee, Diapers, Milk, Eggs}
4 {Bread, Butter, Salmon, Chicken}
5 {Eggs, Bread, Butter}
6 {Salmon, Diapers, Milk}
7 {Bread, Tea, Sugar, Eggs}
8 {Coffee, Sugar, Chicken, Eggs}
9 {Bread, Diapers, Milk, Salt}
10 {Tea, Eggs, Cookies, Diapers, Milk}
Cluster analysis seeks to find groups of closely related observations so that
observations that belong to the same cluster are more similar to each other
than observations that belong to other clusters. Clustering has been used to
group sets of related customers, find areas of the ocean that have a significant
impact on the Earth’s climate, and compress data.
Example 1.3 (Document Clustering).
The collection of news articles shown in Table 1.2 can be grouped based on
their respective topics. Each article is represented as a set of word-frequency
pairs (w : c), where w is a word and c is the number of times the word appears
in the article. There are two natural clusters in the data set. The first cluster
consists of the first four articles, which correspond to news about the
economy, while the second cluster contains the last four articles, which
correspond to news about health care. A good clustering algorithm should be
able to identify these two clusters based on the similarity between words that
appear in the articles.
Table 1.2. Collection of news
Article Word-frequency pairs
1 dollar: 1, industry: 4, country: 2, loan: 3, deal: 2,
government: 2
2 machinery: 2, labor: 3, market: 4, industry: 2, work: 3,
country: 1
3 job: 5, inflation: 3, rise: 2, jobless: 2, market: 3, country: 2,
index: 3
4 domestic: 3, forecast: 2, gain: 1, market: 2, sale: 3, price: 2
5 patient: 4, symptom: 2, drug: 3, health: 2, clinic: 2, doctor:
6 pharmaceutical: 2, company: 3, drug: 2, vaccine: 1, flu: 3
7 death: 2, cancer: 4, drug: 3, public: 4, health: 3, director: 2
medical: 2, cost: 3, increase: 2, patient: 2, health: 3, care: 1
Anomaly detection is the task of identifying observations whose
characteristics are significantly different from the rest of the data. Such
observations are known as anomalies or outliers. The goal of an anomaly
detection algorithm is to discover the real anomalies and avoid falsely
labeling normal objects as anomalous. In other words, a good anomaly
detector must have a high detection rate and a low false alarm rate.
Applications of anomaly detection include the detection of fraud, network
intrusions, unusual patterns of disease, and ecosystem disturbances, such as
droughts, floods, fires, hurricanes, etc.
Example 1.4 (Credit Card Fraud
A credit card company records the transactions made by every credit card
holder, along with personal information such as credit limit, age, annual
income, and address. Since the number of fraudulent cases is relatively small
compared to the number of legitimate transactions, anomaly detection
techniques can be applied to build a profile of legitimate transactions for the
users. When a new transaction arrives, it is compared against the profile of
the user. If the characteristics of the transaction are very different from the
previously created profile, then the transaction is flagged as potentially
1.5 Scope and Organization of the
This book introduces the major principles and techniques used in data mining
from an algorithmic perspective. A study of these principles and techniques is
essential for developing a better understanding of how data mining
technology can be applied to various kinds of data. This book also serves as a
starting point for readers who are interested in doing research in this field.
We begin the technical discussion of this book with a chapter on data
(Chapter 2), which discusses the basic types of data, data quality,
preprocessing techniques, and measures of similarity and dissimilarity.
Although this material can be covered quickly, it provides an essential
foundation for data analysis. Chapters 3 and 4 cover classification. Chapter 3
provides a foundation by discussing decision tree classifiers and several
issues that are important to all classification: overfitting, underfitting, model
selection, and performance evaluation. Using this foundation, Chapter 4
describes a number of other important classification techniques: rule-based
systems, nearest neighbor classifiers, Bayesian classifiers, artificial neural
networks, including deep learning, support vector machines, and ensemble
classifiers, which are collections of classifiers. The multiclass and
imbalanced class problems are also discussed. These topics can be covered
Association analysis is explored in Chapters 5 and 6. Chapter 5 describes the
basics of association analysis: frequent itemsets, association rules, and some
of the algorithms used to generate them. Specific types of frequent itemsets—
maximal, closed, and hyperclique—that are important for data mining are
also discussed, and the chapter concludes with a discussion of evaluation
measures for association analysis. Chapter 6 considers a variety of more
advanced topics, including how association analysis can be applied to
categorical and continuous data or to data that has a concept hierarchy. (A
concept hierarchy is a hierarchical categorization of objects, e.g., store
itemsstore items→clothing→shoes→sneakers.) This chapter also describes
how association analysis can be extended to find sequential patterns (patterns
involving order), patterns in graphs, and negative relationships (if one item is
present, then the other is not).
Cluster analysis is discussed in Chapters 7 and 8. Chapter 7 first describes the
different types of clusters, and then presents three specific clustering
techniques: K-means, agglomerative hierarchical clustering, and DBSCAN.
This is followed by a discussion of techniques for validating the results of a
clustering algorithm. Additional clustering concepts and techniques are
explored in Chapter 8, including fuzzy and probabilistic clustering, SelfOrganizing Maps (SOM), graph-based clustering, spectral clustering, and
density-based clustering. There is also a discussion of scalability issues and
factors to consider when selecting a clustering algorithm.
Chapter 9, is on anomaly detection. After some basic definitions, several
different types of anomaly detection are considered: statistical, distancebased, density-based, clustering-based, reconstruction-based, one-class
classification, and information theoretic. The last chapter, Chapter 10,
supplements the discussions in the other Chapters with a discussion of the
statistical concepts important for avoiding spurious results, and then
discusses those concepts in the context of data mining techniques studied in
the previous chapters. These techniques include statistical hypothesis testing,
p-values, the false discovery rate, and permutation testing. Appendices A
through F give a brief review of important topics that are used in portions of
the book: linear algebra, dimensionality reduction, statistics, regression,
optimization, and scaling up data mining techniques for big data.
The subject of data mining, while relatively young compared to statistics or
machine learning, is already too large to cover in a single book. Selected
references to topics that are only briefly covered, such as data quality, are
provided in the Bibliographic Notes section of the appropriate chapter.
References to topics not covered in this book, such as mining streaming data
and privacy-preserving data mining are provided in the Bibliographic Notes
of this chapter.
1.6 Bibliographic Notes
The topic of data mining has inspired many textbooks. Introductory textbooks
include those by Dunham [16], Han et al. [29], Hand et al. [31], Roiger and
Geatz [50], Zaki and Meira [61], and Aggarwal [2]. Data mining books with
a stronger emphasis on business applications include the works by Berry and
Linoff [5], Pyle [47], and Parr Rud [45]. Books with an emphasis on
statistical learning include those by Cherkassky and Mulier [11], and Hastie
et al. [32]. Similar books with an emphasis on machine learning or pattern
recognition are those by Duda et al. [15], Kantardzic [34], Mitchell [43],
Webb [57], and Witten and Frank [58]. There are also some more specialized
books: Chakrabarti [9] (web mining), Fayyad et al. [20] (collection of early
articles on data mining), Fayyad et al. [18] (visualization), Grossman et al.
[25] (science and engineering), Kargupta and Chan [35] (distributed data
mining), Wang et al. [56] (bioinformatics), and Zaki and Ho [60] (parallel
data mining).
There are several conferences related to data mining. Some of the main
conferences dedicated to this field include the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), the IEEE
International Conference on Data Mining (ICDM), the SIAM International
Conference on Data Mining (SDM), the European Conference on Principles
and Practice of Knowledge Discovery in Databases (PKDD), and the PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD). Data
mining papers can also be found in other major conferences such as the
Conference and Workshop on Neural Information Processing Systems
(NIPS),the International Conference on Machine Learning (ICML), the ACM
SIGMOD/PODS conference, the International Conference on Very Large
Data Bases (VLDB), the Conference on Information and Knowledge
Management (CIKM), the International Conference on Data Engineering
(ICDE), the National Conference on Artificial Intelligence (AAAI), the IEEE
International Conference on Big Data, and the IEEE International Conference
on Data Science and Advanced Analytics (DSAA).
Journal publications on data mining include IEEE Transactions on
Knowledge and Data Engineering, Data Mining and Knowledge Discovery,
Knowledge and Information Systems, ACM Transactions on Knowledge
Discovery from Data, Statistical Analysis and Data Mining, and Information
Systems. There are various open-source data mining software available,
including Weka [27] and Scikit-learn [46]. More recently, data mining
software such as Apache Mahout and Apache Spark have been developed for
large-scale problems on the distributed computing platform.
There have been a number of general articles on data mining that define the
field or its relationship to other fields, particularly statistics. Fayyad et al.
[19] describe data mining and how it fits into the total knowledge discovery
process. Chen et al. [10] give a database perspective on data mining.
Ramakrishnan and Grama [48] provide a general discussion of data mining
and present several viewpoints. Hand [30] describes how data mining differs
from statistics, as does Friedman [21]. Lambert [40] explores the use of
statistics for large data sets and provides some comments on the respective
roles of data mining and statistics. Glymour et al. [23] consider the lessons
that statistics may have for data mining. Smyth et al. [53] describe how the
evolution of data mining is being driven by new types of data and
applications, such as those involving streams, graphs, and text. Han et al. [28]
consider emerging applications in data mining and Smyth [52] describes
some research challenges in data mining. Wu et al. [59] discuss how
developments in data mining research can be turned into practical tools. Data
mining standards are the subject of a paper by Grossman et al. [24]. Bradley
[7] discusses how data mining algorithms can be scaled to large data sets.
The emergence of new data mining applications has produced new challenges
that need to be addressed. For instance, concerns about privacy breaches as a
result of data mining have escalated in recent years, particularly in
application domains such as web commerce and health care. As a result, there
is growing interest in developing data mining algorithms that maintain user
privacy. Developing techniques for mining encrypted or randomized data is
known as privacy-preserving data mining. Some general references in this
area include papers by Agrawal and Srikant [3], Clifton et al. [12] and
Kargupta et al. [36]. Vassilios et al. [55] provide a survey. Another area of
concern is the bias in predictive models that may be used for some
applications, e.g., screening job applicants or deciding prison parole [39].
Assessing whether such applications are producing biased results is made
more difficult by the fact that the predictive models used for such
applications are often black box models, i.e., models that are not interpretable
in any straightforward way.
Data science, its constituent fields, and more generally, the new paradigm of
knowledge discovery they represent [33], have great potential, some of which
has been realized. However, it is important to emphasize that data science
works mostly with observational data, i.e., data that was collected by various
organizations as part of their normal operation. The consequence of this is
that sampling biases are common and the determination of causal factors
becomes more problematic. For this and a number of other reasons, it is often
hard to interpret the predictive models built from this data [42, 49]. Thus,
theory, experimentation and computational simulations will continue to be
the methods of choice in many areas, especially those related to science.
More importantly, a purely data-driven approach often ignores the existing
knowledge in a particular field. Such models may perform poorly, for
example, predicting impossible outcomes or failing to generalize to new
situations. However, if the model does work well, e.g., has high predictive
accuracy, then this approach may be sufficient for practical purposes in some
fields. But in many areas, such as medicine and science, gaining insight into
the underlying domain is often the goal. Some recent work attempts to
address these issues in order to create theory-guided data science, which
takes pre-existing domain knowledge into account [17, 37].
Recent years have witnessed a growing number of applications that rapidly
generate continuous streams of data. Examples of stream data include
network traffic, multimedia streams, and stock prices. Several issues must be
considered when mining data streams, such as the limited amount of memory
available, the need for online analysis, and the change of the data over time.
Data mining for stream data has become an important area in data mining.
Some selected publications are Domingos and Hulten [14] (classification),
Giannella et al. [22] (association analysis), Guha et al. [26] (clustering), Kifer
et al. [38] (change detection), Papadimitriou et al. [44] (time series), and Law
et al. [41] (dimensionality reduction).
Another area of interest is recommender and collaborative filtering systems
[1, 6, 8, 13, 54], which suggest movies, television shows, books, products,
etc. that a person might like. In many cases, this problem, or at least a
component of it, is treated as a prediction problem and thus, data mining
techniques can be applied [4, 51].
[1] G. Adomavicius and A. Tuzhilin. Toward the next generation of
recommender systems: A survey of the state-of-the-art and possible
extensions. IEEE transactions on knowledge and data engineering,
17(6):734–749, 2005.
[2] C. Aggarwal. Data mining: The Textbook. Springer, 2009.
[3] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proc.
of 2000 ACMSIGMOD Intl. Conf. on Management of Data, pages 439–
450, Dallas, Texas, 2000. ACM Press.
[4] X. Amatriain and J. M. Pujol. Data mining methods for
recommender systems. In Recommender Systems Handbook, pages 227–
262. Springer, 2015.
[5] M. J. A. Berry and G. Linoff. Data Mining Techniques: For
Marketing, Sales, and Customer Relationship Management. Wiley
Computer Publishing, 2nd edition, 2004.
[6] J. Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez.
Recommender systems survey. Knowledge-based systems, 46:109–132,
[7] P. S. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scaling
mining algorithms to large databases. Communications of the ACM,
45(8):38–43, 2002.
[8] R. Burke. Hybrid recommender systems: Survey and experiments.
User modeling and user-adapted interaction, 12(4):331–370, 2002.
[9] S. Chakrabarti. Mining the Web: Discovering Knowledge from
Hypertext Data. Morgan Kaufmann, San Francisco, CA, 2003.
[10] M.-S. Chen, J. Han, and P. S. Yu. Data Mining: An Overview from
a Database Perspective. IEEE Transactions on Knowledge and Data
Engineering, 8(6):866–883, 1996.
[11] V. Cherkassky and F. Mulier. Learning from Data: Concepts,
Theory, and Methods. Wiley-IEEE Press, 2nd edition, 1998.
[12] C. Clifton, M. Kantarcioglu, and J. Vaidya. Defining privacy for
data mining. In National Science Foundation Workshop on Next
Generation Data Mining, pages 126– 133, Baltimore, MD, November
[13] C. Desrosiers and G. Karypis. A comprehensive survey of
neighborhood-based recommendation methods. Recommender systems
handbook, pages 107–144, 2011.
[14] P. Domingos and G. Hulten. Mining high-speed data streams. In
Proc. of the 6th Intl. Conf. on Knowledge Discovery and Data Mining,
pages 71–80, Boston, Massachusetts, 2000. ACM Press.
[15] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification.
John Wiley … Sons, Inc., New York, 2nd edition, 2001.
[16] M. H. Dunham. Data Mining: Introductory and Advanced Topics.
Prentice Hall, 2006.
[17] J. H. Faghmous, A. Banerjee, S. Shekhar, M. Steinbach, V. Kumar,
A. R. Ganguly, and N. Samatova. Theory-guided data science for
climate change. Computer, 47(11):74–78, 2014.
[18] U. M. Fayyad, G. G. Grinstein, and A. Wierse, editors. Information
Visualization in Data Mining and Knowledge Discovery. Morgan
Kaufmann Publishers, San Francisco, CA, September 2001.
[19] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From Data
Mining to Knowledge Discovery: An Overview. In Advances in
Knowledge Discovery and Data Mining, pages 1–34. AAAI Press, 1996.
[20] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
editors. Advances in Knowledge Discovery and Data Mining.
AAAI/MIT Press, 1996.
[21] J. H. Friedman. Data Mining and Statistics: What’s the Connection?
Unpublished. www-stat.stanford.edu/~jhf/ftp/dm-stat.ps, 1997.
[22] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu. Mining Frequent
Patterns in Data Streams at Multiple Time Granularities. In H. Kargupta,
A. Joshi, K. Sivakumar, and Y. Yesha, editors, Next Generation Data
Mining, pages 191–212. AAAI/MIT, 2003.
[23] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. Statistical
Themes and Lessons for Data Mining. Data Mining and Knowledge
Discovery, 1(1):11–28, 1997.
[24] R. L. Grossman, M. F. Hornick, and G. Meyer. Data mining
standards initiatives. Communications of the ACM, 45(8):59–61, 2002.
[25] R. L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R.
Namburu, editors. Data Mining for Scientific and Engineering
Applications. Kluwer Academic Publishers, 2001.
[26] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L.
O’Callaghan. Clustering Data Streams: Theory and Practice. IEEE
Transactions on Knowledge and Data Engineering, 15(3):515–528,
May/June 2003.
[27] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.
H. Witten. The WEKA Data Mining Software: An Update. SIGKDD
Explorations, 11(1), 2009.
[28] J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon.
Emerging scientific applications in data mining. Communications of the
ACM, 45(8):54–58, 2002.
[29] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and
Techniques. Morgan Kaufmann Publishers, San Francisco, 3rd edition,
[30] D. J. Hand. Data Mining: Statistics and More? The American
Statistician, 52(2): 112–118, 1998.
[31] D. J. Hand, H. Mannila, and P. Smyth. Principles of Data Mining.
MIT Press, 2001.
[32] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of
Statistical Learning: Data Mining, Inference, Prediction. Springer, 2nd
edition, 2009.
[33] T. Hey, S. Tansley, K. M. Tolle, et al. The fourth paradigm: dataintensive scientific discovery, volume 1. Microsoft research Redmond,
WA, 2009.
[34] M. Kantardzic. Data Mining: Concepts, Models, Methods, and
Algorithms. Wiley-IEEE Press, Piscataway, NJ, 2003.
[35] H. Kargupta and P. K. Chan, editors. Advances in Distributed and
Parallel Knowledge Discovery. AAAI Press, September 2002.
[36] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the Privacy
Preserving Properties of Random Data Perturbation Techniques. In
Proc. of the 2003 IEEE Intl. Conf. on Data Mining, pages 99–106,
Melbourne, Florida, December 2003. IEEE Computer Society.
[37] A. Karpatne, G. Atluri, J. Faghmous, M. Steinbach, A. Banerjee, A.
Ganguly, S. Shekhar, N. Samatova, and V. Kumar. Theory-guided Data
Science: A New Paradigm for Scientific Discovery from Data. IEEE
Transactions on Knowledge and Data Engineering, 2017.
[38] D. Kifer, S. Ben-David, and J. Gehrke. Detecting Change in Data
Streams. In Proc. of the 30th VLDB Conf., pages 180–191, Toronto,
Canada, 2004. Morgan Kaufmann.
[39] J. Kleinberg, J. Ludwig, and S. Mullainathan. A Guide to Solving
Social Problems with Machine Learning. Harvard Business Review,
December 2016.
[40] D. Lambert. What Use is Statistics for Massive Data? In ACM
SIGMOD Workshop on Research Issues in Data Mining and Knowledge
Discovery, pages 54–62, 2000.
[41] M. H. C. Law, N. Zhang, and A. K. Jain. Nonlinear Manifold
Learning for Data Streams. In Proc. of the SIAM Intl. Conf. on Data
Mining, Lake Buena Vista, Florida, April 2004. SIAM.
[42] Z. C. Lipton. The mythos of model interpretability. arXiv preprint
arXiv:1606.03490, 2016.
[43] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.
[44] S. Papadimitriou, A. Brockwell, and C. Faloutsos. Adaptive,
unsupervised stream mining. VLDB Journal, 13(3):222–239, 2004.
[45] O. Parr Rud. Data Mining Cookbook: Modeling Data for
Marketing, Risk and Customer Relationship Management. John Wiley
… Sons, New York, NY, 2001.
[46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J.
Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E.
Duchesnay. Scikit-learn: Machine Learning in Python. Journal of
Machine Learning Research, 12:2825–2830, 2011.
[47] D. Pyle. Business Modeling and Data Mining. Morgan Kaufmann,
San Francisco, CA, 2003.
[48] N. Ramakrishnan and A. Grama. Data Mining: From Serendipity to
Science—Guest Editors’ Introduction. IEEE Computer, 32(8):34–37,
[49] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?:
Explaining the predictions of any classifier. In Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pages 1135–1144. ACM, 2016.
[50] R. Roiger and M. Geatz. Data Mining: A Tutorial Based Primer.
Addison-Wesley, 2002.
[51] J. Schafer. The Application of Data-Mining to Recommender
Systems. Encyclopedia of data warehousing and mining, 1:44–48, 2009.
[52] P. Smyth. Breaking out of the Black-Box: Research Challenges in
Data Mining. In Proc. of the 2001 ACM SIGMOD Workshop on
Research Issues in Data Mining and Knowledge Discovery, 2001.
[53] P. Smyth, D. Pregibon, and C. Faloutsos. Data-driven evolution of
data mining algorithms. Communications of the ACM, 45(8):33–37,
[54] X. Su and T. M. Khoshgoftaar. A survey of collaborative filtering
techniques. Advances in artificial intelligence, 2009:4, 2009.
[55] V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin,
and Y. Theodoridis. State-of-the-art in privacy preserving data mining.
SIGMOD Record, 33(1):50–57, 2004.
[56] J. T. L. Wang, M. J. Zaki, H. Toivonen, and D. E. Shasha, editors.
Data Mining in Bioinformatics. Springer, September 2004.
[57] A. R. Webb. Statistical Pattern Recognition. John Wiley … Sons,
2nd edition, 2002.
[58] I. H. Witten and E. Frank. Data Mining: Practical Machine
Learning Tools and Techniques. Morgan Kaufmann, 3rd edition, 2011.
[59] X. Wu, P. S. Yu, and G. Piatetsky-Shapiro. Data Mining: How
Research Meets Practical Development? Knowledge and Information
Systems, 5(2):248–261, 2003.
[60] M. J. Zaki and C.-T. Ho, editors. Large-Scale Parallel Data
Mining. Springer, September 2002.
[61] M. J. Zaki and W. Meira Jr. Data Mining and Analysis:
Fundamental Concepts and Algorithms. Cambridge University Press,
New York, 2014.
1.7 Exercises
1. 1. Discuss whether or not each of the following activities is a data
mining task.
1. Dividing the customers of a company according to their gender.
2. Dividing the customers of a company according to their
3. Computing the total sales of a company.
4. Sorting a student database based on student identification numbers.
5. Predicting the outcomes of tossing a (fair) pair of dice.
6. Predicting the future stock price of a company using historical
7. Monitoring the heart rate of a patient for abnormalities.
8. Monitoring seismic waves for earthquake activities.
9. Extracting the frequencies of a sound wave.
2. 2. Suppose that you are employed as a data mining consultant for an
Internet search engine company. Describe how data mining can help the
company by giving specific examples of how techniques, such as
clustering, classification, association rule mining, and anomaly detection
can be applied.
3. 3. For each of the following data sets, explain whether or not data
privacy is an important issue.
1. Census data collected from 1900–1950.
2. IP addresses and visit times of web users who visit your website.
3. Images from Earth-orbiting satellites.
4. Names and addresses of people from the telephone book.
5. Names and email addresses collected from the Web.
2 Data
This chapter discusses several data-related issues that are important for
successful data mining:
The Type of Data Data sets differ in a number of ways. For example, the
attributes used to describe data objects can be of different types—quantitative
or qualitative—and data sets often have special characteristics; e.g., some
data sets contain time series or objects with explicit relationships to one
another. Not surprisingly, the type of data determines which tools and
techniques can be used to analyze the data. Indeed, new research in data
mining is often driven by the need to accommodate new application areas and
their new types of data.
The Quality of the Data Data is often far from perfect. While most data
mining techniques can tolerate some level of imperfection in the data, a focus
on understanding and improving data quality typically improves the quality
of the resulting analysis. Data quality issues that often need to be addressed
include the presence of noise and outliers; missing, inconsistent, or duplicate
data; and data that is biased or, in some other way, unrepresentative of the
phenomenon or population that the data is supposed to describe.
Preprocessing Steps to Make the Data More Suitable for Data Mining Often,
the raw data must be processed in order to make it suitable for analysis.
While one objective may be to improve data quality, other goals focus on
modifying the data so that it better fits a specified data mining technique or
tool. For example, a continuous attribute, e.g., length, sometimes needs to be
transformed into an attribute with discrete categories, e.g., short, medium, or
long, in order to apply a particular technique. As another example, the
number of attributes in a data set is often reduced because many techniques
are more effective when the data has a relatively small number of attributes.
Analyzing Data in Terms of Its Relationships One approach to data analysis
is to find relationships among the data objects and then perform the
remaining analysis using these relationships rather than the data objects
themselves. For instance, we can compute the similarity or distance between
pairs of objects and then perform the analysis—clustering, classification, or
anomaly detection—based on these similarities or distances. There are many
such similarity or distance measures, and the proper choice depends on the
type of data and the particular application.
Example 2.1 (An Illustration of
Data-Related Issues).
To further illustrate the importance of these issues, consider the following
hypothetical situation. You receive an email from a medical researcher
concerning a project that you are eager to work on.
I’ve attached the data file that I mentioned in my previous email. Each
line contains the information for a single patient and consists of five
fields. We want to predict the last field using the other fields. I don’t
have time to provide any more information about the data since I’m
going out of town for a couple of days, but hopefully that won’t slow
you down too much. And if you don’t mind, could we meet when I get
back to discuss your preliminary results? I might invite a few other
members of my team.
Thanks and see you in a couple of days.
Despite some misgivings, you proceed to analyze the data. The first few rows
of the file are as follows:
012 232 33.5 0 10.7
020 121 16.9 2 210.1
027 165 24.0 0 427.6

A brief look at the data reveals nothing strange. You put your doubts aside
and start the analysis. There are only 1000 lines, a smaller data file than you
had hoped for, but two days later, you feel that you have made some
progress. You arrive for the meeting, and while waiting for others to arrive,
you strike up a conversation with a statistician who is working on the project.
When she learns that you have also been analyzing the data from the project,
she asks if you would mind giving her a brief overview of your results.
Statistician: So, you got the data for all the patients?
Data Miner: Yes. I haven’t had much time for analysis, but I do have a few
interesting results.
Statistician: Amazing. There were so many data issues with this set of
patients that I couldn’t do much.
Data Miner: Oh? I didn’t hear about any possible problems.
Statistician: Well, first there is field 5, the variable we want to predict. It’s
common knowledge among people who analyze this type of data that results
are better if you work with the log of the values, but I didn’t discover this
until later. Was it mentioned to you?
Data Miner: No.
Statistician: But surely you heard about what happened to field 4? It’s
supposed to be measured on a scale from 1 to 10, with 0 indicating a missing
value, but because of a data entry error, all 10’s were changed into 0’s.
Unfortunately, since some of the patients have missing values for this field,
it’s impossible to say whether a 0 in this field is a real 0 or a 10. Quite a few
of the records have that problem.
Data Miner: Interesting. Were there any other problems?
Statistician: Yes, fields 2 and 3 are basically the same, but I assume that you
probably noticed that.
Data Miner: Yes, but these fields were only weak predictors of field 5.
Statistician: Anyway, given all those problems, I’m surprised you were able
to accomplish anything.
Data Miner: True, but my results are really quite good. Field 1 is a very
strong predictor of field 5. I’m surprised that this wasn’t noticed before.
Statistician: What? Field 1 is just an identification number.
Data Miner: Nonetheless, my results speak for themselves.
Statistician: Oh, no! I just remembered. We assigned ID numbers after we
sorted the records based on field 5. There is a strong connection, but it’s
meaningless. Sorry.
Although this scenario represents an extreme situation, it emphasizes the
importance of “knowing your data.” To that end, this chapter will address
each of the four issues mentioned above, outlining some of the basic
challenges and standard approaches.
2.1 Types of Data
A data set can often be viewed as a collection of data objects. Other names
for a data object are record, point, vector, pattern, event, case, sample,
instance, observation, or entity. In turn, data objects are described by a
number of attributes that capture the characteristics of an object, such as the
mass of a physical object or the time at which an event occurred. Other
names for an attribute are variable, characteristic, field, feature, or
Example 2.2 (Student Information).
Often, a data set is a file, in which the objects are records (or rows) in the file
and each field (or column) corresponds to an attribute. For example, Table
2.1 shows a data set that consists of student information. Each row
corresponds to a student and each column is an attribute that describes some
aspect of a student, such as grade point average (GPA) or identification
number (ID).
Table 2.1. A sample data set
containing student information.
Student ID Year Grade Point Average (GPA) …

1034262 Senior 3.24 …
1052663 Freshman 3.51 …
1082246 Sophomore 3.62 …
Although record-based data sets are common, either in flat files or relational
database systems, there are other important types of data sets and systems for
storing data. In Section 2.1.2, we will discuss some of the types of data sets
that are commonly encountered in data mining. However, we first consider
2.1.1 Attributes and Measurement
In this section, we consider the types of attributes used to describe data
objects. We first define an attribute, then consider what we mean by the type
of an attribute, and finally describe the types of attributes that are commonly
What Is an Attribute?
We start with a more detailed definition of an attribute.
Definition 2.1.
An attribute is a property or characteristic of an object that can vary, either
from one object to another or from one time to another.
For example, eye color varies from person to person, while the temperature
of an object varies over time. Note that eye color is a symbolic attribute with
a small number of possible values {brown, black, blue, green, hazel, etc.} ,
while temperature is a numerical attribute with a potentially unlimited
number of values.
At the most basic level, attributes are not about numbers or symbols.
However, to discuss and more precisely analyze the characteristics of objects,
we assign numbers or symbols to them. To do this in a well-defined way, we
need a measurement scale.
Definition 2.2.
A measurement scale is a rule (function) that associates a numerical or
symbolic value with an attribute of an object.
Formally, the process of measurement is the application of a measurement
scale to associate a value with a particular attribute of a specific object. While
this may seem a bit abstract, we engage in the process of measurement all the
time. For instance, we step on a bathroom scale to determine our weight, we
classify someone as male or female, or we count the number of chairs in a
room to see if there will be enough to seat all the people coming to a meeting.
In all these cases, the “physical value” of an attribute of an object is mapped
to a numerical or symbolic value.
With this background, we can discuss the type of an attribute, a concept that
is important in determining if a particular data analysis technique is
consistent with a specific type of attribute.
The Type of an Attribute
It is common to refer to the type of an attribute as the type of a
measurement scale. It should be apparent from the previous discussion that
an attribute can be described using different measurement scales and that the
properties of an attribute need not be the same as the properties of the values
used to measure it. In other words, the values used to represent an attribute
can have properties that are not properties of the attribute itself, and vice
versa. This is illustrated with two examples.
Example 2.3 (Employee Age and ID
Two attributes that might be associated with an employee are ID and age (in
years). Both of these attributes can be represented as integers. However,
while it is reasonable to talk about the average age of an employee, it makes
no sense to talk about the average employee ID. Indeed, the only aspect of
employees that we want to capture with the ID attribute is that they are
distinct. Consequently, the only valid operation for employee IDs is to test
whether they are equal. There is no hint of this limitation, however, when
integers are used to represent the employee ID attribute. For the age attribute,
the properties of the integers used to represent age are very much the
properties of the attribute. Even so, the correspondence is not complete
because, for example, ages have a maximum, while integers do not.
Example 2.4 (Length of Line
Consider Figure 2.1, which shows some objects—line segments—and how
the length attribute of these objects can be mapped to numbers in two
different ways. Each successive line segment, going from the top to the
bottom, is formed by appending the topmost line segment to itself. Thus, the
second line segment from the top is formed by appending the topmost line
segment to itself twice, the third line segment from the top is formed by
appending the topmost line segment to itself three times, and so forth. In a
very real (physical) sense, all the line segments are multiples of the first. This
fact is captured by the measurements on the right side of the figure, but not
by those on the left side. More specifically, the measurement scale on the left
side captures only the ordering of the length attribute, while the scale on the
right side captures both the ordering and additivity properties. Thus, an
attribute can be measured in a way that does not capture all the properties of
the attribute.
Figure 2.1.
The measurement of the length of line segments on two different
scales of measurement.
Figure 2.1. Full Alternative Text
Knowing the type of an attribute is important because it tells us which
properties of the measured values are consistent with the underlying
properties of the attribute, and therefore, it allows us to avoid foolish actions,
such as computing the average employee ID.
The Different Types of Attributes
A useful (and simple) way to specify the type of an attribute is to identify the
properties of numbers that correspond to underlying properties of the
attribute. For example, an attribute such as length has many of the properties
of numbers. It makes sense to compare and order objects by length, as well as
to talk about the differences and ratios of length. The following properties
(operations) of numbers are typically used to describe attributes.
1. Distinctness = and ≠
2. Order <, ≤, >, and ≥
3. Addition + and −
4. Multiplication × and /
Given these properties, we can define four types of attributes: nominal ,
ordinal, interval , and ratio. Table 2.2 gives the definitions of these types,
along with information about the statistical operations that are valid for each
type. Each attribute type possesses all of the properties and operations of the
attribute types above it. Consequently, any property or operation that is valid
for nominal, ordinal, and interval attributes is also valid for ratio attributes. In
other words, the definition of the attribute types is cumulative. However, this
does not mean that the statistical operations appropriate for one attribute type
are appropriate for the attribute types above it.
Table 2.2. Different attribute
Attribute Type Description Examples Operations
The values of a
nominal attribute
are just different
names; i.e.,
nominal values
provide only
information to
zip codes,
eye color,
χ2 test
distinguish one
object from
another. (=, ≠)
The values of an
ordinal attribute
provide enough
information to
order objects.
(<, >)
hardness of
run tests,
sign tests
For interval
attributes, the
between values
are meaningful,
i.e., a unit of
exists. (+, −)
in Celsius
t and F tests
For ratio
variables, both
differences and
ratios are
meaningful. (×, /)
in Kelvin,
counts, age,
Nominal and ordinal attributes are collectively referred to as categorical or
qualitative attributes. As the name suggests, qualitative attributes, such as
employee ID, lack most of the properties of numbers. Even if they are
represented by numbers, i.e., integers, they should be treated more like
symbols. The remaining two types of attributes, interval and ratio, are
collectively referred to as quantitative or numeric attributes. Quantitative
attributes are represented by numbers and have most of the properties of
numbers. Note that quantitative attributes can be integer-valued or
The types of attributes can also be described in terms of transformations that
do not change the meaning of an attribute. Indeed, S. Smith Stevens, the
psychologist who originally defined the types of attributes shown in Table
2.2, defined them in terms of these permissible transformations. For
example, the meaning of a length attribute is unchanged if it is measured in
meters instead of feet.
The statistical operations that make sense for a particular type of attribute are
those that will yield the same results when the attribute is transformed by
using a transformation that preserves the attribute’s meaning. To illustrate,
the average length of a set of objects is different when measured in meters
rather than in feet, but both averages represent the same length. Table 2.3
shows the meaning-preserving transformations for the four attribute types of
Table 2.2.
Table 2.3. Transformations
that define attribute levels.
Attribute Type Transformation Comment
Any one-to-one mapping,
e.g., a permutation of
If all
employee ID
numbers are
reassigned, it
will not make
An order-preserving
change of values, i.e.,
where f is a monotonic
An attribute
the notion of
good, better,
best can be
function. equally well
by the values
{1, 2, 3} or by
{0.5, 1, 10}.
Interval new_value=a×old_value+b,
a and b constants.
and Celsius
scales differ in
the location of
their zero
value and the
size of a
degree (unit).
Ratio new_value=a×old_value
Length can be
measured in
meters or feet.
Example 2.5 (Temperature Scales).
Temperature provides a good illustration of some of the concepts that have
been described. First, temperature can be either an interval or a ratio attribute,
depending on its measurement scale. When measured on the Kelvin scale, a
temperature of 2is, in a physically meaningful way, twice that of a
temperature of 1. This is not true when temperature is measured on either the
Celsius or Fahrenheit scales, because, physically, a temperature of 1
Fahrenheit (Celsius) is not much different than a temperature of 2Fahrenheit
(Celsius). The problem is that the zero points of the Fahrenheit and Celsius
scales are, in a physical sense, arbitrary, and therefore, the ratio of two
Celsius or Fahrenheit temperatures is not physically meaningful.
Describing Attributes by the
Number of Values
An independent way of distinguishing between attributes is by the number of
values they can take.
Discrete A discrete attribute has a finite or countably infinite set of
values. Such attributes can be categorical, such as zip codes or ID
numbers, or numeric, such as counts. Discrete attributes are often
represented using integer variables. Binary attributes are a special case
of discrete attributes and assume only two values, e.g., true/false, yes/no,
male/female, or 0/1. Binary attributes are often represented as Boolean
variables, or as integer variables that only take the values 0 or 1.
Continuous A continuous attribute is one whose values are real numbers.
Examples include attributes such as temperature, height, or weight.
Continuous attributes are typically represented as floating-point
variables. Practically, real values can be measured and represented only
with limited precision.
In theory, any of the measurement scale types—nominal, ordinal, interval,
and ratio—could be combined with any of the types based on the number of
attribute values—binary, discrete, and continuous. However, some
combinations occur only infrequently or do not make much sense. For
instance, it is difficult to think of a realistic data set that contains a continuous
binary attribute. Typically, nominal and ordinal attributes are binary or
discrete, while interval and ratio attributes are continuous. However, count
attributes , which are discrete, are also ratio attributes.
Asymmetric Attributes
For asymmetric attributes, only presence—a non-zero attribute value—is
regarded as important. Consider a data set in which each object is a student
and each attribute records whether a student took a particular course at a
university. For a specific student, an attribute has a value of 1 if the student
took the course associated with that attribute and a value of 0 otherwise.
Because students take only a small fraction of all available courses, most of
the values in such a data set would be 0. Therefore, it is more meaningful and
more efficient to focus on the non-zero values. To illustrate, if students are
compared on the basis of the courses they don’t take, then most students
would seem very similar, at least if the number of courses is large. Binary
attributes where only non-zero values are important are called asymmetric
binary attributes. This type of attribute is particularly important for
association analysis, which is discussed in Chapter 5. It is also possible to
have discrete or continuous asymmetric features. For instance, if the number
of credits associated with each course is recorded, then the resulting data set
will consist of asymmetric discrete or continuous attributes.
General Comments on Levels of
As described in the rest of this chapter, there are many diverse types of data.
The previous discussion of measurement scales, while useful, is not complete
and has some limitations. We provide the following comments and guidance.
Distinctness, order, and meaningful intervals and ratios are only four
properties of data—many others are possible. For instance, some data is
inherently cyclical, e.g., position on the surface of the Earth or time. As
another example, consider set valued attributes, where each attribute
value is a set of elements, e.g., the set of movies seen in the last year.
Define one set of elements (movies) to be greater (larger) than a second
set if the second set is a subset of the first. However, such a relationship
defines only a partial order that does not match any of the attribute types
just defined.
The numbers or symbols used to capture attribute values may not
capture all the properties of the attributes or may suggest properties that
are not there. An illustration of this for integers was presented in
Example 2.3 , i.e., averages of IDs and out of range ages.
Data is often transformed for the purpose of analysis—see Section 2.3.7.
This often changes the distribution of the observed variable to a
distribution that is easier to analyze, e.g., a Gaussian (normal)
distribution. Often, such transformations only preserve the order of the
original values, and other properties are lost. Nonetheless, if the desired
outcome is a statistical test of differences or a predictive model, such a
transformation is justified.
The final evaluation of any data analysis, including operations on
attributes, is whether the results make sense from a domain point of
In summary, it can be challenging to determine which operations can be
performed on a particular attribute or a collection of attributes without
compromising the integrity of the analysis. Fortunately, established practice
often serves as a reliable guide. Occasionally, however, standard practices are
erroneous or have limitations.
2.1.2 Types of Data Sets
There are many types of data sets, and as the field of data mining develops
and matures, a greater variety of data sets become available for analysis. In
this section, we describe some of the most common types. For convenience,
we have grouped the types of data sets into three groups: record data, graphbased data, and ordered data. These categories do not cover all possibilities
and other groupings are certainly possible.
General Characteristics of Data Sets
Before providing details of specific kinds of data sets, we discuss three
characteristics that apply to many data sets and have a significant impact on
the data mining techniques that are used: dimensionality, distribution, and
The dimensionality of a data set is the number of attributes that the objects in
the data set possess. Analyzing data with a small number of dimensions tends
to be qualitatively different from analyzing moderate or high-dimensional
data. Indeed, the difficulties associated with the analysis of high-dimensional
data are sometimes referred to as the curse of dimensionality. Because of
this, an important motivation in preprocessing the data is dimensionality
reduction. These issues are discussed in more depth later in this chapter and
in Appendix B.
The distribution of a data set is the frequency of occurrence of various values
or sets of values for the attributes comprising data objects. Equivalently, the
distribution of a data set can be considered as a description of the
concentration of objects in various regions of the data space. Statisticians
have enumerated many types of distributions, e.g., Gaussian (normal), and
described their properties. (See Appendix C.) Although statistical approaches
for describing distributions can yield powerful analysis techniques, many
data sets have distributions that are not well captured by standard statistical
As a result, many data mining algorithms do not assume a particular
statistical distribution for the data they analyze. However, some general
aspects of distributions often have a strong impact. For example, suppose a
categorical attribute is used as a class variable, where one of the categories
occurs 95% of the time, while the other categories together occur only 5% of
the time. This skewness in the distribution can make classification difficult as
discussed in Section 4.11. (Skewness has other impacts on data analysis that
are not discussed here.)
A special case of skewed data is sparsity. For sparse binary, count or
continuous data, most attributes of an object have values of 0. In many cases,
fewer than 1% of the values are non-zero. In practical terms, sparsity is an
advantage because usually only the non-zero values need to be stored and
manipulated. This results in significant savings with respect to computation
time and storage. Indeed, some data mining algorithms, such as the
association rule mining algorithms described in Chapter 5, work well only for
sparse data. Finally, note that often the attributes in sparse data sets are
asymmetric attributes.
It is frequently possible to obtain data at different levels of resolution, and
often the properties of the data are different at different resolutions. For
instance, the surface of the Earth seems very uneven at a resolution of a few
meters, but is relatively smooth at a resolution of tens of kilometers. The
patterns in the data also depend on the level of resolution. If the resolution is
too fine, a pattern may not be visible or may be buried in noise; if the
resolution is too coarse, the pattern can disappear. For example, variations in
atmospheric pressure on a scale of hours reflect the movement of storms and
other weather systems. On a scale of months, such phenomena are not
Record Data
Much data mining work assumes that the data set is a collection of records
(data objects), each of which consists of a fixed set of data fields (attributes).
See Figure 2.2(a). For the most basic form of record data, there is no explicit
relationship among records or data fields, and every record (object) has the
same set of attributes. Record data is usually stored either in flat files or in
relational databases. Relational databases are certainly more than a collection
of records, but data mining often does not use any of the additional
information available in a relational database. Rather, the database serves as a
convenient place to find records. Different types of record data are described
below and are illustrated in Figure 2.2.
Figure 2.2.
Different variations of record data.
Figure 2.2. Full Alternative Text
Transaction or Market Basket Data
Transaction data is a special type of record data, where each record
(transaction) involves a set of items. Consider a grocery store. The set of
products purchased by a customer during one shopping trip constitutes a
transaction, while the individual products that were purchased are the items.
This type of data is called market basket data because the items in each
record are the products in a person’s “market basket.” Transaction data is a
collection of sets of items, but it can be viewed as a set of records whose
fields are asymmetric attributes. Most often, the attributes are binary,
indicating whether an item was purchased, but more generally, the attributes
can be discrete or continuous, such as the number of items purchased or the
amount spent on those items. Figure 2.2(b) shows a sample transaction data
set. Each row represents the purchases of a particular customer at a particular
The Data Matrix
If all the data objects in a collection of data have the same fixed set of
numeric attributes, then the data objects can be thought of as points (vectors)
in a multidimensional space, where each dimension represents a distinct
attribute describing the object. A set of such data objects can be interpreted as
an m by n matrix, where there are m rows, one for each object, and n
columns, one for each attribute. (A representation that has data objects as
columns and attributes as rows is also fine.) This matrix is called a data
matrix or a pattern matrix. A data matrix is a variation of record data, but
because it consists of numeric attributes, standard matrix operation can be
applied to transform and manipulate the data. Therefore, the data matrix is the
standard data format for most statistical data. Figure 2.2(c) shows a sample
data matrix.
The Sparse Data Matrix
A sparse data matrix is a special case of a data matrix where the attributes are
of the same type and are asymmetric; i.e., only non-zero values are important.
Transaction data is an example of a sparse data matrix that has only 0–1
entries. Another common example is document data. In particular, if the order
of the terms (words) in a document is ignored—the “bag of words” approach
—then a document can be represented as a term vector, where each term is a
component (attribute) of the vector and the value of each component is the
number of times the corresponding term occurs in the document. This
representation of a collection of documents is often called a document-term
matrix. Figure 2.2(d) shows a sample document-term matrix. The documents
are the rows of this matrix, while the terms are the columns. In practice, only
the non-zero entries of sparse data matrices are stored.
Graph-Based Data
A graph can sometimes be a convenient and powerful representation for data.
We consider two specific cases: (1) the graph captures relationships among
data objects and (2) the data objects themselves are represented as graphs.
Data with Relationships among
The relationships among objects frequently convey important information. In
such cases, the data is often represented as a graph. In particular, the data
objects are mapped to nodes of the graph, while the relationships among
objects are captured by the links between objects and link properties, such as
direction and weight. Consider web pages on the World Wide Web, which
contain both text and links to other pages. In order to process search queries,
web search engines collect and process web pages to extract their contents. It
is well-known, however, that the links to and from each page provide a great
deal of information about the relevance of a web page to a query, and thus,
must also be taken into consideration. Figure 2.3(a) shows a set of linked web
pages. Another important example of such graph data are the social networks,
where data objects are people and the relationships among them are their
interactions via social media.
Data with Objects That Are Graphs
If objects have structure, that is, the objects contain subobjects that have
relationships, then such objects are frequently represented as graphs. For
example, the structure of chemical compounds can be represented by a graph,
where the nodes are atoms and the links between nodes are chemical bonds.
Figure 2.3(b) shows a ball-and-stick diagram of the chemical compound
benzene, which contains atoms of carbon (black) and hydrogen (gray). A
graph representation makes it possible to determine which substructures
occur frequently in a set of compounds and to ascertain whether the presence
of any of these substructures is associated with the presence or absence of
certain chemical properties, such as melting point or heat of formation.
Frequent graph mining, which is a branch of data mining that analyzes such
data, is considered in Section 6.5.
Figure 2.3.
Different variations of graph data.
Figure 2.3. Full Alternative Text
Ordered Data
For some types of data, the attributes have relationships that involve order in
time or space. Different types of ordered data are described next and are
shown in Figure 2.4.
Sequential Transaction Data
Sequential transaction data can be thought of as an extension of transaction
data, where each transaction has a time associated with it. Consider a retail
transaction data set that also stores the time at which the transaction took
place. This time information makes it possible to find patterns such as “candy
sales peak before Halloween.” A time can also be associated with each
attribute. For example, each record could be the purchase history of a
customer, with a listing of items purchased at different times. Using this
information, it is possible to find patterns such as “people who buy DVD
players tend to buy DVDs in the period immediately following the purchase.”
Figure 2.4(a) shows an example of sequential transaction data. There are five
different times—t1, t2, t3, t4, and t5; three different customers—C1, C2, and
C3; and five different items—A, B, C, D, and E. In the top table, each row
corresponds to the items purchased at a particular time by each customer. For
instance, at time t3, customer C2 purchased items A and D. In the bottom
table, the same information is displayed, but each row corresponds to a
particular customer. Each row contains information about each transaction
involving the customer, where a transaction is considered to be a set of items
and the time at which those items were purchased. For example, customer C3
bought items A and C at time t2.
Time Series Data
Time series data is a special type of ordered data where each record is a time
series , i.e., a series of measurements taken over time. For example, a
financial data set might contain objects that are time series of the daily prices
of various stocks. As another example, consider Figure 2.4(c), which shows a
time series of the average monthly temperature for Minneapolis during the
years 1982 to 1994. When working with temporal data, such as time series, it
is important to consider temporal autocorrelation; i.e., if two measurements
are close in time, then the values of those measurements are often very
Figure 2.4.
Different variations of ordered data.
Figure 2.4. Full Alternative Text
Sequence Data
Sequence data consists of a data set that is a sequence of individual entities,
such as a sequence of words or letters. It is quite similar to sequential data,
except that there are no time stamps; instead, there are positions in an ordered
sequence. For example, the genetic information of plants and animals can be
represented in the form of sequences of nucleotides that are known as genes.
Many of the problems associated with genetic sequence data involve
predicting similarities in the structure and function of genes from similarities
in nucleotide sequences. Figure 2.4(b) shows a section of the human genetic
code expressed using the four nucleotides from which all DNA is
constructed: A, T, G, and C.
Spatial and Spatio-Temporal Data
Some objects have spatial attributes, such as positions or areas, in addition to
other types of attributes. An example of spatial data is weather data
(precipitation, temperature, pressure) that is collected for a variety of
geographical locations. Often such measurements are collected over time, and
thus, the data consists of time series at various locations. In that case, we
refer to the data as spatio-temporal data. Although analysis can be conducted
separately for each specific time or location, a more complete analysis of
spatio-temporal data requires consideration of both the spatial and temporal
aspects of the data.
An important aspect of spatial data is spatial autocorrelation; i.e., objects
that are physically close tend to be similar in other ways as well. Thus, two
points on the Earth that are close to each other usually have similar values for
temperature and rainfall. Note that spatial autocorrelation is analogous to
temporal autocorrelation.
Important examples of spatial and spatio-temporal data are the science and
engineering data sets that are the result of measurements or model output
taken at regularly or irregularly distributed points on a two- or threedimensional grid or mesh. For instance, Earth science data sets record the
temperature or pressure measured at points (grid cells) on latitude–longitude
spherical grids of various resolutions, e.g., 1° by 1°. See Figure 2.4(d). As
another example, in the simulation of the flow of a gas, the speed and
direction of flow at various instants in time can be recorded for each grid
point in the simulation. A different type of spatio-temporal data arises from
tracking the trajectories of objects, e.g., vehicles, in time and space.
Handling Non-Record Data
Most data mining algorithms are designed for record data or its variations,
such as transaction data and data matrices. Record-oriented techniques can be
applied to non-record data by extracting features from data objects and using
these features to create a record corresponding to each object. Consider the
chemical structure data that was described earlier. Given a set of common
substructures, each compound can be represented as a record with binary
attributes that indicate whether a compound contains a specific substructure.
Such a representation is actually a transaction data set, where the transactions
are the compounds and the items are the substructures.
In some cases, it is easy to represent the data in a record format, but this type
of representation does not capture all the information in the data. Consider
spatio-temporal data consisting of a time series from each point on a spatial
grid. This data is often stored in a data matrix, where each row represents a
location and each column represents a particular point in time. However, such
a representation does not explicitly capture the time relationships that are
present among attributes and the spatial relationships that exist among
objects. This does not mean that such a representation is inappropriate, but
rather that these relationships must be taken into consideration during the
analysis. For example, it would not be a good idea to use a data mining
technique that ignores the temporal autocorrelation of the attributes or the
spatial autocorrelation of the data objects, i.e., the locations on the spatial
2.2 Data Quality
Data mining algorithms are often applied to data that was collected for
another purpose, or for future, but unspecified applications. For that reason,
data mining cannot usually take advantage of the significant benefits of “addressing quality issues at the source.” In contrast, much of statistics deals
with the design of experiments or surveys that achieve a prespecified level of
data quality. Because preventing data quality problems is typically not an
option, data mining focuses on (1) the detection and correction of data quality
problems and (2) the use of algorithms that can tolerate poor data quality.
The first step, detection and correction, is often called data cleaning.
The following sections discuss specific aspects of data quality. The focus is
on measurement and data collection issues, although some application-related
issues are also discussed.
2.2.1 Measurement and Data
Collection Issues
It is unrealistic to expect that data will be perfect. There may be problems due
to human error, limitations of measuring devices, or flaws in the data
collection process. Values or even entire data objects can be missing. In other
cases, there can be spurious or duplicate objects; i.e., multiple data objects
that all correspond to a single “real” object. For example, there might be two
different records for a person who has recently lived at two different
addresses. Even if all the data is present and “looks fine,” there may be
inconsistencies—a person has a height of 2 meters, but weighs only 2
In the next few sections, we focus on aspects of data quality that are related to
data measurement and collection. We begin with a definition of measurement
and data collection errors and then consider a variety of problems that
involve measurement error: noise, artifacts, bias, precision, and accuracy. We
conclude by discussing data quality issues that involve both measurement and
data collection problems: outliers, missing and inconsistent values, and
duplicate data.
Measurement and Data Collection
The term measurement error refers to any problem resulting from the
measurement process. A common problem is that the value recorded differs
from the true value to some extent. For continuous attributes, the numerical
difference of the measured and true value is called the error. The term data
collection error refers to errors such as omitting data objects or attribute
values, or inappropriately including a data object. For example, a study of
animals of a certain species might include animals of a related species that
are similar in appearance to the species of interest. Both measurement errors
and data collection errors can be either systematic or random.
We will only consider general types of errors. Within particular domains,
certain types of data errors are commonplace, and well-developed techniques
often exist for detecting and/or correcting these errors. For example,
keyboard errors are common when data is entered manually, and as a result,
many data entry programs have techniques for detecting and, with human
intervention, correcting such errors.
Noise and Artifacts
Noise is the random component of a measurement error. It typically involves
the distortion of a value or the addition of spurious objects. Figure 2.5 shows
a time series before and after it has been disrupted by random noise. If a bit
more noise were added to the time series, its shape would be lost. Figure 2.6
shows a set of data points before and after some noise points (indicated by
‘+’s) have been added. Notice that some of the noise points are intermixed
with the non-noise points.
Figure 2.5.
Noise in a time series context.
Figure 2.5. Full Alternative Text
Figure 2.6.
Noise in a spatial context.
Figure 2.6. Full Alternative Text
The term noise is often used in connection with data that has a spatial or
temporal component. In such cases, techniques from signal or image
processing can frequently be used to reduce noise and thus, help to discover
patterns (signals) that might be “lost in the noise.” Nonetheless, the
elimination of noise is frequently difficult, and much work in data mining
focuses on devising robust algorithms that produce acceptable results even
when noise is present.
Data errors can be the result of a more deterministic phenomenon, such as a
streak in the same place on a set of photographs. Such deterministic
distortions of the data are often referred to as artifacts.
Precision, Bias, and Accuracy
In statistics and experimental science, the quality of the measurement process
and the resulting data are measured by precision and bias. We provide the
standard definitions, followed by a brief discussion. For the following
definitions, we assume that we make repeated measurements of the same
underlying quantity.
Definition 2.3 (Precision).
The closeness of repeated measurements (of the same quantity) to one
Definition 2.4 (Bias).
A systematic variation of measurements from the quantity being measured.
Precision is often measured by the standard deviation of a set of values, while
bias is measured by taking the difference between the mean of the set of
values and the known value of the quantity being measured. Bias can be
determined only for objects whose measured quantity is known by means
external to the current situation. Suppose that we have a standard laboratory
weight with a mass of 1g and want to assess the precision and bias of our new
laboratory scale. We weigh the mass five times, and obtain the following five
values:{ 1.015, 0.990, 1.013, 1.001, 0.986}. The mean of these values is
1.001, and hence, the bias is 0.001. The precision, as measured by the
standard deviation, is 0.013.
It is common to use the more general term, accuracy , to refer to the degree
of measurement error in data.
Definition 2.5 (Accuracy)
The closeness of measurements to the true value of the quantity being
Accuracy depends on precision and bias, but there is no specific formula for
accuracy in terms of these two quantities.
One important aspect of accuracy is the use of significant digits. The goal is
to use only as many digits to represent the result of a measurement or
calculation as are justified by the precision of the data. For example, if the
length of an object is measured with a meter stick whose smallest markings
are millimeters, then we should record the length of data only to the nearest
millimeter. The precision of such a measurement would be ± 0.5mm. We do
not review the details of working with significant digits because most readers
will have encountered them in previous courses and they are covered in
considerable depth in science, engineering, and statistics textbooks.
Issues such as significant digits, precision, bias, and accuracy are sometimes
overlooked, but they are important for data mining as well as statistics and
science. Many times, data sets do not come with information about the
precision of the data, and furthermore, the programs used for analysis return
results without any such information. Nonetheless, without some
understanding of the accuracy of the data and the results, an analyst runs the
risk of committing serious data analysis blunders.
Outliers are either (1) data objects that, in some sense, have characteristics
that are different from most of the other data objects in the data set, or (2)
values of an attribute that are unusual with respect to the typical values for
that attribute. Alternatively, they can be referred to as anomalous objects or
values. There is considerable leeway in the definition of an outlier, and many
different definitions have been proposed by the statistics and data mining
communities. Furthermore, it is important to distinguish between the notions
of noise and outliers. Unlike noise, outliers can be legitimate data objects or
values that we are interested in detecting. For instance, in fraud and network
intrusion detection, the goal is to find unusual objects or events from among a
large number of normal ones. Chapter 9 discusses anomaly detection in more
Missing Values
It is not unusual for an object to be missing one or more attribute values. In
some cases, the information was not collected; e.g., some people decline to
give their age or weight. In other cases, some attributes are not applicable to
all objects; e.g., often, forms have conditional parts that are filled out only
when a person answers a previous question in a certain way, but for
simplicity, all fields are stored. Regardless, missing values should be taken
into account during the data analysis.
There are several strategies (and variations on these strategies) for dealing
with missing data, each of which is appropriate in certain circumstances.
These strategies are listed next, along with an indication of their advantages
and disadvantages.
Eliminate Data Objects or
A simple and effective strategy is to eliminate objects with missing values.
However, even a partially specified data object contains some information,
and if many objects have missing values, then a reliable analysis can be
difficult or impossible. Nonetheless, if a data set has only a few objects that
have missing values, then it may be expedient to omit them. A related
strategy is to eliminate attributes that have missing values. This should be
done with caution, however, because the eliminated attributes may be the
ones that are critical to the analysis.
Estimate Missing Values
Sometimes missing data can be reliably estimated. For example, consider a
time series that changes in a reasonably smooth fashion, but has a few,
widely scattered missing values. In such cases, the missing values can be
estimated (interpolated) by using the remaining values. As another example,
consider a data set that has many similar data points. In this situation, the
attribute values of the points closest to the point with the missing value are
often used to estimate the missing value. If the attribute is continuous, then
the average attribute value of the nearest neighbors is used; if the attribute is
categorical, then the most commonly occurring attribute value can be taken.
For a concrete illustration, consider precipitation measurements that are
recorded by ground stations. For areas not containing a ground station, the
precipitation can be estimated using values observed at nearby ground
Ignore the Missing Value during
Many data mining approaches can be modified to ignore missing values. For
example, suppose that objects are being clustered and the similarity between
pairs of data objects needs to be calculated. If one or both objects of a pair
have missing values for some attributes, then the similarity can be calculated
by using only the attributes that do not have missing values. It is true that the
similarity will only be approximate, but unless the total number of attributes
is small or the number of missing values is high, this degree of inaccuracy
may not matter much. Likewise, many classification schemes can be
modified to work with missing values.
Inconsistent Values
Data can contain inconsistent values. Consider an address field, where both a
zip code and city are listed, but the specified zip code area is not contained in
that city. It is possible that the individual entering this information transposed
two digits, or perhaps a digit was misread when the information was scanned
from a handwritten form. Regardless of the cause of the inconsistent values,
it is important to detect and, if possible, correct such problems.
Some types of inconsistences are easy to detect. For instance, a person’s
height should not be negative. In other cases, it can be necessary to consult an
external source of information. For example, when an insurance company
processes claims for reimbursement, it checks the names and addresses on the
reimbursement forms against a database of its customers.
Once an inconsistency has been detected, it is sometimes possible to correct
the data. A product code may have “check” digits, or it may be possible to
double-check a product code against a list of known product codes, and then
correct the code if it is incorrect, but close to a known code. The correction of
an inconsistency requires additional or redundant information.
Example 2.6 (Inconsistent Sea
Surface Temperature).
This example illustrates an inconsistency in actual time series data that
measures the sea surface temperature (SST) at various points on the ocean.
SST data was originally collected using ocean-based measurements from
ships or buoys, but more recently, satellites have been used to gather the data.
To create a long-term data set, both sources of data must be used. However,
because the data comes from different sources, the two parts of the data are
subtly different. This discrepancy is visually displayed in Figure 2.7, which
shows the correlation of SST values between pairs of years. If a pair of years
has a positive correlation, then the location corresponding to the pair of years
is colored white; otherwise it is colored black. (Seasonal variations were
removed from the data since, otherwise, all the years would be highly
correlated.) There is a distinct change in behavior where the data has been put
together in 1983. Years within each of the two groups, 1958–1982 and 1983–
1999, tend to have a positive correlation with one another, but a negative
correlation with years in the other group. This does not mean that this data
should not be used, only that the analyst should consider the potential impact
of such discrepancies on the data mining analysis.
Figure 2.7.
Correlation of SST data between pairs of years. White areas
indicate positive correlation. Black areas indicate negative
Figure 2.7. Full Alternative Text
Duplicate Data
A data set can include data objects that are duplicates, or almost duplicates,
of one another. Many people receive duplicate mailings because they appear
in a database multiple times under slightly different names. To detect and
eliminate such duplicates, two main issues must be addressed. First, if there
are two objects that actually represent a single object, then one or more
values of corresponding attributes are usually different, and these inconsistent
values must be resolved. Second, care needs to be taken to avoid accidentally
combining data objects that are similar, but not duplicates, such as two
distinct people with identical names. The term deduplication is often used to
refer to the process of dealing with these issues.
In some cases, two or more objects are identical with respect to the attributes
measured by the database, but they still represent different objects. Here, the
duplicates are legitimate, but can still cause problems for some algorithms if
the possibility of identical objects is not specifically accounted for in their
design. An example of this is given in Exercise 13 on page 108.
2.2.2 Issues Related to Applications
Data quality issues can also be considered from an application viewpoint as
expressed by the statement “data is of high quality if it is suitable for its
intended use.” This approach to data quality has proven quite useful,
particularly in business and industry. A similar viewpoint is also present in
statistics and the experimental sciences, with their emphasis on the careful
design of experiments to collect the data relevant to a specific hypothesis. As
with quality issues at the measurement and data collection level, many issues
are specific to particular applications and fields. Again, we consider only a
few of the general issues.
Some data starts to age as soon as it has been collected. In particular, if the
data provides a snapshot of some ongoing phenomenon or process, such as
the purchasing behavior of customers or web browsing patterns, then this
snapshot represents reality for only a limited time. If the data is out of date,
then so are the models and patterns that are based on it.
The available data must contain the information necessary for the application.
Consider the task of building a model that predicts the accident rate for
drivers. If information about the age and gender of the driver is omitted, then
it is likely that the model will have limited accuracy unless this information is
indirectly available through other attributes.
Making sure that the objects in a data set are relevant is also challenging. A
common problem is sampling bias, which occurs when a sample does not
contain different types of objects in proportion to their actual occurrence in
the population. For example, survey data describes only those who respond to
the survey. (Other aspects of sampling are discussed further in Section 2.3.2.)
Because the results of a data analysis can reflect only the data that is present,
sampling bias will typically lead to erroneous results when applied to the
broader population.
Knowledge about the Data
Ideally, data sets are accompanied by documentation that describes different
aspects of the data; the quality of this documentation can either aid or hinder
the subsequent analysis. For example, if the documentation identifies several
attributes as being strongly related, these attributes are likely to provide
highly redundant information, and we usually decide to keep just one.
(Consider sales tax and purchase price.) If the documentation is poor,
however, and fails to tell us, for example, that the missing values for a
particular field are indicated with a -9999, then our analysis of the data may
be faulty. Other important characteristics are the precision of the data, the
type of features (nominal, ordinal, interval, ratio), the scale of measurement
(e.g., meters or feet for length), and the origin of the data.
2.3 Data Preprocessing
In this section, we consider which preprocessing steps should be applied to
make the data more suitable for data mining. Data preprocessing is a broad
area and consists of a number of different strategies and techniques that are
interrelated in complex ways. We will present some of the most important
ideas and approaches, and try to point out the interrelationships among them.
Specifically, we will discuss the following topics:
Dimensionality reduction
Feature subset selection
Feature creation
Discretization and binarization
Variable transformation
Roughly speaking, these topics fall into two categories: selecting data objects
and attributes for the analysis or for creating/changing the attributes. In both
cases, the goal is to improve the data mining analysis with respect to time,
cost, and quality. Details are provided in the following sections.
A quick note about terminology: In the following, we sometimes use
synonyms for attribute, such as feature or variable, in order to follow
common usage.
2.3.1 Aggregation
Sometimes “less is more,” and this is the case with aggregation , the
combining of two or more objects into a single object. Consider a data set
consisting of transactions (data objects) recording the daily sales of products
in various store locations (Minneapolis, Chicago, Paris, ) for different days
over the course of a year. See Table 2.4. One way to aggregate transactions
for this data set is to replace all the transactions of a single store with a single
storewide transaction. This reduces the hundreds or thousands of transactions
that occur daily at a specific store to a single daily transaction, and the
number of data objects per day is reduced to the number of stores.
Table 2.4. Data set containing
information about customer
Transaction ID Item Store Location Date Price …
⋮ ⋮ ⋮ ⋮ ⋮
101123 Watch Chicago 09/06/04 $25.99 …
101123 Battery Chicago 09/06/04 $5.99 …
101124 Shoes Minneapolis 09/06/04 $75.00 …
An obvious issue is how an aggregate transaction is created; i.e., how the
values of each attribute are combined across all the records corresponding to
a particular location to create the aggregate transaction that represents the
sales of a single store or date. Quantitative attributes, such as price, are
typically aggregated by taking a sum or an average. A qualitative attribute,
such as item, can either be omitted or summarized in terms of a higher level
category, e.g., televisions versus electronics.
The data in Table 2.4 can also be viewed as a multidimensional array, where
each attribute is a dimension. From this viewpoint, aggregation is the process
of eliminating attributes, such as the type of item, or reducing the number of
values for a particular attribute; e.g., reducing the possible values for date
from 365 days to 12 months. This type of aggregation is commonly used in
Online Analytical Processing (OLAP). References to OLAP are given in the
bibliographic Notes.
There are several motivations for aggregation. First, the smaller data sets
resulting from data reduction require less memory and processing time, and
hence, aggregation often enables the use of more expensive data mining
algorithms. Second, aggregation can act as a change of scope or scale by
providing a high-level view of the data instead of a low-level view. In the
previous example, aggregating over store locations and months gives us a
monthly, per store view of the data instead of a daily, per item view. Finally,
the behavior of groups of objects or attributes is often more stable than that of
individual objects or attributes. This statement reflects the statistical fact that
aggregate quantities, such as averages or totals, have less variability than the
individual values being aggregated. For totals, the actual amount of variation
is larger than that of individual objects (on average), but the percentage of the
variation is smaller, while for means, the actual amount of variation is less
than that of individual objects (on average). A disadvantage of aggregation is
the potential loss of interesting details. In the store example, aggregating over
months loses information about which day of the week has the highest sales.
Example 2.7 (Australian
This example is based on precipitation in Australia from the period 1982–
1993. Figure 2.8(a) shows a histogram for the standard deviation of average
monthly precipitation for 3,030 0.5° by 0.5° grid cells in Australia, while
Figure 2.8(b) shows a histogram for the standard deviation of the average
yearly precipitation for the same locations. The average yearly precipitation
has less variability than the average monthly precipitation. All precipitation
measurements (and their standard deviations) are in centimeters.
Figure 2.8.
Histograms of standard deviation for monthly and yearly
precipitation in Australia for the period 1982–1993.
Figure 2.8. Full Alternative Text
2.3.2 Sampling
Sampling is a commonly used approach for selecting a subset of the data
objects to be analyzed. In statistics, it has long been used for both the
preliminary investigation of the data and the final data analysis. Sampling can
also be very useful in data mining. However, the motivations for sampling in
statistics and data mining are often different. Statisticians use sampling
because obtaining the entire set of data of interest is too expensive or time
consuming, while data miners usually sample because it is too
computationally expensive in terms of the memory or time required to
process all the data. In some cases, using a sampling algorithm can reduce the
data size to the point where a better, but more computationally expensive
algorithm can be used.
The key principle for effective sampling is the following: Using a sample will
work almost as well as using the entire data set if the sample is
representative. In turn, a sample is representative if it has approximately the
same property (of interest) as the original set of data. If the mean (average) of
the data objects is the property of interest, then a sample is representative if it
has a mean that is close to that of the original data. Because sampling is a
statistical process, the representativeness of any particular sample will vary,
and the best that we can do is choose a sampling scheme that guarantees a
high probability of getting a representative sample. As discussed next, this
involves choosing the appropriate sample size and sampling technique.
Sampling Approaches
There are many sampling techniques, but only a few of the most basic ones
and their variations will be covered here. The simplest type of sampling is
simple random sampling. For this type of sampling, there is an equal
probability of selecting any particular object. There are two variations on
random sampling (and other sampling techniques as well): (1) sampling
without replacement —as each object is selected, it is removed from the set
of all objects that together constitute the population , and (2) sampling with
replacement —objects are not removed from the population as they are
selected for the sample. In sampling with replacement, the same object can be
picked more than once. The samples produced by the two methods are not
much different when samples are relatively small compared to the data set
size, but sampling with replacement is simpler to analyze because the
probability of selecting any object remains constant during the sampling
When the population consists of different types of objects, with widely
different numbers of objects, simple random sampling can fail to adequately
represent those types of objects that are less frequent. This can cause
problems when the analysis requires proper representation of all object types.
For example, when building classification models for rare classes, it is critical
that the rare classes be adequately represented in the sample. Hence, a
sampling scheme that can accommodate differing frequencies for the object
types of interest is needed. Stratified sampling , which starts with
prespecified groups of objects, is such an approach. In the simplest version,
equal numbers of objects are drawn from each group even though the groups
are of different sizes. In another variation, the number of objects drawn from
each group is proportional to the size of that group.
Example 2.8 (Sampling and Loss of
Once a sampling technique has been selected, it is still necessary to choose
the sample size. Larger sample sizes increase the probability that a sample
will be representative, but they also eliminate much of the advantage of
sampling. Conversely, with smaller sample sizes, patterns can be missed or
erroneous patterns can be detected. Figure 2.9(a) shows a data set that
contains 8000 two-dimensional points, while Figures 2.9(b) and 2.9(c) show
samples from this data set of size 2000 and 500, respectively. Although most
of the structure of this data set is present in the sample of 2000 points, much
of the structure is missing in the sample of 500 points.
Figure 2.9.
Example of the loss of structure with sampling.
Example 2.9 (Determining the
Proper Sample Size).
To illustrate that determining the proper sample size requires a methodical
approach, consider the following task.
Given a set of data consisting of a small number of almost equalsized
groups, find at least one representative point for each of the groups.
Assume that the objects in each group are highly similar to each other,
but not very similar to objects in different groups. Figure 2.10(a) shows
an idealized set of clusters (groups) from which these points might be
Figure 2.10.
Finding representative points from 10 groups.
Figure 2.10. Full Alternative Text
This problem can be efficiently solved using sampling. One approach is to
take a small sample of data points, compute the pairwise similarities between
points, and then form groups of points that are highly similar. The desired set
of representative points is then obtained by taking one point from each of
these groups. To follow this approach, however, we need to determine a
sample size that would guarantee, with a high probability, the desired
outcome; that is, that at least one point will be obtained from each cluster.
Figure 2.10(b) shows the probability of getting one object from each of the
10 groups as the sample size runs from 10 to 60. Interestingly, with a sample
size of 20, there is little chance (20%) of getting a sample that includes all 10
clusters. Even with a sample size of 30, there is still a moderate chance
(almost 40%) of getting a sample that doesn’t contain objects from all 10
clusters. This issue is further explored in the context of clustering by Exercise
4 on page 603.
Progressive Sampling
The proper sample size can be difficult to determine, so adaptive or
progressive sampling schemes are sometimes used. These approaches start
with a small sample, and then increase the sample size until a sample of
sufficient size has been obtained. While this technique eliminates the need to
determine the correct sample size initially, it requires that there be a way to
evaluate the sample to judge if it is large enough.
Suppose, for instance, that progressive sampling is used to learn a predictive
model. Although the accuracy of predictive models increases as the sample
size increases, at some point the increase in accuracy levels off. We want to
stop increasing the sample size at this leveling-off point. By keeping track of
the change in accuracy of the model as we take progressively larger samples,
and by taking other samples close to the size of the current one, we can get an
estimate of how close we are to this leveling-off point, and thus, stop
2.3.3 Dimensionality Reduction
Data sets can have a large number of features. Consider a set of documents,
where each document is represented by a vector whose components are the
frequencies with which each word occurs in the document. In such cases,
there are typically thousands or tens of thousands of attributes (components),
one for each word in the vocabulary. As another example, consider a set of
time series consisting of the daily closing price of various stocks over a
period of 30 years. In this case, the attributes, which are the prices on specific
days, again number in the thousands.
There are a variety of benefits to dimensionality reduction. A key benefit is
that many data mining algorithms work better if the dimensionality—the
number of attributes in the data—is lower. This is partly because
dimensionality reduction can eliminate irrelevant features and reduce noise
and partly because of the curse of dimensionality, which is explained below.
Another benefit is that a reduction of dimensionality can lead to a more
understandable model because the model usually involves fewer attributes.
Also, dimensionality reduction may allow the data to be more easily
visualized. Even if dimensionality reduction doesn’t reduce the data to two or
three dimensions, data is often visualized by looking at pairs or triplets of
attributes, and the number of such combinations is greatly reduced. Finally,
the amount of time and memory required by the data mining algorithm is
reduced with a reduction in dimensionality.
The term dimensionality reduction is often reserved for those techniques that
reduce the dimensionality of a data set by creating new attributes that are a
combination of the old attributes. The reduction of dimensionality by
selecting attributes that are a subset of the old is known as feature subset
selection or feature selection. It will be discussed in Section 2.3.4.
In the remainder of this section, we briefly introduce two important topics:
the curse of dimensionality and dimensionality reduction techniques based on
linear algebra approaches such as principal components analysis (PCA).
More details on dimensionality reduction can be found in Appendix B.
The Curse of Dimensionality
The curse of dimensionality refers to the phenomenon that many types of
data analysis become significantly harder as the dimensionality of the data
increases. Specifically, as dimensionality increases, the data becomes
increasingly sparse in the space that it occupies. Thus, the data objects we
observe are quite possibly not a representative sample of all possible objects.
For classification, this can mean that there are not enough data objects to
allow the creation of a model that reliably assigns a class to all possible
objects. For clustering, the differences in density and in the distances between
points, which are critical for clustering, become less meaningful. (This is
discussed further in Sections 8.1.2, 8.4.6, and 8.4.8.) As a result, many
clustering and classification algorithms (and other data analysis algorithms)
have trouble with high-dimensional data leading to reduced classification
accuracy and poor quality clusters.
Linear Algebra Techniques for
Dimensionality Reduction
Some of the most common approaches for dimensionality reduction,
particularly for continuous data, use techniques from linear algebra to project
the data from a high-dimensional space into a lower-dimensional space.
Principal Components Analysis (PCA) is a linear algebra technique for
continuous attributes that finds new attributes (principal components) that (1)
are linear combinations of the original attributes, (2) are orthogonal
(perpendicular) to each other, and (3) capture the maximum amount of
variation in the data. For example, the first two principal components capture
as much of the variation in the data as is possible with two orthogonal
attributes that are linear combinations of the original attributes. Singular
Value Decomposition (SVD) is a linear algebra technique that is related to
PCA and is also commonly used for dimensionality reduction. For additional
details, see Appendices A and B.
2.3.4 Feature Subset Selection
Another way to reduce the dimensionality is to use only a subset of the
features. While it might seem that such an approach would lose information,
this is not the case if redundant and irrelevant features are present.
Redundant features duplicate much or all of the information contained in
one or more other attributes. For example, the purchase price of a product and
the amount of sales tax paid contain much of the same information.
Irrelevant features contain almost no useful information for the data mining
task at hand. For instance, students’ ID numbers are irrelevant to the task of
predicting students’ grade point averages. Redundant and irrelevant features
can reduce classification accuracy and the quality of the clusters that are
While some irrelevant and redundant attributes can be eliminated
immediately by using common sense or domain knowledge, selecting the best
subset of features frequently requires a systematic approach. The ideal
approach to feature selection is to try all possible subsets of features as input
to the data mining algorithm of interest, and then take the subset that
produces the best results. This method has the advantage of reflecting the
objective and bias of the data mining algorithm that will eventually be used.
Unfortunately, since the number of subsets involving n attributes is 2n, such
an approach is impractical in most situations and alternative strategies are
needed. There are three standard approaches to feature selection: embedded,
filter, and wrapper.
Embedded approaches
Feature selection occurs naturally as part of the data mining algorithm.
Specifically, during the operation of the data mining algorithm, the algorithm
itself decides which attributes to use and which to ignore. Algorithms for
building decision tree classifiers, which are discussed in Chapter 3, often
operate in this manner.
Filter approaches
Features are selected before the data mining algorithm is run, using some
approach that is independent of the data mining task. For example, we might
select sets of attributes whose pairwise correlation is as low as possible so
that the attributes are non-redundant.
Wrapper approaches
These methods use the target data mining algorithm as a black box to find the
best subset of attributes, in a way similar to that of the ideal algorithm
described above, but typically without enumerating all possible subsets.
Because the embedded approaches are algorithm-specific, only the filter and
wrapper approaches will be discussed further here.
An Architecture for Feature Subset
It is possible to encompass both the filter and wrapper approaches within a
common architecture. The feature selection process is viewed as consisting of
four parts: a measure for evaluating a subset, a search strategy that controls
the generation of a new subset of features, a stopping criterion, and a
validation procedure. Filter methods and wrapper methods differ only in the
way in which they evaluate a subset of features. For a wrapper method,
subset evaluation uses the target data mining algorithm, while for a filter
approach, the evaluation technique is distinct from the target data mining
algorithm. The following discussion provides some details of this approach,
which is summarized in Figure 2.11.
Figure 2.11.
Flowchart of a feature subset selection process.
Figure 2.11. Full Alternative Text
Conceptually, feature subset selection is a search over all possible subsets of
features. Many different types of search strategies can be used, but the search
strategy should be computationally inexpensive and should find optimal or
near optimal sets of features. It is usually not possible to satisfy both
requirements, and thus, trade-offs are necessary.
An integral part of the search is an evaluation step to judge how the current
subset of features compares to others that have been considered. This requires
an evaluation measure that attempts to determine the goodness of a subset of
attributes with respect to a particular data mining task, such as classification
or clustering. For the filter approach, such measures attempt to predict how
well the actual data mining algorithm will perform on a given set of
attributes. For the wrapper approach, where evaluation consists of actually
running the target data mining algorithm, the subset evaluation function is
simply the criterion normally used to measure the result of the data mining.
Because the number of subsets can be enormous and it is impractical to
examine them all, some sort of stopping criterion is necessary. This strategy
is usually based on one or more conditions involving the following: the
number of iterations, whether the value of the subset evaluation measure is
optimal or exceeds a certain threshold, whether a subset of a certain size has
been obtained, and whether any improvement can be achieved by the options
available to the search strategy.
Finally, once a subset of features has been selected, the results of the target
data mining algorithm on the selected subset should be validated. A
straightforward validation approach is to run the algorithm with the full set of
features and compare the full results to results obtained using the subset of
features. Hopefully, the subset of features will produce results that are better
than or almost as good as those produced when using all features. Another
validation approach is to use a number of different feature selection
algorithms to obtain subsets of features and then compare the results of
running the data mining algorithm on each subset.
Feature Weighting
Feature weighting is an alternative to keeping or eliminating features. More
important features are assigned a higher weight, while less important features
are given a lower weight. These weights are sometimes assigned based on
domain knowledge about the relative importance of features. Alternatively,
they can sometimes be determined automatically. For example, some
classification schemes, such as support vector machines (Chapter 4), produce
classification models in which each feature is given a weight. Features with
larger weights play a more important role in the model. The normalization of
objects that takes place when computing the cosine similarity (Section 2.4.5)
can also be regarded as a type of feature weighting.
2.3.5 Feature Creation
It is frequently possible to create, from the original attributes, a new set of
attributes that captures the important information in a data set much more
effectively. Furthermore, the number of new attributes can be smaller than
the number of original attributes, allowing us to reap all the previously
described benefits of dimensionality reduction. Two related methodologies
for creating new attributes are described next: feature extraction and mapping
the data to a new space.
Feature Extraction
The creation of a new set of features from the original raw data is known as
feature extraction. Consider a set of photographs, where each photograph is
to be classified according to whether it contains a human face. The raw data
is a set of pixels, and as such, is not suitable for many types of classification
algorithms. However, if the data is processed to provide higher-level features,
such as the presence or absence of certain types of edges and areas that are
highly correlated with the presence of human faces, then a much broader set
of classification techniques can be applied to this problem.
Unfortunately, in the sense in which it is most commonly used, feature
extraction is highly domain-specific. For a particular field, such as image
processing, various features and the techniques to extract them have been
developed over a period of time, and often these techniques have limited
applicability to other fields. Consequently, whenever data mining is applied
to a relatively new area, a key task is the development of new features and
feature extraction methods.
Although feature extraction is often complicated, Example 2.10 illustrates
that it can be relatively straightforward.
Example 2.10 (Density).
Consider a data set consisting of information about historical artifacts, which,
along with other information, contains the volume and mass of each artifact.
For simplicity, assume that these artifacts are made of a small number of
materials (wood, clay, bronze, gold) and that we want to classify the artifacts
with respect to the material of which they are made. In this case, a density
feature constructed from the mass and volume features, i.e., density
=mass/volume , would most directly yield an accurate classification.
Although there have been some attempts to automatically perform such
simple feature extraction by exploring basic mathematical combinations of
existing attributes, the most common approach is to construct features using
domain expertise.
Mapping the Data to a New Space
A totally different view of the data can reveal important and interesting
features. Consider, for example, time series data, which often contains
periodic patterns. If there is only a single periodic pattern and not much
noise, then the pattern is easily detected. If, on the other hand, there are a
number of periodic patterns and a significant amount of noise, then these
patterns are hard to detect. Such patterns can, nonetheless, often be detected
by applying a Fourier transform to the time series in order to change to a
representation in which frequency information is explicit. In Example 2.11, it
will not be necessary to know the details of the Fourier transform. It is
enough to know that, for each time series, the Fourier transform produces a
new data object whose attributes are related to frequencies.
Example 2.11 (Fourier Analysis).
The time series presented in Figure 2.12(b) is the sum of three other time
series, two of which are shown in Figure 2.12(a) and have frequencies of 7
and 17 cycles per second, respectively. The third time series is random noise.
Figure 2.12(c) shows the power spectrum that can be computed after
applying a Fourier transform to the original time series. (Informally, the
power spectrum is proportional to the square of each frequency attribute.) In
spite of the noise, there are two peaks that correspond to the periods of the
two original, non-noisy time series. Again, the main point is that better
features can reveal important aspects of the data.
Figure 2.12.
Application of the Fourier transform to identify the underlying
frequencies in time series data.
Figure 2.12. Full Alternative Text
Many other sorts of transformations are also possible. Besides the Fourier
transform, the wavelet transform has also proven very useful for time series
and other types of data.
2.3.6 Discretization and
Some data mining algorithms, especially certain classification algorithms,
require that the data be in the form of categorical attributes. Algorithms that
find association patterns require that the data be in the form of binary
attributes. Thus, it is often necessary to transform a continuous attribute into
a categorical attribute (discretization), and both continuous and discrete
attributes may need to be transformed into one or more binary attributes
(binarization). Additionally, if a categorical attribute has a large number of
values (categories), or some values occur infrequently, then it can be
beneficial for certain data mining tasks to reduce the number of categories by
combining some of the values.
As with feature selection, the best discretization or binarization approach is
the one that “produces the best result for the data mining algorithm that will
be used to analyze the data.” It is typically not practical to apply such a
criterion directly. Consequently, discretization or binarization is performed in
a way that satisfies a criterion that is thought to have a relationship to good
performance for the data mining task being considered. In general, the best
discretization depends on the algorithm being used, as well as the other
attributes being considered. Typically, however, the discretization of each
attribute is considered in isolation.
A simple technique to binarize a categorical attribute is the following: If there
are m categorical values, then uniquely assign each original value to an
integer in the interval [0, m−1]. If the attribute is ordinal, then order must be
maintained by the assignment. (Note that even if the attribute is originally
represented using integers, this process is necessary if the integers are not in
the interval [0, m−1].) Next, convert each of these m integers to a binary
number. Since n=[log2(m)] binary digits are required to represent these
integers, represent these binary numbers using n binary attributes. To
illustrate, a categorical variable with 5 values {awful, poor, OK, good, great}
would require three binary variables x1, x2, and x3. The conversion is shown
in Table 2.5.
Table 2.5. Conversion of a
categorical attribute to three
binary attributes.
Categorical Value Integer Value x1 x2 x3
awful 0 0 0 0
poor 1 0 0 1
OK 2 0 1 0
good 3 0 1 1
great 4 1 0 0
Such a transformation can cause complications, such as creating unintended
relationships among the transformed attributes. For example, in Table 2.5,
attributes x2 and x3 are correlated because information about the good value
is encoded using both attributes. Furthermore, association analysis requires
asymmetric binary attributes, where only the presence of the attribute
(value =1). is important. For association problems, it is therefore necessary to
introduce one asymmetric binary attribute for each categorical value, as
shown in Table 2.6. If the number of resulting attributes is too large, then the
techniques described in the following sections can be used to reduce the
number of categorical values before binarization.
Table 2.6. Conversion of a
categorical attribute to five
asymmetric binary attributes.
Categorical Value Integer Value x1 x2 x3 x4 x5
awful 0 1 0 0 0 0
poor 1 0 1 0 0 0
OK 2 0 0 1 0 0
good 3 0 0 0 1 0
great 4 0 0 0 0 1
Likewise, for association problems, it can be necessary to replace a single
binary attribute with two asymmetric binary attributes. Consider a binary
attribute that records a person’s gender, male or female. For traditional
association rule algorithms, this information needs to be transformed into two
asymmetric binary attributes, one that is a 1 only when the person is male and
one that is a 1 only when the person is female. (For asymmetric binary
attributes, the information representation is somewhat inefficient in that two
bits of storage are required to represent each bit of information.)
Discretization of Continuous
Discretization is typically applied to attributes that are used in classification
or association analysis. Transformation of a continuous attribute to a
categorical attribute involves two subtasks: deciding how many categories,n ,
to have and determining how to map the values of the continuous attribute to
these categories. In the first step, after the values of the continuous attribute
are sorted, they are then divided into n intervals by specifying n−1 split
points. In the second, rather trivial step, all the values in one interval are
mapped to the same categorical value. Therefore, the problem of
discretization is one of deciding how many split points to choose and where
to place them. The result can be represented either as a set of intervals
{(x0, x1], (x1, x2],…, (xn−1, xn)}, where x0 and xn can be +∞ or −∞,
respectively, or equivalently, as a series of inequalities x0<x≤x1, …, xn
Unsupervised Discretization
A basic distinction between discretization methods for classification is
whether class information is used (supervised) or not (unsupervised). If class
information is not used, then relatively simple approaches are common. For
instance, the equal width approach divides the range of the attribute into a
user-specified number of intervals each having the same width. Such an
approach can be badly affected by outliers, and for that reason, an equal
frequency (equal depth) approach, which tries to put the same number of
objects into each interval, is often preferred. As another example of
unsupervised discretization, a clustering method, such as K-means (see
Chapter 7), can also be used. Finally, visually inspecting the data can
sometimes be an effective approach.
Example 2.12 (Discretization
This example demonstrates how these approaches work on an actual data set.
Figure 2.13(a) shows data points belonging to four different groups, along
with two outliers—the large dots on either end. The techniques of the
previous paragraph were applied to discretize the x values of these data points
into four categorical values. (Points in the data set have a random y
component to make it easy to see how many points are in each group.)
Visually inspecting the data works quite well, but is not automatic, and thus,
we focus on the other three approaches. The split points produced by the
techniques equal width, equal frequency, and K-means are shown in Figures
2.13(b), 2.13(c), and 2.13(d), respectively. The split points are represented as
dashed lines.
Figure 2.13.
Different discretization techniques.
Figure 2.13. Full Alternative Text
In this particular example, if we measure the performance of a discretization
technique by the extent to which different objects that clump together have
the same categorical value, then K-means performs best, followed by equal
frequency, and finally, equal width. More generally, the best discretization
will depend on the application and often involves domain-specific
discretization. For example, the discretization of people into low income,
middle income, and high income is based on economic factors.
Supervised Discretization
If classification is our application and class labels are known for some data
objects, then discretization approaches that use class labels often produce
better classification. This should not be surprising, since an interval
constructed with no knowledge of class labels often contains a mixture of
class labels. A conceptually simple approach is to place the splits in a way
that maximizes the purity of the intervals, i.e., the extent to which an interval
contains a single class label. In practice, however, such an approach requires
potentially arbitrary decisions about the purity of an interval and the
minimum size of an interval.
To overcome such concerns, some statistically based approaches start with
each attribute value in a separate interval and create larger intervals by
merging adjacent intervals that are similar according to a statistical test. An
alternative to this bottom-up approach is a top-down approach that starts by
bisecting the initial values so that the resulting two intervals give minimum
entropy. This technique only needs to consider each value as a possible split
point, because it is assumed that intervals contain ordered sets of values. The
splitting process is then repeated with another interval, typically choosing the
interval with the worst (highest) entropy, until a user-specified number of
intervals is reached, or a stopping criterion is satisfied.
Entropy-based approaches are one of the most promising approaches to
discretization, whether bottom-up or top-down. First, it is necessary to define
entropy. Let k be the number of different class labels, m
be the number of
values in the ith interval of a partition, and mij be the number of values of
class j in interval i. Then the entropy ei of the ith interval is given by the
ei=−∑j=1kpijlog2 pij,
where pij=mij/mi is the probability (fraction of values) of class j in the ith
interval. The total entropy, e, of the partition is the weighted average of the
individual interval entropies, i.e.,
where m is the number of values, wi=mi/m is the fraction of values in the ith
interval, and n is the number of intervals. Intuitively, the entropy of an
interval is a measure of the purity of an interval. If an interval contains only
values of one class (is perfectly pure), then the entropy is 0 and it contributes
nothing to the overall entropy. If the classes of values in an interval occur
equally often (the interval is as impure as possible), then the entropy is a
Example 2.13 (Discretization of Two
The top-down method based on entropy was used to independently discretize
both the x and y attributes of the two-dimensional data shown in Figure 2.14.
In the first discretization, shown in Figure 2.14(a), the x and y attributes were
both split into three intervals. (The dashed lines indicate the split points.) In
the second discretization, shown in Figure 2.14(b), the x and y attributes were
both split into five intervals.
Figure 2.14.
Discretizing x and y attributes for four groups (classes) of points.
Figure 2.14. Full Alternative Text
This simple example illustrates two aspects of discretization. First, in two
dimensions, the classes of points are well separated, but in one dimension,
this is not so. In general, discretizing each attribute separately often
guarantees suboptimal results. Second, five intervals work better than three,
but six intervals do not improve the discretization much, at least in terms of
entropy. (Entropy values and results for six intervals are not shown.)
Consequently, it is desirable to have a stopping criterion that automatically
finds the right number of partitions.
Categorical Attributes with Too
Many Values
Categorical attributes can sometimes have too many values. If the categorical
attribute is an ordinal attribute, then techniques similar to those for
continuous attributes can be used to reduce the number of categories. If the
categorical attribute is nominal, however, then other approaches are needed.
Consider a university that has a large number of departments. Consequently,
a department name attribute might have dozens of different values. In this
situation, we could use our knowledge of the relationships among different
departments to combine departments into larger groups, such as engineering,
social sciences, or biological sciences. If domain knowledge does not serve
as a useful guide or such an approach results in poor classification
performance, then it is necessary to use a more empirical approach, such as
grouping values together only if such a grouping results in improved
classification accuracy or achieves some other data mining objective.
2.3.7 Variable Transformation
A variable transformation refers to a transformation that is applied to all the
values of a variable. (We use the term variable instead of attribute to adhere
to common usage, although we will also refer to attribute transformation on
occasion.) In other words, for each object, the transformation is applied to the
value of the variable for that object. For example, if only the magnitude of a
variable is important, then the values of the variable can be transformed by
taking the absolute value. In the following section, we discuss two important
types of variable transformations: simple functional transformations and
Simple Functions
For this type of variable transformation, a simple mathematical function is
applied to each value individually. If x is a variable, then examples of such
transformations include xk, log x, ex, x, 1/x, sin x, or |x|. In statistics, variable
transformations, especially sqrt, log, and 1/x, are often used to transform data
that does not have a Gaussian (normal) distribution into data that does. While
this can be important, other reasons often take precedence in data mining.
Suppose the variable of interest is the number of data bytes in a session, and
the number of bytes ranges from 1 to 1 billion. This is a huge range, and it
can be advantageous to compress it by using a log 10transformation. In this
case, sessions that transferred 108 and 109 bytes would be more similar to
each other than sessions that transferred 10 and 1000 bytes
(9−8=1 versus 3−1=3). For some applications, such as network intrusion
detection, this may be what is desired, since the first two sessions most likely
represent transfers of large files, while the latter two sessions could be two
quite distinct types of sessions.
Variable transformations should be applied with caution because they change
the nature of the data. While this is what is desired, there can be problems if
the nature of the transformation is not fully appreciated. For instance, the
transformation 1/x reduces the magnitude of values that are 1 or larger, but
increases the magnitude of values between 0 and 1. To illustrate, the values
{1, 2, 3} go to { 1, 12, 13 }, but the values { 1, 12, 13 } go to {1, 2, 3}. Thus,
for all sets of values, the transformation 1/x reverses the order. To help clarify
the effect of a transformation, it is important to ask questions such as the
following: What is the desired property of the transformed attribute? Does the
order need to be maintained? Does the transformation apply to all values,
especially negative values and 0? What is the effect of the transformation on
the values between 0 and 1? Exercise 17 on page 109 explores other aspects
of variable transformation.
Normalization or Standardization
The goal of standardization or normalization is to make an entire set of values
have a particular property. A traditional example is that of “standardizing a
variable” in statistics. If x¯ is the mean (average) of the attribute values and
sx is their standard deviation, then the transformation x′=(x−x¯)/sx creates a
new variable that has a mean of 0 and a standard deviation of 1. If different
variables are to be used together, e.g., for clustering, then such a
transformation is often necessary to avoid having a variable with large values
dominate the results of the analysis. To illustrate, consider comparing people
based on two variables: age and income. For any two people, the difference
in income will likely be much higher in absolute terms (hundreds or
thousands of dollars) than the difference in age (less than 150). If the
differences in the range of values of age and income are not taken into
account, then the comparison between people will be dominated by
differences in income. In particular, if the similarity or dissimilarity of two
people is calculated using the similarity or dissimilarity measures defined
later in this chapter, then in many cases, such as that of Euclidean distance,
the income values will dominate the calculation.
The mean and standard deviation are strongly affected by outliers, so the
above transformation is often modified. First, the mean is replaced by the
median, i.e., the middle value. Second, the standard deviation is replaced by
the absolute standard deviation. Specifically, if x is a variable, then the
absolute standard deviation of x is given by σA=∑i=1m|xi−μ|, where xi is the
ith value of the variable, m is the number of objects, and μ is either the mean
or median. Other approaches for computing estimates of the location (center)
and spread of a set of values in the presence of outliers are described in
statistics books. These more robust measures can also be used to define a
standardization transformation.
2.4 Measures of Similarity and
Similarity and dissimilarity are important because they are used by a number
of data mining techniques, such as clustering, nearest neighbor classification,
and anomaly detection. In many cases, the initial data set is not needed once
these similarities or dissimilarities have been computed. Such approaches can
be viewed as transforming the data to a similarity (dissimilarity) space and
then performing the analysis. Indeed, kernel methods are a powerful
realization of this idea. These methods are introduced in Section 2.4.7 and are
discussed more fully in the context of classification in Section 4.9.4.
We begin with a discussion of the basics: high-level definitions of similarity
and dissimilarity, and a discussion of how they are related. For convenience,
the term proximity is used to refer to either similarity or dissimilarity. Since
the proximity between two objects is a function of the proximity between the
corresponding attributes of the two objects, we first describe how to measure
the proximity between objects having only one attribute.
We then consider proximity measures for objects with multiple attributes.
This includes measures such as the Jaccard and cosine similarity measures,
which are useful for sparse data, such as documents, as well as correlation
and Euclidean distance, which are useful for non-sparse (dense) data, such as
time series or multi-dimensional points. We also consider mutual
information, which can be applied to many types of data and is good for
detecting nonlinear relationships. In this discussion, we restrict ourselves to
objects with relatively homogeneous attribute types, typically binary or
Next, we consider several important issues concerning proximity measures.
This includes how to compute proximity between objects when they have
heterogeneous types of attributes, and approaches to account for differences
of scale and correlation among variables when computing distance between
numerical objects. The section concludes with a brief discussion of how to
select the right proximity measure.
Although this section focuses on the computation of proximity between data
objects, proximity can also be computed between attributes. For example, for
the document-term matrix of Figure 2.2(d), the cosine measure can be used to
compute similarity between a pair of documents or a pair of terms (words).
Knowing that two variables are strongly related can, for example, be helpful
for eliminating redundancy. In particular, the correlation and mutual
information measures discussed later are often used for that purpose.
2.4.1 Basics
Informally, the similarity between two objects is a numerical measure of the
degree to which the two objects are alike. Consequently, similarities are
higher for pairs of objects that are more alike. Similarities are usually nonnegative and are often between 0 (no similarity) and 1 (complete similarity).
The dissimilarity between two objects is a numerical measure of the degree
to which the two objects are different. Dissimilarities are lower for more
similar pairs of objects. Frequently, the term distance is used as a synonym
for dissimilarity, although, as we shall see, distance often refers to a special
class of dissimilarities. Dissimilarities sometimes fall in the interval [0, 1],
but it is also common for them to range from 0 to ∞.
Transformations are often applied to convert a similarity to a dissimilarity, or
vice versa, or to transform a proximity measure to fall within a particular
range, such as [0,1]. For instance, we may have similarities that range from 1
to 10, but the particular algorithm or software package that we want to use
may be designed to work only with dissimilarities, or it may work only with
similarities in the interval [0,1]. We discuss these issues here because we will
employ such transformations later in our discussion of proximity. In addition,
these issues are relatively independent of the details of specific proximity
Frequently, proximity measures, especially similarities, are defined or
transformed to have values in the interval [0,1]. Informally, the motivation
for this is to use a scale in which a proximity value indicates the fraction of
similarity (or dissimilarity) between two objects. Such a transformation is
often relatively straightforward. For example, if the similarities between
objects range from 1 (not at all similar) to 10 (completely similar), we can
make them fall within the range [0, 1] by using the transformation s′=(s−1)/9,
where s and s′ are the original and new similarity values, respectively. In the
more general case, the transformation of similarities to the interval [0, 1] is
given by the expression s′=(s−min_s)/(max_s−min_s), where max_s and
min_s are the maximum and minimum similarity values, respectively.
Likewise, dissimilarity measures with a finite range can be mapped to the
interval [0,1] by using the formula d′=(d−min_d)/(max_d−min_d). This is an
example of a linear transformation, which preserves the relative distances
between points. In other words, if points, x1 and x2, are twice as far apart as
points, x3 and x4, the same will be true after a linear transformation.
However, there can be complications in mapping proximity measures to the
interval [0, 1] using a linear transformation. If, for example, the proximity
measure originally takes values in the interval [0,∞], then max_d is not
defined and a nonlinear transformation is needed. Values will not have the
same relationship to one another on the new scale. Consider the
transformation d=d/(1+d) for a dissimilarity measure that ranges from 0 to ∞.
The dissimilarities 0, 0.5, 2, 10, 100, and 1000 will be transformed into the
new dissimilarities 0, 0.33, 0.67, 0.90, 0.99, and 0.999, respectively. Larger
values on the original dissimilarity scale are compressed into the range of
values near 1, but whether this is desirable depends on the application.
Note that mapping proximity measures to the interval [0, 1] can also change
the meaning of the proximity measure. For example, correlation, which is
discussed later, is a measure of similarity that takes values in the interval
[−1, 1]. Mapping these values to the interval [0,1] by taking the absolute
value loses information about the sign, which can be important in some
applications. See Exercise 22 on page 111.
Transforming similarities to dissimilarities and vice versa is also relatively
straightforward, although we again face the issues of preserving meaning and
changing a linear scale into a nonlinear scale. If the similarity (or
dissimilarity) falls in the interval [0,1], then the dissimilarity can be defined
as d=1−s(s=1−d). Another simple approach is to define similarity as the
negative of the dissimilarity (or vice versa). To illustrate, the dissimilarities 0,
1, 10, and 100 can be transformed into the similarities 0, −1, −10, and −100,
The similarities resulting from the negation transformation are not restricted
to the range [0, 1], but if that is desired, then transformations such as
s=1d+1, s=e−d, or s=1−d−min_dmax_d−min_d can be used. For the
transformation s=1d+1, the dissimilarities 0, 1, 10, 100 are transformed into
1, 0.5, 0.09, 0.01, respectively. For s=e−d, they become 1.00, 0.37, 0.00,
0.00, respectively, while for s=1−d−min_dmax_d−min_d they become 1.00,
0.99, 0.90, 0.00, respectively. In this discussion, we have focused on
converting dissimilarities to similarities. Conversion in the opposite direction
is considered in Exercise 23 on page 111.
In general, any monotonic decreasing function can be used to convert
dissimilarities to similarities, or vice versa. Of course, other factors also must
be considered when transforming similarities to dissimilarities, or vice versa,
or when transforming the values of a proximity measure to a new scale. We
have mentioned issues related to preserving meaning, distortion of scale, and
requirements of data analysis tools, but this list is certainly not exhaustive.
2.4.2 Similarity and Dissimilarity
between Simple Attributes
The proximity of objects with a number of attributes is typically defined by
combining the proximities of individual attributes, and thus, we first discuss
proximity between objects having a single attribute. Consider objects
described by one nominal attribute. What would it mean for two such objects
to be similar? Because nominal attributes convey only information about the
distinctness of objects, all we can say is that two objects either have the same
value or they do not. Hence, in this case similarity is traditionally defined as
1 if attribute values match, and as 0 otherwise. A dissimilarity would be
defined in the opposite way: 0 if the attribute values match, and 1 if they do
For objects with a single ordinal attribute, the situation is more complicated
because information about order should be taken into account. Consider an
attribute that measures the quality of a product, e.g., a candy bar, on the scale
{poor, fair, OK, good, wonderful}. It would seem reasonable that a product,
P1, which is rated wonderful, would be closer to a product P2, which is rated
good, than it would be to a product P3, which is rated OK. To make this
observation quantitative, the values of the ordinal attribute are often mapped
to successive integers, beginning at 0 or 1, e.g.,
{poor=0, fair=1, OK=2, good=3, wonderful=4}. Then, d(P1, P2)=3−2=1 or,
if we want the dissimilarity to fall between 0 and d(P1, P2)=3−24=0.25. A
similarity for ordinal attributes can then be defined as s=1−d.
This definition of similarity (dissimilarity) for an ordinal attribute should
make the reader a bit uneasy since this assumes equal intervals between
successive values of the attribute, and this is not necessarily so. Otherwise,
we would have an interval or ratio attribute. Is the difference between the
values fair and good really the same as that between the values OK and
wonderful? Probably not, but in practice, our options are limited, and in the
absence of more information, this is the standard approach for defining
proximity between ordinal attributes.
For interval or ratio attributes, the natural measure of dissimilarity between
two objects is the absolute difference of their values. For example, we might
compare our current weight and our weight a year ago by saying “I am ten
pounds heavier.” In cases such as these, the dissimilarities typically range
from 0 to ∞, rather than from 0 to 1. The similarity of interval or ratio
attributes is typically expressed by transforming a dissimilarity into a
similarity, as previously described.
Table 2.7 summarizes this discussion. In this table, x and y are two objects
that have one attribute of the indicated type. Also, d(x, y) and s(x, y) are the
dissimilarity and similarity between x and y, respectively. Other approaches
are possible; these are the most common ones.
Table 2.7. Similarity and
dissimilarity for simple
Type Dissimilarity Similarity
Nominal d={
0if x=y1if x≠y s={ 1if x=y0if x≠y
mapped to
integers 0 to n
−1, where n is
the number of
or Ratio d=|x−y| s=−d, s=11+d, s=e−d,s=1−d−min_dmax_d−min_
The following two sections consider more complicated measures of
proximity between objects that involve multiple attributes: (1) dissimilarities
between data objects and (2) similarities between data objects. This division
allows us to more naturally display the underlying motivations for employing
various proximity measures. We emphasize, however, that similarities can be
transformed into dissimilarities and vice versa using the approaches described
2.4.3 Dissimilarities between Data
In this section, we discuss various kinds of dissimilarities. We begin with a
discussion of distances, which are dissimilarities with certain properties, and
then provide examples of more general kinds of dissimilarities.
We first present some examples, and then offer a more formal description of
distances in terms of the properties common to all distances. The Euclidean
distance ,d , between two points, x and y , in one-, two-, three-, or higherdimensional space, is given by the following familiar formula:
d(x,y)=∑k=1n(xk−yk)2, (2.1)
where n is the number of dimensions and xk and yk are, respectively, the kth
attributes (components) of x and y. We illustrate this formula with Figure
2.15 and Tables 2.8 and 2.9, which show a set of points, the x and y
coordinates of these points, and the distance matrix containing the pairwise
distances of these points.
Figure 2.15.
Four two-dimensional points.
The Euclidean distance measure given in Equation 2.1 is generalized by the
Minkowski distance metric shown in Equation 2.2,
d(x,y)=(∑k=1n|xk−yk|r)1/r, (2.2)
where r is a parameter. The following are the three most common examples
of Minkowski distances.
r=1. City block (Manhattan, taxicab, L1 norm) distance. A common
example is the Hamming distance , which is the number of bits that is
different between two objects that have only binary attributes, i.e.,
between two binary vectors.
r=2. Euclidean distance (L2 norm).
r=∞. Supremum (Lmax or L∞ norm) distance. This is the maximum
difference between any attribute of the objects. More formally, the L∞
distance is defined by Equation 2.3
d(x, y)=limr→∞(∑k=1n|xk−yk|r)1/r. (2.3)
The r parameter should not be confused with the number of dimensions (attributes) n. The Euclidean, Manhattan, and supremum distances are defined
for all values of n: 1, 2, 3, …, and specify different ways of combining the
differences in each dimension (attribute) into an overall distance.
Tables 2.10 and 2.11, respectively, give the proximity matrices for the L1 and
L∞ distances using data from Table 2.8. Notice that all these distance
matrices are symmetric; i.e., the ijth entry is the same as the jith entry. In
Table 2.9, for instance, the fourth row of the first column and the fourth
column of the first row both contain the value 5.1.
Table 2.8. x and y coordinates
of four points.
point x coordinate y coordinate
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Table 2.9. Euclidean distance
matrix for Table 2.8.
p1 p2 p3 p4
p1 0.0 2.8 3.2 5.1
p2 2.8 0.0 1.4 3.2
p3 3.2 1.4 0.0 2.0
p4 5.1 3.2 2.0 0.0
Table 2.10. L1 distance matrix
for Table 2.8.
1 p1 p2 p3 p4
p1 0.0 4.0 4.0 6.0
p2 4.0 0.0 2.0 4.0
p3 4.0 2.0 0.0 2.0
p4 6.0 4.0 2.0 0.0
Table 2.11. L∞ distance matrix
for Table 2.8.
L∞ p1 p2 p3 p4
p1 0.0 2.0 3.0 5.0
p2 2.0 0.0 1.0 3.0
p3 3.0 1.0 0.0 2.0
p4 5.0 3.0 2.0 0.0
Distances, such as the Euclidean distance, have some well-known properties.
If d(x, y) is the distance between two points, x and y, then the following
properties hold.
1. Positivity
1. d(x, y)≥0 for all x and y,
2. d(x, y)=0 only if x=y.
2. Symmetry d(x, y)=d(y, x) for all x and y.
3. Triangle Inequality d(x, z)≤d(x, y)+d(y, z) for all points x , y , and z.
Measures that satisfy all three properties are known as metrics. Some people
use the term distance only for dissimilarity measures that satisfy these
properties, but that practice is often violated. The three properties described
here are useful, as well as mathematically pleasing. Also, if the triangle
inequality holds, then this property can be used to increase the efficiency of
techniques (including clustering) that depend on distances possessing this
property. (See Exercise 25.) Nonetheless, many dissimilarities do not satisfy
one or more of the metric properties. Example 2.14 illustrates such a
Example 2.14 (Non-metric
Dissimilarities: Set Differences).
This example is based on the notion of the difference of two sets, as defined
in set theory. Given two sets A and B, A−B is the set of elements of A that are
not in
B. For example, if A={1, 2, 3, 4} and B={2, 3, 4}, then A−B={1} and B
−A=∅, the empty set. We can define the distance d between two sets A and B
as d(A, B)=size(A−B), where size is a function returning the number of
elements in a set. This distance measure, which is an integer value greater
than or equal to 0, does not satisfy the second part of the positivity property,
the symmetry property, or the triangle inequality. However, these properties
can be made to hold if the dissimilarity measure is modified as follows:
d(A, B)=size(A−B)+size(B−A). See Exercise 21 on page 110.
2.4.4 Similarities between Data
For similarities, the triangle inequality (or the analogous property) typically
does not hold, but symmetry and positivity typically do. To be explicit, if s(x,
y) is the similarity between points x and y, then the typical properties of
similarities are the following:
1. s(x, y)=1 only if x=y. (0≤s≤1)
2. s(x, y)=s(y, x) for all x and y. (Symmetry)
There is no general analog of the triangle inequality for similarity measures.
It is sometimes possible, however, to show that a similarity measure can
easily be converted to a metric distance. The cosine and Jaccard similarity
measures, which are discussed shortly, are two examples. Also, for specific
similarity measures, it is possible to derive mathematical bounds on the
similarity between two objects that are similar in spirit to the triangle
Example 2.15 (A Non-symmetric
Similarity Measure).
Consider an experiment in which people are asked to classify a small set of
characters as they flash on a screen. The confusion matrix for this
experiment records how often each character is classified as itself, and how
often each is classified as another character. Using the confusion matrix, we
can define a similarity measure between a character x and a character y as the
number of times that x is misclassified asy , but note that this measure is not
symmetric. For example, suppose that “0” appeared 200 times and was
classified as a “0” 160 times, but as an “o” 40 times. Likewise, suppose that
“o” appeared 200 times and was classified as an “o” 170 times, but as “0”
only 30 times. Then, s(0,o)=40, but s(o, 0)=30. In such situations, the
similarity measure can be made symmetric by setting s′=(x,y)=s′(x,y)=
(s(x,y+s(y,x))/2, where s indicates the new similarity measure.
2.4.5 Examples of Proximity
This section provides specific examples of some similarity and dissimilarity
Similarity Measures for Binary
Similarity measures between objects that contain only binary attributes are
called similarity coefficients , and typically have values between 0 and 1. A
value of 1 indicates that the two objects are completely similar, while a value
of 0 indicates that the objects are not at all similar. There are many rationales
for why one coefficient is better than another in specific instances.
Let x and y be two objects that consist of n binary attributes. The comparison
of two such objects, i.e., two binary vectors, leads to the following four
quantities (frequencies):
f00=the number of attributes where x is 0 and y is 0f01= the number of attributes where
Simple Matching Coefficient
One commonly used similarity coefficient is the simple matching coefficient
(SMC), which is defined as
SMC=number of matching attribute valuesnumber of attributes=f11+f00f01+f
This measure counts both presences and absences equally. Consequently, the
SMC could be used to find students who had answered questions similarly on
a test that consisted only of true/false questions.
Jaccard Coefficient
Suppose that x and y are data objects that represent two rows (two
transactions) of a transaction matrix (see Section 2.1.2). If each asymmetric
binary attribute corresponds to an item in a store, then a 1 indicates that the
item was purchased, while a 0 indicates that the product was not purchased.
Because the number of products not purchased by any customer far
outnumbers the number of products that were purchased, a similarity measure
such as SMC would say that all transactions are very similar. As a result, the
Jaccard coefficient is frequently used to handle objects consisting of
asymmetric binary attributes. The Jaccard coefficient , which is often
symbolized by j, is given by the following equation:
J=number of matching presencesnumber of attributes not involved in 00 matches
Example 2.16 (The SMC and
Jaccard Similarity Coefficients).
To illustrate the difference between these two similarity measures, we
calculate SMC and j for the following two binary vectors.
x = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)
f01=2the number of attributes where x was 0 and y was 1f10=1the number of attributes where
Cosine Similarity
Documents are often represented as vectors, where each component
(attribute) represents the frequency with which a particular term (word)
occurs in the document. Even though documents have thousands or tens of
thousands of attributes (terms), each document is sparse since it has relatively
few nonzero attributes. Thus, as with transaction data, similarity should not
depend on the number of shared 0 values because any two documents are
likely to “not contain” many of the same words, and therefore, if 0–0 matches
are counted, most documents will be highly similar to most other documents.
Therefore, a similarity measure for documents needs to ignores 0–0 matches
like the Jaccard measure, but also must be able to handle non-binary vectors.
The cosine similarity , defined next, is one of the most common measures of
document similarity. If x and y are two document vectors, then
cos(x, y)=〈 x, y 〉∥x∥∥y∥=x′y∥x∥∥y∥, (2.6)
where ′ indicates vector or matrix transpose and 〈 x, y 〉 indicates the inner
product of the two vectors,
〈 x, y 〉=∑k=1nxkyk=x′y, (2.7)
and ∥x∥ is the length of vector x, ∥x∥=∑k=1nxk2=〈 x, x 〉=x′x.
The inner product of two vectors works well for asymmetric attributes since it
depends only on components that are non-zero in both vectors. Hence, the
similarity between two documents depends only upon the words that appear
in both of them.
Example 2.17 (Cosine Similarity
between Two Document Vectors).
This example calculates the cosine similarity for the following two data
objects, which might represent document vectors:
x=(3, 2, 0, 5, 0, 0, 0, 2, 0, 0)y=(1, 0, 0, 0, 0, 0, 0, 1, 0, 2)
〈 x, y
As indicated by Figure 2.16, cosine similarity really is a measure of the
(cosine of the) angle between x and y. Thus, if the cosine similarity is 1, the
angle between x and y is 0°, and x and y are the same except for length. If the
cosine similarity is 0, then the angle between x and y is 90°, and they do not
share any terms (words).
Figure 2.16.
Geometric illustration of the cosine measure.
Equation 2.6 also can be written as Equation 2.8.
cos(x, y)=〈 x∥x∥, y∥y∥ 〉=〈 x′, y′ 〉, (2.8)
where x′=x/∥x∥ and y′=y/∥y∥. Dividing x and y by their lengths
normalizes them to have a length of 1. This means that cosine similarity does
not take the length of the two data objects into account when computing
similarity. (Euclidean distance might be a better choice when length is
important.) For vectors with a length of 1, the cosine measure can be
calculated by taking a simple inner product. Consequently, when many
cosine similarities between objects are being computed, normalizing the
objects to have unit length can reduce the time required.
Extended Jaccard Coefficient
(Tanimoto Coefficient)
The extended Jaccard coefficient can be used for document data and that
reduces to the Jaccard coefficient in the case of binary attributes. This
coefficient, which we shall represent as EJ, is defined by the following
EJ(x, y)=〈 x, y 〉ǁ x ǁ2+ǁ y ǁ2−〈 x, y 〉=x′yǁ x ǁ2+ǁ y ǁ2−x′y. (2.9)
Correlation is frequently used to measure the linear relationship between two
sets of values that are observed together. Thus, correlation can measure the
relationship between two variables (height and weight) or between two
objects (a pair of temperature time series). Correlation is used much more
frequently to measure the similarity between attributes since the values in two
data objects come from different attributes, which can have very different
attribute types and scales. There are many types of correlation, and indeed
correlation is sometimes used in a general sense to mean the relationship
between two sets of values that are observed together. In this discussion, we
will focus on a measure appropriate for numerical values.
Specifically, Pearson’s correlation between two sets of numerical values,
i.e., two vectors, x and y, is defined by the following equation:
corr(x, y)=covariance(x, y)standard_deviation(x)×standard_deviation(y)=sxys
where we use the following standard statistical notation and definitions:
covariance(x, y)=sxy=1n−1∑k=1n(xk−x¯)(yk−y¯) (2.11)
x¯=1n∑k=1nxk is the mean of x
y¯=1n∑k=1nyk is the mean of y
Example 2.18 (Perfect Correlation).
Correlation is always in the range −1 to 1. A correlation of 1 (−1) means that
x and y have a perfect positive (negative) linear relationship; that is,
xk=ayk+b, where a and b are constants. The following two vectors x and y
illustrate cases where the correlation is −1 and +1, respectively. In the first
case, the means of x and y were chosen to be 0, for simplicity.
x=(−3, 6, 0, 3, −6)y=(1, −2, 0, −1, 2)corr(x, y)=−1xk=−3yk
x=(3, 6, 0, 3, 6)y=(1, 2, 0, 1, 2)corr(x, y)=1xk=3yk
Example 2.19 (Nonlinear
If the correlation is 0, then there is no linear relationship between the two sets
of values. However, nonlinear relationships can still exist. In the following
example, yk=xk2, but their correlation is 0.
x=(−3, −2, −1, 0, 1, 2, 3)y=(9, 4, 1, 0, 1, 4, 9)
Example 2.20 (Visualizing
It is also easy to judge the correlation between two vectors x and y by
plotting pairs of corresponding values of x and y in a scatter plot. Figure 2.17
shows a number of these scatter plots when x and y consist of a set of 30
pairs of values that are randomly generated (with a normal distribution) so
that the correlation of x and y ranges from −1 to 1. Each circle in a plot
represents one of the 30 pairs of x and y values; its x coordinate is the value
of that pair for x, while its y coordinate is the value of the same pair for y.
Figure 2.17.
Scatter plots illustrating correlations from −1 to 1.
Figure 2.17. Full Alternative Text
If we transform x and y by subtracting off their means and then normalizing
them so that their lengths are 1, then their correlation can be calculated by
taking the dot product. Let us refer to these transformed vectors of x and y as
x′ and y′, respectively. (Notice that this transformation is not the same as the
standardization used in other contexts, where we subtract the means and
divide by the standard deviations, as discussed in Section 2.3.7.) This
transformation highlights an interesting relationship between the correlation
measure and the cosine measure. Specifically, the correlation between x and
y is identical to the cosine between x′ and y′. However, the cosine between x
and y is not the same as the cosine between x′ and y′, even though they both
have the same correlation measure. In general, the correlation between two
vectors is equal to the cosine measure only in the special case when the
means of the two vectors are 0.
Differences Among Measures For
Continuous Attributes
In this section, we illustrate the difference among the three proximity
measures for continuous attributes that we have just defined: cosine,
correlation, and Minkowski distance. Specifically, we consider two types of
data transformations that are commonly used, namely, scaling
(multiplication) by a constant factor and translation (addition) by a constant
value. A proximity measure is considered to be invariant to a data
transformation if its value remains unchanged even after performing the
transformation. Table 2.12 compares the behavior of cosine, correlation, and
Minkowski distance measures regarding their invariance to scaling and
translation operations. It can be seen that while correlation is invariant to both
scaling and translation, cosine is only invariant to scaling but not to
translation. Minkowski distance measures, on the other hand, are sensitive to
both scaling and translation and are thus invariant to neither.
Table 2.12. Properties of cosine,
correlation, and Minkowski
distance measures.
Property Cosine Correlation Minkowski
Invariant to scaling
(multiplication) Yes Yes No
Invariant to translation
(addition) No Yes No
Let us consider an example to demonstrate the significance of these
differences among different proximity measures.
Example 2.21 (Comparing
proximity measures).
Consider the following two vectors x and y with seven numeric attributes.
x=(1, 2, 4, 3, 0, 0, 0)y=(1, 2, 3, 4, 0, 0, 0)
It can be seen that both x and y have 4 non-zero values, and the values in the
two vectors are mostly the same, except for the third and the fourth
components. The cosine, correlation, and Euclidean distance between the two
vectors can be computed as follows.
cos(x, y)=2930×30=0.9667correlation(x, y)=2.35711.5811×1.5811=0.9429Euclidean distance
x−y ǁ=1.4142
Not surprisingly, x and y have a cosine and correlation measure close to 1,
while the Euclidean distance between them is small, indicating that they are
quite similar. Now let us consider the vector ys, which is a scaled version of y
(multiplied by a constant factor of 2), and the vector yt, which is constructed
by translating y by 5 units as follows.
ys=2×y=(2, 4, 6, 8, 0, 0, 0)
yt=y+5=(6, 7, 8, 9, 5, 5, 5)
We are interested in finding whether ys and yt show the same proximity with
x as shown by the original vector y. Table 2.13 shows the different measures
of proximity computed for the pairs (x, y), (x, ys), and (x, yt). It can be seen
that the value of correlation between x and y remains unchanged even after
replacing y with ys or yt. However, the value of cosine remains equal to
0.9667 when computed for (x, y) and (x, ys), but significantly reduces to
0.7940 when computed for (x, yt). This highlights the fact that cosine is
invariant to the scaling operation but not to the translation operation, in
contrast with the correlation measure. The Euclidean distance, on the other
hand, shows different values for all three pairs of vectors, as it is sensitive to
both scaling and translation.
Table 2.13. Similarity between
(x, y), (x, ys), and (x, yt).
Measure (x, y) (x, ys) (x, yt)
Cosine 0.9667 0.9667 0.7940
Correlation 0.9429 0.9429 0.9429
Euclidean Distance 1.4142 5.8310 14.2127
We can observe from this example that different proximity measures behave
differently when scaling or translation operations are applied on the data. The
choice of the right proximity measure thus depends on the desired notion of
similarity between data objects that is meaningful for a given application. For
example, if x and y represented the frequencies of different words in a
document-term matrix, it would be meaningful to use a proximity measure
that remains unchanged when y is replaced by ys, because ys is just a scaled
version of y with the same distribution of words occurring in the document.
However, yt is different from y, since it contains a large number of words
with non-zero frequencies that do not occur in y. Because cosine is invariant
to scaling but not to translation, it will be an ideal choice of proximity
measure for this application.
Consider a different scenario in which x represents a location’s temperature
measured on the Celsius scale for seven days. Let y, ys, and yt be the
temperatures measured on those days at a different location, but using three
different measurement scales. Note that different units of temperature have
different offsets (e.g. Celsius and Kelvin) and different scaling factors (e.g.
Celsius and Fahrenheit). It is thus desirable to use a proximity measure that
captures the proximity between temperature values without being affected by
the measurement scale. Correlation would then be the ideal choice of
proximity measure for this application, as it is invariant to both scaling and
As another example, consider a scenario where x represents the amount of
precipitation (in cm) measured at seven locations. Let y, ys, and yt be
estimates of the precipitation at these locations, which are predicted using
three different models. Ideally, we would like to choose a model that
accurately reconstructs the measurements in x without making any error. It is
evident that y provides a good approximation of the values in x, whereas ys
and yt provide poor estimates of precipitation, even though they do capture
the trend in precipitation across locations. Hence, we need to choose a
proximity measure that penalizes any difference in the model estimates from
the actual observations, and is sensitive to both the scaling and translation
operations. The Euclidean distance satisfies this property and thus would be
the right choice of proximity measure for this application. Indeed, the
Euclidean distance is commonly used in computing the accuracy of models,
which will be discussed later in Chapter 3.
2.4.6 Mutual Information
Like correlation, mutual information is used as a measure of similarity
between two sets of paired values that is sometimes used as an alternative to
correlation, particularly when a nonlinear relationship is suspected between
the pairs of values. This measure comes from information theory, which is
the study of how to formally define and quantify information. Indeed, mutual
information is a measure of how much information one set of values provides
about another, given that the values come in pairs, e.g., height and weight. If
the two sets of values are independent, i.e., the value of one tells us nothing
about the other, then their mutual information is 0. On the other hand, if the
two sets of values are completely dependent, i.e., knowing the value of one
tells us the value of the other and vice-versa, then they have maximum
mutual information. Mutual information does not have a maximum value, but
we will define a normalized version of it that ranges between 0 and 1.
To define mutual information, we consider two sets of values, X and Y ,
which occur in pairs (X, Y). We need to measure the average information in a
single set of values, i.e., either in X or in Y , and in the pairs of their values.
This is commonly measured by entropy. More specifically, assume X and Y
are discrete, that is, X can take m distinct values,u1, u2, …, um and Y can
take n distinct values,v1, v2, …, vn. Then their individual and joint entropy
can be defined in terms of the probabilities of each value and pair of values as
H(X)=−∑j=1mP(X=uj)log2 P(X=uj) (2.12)
H(Y)=−∑k=1nP(Y=vk)log2 P(Y=vk) (2.13)
H(X, Y)=−∑j=1m∑k=1nP(X=uj, Y=vk)log2 P(X=uj, Y=vk) (2.14)
where if the probability of a value or combination of values is 0, then
0 log2(0) is conventionally taken to be 0.
The mutual information of X and Y can now be defined straightforwardly:
I(X, Y)=H(X)+H(Y)−H(X, Y) (2.15)
Note that H(X, Y) is symmetric, i.e., H(X, Y)=H(Y, X), and thus mutual
information is also symmetric, i.e., I(X, Y)=I(Y).
Practically, X and Y are either the values in two attributes or two rows of the
same data set. In Example 2.22, we will represent those values as two vectors
x and y and calculate the probability of each value or pair of values from the
frequency with which values or pairs of values occur in x, y and (xi, yi),
where xi is the ith component of x and yi is the ith component of y. Let us
illustrate using a previous example.
Example 2.22 (Evaluating Nonlinear
Relationships with Mutual
Recall Example 2.19 where yk=xk2, but their correlation was 0.
x=(−3, −2, −1, 0, 1, 2, 3)y=(9, 4, 1, 0, 1, 4, 9)
From Figure 2.22, I(x, y)=H(x)+H(y)−H(x, y)=1.9502. Although a variety of
approaches to normalize mutual information are possible—see Bibliographic
Notes—for this example, we will apply one that divides the mutual
information by log2(min(m, n)) and produces a result between 0 and 1. This
yields a value of 1.9502/log2(4))=0.9751. Thus, we can see that x and y are
strongly related. They are not perfectly related because given a value of y
there is, except for y=0, some ambiguity about the value of x. Notice that for
y=−x, the normalized mutual information would be 1.
Figure 2.18.
Computation of mutual information.
Table 2.14. Entropy for x
xj P(x=xj) −P(x=xj)log2 P(x=xj)
−3 1/7 0.4011
−2 1/7 0.4011
−1 1/7 0.4011
0 1/7 0.4011
1 1/7 0.4011
2 1/7 0.4011
3 1/7 0.4011
H(x) 2.8074
Table 2.15. Entropy for y
yk P(y=yk) −P(y=yk)log2 P(y=yk)
9 2/7 0.5164
4 2/7 0.5164
1 2/7 0.5164
0 1/7 0.4011
H(y) 1.9502
Table 2.16. Joint entropy
for x and y
xj yk P(x=xj, y=xk) −P(x=xj, y=xk)log2 P(x=xj, y=xk)
−3 9 1/7 0.4011
−2 4 1/7 0.4011
−1 1 1/7 0.4011
0 0 1/7 0.4011
1 1 1/7 0.4011
2 4 1/7 0.4011
3 9 1/7 0.4011
H(x, y) 2.8074
2.4.7 Kernel Functions*
It is easy to understand how similarity and distance might be useful in an
application such as clustering, which tries to group similar objects together.
What is much less obvious is that many other data analysis tasks, including
predictive modeling and dimensionality reduction, can be expressed in terms
of pairwise “proximities” of data objects. More specifically, many data
analysis problems can be mathematically formulated to take as input, a
kernel matrix, K, which can be considered a type of proximity matrix. Thus,
an initial preprocessing step is used to convert the input data into a kernel
matrix, which is the input to the data analysis algorithm.
More formally, if a data set has m data objects, then K is an m by m matrix. If
xi and xj are the ith and jth data objects, respectively, then kij, the ijth entry
of K, is computed by a kernel function:
kij=κ(xi, xj) (2.16)
As we will see in the material that follows, the use of a kernel matrix allows
both wider applicability of an algorithm to various kinds of data and an
ability to model nonlinear relationships with algorithms that are designed
only for detecting linear relationships.
Kernels make an algorithm data
If an algorithm uses a kernel matrix, then it can be used with any type of data
for which a kernel function can be designed. This is illustrated by Algorithm
2.1. Although only some data analysis algorithms can be modified to use a
kernel matrix as input, this approach is extremely powerful because it allows
such an algorithm to be used with almost any type of data for which an
appropriate kernel function can be defined. Thus, a classification algorithm
can be used, for example, with record data, string data, or graph data. If an
algorithm can be reformulated to use a kernel matrix, then its applicability to
different types of data increases dramatically. As we will see in later chapters,
many clustering, classification, and anomaly detection algorithms work only
with similarities or distances, and thus, can be easily modified to work with
Algorithm 2.1 Basic kernel
1. Read in the m data objects in the data set.
2. Compute the kernel matrix, K by applying the kernel function, κ, to each
pair of data objects.
3. Run the data analysis algorithm with K as input.
4. Return the analysis result, e.g., predicted class or cluster labels.
Mapping data into a higher
dimensional data space can allow
modeling of nonlinear relationships
There is yet another, equally important, aspect of kernel based data
algorithms—their ability to model nonlinear relationships with algorithms
that model only linear relationships. Typically, this works by first
transforming (mapping) the data from a lower dimensional data space to a
higher dimensional space.
Example 2.23 (Mapping Data to a
Higher Dimensional Space).
Consider the relationship between two variables x and y given by the
following equation, which defines an ellipse in two dimensions (Figure
Figure 2.19.
Mapping data to a higher dimensional space: two to three
Figure 2.19. Full Alternative Text
4×2+9xy+7y2=10 (2.17)
We can map our two dimensional data to three dimensions by creating three
new variables, u, v, and w, which are defined as follows:
As a result, we can now express Equation 2.17 as a linear one. This equation
describes a plane in three dimensions. Points on the ellipse will lie on that
plane, while points inside and outside the ellipse will lie on opposite sides of
the plane. See Figure 2.19(b). The viewpoint of this 3D plot is along the
surface of the separating plane so that the plane appears as a line.
4u+9v+7w=10 (2.18)
The Kernel Trick
The approach illustrated above shows the value in mapping data to higher
dimensional space, an operation that is integral to kernel-based methods.
Conceptually, we first define a function φ that maps data points x and y to
data points φ(x) and φ(y) in a higher dimensional space such that the inner
product 〈x, y〉 gives the desired measure of proximity of x and y. It may seem
that we have potentially sacrificed a great deal by using such an approach,
because we can greatly expand the size of our data, increase the
computational complexity of our analysis, and encounter problems with the
curse of dimensionality by computing similarity in a high-dimensional space.
However, this is not the case since these problems can be avoided by defining
a kernel function κ that can compute the same similarity value, but with the
data points in the original space, i.e., κ(x, y)=〈 φ(x), φ(y) 〉. This is known as
the kernel trick. Despite the name, the kernel trick has a very solid
mathematical foundation and is a remarkably powerful approach for data
Not every function of a pair of data objects satisfies the properties needed for
a kernel function, but it has been possible to design many useful kernels for a
wide variety of data types. For example, three common kernel functions are
the polynomial, Gaussian (radial basis function (RBF)), and sigmoid kernels.
If x and y are two data objects, specifically, two data vectors, then these two
kernel functions can be expressed as follows, respectively:
κ(x, y)−(x′y+c)d (2.19)
κ(x, y)=exp(−ǁ x−y ǁ/2σ2) (2.20)
κ(x, y)=tanh(αx′y+c) (2.21)
where α and c≥0 are constants, d is an integer parameter that gives the
polynomial degree, ǁ x−y ǁ is the length of the vector x−y and σ>0 is a
parameter that governs the “spread” of a Gaussian.
Example 2.24 (The Polynomial
Note that the kernel functions presented in the previous section are
computing the same similarity value as would be computed if we actually
mapped the data to a higher dimensional space and then computed an inner
product there. For example, for the polynomial kernel of degree 2, let φ be
the function that maps a two-dimensional data vector x=(x1, x2) to the higher
dimensional space. Specifically, let
φ(x)=(x12, x22, 2x1x2, 2cx1, 2cx2, c). (2.22)
For the higher dimensional space, let the proximity be defined as the inner
product of φ(x) and φ(y), i.e., 〈 φ(x), φ(y) 〉. Then, as previously mentioned, it
can be shown that
κ(x, y)=〈 φ(x), φ(y) 〉 (2.23)
where κ is defined by Equation 2.19 above. Specifically, if x=(x1, x2) and y=
(y1, y2), then
κ(x, y)=〈 x, y 〉=x′y=(x12y12, x22y22, 2x1x2y1y2, 2cx1y1, 2cx2y2, c2).
More generally, the kernel trick depends on defining κ and φ so that Equation
2.23 holds. This has been done for a wide variety of kernels.
This discussion of kernel-based approaches was intended only to provide a
brief introduction to this topic and has omitted many details. A fuller
discussion of the kernel-based approach is provided in Section 4.9.4, which
discusses these issues in the context of nonlinear support vector machines for
classification. More general references for the kernel based analysis can be
found in the Bibliographic Notes of this chapter.
2.4.8 Bregman Divergence*
This section provides a brief description of Bregman divergences, which are
a family of proximity functions that share some common properties. As a
result, it is possible to construct general data mining algorithms, such as
clustering algorithms, that work with any Bregman divergence. A concrete
example is the K-means clustering algorithm (Section 7.2). Note that this
section requires knowledge of vector calculus.
Bregman divergences are loss or distortion functions. To understand the idea
of a loss function, consider the following. Let x and y be two points, where y
is regarded as the original point and x is some distortion or approximation of
it. For example, x may be a point that was generated by adding random noise
to y. The goal is to measure the resulting distortion or loss that results if y is
approximated by x. Of course, the more similar x and y are, the smaller the
loss or distortion. Thus, Bregman divergences can be used as dissimilarity
More formally, we have the following definition.
Definition 2.6 (Bregman divergence)
Given a strictly convex function ϕ (with a few modest restrictions that are
generally satisfied), the Bregman divergence (loss function) D(x, y)
generated by that function is given by the following equation:
D(x, y)=ϕ(x)−ϕ(y)−〈 ∇ϕ(y), (x−y) 〉 (2.25)
where ∇ϕ(y) is the gradient of ϕ evaluated at y, x−y, is the vector difference
between x and y, and 〈 ∇ϕ(y), (x−y) 〉 is the inner product between ∇ϕ(y)
and (x−y). For points in Euclidean space, the inner product is just the dot
D(x, y) can be written as D(x, y)=ϕ(x)−L(x), where L(x)=ϕ(y)+〈 ∇ϕ(y), (x
−y) 〉 and represents the equation of a plane that is tangent to the function ϕ at
y. Using calculus terminology, L(x) is the linearization of ϕ around the point
y, and the Bregman divergence is just the difference between a function and a
linear approximation to that function. Different Bregman divergences are
obtained by using different choices for ϕ.
Example 2.25.
We provide a concrete example using squared Euclidean distance, but restrict
ourselves to one dimension to simplify the mathematics. Let x and y be real
numbers and ϕ(t) be the real-valued function, ϕ(t)=t2. In that case, the
gradient reduces to the derivative, and the dot product reduces to
multiplication. Specifically, Equation 2.25 becomes Equation 2.26.
D(x,y)=x2−y2−2y(x−y)=(x−y)2 (2.26)
The graph for this example, with y=1, is shown in Figure 2.20. The Bregman
divergence is shown for two values of x: x=2 and x=3.
Figure 2.20.
Illustration of Bregman divergence.
Figure 2.20. Full Alternative Text
2.4.9 Issues in Proximity
This section discusses several important issues related to proximity measures:
(1) how to handle the case in which attributes have different scales and/or are
correlated, (2) how to calculate proximity between objects that are composed
of different types of attributes, e.g., quantitative and qualitative, (3) and how
to handle proximity calculations when attributes have different weights; i.e.,
when not all attributes contribute equally to the proximity of objects.
Standardization and Correlation for
Distance Measures
An important issue with distance measures is how to handle the situation
when attributes do not have the same range of values. (This situation is often
described by saying that “the variables have different scales.”) In a previous
example, Euclidean distance was used to measure the distance between
people based on two attributes: age and income. Unless these two attributes
are standardized, the distance between two people will be dominated by
A related issue is how to compute distance when there is correlation between
some of the attributes, perhaps in addition to differences in the ranges of
values. A generalization of Euclidean distance, the Mahalanobis distance, is
useful when attributes are correlated, have different ranges of values
(different variances), and the distribution of the data is approximately
Gaussian (normal). Correlated variables have a large impact on standard
distance measures since a change in any of the correlated variables is
reflected in a change in all the correlated variables. Specifically, the
Mahalanobis distance between two objects (vectors) x and y is defined as
Mahalanobis(x, y)=(x−y)′∑−1(x−y), (2.27)
where ∑−1 is the inverse of the covariance matrix of the data. Note that the
covariance matrix ∑ is the matrix whose ijth entry is the covariance of the ith
and jth attributes as defined by Equation 2.11.
Example 2.26.
In Figure 2.21, there are 1000 points, whose x and y attributes have a
correlation of 0.6. The distance between the two large points at the opposite
ends of the long axis of the ellipse is 14.7 in terms of Euclidean distance, but
only 6 with respect to Mahalanobis distance. This is because the Mahalanobis
distance gives less emphasis to the direction of largest variance. In practice,
computing the Mahalanobis distance is expensive, but can be worthwhile for
data whose attributes are correlated. If the attributes are relatively
uncorrelated, but have different ranges, then standardizing the variables is
Figure 2.21.
Set of two-dimensional points. The Mahalanobis distance between
the two points represented by large dots is 6; their Euclidean
distance is 14.7.
Combining Similarities for
Heterogeneous Attributes
The previous definitions of similarity were based on approaches that assumed
all the attributes were of the same type. A general approach is needed when
the attributes are of different types. One straightforward approach is to
compute the similarity between each attribute separately using Table 2.7, and
then combine these similarities using a method that results in a similarity
between 0 and 1. One possible approach is to define the overall similarity as
the average of all the individual attribute similarities. Unfortunately, this
approach does not work well if some of the attributes are asymmetric
attributes. For example, if all the attributes are asymmetric binary attributes,
then the similarity measure suggested previously reduces to the simple
matching coefficient, a measure that is not appropriate for asymmetric binary
attributes. The easiest way to fix this problem is to omit asymmetric attributes
from the similarity calculation when their values are 0 for both of the objects
whose similarity is being computed. A similar approach also works well for
handling missing values.
In summary, Algorithm 2.2 is effective for computing an overall similarity
between two objects, x and y, with different types of attributes. This
procedure can be easily modified to work with dissimilarities.
Algorithm 2.2 Similarities of
heterogeneous objects.
1. 1: For the kth attribute, compute a similarity, sk(x, y), in the range [0, 1].
2. 2: Define an indicator variable, δk, for the kth attribute as follows:
0if the kth attribute is an asymmetric attribute andboth objects have a value of
3. 3: Compute the overall similarity between the two objects using the
following formula:
similarity (x, y)=∑k=1nδksk(x, y)∑k=1nδk (2.28)
Using Weights
In much of the previous discussion, all attributes were treated equally when
computing proximity. This is not desirable when some attributes are more
important to the definition of proximity than others. To address these
situations, the formulas for proximity can be modified by weighting the
contribution of each attribute.
With attribute weights, wk, (2.28) becomes
similarity (x, y)=∑k=1nwkδksk(x, y)∑k=1nwkδk. (2.29)
The definition of the Minkowski distance can also be modified as follows:
d (x, y)=(∑k=1nwk|xk−yk|r)1/r. (2.30)
2.4.10 Selecting the Right Proximity
A few general observations may be helpful. First, the type of proximity
measure should fit the type of data. For many types of dense, continuous
data, metric distance measures such as Euclidean distance are often used.
Proximity between continuous attributes is most often expressed in terms of
differences, and distance measures provide a well-defined way of combining
these differences into an overall proximity measure. Although attributes can
have different scales and be of differing importance, these issues can often be
dealt with as described earlier, such as normalization and weighting of
For sparse data, which often consists of asymmetric attributes, we typically
employ similarity measures that ignore 0–0 matches. Conceptually, this
reflects the fact that, for a pair of complex objects, similarity depends on the
number of characteristics they both share, rather than the number of
characteristics they both lack. The cosine, Jaccard, and extended Jaccard
measures are appropriate for such data.
There are other characteristics of data vectors that often need to be
considered. Invariance to scaling (multiplication) and to translation (addition)
were previously discussed with respect to Euclidean distance and the cosine
and correlation measures. The practical implications of such considerations
are that, for example, cosine is more suitable for sparse document data where
only scaling is important, while correlation works better for time series,
where both scaling and translation are important. Euclidean distance or other
types of Minkowski distance are most appropriate when two data vectors are
to match as closely as possible across all components (features).
In some cases, transformation or normalization of the data is needed to obtain
a proper similarity measure. For instance, time series can have trends or
periodic patterns that significantly impact similarity. Also, a proper
computation of similarity often requires that time lags be taken into account.
Finally, two time series may be similar only over specific periods of time. For
example, there is a strong relationship between temperature and the use of
natural gas, but only during the heating season.
Practical consideration can also be important. Sometimes, one or more
proximity measures are already in use in a particular field, and thus, others
will have answered the question of which proximity measures should be
used. Other times, the software package or clustering algorithm being used
can drastically limit the choices. If efficiency is a concern, then we may want
to choose a proximity measure that has a property, such as the triangle
inequality, that can be used to reduce the number of proximity calculations.
(See Exercise 25.)
However, if common practice or practical restrictions do not dictate a choice,
then the proper choice of a proximity measure can be a time-consuming task
that requires careful consideration of both domain knowledge and the
purpose for which the measure is being used. A number of different similarity
measures may need to be evaluated to see which ones produce results that
make the most sense.
2.5 Bibliographic Notes
It is essential to understand the nature of the data that is being analyzed, and
at a fundamental level, this is the subject of measurement theory. In
particular, one of the initial motivations for defining types of attributes was to
be precise about which statistical operations were valid for what sorts of data.
We have presented the view of measurement theory that was initially
described in a classic paper by S. S. Stevens [112]. (Tables 2.2 and 2.3 are
derived from those presented by Stevens [113].) While this is the most
common view and is reasonably easy to understand and apply, there is, of
course, much more to measurement theory. An authoritative discussion can
be found in a three-volume series on the foundations of measurement theory
[88, 94, 114]. Also of interest is a wide-ranging article by Hand [77], which
discusses measurement theory and statistics, and is accompanied by
comments from other researchers in the field. Numerous critiques and
extensions of the approach of Stevens have been made [66, 97, 117]. Finally,
many books and articles describe measurement issues for particular areas of
science and engineering.
Data quality is a broad subject that spans every discipline that uses data.
Discussions of precision, bias, accuracy, and significant figures can be found
in many introductory science, engineering, and statistics textbooks. The view
of data quality as “fitness for use” is explained in more detail in the book by
Redman [103]. Those interested in data quality may also be interested in
MIT’s Information Quality (MITIQ) Program [95, 118]. However, the
knowledge needed to deal with specific data quality issues in a particular
domain is often best obtained by investigating the data quality practices of
researchers in that field.
Aggregation is a less well-defined subject than many other preprocessing
tasks. However, aggregation is one of the main techniques used by the
database area of Online Analytical Processing (OLAP) [68, 76, 102]. There
has also been relevant work in the area of symbolic data analysis (Bock and
Diday [64]). One of the goals in this area is to summarize traditional record
data in terms of symbolic data objects whose attributes are more complex
than traditional attributes. Specifically, these attributes can have values that
are sets of values (categories), intervals, or sets of values with weights
(histograms). Another goal of symbolic data analysis is to be able to perform
clustering, classification, and other kinds of data analysis on data that consists
of symbolic data objects.
Sampling is a subject that has been well studied in statistics and related
fields. Many introductory statistics books, such as the one by Lindgren [90],
have some discussion about sampling, and entire books are devoted to the
subject, such as the classic text by Cochran [67]. A survey of sampling for
data mining is provided by Gu and Liu [74], while a survey of sampling for
databases is provided by Olken and Rotem [98]. There are a number of other
data mining and database-related sampling references that may be of interest,
including papers by Palmer and Faloutsos [100], Provost et al. [101],
Toivonen [115], and Zaki et al. [119].
In statistics, the traditional techniques that have been used for dimensionality
reduction are multidimensional scaling (MDS) (Borg and Groenen [65],
Kruskal and Uslaner [89]) and principal component analysis (PCA) (Jolliffe
[80]), which is similar to singular value decomposition (SVD) (Demmel
[70]). Dimensionality reduction is discussed in more detail in Appendix B.
Discretization is a topic that has been extensively investigated in data mining.
Some classification algorithms work only with categorical data, and
association analysis requires binary data, and thus, there is a significant
motivation to investigate how to best binarize or discretize continuous
attributes. For association analysis, we refer the reader to work by Srikant
and Agrawal [111], while some useful references for discretization in the area
of classification include work by Dougherty et al. [71], Elomaa and Rousu
[72], Fayyad and Irani [73], and Hussain et al. [78].
Feature selection is another topic well investigated in data mining. A broad
coverage of this topic is provided in a survey by Molina et al. [96] and two
books by Liu and Motada [91, 92]. Other useful papers include those by
Blum and Langley [63], Kohavi and John [87], and Liu et al. [93].
It is difficult to provide references for the subject of feature transformations
because practices vary from one discipline to another. Many statistics books
have a discussion of transformations, but typically the discussion is restricted
to a particular purpose, such as ensuring the normality of a variable or
making sure that variables have equal variance. We offer two references:
Osborne [99] and Tukey [116].
While we have covered some of the most commonly used distance and
similarity measures, there are hundreds of such measures and more are being
created all the time. As with so many other topics in this chapter, many of
these measures are specific to particular fields, e.g., in the area of time series
see papers by Kalpakis et al. [81] and Keogh and Pazzani [83]. Clustering
books provide the best general discussions. In particular, see the books by
Anderberg [62], Jain and Dubes [79], Kaufman and Rousseeuw [82], and
Sneath and Sokal [109].
Information-based measures of similarity have become more popular lately
despite the computational difficulties and expense of calculating them. A
good introduction to information theory is provided by Cover and Thomas
[69]. Computing the mutual information for continuous variables can be
straightforward if they follow a well-know distribution, such as Gaussian.
However, this is often not the case, and many techniques have been
developed. As one example, the article by Khan, et al. [85] compares various
methods in the context of comparing short time series. See also the
information and mutual information packages for R and Matlab. Mutual
information has been the subject of considerable recent attention due to paper
by Reshef, et al. [104, 105] that introduced an alternative measure, albeit one
based on mutual information, which was claimed to have superior properties.
Although this approach had some early support, e.g., [110], others have
pointed out various limitations [75, 86, 108].
Two popular books on the topic of kernel methods are [106] and [107]. The
latter also has a website with links to kernel-related materials [84]. In
addition, many current data mining, machine learning, and statistical learning
textbooks have some material about kernel methods. Further references for
kernel methods in the context of support vector machine classifiers are
provided in the bibliographic Notes of Section 4.9.4.
[62] M. R. Anderberg. Cluster Analysis for Applications. Academic
Press, New York, December 1973.
[63] A. Blum and P. Langley. Selection of Relevant Features and
Examples in Machine Learning. Artificial Intelligence, 97(1–2):245–
271, 1997.
[64] H. H. Bock and E. Diday. Analysis of Symbolic Data: Exploratory
Methods for Extracting Statistical Information from Complex Data
(Studies in Classification, Data Analysis, and Knowledge Organization).
Springer-Verlag Telos, January 2000.
[65] I. Borg and P. Groenen. Modern Multidimensional Scaling—Theory
and Applications. Springer-Verlag, February 1997.
[66] N. R. Chrisman. Rethinking levels of measurement for cartography.
Cartography and Geographic Information Systems, 25(4):231–242,
[67] W. G. Cochran. Sampling Techniques. John Wiley & Sons, 3rd
edition, July 1977.
[68] E. F. Codd, S. B. Codd, and C. T. Smalley. Providing OLAP (Online Analytical Processing) to User- Analysts: An IT Mandate. White
Paper, E.F. Codd and Associates, 1993.
[69] T. M. Cover and J. A. Thomas. Elements of information theory.
John Wiley & Sons, 2012.
[70] J. W. Demmel. Applied Numerical Linear Algebra. Society for
Industrial & Applied Mathematics, September 1997.
[71] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and
Unsupervised Discretization of Continuous Features. In Proc. of the
12th Intl. Conf. on Machine Learning, pages 194–202, 1995.
[72] T. Elomaa and J. Rousu. General and Efficient Multisplitting of
Numerical Attributes. Machine Learning, 36(3):201–244, 1999.
[73] U. M. Fayyad and K. B. Irani. Multi-interval discretization of
continuousvalued attributes for classification learning. In Proc. 13th Int.
Joint Conf. on Artificial Intelligence, pages 1022–1027. Morgan
Kaufman, 1993.
[74] F. H. Gaohua Gu and H. Liu. Sampling and Its Application in Data
Mining: A Survey. Technical Report TRA6/00, National University of
Singapore, Singapore, 2000.
[75] M. Gorfine, R. Heller, and Y. Heller. Comment on Detecting novel
associations in large data sets. Unpublished (available at http://emotion.
technion. ac. il/ gorfinm/files/science6. pdf on 11 Nov. 2012), 2012.
[76] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M.
Venkatrao, F. Pellow, and H. Pirahesh. Data Cube: A Relational
Aggregation Operator Generalizing Group-By, Cross-Tab, and SubTotals. Journal Data Mining and Knowledge Discovery, 1(1): 29–53,
[77] D. J. Hand. Statistics and the Theory of Measurement. Journal of
the Royal Statistical Society: Series A (Statistics in Society),
159(3):445–492, 1996.
[78] F. Hussain, H. Liu, C. L. Tan, and M. Dash. TRC6/99:
Discretization: an enabling technique. Technical report, National
University of Singapore, Singapore, 1999.
[79] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data.
Prentice Hall Advanced Reference Series. Prentice Hall, March 1988.
[80] I. T. Jolliffe. Principal Component Analysis. Springer Verlag, 2nd
edition, October 2002.
[81] K. Kalpakis, D. Gada, and V. Puttagunta. Distance Measures for
Effective Clustering of ARIMA Time-Series. In Proc. of the 2001 IEEE
Intl. Conf. on Data Mining, pages 273–280. IEEE Computer Society,
[82] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An
Introduction to Cluster Analysis. Wiley Series in Probability and
Statistics. John Wiley and Sons, New York, November 1990.
[83] E. J. Keogh and M. J. Pazzani. Scaling up dynamic time warping
for datamining applications. In KDD, pages 285–289, 2000.
[84] Kernel Methods for Pattern Analysis Website. http://www.kernelmethods.net/, 2014.
[85] S. Khan, S. Bandyopadhyay, A. R. Ganguly, S. Saigal, D. J.
Erickson III, V. Protopopescu, and G. Ostrouchov. Relative performance
of mutual information estimation methods for quantifying the
dependence among short and noisy data. Physical Review E,
76(2):026209, 2007.
[86] J. B. Kinney and G. S. Atwal. Equitability, mutual information, and
the maximal information coefficient. Proceedings of the National
Academy of Sciences, 2014.
[87] R. Kohavi and G. H. John. Wrappers for Feature Subset Selection.
Artificial Intelligence, 97(1–2):273–324, 1997.
[88] D. Krantz, R. D. Luce, P. Suppes, and A. Tversky. Foundations of
Measurements: Volume 1: Additive and polynomial representations.
Academic Press, New York, 1971.
[89] J. B. Kruskal and E. M. Uslaner. Multidimensional Scaling. Sage
Publications, August 1978.
[90] B. W. Lindgren. Statistical Theory. CRC Press, January 1993.
[91] H. Liu and H. Motoda, editors. Feature Extraction, Construction
and Selection: A Data Mining Perspective. Kluwer International Series
in Engineering and Computer Science, 453. Kluwer Academic
Publishers, July 1998.
[92] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery
and Data Mining. Kluwer International Series in Engineering and
Computer Science, 454. Kluwer Academic Publishers, July 1998.
[93] H. Liu, H. Motoda, and L. Yu. Feature Extraction, Selection, and
Construction. In N. Ye, editor, The Handbook of Data Mining, pages
22–41. Lawrence Erlbaum Associates, Inc., Mahwah, NJ, 2003.
[94] R. D. Luce, D. Krantz, P. Suppes, and A. Tversky. Foundations of
Measurements: Volume 3: Representation, Axiomatization, and
Invariance. Academic Press, New York, 1990.
[95] MIT Information Quality (MITIQ) Program. http://mitiq.mit.edu/,
[96] L. C. Molina, L. Belanche, and A. Nebot. Feature Selection
Algorithms: A Survey and Experimental Evaluation. In Proc. of the
2002 IEEE Intl. Conf. on Data Mining, 2002.
[97] F. Mosteller and J. W. Tukey. Data analysis and regression: a
second course in statistics. Addison-Wesley, 1977.
[98] F. Olken and D. Rotem. Random Sampling from Databases—A
Survey. Statistics & Computing, 5(1):25–42, March 1995.
[99] J. Osborne. Notes on the Use of Data Transformations. Practical
Assessment, Research & Evaluation, 28(6), 2002.
[100] C. R. Palmer and C. Faloutsos. Density biased sampling: An
improved method for data mining and clustering. ACM SIGMOD
Record, 29(2):82–92, 2000.
[101] F. J. Provost, D. Jensen, and T. Oates. Efficient Progressive
Sampling. In Proc. of the 5th Intl. Conf. on Knowledge Discovery and
Data Mining, pages 23–32, 1999.
[102] R. Ramakrishnan and J. Gehrke. Database Management Systems.
McGraw-Hill, 3rd edition, August 2002.
[103] T. C. Redman. Data Quality: The Field Guide. Digital Press,
January 2001.
[104] D. Reshef, Y. Reshef, M. Mitzenmacher, and P. Sabeti.
Equitability analysis of the maximal information coefficient, with
comparisons. arXiv preprint arXiv:1301.6314, 2013.
[105] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G.
McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C.
Sabeti. Detecting novel associations in large data sets. science,
334(6062):1518–1524, 2011.
[106] B. Schölkopf and A. J. Smola. Learning with kernels: support
vector machines, regularization, optimization, and beyond. MIT press,
[107] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern
analysis. Cambridge university press, 2004.
[108] N. Simon and R. Tibshirani. Comment on” Detecting Novel
Associations In Large Data Sets” by Reshef Et Al, Science Dec 16,
2011. arXiv preprint arXiv:1401.7645, 2014.
[109] P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman,
San Francisco, 1971.
[110] T. Speed. A correlation for the 21st century. Science,
334(6062):1502–1503, 2011.
[111] R. Srikant and R. Agrawal. Mining Quantitative Association Rules
in Large Relational Tables. In Proc. of 1996 ACM-SIGMOD Intl. Conf.
on Management of Data, pages 1–12, Montreal, Quebec, Canada,
August 1996.
[112] S. S. Stevens. On the Theory of Scales of Measurement. Science,
103(2684):677–680, June 1946.
[113] S. S. Stevens. Measurement. In G. M. Maranell, editor, Scaling: A
Sourcebook for Behavioral Scientists, pages 22–41. Aldine Publishing
Co., Chicago, 1974.
[114] P. Suppes, D. Krantz, R. D. Luce, and A. Tversky. Foundations of
Measurements: Volume 2: Geometrical, Threshold, and Probabilistic
Representations. Academic Press, New York, 1989.
[115] H. Toivonen. Sampling Large Databases for Association Rules. In
VLDB96, pages 134–145. Morgan Kaufman, September 1996.
[116] J. W. Tukey. On the Comparative Anatomy of Transformations.
Annals of Mathematical Statistics, 28(3):602–632, September 1957.
[117] P. F. Velleman and L. Wilkinson. Nominal, ordinal, interval, and
ratio typologies are misleading. The American Statistician, 47(1):65–72,
[118] R. Y. Wang, M. Ziad, Y. W. Lee, and Y. R. Wang. Data Quality.
The Kluwer International Series on Advances in Database Systems,
Volume 23. Kluwer Academic Publishers, January 2001.
[119] M. J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of
Sampling for Data Mining of Association Rules. Technical Report
TR617, Rensselaer Polytechnic Institute, 1996.
2.6 Exercises
1. 1. In the initial example of Chapter 2, the statistician says, “Yes, fields 2
and 3 are basically the same.” Can you tell from the three lines of
sample data that are shown why she says that?
2. 2. Classify the following attributes as binary, discrete, or continuous.
Also classify them as qualitative (nominal or ordinal) or quantitative
(interval or ratio). Some cases may have more than one interpretation, so
briefly indicate your reasoning if you think there may be some
Example: Age in years. Answer: Discrete, quantitative, ratio
1. Time in terms of AM or PM.
2. Brightness as measured by a light meter.
3. Brightness as measured by people’s judgments.
4. Angles as measured in degrees between 0 and 360.
5. Bronze, Silver, and Gold medals as awarded at the Olympics.
6. Height above sea level.
7. Number of patients in a hospital.
8. ISBN numbers for books. (Look up the format on the Web.)
9. Ability to pass light in terms of the following values: opaque,
translucent, transparent.
10. Military rank.
11. Distance from the center of campus.
12. Density of a substance in grams per cubic centimeter.
13. Coat check number. (When you attend an event, you can often give
your coat to someone who, in turn, gives you a number that you can
use to claim your coat when you leave.)
3. 3. You are approached by the marketing director of a local company,
who believes that he has devised a foolproof way to measure customer
satisfaction. He explains his scheme as follows: “It’s so simple that I
can’t believe that no one has thought of it before. I just keep track of the
number of customer complaints for each product. I read in a data mining
book that counts are ratio attributes, and so, my measure of product
satisfaction must be a ratio attribute. But when I rated the products based
on my new customer satisfaction measure and showed them to my boss,
he told me that I had overlooked the obvious, and that my measure was
worthless. I think that he was just mad because our bestselling product
had the worst satisfaction since it had the most complaints. Could you
help me set him straight?”
1. Who is right, the marketing director or his boss? If you answered,
his boss, what would you do to fix the measure of satisfaction?
2. What can you say about the attribute type of the original product
satisfaction attribute?
4. 4. A few months later, you are again approached by the same marketing
director as in Exercise 3. This time, he has devised a better approach to
measure the extent to which a customer prefers one product over other
similar products. He explains, “When we develop new products, we
typically create several variations and evaluate which one customers
prefer. Our standard procedure is to give our test subjects all of the
product variations at one time and then ask them to rank the product
variations in order of preference. However, our test subjects are very
indecisive, especially when there are more than two products. As a
result, testing takes forever. I suggested that we perform the
comparisons in pairs and then use these comparisons to get the rankings.
Thus, if we have three product variations, we have the customers
compare variations 1 and 2, then 2 and 3, and finally 3 and 1. Our
testing time with my new procedure is a third of what it was for the old
procedure, but the employees conducting the tests complain that they
cannot come up with a consistent ranking from the results. And my boss
wants the latest product evaluations, yesterday. I should also mention
that he was the person who came up with the old product evaluation
approach. Can you help me?”
1. Is the marketing director in trouble? Will his approach work for
generating an ordinal ranking of the product variations in terms of
customer preference? Explain.
2. Is there a way to fix the marketing director’s approach? More
generally, what can you say about trying to create an ordinal
measurement scale based on pairwise comparisons?
3. For the original product evaluation scheme, the overall rankings of
each product variation are found by computing its average over all
test subjects. Comment on whether you think that this is a
reasonable approach. What other approaches might you take?
5. 5. Can you think of a situation in which identification numbers would be
useful for prediction?
6. 6. An educational psychologist wants to use association analysis to
analyze test results. The test consists of 100 questions with four possible
answers each.
1. How would you convert this data into a form suitable for
association analysis?
2. In particular, what type of attributes would you have and how many
of them are there?
7. 7. Which of the following quantities is likely to show more temporal
autocorrelation: daily rainfall or daily temperature? Why?
8. 8. Discuss why a document-term matrix is an example of a data set that
has asymmetric discrete or asymmetric continuous features.
9. 9. Many sciences rely on observation instead of (or in addition to)
designed experiments. Compare the data quality issues involved in
observational science with those of experimental science and data
10. 10. Discuss the difference between the precision of a measurement and
the terms single and double precision, as they are used in computer
science, typically to represent floating-point numbers that require 32 and
64 bits, respectively.
11. 11. Give at least two advantages to working with data stored in text files
instead of in a binary format.
12. 12. Distinguish between noise and outliers. Be sure to consider the
following questions.
1. Is noise ever interesting or desirable? Outliers?
2. Can noise objects be outliers?
3. Are noise objects always outliers?
4. Are outliers always noise objects?
5. Can noise make a typical value into an unusual one, or vice versa?
Algorithm 2.3 Algorithm for
finding k-nearest neighbors.
1. 1: for i=1 to number of data objects do
2. 2: Find the distances of the ith object to all other objects.
3. 3: Sort these distances in decreasing order.
(Keep track of which object is associated with each distance.)
4. 4: return the objects associated with the first k distances of the
sorted list
5. 5: end for
13. 13. Consider the problem of finding the K-nearest neighbors of a data
object. A programmer designs Algorithm 2.3 for this task.
1. Describe the potential problems with this algorithm if there are
duplicate objects in the data set. Assume the distance function will
return a distance of 0 only for objects that are the same.
2. How would you fix this problem?
14. 14. The following attributes are measured for members of a herd of
Asian elephants: weight, height, tusk length, trunk length, and ear area.
Based on these measurements, what sort of proximity measure from
Section 2.4 would you use to compare or group these elephants? Justify
your answer and explain any special circumstances.
15. 15. You are given a set of m objects that is divided into k groups, where
the ith group is of size mi. If the goal is to obtain a sample of size n<m,
what is the difference between the following two sampling schemes?
(Assume sampling with replacement.)
1. We randomly select n×mi/m elements from each group.
2. We randomly select n elements from the data set, without regard
for the group to which an object belongs.
16. 16. Consider a document-term matrix, where tfij is the frequency of the
ith word (term) in the jth document and m is the number of documents.
Consider the variable transformation that is defined by
tfij′=tfij×logmdfi, (2.31)
where dfi is the number of documents in which the ith term appears,
which is known as the document frequency of the term. This
transformation is known as the inverse document frequency
1. What is the effect of this transformation if a term occurs in one
document? In every document?
2. What might be the purpose of this transformation?
17. 17. Assume that we apply a square root transformation to a ratio
attribute x to obtain the new attribute x*. As part of your analysis, you
identify an interval (a, b) in which x* has a linear relationship to another
1. What is the corresponding interval (A, B) in terms of x ?
2. Give an equation that relates y to x.
18. 18. This exercise compares and contrasts some similarity and distance
1. For binary data, the L1 distance corresponds to the Hamming
distance; that is, the number of bits that are different between two
binary vectors. The Jaccard similarity is a measure of the similarity
between two binary vectors. Compute the Hamming distance and
the Jaccard similarity between the following two binary vectors.
2. Which approach, Jaccard or Hamming distance, is more similar to
the Simple Matching Coefficient, and which approach is more
similar to the cosine measure? Explain. (Note: The Hamming
measure is a distance, while the other three measures are
similarities, but don’t let this confuse you.)
3. Suppose that you are comparing how similar two organisms of
different species are in terms of the number of genes they share.
Describe which measure, Hamming or Jaccard, you think would be
more appropriate for comparing the genetic makeup of two
organisms. Explain. (Assume that each animal is represented as a
binary vector, where each attribute is 1 if a particular gene is
present in the organism and 0 otherwise.)
4. If you wanted to compare the genetic makeup of two organisms of
the same species, e.g., two human beings, would you use the
Hamming distance, the Jaccard coefficient, or a different measure
of similarity or distance? Explain. (Note that two human beings
share >99.9% of the same genes.)
19. 19. For the following vectors, x and y, calculate the indicated similarity
or distance measures.
1. x=(1, 1, 1, 1), y=(2, 2, 2, 2) cosine, correlation, Euclidean
2. x=(0, 1, 0, 1), y=(1, 0, 1, 0) cosine, correlation, Euclidean, Jaccard
3. x=(0, −1, 0, 1), y=(1, 0, −1, 0) cosine, correlation, Euclidean
4. x=(1, 1, 0, 1, 0, 1), y=(1, 1, 1, 0, 0, 1) cosine, correlation, Jaccard
5. x=(2, −1, 0, 2, 0, −3), y=( −1, 1, −1, 0, 0, −1) cosine, correlation
20. 20. Here, we further explore the cosine and correlation measures.
1. What is the range of values possible for the cosine measure?
2. If two objects have a cosine measure of 1, are they identical?
3. What is the relationship of the cosine measure to correlation, if
any? (Hint: Look at statistical measures such as mean and standard
deviation in cases where cosine and correlation are the same and
4. Figure 2.22(a) shows the relationship of the cosine measure to
Euclidean distance for 100,000 randomly generated points that
have been normalized to have an L2 length of 1. What general
observation can you make about the relationship between
Euclidean distance and cosine similarity when vectors have an L2
norm of 1?
Figure 2.22.
Graphs for Exercise 20.
Figure 2.22. Full Alternative Text
5. Figure 2.22(b) shows the relationship of correlation to Euclidean
distance for 100,000 randomly generated points that have been
standardized to have a mean of 0 and a standard deviation of 1.
What general observation can you make about the relationship
between Euclidean distance and correlation when the vectors have
been standardized to have a mean of 0 and a standard deviation of
6. Derive the mathematical relationship between cosine similarity and
Euclidean distance when each data object has an L2 length of 1.
7. Derive the mathematical relationship between correlation and
Euclidean distance when each data point has been been
standardized by subtracting its mean and dividing by its standard
21. 21. Show that the set difference metric given by
d(A, B)=size(A−B)+size(B−A) (2.32)
satisfies the metric axioms given on page 77. A and B are sets and A−B
is the set difference.
22. 22. Discuss how you might map correlation values from the interval
[−1, 1] to the interval [0, 1]. Note that the type of transformation that
you use might depend on the application that you have in mind. Thus,
consider two applications: clustering time series and predicting the
behavior of one time series given another.
23. 23. Given a similarity measure with values in the interval [0, 1], describe
two ways to transform this similarity value into a dissimilarity value in
the interval [0, ∞].
24. 24. Proximity is typically defined between a pair of objects.
1. Define two ways in which you might define the proximity among a
group of objects.
2. How might you define the distance between two sets of points in
Euclidean space?
3. How might you define the proximity between two sets of data
objects? (Make no assumption about the data objects, except that a
proximity measure is defined between any pair of objects.)
25. 25. You are given a set of points s in Euclidean space, as well as the
distance of each point in s to a point x. (It does not matter if x∈S.)
1. If the goal is to find all points within a specified distance ε of point
y, y≠x, explain how you could use the triangle inequality and the
already calculated distances to x to potentially reduce the number
of distance calculations necessary? Hint: The triangle inequality,
d(x, z)≤d(x, y)+d(y, x), can be rewritten as d(x, y)≥d(x, z)−d(y, z).
2. In general, how would the distance between x and y affect the
number of distance calculations?
3. Suppose that you can find a small subset of points S′, from the
original data set, such that every point in the data set is within a
specified distance ε of at least one of the points in S′, and that you
also have the pairwise distance matrix for S′. Describe a technique
that uses this information to compute, with a minimum of distance
calculations, the set of all points within a distance of β of a
specified point from the data set.
26. 26. Show that 1 minus the Jaccard similarity is a distance measure
between two data objects, x and y, that satisfies the metric axioms given
on page 77. Specifically, d(x, y)=1−J(x, y).
27. 27. Show that the distance measure defined as the angle between two
data vectors, x and y, satisfies the metric axioms given on page 77.
Specifically, d(x, y)=arccos(cos(x, y)).
28. 28. Explain why computing the proximity between two attributes is
often simpler than computing the similarity between two objects.
3 Classification: Basic Concepts and
Humans have an innate ability to classify things into categories, e.g.,
mundane tasks such as filtering spam email messages or more specialized
tasks such as recognizing celestial objects in telescope images (see Figure
3.1). While manual classification often suffices for small and simple data sets
with only a few attributes, larger and more complex data sets require an
automated solution.
Figure 3.1.
Classification of galaxies from telescope images taken from the
NASA website.
This chapter introduces the basic concepts of classification and describes
some of its key issues such as model overfitting, model selection, and model
evaluation. While these topics are illustrated using a classification technique
known as decision tree induction, most of the discussion in this chapter is
also applicable to other classification techniques, many of which are covered
in Chapter 4.
3.1 Basic Concepts
Figure 3.2 illustrates the general idea behind classification. The data for a
classification task consists of a collection of instances (records). Each such
instance is characterized by the tuple (x, y), where x is the set of attribute
values that describe the instance and y is the class label of the instance. The
attribute set x can contain attributes of any type, while the class label y must
be categorical.
Figure 3.2.
A schematic illustration of a classification task.
A classification model is an abstract representation of the relationship
between the attribute set and the class label. As will be seen in the next two
chapters, the model can be represented in many ways, e.g., as a tree, a
probability table, or simply, a vector of real-valued parameters. More
formally, we can express it mathematically as a target function f that takes as
input the attribute set x and produces an output corresponding to the predicted
class label. The model is said to classify an instance (x, y) correctly if f(x)=y.
Table 3.1 shows examples of attribute sets and class labels for various
classification tasks. Spam filtering and tumor identification are examples of
binary classification problems, in which each data instance can be
categorized into one of two classes. If the number of classes is larger than 2,
as in the galaxy classification example, then it is called a multiclass
classification problem.
Table 3.1. Examples of
classification tasks.
Task Attribute set Class label
Features extracted from email
message header and content spam or non-spam
Features extracted from magnetic
resonance imaging (MRI) scans
malignant or
Features extracted from telescope
elliptical, spiral,
or irregularshaped
We illustrate the basic concepts of classification in this chapter with the
following two examples.
3.1. Example Vertebrate
Table 3.2 shows a sample data set for classifying vertebrates into mammals,
reptiles, birds, fishes, and amphibians. The attribute set includes
characteristics of the vertebrate such as its body temperature, skin cover, and
ability to fly. The data set can also be used for a binary classification task
such as mammal classification, by grouping the reptiles, birds, fishes, and
amphibians into a single category called non-mammals.
Table 3.2. A sample data for
the vertebrate classification
Legs Hibernates
human warmblooded hair yes no no yes no
python cold-blooded scales no no no no yes
salmon cold-blooded scales no yes no no no
whale warmblooded hair yes yes no no no
frog cold-blooded none no semi no yes yes
komodo cold-blooded scales no no no yes no
bat warmblooded hair yes no yes yes yes
pigeon warmblooded feathers no no yes yes no
cat warmblooded fur yes no no yes no
leopard cold-blooded scales yes yes no no no
turtle cold-blooded scales no semi no yes no
penguin warmblooded feathers no semi no yes no
porcupine warmblooded quills yes no no yes yes
eel cold-blooded scales no yes no no no
salamander cold-blooded none no semi no yes yes
3.2. Example Loan Borrower
Consider the problem of predicting whether a loan borrower will repay the
loan or default on the loan payments. The data set used to build the
classification model is shown in Table 3.3. The attribute set includes personal
information of the borrower such as marital status and annual income, while
the class label indicates whether the borrower had defaulted on the loan
Table 3.3. A sample data for
the loan borrower classification
ID Home Owner Marital Status Annual Income Defaulted?
1 Yes Single 125000 No
2 No Married 100000 No
3 No Single 70000 No
4 Yes Married 120000 No
5 No Divorced 95000 Yes
6 No Single 60000 No
7 Yes Divorced 220000 No
8 No Single 85000 Yes
9 No Married 75000 No
10 No Single 90000 Yes
A classification model serves two important roles in data mining. First, it is
used as a predictive model to classify previously unlabeled instances. A
good classification model must provide accurate predictions with a fast
response time. Second, it serves as a descriptive model to identify the
characteristics that distinguish instances from different classes. This is
particularly useful for critical applications, such as medical diagnosis, where
it is insufficient to have a model that makes a prediction without justifying
how it reaches such a decision.
For example, a classification model induced from the vertebrate data set
shown in Table 3.2 can be used to predict the class label of the following
Legs Hibernates
cold-blooded scales no no no yes yes
In addition, it can be used as a descriptive model to help determine
characteristics that define a vertebrate as a mammal, a reptile, a bird, a fish,
or an amphibian. For example, the model may identify mammals as warmblooded vertebrates that give birth to their young.
There are several points worth noting regarding the previous example. First,
although all the attributes shown in Table 3.2 are qualitative, there are no
restrictions on the type of attributes that can be used as predictor variables.
The class label, on the other hand, must be of nominal type. This
distinguishes classification from other predictive modeling tasks such as
regression, where the predicted value is often quantitative. More information
about regression can be found in Appendix D.
Another point worth noting is that not all attributes may be relevant to the
classification task. For example, the average length or weight of a vertebrate
may not be useful for classifying mammals, as these attributes can show same
value for both mammals and non-mammals. Such an attribute is typically
discarded during preprocessing. The remaining attributes might not be able to
distinguish the classes by themselves, and thus, must be used in concert with
other attributes. For instance, the Body Temperature attribute is insufficient
to distinguish mammals from other vertebrates. When it is used together with
Gives Birth, the classification of mammals improves significantly. However,
when additional attributes, such as Skin Cover are included, the model
becomes overly specific and no longer covers all mammals. Finding the
optimal combination of attributes that best discriminates instances from
different classes is the key challenge in building classification models.
3.2 General Framework for
Classification is the task of assigning labels to unlabeled data instances and a
classifier is used to perform such a task. A classifier is typically described in
terms of a model as illustrated in the previous section. The model is created
using a given a set of instances, known as the training set, which contains
attribute values as well as class labels for each instance. The systematic
approach for learning a classification model given a training set is known as a
learning algorithm. The process of using a learning algorithm to build a
classification model from the training data is known as induction. This
process is also often described as “learning a model” or “building a model.”
This process of applying a classification model on unseen test instances to
predict their class labels is known as deduction. Thus, the process of
classification involves two steps: applying a learning algorithm to training
data to learn a model, and then applying the model to assign labels to
unlabeled instances. Figure 3.3 illustrates the general framework for
Figure 3.3.
General framework for building a classification model.
Figure 3.3. Full Alternative Text
A classification technique refers to a general approach to classification, e.g.,
the decision tree technique that we will study in this chapter. This
classification technique like most others, consists of a family of related
models and a number of algorithms for learning these models. In Chapter 4,
we will study additional classification techniques, including neural networks
and support vector machines.
A couple notes on terminology. First, the terms “classifier” and “model” are
often taken to be synonymous. If a classification technique builds a single,
global model, then this is fine. However, while every model defines a
classifier, not every classifier is defined by a single model. Some classifiers,
such as k-nearest neighbor classifiers, do not build an explicit model (Section
4.3), while other classifiers, such as ensemble classifiers, combine the output
of a collection of models (Section 4.10). Second, the term “classifier” is often
used in a more general sense to refer to a classification technique. Thus, for
example, “decision tree classifier” can refer to the decision tree classification
technique or a specific classifier built using that technique. Fortunately, the
meaning of “classifier” is usually clear from the context.
In the general framework shown in Figure 3.3, the induction and deduction
steps should be performed separately. In fact, as will be discussed later in
Section 3.6, the training and test sets should be independent of each other to
ensure that the induced model can accurately predict the class labels of
instances it has never encountered before. Models that deliver such predictive
insights are said to have good generalization performance. The
performance of a model (classifier) can be evaluated by comparing the
predicted labels against the true labels of instances. This information can be
summarized in a table called a confusion matrix. Table 3.4 depicts the
confusion matrix for a binary classification problem. Each entry fij denotes
the number of instances from class i predicted to be of class j. For example,
f01 is the number of instances from class 0 incorrectly predicted as class 1.
The number of correct predictions made by the model is (f11+f00) and the
number of incorrect predictions is (f10+f01).
Table 3.4. Confusion matrix for
a binary classification problem.
Predicted Class
Class=1 Class=0
Actual Class Class=1 f11 f10
Class=0 f01 f00
Although a confusion matrix provides the information needed to determine
how well a classification model performs, summarizing this information into
a single number makes it more convenient to compare the relative
performance of different models. This can be done using an evaluation
metric such as accuracy, which is computed in the following way:
Accuracy =
Accuracy=Number of correct predictionsTotal number of predictions. (3.1)
For binary classification problems, the accuracy of a model is given by
Accuracy=f11+f00f11+f10+f01+f00. (3.2)
Error rate is another related metric, which is defined as follows for binary
classification problems:
Error rate=Number of wrong predictionsTotal number of predictions=f10+f01
The learning algorithms of most classification techniques are designed to
learn models that attain the highest accuracy, or equivalently, the lowest error
rate when applied to the test set. We will revisit the topic of model evaluation
in Section 3.6.
3.3 Decision Tree Classifier
This section introduces a simple classification technique known as the
decision tree classifier. To illustrate how a decision tree works, consider the
classification problem of distinguishing mammals from non-mammals using
the vertebrate data set shown in Table 3.2. Suppose a new species is
discovered by scientists. How can we tell whether it is a mammal or a nonmammal? One approach is to pose a series of questions about the
characteristics of the species. The first question we may ask is whether the
species is cold- or warm-blooded. If it is cold-blooded, then it is definitely
not a mammal. Otherwise, it is either a bird or a mammal. In the latter case,
we need to ask a follow-up question: Do the females of the species give birth
to their young? Those that do give birth are definitely mammals, while those
that do not are likely to be non-mammals (with the exception of egg-laying
mammals such as the platypus and spiny anteater).
The previous example illustrates how we can solve a classification problem
by asking a series of carefully crafted questions about the attributes of the test
instance. Each time we receive an answer, we could ask a follow-up question
until we can conclusively decide on its class label. The series of questions
and their possible answers can be organized into a hierarchical structure
called a decision tree. Figure 3.4 shows an example of the decision tree for
the mammal classification problem. The tree has three types of nodes:
A root node, with no incoming links and zero or more outgoing links.
Internal nodes, each of which has exactly one incoming link and two or
more outgoing links.
Leaf or terminal nodes, each of which has exactly one incoming link
and no outgoing links.
Every leaf node in the decision tree is associated with a class label. The nonterminal nodes, which include the root and internal nodes, contain attribute
test conditions that are typically defined using a single attribute. Each
possible outcome of the attribute test condition is associated with exactly one
child of this node. For example, the root node of the tree shown in Figure 3.4
uses the attribute Body Temperature to define an attribute test condition that
has two outcomes, warm and cold, resulting in two child nodes.
Figure 3.4.
A decision tree for the mammal classification problem.
Figure 3.4. Full Alternative Text
Given a decision tree, classifying a test instance is straightforward. Starting
from the root node, we apply its attribute test condition and follow the
appropriate branch based on the outcome of the test. This will lead us either
to another internal node, for which a new attribute test condition is applied, or
to a leaf node. Once a leaf node is reached, we assign the class label
associated with the node to the test instance. As an illustration, Figure 3.5
traces the path used to predict the class label of a flamingo. The path
terminates at a leaf node labeled as Non-mammals.
Figure 3.5.
Classifying an unlabeled vertebrate. The dashed lines represent the
outcomes of applying various attribute test conditions on the
unlabeled vertebrate. The vertebrate is eventually assigned to the
Non-mammals class.
Figure 3.5. Full Alternative Text
3.3.1 A Basic Algorithm to Build a
Decision Tree
Many possible decision trees that can be constructed from a particular data
set. While some trees are better than others, finding an optimal one is
computationally expensive due to the exponential size of the search space.
Efficient algorithms have been developed to induce a reasonably accurate,
albeit suboptimal, decision tree in a reasonable amount of time. These
algorithms usually employ a greedy strategy to grow the decision tree in a
top-down fashion by making a series of locally optimal decisions about
which attribute to use when partitioning the training data. One of the earliest
method is Hunt’s algorithm, which is the basis for many current
implementations of decision tree classifiers, including ID3, C4.5, and CART.
This subsection presents Hunt’s algorithm and describes some of the design
issues that must be considered when building a decision tree.
Hunt’s Algorithm
In Hunt’s algorithm, a decision tree is grown in a recursive fashion. The tree
initially contains a single root node that is associated with all the training
instances. If a node is associated with instances from more than one class, it
is expanded using an attribute test condition that is determined using a
splitting criterion. A child leaf node is created for each outcome of the
attribute test condition and the instances associated with the parent node are
distributed to the children based on the test outcomes. This node expansion
step can then be recursively applied to each child node, as long as it has
labels of more than one class. If all the instances associated with a leaf node
have identical class labels, then the node is not expanded any further. Each
leaf node is assigned a class label that occurs most frequently in the training
instances associated with the node.
To illustrate how the algorithm works, consider the training set shown in
Table 3.3 for the loan borrower classification problem. Suppose we apply
Hunt’s algorithm to fit the training data. The tree initially contains only a
single leaf node as shown in Figure 3.6(a). This node is labeled as Defaulted
= No, since the majority of the borrowers did not default on their loan
payments. The training error of this tree is 30% as three out of the ten
training instances have the class label Defaulted = Yes. The leaf node can
therefore be further expanded because it contains training instances from
more than one class.
Figure 3.6.
Hunt’s algorithm for building decision trees.
Figure 3.6. Full Alternative Text
Let Home Owner be the attribute chosen to split the training instances. The
justification for choosing this attribute as the attribute test condition will be
discussed later. The resulting binary split on the Home Owner attribute is
shown in Figure 3.6(b). All the training instances for which Home Owner =
Yes are propagated to the left child of the root node and the rest are
propagated to the right child. Hunt’s algorithm is then recursively applied to
each child. The left child becomes a leaf node labeled Defaulted = No, since
all instances associated with this node have identical class label
Defaulted = No. The right child has instances from each class label. Hence,
we split it further. The resulting subtrees after recursively expanding the right
child are shown in Figures 3.6(c) and (d).
Hunt’s algorithm, as described above, makes some simplifying assumptions
that are often not true in practice. In the following, we describe these
assumptions and briefly discuss some of the possible ways for handling them.
1. Some of the child nodes created in Hunt’s algorithm can be empty if
none of the training instances have the particular attribute values. One
way to handle this is by declaring each of them as a leaf node with a
class label that occurs most frequently among the training instances
associated with their parent nodes.
2. If all training instances associated with a node have identical attribute
values but different class labels, it is not possible to expand this node
any further. One way to handle this case is to declare it a leaf node and
assign it the class label that occurs most frequently in the training
instances associated with this node.
Design Issues of Decision Tree
Hunt’s algorithm is a generic procedure for growing decision trees in a greedy
fashion. To implement the algorithm, there are two key design issues that
must be addressed.
1. What is the splitting criterion? At each recursive step, an attribute must
be selected to partition the training instances associated with a node into
smaller subsets associated with its child nodes. The splitting criterion
determines which attribute is chosen as the test condition and how the
training instances should be distributed to the child nodes. This will be
discussed in Sections 3.3.2 and 3.3.3.
2. What is the stopping criterion? The basic algorithm stops expanding a
node only when all the training instances associated with the node have
the same class labels or have identical attribute values. Although these
conditions are sufficient, there are reasons to stop expanding a node
much earlier even if the leaf node contains training instances from more
than one class. This process is called early termination and the condition
used to determine when a node should be stopped from expanding is
called a stopping criterion. The advantages of early termination are
discussed in Section 3.4.
3.3.2 Methods for Expressing
Attribute Test Conditions
Decision tree induction algorithms must provide a method for expressing an
attribute test condition and its corresponding outcomes for different attribute
Binary Attributes
The test condition for a binary attribute generates two potential outcomes, as
shown in Figure 3.7.
Figure 3.7.
Attribute test condition for a binary attribute.
Nominal Attributes
Since a nominal attribute can have many values, its attribute test condition
can be expressed in two ways, as a multiway split or a binary split as shown
in Figure 3.8. For a multiway split (Figure 3.8(a)), the number of outcomes
depends on the number of distinct values for the corresponding attribute. For
example, if an attribute such as marital status has three distinct values—
single, married, or divorced—its test condition will produce a three-way split.
It is also possible to create a binary split by partitioning all values taken by
the nominal attribute into two groups. For example, some decision tree
algorithms, such as CART, produce only binary splits by considering all 2k
−1−1 ways of creating a binary partition of k attribute values. Figure 3.8(b)
illustrates three different ways of grouping the attribute values for marital
status into two subsets.
Figure 3.8.
Attribute test conditions for nominal attributes.
Figure 3.8. Full Alternative Text
Ordinal Attributes
Ordinal attributes can also produce binary or multi-way splits. Ordinal
attribute values can be grouped as long as the grouping does not violate the
order property of the attribute values. Figure 3.9 illustrates various ways of
splitting training records based on the Shirt Size attribute. The groupings
shown in Figures 3.9(a) and (b) preserve the order among the attribute values,
whereas the grouping shown in Figure 3.9(c) violates this property because it
combines the attribute values Small and Large into the same partition while
Medium and Extra Large are combined into another partition.
Figure 3.9.
Different ways of grouping ordinal attribute values.
Figure 3.9. Full Alternative Text
Continuous Attributes
For continuous attributes, the attribute test condition can be expressed as a
comparison test (e.g., A<v) producing a binary split, or as a range query of
the form vi≤A<vi+1, for i=1, …, k, producing a multiway split. The
difference between these approaches is shown in Figure 3.10. For the binary
split, any possible value v between the minimum and maximum attribute
values in the training data can be used for constructing the comparison test
A<v. However, it is sufficient to only consider distinct attribute values in the
training set as candidate split positions. For the multiway split, any possible
collection of attribute value ranges can be used, as long as they are mutually
exclusive and cover the entire range of attribute values between the minimum
and maximum values observed in the training set. One approach for
constructing multiway splits is to apply the discretization strategies described
in Section 2.3.6 on page 63. After discretization, a new ordinal value is
assigned to each discretized interval, and the attribute test condition is then
defined using this newly constructed ordinal attribute.
Figure 3.10.
Test condition for continuous attributes.
Figure 3.10. Full Alternative Text
3.3.3 Measures for Selecting an
Attribute Test Condition
There are many measures that can be used to determine the goodness of an
attribute test condition. These measures try to give preference to attribute test
conditions that partition the training instances into purer subsets in the child
nodes, which mostly have the same class labels. Having purer nodes is useful
since a node that has all of its training instances from the same class does not
need to be expanded further. In contrast, an impure node containing training
instances from multiple classes is likely to require several levels of node
expansions, thereby increasing the depth of the tree considerably. Larger trees
are less desirable as they are more susceptible to model overfitting, a
condition that may degrade the classification performance on unseen
instances, as will be discussed in Section 3.4. They are also difficult to
interpret and incur more training and test time as compared to smaller trees.
In the following, we present different ways of measuring the impurity of a
node and the collective impurity of its child nodes, both of which will be used
to identify the best attribute test condition for a node.
Impurity Measure for a Single Node
The impurity of a node measures how dissimilar the class labels are for the
data instances belonging to a common node. Following are examples of
measures that can be used to evaluate the impurity of a node t:
Entropy=−∑i=0c−1pi(t) log2pi(t), (3.4)
Gini index=1−∑i=0c−1pi(t)2, (3.5)
Classification error=1−maxi[pi(t)], (3.6)
where pi(t) is the relative frequency of training instances that belong to class i
at node t, c is the total number of classes, and 0 log2 0=0 in entropy
calculations. All three measures give a zero impurity value if a node contains
instances from a single class and maximum impurity if the node has equal
proportion of instances from multiple classes.
Figure 3.11 compares the relative magnitude of the impurity measures when
applied to binary classification problems. Since there are only two classes,
p0(t)+p1(t)=1. The horizontal axis p refers to the fraction of instances that
belong to one of the two classes. Observe that all three measures attain their
maximum value when the class distribution is uniform (i.e., p0(t)+p1(t)=0.5)
and minimum value when all the instances belong to a single class (i.e., either
p0(t) or p1(t) equals to 1). The following examples illustrate how the values
of the impurity measures vary as we alter the class distribution.
Figure 3.11.
Comparison among the impurity measures for binary classification
Figure 3.11. Full Alternative Text
Node N1 Count Gini=1−(0/6)2−(6/6)2=0
Class=0 0 Entropy=−(0/6) log2(0/6)−(6/6) log2(6/6)=0
Class=1 6 Error=1−max[0/6, 6/6]=0
Node N2 Count Gini=1−(1/6)2−(5/6)2=0.278
Class=0 1 Entropy=−(1/6) log2(1/6)−(5/6) log2(5/6)=0.650
Class=1 5 Error=1−max[1/6, 5/6]=0.167
Node N3 Count Gini=1−(3/6)2−(3/6)2=0.5
Class=0 3 Entropy=−(3/6) log2(3/6)−(3/6) log2(3/6)=1
Class=1 3 Error=1−max[6/6, 3/6]=0.5
Based on these calculations, node N1 has the lowest impurity value, followed
by N2 and N3. This example, along with Figure 3.11, shows the consistency
among the impurity measures, i.e., if a node N1 has lower entropy than node
N2, then the Gini index and error rate of N1 will also be lower than that of
N2. Despite their agreement, the attribute chosen as splitting criterion by the
impurity measures can still be different (see Exercise 6 on page 187).
Collective Impurity of Child Nodes
Consider an attribute test condition that splits a node containing N training
instances into k children, {v1, v2, ⋯ ,vk}, where every child node represents a
partition of the data resulting from one of the k outcomes of the attribute test
condition. Let N(vj) be the number of training instances associated with a
child node vj, whose impurity value is I(vj). Since a training instance in the
parent node reaches node vj for a fraction of N(vj)/N times, the collective
impurity of the child nodes can be computed by taking a weighted sum of the
impurities of the child nodes, as follows:
I(children)=∑j=1kN(vj)NI(vj), (3.7)
3.3. Example Weighted Entropy
Consider the candidate attribute test condition shown in Figures 3.12(a) and
(b) for the loan borrower classification problem. Splitting on the Home
Owner attribute will generate two child nodes
Figure 3.12.
Examples of candidate attribute test conditions.
Figure 3.12. Full Alternative Text
whose weighted entropy can be calculated as follows:
I(Home Owner=yes)=03log203−33log233=0I(Home Owner=no)=
−37log237−47log247=0.985I(Home Owner=310×0+710×0.985=0.690
Splitting on Marital Status, on the other hand, leads to three child nodes with
a weighted entropy given by
I(Marital Status=Single)=
−25log225−35log235=0.971I(Marital Status=Married)=
−03log203−33log233=0I(Marital Status=Divorced)=
−12log212−12log212=1.000I(Marital Status)=510×0.971+310×0+210×1=0.686
Thus, Marital Status has a lower weighted entropy than Home Owner.
Identifying the best attribute test
To determine the goodness of an attribute test condition, we need to compare
the degree of impurity of the parent node (before splitting) with the weighted
degree of impurity of the child nodes (after splitting). The larger their
difference, the better the test condition. This difference, Δ, also termed as the
gain in purity of an attribute test condition, can be defined as follows:
Δ=I(parent)−I(children), (3.8)
Figure 3.13.
Splitting criteria for the loan borrower classification problem using
Gini index.
Figure 3.13. Full Alternative Text
where I(parent) is the impurity of a node before splitting and I(children) is the
weighted impurity measure after splitting. It can be shown that the gain is
non-negative since I(parent)≥I(children) for any reasonable measure such as
those presented above. The higher the gain, the purer are the classes in the
child nodes relative to the parent node. The splitting criterion in the decision
tree learning algorithm selects the attribute test condition that shows the
maximum gain. Note that maximizing the gain at a given node is equivalent
to minimizing the weighted impurity measure of its children since I(parent) is
the same for all candidate attribute test conditions. Finally, when entropy is
used as the impurity measure, the difference in entropy is commonly known
as information gain, Δinfo.
In the following, we present illustrative approaches for identifying the best
attribute test condition given qualitative or quantitative attributes.
Splitting of Qualitative Attributes
Consider the first two candidate splits shown in Figure 3.12 involving
qualitative attributes Home Owner and Marital Status. The initial class
distribution at the parent node is (0.3, 0.7), since there are 3 instances of class
Yes and 7 instances of class No in the training data. Thus,
The information gains for Home Owner and Marital Status are each given by
Δinfo(Home Owner)=0.881−0.690=0.191Δinfo(Marital Status)=0.881−0.686=
The information gain for Marital Status is thus higher due to its lower
weighted entropy, which will thus be considered for splitting.
Binary Splitting of Qualitative
Consider building a decision tree using only binary splits and the Gini index
as the impurity measure. Figure 3.13 shows examples of four candidate
splitting criteria for the Home Owner and Marital Status attributes. Since
there are 3 borrowers in the training set who defaulted and 7 others who
repaid their loan (see Table in Figure 3.13), the Gini index of the parent node
before splitting is
If Home Owner is chosen as the splitting attribute, the Gini index for the child
nodes N1 and N2 are 0 and 0.490, respectively. The weighted average Gini
index for the children is
where the weights represent the proportion of training instances assigned to
each child. The gain using Home Owner as splitting attribute is
0.420−0.343=0.077. Similarly, we can apply a binary split on the Marital
Status attribute. However, since Marital Status is a nominal attribute with
three outcomes, there are three possible ways to group the attribute values
into a binary split. The weighted average Gini index of the children for each
candidate binary split is shown in Figure 3.13. Based on these results, Home
Owner and the last binary split using Marital Status are clearly the best
candidates, since they both produce the lowest weighted average Gini index.
Binary splits can also be used for ordinal attributes, if the binary partitioning
of the attribute values does not violate the ordering property of the values.
Binary Splitting of Quantitative
Consider the problem of identifying the best binary split Annual Income≤τ
for the preceding loan approval classification problem. As discussed
previously, even though τ can take any value between the minimum and
maximum values of annual income in the training set, it is sufficient to only
consider the annual income values observed in the training set as candidate
split positions. For each candidate τ, the training set is scanned once to count
the number of borrowers with annual income less than or greater than τ along
with their class proportions. We can then compute the Gini index at each
candidate split position and choose the τ that produces the lowest value.
Computing the Gini index at each candidate split position requires O(N)
operations, where N is the number of training instances. Since there are at
most N possible candidates, the overall complexity of this brute-force method
is O(N2). It is possible to reduce the complexity of this problem to O(N log
N) by using a method described as follows (see illustration in Figure 3.14). In
this method, we first sort the training instances based on their annual income,
a one-time cost that requires O(N log N) operations. The candidate split
positions are given by the midpoints between every two adjacent sorted
values: $55,000, $65,000, $72,500, and so on. For the first candidate, since
none of the instances has an annual income less than or equal to $55,000, the
Gini index for the child node with Annual Income< $55,000 is equal to zero.
In contrast, there are 3 training instances of class Yes and 7 instances of class
No with annual income greater than $55,000. The Gini index for this node is
0.420. The weighted average Gini index for the first candidate split position,
τ=$55,000, is equal to 0×0+1×0.420=0.420.
Figure 3.14.
Splitting continuous attributes.
Figure 3.14. Full Alternative Text
For the next candidate, τ=$65,000, the class distribution of its child nodes can
be obtained with a simple update of the distribution for the previous
candidate. This is because, as τ increases from $55,000 to $65,000, there is
only one training instance affected by the change. By examining the class
label of the affected training instance, the new class distribution is obtained.
For example, as τ increases to $65,000, there is only one borrower in the
training set, with an annual income of $60,000, affected by this change. Since
the class label for the borrower is No, the count for class No increases from 0
to 1 (for Annual Income≤$65,000) and decreases from 7 to 6 (for
Annual Income>$65,000), as shown in Figure 3.14. The distribution for the
Yes class remains unaffected. The updated Gini index for this candidate split
position is 0.400.
This procedure is repeated until the Gini index for all candidates are found.
The best split position corresponds to the one that produces the lowest Gini
index, which occurs at τ=$97,500. Since the Gini index at each candidate
split position can be computed in O(1) time, the complexity of finding the
best split position is O(N) once all the values are kept sorted, a one-time
operation that takes O(N log N) time. The overall complexity of this method
is thus O(N log N), which is much smaller than the O(N2) time taken by the
brute-force method. The amount of computation can be further reduced by
considering only candidate split positions located between two adjacent
sorted instances with different class labels. For example, we do not need to
consider candidate split positions located between $60,000 and $75,000
because all three instances with annual income in this range ($60,000,
$70,000, and $75,000) have the same class labels. Choosing a split position
within this range only increases the degree of impurity, compared to a split
position located outside this range. Therefore, the candidate split positions at
τ=$65,000 and τ=$72,500 can be ignored. Similarly, we do not need to
consider the candidate split positions at $87,500, $92,500, $110,000,
$122,500, and $172,500 because they are located between two adjacent
instances with the same labels. This strategy reduces the number of candidate
split positions to consider from 9 to 2 (excluding the two boundary cases
τ=$55,000 and τ=$230,000).
Gain Ratio
One potential limitation of impurity measures such as entropy and Gini index
is that they tend to favor qualitative attributes with large number of distinct
values. Figure 3.12 shows three candidate attributes for partitioning the data
set given in Table 3.3. As previously mentioned, the attribute Marital
Status is a better choice than the attribute Home Owner, because it provides a
larger information gain. However, if we compare them against Customer ID,
the latter produces the purest partitions with the maximum information gain,
since the weighted entropy and Gini index is equal to zero for its children.
Yet, Customer ID is not a good attribute for splitting because it has a unique
value for each instance. Even though a test condition involving Customer ID
will accurately classify every instance in the training data, we cannot use
such a test condition on new test instances with Customer ID values that
haven’t been seen before during training. This example suggests having a low
impurity value alone is insufficient to find a good attribute test condition for a
node. As we will see later in Section 3.4, having more number of child nodes
can make a decision tree more complex and consequently more susceptible to
overfitting. Hence, the number of children produced by the splitting attribute
should also be taken into consideration while deciding the best attribute test
There are two ways to overcome this problem. One way is to generate only
binary decision trees, thus avoiding the difficulty of handling attributes with
varying number of partitions. This strategy is employed by decision tree
classifiers such as CART. Another way is to modify the splitting criterion to
take into account the number of partitions produced by the attribute. For
example, in the C4.5 decision tree algorithm, a measure known as gain ratio
is used to compensate for attributes that produce a large number of child
nodes. This measure is computed as follows:
Gain ratio=ΔinfoSplit Info=Entropy(Parent)−∑i=1kN(vi)NEntropy(vi)
−∑i=1kN(vi)Nlog2N(vi)N (3.9)
where N(vi) is the number of instances assigned to node vi and k is the total
number of splits. The split information measures the entropy of splitting a
node into its child nodes and evaluates if the split results in a larger number
of equally-sized child nodes or not. For example, if every partition has the
same number of instances, then ∀i:N(vi)/N=1/k and the split information
would be equal to log2 k. Thus, if an attribute produces a large number of
splits, its split information is also large, which in turn, reduces the gain ratio.
3.4. Example Gain Ratio
Consider the data set given in Exercise 2 on page 185. We want to select the
best attribute test condition among the following three attributes: Gender, Car
Type, and Customer ID. The entropy before splitting is
If Gender is used as attribute test condition:
]×2=0.971Gain Ratio=1−0.971−1020log21020−1020log21020=0.0291=0.029
If Car Type is used as attribute test condition:
]=0.380Gain Ratio=1−0.380−420log2420−820log2820−820log2820=0.6201.52
Finally, if Customer ID is used as attribute test condition:
]×20=0Gain Ratio=1−0−120log2120×20=14.32=0.23
Thus, even though Customer ID has the highest information gain, its gain
ratio is lower than Car Type since it produces a larger number of splits.
3.3.4 Algorithm for Decision Tree
Algorithm 3.1 presents a pseudocode for decision tree induction algorithm.
The input to this algorithm is a set of training instances E along with the
attribute set F . The algorithm works by recursively selecting the best
attribute to split the data (Step 7) and expanding the nodes of the tree (Steps
11 and 12) until the stopping criterion is met (Step 1). The details of this
algorithm are explained below.
1. The createNode() function extends the decision tree by creating a new
node. A node in the decision tree either has a test condition, denoted as
node.test cond, or a class label, denoted as node.label.
2. The find best split() function determines the attribute test condition
for partitioning the training instances associated with a node. The
splitting attribute chosen depends on the impurity measure used. The
popular measures include entropy and the Gini index.
3. The Classify() function determines the class label to be assigned to a
leaf node. For each leaf node t, let p(i|t) denote the fraction of training
instances from class i associated with the node t. The label assigned to
the leaf node is typically the one that occurs most frequently in the
training instances that are associated with this node.
Algorithm 3.1 A skeleton decision
tree induction algorithm.
TreeGrowth (E, F)
1: if stopping cond(E,F ) = true then
2: leaf = createNode().
3: leaf.label = Classify(E).
4: return leaf.
5: else
6: root = createNode().
7: root.test cond = find best split(E, F ).
8: let V = {v|v is a possible outcome of root.test cond }
9: for each v V do
10: E
v = {e | root.test cond(e) = v and e E}.
11: child = TreeGrowth(Ev, F ).
12: add child as descendent of root and label the edge (
13: end for
14: end if
15: return root.
leaf.label=argmaxi p(i|t), (3.10)
where the argmax operator returns the class i that maximizes p(i|t).
Besides providing the information needed to determine the class label of
a leaf node, p(i|t) can also be used as a rough estimate of the probability
that an instance assigned to the leaf node t belongs to class i. Sections
4.11.2 and 4.11.4 in the next chapter describe how such probability
estimates can be used to determine the performance of a decision tree
under different cost functions.
4. The stopping cond() function is used to terminate the tree-growing
process by checking whether all the instances have identical class label
or attribute values. Since decision tree classifiers employ a top-down,
recursive partitioning approach for building a model, the number of
training instances associated with a node decreases as the depth of the
tree increases. As a result, a leaf node may contain too few training
instances to make a statistically significant decision about its class label.
This is known as the data fragmentation problem. One way to avoid
this problem is to disallow splitting of a node when the number of
instances associated with the node fall below a certain threshold. A more
systematic way to control the size of a decision tree (number of leaf
nodes) will be discussed in Section 3.5.4.
3.3.5 Example Application: Web
Robot Detection
Consider the task of distinguishing the access patterns of web robots from
those generated by human users. A web robot (also known as a web crawler)
is a software program that automatically retrieves files from one or more
websites by following the hyperlinks extracted from an initial set of seed
URLs. These programs have been deployed for various purposes, from
gathering web pages on behalf of search engines to more malicious activities
such as spamming and committing click frauds in online advertisements.
Figure 3.15.
Input data for web robot detection.
Figure 3.15. Full Alternative Text
The web robot detection problem can be cast as a binary classification task.
The input data for the classification task is a web server log, a sample of
which is shown in Figure 3.15(a). Each line in the log file corresponds to a
request made by a client (i.e., a human user or a web robot) to the web server.
The fields recorded in the web log include the client’s IP address, timestamp
of the request, URL of the requested file, size of the file, and user agent,
which is a field that contains identifying information about the client. For
human users, the user agent field specifies the type of web browser or mobile
device used to fetch the files, whereas for web robots, it should technically
contain the name of the crawler program. However, web robots may conceal
their true identities by declaring their user agent fields to be identical to
known browsers. Therefore, user agent is not a reliable field to detect web
The first step toward building a classification model is to precisely define a
data instance and associated attributes. A simple approach is to consider each
log entry as a data instance and use the appropriate fields in the log file as its
attribute set. This approach, however, is inadequate for several reasons. First,
many of the attributes are nominal-valued and have a wide range of domain
values. For example, the number of unique client IP addresses, URLs, and
referrers in a log file can be very large. These attributes are undesirable for
building a decision tree because their split information is extremely high (see
Equation (3.9)). In addition, it might not be possible to classify test instances
containing IP addresses, URLs, or referrers that are not present in the training
data. Finally, by considering each log entry as a separate data instance, we
disregard the sequence of web pages retrieved by the client—a critical piece
of information that can help distinguish web robot accesses from those of a
human user.
A better alternative is to consider each web session as a data instance. A web
session is a sequence of requests made by a client during a given visit to the
website. Each web session can be modeled as a directed graph, in which the
nodes correspond to web pages and the edges correspond to hyperlinks
connecting one web page to another. Figure 3.15(b) shows a graphical
representation of the first web session given in the log file. Every web session
can be characterized using some meaningful attributes about the graph that
contain discriminatory information. Figure 3.15(c) shows some of the
attributes extracted from the graph, including the depth and breadth of its
corresponding tree rooted at the entry point to the website. For example, the
depth and breadth of the tree shown in Figure 3.15(b) are both equal to two.
The derived attributes shown in Figure 3.15(c) are more informative than the
original attributes given in the log file because they characterize the behavior
of the client at the website. Using this approach, a data set containing 2916
instances was created, with equal numbers of sessions due to web robots
(class 1) and human users (class 0). 10% of the data were reserved for
training while the remaining 90% were used for testing. The induced decision
tree is shown in Figure 3.16, which has an error rate equal to 3.8% on the
training set and 5.3% on the test set. In addition to its low error rate, the tree
also reveals some interesting properties that can help discriminate web robots
from human users:
1. Accesses by web robots tend to be broad but shallow, whereas accesses
by human users tend to be more focused (narrow but deep).
2. Web robots seldom retrieve the image pages associated with a web page.
3. Sessions due to web robots tend to be long and contain a large number
of requested pages.
4. Web robots are more likely to make repeated requests for the same web
page than human users since the web pages retrieved by human users are
often cached by the browser.
3.3.6 Characteristics of Decision
Tree Classifiers
The following is a summary of the important characteristics of decision tree
induction algorithms.
1. Applicability: Decision trees are a nonparametric approach for building
classification models. This approach does not require any prior
assumption about the probability distribution governing the class and
attributes of the data, and thus, is applicable to a wide variety of data
sets. It is also applicable to both categorical and continuous data without
requiring the attributes to be transformed into a common representation
via binarization, normalization, or standardization. Unlike some binary
classifiers described in Chapter 4, it can also deal with multiclass
problems without the need to decompose them into multiple binary
classification tasks. Another appealing feature of decision tree classifiers
is that the induced trees, especially the shorter ones, are relatively easy
to interpret. The accuracies of the trees are also quite comparable to
other classification techniques for many simple data sets.
2. Expressiveness: A decision tree provides a universal representation for
discrete-valued functions. In other words, it can encode any function of
discrete-valued attributes. This is because every discrete-valued function
can be represented as an assignment table, where every unique
combination of discrete attributes is assigned a class label. Since every
combination of attributes can be represented as a leaf in the decision
tree, we can always find a decision tree whose label assignments at the
leaf nodes matches with the assignment table of the original function.
Decision trees can also help in providing compact representations of
functions when some of the unique combinations of attributes can be
represented by the same leaf node. For example, Figure 3.17 shows the
assignment table of the Boolean function (A∧B)∨(C∧D) involving
four binary attributes, resulting in a total of 24=16 possible assignments.
The tree shown in Figure 3.17 shows a compressed encoding of this
assignment table. Instead of requiring a fully-grown tree with 16 leaf
nodes, it is possible to encode the function using a simpler tree with only
7 leaf nodes. Nevertheless, not all decision trees for discrete-valued
attributes can be simplified. One notable example is the parity function,
whose value is 1 when there is an even number of true values among its
Boolean attributes, and 0 otherwise. Accurate modeling of such a
function requires a full decision tree with 2d nodes, where d is the
number of Boolean attributes (see Exercise 1 on page 185).
Figure 3.16.
Decision tree model for web robot detection.
Figure 3.16. Full Alternative Text
Figure 3.17.
Decision tree for the Boolean function (A∧B)∨(C∧D).
Figure 3.17. Full Alternative Text
3. Computational Efficiency: Since the number of possible decision trees
can be very large, many decision tree algorithms employ a heuristicbased approach to guide their search in the vast hypothesis space. For
example, the algorithm presented in Section 3.3.4 uses a greedy, topdown, recursive partitioning strategy for growing a decision tree. For
many data sets, such techniques quickly construct a reasonably good
decision tree even when the training set size is very large. Furthermore,
once a decision tree has been built, classifying a test record is extremely
fast, with a worst-case complexity of O(w), where w is the maximum
depth of the tree.
4. Handling Missing Values: A decision tree classifier can handle missing
attribute values in a number of ways, both in the training and the test
sets. When there are missing values in the test set, the classifier must
decide which branch to follow if the value of a splitting node attribute is
missing for a given test instance. One approach, known as the
probabilistic split method, which is employed by the C4.5 decision
tree classifier, distributes the data instance to every child of the splitting
node according to the probability that the missing attribute has a
particular value. In contrast, the CART algorithm uses the surrogate
split method, where the instance whose splitting attribute value is
missing is assigned to one of the child nodes based on the value of
another non-missing surrogate attribute whose splits most resemble the
partitions made by the missing attribute. Another approach, known as
the separate class method is used by the CHAID algorithm, where the
missing value is treated as a separate categorical value distinct from
other values of the splitting attribute. Figure 3.18 shows an example of
the three different ways for handling missing values in a decision tree
classifier. Other strategies for dealing with missing values are based on
data preprocessing, where the instance with missing value is either
imputed with the mode (for categorical attribute) or mean (for
continuous attribute) value or discarded before the classifier is trained.
Figure 3.18.
Methods for handling missing attribute values in decision tree
Figure 3.18. Full Alternative Text
During training, if an attribute v has missing values in some of the
training instances associated with a node, we need a way to measure the
gain in purity if v is used for splitting. One simple way is to exclude
instances with missing values of v in the counting of instances
associated with every child node, generated for every possible outcome
of v.Further, if v is chosen as the attribute test condition at a node,
training instances with missing values of v can be propagated to the
child nodes using any of the methods described above for handling
missing values in test instances.
5. Handling Interactions among Attributes: Attributes are considered
interacting if they are able to distinguish between classes when used
together, but individually they provide little or no information. Due to
the greedy nature of the splitting criteria in decision trees, such attributes
could be passed over in favor of other attributes that are not as useful.
This could result in more complex decision trees than necessary. Hence,
decision trees can perform poorly when there are interactions among
To illustrate this point, consider the three-dimensional data shown in
Figure 3.19(a), which contains 2000 data points from one of two classes,
denoted as + and ∘ in the diagram. Figure 3.19(b) shows the
distribution of the two classes in the two-dimensional space involving
attributes X and Y , which is a noisy version of the XOR Boolean
function. We can see that even though the two classes are well-separated
in this two-dimensional space, neither of the two attributes contain
sufficient information to distinguish between the two classes when used
alone. For example, the entropies of the following attribute test
conditions: X≤10 and Y≤10, are close to 1, indicating that neither X nor
Y provide any reduction in the impurity measure when used individually.
X and Y thus represent a case of interaction among attributes. The data
set also contains a third attribute, Z, in which both classes are distributed
uniformly, as shown in Figures 3.19(c) and 3.19(d), and hence, the
entropy of any split involving Z is close to 1. As a result, Z is as likely to
be chosen for splitting as the interacting but useful attributes, X and Y .
For further illustration of this issue, readers are referred to Example 3.7
in Section 3.4.1 and Exercise 7 at the end of this chapter.

Figure 3.19.
Example of a XOR data involving X and Y , along with an
irrelevant attribute Z.
Figure 3.19. Full Alternative Text
6. Handling Irrelevant Attributes: An attribute is irrelevant if it is not
useful for the classification task. Since irrelevant attributes are poorly
associated with the target class labels, they will provide little or no gain
in purity and thus will be passed over by other more relevant features.
Hence, the presence of a small number of irrelevant attributes will not
impact the decision tree construction process. However, not all attributes
that provide little to no gain are irrelevant (see Figure 3.19). Hence, if
the classification problem is complex (e.g., involving interactions among
attributes) and there are a large number of irrelevant attributes, then
some of these attributes may be accidentally chosen during the treegrowing process, since they may provide a better gain than a relevant
attribute just by random chance. Feature selection techniques can help to
improve the accuracy of decision trees by eliminating the irrelevant
attributes during preprocessing. We will investigate the issue of too
many irrelevant attributes in Section 3.4.1.
7. Handling Redundant Attributes: An attribute is redundant if it is strongly
correlated with another attribute in the data. Since redundant attributes
show similar gains in purity if they are selected for splitting, only one of
them will be selected as an attribute test condition in the decision tree
algorithm. Decision trees can thus handle the presence of redundant
8. Using Rectilinear Splits: The test conditions described so far in this
chapter involve using only a single attribute at a time. As a consequence,
the tree-growing procedure can be viewed as the process of partitioning
the attribute space into disjoint regions until each region contains
records of the same class. The border between two neighboring regions
of different classes is known as a decision boundary. Figure 3.20 shows
the decision tree as well as the decision boundary for a binary
classification problem. Since the test condition involves only a single
attribute, the decision boundaries are rectilinear; i.e., parallel to the
coordinate axes. This limits the expressiveness of decision trees in
representing decision boundaries of data sets with continuous attributes.
Figure 3.21 shows a two-dimensional data set involving binary classes
that cannot be perfectly classified by a decision tree whose attribute test
conditions are defined based on single attributes. The binary classes in
the data set are generated from two skewed Gaussian distributions,
centered at (8,8) and (12,12), respectively. The true decision boundary is
represented by the diagonal dashed line, whereas the rectilinear decision
boundary produced by the decision tree classifier is shown by the thick
solid line. In contrast, an oblique decision tree may overcome this
limitation by allowing the test condition to be specified using more than
one attribute. For example, the binary classification data shown in
Figure 3.21 can be easily represented by an oblique decision tree with a
single root node with test condition
Figure 3.20.
Example of a decision tree and its decision boundaries for a
two-dimensional data set.
Figure 3.20. Full Alternative Text
Figure 3.21.
Example of data set that cannot be partitioned optimally using
a decision tree with single attribute test conditions. The true
decision boundary is shown by the dashed line.
Although an oblique decision tree is more expressive and can produce
more compact trees, finding the optimal test condition is
computationally more expensive.
9. Choice of Impurity Measure: It should be noted that the choice of
impurity measure often has little effect on the performance of decision
tree classifiers since many of the impurity measures are quite consistent
with each other, as shown in Figure 3.11 on page 129. Instead, the
strategy used to prune the tree has a greater impact on the final tree than
the choice of impurity measure.
3.4 Model Overfitting
Methods presented so far try to learn classification models that show the
lowest error on the training set. However, as we will show in the following
example, even if a model fits well over the training data, it can still show
poor generalization performance, a phenomenon known as model overfitting.
Figure 3.22.
Examples of training and test sets of a two-dimensional
classification problem.
Figure 3.22. Full Alternative Text
Figure 3.23.
Effect of varying tree size (number of leaf nodes) on training and
test errors.
Figure 3.23. Full Alternative Text
3.5. Example Overfitting and
Underfitting of Decision Trees
Consider the two-dimensional data set shown in Figure 3.22(a). The data set
contains instances that belong to two separate classes, represented as + and
∘ , respectively, where each class has 5400 instances. All instances
belonging to the ∘ class were generated from a uniform distribution. For the
+ class, 5000 instances were generated from a Gaussian distribution centered
at (10,10) with unit variance, while the remaining 400 instances were
sampled from the same uniform distribution as the ∘ class. We can see from
Figure 3.22(a) that the + class can be largely distinguished from the ∘ class
by drawing a circle of appropriate size centered at (10,10). To learn a
classifier using this two-dimensional data set, we randomly sampled 10% of
the data for training and used the remaining 90% for testing. The training set,
shown in Figure 3.22(b), looks quite representative of the overall data. We
used Gini index as the impurity measure to construct decision trees of
increasing sizes (number of leaf nodes), by recursively expanding a node into
child nodes till every leaf node was pure, as described in Section 3.3.4.
Figure 3.23(a) shows changes in the training and test error rates as the size of
the tree varies from 1 to 8. Both error rates are initially large when the tree
has only one or two leaf nodes. This situation is known as model
underfitting. Underfitting occurs when the learned decision tree is too
simplistic, and thus, incapable of fully representing the true relationship
between the attributes and the class labels. As we increase the tree size from
1 to 8, we can observe two effects. First, both the error rates decrease since
larger trees are able to represent more complex decision boundaries. Second,
the training and test error rates are quite close to each other, which indicates
that the performance on the training set is fairly representative of the
generalization performance. As we further increase the size of the tree from 8
to 150, the training error continues to steadily decrease till it eventually
reaches zero, as shown in Figure 3.23(b). However, in a striking contrast, the
test error rate ceases to decrease any further beyond a certain tree size, and
then it begins to increase. The training error rate thus grossly under-estimates
the test error rate once the tree becomes too large. Further, the gap between
the training and test error rates keeps on widening as we increase the tree
size. This behavior, which may seem counter-intuitive at first, can be
attributed to the phenomena of model overfitting.
3.4.1 Reasons for Model Overfitting
Model overfitting is the phenomena where, in the pursuit of minimizing the
training error rate, an overly complex model is selected that captures specific
patterns in the training data but fails to learn the true nature of relationships
between attributes and class labels in the overall data. To illustrate this,
Figure 3.24 shows decision trees and their corresponding decision boundaries
(shaded rectangles represent regions assigned to the + class) for two trees of
sizes 5 and 50. We can see that the decision tree of size 5 appears quite
simple and its decision boundaries provide a reasonable approximation to the
ideal decision boundary, which in this case corresponds to a circle centered
around the Gaussian distribution at (10, 10). Although its training and test
error rates are non-zero, they are very close to each other, which indicates
that the patterns learned in the training set should generalize well over the test
set. On the other hand, the decision tree of size 50 appears much more
complex than the tree of size 5, with complicated decision boundaries. For
example, some of its shaded rectangles (assigned the + class) attempt to cover
narrow regions in the input space that contain only one or two + training
instances. Note that the prevalence of + instances in such regions is highly
specific to the training set, as these regions are mostly dominated by –
instances in the overall data. Hence, in an attempt to perfectly fit the training
data, the decision tree of size 50 starts fine tuning itself to specific patterns in
the training data, leading to poor performance on an independently chosen
test set.
Figure 3.24.
Decision trees with different model complexities.
Figure 3.24. Full Alternative Text
Figure 3.25.
Performance of decision trees using 20% data for training (twice
the original training size).
Figure 3.25. Full Alternative Text
There are a number of factors that influence model overfitting. In the
following, we provide brief descriptions of two of the major factors: limited
training size and high model complexity. Though they are not exhaustive, the
interplay between them can help explain most of the common model
overfitting phenomena in real-world applications.
Limited Training Size
Note that a training set consisting of a finite number of instances can only
provide a limited representation of the overall data. Hence, it is possible that
the patterns learned from a training set do not fully represent the true patterns
in the overall data, leading to model overfitting. In general, as we increase the
size of a training set (number of training instances), the patterns learned from
the training set start resembling the true patterns in the overall data. Hence,
the effect of overfitting can be reduced by increasing the training size, as
illustrated in the following example.
3.6 Example Effect of Training Size
Suppose that we use twice the number of training instances than what we had
used in the experiments conducted in Example 3.5. Specifically, we use 20%
data for training and use the remainder for testing. Figure 3.25(b) shows the
training and test error rates as the size of the tree is varied from 1 to 150.
There are two major differences in the trends shown in this figure and those
shown in Figure 3.23(b) (using only 10% of the data for training). First, even
though the training error rate decreases with increasing tree size in both
figures, its rate of decrease is much smaller when we use twice the training
size. Second, for a given tree size, the gap between the training and test error
rates is much smaller when we use twice the training size. These differences
suggest that the patterns learned using 20% of data for training are more
generalizable than those learned using 10% of data for training.
Figure 3.25(a) shows the decision boundaries for the tree of size 50, learned
using 20% of data for training. In contrast to the tree of the same size learned
using 10% data for training (see Figure 3.24(d)), we can see that the decision
tree is not capturing specific patterns of noisy + instances in the training set.
Instead, the high model complexity of 50 leaf nodes is being effectively used
to learn the boundaries of the + instances centered at (10, 10).
High Model Complexity
Generally, a more complex model has a better ability to represent complex
patterns in the data. For example, decision trees with larger number of leaf
nodes can represent more complex decision boundaries than decision trees
with fewer leaf nodes. However, an overly complex model also has a
tendency to learn specific patterns in the training set that do not generalize
well over unseen instances. Models with high complexity should thus be
judiciously used to avoid overfitting.
One measure of model complexity is the number of “parameters” that need to
be inferred from the training set. For example, in the case of decision tree
induction, the attribute test conditions at internal nodes correspond to the
parameters of the model that need to be inferred from the training set. A
decision tree with larger number of attribute test conditions (and
consequently more leaf nodes) thus involves more “parameters” and hence is
more complex.
Given a class of models with a certain number of parameters, a learning
algorithm attempts to select the best combination of parameter values that
maximizes an evaluation metric (e.g., accuracy) over the training set. If the
number of parameter value combinations (and hence the complexity) is large,
the learning algorithm has to select the best combination from a large number
of possibilities, using a limited training set. In such cases, there is a high
chance for the learning algorithm to pick a spurious combination of
parameters that maximizes the evaluation metric just by random chance. This
is similar to the multiple comparisons problem (also referred as multiple
testing problem) in statistics.
As an illustration of the multiple comparisons problem, consider the task of
predicting whether the stock market will rise or fall in the next ten trading
days. If a stock analyst simply makes random guesses, the probability that her
prediction is correct on any trading day is 0.5. However, the probability that
she will predict correctly at least nine out of ten times is
which is extremely low.
Suppose we are interested in choosing an investment advisor from a pool of
200 stock analysts. Our strategy is to select the analyst who makes the most
number of correct predictions in the next ten trading days. The flaw in this
strategy is that even if all the analysts make their predictions in a random
fashion, the probability that at least one of them makes at least nine correct
predictions is
which is very high. Although each analyst has a low probability of predicting
at least nine times correctly, considered together, we have a high probability
of finding at least one analyst who can do so. However, there is no guarantee
in the future that such an analyst will continue to make accurate predictions
by random guessing.
How does the multiple comparisons problem relate to model overfitting? In
the context of learning a classification model, each combination of parameter
values corresponds to an analyst, while the number of training instances
corresponds to the number of days. Analogous to the task of selecting the
best analyst who makes the most accurate predictions on consecutive days,
the task of a learning algorithm is to select the best combination of
parameters that results in the highest accuracy on the training set. If the
number of parameter combinations is large but the training size is small, it is
highly likely for the learning algorithm to choose a spurious parameter
combination that provides high training accuracy just by random chance. In
the following example, we illustrate the phenomena of overfitting due to
multiple comparisons in the context of decision tree induction.
Figure 3.26.
Example of a two-dimensional (X-Y) data set.
Figure 3.27.
Training and test error rates illustrating the effect of multiple
comparisons problem on model overfitting.
Figure 3.27. Full Alternative Text
3.7. Example Multiple Comparisons
and Overfitting
Consider the two-dimensional data set shown in Figure 3.26 containing 500 +
and 500 ∘ instances, which is similar to the data shown in Figure 3.19. In
this data set, the distributions of both classes are well-separated in the twodimensional (XY) attribute space, but none of the two attributes (X or Y) are
sufficiently informative to be used alone for separating the two classes.
Hence, splitting the data set based on any value of an X or Y attribute will
provide close to zero reduction in an impurity measure. However, if X and Y
attributes are used together in the splitting criterion (e.g., splitting X at 10
and Y at 10), the two classes can be effectively separated.
Figure 3.28.
Decision tree with 6 leaf nodes using X and Y as attributes. Splits
have been numbered from 1 to 5 in order of other occurrence in the
Figure 3.28. Full Alternative Text
Figure 3.27(a) shows the training and test error rates for learning decision
trees of varying sizes, when 30% of the data is used for training and the
remainder of the data for testing. We can see that the two classes can be
separated using a small number of leaf nodes. Figure 3.28 shows the decision
boundaries for the tree with six leaf nodes, where the splits have been
numbered according to their order of appearance in the tree. Note that the
even though splits 1 and 3 provide trivial gains, their consequent splits (2, 4,
and 5) provide large gains, resulting in effective discrimination of the two
Assume we add 100 irrelevant attributes to the two-dimensional X-Y data.
Learning a decision tree from this resultant data will be challenging because
the number of candidate attributes to choose for splitting at every internal
node will increase from two to 102. With such a large number of candidate
attribute test conditions to choose from, it is quite likely that spurious
attribute test conditions will be selected at internal nodes because of the
multiple comparisons problem. Figure 3.27(b) shows the training and test
error rates after adding 100 irrelevant attributes to the training set. We can see
that the test error rate remains close to 0.5 even after using 50 leaf nodes,
while the training error rate keeps on declining and eventually becomes 0.
3.5 Model Selection
There are many possible classification models with varying levels of model
complexity that can be used to capture patterns in the training data. Among
these possibilities, we want to select the model that shows lowest
generalization error rate. The process of selecting a model with the right level
of complexity, which is expected to generalize well over unseen test
instances, is known as model selection. As described in the previous section,
the training error rate cannot be reliably used as the sole criterion for model
selection. In the following, we present three generic approaches to estimate
the generalization performance of a model that can be used for model
selection. We conclude this section by presenting specific strategies for using
these approaches in the context of decision tree induction.
3.5.1 Using a Validation Set
Note that we can always estimate the generalization error rate of a model by
using “out-of-sample” estimates, i.e. by evaluating the model on a separate
validation set that is not used for training the model. The error rate on the
validation set, termed as the validation error rate, is a better indicator of
generalization performance than the training error rate, since the validation
set has not been used for training the model. The validation error rate can be
used for model selection as follows.
Given a training set D.train, we can partition D.train into two smaller
subsets, D.tr and D.val, such that D.tr is used for training while D.val is used
as the validation set. For example, two-thirds of D.train can be reserved as
D.tr for training, while the remaining one-third is used as D.val for
computing validation error rate. For any choice of classification model m that
is trained on D.tr, we can estimate its validation error rate on D.val,
errval(m). The model that shows the lowest value of errval(m) can then be
selected as the preferred choice of model.
The use of validation set provides a generic approach for model selection.
However, one limitation of this approach is that it is sensitive to the sizes of
D.tr and D.val, obtained by partitioning D.train. If the size of D.tr is too
small, it may result in the learning of a poor classification model with substandard performance, since a smaller training set will be less representative
of the overall data. On the other hand, if the size of D.val is too small, the
validation error rate might not be reliable for selecting models, as it would be
computed over a small number of instances.
Figure 3.29.
Class distribution of validation data for the two decision trees
shown in Figure 3.30.
Figure 3.29. Full Alternative Text
3.8. Example Validation Error
In the following example, we illustrate one possible approach for using a
validation set in decision tree induction. Figure 3.29 shows the predicted
labels at the leaf nodes of the decision trees generated in Figure 3.30. The
counts given beneath the leaf nodes represent the proportion of data instances
in the validation set that reach each of the nodes. Based on the predicted
labels of the nodes, the validation error rate for the left tree is
errval(TL)=6/16=0.375, while the validation error rate for the right tree is
errval(TR)=4/16=0.25. Based on their validation error rates, the right tree is
preferred over the left one.
3.5.2 Incorporating Model
Since the chance for model overfitting increases as the model becomes more
complex, a model selection approach should not only consider the training
error rate but also the model complexity. This strategy is inspired by a wellknown principle known as Occam’s razor or the principle of parsimony,
which suggests that given two models with the same errors, the simpler
model is preferred over the more complex model. A generic approach to
account for model complexity while estimating generalization performance is
formally described as follows.
Given a training set D.train, let us consider learning a classification model m
that belongs to a certain class of models, M. For example, if M represents the
set of all possible decision trees, then m can correspond to a specific decision
tree learned from the training set. We are interested in estimating the
generalization error rate of m, gen.error(m). As discussed previously, the
training error rate of m, train.error(m, D.train), can under-estimate
gen.error(m) when the model complexity is high. Hence, we represent
gen.error(m) as a function of not just the training error rate but also the
model complexity of M, complexity(M), as follows:
gen.error(m)=train.error(m, D.train)+α×complexity(M), (3.11)
where α is a hyper-parameter that strikes a balance between minimizing
training error and reducing model complexity. A higher value of α gives more
emphasis to the model complexity in the estimation of generalization
performance. To choose the right value of α, we can make use of the
validation set in a similar way as described in 3.5.1. For example, we can
iterate through a range of values of α and for every possible value, we can
learn a model on a subset of the training set, D.tr, and compute its validation
error rate on a separate subset, D.val. We can then select the value of α that
provides the lowest validation error rate.
Equation 3.11 provides one possible approach for incorporating model
complexity into the estimate of generalization performance. This approach is
at the heart of a number of techniques for estimating generalization
performance, such as the structural risk minimization principle, the Akaike’s
Information Criterion (AIC), and the Bayesian Information Criterion (BIC).
The structural risk minimization principle serves as the building block for
learning support vector machines, which will be discussed later in Chapter 4.
For more details on AIC and BIC, see the Bibliographic Notes.
In the following, we present two different approaches for estimating the
complexity of a model, complexity(M). While the former is specific to
decision trees, the latter is more generic and can be used with any class of
Estimating the Complexity of
Decision Trees
In the context of decision trees, the complexity of a decision tree can be
estimated as the ratio of the number of leaf nodes to the number of training
instances. Let k be the number of leaf nodes and Ntrain be the number of
training instances. The complexity of a decision tree can then be described as
k/Ntrain. This reflects the intuition that for a larger training size, we can learn
a decision tree with larger number of leaf nodes without it becoming overly
complex. The generalization error rate of a decision tree T can then be
computed using Equation 3.11 as follows:
where err(T) is the training error of the decision tree and Ω is a hyperparameter that makes a trade-off between reducing training error and
minimizing model complexity, similar to the use of α in Equation 3.11. Ω can
be viewed as the relative cost of adding a leaf node relative to incurring a
training error. In the literature on decision tree induction, the above approach
for estimating generalization error rate is also termed as the pessimistic error
estimate. It is called pessimistic as it assumes the generalization error rate to
be worse than the training error rate (by adding a penalty term for model
complexity). On the other hand, simply using the training error rate as an
estimate of the generalization error rate is called the optimistic error
estimate or the resubstitution estimate.
3.9. Example Generalization Error
Consider the two binary decision trees, TL and TR, shown in Figure 3.30.
Both trees are generated from the same training data and TL is generated by
expanding three leaf nodes of TR. The counts shown in the leaf nodes of the
trees represent the class distribution of the training instances. If each leaf
node is labeled according to the majority class of training instances that reach
the node, the training error rate for the left tree will be err(TL)=4/24=0.167,
while the training error rate for the right tree will be err(TR)=6/24=0.25.
Based on their training error rates alone, TL would preferred over TR, even
though TL is more complex (contains larger number of leaf nodes) than TR.
Figure 3.30.
Example of two decision trees generated from the same training
Figure 3.30. Full Alternative Text
Now, assume that the cost associated with each leaf node is Ω=0.5. Then, the
generalization error estimate for TL will be
and the generalization error estimate for TR will be
errgen(TR) =624+0.5×424=824=0.3333.
Since TL has a lower generalization error rate, it will still be preferred over
TR. Note that Ω=0.5 implies that a node should always be expanded into its
two child nodes if it improves the prediction of at least one training instance,
since expanding a node is less costly than misclassifying a training instance.
On the other hand, if Ω=1, then the generalization error rate for TL is
errgen(TL)=11/24=0.458 and for TR is errgen(TR)=10/24=0.417. In this
case, TR will be preferred over TL because it has a lower generalization error
rate. This example illustrates that different choices of Ω can change our
preference of decision trees based on their generalization error estimates.
However, for a given choice of Ω, the pessimistic error estimate provides an
approach for modeling the generalization performance on unseen test
instances. The value of Ω can be selected with the help of a validation set.
Minimum Description Length
Another way to incorporate model complexity is based on an informationtheoretic approach known as the minimum description length or MDL
principle. To illustrate this approach, consider the example shown in Figure
3.31. In this example, both person A and person B are given a set of instances
with known attribute values x. Assume person A knows the class label y for
every instance, while person B has no such information. A would like to share
the class information with B by sending a message containing the labels. The
message would contain Θ(N) bits of information, where N is the number of
Figure 3.31.
An illustration of the minimum description length principle.
Figure 3.31. Full Alternative Text
Alternatively, instead of sending the class labels explicitly, A can build a
classification model from the instances and transmit it to B. B can then apply
the model to determine the class labels of the instances. If the model is 100%
accurate, then the cost for transmission is equal to the number of bits required
to encode the model. Otherwise, A must also transmit information about
which instances are misclassified by the model so that B can reproduce the
same class labels. Thus, the overall transmission cost, which is equal to the
total description length of the message, is
Cost(model, data)=Cost(data|model)+α×Cost(model), (3.12)
where the first term on the right-hand side is the number of bits needed to
encode the misclassified instances, while the second term is the number of
bits required to encode the model. There is also a hyper-parameter α that
trades-off the relative costs of the misclassified instances and the model.
Notice the familiarity between this equation and the generic equation for
generalization error rate presented in Equation 3.11. A good model must have
a total description length less than the number of bits required to encode the
entire sequence of class labels. Furthermore, given two competing models,
the model with lower total description length is preferred. An example
showing how to compute the total description length of a decision tree is
given in Exercise 10 on page 189.
3.5.3 Estimating Statistical Bounds
Instead of using Equation 3.11 to estimate the generalization error rate of a
model, an alternative way is to apply a statistical correction to the training
error rate of the model that is indicative of its model complexity. This can be
done if the probability distribution of training error is available or can be
assumed. For example, the number of errors committed by a leaf node in a
decision tree can be assumed to follow a binomial distribution. We can thus
compute an upper bound limit to the observed training error rate that can be
used for model selection, as illustrated in the following example.
3.10. Example Statistical Bounds on
Training Error
Consider the left-most branch of the binary decision trees shown in Figure
3.30. Observe that the left-most leaf node of TR has been expanded into two
child nodes in TL. Before splitting, the training error rate of the node is
2/7=0.286. By approximating a binomial distribution with a normal
distribution, the following upper bound of the training error rate e can be
eupper(N, e, α)=e+zα/222N+zα/2e(1−e)N+zα/224N21+zα/22N, (3.13)
where α is the confidence level, zα/2 is the standardized value from a
standard normal distribution, and N is the total number of training instances
used to compute e. By replacing α=25%,N=7, and e=2/7, the upper bound for
the error rate is eupper(7, 2/7, 0.25)=0.503, which corresponds to
7×0.503=3.521 errors. If we expand the node into its child nodes as shown in
TL, the training error rates for the child nodes are 1/4=0.250 and 1/3=0.333,
respectively. Using Equation (3.13), the upper bounds of these error rates are
eupper(4, 1/4,0.25)=0.537 and eupper(3, 1/3, 0.25)=0.650, respectively. The
overall training error of the child nodes is 4×0.537+3×0.650=4.098, which is
larger than the estimated error for the corresponding node in TR, suggesting
that it should not be split.
3.5.4 Model Selection for Decision
Building on the generic approaches presented above, we present two
commonly used model selection strategies for decision tree induction.
Prepruning (Early Stopping Rule)
In this approach, the tree-growing algorithm is halted before generating a
fully grown tree that perfectly fits the entire training data. To do this, a more
restrictive stopping condition must be used; e.g., stop expanding a leaf node
when the observed gain in the generalization error estimate falls below a
certain threshold. This estimate of the generalization error rate can be
computed using any of the approaches presented in the preceding three
subsections, e.g., by using pessimistic error estimates, by using validation
error estimates, or by using statistical bounds. The advantage of prepruning is
that it avoids the computations associated with generating overly complex
subtrees that overfit the training data. However, one major drawback of this
method is that, even if no significant gain is obtained using one of the
existing splitting criterion, subsequent splitting may result in better subtrees.
Such subtrees would not be reached if prepruning is used because of the
greedy nature of decision tree induction.
In this approach, the decision tree is initially grown to its maximum size. This
is followed by a tree-pruning step, which proceeds to trim the fully grown
tree in a bottom-up fashion. Trimming can be done by replacing a subtree
with (1) a new leaf node whose class label is determined from the majority
class of instances affiliated with the subtree (approach known as subtree
replacement), or (2) the most frequently used branch of the subtree
(approach known as subtree raising). The tree-pruning step terminates when
no further improvement in the generalization error estimate is observed
beyond a certain threshold. Again, the estimates of generalization error rate
can be computed using any of the approaches presented in the previous three
subsections. Post-pruning tends to give better results than prepruning because
it makes pruning decisions based on a fully grown tree, unlike prepruning,
which can suffer from premature termination of the tree-growing process.
However, for post-pruning, the additional computations needed to grow the
full tree may be wasted when the subtree is pruned.
Figure 3.32 illustrates the simplified decision tree model for the web robot
detection example given in Section 3.3.5. Notice that the subtree rooted at
depth=1 has been replaced by one of its branches corresponding to
breadth<=7, width>3, and MultiP=1, using subtree raising. On the other
hand, the subtree corresponding to depth>1 and MultiAgent=0 has been
replaced by a leaf node assigned to class 0, using subtree replacement. The
subtree for depth>1 and MultiAgent=1 remains intact.
Figure 3.32.
Post-pruning of the decision tree for web robot detection.
Figure 3.32. Full Alternative Text
3.6 Model Evaluation
The previous section discussed several approaches for model selection that
can be used to learn a classification model from a training set D.train. Here
we discuss methods for estimating its generalization performance, i.e. its
performance on unseen instances outside of D.train. This process is known as
model evaluation.
Note that model selection approaches discussed in Section 3.5 also compute
an estimate of the generalization performance using the training set D.train.
However, these estimates are biased indicators of the performance on unseen
instances, since they were used to guide the selection of classification model.
For example, if we use the validation error rate for model selection (as
described in Section 3.5.1), the resulting model would be deliberately chosen
to minimize the errors on the validation set. The validation error rate may
thus under-estimate the true generalization error rate, and hence cannot be
reliably used for model evaluation.
A correct approach for model evaluation would be to assess the performance
of a learned model on a labeled test set has not been used at any stage of
model selection. This can be achieved by partitioning the entire set of labeled
instances D, into two disjoint subsets, D.train, which is used for model
selection and D.test, which is used for computing the test error rate, errtest. In
the following, we present two different approaches for partitioning D into
D.train and D.test, and computing the test error rate, errtest.
3.6.1 Holdout Method
The most basic technique for partitioning a labeled data set is the holdout
method, where the labeled set D is randomly partitioned into two disjoint
sets, called the training set D.train and the test set D.test. A classification
model is then induced from D.train using the model selection approaches
presented in Section 3.5, and its error rate on D.test, errtest, is used as an
estimate of the generalization error rate. The proportion of data reserved for
training and for testing is typically at the discretion of the analysts, e.g., twothirds for training and one-third for testing.
Similar to the trade-off faced while partitioning D.train into D.tr and D.val in
Section 3.5.1, choosing the right fraction of labeled data to be used for
training and testing is non-trivial. If the size of D.train is small, the learned
classification model may be improperly learned using an insufficient number
of training instances, resulting in a biased estimation of generalization
performance. On the other hand, if the size of D.test is small, errtest may be
less reliable as it would be computed over a small number of test instances.
Moreover, errtest can have a high variance as we change the random
partitioning of D into D.train and D.test.
The holdout method can be repeated several times to obtain a distribution of
the test error rates, an approach known as random subsampling or repeated
holdout method. This method produces a distribution of the error rates that
can be used to understand the variance of errtest.
3.6.2 Cross-Validation
Cross-validation is a widely-used model evaluation method that aims to make
effective use of all labeled instances in D for both training and testing. To
illustrate this method, suppose that we are given a labeled set that we have
randomly partitioned into three equal-sized subsets, S1, S2, and S3, as shown
in Figure 3.33. For the first run, we train a model using subsets S2 and S3
(shown as empty blocks) and test the model on subset S1. The test error rate
on S1, denoted as err(S1), is thus computed in the first run. Similarly, for the
second run, we use S1 and S3 as the training set and S2 as the test set, to
compute the test error rate, err(S2), on S2. Finally, we use S1 and S3 for
training in the third run, while S3 is used for testing, thus resulting in the test
error rate err(S3) for S3. The overall test error rate is obtained by summing
up the number of errors committed in each test subset across all runs and
dividing it by the total number of instances. This approach is called three-fold
Figure 3.33.
Example demonstrating the technique of 3-fold cross-validation.
Figure 3.33. Full Alternative Text
The k-fold cross-validation method generalizes this approach by segmenting
the labeled data D (of size N) into k equal-sized partitions (or folds). During
the ith run, one of the partitions of D is chosen as D.test(i) for testing, while
the rest of the partitions are used as D.train(i) for training. A model m(i) is
learned using D.train(i) and applied on D.test(i) to obtain the sum of test
errors, errsum(i). This procedure is repeated k times. The total test error rate,
errtest, is then computed as
Every instance in the data is thus used for testing exactly once and for
training exactly (k−1) times. Also, every run uses (k−1)/k fraction of the data
for training and 1/k fraction for testing.
The right choice of k in k-fold cross-validation depends on a number of
characteristics of the problem. A small value of k will result in a smaller
training set at every run, which will result in a larger estimate of
generalization error rate than what is expected of a model trained over the
entire labeled set. On the other hand, a high value of k results in a larger
training set at every run, which reduces the bias in the estimate of
generalization error rate. In the extreme case, when k=N, every run uses
exactly one data instance for testing and the remainder of the data for testing.
This special case of k-fold cross-validation is called the leave-one-out
approach. This approach has the advantage of utilizing as much data as
possible for training. However, leave-one-out can produce quite misleading
results in some special scenarios, as illustrated in Exercise 11. Furthermore,
leave-one-out can be computationally expensive for large data sets as the
cross-validation procedure needs to be repeated N times. For most practical
applications, the choice of k between 5 and 10 provides a reasonable
approach for estimating the generalization error rate, because each fold is
able to make use of 80% to 90% of the labeled data for training.
The k-fold cross-validation method, as described above, produces a single
estimate of the generalization error rate, without providing any information
about the variance of the estimate. To obtain this information, we can run kfold cross-validation for every possible partitioning of the data into k
partitions, and obtain a distribution of test error rates computed for every
such partitioning. The average test error rate across all possible partitionings
serves as a more robust estimate of generalization error rate. This approach of
estimating the generalization error rate and its variance is known as the
complete cross-validation approach. Even though such an estimate is quite
robust, it is usually too expensive to consider all possible partitionings of a
large data set into k partitions. A more practical solution is to repeat the
cross-validation approach multiple times, using a different random
partitioning of the data into k partitions at every time, and use the average test
error rate as the estimate of generalization error rate. Note that since there is
only one possible partitioning for the leave-one-out approach, it is not
possible to estimate the variance of generalization error rate, which is another
limitation of this method.
The k-fold cross-validation does not guarantee that the fraction of positive
and negative instances in every partition of the data is equal to the fraction
observed in the overall data. A simple solution to this problem is to perform a
stratified sampling of the positive and negative instances into k partitions, an
approach called stratified cross-validation.
In k-fold cross-validation, a different model is learned at every run and the
performance of every one of the k models on their respective test folds is then
aggregated to compute the overall test error rate, errtest. Hence, errtest does
not reflect the generalization error rate of any of the k models. Instead, it
reflects the expected generalization error rate of the model selection
approach, when applied on a training set of the same size as one of the
training folds (N(k−1)/k). This is different than the errtest computed in the
holdout method, which exactly corresponds to the specific model learned
over D.train. Hence, although effectively utilizing every data instance in D
for training and testing, the errtest computed in the cross-validation method
does not represent the performance of a single model learned over a specific
Nonetheless, in practice, errtest is typically used as an estimate of the
generalization error of a model built on D. One motivation for this is that
when the size of the training folds is closer to the size of the overall data
(when k is large), then errtest resembles the expected performance of a model
learned over a data set of the same size as D. For example, when k is 10,
every training fold is 90% of the overall data. The errtest then should
approach the expected performance of a model learned over 90% of the
overall data, which will be close to the expected performance of a model
learned over D.
3.7 Presence of Hyper-parameters
Hyper-parameters are parameters of learning algorithms that need to be
determined before learning the classification model. For instance, consider
the hyper-parameter α that appeared in Equation 3.11, which is repeated here
for convenience. This equation was used for estimating the generalization
error for a model selection approach that used an explicit representations of
model complexity. (See Section 3.5.2.)
gen.error(m)=train.error(m, D.train)+α×complexity(M)
For other examples of hyper-parameters, see Chapter 4.
Unlike regular model parameters, such as the test conditions in the internal
nodes of a decision tree, hyper-parameters such as α do not appear in the final
classification model that is used to classify unlabeled instances. However, the
values of hyper-parameters need to be determined during model selection—a
process known as hyper-parameter selection—and must be taken into
account during model evaluation. Fortunately, both tasks can be effectively
accomplished via slight modifications of the cross-validation approach
described in the previous section.
3.7.1 Hyper-parameter Selection
In Section 3.5.2, a validation set was used to select α and this approach is
generally applicable for hyper-parameter section. Let p be the hyperparameter that needs to be selected from a finite range of values, P=
{p1, p2, … pn }. Partition D.train into D.tr and D.val. For every choice of
hyper-parameter value pi, we can learn a model mi on D.tr, and apply this
model on D.val to obtain the validation error rate errval(pi). Let p* be the
hyper-parameter value that provides the lowest validation error rate. We can
then use the model m* corresponding to p* as the final choice of
classification model.
The above approach, although useful, uses only a subset of the data, D.train,
for training and a subset, D.val, for validation. The framework of crossvalidation, presented in Section 3.6.2, addresses both of those issues, albeit in
the context of model evaluation. Here we indicate how to use a crossvalidation approach for hyper-parameter selection. To illustrate this approach,
let us partition D.train into three folds as shown in Figure 3.34. At every run,
one of the folds is used as D.val for validation, and the remaining two folds
are used as D.tr for learning a model, for every choice of hyper-parameter
value pi. The overall validation error rate corresponding to each pi is
computed by summing the errors across all the three folds. We then select the
hyper-parameter value p* that provides the lowest validation error rate, and
use it to learn a model m* on the entire training set D.train.
Figure 3.34.
Example demonstrating the 3-fold cross-validation framework for
hyper-parameter selection using D.train.
Figure 3.34. Full Alternative Text
Algorithm 3.2 generalizes the above approach using a k-fold cross-validation
framework for hyper-parameter selection. At the ith run of cross-validation,
the data in the ith fold is used as D.val(i) for validation (Step 4), while the
remainder of the data in D.train is used as D.tr(i) for training (Step 5). Then
for every choice of hyper-parameter value pi, a model is learned on D.tr(i)
(Step 7), which is applied on D.val(i) to compute its validation error (Step 8).
This is used to compute the validation error rate corresponding to models
learning using pi over all the folds (Step 11). The hyper-parameter value p*
that provides the lowest validation error rate (Step 12) is now used to learn
the final model m* on the entire training set D.train (Step 13). Hence, at the
end of this algorithm, we obtain the best choice of the hyper-parameter value
as well as the final classification model (Step 14), both of which are obtained
by making an effective use of every data instance in D.train.
Algorithm 3.2 Procedure modelselect(k, P, D.train)
1: N
train = |D.train| {Size of D.train.}
2: Divide D.train into k partitions, D.train1 to D.traink.
3: for each run i = 1 to k do
4: D.val(i) = D.traini. {Partition used for validation.}
5: D.tr(i) = D.train \ D.traini. {Partitions used for training.}
6: for each parameter p ∈ P do
7: m = model-train(p, D.tr(i)). {Train model}
8: err
sum(p, i) = model-test(m, D.val(i)). {Sum of validation errors.}
9: end for
10: end for
11: errval(p)=∑ik errsum(p, i)/Ntrain. {Compute validation error rate.}
12: p* = argminp errval(p). {Select best hyper-parameter value.}
13: m* = model-train(p*, D.train). {Learn final model on D.train}
14: return (p*, m*).
3.7.2 Nested Cross-Validation
The approach of the previous section provides a way to effectively use all the
instances in D.train to learn a classification model when hyper-parameter
selection is required. This approach can be applied over the entire data set D
to learn the final classification model. However, applying Algorithm 3.2 on D
would only return the final classification model m* but not an estimate of its
generalization performance, errtest. Recall that the validation error rates used
in Algorithm 3.2 cannot be used as estimates of generalization performance,
since they are used to guide the selection of the final model m*. However, to
compute errtest, we can again use a cross-validation framework for
evaluating the performance on the entire data set D, as described originally in
Section 3.6.2. In this approach, D is partitioned into D.train (for training) and
D.test (for testing) at every run of cross-validation. When hyper-parameters
are involved, we can use Algorithm 3.2 to train a model using D.train at
every run, thus “internally” using cross-validation for model selection. This
approach is called nested cross-validation or double cross-validation.
Algorithm 3.3 describes the complete approach for estimating errtest using
nested cross-validation in the presence of hyper-parameters.
As an illustration of this approach, see Figure 3.35 where the labeled set D is
partitioned into D.train and D.test, using a 3-fold cross-validation method.
Figure 3.35.
Example demonstrating 3-fold nested cross-validation for
computing errtest.
Figure 3.35. Full Alternative Text
At the ith run of this method, one of the folds is used as the test set, D.test(i),
while the remaining two folds are used as the training set, D.train(i). This is
represented in Figure 3.35 as the ith “outer” run. In order to select a model
using D.train(i), we again use an “inner” 3-fold cross-validation framework
that partitions D.train(i) into D.tr and D.val at every one of the three inner
runs (iterations). As described in Section 3.7, we can use the inner crossvalidation framework to select the best hyper-parameter value p*(i) as well as
its resulting classification model m*(i) learned over D.train(i). We can then
apply m*(i) on D.test(i) to obtain the test error at the ith outer run. By
repeating this process for every outer run, we can compute the average test
error rate, errtest, over the entire labeled set D. Note that in the above
approach, the inner cross-validation framework is being used for model
selection while the outer cross-validation framework is being used for model
Algorithm 3.3 The nested crossvalidation approach for computing
1: Divide D into k partitions, D1 to Dk.
2: for each outer run i = 1 to k do
3: D.test(i) = Di. {Partition used for testing.}
4: D.train(i) = D \ Di. {Partitions used for model selection.}
5: (p*, m*(i)) = model-select(k, P, D.train(i)). {Inner cross-validation.}
6: err
sum(i) = model-test(m*(i), D.test(i)). {Sum of test errors.}
7: end for
8: errtest=∑ikerrtest(i)/N. {Compute test error rate.}
3.8 Pitfalls of Model Selection and
Model selection and evaluation, when used effectively, serve as excellent
tools for learning classification models and assessing their generalization
performance. However, when using them effectively in practical settings,
there are several pitfalls that can result in improper and often misleading
conclusions. Some of these pitfalls are simple to understand and easy to
avoid, while others are quite subtle in nature and difficult to catch. In the
following, we present two of these pitfalls and discuss best practices to avoid
3.8.1 Overlap between Training and
Test Sets
One of the basic requirements of a clean model selection and evaluation
setup is that the data used for model selection (D.train) must be kept separate
from the data used for model evaluation (D.test). If there is any overlap
between the two, the test error rate errtest computed over D.test cannot be
considered representative of the performance on unseen instances.
Comparing the effectiveness of classification models using errtest can then be
quite misleading, as an overly complex model can show an inaccurately low
value of errtest due to model overfitting (see Exercise 12 at the end of this
To illustrate the importance of ensuring no overlap between D.train and
D.test, consider a labeled data set where all the attributes are irrelevant, i.e.
they have no relationship with the class labels. Using such attributes, we
should expect no classification model to perform better than random
guessing. However, if the test set involves even a small number of data
instances that were used for training, there is a possibility for an overly
complex model to show better performance than random, even though the
attributes are completely irrelevant. As we will see later in Chapter 10, this
scenario can actually be used as a criterion to detect overfitting due to
improper setup of experiment. If a model shows better performance than a
random classifier even when the attributes are irrelevant, it is an indication of
a potential feedback between the training and test sets.
3.8.2 Use of Validation Error as
Generalization Error
The validation error rate errval serves an important role during model
selection, as it provides “out-of-sample” error estimates of models on D.val,
which is not used for training the models. Hence, errval serves as a better
metric than the training error rate for selecting models and hyper-parameter
values, as described in Sections 3.5.1 and 3.7, respectively. However, once
the validation set has been used for selecting a classification model m*, errval
no longer reflects the performance of m* on unseen instances.
To realize the pitfall in using validation error rate as an estimate of
generalization performance, consider the problem of selecting a hyperparameter value p from a range of values P, using a validation set D.val. If
the number of possible values in P is quite large and the size of D.val is
small, it is possible to select a hyper-parameter value p* that shows favorable
performance on D.val just by random chance. Notice the similarity of this
problem with the multiple comparisons problem discussed in Section 3.4.1.
Even though the classification model m* learned using p* would show a low
validation error rate, it would lack generalizability on unseen test instances.
The correct approach for estimating the generalization error rate of a model
m* is to use an independently chosen test set D.test that hasn’t been used in
any way to influence the selection of m*. As a rule of thumb, the test set
should never be examined during model selection, to ensure the absence of
any form of overfitting. If the insights gained from any portion of a labeled
data set help in improving the classification model even in an indirect way,
then that portion of data must be discarded during testing.
3.9 Model Comparison*
One difficulty when comparing the performance of different classification
models is whether the observed difference in their performance is statistically
significant. For example, consider a pair of classification models, MA and
MB. Suppose MA achieves 85% accuracy when evaluated on a test set
containing 30 instances, while MB achieves 75% accuracy on a different test
set containing 5000 instances. Based on this information, is MA a better
model than MB? This example raises two key questions regarding the
statistical significance of a performance metric:
1. Although MA has a higher accuracy than MB, it was tested on a smaller
test set. How much confidence do we have that the accuracy for MA is
actually 85%?
2. Is it possible to explain the difference in accuracies between MA and
MB as a result of variations in the composition of their test sets?
The first question relates to the issue of estimating the confidence interval of
model accuracy. The second question relates to the issue of testing the
statistical significance of the observed deviation. These issues are
investigated in the remainder of this section.
3.9.1 Estimating the Confidence
Interval for Accuracy
To determine its confidence interval, we need to establish the probability
distribution for sample accuracy. This section describes an approach for
deriving the confidence interval by modeling the classification task as a
binomial random experiment. The following describes characteristics of such
an experiment:
1. The random experiment consists of N independent trials, where each
trial has two possible outcomes: success or failure.
2. The probability of success, p, in each trial is constant.
An example of a binomial experiment is counting the number of heads that
turn up when a coin is flipped N times. If X is the number of successes
observed in N trials, then the probability that X takes a particular value is
given by a binomial distribution with mean Np and variance Np(1−p):
For example, if the coin is fair (p=0.5) and is flipped fifty times, then the
probability that the head shows up 20 times is
If the experiment is repeated many times, then the average number of heads
expected to show up is 50×0.5=25, while its variance is 50×0.5×0.5=12.5.
The task of predicting the class labels of test instances can also be considered
as a binomial experiment. Given a test set that contains N instances, let X be
the number of instances correctly predicted by a model and p be the true
accuracy of the model. If the prediction task is modeled as a binomial
experiment, then X has a binomial distribution with mean Np and variance
Np(1−p). It can be shown that the empirical accuracy, acc=X/N, also has a
binomial distribution with mean p and variance p(1−p)/N (see Exercise 14).
The binomial distribution can be approximated by a normal distribution when
N is sufficiently large. Based on the normal distribution, the confidence
interval for acc can be derived as follows:
P(−Zα/2≤acc−pp(1−p)/N≤Z1−α/2)=1−α, (3.15)
where Zα/2 and Z1−α/2 are the upper and lower bounds obtained from a
standard normal distribution at confidence level (1−α). Since a standard
normal distribution is symmetric around Z=0, it follows that Zα/2=Z1−α/2.
Rearranging this inequality leads to the following confidence interval for p:
2×N×acc×Zα/22±Zα/2Zα/22+4Nacc−4Nacc22(N+Zα/22). (3.16)
The following table shows the values of Zα/2 at different confidence levels:
1−α 0.99 0.98 0.95 0.9 0.8 0.7 0.5
Zα/2 2.58 2.33 1.96 1.65 1.28 1.04 0.67
3.11. Example Confidence Interval
for Accuracy
Consider a model that has an accuracy of 80% when evaluated on 100 test
instances. What is the confidence interval for its true accuracy at a 95%
confidence level? The confidence level of 95% corresponds to Za/2=1.96
according to the table given above. Inserting this term into Equation 3.16
yields a confidence interval between 71.1% and 86.7%. The following table
shows the confidence interval when the number of instances, N, increases:
N 20 50 100 500 1000 5000
Confidence 0.584 0.670 0.711 0.763 0.774 0.789
Interval −0.919 −0.888 −0.867 −0.833 −0.824 −0.811
Note that the confidence interval becomes tighter when N increases.
3.9.2 Comparing the Performance
of Two Models
Consider a pair of models, M1 and M2, which are evaluated on two
independent test sets, D1 and D2. Let n1 denote the number of instances in
D1 and n2 denote the number of instances in D2. In addition, suppose the
error rate for M1 on D1 is e1 and the error rate for M2 on D2 is e2. Our goal
is to test whether the observed difference between e1 and e2 is statistically
Assuming that n1 and n2 are sufficiently large, the error rates e1 and e2 can
be approximated using normal distributions. If the observed difference in the
error rate is denoted as d=e1−e2, then d is also normally distributed with
mean dt, its true difference, and variance, σd2. The variance of d can be
computed as follows:
σd2≃σ^d2=e1(1−e1)n1+e2(1−e2)n2, (3.17)
where e1(1−e1)/n1 and e2(1−e1)/n2 are the variances of the error rates.
Finally, at the (1−α)% confidence level, it can be shown that the confidence
interval for the true difference dt is given by the following equation:
dt=d±zα/2σ^d. (3.18)
3.12. Example Significance Testing
Consider the problem described at the beginning of this section. Model MA
has an error rate of e1=0.15 when applied to N1=30 test instances, while
model MB has an error rate of e2=0.25 when applied to N2=5000 test
instances. The observed difference in their error rates is d=|0.15−0.25|=0.1. In
this example, we are performing a two-sided test to check whether dt=0 or
dt≠0. The estimated variance of the observed difference in error rates can be
computed as follows:
or σ^d=0.0655. Inserting this value into Equation 3.18, we obtain the
following confidence interval for dt at 95% confidence level:
As the interval spans the value zero, we can conclude that the observed
difference is not statistically significant at a 95% confidence level.
At what confidence level can we reject the hypothesis that dt=0? To do this,
we need to determine the value of Zα/2 such that the confidence interval for
dt does not span the value zero. We can reverse the preceding computation
and look for the value Zα/2 such that d>Zσ/2σ^d. Replacing the values of d
and σ^d gives Zσ/2<1.527. This value first occurs when (1−α)<~0.936 (for a
two-sided test). The result suggests that the null hypothesis can be rejected at
confidence level of 93.6% or lower.
3.10 Bibliographic Notes
Early classification systems were developed to organize various collections
of objects, from living organisms to inanimate ones. Examples abound, from
Aristotle’s cataloguing of species to the Dewey Decimal and Library of
Congress classification systems for books. Such a task typically requires
considerable human efforts, both to identify properties of the objects to be
classified and to organize them into well distinguished categories.
With the development of statistics and computing, automated classification
has been a subject of intensive research. The study of classification in
classical statistics is sometimes known as discriminant analysis, where the
objective is to predict the group membership of an object based on its
corresponding features. A well-known classical method is Fisher’s linear
discriminant analysis [142], which seeks to find a linear projection of the data
that produces the best separation between objects from different classes.
Many pattern recognition problems also require the discrimination of objects
from different classes. Examples include speech recognition, handwritten
character identification, and image classification. Readers who are interested
in the application of classification techniques for pattern recognition may
refer to the survey articles by Jain et al. [150] and Kulkarni et al. [157] or
classic pattern recognition books by Bishop [125], Duda et al. [137], and
Fukunaga [143]. The subject of classification is also a major research topic in
neural networks, statistical learning, and machine learning. An in-depth
treatment on the topic of classification from the statistical and machine
learning perspectives can be found in the books by Bishop [126], Cherkassky
and Mulier [132], Hastie et al. [148], Michie et al. [162], Murphy [167], and
Mitchell [165]. Recent years have also seen the release of many publicly
available software packages for classification, which can be embedded in
programming languages such as Java (Weka [147]) and Python (scikit-learn
An overview of decision tree induction algorithms can be found in the survey
articles by Buntine [129], Moret [166], Murthy [168], and Safavian et al.
[179]. Examples of some well-known decision tree algorithms include CART
[127], ID3 [175], C4.5 [177], and CHAID [153]. Both ID3 and C4.5 employ
the entropy measure as their splitting function. An in-depth discussion of the
C4.5 decision tree algorithm is given by Quinlan [177]. The CART algorithm
was developed by Breiman et al. [127] and uses the Gini index as its splitting
function. CHAID [153] uses the statistical χ2 test to determine the best split
during the tree-growing process.
The decision tree algorithm presented in this chapter assumes that the
splitting condition at each internal node contains only one attribute. An
oblique decision tree can use multiple attributes to form the attribute test
condition in a single node [149, 187]. Breiman et al. [127] provide an option
for using linear combinations of attributes in their CART implementation.
Other approaches for inducing oblique decision trees were proposed by Heath
et al. [149], Murthy et al. [169], Cantú-Paz and Kamath [130], and Utgoff
and Brodley [187]. Although an oblique decision tree helps to improve the
expressiveness of the model representation, the tree induction process
becomes computationally challenging. Another way to improve the
expressiveness of a decision tree without using oblique decision trees is to
apply a method known as constructive induction [161]. This method
simplifies the task of learning complex splitting functions by creating
compound features from the original data.
Besides the top-down approach, other strategies for growing a decision tree
include the bottom-up approach by Landeweerd et al. [159] and Pattipati and
Alexandridis [173], as well as the bidirectional approach by Kim and
Landgrebe [154]. Schuermann and Doster [181] and Wang and Suen [193]
proposed using a soft splitting criterion to address the data fragmentation
problem. In this approach, each instance is assigned to different branches of
the decision tree with different probabilities.
Model overfitting is an important issue that must be addressed to ensure that
a decision tree classifier performs equally well on previously unlabeled data
instances. The model overfitting problem has been investigated by many
authors including Breiman et al. [127], Schaffer [180], Mingers [164], and
Jensen and Cohen [151]. While the presence of noise is often regarded as one
of the primary reasons for overfitting [164, 170], Jensen and Cohen [151]
viewed overfitting as an artifact of failure to compensate for the multiple
comparisons problem.
Bishop [126] and Hastie et al. [148] provide an excellent discussion of model
overfitting, relating it to a well-known framework of theoretical analysis,
known as bias-variance decomposition [146]. In this framework, the
prediction of a learning algorithm is considered to be a function of the
training set, which varies as the training set is changed. The generalization
error of a model is then described in terms of its bias (the error of the average
prediction obtained using different training sets), its variance (how different
are the predictions obtained using different training sets), and noise (the
irreducible error inherent to the problem). An underfit model is considered to
have high bias but low variance, while an overfit model is considered to have
low bias but high variance. Although the bias-variance decomposition was
originally proposed for regression problems (where the target attribute is a
continuous variable), a unified analysis that is applicable for classification
has been proposed by Domingos [136]. The bias variance decomposition will
be discussed in more detail while introducing ensemble learning methods in
Chapter 4.
Various learning principles, such as the Probably Approximately Correct
(PAC) learning framework [188], have been developed to provide a
theoretical framework for explaining the generalization performance of
learning algorithms. In the field of statistics, a number of performance
estimation methods have been proposed that make a trade-off between the
goodness of fit of a model and the model complexity. Most noteworthy
among them are the Akaike’s Information Criterion [120] and the Bayesian
Information Criterion [182]. They both apply corrective terms to the training
error rate of a model, so as to penalize more complex models. Another
widely-used approach for measuring the complexity of any general model is
the VapnikChervonenkis (VC) Dimension [190]. The VC dimension of a
class of functions C is defined as the maximum number of points that can be
shattered (every point can be distinguished from the rest) by functions
belonging to C, for any possible configuration of points. The VC dimension
lays the foundation of the structural risk minimization principle [189], which
is extensively used in many learning algorithms, e.g., support vector
machines, which will be discussed in detail in Chapter 4.
The Occam’s razor principle is often attributed to the philosopher William of
Occam. Domingos [135] cautioned against the pitfall of misinterpreting
Occam’s razor as comparing models with similar training errors, instead of
generalization errors. A survey on decision tree-pruning methods to avoid
overfitting is given by Breslow and Aha [128] and Esposito et al. [141].
Some of the typical pruning methods include reduced error pruning [176],
pessimistic error pruning [176], minimum error pruning [171], critical value
pruning [163], cost-complexity pruning [127], and error-based pruning [177].
Quinlan and Rivest proposed using the minimum description length principle
for decision tree pruning in [178].
The discussions in this chapter on the significance of cross-validation error
estimates is inspired from Chapter 7 in Hastie et al. [148]. It is also an
excellent resource for understanding “the right and wrong ways to do crossvalidation”, which is similar to the discussion on pitfalls in Section 3.8 of this
chapter. A comprehensive discussion of some of the common pitfalls in using
cross-validation for model selection and evaluation is provided in Krstajic et
al. [156].
The original cross-validation method was proposed independently by Allen
[121], Stone [184], and Geisser [145] for model assessment (evaluation).
Even though cross-validation can be used for model selection [194], its usage
for model selection is quite different than when it is used for model
evaluation, as emphasized by Stone [184]. Over the years, the distinction
between the two usages has often been ignored, resulting in incorrect
findings. One of the common mistakes while using cross-validation is to
perform pre-processing operations (e.g., hyper-parameter tuning or feature
selection) using the entire data set and not “within” the training fold of every
cross-validation run. Ambroise et al., using a number of gene expression
studies as examples, [124] provide an extensive discussion of the selection
bias that arises when feature selection is performed outside cross-validation.
Useful guidelines for evaluating models on microarray data have also been
provided by Allison et al. [122].
The use of the cross-validation protocol for hyper-parameter tuning has been
described in detail by Dudoit and van der Laan [138]. This approach has been
called “grid-search cross-validation.” The correct approach in using cross-
validation for both hyper-parameter selection and model evaluation, as
discussed in Section 3.7 of this chapter, is extensively covered by Varma and
Simon [191]. This combined approach has been referred to as “nested crossvalidation” or “double cross-validation” in the existing literature. Recently,
Tibshirani and Tibshirani [185] have proposed a new approach for hyperparameter selection and model evaluation. Tsamardinos et al. [186] compared
this approach to nested cross-validation. The experiments they performed
found that, on average, both approaches provide conservative estimates of
model performance with the Tibshirani and Tibshirani approach being more
computationally efficient.
Kohavi [155] has performed an extensive empirical study to compare the
performance metrics obtained using different estimation methods such as
random subsampling and k-fold cross-validation. Their results suggest that
the best estimation method is ten-fold, stratified cross-validation.
An alternative approach for model evaluation is the bootstrap method, which
was presented by Efron in 1979 [139]. In this method, training instances are
sampled with replacement from the labeled set, i.e., an instance previously
selected to be part of the training set is equally likely to be drawn again. If the
original data has N instances, it can be shown that, on average, a bootstrap
sample of size N contains about 63.2% of the instances in the original data.
Instances that are not included in the bootstrap sample become part of the test
set. The bootstrap procedure for obtaining training and test sets is repeated b
times, resulting in a different error rate on the test set, err(i), at the ith run. To
obtain the overall error rate, errboot, the .632 bootstrap approach combines
err(i) with the error rate obtained from a training set containing all the
labeled examples, errs, as follows:
errboot=1b∑i=1b(0.632)×err(i)+0.386×errs). (3.19)
Efron and Tibshirani [140] provided a theoretical and empirical comparison
between cross-validation and a bootstrap method known as the 632+ rule.
While the .632 bootstrap method presented above provides a robust estimate
of the generalization performance with low variance in its estimate, it may
produce misleading results for highly complex models in certain conditions,
as demonstrated by Kohavi [155]. This is because the overall error rate is not
truly an out-of-sample error estimate as it depends on the training error rate,
errs, which can be quite small if there is overfitting.
Current techniques such as C4.5 require that the entire training data set fit
into main memory. There has been considerable effort to develop parallel and
scalable versions of decision tree induction algorithms. Some of the proposed
algorithms include SLIQ by Mehta et al. [160], SPRINT by Shafer et al.
[183], CMP by Wang and Zaniolo [192], CLOUDS by Alsabti et al. [123],
RainForest by Gehrke et al. [144], and ScalParC by Joshi et al. [152]. A
survey of parallel algorithms for classification and other data mining tasks is
given in [158]. More recently, there has been extensive research to implement
large-scale classifiers on the compute unified device architecture (CUDA)
[131, 134] and MapReduce [133, 172] platforms.
[120] H. Akaike. Information theory and an extension of the maximum
likelihood principle. In Selected Papers of Hirotugu Akaike, pages 199–
213. Springer, 1998.
[121] D. M. Allen. The relationship between variable selection and data
agumentation and a method for prediction. Technometrics, 16(1):125–
127, 1974.
[122] D. B. Allison, X. Cui, G. P. Page, and M. Sabripour. Microarray
data analysis: from disarray to consolidation and consensus. Nature
reviews genetics, 7(1):55–65, 2006.
[123] K. Alsabti, S. Ranka, and V. Singh. CLOUDS: A Decision Tree
Classifier for Large Datasets. In Proc. of the 4th Intl. Conf. on
Knowledge Discovery and Data Mining, pages 2–8, New York, NY,
August 1998.
[124] C. Ambroise and G. J. McLachlan. Selection bias in gene
extraction on the basis of microarray gene-expression data. Proceedings
of the national academy of sciences, 99 (10):6562–6566, 2002.
[125] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford
University Press, Oxford, U.K., 1995.
[126] C. M. Bishop. Pattern Recognition and Machine Learning.
Springer, 2006.
[127] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone.
Classification and Regression Trees. Chapman & Hall, New York,
[128] L. A. Breslow and D. W. Aha. Simplifying Decision Trees: A
Survey. Knowledge Engineering Review, 12(1):1–40, 1997.
[129] W. Buntine. Learning classification trees. In Artificial Intelligence
Frontiers in Statistics, pages 182–201. Chapman & Hall, London, 1993.
[130] E. Cantú-Paz and C. Kamath. Using evolutionary algorithms to
induce oblique decision trees. In Proc. of the Genetic and Evolutionary
Computation Conf., pages 1053–1060, San Francisco, CA, 2000.
[131] B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector
machine training and classification on graphics processors. In
Proceedings of the 25th International Conference on Machine Learning,
pages 104–111, 2008.
[132] V. Cherkassky and F. M. Mulier. Learning from Data: Concepts,
Theory, and Methods. Wiley, 2nd edition, 2007.
[133] C. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and
K. Olukotun. Map-reduce for machine learning on multicore. Advances
in neural information processing systems, 19:281, 2007.
[134] A. Cotter, N. Srebro, and J. Keshet. A GPU-tailored Approach for
Training Kernelized SVMs. In Proceedings of the 17th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
pages 805–813, San Diego, California, USA, 2011.
[135] P. Domingos. The Role of Occam’s Razor in Knowledge
Discovery. Data Mining and Knowledge Discovery, 3(4):409–425,
[136] P. Domingos. A unified bias-variance decomposition. In
Proceedings of 17th International Conference on Machine Learning,
pages 231–238, 2000.
[137] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification.
John Wiley & Sons, Inc., New York, 2nd edition, 2001.
[138] S. Dudoit and M. J. van der Laan. Asymptotics of cross-validated
risk estimation in estimator selection and performance assessment.
Statistical Methodology, 2(2):131–154, 2005.
[139] B. Efron. Bootstrap methods: another look at the jackknife. In
Breakthroughs in Statistics, pages 569–593. Springer, 1992.
[140] B. Efron and R. Tibshirani. Cross-validation and the Bootstrap:
Estimating the Error Rate of a Prediction Rule. Technical report,
Stanford University, 1995.
[141] F. Esposito, D. Malerba, and G. Semeraro. A Comparative
Analysis of Methods for Pruning Decision Trees. IEEE Trans. Pattern
Analysis and Machine Intelligence, 19(5):476–491, May 1997.
[142] R. A. Fisher. The use of multiple measurements in taxonomic
problems. Annals of Eugenics, 7:179–188, 1936.
[143] K. Fukunaga. Introduction to Statistical Pattern Recognition.
Academic Press, New York, 1990.
[144] J. Gehrke, R. Ramakrishnan, and V. Ganti. RainForest—A
Framework for Fast Decision Tree Construction of Large Datasets. Data
Mining and Knowledge Discovery, 4(2/3):127–162, 2000.
[145] S. Geisser. The predictive sample reuse method with applications.
Journal of the American Statistical Association, 70(350):320–328, 1975.
[146] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and
the bias/variance dilemma. Neural computation, 4(1):1–58, 1992.
[147] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.
H. Witten. The WEKA Data Mining Software: An Update. SIGKDD
Explorations, 11(1), 2009.
[148] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of
Statistical Learning: Data Mining, Inference, and Prediction. Springer,
2nd edition, 2009.
[149] D. Heath, S. Kasif, and S. Salzberg. Induction of Oblique Decision
Trees. In Proc. of the 13th Intl. Joint Conf. on Artificial Intelligence,
pages 1002–1007, Chambery, France, August 1993.
[150] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical Pattern
Recognition: A Review. IEEE Tran. Patt. Anal. and Mach. Intellig.,
22(1):4–37, 2000.
[151] D. Jensen and P. R. Cohen. Multiple Comparisons in Induction
Algorithms. Machine Learning, 38(3):309–338, March 2000.
[152] M. V. Joshi, G. Karypis, and V. Kumar. ScalParC: A New
Scalable and Efficient Parallel Classification Algorithm for Mining
Large Datasets. In Proc. of 12th Intl. Parallel Processing Symp.
(IPPS/SPDP), pages 573–579, Orlando, FL, April 1998.
[153] G. V. Kass. An Exploratory Technique for Investigating Large
Quantities of Categorical Data. Applied Statistics, 29:119–127, 1980.
[154] B. Kim and D. Landgrebe. Hierarchical decision classifiers in
high-dimensional and large class data. IEEE Trans. on Geoscience and
Remote Sensing, 29(4):518–528, 1991.
[155] R. Kohavi. A Study on Cross-Validation and Bootstrap for
Accuracy Estimation and Model Selection. In Proc. of the 15th Intl.
Joint Conf. on Artificial Intelligence, pages 1137–1145, Montreal,
Canada, August 1995.
[156] D. Krstajic, L. J. Buturovic, D. E. Leahy, and S. Thomas. Crossvalidation pitfalls when selecting and assessing regression and
classification models. Journal of cheminformatics, 6(1):1, 2014.
[157] S. R. Kulkarni, G. Lugosi, and S. S. Venkatesh. Learning Pattern
Classification—A Survey. IEEE Tran. Inf. Theory, 44(6):2178–2206,
[158] V. Kumar, M. V. Joshi, E.-H. Han, P. N. Tan, and M. Steinbach.
High Performance Data Mining. In High Performance Computing for
Computational Science (VECPAR 2002), pages 111–125. Springer,
[159] G. Landeweerd, T. Timmers, E. Gersema, M. Bins, and M. Halic.
Binary tree versus single level tree classification of white blood cells.
Pattern Recognition, 16:571–577, 1983.
[160] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A Fast Scalable
Classifier for Data Mining. In Proc. of the 5th Intl. Conf. on Extending
Database Technology, pages 18–32, Avignon, France, March 1996.
[161] R. S. Michalski. A theory and methodology of inductive learning.
Artificial Intelligence, 20:111–116, 1983.
[162] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine
Learning, Neural and Statistical Classification. Ellis Horwood, Upper
Saddle River, NJ, 1994.
[163] J. Mingers. Expert Systems—Rule Induction with Statistical Data.
J Operational Research Society, 38:39–47, 1987.
[164] J. Mingers. An empirical comparison of pruning methods for
decision tree induction. Machine Learning, 4:227–243, 1989.
[165] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.
[166] B. M. E. Moret. Decision Trees and Diagrams. Computing
Surveys, 14(4):593–623, 1982.
[167] K. P. Murphy. Machine Learning: A Probabilistic Perspective.
MIT Press, 2012.
[168] S. K. Murthy. Automatic Construction of Decision Trees from
Data: A Multi-Disciplinary Survey. Data Mining and Knowledge
Discovery, 2(4):345–389, 1998.
[169] S. K. Murthy, S. Kasif, and S. Salzberg. A system for induction of
oblique decision trees. J of Artificial Intelligence Research, 2:1–33,
[170] T. Niblett. Constructing decision trees in noisy domains. In Proc.
of the 2nd European Working Session on Learning, pages 67–78, Bled,
Yugoslavia, May 1987.
[171] T. Niblett and I. Bratko. Learning Decision Rules in Noisy
Domains. In Research and Development in Expert Systems III,
Cambridge, 1986. Cambridge University Press.
[172] I. Palit and C. K. Reddy. Scalable and parallel boosting with
mapreduce. IEEE Transactions on Knowledge and Data Engineering,
24(10):1904–1916, 2012.
[173] K. R. Pattipati and M. G. Alexandridis. Application of heuristic
search and information theory to sequential fault diagnosis. IEEE Trans.
on Systems, Man, and Cybernetics, 20(4):872–887, 1990.
[174] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J.
Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E.
Duchesnay. Scikit-learn: Machine Learning in Python. Journal of
Machine Learning Research, 12:2825–2830, 2011.
[175] J. R. Quinlan. Discovering rules by induction from large collection
of examples. In D. Michie, editor, Expert Systems in the Micro
Electronic Age. Edinburgh University Press, Edinburgh, UK, 1979.
[176] J. R. Quinlan. Simplifying Decision Trees. Intl. J. Man-Machine
Studies, 27:221–234, 1987.
[177] J. R. Quinlan. C4.5: Programs for Machine Learning. MorganKaufmann Publishers, San Mateo, CA, 1993.
[178] J. R. Quinlan and R. L. Rivest. Inferring Decision Trees Using the
Minimum Description Length Principle. Information and Computation,
80(3):227–248, 1989.
[179] S. R. Safavian and D. Landgrebe. A Survey of Decision Tree
Classifier Methodology. IEEE Trans. Systems, Man and Cybernetics,
22:660–674, May/June 1998.
[180] C. Schaffer. Overfitting avoidence as bias. Machine Learning,
10:153–178, 1993.
[181] J. Schuermann and W. Doster. A decision-theoretic approach in
hierarchical classifier design. Pattern Recognition, 17:359–369, 1984.
[182] G. Schwarz et al. Estimating the dimension of a model. The annals
of statistics, 6(2): 461–464, 1978.
[183] J. C. Shafer, R. Agrawal, and M. Mehta. SPRINT: A Scalable
Parallel Classifier for Data Mining. In Proc. of the 22nd VLDB Conf.,
pages 544–555, Bombay, India, September 1996.
[184] M. Stone. Cross-validatory choice and assessment of statistical
predictions. Journal of the Royal Statistical Society. Series B
(Methodological), pages 111–147, 1974.
[185] R. J. Tibshirani and R. Tibshirani. A bias correction for the
minimum error rate in cross-validation. The Annals of Applied Statistics,
pages 822–829, 2009.
[186] I. Tsamardinos, A. Rakhshani, and V. Lagani. Performanceestimation properties of cross-validation-based protocols with
simultaneous hyper-parameter optimization. In Hellenic Conference on
Artificial Intelligence, pages 1–14. Springer, 2014.
[187] P. E. Utgoff and C. E. Brodley. An incremental method for finding
multivariate splits for decision trees. In Proc. of the 7th Intl. Conf. on
Machine Learning, pages 58–65, Austin, TX, June 1990.
[188] L. Valiant. A theory of the learnable. Communications of the
ACM, 27(11):1134–1142, 1984.
[189] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience,
[190] V. N. Vapnik and A. Y. Chervonenkis. On the uniform
convergence of relative frequencies of events to their probabilities. In
Measures of Complexity, pages 11–30. Springer, 2015.
[191] S. Varma and R. Simon. Bias in error estimation when using
cross-validation for model selection. BMC bioinformatics, 7(1):1, 2006.
[192] H. Wang and C. Zaniolo. CMP: A Fast Decision Tree Classifier
Using Multivariate Predictions. In Proc. of the 16th Intl. Conf. on Data
Engineering, pages 449–460, San Diego, CA, March 2000.
[193] Q. R. Wang and C. Y. Suen. Large tree classifier with heuristic
search and global training. IEEE Trans. on Pattern Analysis and
Machine Intelligence, 9(1):91–102, 1987.
[194] Y. Zhang and Y. Yang. Cross-validation for selecting a model
selection procedure. Journal of Econometrics, 187(1):95–112, 2015.
3.11 Exercises
1. 1. Draw the full decision tree for the parity function of four Boolean
attributes, A, B, C, and D. Is it possible to simplify the tree?
2. 2. Consider the training examples shown in Table 3.5 for a binary
classification problem.
Table 3.5. Data set for
Exercise 2.
Customer ID Gender Car Type Shirt Size Class
1 M Family Small C0
2 M Sports Medium C0
3 M Sports Medium C0
4 M Sports Large C0
5 M Sports Extra Large C0
6 M Sports Extra Large C0
7 F Sports Small C0
8 F Sports Small C0
9 F Sports Medium C0
10 F Luxury Large C0
11 M Family Large C1
12 M Family Extra Large C1
13 M Family Medium C1
14 M Luxury Extra Large C1
15 F Luxury Small C1
16 F Luxury Small C1
17 F Luxury Medium C1
18 F Luxury Medium C1
19 F Luxury Medium C1
20 F Luxury Large C1
1. Compute the Gini index for the overall collection of training
2. Compute the Gini index for the Customer ID attribute.
3. Compute the Gini index for the Gender attribute.
4. Compute the Gini index for the Car Type attribute using multiway
5. Compute the Gini index for the Shirt Size attribute using
multiway split.
6. Which attribute is better, Gender, Car Type, or Shirt Size?
7. Explain why Customer ID should not be used as the attribute test
condition even though it has the lowest Gini.
3. 3. Consider the training examples shown in Table 3.6 for a binary
classification problem.
Table 3.6. Data set for
Exercise 3.
Instance a1 a2 a3 Target Class
1 T T 1.0 +
2 T T 6.0 +
3 T F 5.0 −
4 F F 4.0 +
5 F T 7.0 −
6 F T 3.0 −
7 F F 8.0 −
8 T F 7.0 +
9 F T 5.0 −
1. What is the entropy of this collection of training examples with
respect to the class attribute?
2. What are the information gains of a1 and a2 relative to these
training examples?
3. For a3, which is a continuous attribute, compute the information
gain for every possible split.
4. What is the best split (among a1, a2, and a3) according to the
information gain?
5. What is the best split (between a1 and a2) according to the
misclassification error rate?
6. What is the best split (between a1 and a2) according to the Gini
4. 4. Show that the entropy of a node never increases after splitting it into
smaller successor nodes.
5. 5. Consider the following data set for a binary class problem.
A B Class Label
T F +
T T +
T T +
T F −
T T +
F F −
F F −
F F −
T T −
T F −
1. Calculate the information gain when splitting on A and B. Which
attribute would the decision tree induction algorithm choose?
2. Calculate the gain in the Gini index when splitting on A and B.
Which attribute would the decision tree induction algorithm
3. Figure 3.11 shows that entropy and the Gini index are both
monotonically increasing on the range [0, 0.5] and they are both
monotonically decreasing on the range [0.5, 1]. Is it possible that
information gain and the gain in the Gini index favor different
attributes? Explain.
6. 6. Consider splitting a parent node P into two child nodes, C1 and C2,
using some attribute test condition. The composition of labeled training
instances at every node is summarized in the Table below.
P C1 C2
Class 0 7 3 4
Class 1 3 0 3
1. Calculate the Gini index and misclassification error rate of the
parent node P .
2. Calculate the weighted Gini index of the child nodes. Would you
consider this attribute test condition if Gini is used as the impurity
3. Calculate the weighted misclassification rate of the child nodes.
Would you consider this attribute test condition if misclassification
rate is used as the impurity measure?
7. 7. Consider the following set of training examples.
X Y Z No. of Class C1 Examples No. of Class C2 Examples
0 0 0 5 40
0 0 1 0 15
0 1 0 10 5
0 1 1 45 0
1 0 0 10 5
1 0 1 25 0
1 1 0 5 20
1 1 1 0 15
1. Compute a two-level decision tree using the greedy approach
described in this chapter. Use the classification error rate as the
criterion for splitting. What is the overall error rate of the induced
2. Repeat part (a) using X as the first splitting attribute and then
choose the best remaining attribute for splitting at each of the two
successor nodes. What is the error rate of the induced tree?
3. Compare the results of parts (a) and (b). Comment on the suitability
of the greedy heuristic used for splitting attribute selection.
8. 8. The following table summarizes a data set with three attributes A, B,
C and two class labels +, −. Build a two-level decision tree.
A B C Number of Instances
+ −
T T T 5 0
F T T 0 20
T F T 20 0
F F T 0 5
T T F 0 0
F T F 25 0
T F F 0 0
F F F 0 25
1. According to the classification error rate, which attribute would be
chosen as the first splitting attribute? For each attribute, show the
contingency table and the gains in classification error rate.
2. Repeat for the two children of the root node.
3. How many instances are misclassified by the resulting decision
4. Repeat parts (a), (b), and (c) using C as the splitting attribute.
5. Use the results in parts (c) and (d) to conclude about the greedy
nature of the decision tree induction algorithm.
9. 9. Consider the decision tree shown in Figure 3.36.
Figure 3.36.
Decision tree and data sets for Exercise 9.
Figure 3.36. Full Alternative Text
1. Compute the generalization error rate of the tree using the
optimistic approach.
2. Compute the generalization error rate of the tree using the
pessimistic approach. (For simplicity, use the strategy of adding a
factor of 0.5 to each leaf node.)
3. Compute the generalization error rate of the tree using the
validation set shown above. This approach is known as reduced
error pruning.
10. 10. Consider the decision trees shown in Figure 3.37. Assume they are
generated from a data set that contains 16 binary attributes and 3 classes,
C1, C2, and C3.
Compute the total description length of each decision tree according to
the following formulation of the minimum description length principle.
The total description length of a tree is given by
Each internal node of the tree is encoded by the ID of the splitting
attribute. If there are m attributes, the cost of encoding each
attribute is log2m bits.
Figure 3.37.
Decision trees for Exercise 10.
Figure 3.37. Full Alternative Text
Each leaf is encoded using the ID of the class it is associated with.
If there are k classes, the cost of encoding a class is log2 k bits.
Cost(tree) is the cost of encoding all the nodes in the tree. To
simplify the computation, you can assume that the total cost of the
tree is obtained by adding up the costs of encoding each internal
node and each leaf node.
Cost(data|tree) is encoded using the classification errors the tree
commits on the training set. Each error is encoded by log2 n bits,
where n is the total number of training instances.
Which decision tree is better, according to the MDL principle?
11. 11. This exercise, inspired by the discussions in [155], highlights one of
the known limitations of the leave-one-out model evaluation procedure.
Let us consider a data set containing 50 positive and 50 negative
instances, where the attributes are purely random and contain no
information about the class labels. Hence, the generalization error rate of
any classification model learned over this data is expected to be 0.5. Let
us consider a classifier that assigns the majority class label of training
instances (ties resolved by using the positive label as the default class) to
any test instance, irrespective of its attribute values. We can call this
approach as the majority inducer classifier. Determine the error rate of
this classifier using the following methods.
1. Leave-one-out.
2. 2-fold stratified cross-validation, where the proportion of class
labels at every fold is kept same as that of the overall data.
3. From the results above, which method provides a more reliable
evaluation of the classifier’s generalization error rate?
12. 12. Consider a labeled data set containing 100 data instances, which is
randomly partitioned into two sets A and B, each containing 50
instances. We use A as the training set to learn two decision trees, T10
with 10 leaf nodes and T100 with 100 leaf nodes. The accuracies of the
two decision trees on data sets A and B are shown in Table 3.7.
Table 3.7. Comparing the
test accuracy of decision
trees T10 and T100.
Data Set T10 T100
A 0.86 0.97
B 0.84 0.77
1. Based on the accuracies shown in Table 3.7, which classification
model would you expect to have better performance on unseen
2. Now, you tested T10 and T100 on the entire data set (A+B) and
found that the classification accuracy of T10 on data set (A+B) is
0.85, whereas the classification accuracy of T100 on the data set
(A+B) is 0.87. Based on this new information and your
observations from Table 3.7, which classification model would you
finally choose for classification?
13. 13. Consider the following approach for testing whether a classifier A
beats another classifier B. Let N be the size of a given dataset, pA be the
accuracy of classifier A, pB be the accuracy of classifier B, and p=
(pA+pB)/2 be the average accuracy for both classifiers. To test whether
classifier A is significantly better than B, the following Z-statistic is
Classifier A is assumed to be better than classifier B if Z>1.96.
Table 3.8 compares the accuracies of three different classifiers, decision
tree classifiers, naïve Bayes classifiers, and support vector machines, on
various data sets. (The latter two classifiers are described in Chapter 4.)
Summarize the performance of the classifiers given in Table 3.8 using
the following 3×3 table:
win-loss-draw Decision
Support vector
Decision tree 0 – 0 – 23
Naïve Bayes 0 – 0 – 23
Support vector
machine 0 – 0 – 23
Table 3.8. Comparing the
accuracy of various
classification methods.
Data Set Size(N) Decision
Tree (%)
Support vector
machine (%)
Anneal 898 92.09 79.62 87.19
Australia 690 85.51 76.81 84.78
Auto 205 81.95 58.05 70.73
Breast 699 95.14 95.99 96.42
Cleve 303 76.24 83.50 84.49
Credit 690 85.80 77.54 85.07
Diabetes 768 72.40 75.91 76.82
German 1000 70.90 74.70 74.40
Glass 214 67.29 48.59 59.81
Heart 270 80.00 84.07 83.70
Hepatitis 155 81.94 83.23 87.10
Horse 368 85.33 78.80 82.61
Ionosphere 351 89.17 82.34 88.89
Iris 150 94.67 95.33 96.00
Labor 57 78.95 94.74 92.98
Led7 3200 73.34 73.16 73.56
Lymphography 148 77.03 83.11 86.49
Pima 768 74.35 76.04 76.95
Sonar 208 78.85 69.71 76.92
Tic-tac-toe 958 83.72 70.04 98.33
Vehicle 846 71.04 45.04 74.94
Wine 178 94.38 96.63 98.88
Zoo 101 93.07 93.07 96.04
Each cell in the table contains the number of wins, losses, and draws
when comparing the classifier in a given row to the classifier in a given
14. 14. Let X be a binomial random variable with mean Np and variance
Np(1−p). Show that the ratio X/N also has a binomial distribution with
mean p and variance p(1−p)N.
4 Classification: Alternative
The previous chapter introduced the classification problem and presented a
technique known as the decision tree classifier. Issues such as model
overfitting and model evaluation were also discussed. This chapter presents
alternative techniques for building classification models—from simple
techniques such as rule-based and nearest neighbor classifiers to more
sophisticated techniques such as artificial neural networks, deep learning,
support vector machines, and ensemble methods. Other practical issues such
as the class imbalance and multiclass problems are also discussed at the end
of the chapter.
4.1 Types of Classifiers
Before presenting specific techniques, we first categorize the different types
of classifiers available. One way to distinguish classifiers is by considering
the characteristics of their output.
Binary versus Multiclass
Binary classifiers assign each data instance to one of two possible labels,
typically denoted as +1 and −1. The positive class usually refers to the
category we are more interested in predicting correctly compared to the
negative class (e.g., the spam category in email classification problems). If
there are more than two possible labels available, then the technique is known
as a multiclass classifier. As some classifiers were designed for binary classes
only, they must be adapted to deal with multiclass problems. Techniques for
transforming binary classifiers into multiclass classifiers are described in
Section 4.12.
Deterministic versus Probabilistic
A deterministic classifier produces a discrete-valued label to each data
instance it classifies whereas a probabilistic classifier assigns a continuous
score between 0 and 1 to indicate how likely it is that an instance belongs to a
particular class, where the probability scores for all the classes sum up to 1.
Some examples of probabilistic classifiers include the naïve Bayes classifier,
Bayesian networks, and logistic regression. Probabilistic classifiers provide
additional information about the confidence in assigning an instance to a
class than deterministic classifiers. A data instance is typically assigned to the
class with the highest probability score, except when the cost of
misclassifying the class with lower probability is significantly higher. We
will discuss the topic of cost-sensitive classification with probabilistic outputs
in Section 4.11.2.
Another way to distinguish the different types of classifiers is based on their
technique for discriminating instances from different classes.
Linear versus Nonlinear
A linear classifier uses a linear separating hyperplane to discriminate
instances from different classes whereas a nonlinear classifier enables the
construction of more complex, nonlinear decision surfaces. We illustrate an
example of a linear classifier (perceptron) and its nonlinear counterpart
(multi-layer neural network) in Section 4.7. Although the linearity
assumption makes the model less flexible in terms of fitting complex data,
linear classifiers are thus less susceptible to model overfitting compared to
nonlinear classifiers. Furthermore, one can transform the original set of
attributes, x=(x1, x2, ⋯ ,xd), into a more complex feature set, e.g., Φ(x)=
(x1, x2, x1x2, x12, x22, ⋯), before applying the linear classifier. Such feature
transformation allows the linear classifier to fit data sets with nonlinear
decision surfaces (see Section 4.9.4).
Global versus Local
A global classifier fits a single model to the entire data set. Unless the model
is highly nonlinear, this one-size-fits-all strategy may not be effective when
the relationship between the attributes and the class labels varies over the
input space. In contrast, a local classifier partitions the input space into
smaller regions and fits a distinct model to training instances in each region.
The k-nearest neighbor classifier (see Section 4.3) is a classic example of
local classifiers. While local classifiers are more flexible in terms of fitting
complex decision boundaries, they are also more susceptible to the model
overfitting problem, especially when the local regions contain few training
Generative versus Discriminative
Given a data instance x, the primary objective of any classifier is to predict
the class label, y, of the data instance. However, apart from predicting the
class label, we may also be interested in describing the underlying
mechanism that generates the instances belonging to every class label. For
example, in the process of classifying spam email messages, it may be useful
to understand the typical characteristics of email messages that are labeled as
spam, e.g., specific usage of keywords in the subject or the body of the email.
Classifiers that learn a generative model of every class in the process of
predicting class labels are known as generative classifiers. Some examples of
generative classifiers include the naïve Bayes classifier and Bayesian
networks. In contrast, discriminative classifiers directly predict the class
labels without explicitly describing the distribution of every class label. They
solve a simpler problem than generative models since they do not have the
onus of deriving insights about the generative mechanism of data instances.
They are thus sometimes preferred over generative models, especially when it
is not crucial to obtain information about the properties of every class. Some
examples of discriminative classifiers include decision trees, rule-based
classifier, nearest neighbor classifier, artificial neural networks, and support
vector machines.
4.2 Rule-Based Classifier
A rule-based classifier uses a collection of “if …then…” rules (also known as
a rule set) to classify data instances. Table 4.1 shows an example of a rule set
generated for the vertebrate classification problem described in the previous
chapter. Each classification rule in the rule set can be expressed in the
following way:
ri:(Conditioni)→yi. (4.1)
The left-hand side of the rule is called the rule antecedent or precondition.
It contains a conjunction of attribute test conditions:
Conditioni=(A1 op v1)∧(A2 op v2)…(Ak op vk), (4.2)
where (Aj, vj) is an attribute-value pair and op is a comparison operator
chosen from the set {=, ≠, <, >, ≤, ≥}. Each attribute test (Aj op vj) is also
known as a conjunct. The right-hand side of the rule is called the rule
consequent, which contains the predicted class yi.
A rule r covers a data instance x if the precondition of r matches the attributes
of x. r is also said to be fired or triggered whenever it covers a given instance.
For an illustration, consider the rule r1 given in Table 4.1 and the following
attributes for two vertebrates: hawk and grizzly bear.
Table 4.1. Example of a rule set
for the vertebrate classification
r1:(Gives Birth=no)∧(Aerial Creature=yes)→Birdsr2:(Gives Birth=no)∧
blooded)→Mammalsr4:(Gives Birth=no)∧(Aerial Creature=no)→Reptiles
Name Body
Legs Hibernates
hawk warmblooded feather no no yes yes no
warmblooded fur yes no no yes yes
r1 covers the first vertebrate because its precondition is satisfied by the
hawk’s attributes. The rule does not cover the second vertebrate because
grizzly bears give birth to their young and cannot fly, thus violating the
precondition of r1.
The quality of a classification rule can be evaluated using measures such as
coverage and accuracy. Given a data set D and a classification rule r : A→y,
the coverage of the rule is the fraction of instances in D that trigger the rule r.
On the other hand, its accuracy or confidence factor is the fraction of
instances triggered by r whose class labels are equal to y. The formal
definitions of these measures are
Coverage(r)=| A || D |Coverage(r)=|A∩y || A |, (4.3)
where |A| is the number of instances that satisfy the rule antecedent, |A∩y| is
the number of instances that satisfy both the antecedent and consequent, and
|D| is the total number of instances.
Example 4.1.
Consider the data set shown in Table 4.2. The rule
(Gives Birth=yes)∧(Body Temperature=warm-blooded)→Mammals
has a coverage of 33% since five of the fifteen instances support the rule
antecedent. The rule accuracy is 100% because all five vertebrates covered by
the rule are mammals.
Table 4.2. The vertebrate data
Name Body
Legs Hibernates
human warmblooded hair yes no no yes no
python cold-blooded scales no no no no yes
salmon cold-blooded scales no yes no no no
whale warmblooded hair yes yes no no no
frog cold-blooded none no semi no yes yes
dragon cold-blooded scales no no no yes no
bat warmblooded hair yes no yes yes yes
pigeon warmblooded feathers no no yes yes no
cat warmblooded fur yes no no yes no
guppy cold-blooded scales yes yes no no no
alligator cold-blooded scales no semi no yes no
penguin warmblooded feathers no semi no yes no
porcupine warmblooded quills yes no no yes yes
eel cold-blooded scales no yes no no no
salamander cold-blooded none no semi no yes yes
4.2.1 How a Rule-Based Classifier
A rule-based classifier classifies a test instance based on the rule triggered by
the instance. To illustrate how a rule-based classifier works, consider the rule
set shown in Table 4.1 and the following vertebrates:
Name Body
Legs Hibernates
lemur warmblooded fur yes no no yes yes
turtle cold-blooded scales no semi no yes no
shark cold-blooded scales yes yes no no no
The first vertebrate, which is a lemur, is warm-blooded and gives birth
to its young. It triggers the rule r3, and thus, is classified as a mammal.
The second vertebrate, which is a turtle, triggers the rules r4 and r5.
Since the classes predicted by the rules are contradictory (reptiles versus
amphibians), their conflicting classes must be resolved.
None of the rules are applicable to a dogfish shark. In this case, we need
to determine what class to assign to such a test instance.
4.2.2 Properties of a Rule Set
The rule set generated by a rule-based classifier can be characterized by the
following two properties.
Definition 4.1 (Mutually Exclusive
Rule Set).
The rules in a rule set R are mutually exclusive if no two rules in R are
triggered by the same instance. This property ensures that every instance is
covered by at most one rule in R.
Definition 4.2 (Exhaustive Rule Set).
A rule set R has exhaustive coverage if there is a rule for each combination of
attribute values. This property ensures that every instance is covered by at
least one rule in R.
Table 4.3. Example of a
mutually exclusive and
exhaustive rule set.
r1: (Body Temperature=cold-blooded)→Nonmammalsr2: (Body Temperature=warmblooded)∧(Gives Birth=yes)→Mammalsr3: (Body Temperature=warmblooded)∧(Gives Birth=no)→Non-mammals
Together, these two properties ensure that every instance is covered by
exactly one rule. An example of a mutually exclusive and exhaustive rule set
is shown in Table 4.3. Unfortunately, many rule-based classifiers, including
the one shown in Table 4.1, do not have such properties. If the rule set is not
exhaustive, then a default rule, rd: ()→yd, must be added to cover the
remaining cases. A default rule has an empty antecedent and is triggered
when all other rules have failed. yd is known as the default class and is
typically assigned to the majority class of training instances not covered by
the existing rules. If the rule set is not mutually exclusive, then an instance
can be covered by more than one rule, some of which may predict conflicting
Definition 4.3 (Ordered Rule Set).
The rules in an ordered rule set R are ranked in decreasing order of their
priority. An ordered rule set is also known as a decision list.
The rank of a rule can be defined in many ways, e.g., based on its accuracy or
total description length. When a test instance is presented, it will be classified
by the highest-ranked rule that covers the instance. This avoids the problem
of having conflicting classes predicted by multiple classification rules if the
rule set is not mutually exclusive.
An alternative way to handle a non-mutually exclusive rule set without
ordering the rules is to consider the consequent of each rule triggered by a
test instance as a vote for a particular class. The votes are then tallied to
determine the class label of the test instance. The instance is usually assigned
to the class that receives the highest number of votes. The vote may also be
weighted by the rule’s accuracy. Using unordered rules to build a rule-based
classifier has both advantages and disadvantages. Unordered rules are less
susceptible to errors caused by the wrong rule being selected to classify a test
instance unlike classifiers based on ordered rules, which are sensitive to the
choice of rule-ordering criteria. Model building is also less expensive because
the rules do not need to be kept in sorted order. Nevertheless, classifying a
test instance can be quite expensive because the attributes of the test instance
must be compared against the precondition of every rule in the rule set.
In the next two sections, we present techniques for extracting an ordered rule
set from data. A rule-based classifier can be constructed using (1) direct
methods, which extract classification rules directly from data, and (2) indirect
methods, which extract classification rules from more complex classification
models, such as decision trees and neural networks. Detailed discussions of
these methods are presented in Sections 4.2.3 and 4.2.4, respectively.
4.2.3 Direct Methods for Rule
To illustrate the direct method, we consider a widely-used rule induction
algorithm called RIPPER. This algorithm scales almost linearly with the
number of training instances and is particularly suited for building models
from data sets with imbalanced class distributions. RIPPER also works well
with noisy data because it uses a validation set to prevent model overfitting.
RIPPER uses the sequential covering algorithm to extract rules directly from
data. Rules are grown in a greedy fashion one class at a time. For binary class
problems, RIPPER chooses the majority class as its default class and learns
the rules to detect instances from the minority class. For multiclass problems,
the classes are ordered according to their prevalence in the training set. Let
(y1, y2, … ,yc) be the ordered list of classes, where y1 is the least prevalent
class and yc is the most prevalent class. All training instances that belong to
y1 are initially labeled as positive examples, while those that belong to other
classes are labeled as negative examples. The sequential covering algorithm
learns a set of rules to discriminate the positive from negative examples.
Next, all training instances from y2 are labeled as positive, while those from
classes y3, y4, ⋯, yc are labeled as negative. The sequential covering
algorithm would learn the next set of rules to distinguish y2 from other
remaining classes. This process is repeated until we are left with only one
class, yc, which is designated as the default class.
Example 4.1. Sequential covering
1: Let E be the training instances and A be the set of attribute-value pairs, {(A
2: Let Y
o be an ordered set of classes {y1, y2, … ,yk.
3: Let R = { } be the initial rule list.
4: for each class y Yo −{yk} do
5: while stopping condition is not met do
6: r← Learn-One-Rule (E, A, y).
7: Remove training instances from E that are covered by r.
8: Add r to the bottom of the rule list: R R r.
9: end while
10: end for
11: Insert the default rule, {} → yk, to the bottom of the rule list
A summary of the sequential covering algorithm is shown in Algorithm 4.1.
The algorithm starts with an empty decision list, R, and extracts rules for each
class based on the ordering specified by the class prevalence. It iteratively
extracts the rules for a given class y using the Learn-One-Rule function. Once
such a rule is found, all the training instances covered by the rule are
eliminated. The new rule is added to the bottom of the decision list R. This
procedure is repeated until the stopping criterion is met. The algorithm then
proceeds to generate rules for the next class.
Figure 4.1 demonstrates how the sequential covering algorithm works for a
data set that contains a collection of positive and negative examples. The rule
R1, whose coverage is shown in Figure 4.1(b), is extracted first because it
covers the largest fraction of positive examples. All the training instances
covered by R1 are subsequently removed and the algorithm proceeds to look
for the next best rule, which is R2.
Learn-One-Rule Function
Finding an optimal rule is computationally expensive due to the exponential
search space to explore. The Learn-One-Rule function addresses this problem
by growing the rules in a greedy fashion. It generates an initial rule r: {}→+,
where the left-hand side is an empty set and the right-hand side corresponds
to the positive class. It then refines the rule until a certain stopping criterion is
met. The accuracy of the initial rule may be poor because some of the
training instances covered by the rule belong to the negative class. A new
conjunct must be added to the rule antecedent to improve its accuracy.
Figure 4.1.
An example of the sequential covering algorithm.
Figure 4.1. Full Alternative Text
RIPPER uses the FOIL’s information gain measure to choose the best
conjunct to be added into the rule antecedent. The measure takes into
consideration both the gain in accuracy and support of a candidate rule,
where support is defined as the number of positive examples covered by the
rule. For example, suppose the rule r: A→+ initially covers p0 positive
examples and n0 negative examples. After adding a new conjunct B, the
extended rule r′: A∧B→+ covers p1 positive examples and n1 negative
examples. The FOIL’s information gain of the extended rule is computed as
FOIL’s information gain=p1×(log2p1p1+n1−log2p0p0+n0). (4.4)
RIPPER chooses the conjunct with highest FOIL’s information gain to extend
the rule, as illustrated in the next example.
Example 4.2. [Foil’s Information
Consider the training set for the vertebrate classification problem shown in
Table 4.2. Suppose the target class for the Learn-One-Rule function is
mammals. Initially, the antecedent of the rule {}→Mammals covers 5
positive and 10 negative examples. Thus, the accuracy of the rule is only
0.333. Next, consider the following three candidate conjuncts to be added to
the left-hand side of the rule: Skin cover=hair, Body temperature=warm, and
Has legs=No. The number of positive and negative examples covered by the
rule after adding each conjunct along with their respective accuracy and
FOIL’s information gain are shown in the following table.
Candidate rule p1 n1 Accuracy Info Gain
{Skin Cover=hair}→mammals 3 0 1.000 4.755
{Body temperature=wam}→mammals 5 1 0.714 5.498
{Has legs=No}→mammals 2 4 0.200 −0.737
Although Skin cover=hair has the highest accuracy among the three
candidates, the conjunct Body temperature=warm has the highest FOIL’s
information gain. Thus, it is chosen to extend the rule (see Figure 4.2). This
process continues until adding new conjuncts no longer improves the
information gain measure.
Rule Pruning
The rules generated by the Learn-One-Rule function can be pruned to
improve their generalization errors. RIPPER prunes the rules based on their
performance on the validation set. The following metric is computed to
determine whether pruning is needed: (p−n)/(p+n), where p(n) is the number
of positive (negative) examples in the validation set covered by the rule. This
metric is monotonically related to the rule’s accuracy on the validation set. If
the metric improves after pruning, then the conjunct is removed. Pruning is
done starting from the last conjunct added to the rule. For example, given a
rule ABCD→y, RIPPER checks whether D should be pruned first, followed
by CD, BCD, etc. While the original rule covers only positive examples, the
pruned rule may cover some of the negative examples in the training set.
Building the Rule Set
After generating a rule, all the positive and negative examples covered by the
rule are eliminated. The rule is then added into the rule set as long as it does
not violate the stopping condition, which is based on the minimum
description length principle. If the new rule increases the total description
length of the rule set by at least d bits, then RIPPER stops adding rules into
its rule set (by default, d is chosen to be 64 bits). Another stopping condition
used by RIPPER is that the error rate of the rule on the validation set must
not exceed 50%.
Figure 4.2.
General-to-specific and specific-to-general rule-growing strategies.
Figure 4.2. Full Alternative Text
RIPPER also performs additional optimization steps to determine whether
some of the existing rules in the rule set can be replaced by better alternative
rules. Readers who are interested in the details of the optimization method
may refer to the reference cited at the end of this chapter.
Instance Elimination
After a rule is extracted, RIPPER eliminates the positive and negative
examples covered by the rule. The rationale for doing this is illustrated in the
next example.
Figure 4.3 shows three possible rules, R1, R2, and R3, extracted from a
training set that contains 29 positive examples and 21 negative examples. The
accuracies of R1, R2, and R3 are 12/15 (80%), 7/10 (70%), and 8/12 (66.7%),
respectively. R1 is generated first because it has the highest accuracy. After
generating R1, the algorithm must remove the examples covered by the rule
so that the next rule generated by the algorithm is different than R1. The
question is, should it remove the positive examples only, negative examples
only, or both? To answer this, suppose the algorithm must choose between
generating R2 or R3 after R1. Even though R2 has a higher accuracy than R3
(70% versus 66.7%), observe that the region covered by R2 is disjoint from
R1, while the region covered by R3 overlaps with R1. As a result, R1 and R3
together cover 18 positive and 5 negative examples (resulting in an overall
accuracy of 78.3%), whereas R1 and R2 together cover 19 positive and 6
negative examples (resulting in a lower overall accuracy of 76%). If the
positive examples covered by R1 are not removed, then we may overestimate
the effective accuracy of R3. If the negative examples covered by R1 are not
removed, then we may underestimate the accuracy of R3. In the latter case,
we might end up preferring R2 over R3 even though half of the false positive
errors committed by R3 have already been accounted for by the preceding
rule, R1. This example shows that the effective accuracy after adding R2 or
R3 to the rule set becomes evident only when both positive and negative
examples covered by R1 are removed.
Figure 4.3.
Elimination of training instances by the sequential covering
algorithm. R1, R2, and R3 represent regions covered by three
different rules.
Figure 4.3. Full Alternative Text
4.2.4 Indirect Methods for Rule
This section presents a method for generating a rule set from a decision tree.
In principle, every path from the root node to the leaf node of a decision tree
can be expressed as a classification rule. The test conditions encountered
along the path form the conjuncts of the rule antecedent, while the class label
at the leaf node is assigned to the rule consequent. Figure 4.4 shows an
example of a rule set generated from a decision tree. Notice that the rule set is
exhaustive and contains mutually exclusive rules. However, some of the rules
can be simplified as shown in the next example.
Figure 4.4.
Converting a decision tree into classification rules.
Figure 4.4. Full Alternative Text
Example 4.3.
Consider the following three rules from Figure 4.4:
Observe that the rule set always predicts a positive class when the value of Q
is Yes. Therefore, we may simplify the rules as follows:
r3 is retained to cover the remaining instances of the positive class. Although
the rules obtained after simplification are no longer mutually exclusive, they
are less complex and are easier to interpret.
In the following, we describe an approach used by the C4.5rules algorithm to
generate a rule set from a decision tree. Figure 4.5 shows the decision tree
and resulting classification rules obtained for the data set given in Table 4.2.
Rule Generation
Classification rules are extracted for every path from the root to one of the
leaf nodes in the decision tree. Given a classification rule r:A→y, we
consider a simplified rule, r′:A′→y where A′ is obtained by removing one of
the conjuncts in A. The simplified rule with the lowest pessimistic error rate
is retained provided its error rate is less than that of the original rule. The
rule-pruning step is repeated until the pessimistic error of the rule cannot be
improved further. Because some of the rules may become identical after
pruning, the duplicate rules are discarded.
Figure 4.5.
Classification rules extracted from a decision tree for the vertebrate
classification problem.
Figure 4.5. Full Alternative Text
Rule Ordering
After generating the rule set, C4.5rules uses the class-based ordering scheme
to order the extracted rules. Rules that predict the same class are grouped
together into the same subset. The total description length for each subset is
computed, and the classes are arranged in increasing order of their total
description length. The class that has the smallest description length is given
the highest priority because it is expected to contain the best set of rules. The
total description length for a class is given by Lexception+g×Lmodel, where
Lexception is the number of bits needed to encode the misclassified
examples, Lmodel is the number of bits needed to encode the model, and g is
a tuning parameter whose default value is 0.5. The tuning parameter depends
on the number of redundant attributes present in the model. The value of the
tuning parameter is small if the model contains many redundant attributes.
4.2.5 Characteristics of Rule-Based
1. Rule-based classifiers have very similar characteristics as decision trees.
The expressiveness of a rule set is almost equivalent to that of a decision
tree because a decision tree can be represented by a set of mutually
exclusive and exhaustive rules. Both rule-based and decision tree
classifiers create rectilinear partitions of the attribute space and assign a
class to each partition. However, a rule-based classifier can allow
multiple rules to be triggered for a given instance, thus enabling the
learning of more complex models than decision trees.
2. Like decision trees, rule-based classifiers can handle varying types of
categorical and continuous attributes and can easily work in multiclass
classification scenarios. Rule-based classifiers are generally used to
produce descriptive models that are easier to interpret but give
comparable performance to the decision tree classifier.
3. Rule-based classifiers can easily handle the presence of redundant
attributes that are highly correlated with one other. This is because once
an attribute has been used as a conjunct in a rule antecedent, the
remaining redundant attributes would show little to no FOIL’s
information gain and would thus be ignored.
4. Since irrelevant attributes show poor information gain, rule-based
classifiers can avoid selecting irrelevant attributes if there are other
relevant attributes that show better information gain. However, if the
problem is complex and there are interacting attributes that can
collectively distinguish between the classes but individually show poor
information gain, it is likely for an irrelevant attribute to be accidentally
favored over a relevant attribute just by random chance. Hence, rulebased classifiers can show poor performance in the presence of
interacting attributes, when the number of irrelevant attributes is large.
5. The class-based ordering strategy adopted by RIPPER, which
emphasizes giving higher priority to rare classes, is well suited for
handling training data sets with imbalanced class distributions.
6. Rule-based classifiers are not well-suited for handling missing values in
the test set. This is because the position of rules in a rule set follows a
certain ordering strategy and even if a test instance is covered by
multiple rules, they can assign different class labels depending on their
position in the rule set. Hence, if a certain rule involves an attribute that
is missing in a test instance, it is difficult to ignore the rule and proceed
to the subsequent rules in the rule set, as such a strategy can result in
incorrect class assignments.
4.3 Nearest Neighbor Classifiers
The classification framework shown in Figure 3.3 involves a two-step
(1) an inductive step for constructing a classification model from data, and
(2) a deductive step for applying the model to test examples. Decision tree
and rule-based classifiers are examples of eager learners because they are
designed to learn a model that maps the input attributes to the class label as
soon as the training data becomes available. An opposite strategy would be to
delay the process of modeling the training data until it is needed to classify
the test instances. Techniques that employ this strategy are known as lazy
learners. An example of a lazy learner is the Rote classifier, which
memorizes the entire training data and performs classification only if the
attributes of a test instance match one of the training examples exactly. An
obvious drawback of this approach is that some test instances may not be
classified because they do not match any training example.
One way to make this approach more flexible is to find all the training
examples that are relatively similar to the attributes of the test instances.
These examples, which are known as nearest neighbors, can be used to
determine the class label of the test instance. The justification for using
nearest neighbors is best exemplified by the following saying: “If it walks
like a duck, quacks like a duck, and looks like a duck, then it’s probably a
duck.” A nearest neighbor classifier represents each example as a data point
in a d-dimensional space, where d is the number of attributes. Given a test
instance, we compute its proximity to the training instances according to one
of the proximity measures described in Section 2.4 on page 71. The k-nearest
neighbors of a given test instance z refer to the k training examples that are
closest to z.
Figure 4.6 illustrates the 1-, 2-, and 3-nearest neighbors of a test instance
located at the center of each circle. The instance is classified based on the
class labels of its neighbors. In the case where the neighbors have more than
one label, the test instance is assigned to the majority class of its nearest
neighbors. In Figure 4.6(a), the 1-nearest neighbor of the instance is a
negative example. Therefore the instance is assigned to the negative class. If
the number of nearest neighbors is three, as shown in Figure 4.6(c), then the
neighborhood contains two positive examples and one negative example.
Using the majority voting scheme, the instance is assigned to the positive
class. In the case where there is a tie between the classes (see Figure 4.6(b)),
we may randomly choose one of them to classify the data point.
Figure 4.6.
The 1-, 2-, and 3-nearest neighbors of an instance.
Figure 4.6. Full Alternative Text
The preceding discussion underscores the importance of choosing the right
value for k. If k is too small, then the nearest neighbor classifier may be
susceptible to overfitting due to noise, i.e., mislabeled examples in the
training data. On the other hand, if k is too large, the nearest neighbor
classifier may misclassify the test instance because its list of nearest
neighbors includes training examples that are located far away from its
neighborhood (see Figure 4.7).
Figure 4.7.
k-nearest neighbor classification with large k.
4.3.1 Algorithm
A high-level summary of the nearest neighbor classification method is given
in Algorithm 4.2. The algorithm computes the distance (or similarity)
between each test instance z=(x′, y′) and all the training examples (x, y)∈D
to determine its nearest neighbor list, Dz. Such computation can be costly if
the number of training examples is large. However, efficient indexing
techniques are available to reduce the computation needed to find the nearest
neighbors of a test instance.
Algorithm 4.2 The k-nearest
neighbor classifier.
1: Let k be the number of nearest neighbors and D be the set of training examples.
2: for each test instance z=(x′, y′) do
3: Compute z=(x′, x), the distance between z and every example,
4: Select Dz⊆D, the set of k closest training examples to z.
5: y′=argmaxv∑(xi, yi)∈DzI(v=yi)
6: end for
Once the nearest neighbor list is obtained, the test instance is classified based
on the majority class of its nearest neighbors:
Majority Voting: y′=argmaxv∑(xi, yi)∈DzI(v=yi), (4.5)
where v is a class label, yi is the class label for one of the nearest neighbors,
and I(⋅) is an indicator function that returns the value 1 if its argument is true
and 0 otherwise.
In the majority voting approach, every neighbor has the same impact on the
classification. This makes the algorithm sensitive to the choice of k, as shown
in Figure 4.6. One way to reduce the impact of k is to weight the influence of
each nearest neighbor xi according to its distance: wi=1/d(x′, xi)2. As a
result, training examples that are located far away from z have a weaker
impact on the classification compared to those that are located close to z.
Using the distance-weighted voting scheme, the class label can be determined
as follows:
Distance-Weighted Voting: y′=argmaxv∑(xi, yi)∈Dzwi×I(v=yi). (4.6)
4.3.2 Characteristics of Nearest
Neighbor Classifiers
1. Nearest neighbor classification is part of a more general technique
known as instance-based learning, which does not build a global model,
but rather uses the training examples to make predictions for a test
instance. (Thus, such classifiers are often said to be “model free.”) Such
algorithms require a proximity measure to determine the similarity or
distance between instances and a classification function that returns the
predicted class of a test instance based on its proximity to other
2. Although lazy learners, such as nearest neighbor classifiers, do not
require model building, classifying a test instance can be quite expensive
because we need to compute the proximity values individually between
the test and training examples. In contrast, eager learners often spend the
bulk of their computing resources for model building. Once a model has
been built, classifying a test instance is extremely fast.
3. Nearest neighbor classifiers make their predictions based on local
information. (This is equivalent to building a local model for each test
instance.) By contrast, decision tree and rule-based classifiers attempt to
find a global model that fits the entire input space. Because the
classification decisions are made locally, nearest neighbor classifiers
(with small values of k) are quite susceptible to noise.
4. Nearest neighbor classifiers can produce decision boundaries of
arbitrary shape. Such boundaries provide a more flexible model
representation compared to decision tree and rule-based classifiers that
are often constrained to rectilinear decision boundaries. The decision
boundaries of nearest neighbor classifiers also have high variability
because they depend on the composition of training examples in the
local neighborhood. Increasing the number of nearest neighbors may
reduce such variability.
5. Nearest neighbor classifiers have difficulty handling missing values in
both the training and test sets since proximity computations normally
require the presence of all attributes. Although, the subset of attributes
present in two instances can be used to compute a proximity, such an
approach may not produce good results since the proximity measures
may be different for each pair of instances and thus hard to compare.
6. Nearest neighbor classifiers can handle the presence of interacting
attributes, i.e., attributes that have more predictive power taken in
combination then by themselves, by using appropriate proximity
measures that can incorporate the effects of multiple attributes together.
7. The presence of irrelevant attributes can distort commonly used
proximity measures, especially when the number of irrelevant attributes
is large. Furthermore, if there are a large number of redundant attributes
that are highly correlated with each other, then the proximity measure
can be overly biased toward such attributes, resulting in improper
estimates of distance. Hence, the presence of irrelevant and redundant
attributes can adversely affect the performance of nearest neighbor
8. Nearest neighbor classifiers can produce wrong predictions unless the
appropriate proximity measure and data preprocessing steps are taken.
For example, suppose we want to classify a group of people based on
attributes such as height (measured in meters) and weight (measured in
pounds). The height attribute has a low variability, ranging from 1.5 m
to 1.85 m, whereas the weight attribute may vary from 90 lb. to 250 lb.
If the scale of the attributes are not taken into consideration, the
proximity measure may be dominated by differences in the weights of a
4.4 Naïve Bayes Classifier
Many classification problems involve uncertainty. First, the observed
attributes and class labels may be unreliable due to imperfections in the
measurement process, e.g., due to the limited preciseness of sensor devices.
Second, the set of attributes chosen for classification may not be fully
representative of the target class, resulting in uncertain predictions. To
illustrate this, consider the problem of predicting a person’s risk for heart
disease based on a model that uses their diet and workout frequency as
attributes. Although most people who eat healthily and exercise regularly
have less chance of developing heart disease, they may still be at risk due to
other latent factors, such as heredity, excessive smoking, and alcohol abuse,
that are not captured in the model. Third, a classification model learned over
a finite training set may not be able to fully capture the true relationships in
the overall data, as discussed in the context of model overfitting in the
previous chapter. Finally, uncertainty in predictions may arise due to the
inherent random nature of real-world systems, such as those encountered in
weather forecasting problems.
In the presence of uncertainty, there is a need to not only make predictions of
class labels but also provide a measure of confidence associated with every
prediction. Probability theory offers a systematic way for quantifying and
manipulating uncertainty in data, and thus, is an appealing framework for
assessing the confidence of predictions. Classification models that make use
of probability theory to represent the relationship between attributes and class
labels are known as probabilistic classification models. In this section, we
present the naïve Bayes classifier, which is one of the simplest and most
widely-used probabilistic classification models.
4.4.1 Basics of Probability Theory
Before we discuss how the naïve Bayes classifier works, we first introduce
some basics of probability theory that will be useful in understanding the
probabilistic classification models presented in this chapter. This involves
defining the notion of probability and introducing some common approaches
for manipulating probability values.
Consider a variable X, which can take any discrete value from the set
{x1, …, xk}. When we have multiple observations of that variable, such as in
a data set where the variable describes some characteristic of data objects,
then we can compute the relative frequency with which each value occurs.
Specifically, suppose that X has the value xi for ni data objects. The relative
frequency with which we observe the event X=xi is then ni/N, where N
denotes the total number of occurrences (N=∑i=1kni). These relative
frequencies characterize the uncertainty that we have with respect to what
value X may take for an unseen observation and motivates the notion of
More formally, the probability of an event e, e.g., P(X=xi), measures how
likely it is for the event e to occur. The most traditional view of probability is
based on relative frequency of events (frequentist), while the Bayesian
viewpoint (described later) takes a more flexible view of probabilities. In
either case, a probability is always a number between 0 and 1. Further, the
sum of probability values of all possible events, e.g., outcomes of a variable
X is equal to 1. Variables that have probabilities associated with each possible
outcome (values) are known as random variables.
Now, let us consider two random variables, X and Y , that can each take k
discrete values. Let nij be the number of times we observe X=xi and Y=yj,
out of a total number of N occurrences. The joint probability of observing
X=xi and Y=yj together can be estimated as
P(X=xi, Y=yi)=nijN. (4.7)
(This is an estimate since we typically have only a finite subset of all possible
observations.) Joint probabilities can be used to answer questions such as
“what is the probability that there will be a surprise quiz today AND I will be
late for the class.” Joint probabilities are symmetric, i.e.,
P(X=x, Y=y)=P(Y=y, X=x). For joint probabilities, it is to useful to consider
their sum with respect to one of the random variables, as described in the
following equation:
∑j=1kP(X=xi,Y=yj)=∑j=1knijN=niN=P(X=xi), (4.8)
where ni is the total number of times we observe X=xi irrespective of the
value of Y. Notice that ni/N is essentially the probability of observing X=xi.
Hence, by summing out the joint probabilities with respect to a random
variable Y , we obtain the probability of observing the remaining variable X.
This operation is called marginalization and the probability value P(X=xi)
obtained by marginalizing out Y is sometimes called the marginal
probability of X. As we will see later, joint probability and marginal
probability form the basic building blocks of a number of probabilistic
classification models discussed in this chapter.
Notice that in the previous discussions, we used P(X=xi) to denote the
probability of a particular outcome of a random variable X. This notation can
easily become cumbersome when a number of random variables are involved.
Hence, in the remainder of this section, we will use P(X) to denote the
probability of any generic outcome of the random variable X, while P(xi) will
be used to represent the probability of the specific outcome xi.
Bayes Theorem
Suppose you have invited two of your friends Alex and Martha to a
dinner party. You know that Alex attends 40% of the parties he is invited
to. Further, if Alex is going to a party, there is an 80% chance of
Martha coming along. On the other hand, if Alex is not going to the
party, the chance of Martha coming to the party is reduced to 30%. If
Martha has responded that she will be coming to your party, what is the
probability that Alex will also be coming?
Bayes theorem presents the statistical principle for answering questions like
the previous one, where evidence from multiple sources has to be combined
with prior beliefs to arrive at predictions. Bayes theorem can be briefly
described as follows.
Let P(Y|X) denotethe conditional probability of observing the random
variable Y whenever the random variable X takes a particular value. P(Y|X) is
often read as the probability of observing Y conditioned on the outcome of X.
Conditional probabilities can be used for answering questions such as “given
that it is going to rain today, what will be the probability that I will go to the
class.” Conditional probabilities of X and Y are related to their joint
probability in the following way:
P(Y|X)=P(X, Y)P(X), which implies (4.9)
P(X, Y)=P(Y|X)×P(X)=P(X|Y)×P(Y). (4.10)
Rearranging the last two expressions in Equation 4.10 leads to Equation 4.11,
which is known as Bayes theorem:
P(Y|X)=P(X|Y)P(Y)P(X). (4.11)
Bayes theorem provides a relationship between the conditional probabilities
P(Y|X) and P(X|Y). Note that the denominator in Equation 4.11 involves the
marginal probability of X, which can also be represented as
P(X)=∑i=1kP(X, yi)=∑i=1kP(X|yi)×P(yi).
Using the previous expression for P(X), we can obtain the following equation
for P(Y|X) solely in terms of P(X|Y) and P(Y):
P(Y|X)=P(X|Y)P(Y)∑i−1kP(X|yi)P(yi). (4.12)
Example 4.4. [Bayes Theorem]
Bayes theorem can be used to solve a number of inferential questions about
random variables. For example, consider the problem stated at the beginning
on inferring whether Alex will come to the party. Let P(A=1) denote the
probability of Alex going to a party, while P(A=0) denotes the probability of
him not going to a party. We know that
Further, let P(M=1|A) denote the conditional probability of Martha going to a
party conditioned on whether Alex is going to the party. P(M=1|A) takes the
following values:
We can use the above values of P(M|A) and P(A) to compute the probability
of Alex going to the party given Martha is going to the party, P(A=1|M=1),
as follows:
Notice that even though the prior probability P(A) of Alex going to the party
is low, the observation that Martha is going, M=1, affects the conditional
probability P(A=1|M=1). This shows the value of Bayes theorem in
combining prior assumptions with observed outcomes to make predictions.
Since P(A=1|M=1)>0.5, it is more likely for Alex to join if Martha is going to
the party.
Using Bayes Theorem for
For the purpose of classification, we are interested in computing the
probability of observing a class label y for a data instance given its set of
attribute values x. This can be represented as P(y|x), which is known as the
posterior probability of the target class. Using the Bayes Theorem, we can
represent the posterior probability as
P(y|x)=P(x|y)P(y)P(x) (4.14)
Note that the numerator of the previous equation involves two terms, P(x|y)
and P(y), both of which contribute to the posterior probability P(y|x). We
describe both of these terms in the following.
The first term P(x|y) is known as the class-conditional probability of the
attributes given the class label. P(x|y) measures the likelihood of observing x
from the distribution of instances belonging to y. If x indeed belongs to class
y, then we should expect P(x|y) to be high. From this point of view, the use of
class-conditional probabilities attempts to capture the process from which the
data instances were generated. Because of this interpretation, probabilistic
classification models that involve computing class-conditional probabilities
are known as generative classification models. Apart from their use in
computing posterior probabilities and making predictions, class-conditional
probabilities also provide insights about the underlying mechanism behind
the generation of attribute values.
The second term in the numerator of Equation 4.14 is the prior probability
P(y). The prior probability captures our prior beliefs about the distribution of
class labels, independent of the observed attribute values. (This is the
Bayesian viewpoint.) For example, we may have a prior belief that the
likelihood of any person to suffer from a heart disease is α, irrespective of
their diagnosis reports. The prior probability can either be obtained using
expert knowledge, or inferred from historical distribution of class labels.
The denominator in Equation 4.14 involves the probability of evidence, P (x).
Note that this term does not depend on the class label and thus can be treated
as a normalization constant in the computation of posterior probabilities.
Further, the value of P(x) can be calculated as P(x)=∑iP(x|yi)P(yi).
Bayes theorem provides a convenient way to combine our prior beliefs with
the likelihood of obtaining the observed attribute values. During the training
phase, we are required to learn the parameters for P(y) and P(x|y). The prior
probability P(y) can be easily estimated from the training set by computing
the fraction of training instances that belong to each class. To compute the
class-conditional probabilities, one approach is to consider the fraction of
training instances of a given class for every possible combination of attribute
values. For example, suppose that there are two attributes X1 and X2 that can
each take a discrete value from c1 to ck. Let n0 denote the number of training
instances belonging to class 0, out of which nij0 number of training instances
have X1=ci and X2=cj. The class-conditional probability can then be given as
P(X1=ci, X2=cj|Y=0)=nij0n0.
This approach can easily become computationally prohibitive as the number
of attributes increase, due to the exponential growth in the number of
attribute value combinations. For example, if every attribute can take k
discrete values, then the number of attribute value combinations is equal to
kd, where d is the number of attributes. The large number of attribute value
combinations can also result in poor estimates of class-conditional
probabilities, since every combination will have fewer training instances
when the size of training set is small.
In the following, we present the naïve Bayes classifier, which makes a
simplifying assumption about the class-conditional probabilities, known as
the naïve Bayes assumption. The use of this assumption significantly helps
in obtaining reliable estimates of class-conditional probabilities, even when
the number of attributes are large.
4.4.2 Naïve Bayes Assumption
The naïve Bayes classifier assumes that the class-conditional probability of
all attributes x can be factored as a product of class-conditional probabilities
of every attribute xi, as described in the following equation:
P(x|y)=∏i=1dP(xi|y), (4.15)
where every data instance x consists of d attributes, {x1, x2, …, xd}. The
basic assumption behind the previous equation is that the attribute values xi
are conditionally independent of each other, given the class label y. This
means that the attributes are influenced only by the target class and if we
know the class label, then we can consider the attributes to be independent of
each other. The concept of conditional independence can be formally stated
as follows.
Conditional Independence
Let X1, X2,, and Y denote three sets of random variables. The variables in
X1 are said to be conditionally independent of X2, given Y, if the following
condition holds:
P(X1|X2, Y)=P(X1|Y). (4.16)
This means that conditioned on Y, the distribution of X1 is not influenced by
the outcomes of X2, and hence is conditionally independent of X2. To
illustrate the notion of conditional independence, consider the relationship
between a person’s arm length (X1) and his or her reading skills (X2). One
might observe that people with longer arms tend to have higher levels of
reading skills, and thus consider X1 and X2 to be related to each other.
However, this relationship can be explained by another factor, which is the
age of the person (Y). A young child tends to have short arms and lacks the
reading skills of an adult. If the age of a person is fixed, then the observed
relationship between arm length and reading skills disappears. Thus, we can
conclude that arm length and reading skills are not directly related to each
other and are conditionally independent when the age variable is fixed.
Another way of describing conditional independence is to consider the joint
conditional probability, P(X1, X2|Y), as follows:
P(X1, X2|Y)=P(X1, X2, Y)P(Y)=P(X1, X2, Y)P(X2, Y)×P(X2, Y)P(Y)=P(X1
where Equation 4.16 was used to obtain the last line of Equation 4.17. The
previous description of conditional independence is quite useful from an
operational perspective. It states that the joint conditional probability of X1
and X2 given Y can be factored as the product of conditional probabilities of
X1 and X2 considered separately. This forms the basis of the naïve Bayes
assumption stated in Equation 4.15.
How a Naïve Bayes Classifier
Using the naïve Bayes assumption, we only need to estimate the conditional
probability of each xi given Y separately, instead of computing the class-
conditional probability for every combination of attribute values. For
example, if ni0 and nj0 denote the number of training instances belonging to
class 0 with X1=ci and X2=cj, respectively, then the class-conditional
probability can be estimated as
P(X1=ci, X2=xj|Y=0)=ni0n0×nj0n0.
In the previous equation, we only need to count the number of training
instances for every one of the k values of an attribute X, irrespective of the
values of other attributes. Hence, the number of parameters needed to learn
class-conditional probabilities is reduced from dk to dk. This greatly
simplifies the expression for the class-conditional probability and makes it
more amenable to learning parameters and making predictions, even in highdimensional settings.
The naïve Bayes classifier computes the posterior probability for a test
instance x by using the following equation:
P(y|x)=P(y)∏i=1dP(xi|y)P(x) (4.18)
Since P (x) is fixed for every y and only acts as a normalizing constant to
ensure that P(y|x)∈[0, 1], we can write
Hence, it is sufficient to choose the class that maximizes P(y)∏i=1dP(xi|y).
One of the useful properties of the naïve Bayes classifier is that it can easily
work with incomplete information about data instances, when only a subset
of attributes are observed at every instance. For example, if we only observe
p out of d attributes at a data instance, then we can still compute
P(y)∏i=1pP(xi|y) using those p attributes and choose the class with the
maximum value. The naïve Bayes classifier can thus naturally handle missing
values in test instances. In fact, in the extreme case where no attributes are
observed, we can still use the prior probability P(y) as an estimate of the
posterior probability. As we observe more attributes, we can keep refining the
posterior probability to better reflect the likelihood of observing the data
In the next two subsections, we describe several approaches for estimating
the conditional probabilities P(xi|y) for categorical and continuous attributes
from the training set.
Estimating Conditional
Probabilities for Categorical
For a categorical attribute Xi, the conditional probability P(Xi=c|y) is
estimated according to the fraction of training instances in class y where Xi
takes on a particular categorical value c.
where n is the number of training instances belonging to class y, out of which
nc number of instances have Xi=c. For example, in the training set given in
Figure 4.8, seven people have the class label Defaulted Borrower=No, out of
which three people have Home Owner=Yes while the remaining four have
Home Owner=No. As a result, the conditional probability for
P(Home Owner=Yes|Defaulted Borrower=No) is equal to 3/7. Similarly, the
conditional probability for defaulted borrowers with Marital Status=Single is
given by P(Marital Status=Single|Defaulted Borrower=Yes)=2/3. Note that
the sum of conditional probabilities over all possible outcomes of Xi is equal
to one, i.e., ∑cP(Xi=c|y)=1,.
Figure 4.8.
Training set for predicting the loan default problem.
Estimating Conditional
Probabilities for Continuous
There are two ways to estimate the class-conditional probabilities for
continuous attributes:
1. We can discretize each continuous attribute and then replace the
continuous values with their corresponding discrete intervals. This
approach transforms the continuous attributes into ordinal attributes, and
the simple method described previously for computing the conditional
probabilities of categorical attributes can be employed. Note that the
estimation error of this method depends on the discretization strategy (as
described in Section 2.3.6 on page 63), as well as the number of discrete
intervals. If the number of intervals is too large, every interval may have
an insufficient number of training instances to provide a reliable
estimate of P(Xi|Y). On the other hand, if the number of intervals is too
small, then the discretization process may loose information about the
true distribution of continuous values, and thus result in poor
2. We can assume a certain form of probability distribution for the
continuous variable and estimate the parameters of the distribution using
the training data. For example, we can use a Gaussian distribution to
represent the conditional probability of continuous attributes. The
Gaussian distribution is characterized by two parameters, the mean, μ,
and the variance, σ2. For each class yj, the class-conditional probability
for attribute Xi is
P(Xi=xi|Y=yj)=12πσijexp[−(xi−μij)22σij2 ]. (4.19)
The parameter μij can be estimated using the sample mean of Xi(x¯) for
all training instances that belong to yj. Similarly, σij2 can be estimated
from the sample variance (s2) of such training instances. For example,
consider the annual income attribute shown in Figure 4.8. The sample
mean and variance for this attribute with respect to the class No are
Given a test instance with taxable income equal to $120K, we can use
the following value as its conditional probability given class No:
Example 4.5. [Naïve Bayes
Consider the data set shown in Figure 4.9(a), where the target class is
Defaulted Borrower, which can take two values Yes and No. We can
compute the class-conditional probability for each categorical attribute and
the sample mean and variance for the continuous attribute, as summarized in
Figure 4.9(b).
We are interested in predicting the class label of a test instance x=
(Home Owner=No, Marital Status=Married, Annual Income=$120K). To do
this, we first compute the prior probabilities by counting the number of
training instances belonging to every class. We thus obtain P(yes)=0.3 and
P(No)=0.7. Next, we can compute the class-conditional probability as
Figure 4.9.
The naïve Bayes classifier for the loan classification problem.
Figure 4.9. Full Alternative Text
P(x|NO)=P(Home Owner=No|No)×P(Status=Married|No)×P(Annual Income=
Notice that the class-conditional probability for class Yes has become 0
because there are no instances belonging to class Yes with Status=Married in
the training set. Using these class-conditional probabilities, we can estimate
the posterior probabilities as
where α=1/P(x) is a normalizing constant. Since P(No|x)>P(Yes|x), the
instance is classified as No.
Handling Zero Conditional
The preceding example illustrates a potential problem with using the naïve
Bayes assumption in estimating class-conditional probabilities. If the
conditional probability for any of the attributes is zero, then the entire
expression for the class-conditional probability becomes zero. Note that zero
conditional probabilities arise when the number of training instances is small
and the number of possible values of an attribute is large. In such cases, it
may happen that a combination of attribute values and class labels are never
observed, resulting in a zero conditional probability.
In a more extreme case, if the training instances do not cover some
combinations of attribute values and class labels, then we may not be able to
even classify some of the test instances. For example, if
P(Marital Status=Divorced|No) is zero instead of 1/7, then a data instance
with attribute set x=
(Home Owner=Yes, Marital Status=Divorced, Income=$120K) has the
following class-conditional probabilities:
Since both the class-conditional probabilities are 0, the naïve Bayes classifier
will not be able to classify the instance. To address this problem, it is
important to adjust the conditional probability estimates so that they are not
as brittle as simply using fractions of training instances. This can be achieved
by using the following alternate estimates of conditional probability:
Laplace estimate:P(Xi=c|y)=nc+1n+v, (4.20)
m-estimate:P(Xi=c|y)=nc+mpn+m, (4.21)
where n is the number of training instances belonging to class y, nc is the
number of training instances with Xi=c and Y=y, v is the total number of
attribute values that Xi can take, p is some initial estimate of P(Xi=c|y) that is
known a priori, and m is a hyper-parameter that indicates our confidence in
using p when the fraction of training instances is too brittle. Note that even if
nc=0, both Laplace and m-estimate provide non-zero values of conditional
probabilities. Hence, they avoid the problem of vanishing class-conditional
probabilities and thus generally provide more robust estimates of posterior
Characteristics of Naïve Bayes
1. Naïve Bayes classifiers are probabilistic classification models that are
able to quantify the uncertainty in predictions by providing posterior
probability estimates. They are also generative classification models as
they treat the target class as the causative factor for generating the data
instances. Hence, apart from computing posterior probabilities, naïve
Bayes classifiers also attempt to capture the underlying mechanism
behind the generation of data instances belonging to every class. They
are thus useful for gaining predictive as well as descriptive insights.
2. By using the naïve Bayes assumption, they can easily compute classconditional probabilities even in high-dimensional settings, provided
that the attributes are conditionally independent of each other given the
class labels. This property makes naïve Bayes classifier a simple and
effective classification technique that is commonly used in diverse
application problems, such as text classification.
3. Naïve Bayes classifiers are robust to isolated noise points because such
points are not able to significantly impact the conditional probability
estimates, as they are often averaged out during training.
4. Naïve Bayes classifiers can handle missing values in the training set by
ignoring the missing values of every attribute while computing its
conditional probability estimates. Further, naïve Bayes classifiers can
effectively handle missing values in a test instance, by using only the
non-missing attribute values while computing posterior probabilities. If
the frequency of missing values for a particular attribute value depends
on class label, then this approach will not accurately estimate posterior
5. Naïve Bayes classifiers are robust to irrelevant attributes. If Xi is an
irrelevant attribute, then P(Xi|Y) becomes almost uniformly distributed
for every class y. The class-conditional probabilities for every class thus
receive similar contributions of P(Xi|Y), resulting in negligible impact
on the posterior probability estimates.
6. Correlated attributes can degrade the performance of naïve Bayes
classifiers because the naïve Bayes assumption of conditional
independence no longer holds for such attributes. For example, consider
the following probabilities:
where A is a binary attribute and Y is a binary class variable. Suppose
there is another binary attribute B that is perfectly correlated with A
when Y=0, but is independent of A when Y=1. For simplicity, assume
that the conditional probabilities for B are the same as for A. Given an
instance with attributes A=0, B=0, and assuming conditional
independence, we can compute its posterior probabilities as follows:
P(Y=0|A=0, B=0)=P(A=0|Y=0)P(B=0|Y=0)P(Y=0)P(A=0, B=0)=0.16×P
If P(Y=0)=P(Y=1), then the naïve Bayes classifier would assign the instance
to class 1. However, the truth is,
P(A=0, B=0|Y=0)=P(A=0|Y=0)=0.4,
because A and B are perfectly correlated when Y=0. As a result, the posterior
probability for Y=0 is
P(Y=0|A=0, B=0)=P(A=0, B=0|Y=0)P(Y=0)P(A=0, B=0)=0.4×P(Y=0)P(A=0
which is larger than that for Y=1. The instance should have been classified as
class 0. Hence, the naïve Bayes classifier can produce incorrect results when
the attributes are not conditionally independent given the class labels. Naïve
Bayes classifiers are thus not well-suited for handling redundant or
interacting attributes.
4.5 Bayesian Networks
The conditional independence assumption made by naïve Bayes classifiers
may seem too rigid, especially for classification problems where the
attributes are dependent on each other even after conditioning on the class
labels. We thus need an approach to relax the naïve Bayes assumption so that
we can capture more generic representations of conditional independence
among attributes.
In this section, we present a flexible framework for modeling probabilistic
relationships between attributes and class labels, known as Bayesian
Networks. By building on concepts from probability theory and graph
theory, Bayesian networks are able to capture more generic forms of
conditional independence using simple schematic representations. They also
provide the necessary computational structure to perform inferences over
random variables in an efficient way. In the following, we first describe the
basic representation of a Bayesian network, and then discuss methods for
performing inference and learning model parameters in the context of
4.5.1 Graphical Representation
Bayesian networks belong to a broader family of models for capturing
probabilistic relationships among random variables, known as probabilistic
graphical models. The basic concept behind these models is to use graphical
representations where the nodes of the graph correspond to random variables
and the edges between the nodes express probabilistic relationships. Figures
4.10(a) and 4.10(b) show examples of probabilistic graphical models using
directed edges (with arrows) and undirected edges (without arrows),
respectively. Directed graphical models are also known as Bayesian networks
while undirected graphical models are known as Markov random fields.
The two approaches use different semantics for expressing relationships
among random variables and are thus useful in different contexts. In the
following, we briefly describe Bayesian networks that are useful in the
context of classification.
A Bayesian network (also referred to as a belief network) involves directed
edges between nodes, where every edge represents a direction of influence
among random variables. For example, Figure 4.10(a) shows a Bayesian
network where variable C depends upon the values of variables A and B, as
indicated by the arrows pointing toward C from A and B. Consequently, the
variable C influences the values of variables D and E. Every edge in a
Bayesian network thus encodes a dependence relationship between random
variables with a particular directionality.
Figure 4.10.
Illustrations of two basic types of graphical models.
Figure 4.10. Full Alternative Text
Bayesian networks are directed acyclic graphs (DAG) because they do not
contain any directed cycles such that the influence of a node loops back to the
same node. Figure 4.11 shows some examples of Bayesian networks that
capture different types of dependence structures among random variables. In
a directed acyclic graph, if there is a directed edge from X to Y ,then X is
called the parent of Y and Y is called the child of X. Note that a node can
have multiple parents in a Bayesian network, e.g., node D has two parent
nodes, B and C, in Figure 4.11(a). Furthermore, if there is a directed path in
the network from X to Z, then X is an ancestor of Z, while Z is a descendant
of X. For example, in the diagram shown in Figure 4.11(b), A is a descendant
of D and D is an ancestor of B. Note that there can be multiple directed paths
between two nodes of a directed acyclic graph, as is the case for nodes A and
D in Figure 4.11(a).
Figure 4.11.
Examples of Bayesian networks.
Conditional Independence
An important property of a Bayesian network is its ability to represent
varying forms of conditional independence among random variables. There
are several ways of describing the conditional independence assumptions
captured by Bayesian networks. One of the most generic ways of expressing
conditional independence is the concept of d-separation, which can be used
to determine if any two sets of nodes A and B are conditionally independent
given another set of nodes C. Another useful concept is that of the Markov
blanket of a node Y , which denotes the minimal set of nodes X that makes Y
independent of the other nodes in the graph, when conditioned on X. (See
Bibliographic Notes for more details on d-separation and Markov blanket.)
However, for the purpose of classification, it is sufficient to describe a
simpler expression of conditional independence in Bayesian networks, known
as the local Markov property.
Property 1 (Local Markov
A node in a Bayesian network is conditionally independent of its nondescendants, if its parents are known.
To illustrate the local Markov property, consider the Bayes network shown in
Figure 4.11(b). We can state that A is conditionally independent of both B
and D given C, because C is the parent of A and nodes B and D are nondescendants of A. The local Markov property helps in interpreting parentchild relationships in Bayesian networks as representations of conditional
probabilities. Since a node is conditionally independent of its nondescendants given it parents, the conditional independence assumptions
imposed by a Bayesian network is often sparse in structure. Nonetheless,
Bayesian networks are able to express a richer class of conditional
independence statements among attributes and class labels than the naïve
Bayes classifier. In fact, the naïve Bayes classifier can be viewed as a special
type of Bayesian network, where the target class Y is at the root of a tree and
every attribute Xi is connected to the root node by a directed edge, as shown
in Figure 4.12(a).
Figure 4.12.
Comparing the graphical representation of a naïve Bayes classifier
with that of a generic Bayesian network.
Figure 4.12. Full Alternative Text
Note that in a naïve Bayes classifier, every directed edge points from the
target class to the observed attributes, suggesting that the class label is a
factor behind the generation of attributes. Inferring the class label can thus be
viewed as diagnosing the root cause behind the observed attributes. On the
other hand, Bayesian networks provide a more generic structure of
probabilistic relationships, since the target class is not required to be at the
root of a tree but can appear anywhere in the graph, as shown in Figure
4.12(b). In this diagram, inferring Y not only helps in diagnosing the factors
influencing X3 and X4, but also helps in predicting the influence of X1 and
Joint Probability
The local Markov property can be used to succinctly express the joint
probability of the set of random variables involved in a Bayesian network. To
realize this, let us first consider a Bayesian network consisting of d nodes, X1
to Xd, where the nodes have been numbered in such a way that Xi is an
ancestor of Xj only if i<j. The joint probability of X={X1, …, Xd} can be
generically factorized using the chain rule of probability as
P(X)=P(X1)P(X2|X1)P(X3|X1, X2) … P(Xd|X1, … Xd
−1)=∏i=1dP(Xi|X1, … Xi−1) (4.22)
By the way we have constructed the graph, note that the set {X1, … Xi−1 }
contains only non-descendants of Xi. Hence, by using the local Markov
property, we can write P(Xi|X1, … Xi−1) as P(Xi|pa(Xi)), where pa(Xi)
denotes the parents of Xi. The joint probability can then be represented as
P(X)=∏i=1dP(Xi|pa(Xi)) (4.23)
It is thus sufficient to represent the probability of every node Xi in terms of
its parent nodes, pa(Xi), for computing P(x). This is achieved with the help of
probability tables that associate every node to its parent nodes as follows:
1. The probability table for node Xi contains the conditional probability
values P(Xi|pa(Xi)) for every combination of values in Xi and pa(Xi).
2. If Xi has no parents (pa(Xi)=ϕ), then the table contains only the prior
probability P(Xi).
Example 4.6. [Probability Tables]
Figure 4.13 shows an example of a Bayesian network for modeling the
relationships between a patient’s symptoms and risk factors. The probability
tables are shown at the side of every node in the figure. The probability tables
associated with the risk factors (Exercise and Diet) contain only the prior
probabilities, whereas the tables for heart disease, heartburn, blood pressure,
and chest pain, contain the conditional probabilities.
Figure 4.13.
A Bayesian network for detecting heart disease and heartburn in
Figure 4.13. Full Alternative Text
Use of Hidden Variables
A Bayesian network typically involves two types of variables: observed
variables that are clamped to specific observed values, and unobserved
variables, whose values are not known and need to be inferred from the
network. To distinguish between these two types of variables, observed
variables are generally represented using shaded nodes while unobserved
variables are represented using empty nodes. Figure 4.14 shows an example
of a Bayesian network with observed variables (A, B, and E ) and unobserved
variables (C and D).
Figure 4.14.
Observed and unobserved variables are represented using unshaded
and shaded circles, respectively.
In the context of classification, the observed variables correspond to the set of
attributes X, while the target class is represented using an unobserved
variable Y that needs to be inferred during testing. However, note that a
generic Bayesian network may contain many other unobserved variables
apart from the target class, as represented in Figure 4.15 as the set of
variables H. These unobserved variables represent hidden or confounding
factors that affect the probabilities of attributes and class labels, although they
are never directly observed. The use of hidden variables enhances the
expressive power of Bayesian networks in representing complex probabilistic
relationships between attributes and class labels. This is one of the key
distinguishing properties of Bayesian networks as compared to naïve Bayes
4.5.2 Inference and Learning
Given the probability tables corresponding to every node in a Bayesian
network, the problem of inference corresponds to computing the probabilities
of different sets of random variables. In the context of classification, one of
the key inference problems is to compute the probability of a target class Y
taking on a specific value y, given the set of observed attributes at a data
instance, x. This can be represented using the following conditional
P(Y=y|x)=(y, x)P(x)=(y, x)∑y′P(y′, x) (4.24)
The previous equation involves marginal probabilities of the form P(y, x).
They can be computed by marginalizing out the hidden variables H from the
joint probability as follows:
P(y, x)=∑HP(y, x, H), (4.25)
where the joint probability P(y, x, H) can be obtained by using the
factorization described in Equation 4.23. To understand the nature of
computations involved in estimating P(y, x), consider the example Bayesian
network shown in Figure 4.15, which involves a target class, Y , three
observed attributes, X1 to X3, and four hidden variables, H1 to H4. For this
network, we can express P(y, x) as
Figure 4.15.
An example of a Bayesian network with four hidden variables, H1
to H4, three observed attributes, X1 to X3, and one target class Y .
P(y, x)=∑h1∑h2∑h3∑h4P(y, x1, x2, h1, h2, h3, h4),=∑h1∑h2∑h3∑h4
[P(h1)P(h2)P(x2)P(h4)P(x1|h1, h2) ×P(h3|x2, h2)P(y|x1, h3)P(x3|h3, h4) ],
=∑h1∑h2∑h3∑h4f(h1, h2, h3, h4), (4.27)
where f is a factor that depends on the values of h1 to h4. In the previous
simplistic expression of P(y, x), a different summand is considered for every
combination of values, h1 to h4, in the hidden variables, H1 to H4. If we
assume that every variable in the network can take k discrete values, then the
summation has to be carried out for a total number of k4 times. The
computational complexity of this approach is thus O(k4). Moreover, the
number of computations grows exponentially with the number of hidden
variables, making it difficult to use this approach with networks that have a
large number of hidden variables. In the following, we present different
computational techniques for efficiently performing inferences in Bayesian
Variable Elimination
To reduce the number of computations involved in estimating P(y, x), let us
closely examine the expressions in Equations 4.26 and 4.27. Notice that
although f(h1, h2, h3, h4) depends on the values of all four hidden variables,
it can be decomposed as a product of several smaller factors, where every
factor involves only a small number of hidden variables. For example, the
factor P(h4) depends only on the value of h4, and thus acts as a constant
multiplicative term when summations are performed over h1, h2, or h3.
Hence, if we place P(h4) outside the summations of h1 to h3, we can save
some repeated multiplications occurring inside every summand.
In general, we can push every summation as far inside as possible, so that the
factors that do not depend on the summing variable are placed outside the
summation. This will help reduce the number of wasteful computations by
using smaller factors at every summation. To illustrate this process, consider
the following sequence of steps for computing P(y, x), by rearranging the
of summations in Equation 4.26.
P(y, x)=P(x2)∑h4P(h4)∑h3P(y|x1, h3)P(x3|h3, h4)×∑h2P(h2)P(h3|x2, h2)∑h1
=P(x2)∑h4P(h4)∑h3P(y|x1, h3)P(x3|h3, h4)×∑h2P(h2)P(h3|x2, h2)f1(h2)
=P(x2)∑h4P(h4)∑h3P(y|x1, h3)P(x3|h3, h4)f2(h3) (4.30)
=P(x2)∑h4P(h4)f3(h4) (4.31)
where fi represents the intermediate factor term obtained by summing out hi.
To check if the previous rearrangements provide any improvements in
computational efficiency, let us count the number of computations occurring
at every step of the process. At the first step (Equation 4.28), we perform a
summation over h1 using factors that depend on h1 and h2. This requires
considering every pair of values in h1 and h2, resulting in O(k2)
computations. Similarly, the second step (Equation 4.29) involves summing
out h2 using factors of h2 and h3, leading to O(k2) computations. The third
step (Equation 4.30) again requires O(k2) computations as it involves
summing out h3 over factors depending on h3 and h4. Finally, the fourth step
(Equation 4.31) involves summing out h4 using factors depending on h4,
resulting in O(k) computations.
The overall complexity of the previous approach is thus O(k2), which is
considerably smaller than the O(k4) complexity of the basic approach.
Hence, by merely rearranging summations and using algebraic
manipulations, we are able to improve the computational efficiency in
computing P(y, x). This procedure is known as variable elimination.
The basic concept that variable elimination exploits to reduce the number of
computations is the distributive nature of multiplication over addition
operations. For example, consider the following multiplication and addition
Notice that the right-hand side of the previous equation involves three
multiplications and three additions, while the left-hand side involves only one
multiplication and three additions, thus saving on two arithmetic operations.
This property is utilized by variable elimination in pushing out constant terms
outside the summation, such that they are multiplied only once.
Note that the efficiency of variable elimination depends on the order of
hidden variables used for performing summations. Hence, we would ideally
like to find the optimal order of variables that result in the smallest number of
computations. Unfortunately, finding the optimal order of summations for a
generic Bayesian network is an NP-Hard problem, i.e., there does not exist an
efficient algorithm for finding the optimal ordering that can run in
polynomial time. However, there exists efficient techniques for handling
special types of Bayesian networks, e.g., those involving tree-like graphs, as
described in the following.
Sum-Product Algorithm for Trees
Note that in Equations 4.28 and 4.29, whenever a variable hi is eliminated
during marginalization, it results in the creation of a factor fi that depends on
the neighboring nodes of hi. fi is then absorbed in the factors of neighboring
variables and the process is repeated until all unobserved variables have
marginalized. This phenomena of variable elimination can be viewed as
transmitting a local message from the variable being marginalized to its
neighboring nodes. This idea of message passing utilizes the structure of the
graph for performing computations, thus making it possible to use graphtheoretic approaches for making effective inferences. The sum-product
algorithm builds on the concept of message passing for computing marginal
and conditional probabilities on tree-based graphs.
Figure 4.16 shows an example of a tree involving five variables, X1 to X5. A
key characteristic of a tree is that every node in the tree has exactly one
parent, and there is only one directed edge between any two nodes in the tree.
For the purpose of illustration, let us consider the problem of estimating the
marginal probability of X2, P(X2). This can be obtained by marginalizing out
every variable in the graph except X2 and rearranging the summations to
obtain the following expression:
Figure 4.16.
An example of a Bayesian network with a tree structure.
where mij(xj) has been conveniently chosen to represent the factor of xj that
is obtained by summing out xi. We can view mij(xj) as a local message
passed from node xi to node xj, as shown using arrows in Figure 4.17(a).
These local messages capture the influence of eliminating nodes on the
marginal probabilities of neighboring nodes.
Before we formally describe the formula for computing mij(xj) and P(xj), we
first define a potential function ψ(⋅) that is associated every node and edge of
the graph. We can define the potential of a node Xi as
ψ(Xi)={P(Xi),if Xi is the root node.1,otherwise. (4.32)
Figure 4.17.
Illustration of message passing in the sum-product algorithm.
Figure 4.17. Full Alternative Text
Similarly, we can define the potential of an edge between nodes Xi and Xj
(where Xi is the parent of Xj) as
ψ(Xi, Xj)=P(Xj|Xi).
Using ψ(Xi) and ψ(Xi, Xj), we can represent mij(xj) using the following
mij(xj)=∑xi(ψ(xi)ψ(xi, xj)∏k∈N(i)\imki(xi)), (4.33)
where N(i) represents the set of neighbors of node Xi. The message mij that is
transmitted from Xi to Xj can thus be recursively computed using the
messages incident on Xi from its neighboring nodes excluding Xi. Note that
the formula for mij involves taking a sum over all possible values of Xj, after
multiplying the factors obtained from the neighbors of Xj. This approach of
message passing is thus called the “sum-product” algorithm. Further, since
mij represents a notion of “belief” propagated from Xi to Xj, this algorithm is
also known as belief propagation. The marginal probability of a node Xi
is then given as
P(xi)=ψ(xi)∏j∈N(i)mji(xi). (4.34)
A useful property of the sum-product algorithm is that it allows the messages
to be reused for computing a different marginal probability in the future. For
example, if we had to compute the marginal probability for node X3, we
would require the following messages from its neighboring nodes:
m23(x3), m43(x3), and m53(x3). However, note that m43(x3), and m53(x3)
have already been computed in the process of computing the marginal
probability of X2 and thus can be reused.
Notice that the basic operations of the sum-product algorithm resemble a
message passing protocol over the edges of the network. A node sends out a
message to all its neighboring nodes only after it has received incoming
messages from all its neighbors. Hence, we can initialize the message passing
protocol from the leaf nodes, and transmit messages till we reach the root
node. We can then run a second pass of messages from the root node back to
the leaf nodes. In this way, we can compute the messages for every edge in
both directions, using just O(2|E|) operations, where |E| is the number of
edges. Once we have transmitted all possible messages as shown in Figure
4.17(b), we can easily compute the marginal probability of every node in the
graph using Equation 4.34.
In the context of classification, the sum-product algorithm can be easily
modified for computing the conditional probability of the class label y given
the set of observed attributes x^, i.e., P(y|x^). This basically amounts to
computing P(y, X=x^) in Equation 4.24, where X is clamped to the observed
values x^. To handle the scenario where some of the random variables are
fixed and do not need to be normalized, we consider the following
If Xi is a random variable that is fixed to a specific value x^i, then we can
simply modify ψ(Xi) and ψ(Xi, Xj) as follows:
ψ(Xi)={1,if Xi=x^i.0,otherwise. (4.35)
ψ(Xi, Xj)={P(Xi|x^i),if Xi=x^i.0,otherwise. (4.36)
We can run the sum-product algorithm using these modified values for every
observed variable and thus compute P(y, X=x^).
Figure 4.18.
Example of a poly-tree and its corresponding factor graph.
Figure 4.18. Full Alternative Text
Generalizations for Non-Tree
The sum-product algorithm is guaranteed to optimally converge in the case of
trees using a single run of message passing in both directions of every edge.
This is because any two nodes in a tree have a unique path for the
transmission of messages. Furthermore, since every node in a tree has a
single parent, the joint probability involves only factors of at most two
variables. Hence, it is sufficient to consider potentials over edges and not
other generic substructures in the graph.
Both of the previous properties are violated in graphs that are not trees, thus
making it difficult to directly apply the sum-product algorithm for making
inferences. However, a number of variants of the sum-product algorithm have
been devised to perform inferences on a broader family of graphs than trees.
Many of these variants transform the original graph into an alternative treebased representation, and then apply the sum-product algorithm on the
transformed tree. In this section, we briefly discuss one such transformations
known as factor graphs.
Factor graphs are useful for making inferences over graphs that violate the
condition that every node has a single parent. Nonetheless, they still require
the absence of multiple paths between any two nodes, to guarantee
convergence. Such graphs are known as poly-trees. An example of a polytree is shown in Figure 4.18(a).
A poly-tree can be transformed into a tree-based representation with the help
of factor graphs. These graphs consist of two types of nodes, variables nodes
(that are represented using circles) and factor nodes (that are represented
using squares). The factor nodes represent conditional independence
relationships among the variables of the poly-tree. In particular, every
probability table can be represented as a factor node. The edges in a factor
graph are undirected in nature and relate a variable node to a factor node if
the variable is involved in the probability table corresponding to the factor
node. Figure 4.18(b) presents the factor graph representation of the poly-tree
shown in Figure 4.18(a).
Note that the factor graph of a poly-tree always forms a tree-like structure,
where there is a unique path of influence between any two nodes in the factor
graph. Hence, we can apply a modified form of sum-product algorithm to
transmit messages between variable nodes and factor nodes, which is
guaranteed to converge to optimal values.
Learning Model Parameters
In all our previous discussions on Bayesian networks, we had assumed that
the topology of the Bayesian network and the values in the probability tables
of every node were already known. In this section, we discuss approaches for
learning both the topology and the probability table values of a Bayesian
network from the training data.
Let us first consider the case where the topology of the network is known and
we are only required to compute the probability tables. If there are no
unobserved variables in the training data, then we can easily compute the
probability table for P(Xi|pa(Xi)), by counting the fraction of training
instances for every value of Xi and every combination of values in pa(Xi).
However, if there are unobserved variables in Xi or pa(Xi), then computing
the fraction of training instances for such variables is non-trivial and requires
the use of advances techniques such as the Expectation-Maximization
algorithm (described later in Chapter 8).
Learning the structure of the Bayesian network is a much more challenging
task than learning the probability tables. Although there are some scoring
approaches that attempt to find a graph structure that maximizes the training
likelihood, they are often computationally infeasible when the graph is large.
Hence, a common approach for constructing Bayesian networks is to use the
subjective knowledge of domain experts.
4.5.3 Characteristics of Bayesian
1. Bayesian networks provide a powerful approach for representing
probabilistic relationships between attributes and class labels with the
help of graphical models. They are able to capture complex forms of
dependencies among variables. Apart from encoding prior beliefs, they
are also able to model the presence of latent (unobserved) factors as
hidden variables in the graph. Bayesian networks are thus quite
expressive and provide predictive as well as descriptive insights about
the behavior of attributes and class labels.
2. Bayesian networks can easily handle the presence of correlated or
redundant attributes, as opposed to the naïve Bayes classifier. This is
because Bayesian networks do not use the naïve Bayes assumption
about conditional independence, but instead are able to express richer
forms of conditional independence.
3. Similar to the naïve Bayes classifier, Bayesian networks are also quite
robust to the presence of noise in the training data. Further, they can
handle missing values during training as well as testing. If a test instance
contains an attribute Xi with a missing value, then a Bayesian network
can perform inference by treating Xi as an unobserved node and
marginalizing out its effect on the target class. Hence, Bayesian
networks are well-suited for handling incompleteness in the data, and
can work with partial information. However, unless the pattern with
which missing values occurs is completely random, then their presence
will likely introduce some degree of error and/or bias into the analysis.
4. Bayesian networks are robust to irrelevant attributes that contain no
discriminatory information about the class labels. Such attributes show
no impact on the conditional probability of the target class, and are thus
rightfully ignored.
5. Learning the structure of a Bayesian network is a cumbersome task that
often requires assistance from expert knowledge. However, once the
structure has been decided, learning the parameters of the network can
be quite straightforward, especially if all the variables in the network are
6. Due to its additional ability of representing complex forms of
relationships, Bayesian networks are more susceptible to overfitting as
compared to the naïve Bayes classifier. Furthermore, Bayesian networks
typically require more training instances for effectively learning the
probability tables than the naïve Bayes classifier.
7. Although the sum-product algorithm provides computationally efficient
techniques for performing inference over tree-like graphs, the
complexity of the approach increase significantly when dealing with
generic graphs of large sizes. In situations where exact inference is
computationally infeasible, it is quite common to use approximate
inference techniques.
4.6 Logistic Regression
The naïve Bayes and the Bayesian network classifiers described in the
previous sections provide different ways of estimating the conditional
probability of an instance x given class y, P(x|y). Such models are known as
probabilistic generative models. Note that the conditional probability P(x|y)
essentially describes the behavior of instances in the attribute space that are
generated from class y. However, for the purpose of making predictions, we
are finally interested in computing the posterior probability P(y|x). For
example, computing the following ratio of posterior probabilities is sufficient
for inferring class labels in a binary classification problem:
This ratio is known as the odds. If this ratio is greater than 1, then x is
classified as y=1. Otherwise, it is assigned to class y=0. Hence, one may
simply learn a model of the odds based on the attribute values of training
instances, without having to compute P(x|y) as an intermediate quantity in the
Bayes theorem.
Classification models that directly assign class labels without computing
class-conditional probabilities are called discriminative models. In this
section, we present a probabilistic discriminative model known as logistic
regression, which directly estimates the odds of a data instance x using its
attribute values. The basic idea of logistic regression is to use a linear
predictor, z=wTx+b, for representing the odds of x as follows:
P(y=1|x)P(y=0|x)=ez=ewTx+b, (4.37)
where w and b are the parameters of the model and aT denotes the transpose
of a vector a. Note that if wTx+b>0, then x belongs to class 1 since its odds is
greater than 1. Otherwise, x belongs to class 0.
Figure 4.19.
Plot of sigmoid (logistic) function, σ(z).
Since P(y=0|x)+P(y=1|x)=1, we can re-write Equation 4.37 as
This can be further simplified to express P(y=1|x) as a function of z.
P(y=1|x)=11+e−z=σ(z), (4.38)
where the function σ(⋅) is known as the logistic or sigmoid function. Figure
4.19 shows the behavior of the sigmoid function as we vary z. We can see
that σ(z)≥0.5 only when z≥0. We can also derive P(y=0|x) using σ(z) as
P(y=0|x)=1−σ(z)=11+e−z (4.39)
Hence, if we have learned a suitable value of parameters w and b, we can use
Equations 4.38 and 4.39 to estimate the posterior probabilities of any data
instance x and determine its class label.
4.6.1 Logistic Regression as a
Generalized Linear Model
Since the posterior probabilities are real-valued, their estimation using the
previous equations can be viewed as solving a regression problem. In fact,
logistic regression belongs to a broader family of statistical regression
models, known as generalized linear models (GLM). In these models, the
target variable y is considered to be generated from a probability distribution
P(y|x), whose mean μ can be estimated using a link function g(⋅) as follows:
g(μ)=z=wT x + b. (4.40)
For binary classification using logistic regression, y follows a Bernoulli
distribution (y can either be 0 or 1) and μ is equal to P(y=1|x). The link
function g(⋅) of logistic regression, called the logit function, can thus be
represented as
Depending on the choice of link function g(⋅) and the form of probability
distribution P(y|x), GLMs are able to represent a broad family of regression
models, such as linear regression and Poisson regression. They require
different approaches for estimating their model parameters, (w, b). In this
chapter, we will only discuss approaches for estimating the model parameters
of logistic regression, although methods for estimating parameters of other
types of GLMs are often similar (and sometimes even simpler). (See
Bibliographic Notes for more details on GLMs.)
Note that even though logistic regression has relationships with regression
models, it is a classification model since the computed posterior probabilities
are eventually used to determine the class label of a data instance.
4.6.2 Learning Model Parameters
The parameters of logistic regression, (w, b), are estimated during training
using a statistical approach known as the maximum likelihood estimation
(MLE) method. This method involves computing the likelihood of observing
the training data given (w, b), and then determining the model parameters
(w*, b*) that yield maximum likelihood.
Let D.train={(x1, y1), (x2, y2), … , (xn, yn)} denote a set of n training
instances, where yi is a binary variable (0 or 1). For a given training instance
xi, we can compute its posterior probabilities using Equations 4.38 and 4.39.
We can then express the likelihood of observing yi given xi, w, and b as
P(yi|xi, w, b)=P(y=1|xi)yi×P(y=0|xi)1−yi,=(σ(zi))yi×(1−σ(zi))1−yi,=
(σ(wTxi+b))yi×(1−σ(wTxi+b))1−yi, (4.41)
where σ(⋅) is the sigmoid function as described above, Equation 4.41
basically means that the likelihood P(yi|xi, w, b) is equal to P(y=1|xi) when
yi=1, and equal to P(y=0|xi) when yi=0. The likelihood of all training
instances, L(w, b), can then be computed by taking the product of individual
likelihoods (assuming independence among training instances) as follows:
L(w, b)=∏i=1nP(yi|xi, w, b)=∏i=1nP(y=1|xi)yi×P(y=0|xi)1−yi. (4.42)
The previous equation involves multiplying a large number of probability
values, each of which are smaller than or equal to 1. Since this naïve
computation can easily become numerically unstable when n is large, a more
practical approach is to consider the negative logarithm (to base e) of the
likelihood function, also known as the cross entropy function:
−logL(w, b)=−∑i=1nyilog(P(y=1|xi))+(1−yi)log(P(y=0|xi)).=
The cross entropy is a loss function that measures how unlikely it is for the
training data to be generated from the logistic regression model with
parameters (w, b). Intuitively, we would like to find model parameters
(w*, b*) that result in the lowest cross entropy, −logL(w*, b*).
(w*, b*)=argmin(w, b)E(w, b)=argmin(w, b)−logL(w, b) (4.43)
where E(w, b)=−logL(w, b) is the loss function. It is worth emphasizing that
E(w, b) is a convex function, i.e., any minima of E(w, b) will be a global
minima. Hence, we can use any of the standard convex optimization
techniques to solve Equation 4.43, which are mentioned in Appendix E. Here,
we briefly describe the Newton-Raphson method that is commonly used for
estimating the parameters of logistic regression. For ease of representation,
we will use a single vector to describe w˜=(wT b)T, which is of size one
greater than w. Similarly, we will consider the concatenated feature vector x˜=
(xT 1)T, such that the linear predictor z=wTx+b can be succinctly written as
z=w˜Tx˜. Also, the concatenation of all training labels, y1 to yn, will be
represented as y, the set consisting of σ(z1) to σ(zn) will be represented as σ,
and the concatenation of x˜1 to x˜n will be represented as X˜.
The Newton-Raphson is an iterative method for finding w˜* that uses the
following equation to update the model parameters at every iteration:
w˜(new)=w˜(old)−H−1∇E(w˜), (4.44)
where ∇E(w˜) and H are the first- and second-order derivatives of the loss
function E( w˜) with respect to w˜, respectively. The key intuition behind
Equation 4.44 is to move the model parameters in the direction of maximum
gradient, such that w˜ takes larger steps when ∇E(w˜) is large. When w˜
arrives at a minima after some number of iterations, then ∇E(w˜) would
become equal to 0 and thus result in convergence. Hence, we start with some
initial values of w˜ (either randomly assigned or set to 0) and use Equation
4.44 to iteratively update w˜ till there are no significant changes in its value
(beyond a certain threshold).
The first-order derivative of E(w˜) is given by
−yi)x˜i,=X˜(σ−y), (4.45)
where we have used the fact that dσ(z)/dz=σ(z)(1−σ(z)). Using ∇E(w˜), we
can compute the second-order derivative of E(w˜) as
H=∇∇E(w˜)=∑i=1nσ(w˜Tx˜i)(1−σ(w˜Tx˜i)x˜ix˜iT)=X˜TRX˜, (4.46)
where R is a diagonal matrix whose ith diagonal element Rii=σi(1−σi). We
can now use the first- and second-order derivatives of E(w˜) in Equation 4.44
to obtain the following update equation at the kth iteration:
w˜(k+1)=w˜(k)−(X˜TRkX˜)−1X˜T(σk−y) (4.47)
where the subscript k under Rk and σk refers to using w˜(k) to compute both
4.6.3 Characteristics of Logistic
1. Logistic Regression is a discriminative model for classification that
directly computes the poster probabilities without making any
assumption about the class conditional probabilities. Hence, it is quite
generic and can be applied in diverse applications. It can also be easily
extended to multiclass classification, where it is known as multinomial
logistic regression. However, its expressive power is limited to learning
only linear decision boundaries.
2. Because there are different weights (parameters) for every attribute, the
learned parameters of logistic regression can be analyzed to understand
the relationships between attributes and class labels.
3. Because logistic regression does not involve computing densities and
distances in the attribute space, it can work more robustly even in highdimensional settings than distance-based methods such as nearest
neighbor classifiers. However, the objective function of logistic
regression does not involve any term relating to the complexity of the
model. Hence, logistic regression does not provide a way to make a
trade-off between model complexity and training performance, as
compared to other classification models such as support vector
machines. Nevertheless, variants of logistic regression can easily be
developed to account for model complexity, by including appropriate
terms in the objective function along with the cross entropy function.
4. Logistic regression can handle irrelevant attributes by learning weight
parameters close to 0 for attributes that do not provide any gain in
performance during training. It can also handle interacting attributes
since the learning of model parameters is achieved in a joint fashion by
considering the effects of all attributes together. Furthermore, if there are
redundant attributes that are duplicates of each other, then logistic
regression can learn equal weights for every redundant attribute, without
degrading classification performance. However, the presence of a large
number of irrelevant or redundant attributes in high-dimensional settings
can make logistic regression susceptible to model overfitting.
5. Logistic regression cannot handle data instances with missing values,
since the posterior probabilities are only computed by taking a weighted
sum of all the attributes. If there are missing values in a training
instance, it can be discarded from the training set. However, if there are
missing values in a test instance, then logistic regression would fail to
predict its class label.
4.7 Artificial Neural Network
Artificial neural networks (ANN) are powerful classification models that are
able to learn highly complex and nonlinear decision boundaries purely from
the data. They have gained widespread acceptance in several applications
such as vision, speech, and language processing, where they have been
repeatedly shown to outperform other classification models (and in some
cases even human performance). Historically, the study of artificial neural
networks was inspired by attempts to emulate biological neural systems. The
human brain consists primarily of nerve cells called neurons, linked together
with other neurons via strands of fiber called axons. Whenever a neuron is
stimulated (e.g., in response to a stimuli), it transmits nerve activations via
axons to other neurons. The receptor neurons collect these nerve activations
using structures called dendrites, which are extensions from the cell body of
the neuron. The strength of the contact point between a dendrite and an axon,
known as a synapse, determines the connectivity between neurons.
Neuroscientists have discovered that the human brain learns by changing the
strength of the synaptic connection between neurons upon repeated
stimulation by the same impulse.
The human brain consists of approximately 100 billion neurons that are interconnected in complex ways, making it possible for us to learn new tasks and
perform regular activities. Note that a single neuron only performs a simple
modular function, which is to respond to the nerve activations coming from
sender neurons connected at its dendrite, and transmit its activation to
receptor neurons via axons. However, it is the composition of these simple
functions that together is able to express complex functions. This idea is at
the basis of constructing artificial neural networks.
Analogous to the structure of a human brain, an artificial neural network is
composed of a number of processing units, called nodes, that are connected
with each other via directed links. The nodes correspond to neurons that
perform the basic units of computation, while the directed links correspond to
connections between neurons, consisting of axons and dendrites. Further, the
weight of a directed link between two neurons represents the strength of the
synaptic connection between neurons. As in biological neural systems, the
primary objective of ANN is to adapt the weights of the links until they fit the
input-output relationships of the underlying data.
The basic motivation behind using an ANN model is to extract useful
features from the original attributes that are most relevant for classification.
Traditionally, feature extraction has been achieved by using dimensionality
reduction techniques such as PCA (introduced in Chapter 2), which show
limited success in extracting nonlinear features, or by using hand-crafted
features provided by domain experts. By using a complex combination of
inter-connected nodes, ANN models are able to extract much richer sets of
features, resulting in good classification performance. Moreover, ANN
models provide a natural way of representing features at multiple levels of
abstraction, where complex features are seen as compositions of simpler
features. In many classification problems, modeling such a hierarchy of
features turns out to be very useful. For example, in order to detect a human
face in an image, we can first identify low-level features such as sharp edges
with different gradients and orientations. These features can then be
combined to identify facial parts such as eyes, nose, ears, and lips. Finally, an
appropriate arrangement of facial parts can be used to correctly identify a
human face. ANN models provide a powerful architecture to represent a
hierarchical abstraction of features, from lower levels of abstraction (e.g.,
edges) to higher levels (e.g., facial parts).
Artificial neural networks have had a long history of developments spanning
over five decades of research. Although classical models of ANN suffered
from several challenges that hindered progress for a long time, they have reemerged with widespread popularity because of a number of recent
developments in the last decade, collectively known as deep learning. In this
section, we examine classical approaches for learning ANN models, starting
from the simplest model called perceptrons to more complex architectures
called multi-layer neural networks. In the next section, we discuss some of
the recent advancements in the area of ANN that have made it possible to
effectively learn modern ANN models with deep architectures.
4.7.1 Perceptron
A perceptron is a basic type of ANN model that involves two types of nodes:
input nodes, which are used to represent the input attributes, and an output
node, which is used to represent the model output. Figure 4.20 illustrates the
basic architecture of a perceptron that takes three input attributes, x1, x2, and
x3, and produces a binary output y. The input node corresponding to an
attribute xi is connected via a weighted link wi to the output node. The
weighted link is used to emulate the strength of a synaptic connection
between neurons.
Figure 4.20.
Basic architecture of a perceptron.
The output node is a mathematical device that computes a weighted sum of
its inputs, adds a bias factor b to the sum, and then examines the sign of the
result to produce the output y^ as follows:
3^y={1,if wTx+b>0.−1,otherwise. (4.48)
To simplify notations, w and b can be concatenated to form w˜=(wT b)T,
while x can be appended with 1 at the end to form x˜=(xT 1)T. The output of
the perceptron y^ can then be written:
where the sign function acts as an activation function by providing an output
value of +1 if the argument is positive and −1 if its argument is negative.
Learning the Perceptron
Given a training set, we are interested in learning parameters w˜ such that y^
closely resembles the true y of training instances. This is achieved by using
the perceptron learning algorithm given in Algorithm 4.3. The key
computation for this algorithm is the iterative weight update formula given in
Step 8 of the algorithm:
wj(k+1)=wj(k)+λ(yi−yi^(k))xij, (4.49)
where w(k) is the weight parameter associated with the ith input link after the
kth iteration, λ is a parameter known as the learning rate, and xij is the value
of the jth attribute of the training example xi. The justification for Equation
4.49 is rather intuitive. Note that (yi−y^i) captures the discrepancy between
yi and y^i, such that its value is 0 only when the true label and the predicted
output match. Assume xij is positive. If y^=0 and y=1, then wj is increased at
the next iteration so that w˜Txi can become positive. On the other hand, if
y^=1 and y=0, then wj is decreased so that w˜Txi can become negative.
Hence, the weights are modified at every iteration to reduce the discrepancies
between y^ and y across all training instances. The learning rate λ, a
parameter whose value is between 0 and 1, can be used to control the amount
of adjustments made in each iteration. The algorithm halts when the average
number of discrepancies are smaller than a threshold γ.
Algorithm 4.3 Perceptron learning
1: Let D.train={(x˜i, yi)|i=1, 2, …, n} be the set of training instances.
2: Set k ← 0.
3: Initialize the weight vector w˜(0) with random values.
4: repeat
5: for each training instance (x˜i, yi)∈D.train do
6: Compute the predicted output y^i(k) using w˜(k).
7: for each weight component wj do
8: Update the weight, wj(k+1)=wj(k)+λ(yi−y^i(k))xij.
9: end for
10: Update k k + 1.
11: end for
12: until ∑i=1n|yi−y^i(k)|/n is less than a threshold γ
The perceptron is a simple classification model that is designed to learn linear
decision boundaries in the attribute space. Figure 4.21 shows the decision
boundary obtained by applying the perceptron learning algorithm to the data
set provided on the left of the figure. However, note that there can be multiple
decision boundaries that can separate the two classes, and the perceptron
arbitrarily learns one of these boundaries depending on the random initial
values of parameters. (The selection of the optimal decision boundary is a
problem that will be revisited in the context of support vector machines in
Section 4.9.) Further, the perceptron learning algorithm is only guaranteed to
converge when the classes are linearly separable. However, if the classes are
not linearly separable, the algorithm fails to converge. Figure 4.22 shows an
example of a nonlinearly separable data given by the XOR function. The
perceptron cannot find the right solution for this data because there is no
linear decision boundary that can perfectly separate the training instances.
Thus, the stopping condition at line 12 of Algorithm 4.3 would never be met
and hence, the perceptron learning algorithm would fail to converge. This is a
major limitation of perceptrons since real-world classification problems often
involve nonlinearly separable classes.
Figure 4.21.
Perceptron decision boundary for the data given on the left (+
represents a positively labeled instance while o represents a
negatively labeled instance.
Figure 4.21. Full Alternative Text
Figure 4.22.
XOR classification problem. No linear hyperplane can separate the
two classes.
Figure 4.22. Full Alternative Text
4.7.2 Multi-layer Neural Network
A multi-layer neural network generalizes the basic concept of a perceptron to
more complex architectures of nodes that are capable of learning nonlinear
decision boundaries. A generic architecture of a multi-layer neural network is
shown in Figure 4.23 where the nodes are arranged in groups called layers.
These layers are commonly organized in the form of a chain such that every
layer operates on the outputs of its preceding layer. In this way, the layers
represent different levels of abstraction that are applied on the input features
in a sequential manner. The composition of these abstractions generates the
final output at the last layer, which is used for making predictions. In the
following, we briefly describe the three types of layers used in multi-layer
neural networks.
Figure 4.23.
Example of a multi-layer artificial neural network (ANN).
The first layer of the network, called the input layer, is used for representing
inputs from attributes. Every numerical or binary attribute is typically
represented using a single node on this layer, while a categorical attribute is
either represented using a different node for each categorical value, or by
encoding the k-ary attribute using ⌈log2k ⌉ input nodes. These inputs are fed
into intermediary layers known as hidden layers, which are made up of
processing units known as hidden nodes. Every hidden node operates on
signals received from the input nodes or hidden nodes at the preceding layer,
and produces an activation value that is transmitted to the next layer. The
final layer is called the output layer and processes the activation values from
its preceding layer to produce predictions of output variables. For binary
classification, the output layer contains a single node representing the binary
class label. In this architecture, since the signals are propagated only in the
forward direction from the input layer to the output layer, they are also called
feedforward neural networks.
A major difference between multi-layer neural networks and perceptrons is
the inclusion of hidden layers, which dramatically improves their ability to
represent arbitrarily complex decision boundaries. For example, consider the
XOR problem described in the previous section. The instances can be
classified using two hyperplanes that partition the input space into their
respective classes, as shown in Figure 4.24(a). Because a perceptron can
create only one hyperplane, it cannot find the optimal solution. However, this
problem can be addressed by using a hidden layer consisting of two nodes, as
shown in Figure 4.24(b). Intuitively, we can think of each hidden node as a
perceptron that tries to construct one of the two hyperplanes, while the output
node simply combines the results of the perceptrons to yield the decision
boundary shown in Figure 4.24(a).
Figure 4.24.
A two-layer neural network for the XOR problem.
Figure 4.24. Full Alternative Text
The hidden nodes can be viewed as learning latent representations or features
that are useful for distinguishing between the classes. While the first hidden
layer directly operates on the input attributes and thus captures simpler
features, the subsequent hidden layers are able to combine them and construct
more complex features. From this perspective, multi-layer neural networks
learn a hierarchy of features at different levels of abstraction that are finally
combined at the output nodes to make predictions. Further, there are
combinatorially many ways we can combine the features learned at the
hidden layers of ANN, making them highly expressive. This property chiefly
distinguishes ANN from other classification models such as decision trees,
which can learn partitions in the attribute space but are unable to combine
them in exponential ways.
Figure 4.25.
Schematic illustration of the parameters of an ANN model with (L
−1) hidden layers.
Figure 4.25. Full Alternative Text
To understand the nature of computations happening at the hidden and output
nodes of ANN, consider the ith node at the lth layer of the network (l>0),
where the layers are numbered from 0 (input layer) to L (output layer), as
shown in Figure 4.25. The activation value generated at this node, ail, can be
represented as a function of the inputs received from nodes at the preceding
layer. Let wijl represent the weight of the connection from the jth node at
layer (l−1) to the ith node at layer l. Similarly, let us denote the bias term at
this node as bjl. The activation value ail can then be expressed as
where z is called the linear predictor and f(⋅) is the activation function that
converts z to a. Further, note that, by definition, aj0=xj at the input layer and
aL=y^ at the output node.
There are a number of alternate activation functions apart from the sign
function that can be used in multi-layer neural networks. Some examples
include linear, sigmoid (logistic), and hyperbolic tangent functions, as shown
in Figure 4.26. These functions are able to produce real-valued and nonlinear
activation values. Among these activation functions, the sigmoid σ(⋅) has
been widely used in many ANN models, although the use of other types of
activation functions in the context of deep learning will be discussed in
Section 4.8. We can thus represent ail as
Figure 4.26.
Types of activation functions used in multi-layer neural networks.
Figure 4.26. Full Alternative Text
ail=σ(zil)=11+e−zil. (4.50)
Learning Model Parameters
The weights and bias terms (w, b) of the ANN model are learned during
training so that the predictions on training instances match the true labels.
This is achieved by using a loss function
E(w, b)=∑k=1nLoss (yk, y^k) (4.51)
where yk is the true label of the kth training instance and y^k is equal to aL,
produced by using xk. A typical choice of the loss function is the squared
loss function:.
Loss (yk, y^k)=(yk, y^k)2. (4.52)
Note that E(w, b) is a function of the model parameters (w, b) because the
output activation value aL depends on the weights and bias terms. We are
interested in choosing (w, b) that minimizes the training loss E(w, b).
Unfortunately, because of the use of hidden nodes with nonlinear activation
functions, E(w, b) is not a convex function of w and b, which means that E(w,
b) can have local minima that are not globally optimal. However, we can still
apply standard optimization techniques such as the gradient descent method
to arrive at a locally optimal solution. In particular, the weight parameter wijl
and the bias term bil can be iteratively updated using the following equations:
wijl←wijl−λ∂E∂wijl, (4.53)
bil←bil−λ∂E∂bil, (4.54)
where λ is a hyper-parameter known as the learning rate. The intuition behind
this equation is to move the weights in a direction that reduces the training
loss. If we arrive at a minima using this procedure, the gradient of the
training loss will be close to 0, eliminating the second term and resulting in
the convergence of weights. The weights are commonly initialized with
values drawn randomly from a Gaussian or a uniform distribution.
A necessary tool for updating weights in Equation 4.53 is to compute the
partial derivative of E with respect to wijl. This computation is nontrivial
especially at hidden layers (l<L), since wijl does not directly affect y^=aL
(and hence the training loss), but has complex chains of influences via
activation values at subsequent layers. To address this problem, a technique
known as backpropagation was developed, which propagates the derivatives
backward from the output layer to the hidden layers. This technique can be
described as follows.
Recall that the training loss E is simply the sum of individual losses at
training instances. Hence the partial derivative of E can be decomposed as a
sum of partial derivatives of individual losses.
∂E∂wjl=∑k=1n∂ Loss (yk, y^k)∂wjl.
To simplify discussions, we will consider only the derivatives of the loss at
the kth training instance, which will be generically represented as Loss(y, aL).
By using the chain rule of differentiation, we can represent the partial
derivatives of the loss with respect to wijl as
∂ Loss∂wijl=∂ Loss∂ail×∂ail∂zil×∂zil∂wijl. (4.55)
The last term of the previous equation can be written as
Also, if we use the sigmoid activation function, then
∂ail∂zil=∂ σ(zil)∂zil=ail(1−ai1).
Equation 4.55 can thus be simplified as
∂ Loss∂wijl=δil×ail(1−ai1)×ajl−1,where δil=∂ Loss∂ail. (4.56)
A similar formula for the partial derivatives with respect to the bias terms bli
is given by
∂ Loss∂bil=δil×ail(1−ai1). (4.57)
Hence, to compute the partial derivatives, we only need to determine δil.
Using a squared loss function, we can easily write δL at the output node as
δL=∂ Loss∂aL=∂ (y−aL)2∂aL=2(aL−y). (4.58)
However, the approach for computing δjl at hidden nodes (l<L) is more
involved. Notice that ajl affects the activation values ail+1 of all nodes at the
next layer, which in turn influences the loss. Hence, again using the chain
rule of differentiation, δjl can be represented as
δjl=∂ Loss∂ajl=∑i(∂ Loss∂ail+1×∂ail+1∂ajl).=∑i(∂ Loss∂ail+1×∂ail+1∂zil+1×
The previous equation provides a concise representation of the δjl values at
layer l in terms of the δjl+1 values computed at layer l+1. Hence, proceeding
backward from the output layer L to the hidden layers, we can recursively
apply Equation 4.59 to compute δil at every hidden node. δil can then be used
in Equations 4.56 and 4.57 to compute the partial derivatives of the loss with
respect to wijl and bil, respectively. Algorithm 4.4 summarizes the complete
approach for learning the model parameters of ANN using backpropagation
and gradient descent method.
Algorithm 4.4 Learning ANN using
backpropagation and gradient
1: Let D.train = {(xk, yk) | k = 1, 2, …, n} be the set of training instances.
2: Set counter c ← 0.
3: Initialize the weight and bias terms (w(0), b(0)) with random values.
4: repeat
5: for each training instance (xk, yk) ∈ D.train do
6: Compute the set of activations (ail)k by making a forward pass using
7: Compute the set (δil)k by backpropagation using Equations
8: Compute (∂ Loss/∂wijl, ∂ Loss/∂bil)k using Equations 4.56
9: end for
10: Compute ∂E/∂wijl←∑k=1n(∂ Loss/∂wijl)k.
11: Compute ∂E/∂bil←∑k=1n(∂ Loss/∂bil)k.
12: Update (w(c + 1), b(c + 1)) by gradient descent using Equations
13: Update c c + 1.
14: until (w(c + 1), b(c + 1)) and (w(c), b(c)) converge to the same value
4.7.3 Characteristics of ANN
1. Multi-layer neural networks with at least one hidden layer are universal
approximators; i.e., they can be used to approximate any target
function. They are thus highly expressive and can be used to learn
complex decision boundaries in diverse applications. ANN can also be
used for multiclass classification and regression problems, by
appropriately modifying the output layer. However, the high model
complexity of classical ANN models makes it susceptible to overfitting,
which can be overcome to some extent by using deep learning
techniques discussed in Section 4.8.3.
2. ANN provides a natural way to represent a hierarchy of features at
multiple levels of abstraction. The outputs at the final hidden layer of the
ANN model thus represent features at the highest level of abstraction
that are most useful for classification. These features can also be used as
inputs in other supervised classification models, e.g., by replacing the
output node of the ANN by any generic classifier.
3. ANN represents complex high-level features as compositions of simpler
lower-level features that are easier to learn. This provides ANN the
ability to gradually increase the complexity of representations, by
adding more hidden layers to the architecture. Further, since simpler
features can be combined in combinatorial ways, the number of complex
features learned by ANN is much larger than traditional classification
models. This is one of the main reasons behind the high expressive
power of deep neural networks.
4. ANN can easily handle irrelevant attributes, by using zero weights for
attributes that do not help in improving the training loss. Also, redundant
attributes receive similar weights and do not degrade the quality of the
classifier. However, if the number of irrelevant or redundant attributes is
large, the learning of the ANN model may suffer from overfitting,
leading to poor generalization performance.
5. Since the learning of ANN model involves minimizing a non-convex
function, the solutions obtained by gradient descent are not guaranteed
to be globally optimal. For this reason, ANN has a tendency to get stuck
in local minima, a challenge that can be addressed by using deep
learning techniques discussed in Section 4.8.4.
6. Training an ANN is a time consuming process, especially when the
number of hidden nodes is large. Nevertheless, test examples can be
classified rapidly.
7. Just like logistic regression, ANN can learn in the presence of
interacting variables, since the model parameters are jointly learned over
all variables together. In addition, ANN cannot handle instances with
missing values in the training or testing phase.
4.8 Deep Learning
As described above, the use of hidden layers in ANN is based on the general
belief that complex high-level features can be constructed by combining
simpler lower-level features. Typically, the greater the number of hidden
layers, the deeper the hierarchy of features learned by the network. This
motivates the learning of ANN models with long chains of hidden layers,
known as deep neural networks. In contrast to “shallow” neural networks
that involve only a small number of hidden layers, deep neural networks are
able to represent features at multiple levels of abstraction and often require
far fewer nodes per layer to achieve generalization performance similar to
shallow networks.
Despite the huge potential in learning deep neural networks, it has remained
challenging to learn ANN models with a large number of hidden layers using
classical approaches. Apart from reasons related to limited computational
resources and hardware architectures, there have been a number of
algorithmic challenges in learning deep neural networks. First, learning a
deep neural network with low training error has been a daunting task because
of the saturation of sigmoid activation functions, resulting in slow
convergence of gradient descent. This problem becomes even more serious as
we move away from the output node to the hidden layers, because of the
compounded effects of saturation at multiple layers, known as the vanishing
gradient problem. Because of this reason, classical ANN models have
suffered from slow and ineffective learning, leading to poor training and test
performance. Second, the learning of deep neural networks is quite sensitive
to the initial values of model parameters, chiefly because of the non-convex
nature of the optimization function and the slow convergence of gradient
descent. Third, deep neural networks with a large number of hidden layers
have high model complexity, making them susceptible to overfitting. Hence,
even if a deep neural network has been trained to show low training error, it
can still suffer from poor generalization performance.
These challenges have deterred progress in building deep neural networks for
several decades and it is only recently that we have started to unlock their
immense potential with the help of a number of advances being made in the
area of deep learning. Although some of these advances have been around for
some time, they have only gained mainstream attention in the last decade,
with deep neural networks continually beating records in various
competitions and solving problems that were too difficult for other
classification approaches.
There are two factors that have played a major role in the emergence of deep
learning techniques. First, the availability of larger labeled data sets, e.g., the
ImageNet data set contains more than 10 million labeled images, has made it
possible to learn more complex ANN models than ever before, without
falling easily into the traps of model overfitting. Second, advances in
computational abilities and hardware infrastructures, such as the use of
graphical processing units (GPU) for distributed computing, have greatly
helped in experimenting with deep neural networks with larger architectures
that would not have been feasible with traditional resources.
In addition to the previous two factors, there have been a number of
algorithmic advancements to overcome the challenges faced by classical
methods in learning deep neural networks. Some examples include the use of
more responsive combinations of loss functions and activation functions,
better initialization of model parameters, novel regularization techniques,
more agile architecture designs, and better techniques for model learning and
hyper-parameter selection. In the following, we describe some of the deep
learning advances made to address the challenges in learning deep neural
networks. Further details on recent developments in deep learning can be
obtained from the Bibliographic Notes.
4.8.1 Using Synergistic Loss
One of the major realizations leading to deep learning has been the
importance of choosing appropriate combinations of activation and loss
functions. Classical ANN models commonly made use of the sigmoid
activation function at the output layer, because of its ability to produce real-
valued outputs between 0 and 1, which was combined with a squared loss
objective to perform gradient descent. It was soon noticed that this particular
combination of activation and loss function resulted in the saturation of
output activation values, which can be described as follows.
Saturation of Outputs
Although the sigmoid has been widely-used as an activation function, it
easily saturates at high and low values of inputs that are far away from 0.
Observe from Figure 4.27(a) that σ(z) shows variance in its values only when
z is close to 0. For this reason, ∂σ(z)/∂z is non-zero for only a small range of z
around 0, as shown in Figure 4.27(b). Since ∂σ(z)/∂z is one of the
components in the gradient of loss (see Equation 4.55), we get a diminishing
gradient value when the activation values are far from 0.
Figure 4.27.
Plots of sigmoid function and its derivative.
Figure 4.27. Full Alternative Text
To illustrate the effect of saturation on the learning of model parameters at
the output node, consider the partial derivative of loss with respect to the
weight wjL at the output node. Using the squared loss function, we can write
this as
∂ Loss∂wjL=2(aL−y)×σ(zL)(1−σ(zL))×ajL−1. (4.60)
In the previous equation, notice that when zL is highly negative, σ(zL) (and
hence the gradient) is close to 0. On the other hand, when zL is highly
positive, (1−σ(zL)) becomes close to 0, nullifying the value of the gradient.
Hence, irrespective of whether the prediction aL matches the true label y or
not, the gradient of the loss with respect to the weights is close to 0 whenever
zL is highly positive or negative. This causes an unnecessarily slow
convergence of the model parameters of the ANN model, often resulting in
poor learning.
Note that it is the combination of the squared loss function and the sigmoid
activation function at the output node that together results in diminishing
gradients (and thus poor learning) upon saturation of outputs. It is thus
important to choose a synergistic combination of loss function and activation
function that does not suffer from the saturation of outputs.
Cross entropy loss function
The cross entropy loss function, which was described in the context of
logistic regression in Section 4.6.2, can significantly avoid the problem of
saturating outputs when used in combination with the sigmoid activation
function. The cross entropy loss function of a real-valued prediction
y^∈(0, 1) on a data instance with binary label y∈{0, 1} can be defined as
Loss(y, y^)=−ylog(y^)−(1−y)log(1−y^), (4.61)
where log represents the natural logarithm (to base e) and 0 log(0)=0 for
convenience. The cross entropy function has foundations in information
theory and measures the amount of disagreement between y and y^. The
partial derivative of this loss function with respect to y^=aL can be given as
δL=∂ Loss∂aL=−yaL+(1−y)(1−aL).=(aL−y)aL(1−aL). (4.62)
Using this value of δL in Equation 4.56, we can obtain the partial derivative
of the loss with respect to the weight wjl at the output node as
∂ Loss∂wjL=(aL−y)aL(1−aL)×aL(1−aL)×ajL−1.=(aL−y)×ajL−1. (4.63)
Notice the simplicity of the previous formula using the cross entropy loss
function. The partial derivatives of the loss with respect to the weights at the
output node depend only on the difference between the prediction aL and the
true label y. In contrast to Equation 4.60, it does not involve terms such as
σ(zL)(1−σ(zL)) that can be impacted by saturation of zL. Hence, the
gradients are high whenever (aL−y) is large, promoting effective learning of
the model parameters at the output node. This has been a major breakthrough
in the learning of modern ANN models and it is now a common practice to
use the cross entropy loss function with sigmoid activations at the output
4.8.2 Using Responsive Activation
Even though the cross entropy loss function helps in overcoming the problem
of saturating outputs, it still does not solve the problem of saturation at
hidden layers, arising due to the use of sigmoid activation functions at hidden
nodes. In fact, the effect of saturation on the learning of model parameters is
even more aggravated at hidden layers, a problem known as the vanishing
gradient problem. In the following, we describe the vanishing gradient
problem and the use of a more responsive activation function, called the
rectified linear output unit (ReLU), to overcome this problem.
Vanishing Gradient Problem
The impact of saturating activation values on the learning of model
parameters increases at deeper hidden layers that are farther away from the
output node. Even if the activation in the output layer does not saturate, the
repeated multiplications performed as we backpropagate the gradients from
the output layer to the hidden layers may lead to decreasing gradients in the
hidden layers. This is called the vanishing gradient problem, which has been
one of the major hindrances in learning deep neural networks.
To illustrate the vanishing gradient problem, consider an ANN model that
consists of a single node at every hidden layer of the network, as shown in
Figure 4.28. This simplified architecture involves a single chain of hidden
nodes where a single weighted link wl connects the node at layer l−1 to the
node at layer l. Using Equations 4.56 and 4.59, we can represent the partial
derivative of the loss with respect to wl as
∂ Loss∂wl=δl×al(1−al)×al−1,where δl=2(aL−y)×∏r=lL
−1(ar+1(1−ar+1)×wr+1). (4.64)
Notice that if any of the linear predictors zr+1 saturates at subsequent layers,
then the term ar+1(1−ar+1) becomes close to 0, thus diminishing the overall
gradient. The saturation of activations thus gets compounded and has
multiplicative effects on the gradients at hidden layers, making them highly
unstable and thus, unsuitable for use with gradient descent. Even though the
previous discussion only pertains to the simplified architecture involving a
single chain of hidden nodes, a similar argument can be made for any generic
ANN architecture involving multiple chains of hidden nodes. Note that the
vanishing gradient problem primarily arises because of the use of sigmoid
activation function at hidden nodes, which is known to easily saturate
especially after repeated multiplications.
Figure 4.28.
An example of an ANN model with only one node at every hidden
Figure 4.29.
Plot of the rectified linear unit (ReLU) activation function.
Rectified Linear Units (ReLU)
To overcome the vanishing gradient problem, it is important to use an
activation function f(z)at the hidden nodes that provides a stable and
significant value of the gradient whenever a hidden node is active, i.e., z>0.
This is achieved by using rectified linear units (ReLU) as activation functions
at hidden nodes, which can be defined as
a=f(z)={z,if z>0.0,otherwise. (4.65)
The idea of ReLU has been inspired from biological neurons, which are
either in an inactive state (f(z)=0) or show an activation value proportional to
the input. Figure 4.29 shows a plot of the ReLU function. We can see that it
is linear with respect to z when z>0. Hence, the gradient of the activation
value with respect to z can be written as
∂a∂z={1,if z>0.0,if z<0. (4.66)
Although f(z)is not differentiable at 0, it is common practice to use ∂a/∂z=0
when z=0. Since the gradient of the ReLU activation function is equal to 1
whenever z>0, it avoids the problem of saturation at hidden nodes, even after
repeated multiplications. Using ReLU, the partial derivatives of the loss with
respect to the weight and bias parameters can be given by
∂ Loss∂wijl=δil×I(zil)×ajl−1, (4.67)
∂ Loss∂bil=δil×I(zil),where δil=∑i=1n(δil+1×I(zil+1)×wijl+1),and I(z)=
{1,if z>0.0,otherwise. (4.68)
Notice that ReLU shows a linear behavior in the activation values whenever a
node is active, as compared to the nonlinear properties of the sigmoid
function. This linearity promotes better flows of gradients during
backpropagation, and thus simplifies the learning of ANN model parameters.
The ReLU is also highly responsive at large values of z away from 0, as
opposed to the sigmoid activation function, making it more suitable for
gradient descent. These differences give ReLU a major advantage over the
sigmoid function. Indeed, ReLU is used as the preferred choice of activation
function at hidden layers in most modern ANN models.
4.8.3 Regularization
A major challenge in learning deep neural networks is the high model
complexity of ANN models, which grows with the addition of hidden layers
in the network. This can become a serious concern, especially when the
training set is small, due to the phenomena of model overfitting. To
overcome this challenge, it is important to use techniques that can help in
reducing the complexity of the learned model, known as regularization
techniques. Classical approaches for learning ANN models did not have an
effective way to promote regularization of the learned model parameters.
Hence, they had often been sidelined by other classification methods, such as
support vector machines (SVM), which have in-built regularization
mechanisms. (SVMs will be discussed in more detail in Section 4.9).
One of the major advancements in deep learning has been the development of
novel regularization techniques for ANN models that are able to offer
significant improvements in generalization performance. In the following, we
discuss one of the regularization techniques for ANN, known as the dropout
method, that have gained a lot of attention in several applications.
The main objective of dropout is to avoid the learning of spurious features at
hidden nodes, occurring due to model overfitting. It uses the basic intuition
that spurious features often “co-adapt” themselves such that they show good
training performance only when used in highly selective combinations. On
the other hand, relevant features can be used in a diversity of feature
combinations and hence are quite resilient to the removal or modification of
other features. The dropout method uses this intuition to break complex “coadaptations” in the learned features by randomly dropping input and hidden
nodes in the network during training.
Dropout belongs to a family of regularization techniques that uses the criteria
of resilience to random perturbations as a measure of the robustness (and
hence, simplicity) of a model. For example, one approach to regularization is
to inject noise in the input attributes of the training set and learn a model with
the noisy training instances. If a feature learned from the training data is
indeed generalizable, it should not be affected by the addition of noise.
Dropout can be viewed as a similar regularization approach that perturbs the
information content of the training set not only at the level of attributes but
also at multiple levels of abstractions, by dropping input and hidden nodes.
The dropout method draws inspiration from the biological process of gene
swapping in sexual reproduction, where half of the genes from both parents
are combined together to create the genes of the offspring. This favors the
selection of parent genes that are not only useful but can also inter-mingle
with diverse combinations of genes coming from the other parent. On the
other hand, co-adapted genes that function only in highly selective
combinations are soon eliminated in the process of evolution. This idea is
used in the dropout method for eliminating spurious co-adapted features. A
simplified description of the dropout method is provided in the rest of this
Figure 4.30.
Examples of sub-networks generated in the dropout method using
Figure 4.30. Full Alternative Text
Let (wk, bk) represent the model parameters of the ANN model at the kth
iteration of the gradient descent method. At every iteration, we randomly
select a fraction γ of input and hidden nodes to be dropped from the network,
where γ∈(0, 1) is a hyper-parameter that is typically chosen to be 0.5. The
weighted links and bias terms involving the dropped nodes are then
eliminated, resulting in a “thinned” sub-network of smaller size. The model
parameters of the sub-network (wsk, bsk) are then updated by computing
activation values and performing backpropagation on this smaller subnetwork. These updated values are then added back in the original network to
obtain the updated model parameters, (wk+1, bk+1), to be used in the next
Figure 4.30 shows some examples of sub-networks that can be generated at
different iterations of the dropout method, by randomly dropping input and
hidden nodes. Since every sub-network has a different architecture, it is
difficult to learn complex co-adaptations in the features that can result in
overfitting. Instead, the features at the hidden nodes are learned to be more
agile to random modifications in the network structure, thus improving their
generalization ability. The model parameters are updated using a different
random sub-network at every iteration, till the gradient descent method
Let (wkmax, bkmax) denote the model parameters at the last iteration kmax
of the gradient descent method. These parameters are finally scaled down by
a factor of (1−γ), to produce the weights and bias terms of the final ANN
model, as follows:
(w*, b*)=((1−γ)×wkmax, (1−γ)×bkmax)
We can now use the complete neural network with model parameters
(w*, b*) for testing. The dropout method has been shown to provide
significant improvements in the generalization performance of ANN models
in a number of applications. It is computationally cheap and can be applied in
combination with any of the other deep learning techniques. It also has a
number of similarities with a widely-used ensemble learning method known
as bagging, which learns multiple models using random subsets of the
training set, and then uses the average output of all the models to make
predictions. (Bagging will be presented in more detail later in Section 4.10.4).
In a similar vein, it can be shown that the predictions of the final network
learned using dropout approximates the average output of all possible 2n subnetworks that can be formed using n nodes. This is one of the reasons behind
the superior regularization abilities of dropout.
4.8.4 Initialization of Model
Because of the non-convex nature of the loss function used by ANN models,
it is possible to get stuck in locally optimal but globally inferior solutions.
Hence, the initial choice of model parameter values plays a significant role in
the learning of ANN by gradient descent. The impact of poor initialization is
even more aggravated when the model is complex, the network architecture is
deep, or the classification task is difficult. In such cases, it is often advisable
to first learn a simpler model for the problem, e.g., using a single hidden
layer, and then incrementally increase the complexity of the model, e.g., by
adding more hidden layers. An alternate approach is to train the model for a
simpler task and then use the learned model parameters as initial parameter
choices in the learning of the original task. The process of initializing ANN
model parameters before the actual training process is known as pretraining.
Pretraining helps in initializing the model to a suitable region in the
parameter space that would otherwise be inaccessible by random
initialization. Pretraining also reduces the variance in the model parameters
by fixing the starting point of gradient descent, thus reducing the chances of
overfitting due to multiple comparisons. The models learned by pretraining
are thus more consistent and provide better generalization performance.
Supervised Pretraining
A common approach for pretraining is to incrementally train the ANN model
in a layer-wise manner, by adding one hidden layer at a time. This approach,
known as supervised pretraining, ensures that the parameters learned at
every layer are obtained by solving a simpler problem, rather than learning all
model parameters together. These parameter values thus provide a good
choice for initializing the ANN model. The approach for supervised
pretraining can be briefly described as follows.
We start the supervised pretraining process by considering a reduced ANN
model with only a single hidden layer. By applying gradient descent on this
simple model, we are able to learn the model parameters of the first hidden
layer. At the next run, we add another hidden layer to the model and apply
gradient descent to learn the parameters of the newly added hidden layer,
while keeping the parameters of the first layer fixed. This procedure is
recursively applied such that while learning the parameters of the lth hidden
layer, we consider a reduced model with only l hidden layers, whose first (l
−1) hidden layers are not updated on the lth run but are instead fixed using
pretrained values from previous runs. In this way, we are able to learn the
model parameters of all (L−1) hidden layers. These pretrained values are
used to initialize the hidden layers of the final ANN model, which is finetuned by applying a final round of gradient descent over all the layers.
Unsupervised Pretraining
Supervised pretraining provides a powerful way to initialize model
parameters, by gradually growing the model complexity from shallower to
deeper networks. However, supervised pretraining requires a sufficient
number of labeled training instances for effective initialization of the ANN
model. An alternate pretraining approach is unsupervised pretraining,
which initializes model parameters by using unlabeled instances that are often
abundantly available. The basic idea of unsupervised pretraining is to
initialize the ANN model in such a way that the learned features capture the
latent structure in the unlabeled data.
Figure 4.31.
The basic architecture of a single-layer autoencoder.
Unsupervised pretraining relies on the assumption that learning the
distribution of the input data can indirectly help in learning the classification
model. It is most helpful when the number of labeled examples is small and
the features for the supervised problem bear resemblance to the factors
generating the input data. Unsupervised pretraining can be viewed as a
different form of regularization, where the focus is not explicitly toward
finding simpler features but instead toward finding features that can best
explain the input data. Historically, unsupervised pretraining has played an
important role in reviving the area of deep learning, by making it possible to
train any generic deep neural network without requiring specialized
Use of Autoencoders
One simple and commonly used approach for unsupervised pretraining is to
use an unsupervised ANN model known as an autoencoder. The basic
architecture of an autoencoder is shown in Figure 4.31. An autoencoder
attempts to learn a reconstruction of the input data by mapping the attributes
x to latent features c, and then re-projecting c back to the original attribute
space to create the reconstruction x^. The latent features are represented
using a hidden layer of nodes, while the input and output layers represent the
attributes and contain the same number of nodes. During training, the goal is
to learn an autoencoder model that provides the lowest reconstruction error,
RE(x, x^), on all input data instances. A typical choice of the reconstruction
error is the squared loss function:
RE(x, x^)=ǁx−x^ ǁ2.
The model parameters of the autoencoder can be learned by using a similar
gradient descent method as the one used for learning supervised ANN models
for classification. The key difference is the use of the reconstruction error on
all training instances as the training loss. Autoencoders that have multiple
layers of hidden layers are known as stacked autoencoders.
Autoencoders are able to capture complex representations of the input data by
the use of hidden nodes. However, if the number of hidden nodes is large, it
is possible for an autoencoder to learn the identity relationship, where the
input x is just copied and returned as the output x^, resulting in a trivial
solution. For example, if we use as many hidden nodes as the number of
attributes, then it is possible for every hidden node to copy an attribute and
simply pass it along to an output node, without extracting any useful
information. To avoid this problem, it is common practice to keep the number
of hidden nodes smaller than the number of input attributes. This forces the
autoencoder to learn a compact and useful encoding of the input data, similar
to a dimensionality reduction technique. An alternate approach is to corrupt
the input instances by adding random noise, and then learn the autoencoder to
reconstruct the original instance from the noisy input. This approach is
known as the denoising autoencoder, which offers strong regularization
capabilities and is often used to learn complex features even in the presence
of a large number of hidden nodes.
To use an autoencoder for unsupervised pretraining, we can follow a similar
layer-wise approach like supervised pretraining. In particular, to pretrain the
model parameters of the lth hidden layer, we can construct a reduced ANN
model with only l hidden layers and an output layer containing the same
number of nodes as the attributes and is used for reconstruction. The
parameters of the lth hidden layer of this network are then learned using a
gradient descent method to minimize the reconstruction error. The use of
unlabeled data can be viewed as providing hints to the learning of parameters
at every layer that aid in generalization. The final model parameters of the
ANN model are then learned by applying gradient descent over all the layers,
using the initial values of parameters obtained from pretraining.
Hybrid Pretraining
Unsupervised pretraining can also be combined with supervised pretraining
by using two output layers at every run of pretraining, one for reconstruction
and the other for supervised classification. The parameters of the lth hidden
layer are then learned by jointly minimizing the losses on both output layers,
usually weighted by a trade-off hyper-parameter α. Such a combined
approach often shows better generalization performance than either of the
approaches, since it provides a way to balance between the competing
objectives of representing the input data and improving classification
4.8.5 Characteristics of Deep
Apart from the basic characteristics of ANN discussed in Section 4.7.3, the
use of deep learning techniques provides the following additional
1. An ANN model trained for some task can be easily re-used for a
different task that involves the same attributes, by using pretraining
strategies. For example, we can use the learned parameters of the
original task as initial parameter choices for the target task. In this way,
ANN promotes re-usability of learning, which can be quite useful when
the target application has a smaller number of labeled training instances.
2. Deep learning techniques for regularization, such as the dropout method,
help in reducing the model complexity of ANN and thus promoting
good generalization performance. The use of regularization techniques is
especially useful in high-dimensional settings, where the number of
training labels is small but the classification problem is inherently
3. The use of an autoencoder for pretraining can help eliminate irrelevant
attributes that are not related to other attributes. Further, it can help
reduce the impact of redundant attributes by representing them as copies
of the same attribute.
4. Although the learning of an ANN model can succumb to finding inferior
and locally optimal solutions, there are a number of deep learning
techniques that have been proposed to ensure adequate learning of an
ANN. Apart from the methods discussed in this section, some other
techniques involve novel architecture designs such as skip connections
between the output layer and lower layers, which aids the easy flow of
gradients during backpropagation.
5. A number of specialized ANN architectures have been designed to
handle a variety of input data sets. Some examples include
convolutional neural networks (CNN) for two-dimensional gridded
objects such as images, and recurrent neural network (RNN) for
sequences. While CNNs have been extensively used in the area of
computer vision, RNNs have found applications in processing speech
and language.
4.9 Support Vector Machine (SVM)
A support vector machine (SVM) is a discriminative classification model that
learns linear or nonlinear decision boundaries in the attribute space to
separate the classes. Apart from maximizing the separability of the two
classes, SVM offers strong regularization capabilities, i.e., it is able to control
the complexity of the model in order to ensure good generalization
performance. Due to its unique ability to innately regularize its learning,
SVM is able to learn highly expressive models without suffering from
overfitting. It has thus received considerable attention in the machine learning
community and is commonly used in several practical applications, ranging
from handwritten digit recognition to text categorization. SVM has strong
roots in statistical learning theory and is based on the principle of structural
risk minimization. Another unique aspect of SVM is that it represents the
decision boundary using only a subset of the training examples that are most
difficult to classify, known as the support vectors. Hence, it is a
discriminative model that is impacted only by training instances near the
boundary of the two classes, in contrast to learning the generative distribution
of every class.
To illustrate the basic idea behind SVM, we first introduce the concept of the
margin of a separating hyperplane and the rationale for choosing such a
hyperplane with maximum margin. We then describe how a linear SVM can
be trained to explicitly look for this type of hyperplane. We conclude by
showing how the SVM methodology can be extended to learn nonlinear
decision boundaries by using kernel functions.
4.9.1 Margin of a Separating
The generic equation of a separating hyperplane can be written as
where x represents the attributes and (w, b) represent the parameters of the
hyperplane. A data instance xi can belong to either side of the hyperplane
depending on the sign of (wTxi+b). For the purpose of binary classification,
we are interested in finding a hyperplane that places instances of both classes
on opposite sides of the hyperplane, thus resulting in a separation of the two
classes. If there exists a hyperplane that can perfectly separate the classes in
the data set, we say that the data set is linearly separable. Figure 4.32 shows
an example of linearly separable data involving two classes, squares and
circles. Note that there can be infinitely many hyperplanes that can separate
the classes, two of which are shown in Figure 4.32 as lines B1 and B2. Even
though every such hyperplane will have zero training error, they can provide
different results on previously unseen instances. Which separating hyperplane
should we thus finally choose to obtain the best generalization performance?
Ideally, we would like to choose a simple hyperplane that is robust to small
perturbations. This can be achieved by using the concept of the margin of a
separating hyperplane, which can be briefly described as follows.
Figure 4.32.
Margin of a hyperplane in a two-dimensional data set.
Figure 4.32. Full Alternative Text
For every separating hyperplane Bi, let us associate a pair of parallel
hyperplanes, bi1 and bi2, such that they touch the closest instances of both
classes, respectively. For example, if we move B1 parallel to its direction, we
can touch the first square using b11 and the first circle using b12. bi1 and bi2
are known as the margin hyperplanes of Bi and the distance between them
is known as the margin of the separating hyperplane Bi. From the diagram
shown in Figure 4.32, notice that the margin for B1 is considerably larger
than that for B2. In this example, b1 turns out to be the separating hyperplane
with the maximum margin, known as the maximum margin hyperplane.
Rationale for Maximum Margin
Hyperplanes with large margins tend to have better generalization
performance than those with small margins. Intuitively, if the margin is
small, then any slight perturbation in the hyperplane or the training instances
located at the boundary can have quite an impact on the classification
performance. Small margin hyperplanes are thus more susceptible to
overfitting, as they are barely able to separate the classes with a very narrow
room to allow perturbations. On the other hand, a hyperplane that is farther
away from training instances of both classes has sufficient leeway to be
robust to minor modifications in the data, and thus shows superior
generalization performance.
The idea of choosing the maximum margin separating hyperplane also has
strong foundations in statistical learning theory. It can be shown that the
margin of such a hyperplane is inversely related to the VC-dimension of the
classifier, which is a commonly used measure of the complexity of a model.
As discussed in Section 3.4 of the last chapter, a simpler model should be
preferred over a more complex model if they both show similar training
performance. Hence, maximizing the margin results in the selection of a
separating hyperplane with the lowest model complexity, which is expected
to show better generalization performance.
4.9.2 Linear SVM
A linear SVM is a classifier that searches for a separating hyperplane with the
largest margin, which is why it is often known as a maximal margin
classifier. The basic idea of SVM can be described as follows.
Consider a binary classification problem consisting of n training instances,
where every training instance xi is associated with a binary label yi∈{−1, 1}.
Let wTx+b=0 be the equation of a separating hyperplane that separates the
two classes by placing them on opposite sides. This means that
wTxi+b>0if yi=1,wTxi+b<0if yi=−1.
The distance of any point x from the hyperplane is then given by
D(x)=|wTx+b |ǁ w ǁ
where |⋅| denotes the absolute value and ǁ ⋅ ǁ denotes the length of a vector.
Let the distance of the closest point from the hyperplane with y=1 be k+>0.
Similarly, let k−>0 denote the distance of the closest point from class −1.
This can be represented using the following constraints:
wTxi+bǁ w ǁ≥k+if yi=1,wTxi+bǁ w ǁ≤−k−if yi=−1, (4.69)
The previous equations can be succinctly represented by using the product of
yi and (wTxi+b) as
yi(wTxi+b)≥Mǁwǁ (4.70)
where M is a parameter related to the margin of the hyperplane, i.e., if k+=k
−=M, then margin =k+−k−=2M. In order to find the maximum margin
hyperplane that adheres to the previous constraints, we can consider the
following optimization problem:
maxw, bMsubject toyi(wTxi+b)≥Mǁ w ǁ. (4.71)
To find the solution to the previous problem, note that if w and b satisfy the
constraints of the previous problem, then any scaled version of w and b would
satisfy them too. Hence, we can conveniently choose ǁwǁ=1/M to simplify the
right-hand side of the inequalities. Furthermore, maximizing M amounts to
minimizing ǁwǁ2. Hence, the optimization problem of SVM is commonly
represented in the following form:
minw, bǁ w ǁ22subject toyi(wTxi+b)≥1. (4.72)
Learning Model Parameters
Equation 4.72 represents a constrained optimization problem with linear
inequalities. Since the objective function is convex and quadratic with respect
to w, it is known as a quadratic programming problem (QPP), which can be
solved using standard optimization techniques, as described in Appendix E.
In the following, we present a brief sketch of the main ideas for learning the
model parameters of SVM.
First, we rewrite the objective function in a form that takes into account the
constraints imposed on its solutions. The new objective function is known as
the Lagrangian primal problem, which can be represented as follows,
LP=12ǁ w ǁ2−∑i=1nλi(yi(wTxi+b)−1), (4.73)
where the parameters λi≥0 correspond to the constraints and are called the
Lagrange multipliers. Next, to minimize the Lagrangian, we take the
derivative of LP with respect to w and b and set them equal to zero:
∂LP∂w=0⇒w=∑i=1nλiyixi, (4.74)
∂LP∂b=0⇒∑i=1nλiyi=0. (4.75)
Note that using Equation 4.74, we can represent w completely in terms of the
Lagrange multipliers. There is another relationship between (w, b) and λi that
is derived from the Karush-Kuhn-Tucker (KKT) conditions, a commonly
used technique for solving QPP. This relationship can be described as
λi[yi(wTxi+b)−1]=0. (4.76)
Equation 4.76 is known as the complementary slackness condition, which
sheds light on a valuable property of SVM. It states that the Lagrange
multiplier λi is strictly greater than 0 only when xi satisfies the equation
yi(w⋅xi+b)=1, which means that xi lies exactly on a margin hyperplane.
However, if xi is farther away from the margin hyperplanes such that
yi(w⋅xi+b)>1, then λi is necessarily 0. Hence, λi>0 for only a small number
of instances that are closest to the separating hyperplane, which are known as
support vectors. Figure 4.33 shows the support vectors of a hyperplane as
filled circles and squares. Further, if we look at Equation 4.74, we will
observe that training instances with λi=0 do not contribute to the weight
parameter w. This suggests that w can be concisely represented only in terms
of the support vectors in the training data, which are quite fewer than the
overall number of training instances. This ability to represent the decision
function only in terms of the support vectors is what gives this classifier the
name support vector machines.
Figure 4.33.
Support vectors of a hyperplane shown as filled circles and squares.
Figure 4.33. Full Alternative Text
Using equations 4.74, 4.75, and 4.76 in Equation 4.73, we obtain the
following optimization problem in terms of the Lagrange multipliers λi:
maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubject to∑i=1nλiyi=0,λi≥0.
The previous optimization problem is called the dual optimization problem.
Maximizing the dual problem with respect to λi is equivalent to minimizing
the primal problem with respect to w and b.
The key differences between the dual and primal problems are as follows:
1. Solving the dual problem helps us identify the support vectors in the
data that have non-zero values of λi. Further, the solution of the dual
problem is influenced only by the support vectors that are closest to the
decision boundary of SVM. This helps in summarizing the learning of
SVM solely in terms of its support vectors, which are easier to manage
computationally. Further, it represents a unique ability of SVM to be
dependent only on the instances closest to the boundary, which are
harder to classify, rather than the distribution of instances farther away
from the boundary.
2. The objective of the dual problem involves only terms of the form xiTxj,
which are basically inner products in the attribute space. As we will see
later in Section 4.9.4, this property will prove to be quite useful in
learning nonlinear decision boundaries using SVM.
Because of these differences, it is useful to solve the dual optimization
problem using any of the standard solvers for QPP. Having found an optimal
solution for λi, we can use Equation 4.74 to solve for w. We can then use
Equation 4.76 on the support vectors to solve for b as follows:
b=1nS∑i∈S1−yiwTxiyi (4.78)
where S represents the set of support vectors (S={i|λi>0}) and nS is the
number of support vectors. The maximum margin hyperplane can then be
expressed as
f(x)=(∑i=1nλiyixiTx)+b=0. (4.79)
Using this separating hyperplane, a test instance x can be assigned a class
label using the sign of f(x).
Example 4.7.
Consider the two-dimensional data set shown in Figure 4.34, which contains
eight training instances. Using quadratic programming, we can solve the
optimization problem stated in Equation 4.77 to obtain the Lagrange
multiplier λi for each training instance. The Lagrange multipliers are depicted
in the last column of the table. Notice that only the first two instances have
non-zero Lagrange multipliers. These instances correspond to the support
vectors for this data set.
Let w=(w1, w2) and b denote the parameters of the decision boundary. Using
Equation 4.74, we can solve for w1 and w2 in the following way:
Figure 4.34.
Example of a linearly separable data set.
Figure 4.34. Full Alternative Text
The bias term b can be computed using Equation 4.76 for each support vector:
Averaging these values, we obtain b=7.93. The decision boundary
corresponding to these parameters is shown in Figure 4.34.
4.9.3 Soft-margin SVM
Figure 4.35 shows a data set that is similar to Figure 4.32, except it has two
new examples, P and Q. Although the decision boundary B1 misclassifies the
new examples, while B2 classifies them correctly, this does not mean that B2
is a better decision boundary than B1 because the new examples may
correspond to noise in the training data. B1 should still be preferred over B2
because it has a wider margin, and thus, is less susceptible to overfitting.
However, the SVM formulation presented in the previous section only
constructs decision boundaries that are mistake-free.
Figure 4.35.
Decision boundary of SVM for the non-separable case.
Figure 4.35. Full Alternative Text
This section examines how the formulation of SVM can be modified to learn
a separating hyperplane that is tolerable to small number of training errors
using a method known as the soft-margin approach. More importantly, the
method presented in this section allows SVM to learn linear hyperplanes even
in situations where the classes are not linearly separable. To do this, the
learning algorithm in SVM must consider the trade-off between the width of
the margin and the number of training errors committed by the linear
To introduce the concept of training errors in the SVM formulation, let us
relax the inequality constraints to accommodate for some violations on a
small number of training instances. This can be done by introducing a slack
variable ξ≥0 for every training instance xi as follows:
yi(wTxi+b)≥1−ξi (4.80)
The variable ξi allows for some slack in the inequalities of the SVM such that
every instance xi does not need to strictly satisfy yi(wTxi+b)≥1. Further, ξi is
non-zero only if the margin hyperplanes are not able to place xi on the same
side as the rest of the instances belonging to yi. To illustrate this, Figure 4.36
shows a circle P that falls on the opposite side of the separating hyperplane as
the rest of the circles, and thus satisfies wTx+b=−1+ξ. The distance between
P and the margin hyperplane wTx+b=−1 is equal to ξ/ǁ w ǁ. Hence, ξi
provides a measure of the error of SVM in representing xi using soft
inequality constraints.
Figure 4.36.
Slack variables used in soft-margin SVM.
Figure 4.36. Full Alternative Text
In the presence of slack variables, it is important to learn a separating
hyperplane that jointly maximizes the margin (ensuring good generalization
performance) and minimizes the values of slack variables (ensuring low
training error). This can be achieved by modifying the optimization problem
of SVM as follows:
minw, b, ξiǁ w ǁ22+c∑i=1nξisubject toyi(wTxi+b)≥1−ξi,ξi≥0. (4.81)
where C is a hyper-parameter that makes a trade-off between maximizing the
margin and minimizing the training error. A large value of C pays more
emphasis on minimizing the training error than maximizing the margin.
Notice the similarity of the previous equation with the generic formula of
generalization error rate introduced in Section 3.4 of the previous chapter.
Indeed, SVM provides a natural way to balance between model complexity
and training error in order to maximize generalization performance.
To solve Equation 4.81 we apply the Lagrange multiplier method and convert
the primal problem to its corresponding dual problem, similar to the approach
described in the previous section. The Lagrangian primal problem of
Equation 4.81 can be written as follows:
LP=12ǁ w ǁ2+C∑i=1nξi−∑i=1nλi(yi(wTxi+b)−1+ξi)−∑i=1nμi(ξi), (4.82)
where λi≥0 and μi≥0 are the Lagrange multipliers corresponding to the
inequality constraints of Equation 4.81. Setting the derivative of LP with
respect to w, b, and ξi equal to 0, we obtain the following equations:
∂LP∂w=0⇒w=∑i=1nλiyixi. (4.83)
∂L∂b=0⇒∑i=1nλiyi=0. (4.84)
∂L∂ξi=0⇒λi+μi=C. (4.85)
We can also obtain the complementary slackness conditions by using the
following KKT conditions:
λi(yi(wTxi+b)−1+ξi)=0, (4.86)
μiξi=0. (4.87)
Equation 4.86 suggests that λi is zero for all training instances except those
that reside on the margin hyperplanes wTxi+b=±1, or have ξi>0. These
instances with λi>0 are known as support vectors. On the other hand, μi given
in Equation 4.87 is zero for any training instance that is misclassified, i.e.,
ξi>0. Further, λi and μi are related with each other by Equation 4.85. This
results in the following three configurations of (λi, μi):
1. If λi=0 and μi=C, then xi does not reside on the margin hyperplanes and
is correctly classified on the same side as other instances belonging to
2. If λi=C and μi=0, then xi is misclassified and has a non-zero slack
variable ξi.
3. If 0<λi<C and 0<μi<C, then xi resides on one of the margin
Substituting Equations 4.83 to 4.87 into Equation 4.82, we obtain the
following dual optimization problem:
maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubject to∑i=1nλiyi=0,0≤λi≤C.
Notice that the previous problem looks almost identical to the dual problem
of SVM for the linearly separable case (Equation 4.77), except that λi is
required to not only be greater than 0 but also smaller than a constant value
C. Clearly, when C reaches infinity, the previous optimization problem
becomes equivalent to Equation 4.77, where the learned hyperplane perfectly
separates the classes (with no training errors). However, by capping the
values of λi to C, the learned hyperplane is able to tolerate a few training
errors that have ξi>0.
Figure 4.37.
Hinge loss as a function of yy^.
As before, Equation 4.88 can be solved by using any of the standard solvers
for QPP, and the optimal value of w can be obtained by using Equation 4.83.
To solve for b, we can use Equation 4.86 on the support vectors that reside on
the margin hyperplanes as follows:
b=1nS∑i∈S1−yiwTxiyi (4.89)
where S represents the set of support vectors residing on the margin
hyperplanes (S={i|0<λi<C}) and nS is the number of elements in S.
SVM as a Regularizer of Hinge Loss
SVM belongs to a broad class of regularization techniques that use a loss
function to represent the training errors and a norm of the model parameters
to represent the model complexity. To realize this, notice that the slack
variable ξ, used for measuring the training errors in SVM, is equivalent to the
hinge loss function, which can be defined as follows:
Loss (y, y^) =max(0, 1−yy^),
where y∈{+1, −1}. In the case of SVM, y^ corresponds to wTx+b. Figure
4.37 shows a plot of the hinge loss function as we vary yy^. We can see that
the hinge loss is equal to 0 as long as y and y^ have the same sign and |y^|≥1.
However, the hinge loss grows linearly with |y^| whenever y and y^ are of the
opposite sign or |y^|<1. This is similar to the notion of the slack variable ξ,
which is used to measure the distance of a point from its margin hyperplane.
Hence, the optimization problem of SVM can be represented in the following
equivalent form:
minw, bǁ w ǁ22+C∑i=1nLoss (yi, wTxi+b) (4.90)
Note that using the hinge loss ensures that the optimization problem is
convex and can be solved using standard optimization techniques. However,
if we use a different loss function, such as the squared loss function that was
introduced in Section 4.7 on ANN, it will result in a different optimization
problem that may or may not remain convex. Nevertheless, different loss
functions can be explored to capture varying notions of training error,
depending on the characteristics of the problem.
Another interesting property of SVM that relates it to a broader class of
regularization techniques is the concept of a margin. Although minimizing ǁ
w ǁ2 has the geometric interpretation of maximizing the margin of a
separating hyperplane, it is essentially the squared L2 norm of the model
parameters, ǁ w ǁ22. In general, the Lq norm of w, ǁ w ǁq, is equal to the
Minkowski distance of order q from w to the origin, i.e.,
ǁ w ǁq=(∑ipwiq)1/q
Minimizing the Lq norm of w to achieve lower model complexity is a generic
regularization concept that has several interpretations. For example,
minimizing the L2 norm amounts to finding a solution on a hypersphere of
smallest radius that shows suitable training performance. To visualize this in
two-dimensions, Figure 4.38(a) shows the plot of a circle with constant radius
r, where every point has the same L2 norm. On the other hand, using the L1
norm ensures that the solution lies on the surface of a hypercube with
smallest size, with vertices along the axes. This is illustrated in Figure 4.38(b)
as a square with vertices on the axes at a distance of r from the origin. The L1
norm is commonly used as a regularizer to obtain sparse model parameters
with only a small number of non-zero parameter values, such as the use of
Lasso in regression problems (see Bibliographic Notes).
Figure 4.38.
Plots showing the behavior of two-dimensional solutions with
constant L2 and L1 norms.
Figure 4.38. Full Alternative Text
In general, depending on the characteristics of the problem, different
combinations of Lq norms and training loss functions can be used for
learning the model parameters, each requiring a different optimization solver.
This forms the backbone of a wide range of modeling techniques that attempt
to improve the generalization performance by jointly minimizing training
error and model complexity. However, in this section, we focus only on the
squared L2 norm and the hinge loss function, resulting in the classical
formulation of SVM.
4.9.4 Nonlinear SVM
The SVM formulations described in the previous sections construct a linear
decision boundary to separate the training examples into their respective
classes. This section presents a methodology for applying SVM to data sets
that have nonlinear decision boundaries. The basic idea is to transform the
data from its original attribute space in x into a new space φ(x) so that a linear
hyperplane can be used to separate the instances in the transformed space,
using the SVM approach. The learned hyperplane can then be projected back
to the original attribute space, resulting in a nonlinear decision boundary.
Figure 4.39.
Classifying data with a nonlinear decision boundary.
Figure 4.39. Full Alternative Text
Attribute Transformation
To illustrate how attribute transformation can lead to a linear decision
boundary, Figure 4.39(a) shows an example of a two-dimensional data set
consisting of squares (classified as y=1) and circles (classified as y=−1). The
data set is generated in such a way that all the circles are clustered near the
center of the diagram and all the squares are distributed farther away from the
center. Instances of the data set can be classified using the following
y={1if (x1−0.5)2+(x2−0.5)2>0.2,−1otherwise. (4.91)
The decision boundary for the data can therefore be written as follows:
which can be further simplified into the following quadratic equation:
A nonlinear transformation φ is needed to map the data from its original
attribute space into a new space such that a linear hyperplane can separate the
classes. This can be achieved by using the following simple transformation:
φ:(x1, x2)→(x12−x1, x22−x2). (4.92)
Figure 4.39(b) shows the points in the transformed space, where we can see
that all the circles are located in the lower left-hand side of the diagram. A
linear hyperplane with parameters w and b can therefore be constructed in the
transformed space, to separate the instances into their respective classes.
One may think that because the nonlinear transformation possibly increases
the dimensionality of the input space, this approach can suffer from the curse
of dimensionality that is often associated with high-dimensional data.
However, as we will see in the following section, nonlinear SVM is able to
avoid this problem by using kernel functions.
Learning a Nonlinear SVM Model
Using a suitable function, φ(⋅), we can transform any data instance x to φ(x).
(The details on how to choose φ(⋅) will become clear later.) The linear
hyperplane in the transformed space can be expressed as wTφ(x)+b=0. To
learn the optimal separating hyperplane, we can substitute φ(x) for x in the
formulation of SVM to obtain the following optimization problem:
minw, b, ξiǁ w ǁ22+C∑i=1nξisubject toyi(wTφ(xi)+b)≥1−ξi,ξi≥0. (4.93)
Using Lagrange multipliers λi, this can be converted into a dual optimization
problem: max
maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyj〈φ(xi), φ(xj)
〉subject to∑i=1nλjyi=0,0≤λi≤C, (4.94)
where 〈 a, b 〉 denotes the inner product between vectors a and b. Also, the
equation of the hyperplane in the transformed space can be represented using
λi as follows:
∑i=1nλiyi〈φ(xi), φ(x) 〉+b=0. (4.95)
Further, b is given by
b=1nS(∑i∈S1yi−∑i∈S∑j=1nλjyiyj〈φ(xi), φ(xj) 〉yi), (4.96)
where S={i|0>λi<C} is the set of support vectors residing on the margin
hyperplanes and nS is the number of elements in S.
Note that in order to solve the dual optimization problem in Equation 4.94, or
to use the learned model parameters to make predictions using Equations
4.95 and 4.96, we need only inner products of φ(x). Hence, even though φ(x)
may be nonlinear and high-dimensional, it suffices to use a function of the
inner products of φ(x) in the transformed space. This can be achieved by
using a kernel trick, which can be described as follows.
The inner product between two vectors is often regarded as a measure of
similarity between the vectors. For example, the cosine similarity described
in Section 2.4.5 on page 79 can be defined as the dot product between two
vectors that are normalized to unit length. Analogously, the inner product
φ(xi), φ(xj) can also be regarded as a measure of similarity between two
instances, xi and xj, in the transformed space. The kernel trick is a method
for computing this similarity as a function of the original attributes.
Specifically, the kernel function K(u, v) between two instances u and v can be
defined as follows:
K(u, v)=〈φ(u), φ(v) 〉=f(u, v) (4.97)
where f(⋅) is a function that follows certain conditions as stated by the
Mercer’s Theorem. Although the details of this theorem are outside the scope
of the book, we provide a list of some of the commonly used kernel
Polynomial kernelK(u, v)=(uTv+1)p (4.98)
Radial Basis Function kernelK(u, v)=e−ǁu−v ǁ2/(2σ2) (4.99)
Sigmoid kernelK(u, v)=tanh(kuTv−δ) (4.100)
By using a kernel function, we can directly work with inner products in the
transformed space without dealing with the exact forms of the nonlinear
transformation function φ. Specifically, this allows us to use highdimensional transformations (sometimes even involving infinitely many
dimensions), while performing calculations only in the original attribute
space. Computing the inner products using kernel functions is also
considerably cheaper than using the transformed attribute set φ(x). Hence, the
use of kernel functions provides a significant advantage in representing
nonlinear decision boundaries, without suffering from the curse of
dimensionality. This has been one of the major reasons behind the
widespread usage of SVM in highly complex and nonlinear problems.
Figure 4.40.
Decision boundary produced by a nonlinear SVM with polynomial
Figure 4.40 shows the nonlinear decision boundary obtained by SVM using
the polynomial kernel function given in Equation 4.98. We can see that the
learned decision boundary is quite close to the true decision boundary shown
in Figure 4.39(a). Although the choice of kernel function depends on the
characteristics of the input data, a commonly used kernel function is the
radial basis function (RBF) kernel, which involves a single hyper-parameter
σ, known as the standard deviation of the RBF kernel.
4.9.5 Characteristics of SVM
1. The SVM learning problem can be formulated as a convex optimization
problem, in which efficient algorithms are available to find the global
minimum of the objective function. Other classification methods, such
as rule-based classifiers and artificial neural networks, employ a greedy
strategy to search the hypothesis space. Such methods tend to find only
locally optimum solutions.
2. SVM provides an effective way of regularizing the model parameters by
maximizing the margin of the decision boundary. Furthermore, it is able
to create a balance between model complexity and training errors by
using a hyper-parameter C. This trade-off is generic to a broader class of
model learning techniques that capture the model complexity and the
training loss using different formulations.
3. Linear SVM can handle irrelevant attributes by learning zero weights
corresponding to such attributes. It can also handle redundant attributes
by learning similar weights for the duplicate attributes. Furthermore, the
ability of SVM to regularize its learning makes it more robust to the
presence of a large number of irrelevant and redundant attributes than
other classifiers, even in high-dimensional settings. For this reason,
nonlinear SVMs are less impacted by irrelevant and redundant attributes
than other highly expressive classifiers that can learn nonlinear decision
boundaries such as decision trees.
To compare the effect of irrelevant attributes on the performance of
nonlinear SVMs and decision trees, consider the two-dimensional data
set shown in Figure 4.41(a) containing 500+ and 500o instances, where
the two classes can be easily separated using a nonlinear decision
boundary. We incrementally add irrelevant attributes to this data set and
compare the performance of two classifiers: decision tree and nonlinear
SVM (using radial basis function kernel), using 70% of the data for
training and the rest for testing. Figure 4.41(b) shows the test error rates
of the two classifiers as we increase the number of irrelevant attributes.
We can see that the test error rate of decision trees swiftly reaches 0.5
(same as random guessing) in the presence of even a small number of
irrelevant attributes. This can be attributed to the problem of multiple
comparisons while choosing splitting attributes at internal nodes as
discussed in Example 3.7 of the previous chapter. On the other hand,
nonlinear SVM shows a more robust and steady performance even after
adding a moderately large number of irrelevant attributes. Its test error
rate gradually declines and eventually reaches close to 0.5 after adding
125 irrelevant attributes, at which point it becomes difficult to discern
the discriminative information in the original two attributes from the
noise in the remaining attributes for learning nonlinear decision
Figure 4.41.
Comparing the effect of adding irrelevant attributes on the
performance of nonlinear SVMs and decision trees.
Figure 4.41. Full Alternative Text
4. SVM can be applied to categorical data by introducing dummy variables
for each categorical attribute value present in the data. For example, if
Marital Status has three values {Single,Married,Divorced}, we can
introduce a binary variable for each of the attribute values.
5. The SVM formulation presented in this chapter is for binary class
problems. However, multiclass extensions of SVM have also been
6. Although the training time of an SVM model can be large, the learned
parameters can be succinctly represented with the help of a small
number of support vectors, making the classification of test instances
quite fast.
4.10 Ensemble Methods
This section presents techniques for improving classification accuracy by
aggregating the predictions of multiple classifiers. These techniques are
known as ensemble or classifier combination methods. An ensemble
method constructs a set of base classifiers from training data and performs
classification by taking a vote on the predictions made by each base
classifier. This section explains why ensemble methods tend to perform better
than any single classifier and presents techniques for constructing the
classifier ensemble.
4.10.1 Rationale for Ensemble
The following example illustrates how an ensemble method can improve a
classifier’s performance.
Example 4.8.
Consider an ensemble of 25 binary classifiers, each of which has an error rate
of ∈=0.35. The ensemble classifier predicts the class label of a test example
by taking a majority vote on the predictions made by the base classifiers. If
the base classifiers are identical, then all the base classifiers will commit the
same mistakes. Thus, the error rate of the ensemble remains 0.35. On the
other hand, if the base classifiers are independent—i.e., their errors are
uncorrelated—then the ensemble makes a wrong prediction only if more than
half of the base classifiers predict incorrectly. In this case, the error rate of the
ensemble classifier is
eensemble=∑i=1325(25i)∈i(1−∈)25−i=0.06, (4.101)
which is considerably lower than the error rate of the base classifiers.
Figure 4.42 shows the error rate of an ensemble of 25 binary classifiers
(eensemble) for different base classifier error rates (∈). The diagonal line
represents the case in which the base classifiers are identical, while the solid
line represents the case in which the base classifiers are independent. Observe
that the ensemble classifier performs worse than the base classifiers when ∈
is larger than 0.5.
The preceding example illustrates two necessary conditions for an ensemble
classifier to perform better than a single classifier: (1) the base classifiers
should be independent of each other, and (2) the base classifiers should do
better than a classifier that performs random guessing. In practice, it is
difficult to ensure total independence among the base classifiers.
Nevertheless, improvements in classification accuracies have been observed
in ensemble methods in which the base classifiers are somewhat correlated.
4.10.2 Methods for Constructing an
Ensemble Classifier
A logical view of the ensemble method is presented in Figure 4.43. The basic
idea is to construct multiple classifiers from the original data and then
aggregate their predictions when classifying unknown examples. The
ensemble of classifiers can be constructed in many ways:
1. By manipulating the training set. In this approach, multiple training sets
are created by resampling the original data according to some sampling
distribution and constructing a classifier from each training set. The
sampling distribution determines how likely it is that an example will be
selected for training, and it may vary from one trial to another. Bagging
and boosting are two examples of ensemble methods that manipulate
their training sets. These methods are described in further detail in
Sections 4.10.4 and 4.10.5.
Figure 4.42.
Comparison between errors of base classifiers and errors of
the ensemble classifier.
Figure 4.43.
A logical view of the ensemble learning method.
Figure 4.43. Full Alternative Text
2. By manipulating the input features. In this approach, a subset of input
features is chosen to form each training set. The subset can be either
chosen randomly or based on the recommendation of domain experts.
Some studies have shown that this approach works very well with data
sets that contain highly redundant features. Random forest, which is
described in Section 4.10.6, is an ensemble method that manipulates its
input features and uses decision trees as its base classifiers.
3. By manipulating the class labels. This method can be used when the
number of classes is sufficiently large. The training data is transformed
into a binary class problem by randomly partitioning the class labels into
two disjoint subsets, A0 and A1. Training examples whose class label
belongs to the subset A0 are assigned to class 0, while those that belong
to the subset A1 are assigned to class 1. The relabeled examples are then
used to train a base classifier. By repeating this process multiple times,
an ensemble of base classifiers is obtained. When a test example is
presented, each base classifier Ci is used to predict its class label. If the
test example is predicted as class 0, then all the classes that belong to A0
will receive a vote. Conversely, if it is predicted to be class 1, then all
the classes that belong to A1 will receive a vote. The votes are tallied
and the class that receives the highest vote is assigned to the test
example. An example of this approach is the error-correcting output
coding method described on page 331.
4. By manipulating the learning algorithm. Many learning algorithms can
be manipulated in such a way that applying the algorithm several times
on the same training data will result in the construction of different
classifiers. For example, an artificial neural network can change its
network topology or the initial weights of the links between neurons.
Similarly, an ensemble of decision trees can be constructed by injecting
randomness into the tree-growing procedure. For example, instead of
choosing the best splitting attribute at each node, we can randomly
choose one of the top k attributes for splitting.
The first three approaches are generic methods that are applicable to any
classifier, whereas the fourth approach depends on the type of classifier used.
The base classifiers for most of these approaches can be generated
sequentially (one after another) or in parallel (all at once). Once an ensemble
of classifiers has been learned, a test example x is classified by combining the
predictions made by the base classifiers Ci(x):
C*(x)=f(C1(x), C2(x), …, Ck(x)).
where f is the function that combines the ensemble responses. One simple
approach for obtaining C*(x) is to take a majority vote of the individual
predictions. An alternate approach is to take a weighted majority vote, where
the weight of a base classifier denotes its accuracy or relevance.
Ensemble methods show the most improvement when used with unstable
classifiers, i.e., base classifiers that are sensitive to minor perturbations in the
training set, because of high model complexity. Although unstable classifiers
may have a low bias in finding the optimal decision boundary, their
predictions have a high variance for minor changes in the training set or
model selection. This trade-off between bias and variance is discussed in
detail in the next section. By aggregating the responses of multiple unstable
classifiers, ensemble learning attempts to minimize their variance without
worsening their bias.
4.10.3 Bias-Variance Decomposition
Bias-variance decomposition is a formal method for analyzing the
generalization error of a predictive model. Although the analysis is slightly
different for classification than regression, we first discuss the basic intuition
of this decomposition by using an analogue of a regression problem.
Consider the illustrative task of reaching a target y by firing projectiles from a
starting position x, as shown in Figure 4.44. The target corresponds to the
desired output at a test instance, while the starting position corresponds to its
observed attributes. In this analogy, the projectile represents the model used
for predicting the target using the observed attributes. Let y^ denote the point
where the projectile hits the ground, which is analogous of the prediction of
the model.
Figure 4.44.
Bias-variance decomposition.
Figure 4.44. Full Alternative Text
Ideally, we would like our predictions to be as close to the true target as
possible. However, note that different trajectories of projectiles are possible
based on differences in the training data or in the approach used for model
selection. Hence, we can observe a variance in the predictions y^ over
different runs of projectile. Further, the target in our example is not fixed but
has some freedom to move around, resulting in a noise component in the true
target. This can be understood as the non-deterministic nature of the output
variable, where the same set of attributes can have different output values.
Let y^avg represent the average prediction of the projectile over multiple
runs, and yavg denote the average target value. The difference between y^avg
and yavg is known as the bias of the model.
In the context of classification, it can be shown that the generalization error
of a classification model m can be decomposed into terms involving the bias,
variance, and noise components of the model in the following way:
where c1 and c2 are constants that depend on the characteristics of training
and test sets. Note that while the noise term is intrinsic to the target class, the
bias and variance terms depend on the choice of the classification model. The
bias of a model represents how close the average prediction of the model is to
the average target. Models that are able to learn complex decision boundaries,
e.g., models produced by k-nearest neighbor and multi-layer ANN, generally
show low bias. The variance of a model captures the stability of its
predictions in response to minor perturbations in the training set or the model
selection approach.
We can say that a model shows better generalization performance if it has a
lower bias and lower variance. However, if the complexity of a model is high
but the training size is small, we generally expect to see a lower bias but
higher variance, resulting in the phenomena of overfitting. This phenomena
is pictorially represented in Figure 4.45(a). On the other hand, an overly
simplistic model that suffers from underfitting may show a lower variance
but would suffer from a high bias, as shown in Figure 4.45(b). Hence, the
trade-off between bias and variance provides a useful way for interpreting the
effects of underfitting and overfitting on the generalization performance of a
Figure 4.45.
Plots showing the behavior of two-dimensional solutions with
constant L2 and L1 norms.
Figure 4.45. Full Alternative Text
The bias-variance trade-off can be used to explain why ensemble learning
improves the generalization performance of unstable classifiers. If a base
classifier show low bias but high variance, it can become susceptible to
overfitting, as even a small change in the training set will result in different
predictions. However, by combining the responses of multiple base
classifiers, we can expect to reduce the overall variance. Hence, ensemble
learning methods show better performance primarily by lowering the
variance in the predictions, although they can even help in reducing the bias.
One of the simplest approaches for combining predictions and reducing their
variance is to compute their average. This forms the basis of the bagging
method, described in the following subsection.
4.10.4 Bagging
Bagging, which is also known as bootstrap aggregating, is a technique that
repeatedly samples (with replacement) from a data set according to a uniform
probability distribution. Each bootstrap sample has the same size as the
original data. Because the sampling is done with replacement, some instances
may appear several times in the same training set, while others may be
omitted from the training set. On average, a bootstrap sample Di contains
approximately 63% of the original training data because each sample has a
probability 1−(1−1/N)N of being selected in each Di. If N is sufficiently
large, this probability converges to 1−1/e≃0.632. The basic procedure for
bagging is summarized in Algorithm 4.5. After training the k classifiers, a test
instance is assigned to the class that receives the highest number of votes.
To illustrate how bagging works, consider the data set shown in Table 4.4.
Let x denote a one-dimensional attribute and y denote the class label. Suppose
we use only one-level binary decision trees, with a test condition x≤k, where
k is a split point chosen to minimize the entropy of the leaf nodes. Such a tree
is also known as a decision stump.
Table 4.4. Example of data set
used to construct an ensemble
of bagging classifiers.
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 −1 −1 −1 −1 1 1 1
Without bagging, the best decision stump we can produce splits the instances
at either x≤0.35 or x≤0.75. Either way, the accuracy of the tree is at most
70%. Suppose we apply the bagging procedure on the data set using 10
bootstrap samples. The examples chosen for training in each bagging round
are shown in Figure 4.46. On the right-hand side of each table, we also
describe the decision stump being used in each round.
We classify the entire data set given in Table 4.4 by taking a majority vote
among the predictions made by each base classifier. The results of the
predictions are shown in Figure 4.47. Since the class labels are either −1 or
+1, taking the majority vote is equivalent to summing up the predicted values
of y and examining the sign of the resulting sum (refer to the second to last
row in Figure 4.47). Notice that the ensemble classifier perfectly classifies all
10 examples in the original data.
Algorithm 4.5 Bagging algorithm.
1: Let k be the number of bootstrap samples.
2: for i = 1 to k do
3: Create a bootstrap sample of size N, Di.
4: Train a base classifier C
i on the bootstrap sample Di.
5: end for
6: C*(x)=argmaxy∑iδ(Ci(x)=y).
{δ(⋅)=1 if its argument is true and 0 otherwise.}

Don't use plagiarized sources. Get Your Custom Essay on
Discussion on Data Mining
Just from $10/Page
Order Essay

Figure 4.46.
Example of bagging.
Figure 4.46. Full Alternative Text
The preceding example illustrates another advantage of using ensemble
methods in terms of enhancing the representation of the target function. Even
though each base classifier is a decision stump, combining the classifiers can
lead to a decision boundary that mimics a decision tree of depth 2.
Bagging improves generalization error by reducing the variance of the base
classifiers. The performance of bagging depends on the stability of the base
classifier. If a base classifier is unstable, bagging helps to reduce the errors
associated with random fluctuations in the training data. If a base classifier is
stable, i.e., robust to minor perturbations in the training set, then the error of
the ensemble is primarily caused by bias in the base classifier. In this
situation, bagging may not be able to improve the performance of the base
classifiers significantly. It may even degrade the classifier’s performance
because the effective size of each training set is about 37% smaller than the
original data.
Figure 4.47.
Example of combining classifiers constructed using the bagging
4.10.5 Boosting
Boosting is an iterative procedure used to adaptively change the distribution
of training examples for learning base classifiers so that they increasingly
focus on examples that are hard to classify. Unlike bagging, boosting assigns
a weight to each training example and may adaptively change the weight at
the end of each boosting round. The weights assigned to the training
examples can be used in the following ways:
1. They can be used to inform the sampling distribution used to draw a set
of bootstrap samples from the original data.
2. They can be used to learn a model that is biased toward examples with
higher weight.
This section describes an algorithm that uses weights of examples to
determine the sampling distribution of its training set. Initially, the examples
are assigned equal weights, 1/N, so that they are equally likely to be chosen
for training. A sample is drawn according to the sampling distribution of the
training examples to obtain a new training set. Next, a classifier is built from
the training set and used to classify all the examples in the original data. The
weights of the training examples are updated at the end of each boosting
round. Examples that are classified incorrectly will have their weights
increased, while those that are classified correctly will have their weights
decreased. This forces the classifier to focus on examples that are difficult to
classify in subsequent iterations.
The following table shows the examples chosen during each boosting round,
when applied to the data shown in Table 4.4.
Boosting (Round 1): 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2): 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3): 4 4 8 10 4 5 4 6 3 4
Initially, all the examples are assigned the same weights. However, some
examples may be chosen more than once, e.g., examples 3 and 7, because the
sampling is done with replacement. A classifier built from the data is then
used to classify all the examples. Suppose example 4 is difficult to classify.
The weight for this example will be increased in future iterations as it gets
misclassified repeatedly. Meanwhile, examples that were not chosen in the
previous round, e.g., examples 1 and 5, also have a better chance of being
selected in the next round since their predictions in the previous round were
likely to be wrong. As the boosting rounds proceed, examples that are the
hardest to classify tend to become even more prevalent. The final ensemble is
obtained by aggregating the base classifiers obtained from each boosting
Over the years, several implementations of the boosting algorithm have been
developed. These algorithms differ in terms of (1) how the weights of the
training examples are updated at the end of each boosting round, and (2) how
the predictions made by each classifier are combined. An implementation
called AdaBoost is explored in the next section.
Let {(xj, yj)|j=1, 2, …, N} denote a set of N training examples. In the
AdaBoost algorithm, the importance of a base classifier Ci depends on its
Figure 4.48.
Plot of α as a function of training error ∈.
error rate, which is defined as
∈i=1N[∑j=1Nwj I(Ci(xj)≠yj) ], (4.102)
where I(p)=1 if the predicate p is true, and 0 otherwise. The importance of a
classifier Ci is given by the following parameter,
αi=12ln (1−∈i∈i).
Note that αi has a large positive value if the error rate is close to 0 and a large
negative value if the error rate is close to 1, as shown in Figure 4.48.
The αi parameter is also used to update the weight of the training examples.
To illustrate, let wi(j) denote the weight assigned to example (xi, yi) during
the jth boosting round. The weight update mechanism for AdaBoost is given
by the equation:
wi(j+1)=wi(j)Zj×{e−αjif Cj(xi)=yi,eαjif Cj(xi)≠yi (4.103)
where Zj is the normalization factor used to ensure that ∑iwi(j+1)=1. The
weight update formula given in Equation 4.103 increases the weights of
incorrectly classified examples and decreases the weights of those classified
Instead of using a majority voting scheme, the prediction made by each
classifier Cj is weighted according to αj. This approach allows AdaBoost to
penalize models that have poor accuracy, e.g., those generated at the earlier
boosting rounds. In addition, if any intermediate rounds produce an error rate
higher than 50%, the weights are reverted back to their original uniform
values, wi=1/N, and the resampling procedure is repeated. The AdaBoost
algorithm is summarized in Algorithm 4.6.
Algorithm 4.6 AdaBoost algorithm.
1: w = {wj = 1/N | j = 1, 2,…,N}. {Initialize the weights for all
2: Let k be the number of boosting rounds.
3: for i = 1 to k do
4: Create training set Di by sampling (with replacement) from
5: Train a base classifier C
i on Di.
6: Apply Ci to all examples in the original training set, D.
7: ∈i=1N[ ∑jwj δ(Ci(xj)≠yj) ] {Calculate the weighted error.}
8: if ∈i > 0.5 then
9: w = {wj = 1/N | j = 1, 2,…,N}. {Reset the weights for all
10: Go back to Step 4.
11: end if
12: αi=12ln1−∈i∈i.
13: Update the weight of each example according to Equation 4.103
14: end for
15: C*(x)=argmaxy∑j=1Tαjδ(Cj(x)=y))..
Let us examine how the boosting approach works on the data set shown in
Table 4.4. Initially, all the examples have identical weights. After three
boosting rounds, the examples chosen for training are shown in Figure
4.49(a). The weights for each example are updated at the end of each
boosting round using Equation 4.103, as shown in Figure 4.50(b).
Without boosting, the accuracy of the decision stump is, at best, 70%. With
AdaBoost, the results of the predictions are given in Figure 4.50(b). The final
prediction of the ensemble classifier is obtained by taking a weighted average
of the predictions made by each base classifier, which is shown in the last
row of Figure 4.50(b). Notice that AdaBoost perfectly classifies all the
examples in the training data.
Figure 4.49.
Example of boosting.
Figure 4.49. Full Alternative Text
An important analytical result of boosting shows that the training error of the
ensemble is bounded by the following expression:
eensemble≤∏i[∈i(1−∈i) ], (4.104)
where ∈i is the error rate of each base classifier i. If the error rate of the base
classifier is less than 50%, we can write ∈i=0.5 −γi, where γi measures how
much better the classifier is than random guessing. The bound on the training
error of the ensemble becomes
eensemble≤∏i1−4γi2≤exp(−2∑iγi2). (4.105)
Hence, the training error of the ensemble decreases exponentially, which
leads to the fast convergence of the algorithm. By focusing on examples that
are difficult to classify by base classifiers, it is able to reduce the bias of the
final predictions along with the variance. AdaBoost has been shown to
provide significant improvements in performance over base classifiers on a
range of data sets. Nevertheless, because of its tendency to focus on training
examples that are wrongly classified, the boosting technique can be
susceptible to overfitting, resulting in poor generalization performance in
some scenarios.
Figure 4.50.
Example of combining classifiers constructed using the
AdaBoost approach.
4.10.6 Random Forests
Random forests attempt to improve the generalization performance by
constructing an ensemble of decorrelated decision trees. Random forests
build on the idea of bagging to use a different bootstrap sample of the
training data for learning decision trees. However, a key distinguishing
feature of random forests from bagging is that at every internal node of a tree,
the best splitting criterion is chosen among a small set of randomly selected
attributes. In this way, random forests construct ensembles of decision trees
by not only manipulating training instances (by using bootstrap samples
similar to bagging), but also the input attributes (by using different subsets of
attributes at every internal node).
Given a training set D consisting of n instances and d attributes, the basic
procedure of training a random forest classifier can be summarized using the
following steps:
1. Construct a bootstrap sample Di of the training set by randomly
sampling n instances (with replacement) from D.
2. Use Di to learn a decision tree Ti as follows. At every internal node of
Ti, randomly sample a set of p attributes and choose an attribute from
this subset that shows the maximum reduction in an impurity measure
for splitting. Repeat this procedure till every leaf is pure, i.e., containing
instances from the same class.
Once an ensemble of decision trees have been constructed, their average
prediction (majority vote) on a test instance is used as the final prediction of
the random forest. Note that the decision trees involved in a random forest
are unpruned trees, as they are allowed to grow to their largest possible size
till every leaf is pure. Hence, the base classifiers of random forest represent
unstable classifiers that have low bias but high variance, because of their
large size.
Another property of the base classifiers learned in random forests is the lack
of correlation among their model parameters and test predictions. This can be
attributed to the use of an independently sampled data set Di for learning
every decision tree Ti, similar to the bagging approach. However, random
forests have the additional advantage of choosing a splitting criterion at every
internal node using a different (and randomly selected) subset of attributes.
This property significantly helps in breaking the correlation structure, if any,
among the decision trees Ti.
To realize this, consider a training set involving a large number of attributes,
where only a small subset of attributes are strong predictors of the target
class, whereas other attributes are weak indicators. Given such a training set,
even if we consider different bootstrap samples Di for learning Ti, we would
mostly be choosing the same attributes for splitting at internal nodes, because
the weak attributes would be largely overlooked when compared with the
strong predictors. This can result in a considerable correlation among the
trees. However, if we restrict the choice of attributes at every internal node to
a random subset of attributes, we can ensure the selection of both strong and
weak predictors, thus promoting diversity among the trees. This principle is
utilized by random forests for creating decorrelated decision trees.
By aggregating the predictions of an ensemble of strong and decorrelated
decision trees, random forests are able to reduce the variance of the trees
without negatively impacting their low bias. This makes random forests quite
robust to overfitting. Additionally, because of their ability to consider only a
small subset of attributes at every internal node, random forests are
computationally fast and robust even in high-dimensional settings.
The number of attributes to be selected at every node, p, is a hyper-parameter
of the random forest classifier. A small value of p can reduce the correlation
among the classifiers but may also reduce their strength. A large value can
improve their strength but may result in correlated trees similar to bagging.
Although common suggestions for p in the literature include d and log2d+1, a
suitable value of p for a given training set can always be selected by tuning it
over a validation set, as described in the previous chapter. However, there is
an alternative way for selecting hyper-parameters in random forests, which
does not require using a separate validation set. It involves computing a
reliable estimate of the generalization error rate directly during training,
known as the out-of-bag (oob) error estimate. The oob estimate can be
computed for any generic ensemble learning method that builds independent
base classifiers using bootstrap samples of the training set, e.g., bagging and
random forests. The approach for computing oob estimate can be described
as follows.
Consider an ensemble learning method that uses an independent base
classifier Ti built on a bootstrap sample of the training set Di. Since every
training instance x will be used for training approximately 63% of base
classifiers, we can call x as an out-of-bag sample for the remaining 27% of
base classifiers that did not use it for training. If we use these remaining 27%
classifiers to make predictions on x, we can obtain the oob error on x by
taking their majority vote and comparing it with its class label. Note that the
oob error estimates the error of 27% classifiers on an instance that was not
used for training those classifiers. Hence, the oob error can be considered as a
reliable estimate of generalization error. By taking the average of oob errors
of all training instances, we can compute the overall oob error estimate. This
can be used as an alternative to the validation error rate for selecting hyperparameters. Hence, random forests do not need to use a separate partition of
the training set for validation, as it can simultaneously train the base
classifiers and compute generalization error estimates on the same data set.
Random forests have been empirically found to provide significant
improvements in generalization performance that are often comparable, if not
superior, to the improvements provided by the AdaBoost algorithm. Random
forests are also more robust to overfitting and run much faster than the
AdaBoost algorithm.
4.10.7 Empirical Comparison
among Ensemble Methods
Table 4.5 shows the empirical results obtained when comparing the
performance of a decision tree classifier against bagging, boosting, and
random forest. The base classifiers used in each ensemble method consist of
50 decision trees. The classification accuracies reported in this table are
obtained from tenfold cross-validation. Notice that the ensemble classifiers
generally outperform a single decision tree classifier on many of the data sets.
Table 4.5. Comparing the
accuracy of a decision tree
classifier against three
ensemble methods.
Data Set
Number of
Tree (%) Bagging(%) Boosting(%) RF(%)
Anneal (39, 6,
898) 92.09 94.43 95.43 95.43
Australia (15, 2,
690) 85.51 87.10 85.22 85.80
Auto (26, 7,
205) 81.95 85.37 85.37 84.39
Breast (11, 2,
699) 95.14 96.42 97.28 96.14
Cleve (14, 2,
303) 76.24 81.52 82.18 82.18
Credit (16, 2,
690) 85.8 86.23 86.09 85.8
Diabetes (9, 2, 768) 72.40 76.30 73.18 75.13
German (21, 2,
1000) 70.90 73.40 73.00 74.5
Glass (10, 7,
214) 67.29 76.17 77.57 78.04
Heart (14, 2,
270) 80.00 81.48 80.74 83.33
Hepatitis (20, 2,
155) 81.94 81.29 83.87 83.23
Horse (23, 2,
368) 85.33 85.87 81.25 85.33
Ionosphere (35, 2,
351) 89.17 92.02 93.73 93.45
Iris (5, 3, 150) 94.67 94.67 94.00 93.33
Labor (17, 2, 57) 78.95 84.21 89.47 84.21
Led7 (8, 10,
3200) 73.34 73.66 73.34 73.06
Lymphography (19, 4, 77.03 79.05 85.14 82.43
Pima (9, 2, 768) 74.35 76.69 73.44 77.60
Sonar (61, 2,
208) 78.85 78.85 84.62 85.58
Tic-tac-toe (10, 2,
958) 83.72 93.84 98.54 95.82
Vehicle (19, 4,
846) 71.04 74.11 78.25 74.94
Waveform (22, 3,
5000) 76.44 83.30 83.90 84.04
Wine (14, 3,
178) 94.38 96.07 97.75 97.75
Zoo (17, 7,
101) 93.07 93.07 95.05 97.03
4.11 Class Imbalance Problem
In many data sets there are a disproportionate number of instances that
belong to different classes, a property known as skew or class imbalance.For
example, consider a health-care application where diagnostic reports are used
to decide whether a person has a rare disease. Because of the infrequent
nature of the disease, we can expect to observe a smaller number of subjects
who are positively diagnosed. Similarly, in credit card fraud detection,
fraudulent transactions are greatly outnumbered by legitimate transactions.
The degree of imbalance between the classes varies across different
applications and even across different data sets from the same application.
For example, the risk for a rare disease may vary across different populations
of subjects depending on their dietary and lifestyle choices. However, despite
their infrequent occurrences, a correct classification of the rare class often has
greater value than a correct classification of the majority class. For example,
it may be more dangerous to ignore a patient suffering from a disease than to
misdiagnose a healthy person.
More generally, class imbalance poses two challenges for classification. First,
it can be difficult to find sufficiently many labeled samples of a rare class.
Note that many of the classification methods discussed so far work well only
when the training set has a balanced representation of both classes. Although
some classifiers are more effective at handling imbalance in the training data
than others, e.g., rule-based classifiers and k-NN, they are all impacted if the
minority class is not well-represented in the training set. In general, a
classifier trained over an imbalanced data set shows a bias toward improving
its performance over the majority class, which is often not the desired
behavior. As a result, many existing classification models, when trained on
an imbalanced data set, may not effectively detect instances of the rare class.
Second, accuracy, which is the traditional measure for evaluating
classification performance, is not well-suited for evaluating models in the
presence of class imbalance in the test data. For example, if 1% of the credit
card transactions are fraudulent, then a trivial model that predicts every
transaction as legitimate will have an accuracy of 99% even though it fails to
detect any of the fraudulent activities. Thus, there is a need to use alternative
evaluation metrics that are sensitive to the skew and can capture different
criteria of performance than accuracy.
In this section, we first present some of the generic methods for building
classifiers when there is class imbalance in the training set. We then discuss
methods for evaluating classification performance and adapting classification
decisions in the presence of a skewed test set. In the remainder of this
section, we will consider binary classification problems for simplicity, where
the minority class is referred as the positive (+) class while the majority class
is referred as the negative (−) class.
4.11.1 Building Classifiers with
Class Imbalance
There are two primary considerations for building classifiers in the presence
of class imbalance in the training set. First, we need to ensure that the
learning algorithm is trained over a data set that has adequate representation
of both the majority as well as the minority classes. Some common
approaches for ensuring this includes the methodologies of oversampling and
undersampling the training set. Second, having learned a classification
model, we need a way to adapt its classification decisions (and thus create an
appropriately tuned classifier) to best match the requirements of the
imbalanced test set. This is typically done by converting the outputs of the
classification model to real-valued scores, and then selecting a suitable
threshold on the classification score to match the needs of a test set. Both
these considerations are discussed in detail in the following.
Oversampling and Undersampling
The first step in learning with imbalanced data is to transform the training set
to a balanced training set, where both classes have nearly equal
representation. The balanced training set can then be used with any of the
existing classification techniques (without making any modifications in the
learning algorithm) to learn a model that gives equal emphasis to both
classes. In the following, we present some of the common techniques for
transforming an imbalanced training set to a balanced one.
A basic approach for creating balanced training sets is to generate a sample of
training instances where the rare class has adequate representation. There are
two types of sampling methods that can be used to enhance the representation
of the minority class: (a) undersampling, where the frequency of the
majority class is reduced to match the frequency of the minority class, and (b)
oversampling, where artificial examples of the minority class are created to
make them equal in proportion to the number of negative instances.
To illustrate undersampling, consider a training set that contains 100 positive
examples and 1000 negative examples. To overcome the skew among the
classes, we can select a random sample of 100 examples from the negative
class and use them with the 100 positive examples to create a balanced
training set. A classifier built over the resultant balanced set will then be
unbiased toward both classes. However, one limitation of undersampling is
that some of the useful negative examples (e.g., those closer to the actual
decision boundary) may not be chosen for training, therefore, resulting in an
inferior classification model. Another limitation is that the smaller sample of
100 negative instances may have a higher variance than the larger set of
Oversampling attempts to create a balanced training set by artificially
generating new positive examples. A simple approach for oversampling is to
duplicate every positive instance n−/n+ times, where n+ and n− are the
numbers of positive and negative training instances, respectively. Figure 4.51
illustrates the effect of oversampling on the learning of a decision boundary
using a classifier such as a decision tree. Without oversampling, only the
positive examples at the bottom right-hand side of Figure 4.51(a) are
classified correctly. The positive example in the middle of the diagram is
misclassified because there are not enough examples to justify the creation of
a new decision boundary to separate the positive and negative instances.
Oversampling provides the additional examples needed to ensure that the
decision boundary surrounding the positive example is not pruned, as
illustrated in Figure 4.51(b). Note that duplicating a positive instance is
analogous to doubling its weight during the training stage. Hence, the effect
of oversampling can be alternatively achieved by assigning higher weights to
positive instances than negative instances. This method of weighting
instances can be used with a number of classifiers such as logistic regression,
ANN, and SVM.
Figure 4.51.
Illustrating the effect of oversampling of the rare class.
One limitation of the duplication method for oversampling is that the
replicated positive examples have an artificially lower variance when
compared with their true distribution in the overall data. This can bias the
classifier to the specific distribution of training instances, which may not be
representative of the overall distribution of test instances, leading to poor
generalizability. To overcome this limitation, an alternative approach for
oversampling is to generate synthetic positive instances in the neighborhood
of existing positive instances. In this approach, called the Synthetic Minority
Oversampling Technique (SMOTE), we first determine the k-nearest positive
neighbors of every positive instance x, and then generate a synthetic positive
instance at some intermediate point along the line segment joining x to one of
its randomly chosen k-nearest neighbor, xk. This process is repeated until the
desired number of positive instances is reached. However, one limitation of
this approach is that it can only generate new positive instances in the convex
hull of the existing positive class. Hence, it does not help improve the
representation of the positive class outside the boundary of existing positive
instances. Despite their complementary strengths and weaknesses,
undersampling and oversampling provide useful directions for generating
balanced training sets in the presence of class imbalance.
Assigning Scores to Test Instances
If a classifier returns an ordinal score s(x)for every test instance x such that a
higher score denotes a greater likelihood of x belonging to the positive class,
then for every possible value of score threshold, sT, we can create a new
binary classifier where a test instance x is classified positive only if s(x)>sT.
Thus, every choice of sT can potentially lead to a different classifier, and we
are interested in finding the classifier that is best suited for our needs.
Ideally, we would like the classification score to vary monotonically with the
actual posterior probability of the positive class, i.e., if s(x1) and s(x2) are the
scores of any two instances, x1 and x2, then
s(x1)≥s(x2)⇒P(y=1|x1)≥P(y=1|x2). However, this is difficult to guarantee in
practice as the properties of the classification score depends on several
factors such as the complexity of the classification algorithm and the
representative power of the training set. In general, we can only expect the
classification score of a reasonable algorithm to be weakly related to the
actual posterior probability of the positive class, even though the relationship
may not be strictly monotonic. Most classifiers can be easily modified to
produce such a real valued score. For example, the signed distance of an
instance from the positive margin hyperplane of SVM can be used as a
classification score. As another example, test instances belonging to a leaf in
a decision tree can be assigned a score based on the fraction of training
instances labeled as positive in the leaf. Also, probabilistic classifiers such as
naïve Bayes, Bayesian networks, and logistic regression naturally output
estimates of posterior probabilities, P(y=1|x). Next, we discuss some
evaluation measures for assessing the goodness of a classifier in the presence
of class imbalance.
Table 4.6. A confusion matrix
for a binary classification
problem in which the classes
are not equally important.
Predicted Class
+ −
+ f++ (TP) f+− (FN)
− f−+ (FP) f−− (TN)
4.11.2 Evaluating Performance with
Class Imbalance
The most basic approach for representing a classifier’s performance on a test
set is to use a confusion matrix, as shown in Table 4.6. This table is
essentially the same as Table 3.4, which was introduced in the context of
evaluating classification performance in Section 3.2. A confusion matrix
summarizes the number of instances predicted correctly or incorrectly by a
classifier using the following four counts:
True positive (TP) or f++, which corresponds to the number of positive
examples correctly predicted by the classifier.
False positive (FP) or f−+ (also known as Type I error), which
corresponds to the number of negative examples wrongly predicted as
positive by the classifier.
False negative (FN) or f+− (also known as Type II error), which
corresponds to the number of positive examples wrongly predicted as
negative by the classifier.
True negative (TN) or f−−, which corresponds to the number of negative
examples correctly predicted by the classifier.
The confusion matrix provides a concise representation of classification
performance on a given test data set. However, it is often difficult to interpret
and compare the performance of classifiers using the four-dimensional
representations (corresponding to the four counts) provided by their
confusion matrices. Hence, the counts in the confusion matrix are often
summarized using a number of evaluation measures. Accuracy is an
example of one such measure that combines these four counts into a single
value, which is used extensively when classes are balanced. However, the
accuracy measure is not suitable for handling data sets with imbalanced class
distributions as it tends to favor classifiers that correctly classify the majority
class. In the following, we describe other possible measures that capture
different criteria of performance when working with imbalanced classes.
A basic evaluation measure is the true positive rate (TPR), which is defined
as the fraction of positive test instances correctly predicted by the classifier:
In the medical community, TPR is also known as sensitivity, while in the
information retrieval literature, it is also called recall (r). A classifier with a
high TPR has a high chance of correctly identifying the positive instances of
the data.
Analogously to TPR, the true negative rate (TNR) (also known as
specificity) is defined as the fraction of negative test instances correctly
predicted by the classifier, i.e.,
A high TNR value signifies that the classifier correctly classifies any
randomly chosen negative instance in the test set. A commonly used
evaluation measure that is closely related to TNR is the false positive rate
(FPR), which is defined as 1−TNR.
Similarly, we can define false negative rate (FNR) as 1−TPR.
Note that the evaluation measures defined above do not take into account the
skew among the classes, which can be formally defined as α=P/(P+N), where
P and N denote the number of actual positives and actual negatives,
respectively. As a result, changing the relative numbers of P and N will have
no effect on TPR, TNR, FPR, or FNR, since they depend only on the fraction
of correct classifications for every class, independently of the other class.
Furthermore, knowing the values of TPR and TNR (and consequently FNR
and FPR) does not by itself help us uniquely determine all four entries of the
confusion matrix. However, together with information about the skew factor,
α, and the total number of instances, N, we can compute the entire confusion
matrix using TPR and TNR, as shown in Table 4.7.
Table 4.7. Entries of the
confusion matrix in terms of
the TPR, TNR, skew, α, and
total number of instances, N.
Predicted + Predicted −
Actual + TPR×α×N (1−TPR)×α×N α×N
Actual − (1−TNR)×(1−α)×N TNR×(1−α)×N (1−α)×N
An evaluation measure that is sensitive to the skew is precision, which can
be defined as the fraction of correct predictions of the positive class over the
total number of positive predictions, i.e.,
Precision, p=TPTP+FP.
Precision is also referred as the positive predicted value (PPV). A classifier
that has a high precision is likely to have most of its positive predictions
correct. Precision is a useful measure for highly skewed test sets where the
positive predictions, even though small in numbers, are required to be mostly
correct. A measure that is closely related to precision is the false discovery
rate (FDR), which can be defined as 1−p.
Although both FDR and FPR focus on FP, they are designed to capture
different evaluation objectives and thus can take quite contrasting values,
especially in the presence of class imbalance. To illustrate this, consider a
classifier with the following confusion matrix.
Predicted Class
+ −
+ 100 0
− 100 900
Since half of the positive predictions made by the classifier are incorrect, it
has a FDR value of 100/(100+100)=0.5. However, its FPR is equal to
100/(100+900)=0.1, which is quite low. This example shows that in the
presence of high skew (i.e., very small value of α), even a small FPR can
result in high FDR. See Section 10.6 for further discussion of this issue.
Note that the evaluation measures defined above provide an incomplete
representation of performance, because they either only capture the effect of
false positives (e.g., FPR and precision) or the effect of false negatives (e.g.,
TPR or recall), but not both. Hence, if we optimize only one of these
evaluation measures, we may end up with a classifier that shows low FN but
high FP, or vice-versa. For example, a classifier that declares every instance
to be positive will have a perfect recall, but high FPR and very poor
precision. On the other hand, a classifier that is very conservative in
classifying an instance as positive (to reduce FP) may end up having high
precision but very poor recall. We thus need evaluation measures that
account for both types of misclassifications, FP and FN. Some examples of
such evaluation measures are summarized by the following definitions.
Positive Likelihood Ratio=TPRFPR.F1 measure=2rpr+p=2×TP2×TP+FP+FN.
While some of these evaluation measures are invariant to the skew (e.g., the
positive likelihood ratio), others (e.g., precision and the F1 measure) are
sensitive to skew. Further, different evaluation measures capture the effects
of different types of misclassification errors in various ways. For example,
the F1 measure represents a harmonic mean between recall and precision, i.e.,
Because the harmonic mean of two numbers tends to be closer to the smaller
of the two numbers, a high value of F1-measure ensures that both precision
and recall are reasonably high. Similarly, the G measure represents the
geometric mean between recall and precision. A comparison among
harmonic, geometric, and arithmetic means is given in the next example.
Example 4.9.
Consider two positive numbers a=1 and b=5. Their arithmetic mean is μa=
(a+b)/2=3 and their geometric mean is μg=ab=2.236. Their harmonic mean is
μh=(2×1×5)/6=1.667, which is closer to the smaller value between a and b
than the arithmetic and geometric means.
A generic extension of the F1 measure is the Fβ measure, which can be
defined as follows.
Fβ=(β2+1)rpr+β2p=(β2+1)×TP(β2+1)TP+β2FP+FN. (4.106)
Both precision and recall can be viewed as special cases of Fβ by setting β=0
and β=∞, respectively. Low values of β make Fβ closer to precision, and high
values make it closer to recall.
A more general measure that captures Fβ as well as accuracy is the weighted
accuracy measure, which is defined by the following equation:
Weighted accuracy=w1TP+w4TNw1TP+w2FP+w3FN+w4TN. (4.107)
The relationship between weighted accuracy and other performance measures
is summarized in the following table:
Measure w1 w2 w3 w4
Recall 1 1 0 0
Precision 1 0 1 0
Fβ β2+1 β2 1 0
Accuracy 1 1 1 1
4.11.3 Finding an Optimal Score
Given a suitably chosen evaluation measure E and a distribution of
classification scores, s(x), on a validation set, we can obtain the optimal score
threshold s* on the validation set using the following approach:
1. Sort the scores in increasing order of their values.
2. For every unique value of score, s, consider the classification model that
assigns an instance x as positive only if s(x)>s. Let E(s) denote the
performance of this model on the validation set.
3. Find s* that maximizes the evaluation measure, E(s).
s*=argmaxs E(s).
Note that s* can be treated as a hyper-parameter of the classification
algorithm that is learned during model selection. Using s*, we can assign a
positive label to a future test instance x only if s(x)>s*. If the evaluation
measure E is skew invariant (e.g., Positive Likelihood Ratio), then we can
select s* without knowing the skew of the test set, and the resultant classifier
formed using s* can be expected to show optimal performance on the test set
(with respect to the evaluation measure E). On the other hand, if E is sensitive
to the skew (e.g., precision or F1-measure), then we need to ensure that the
skew of the validation set used for selecting s* is similar to that of the test set,
so that the classifier formed using s* shows optimal test performance with
respect to E. Alternatively, given an estimate of the skew of the test data, α,
we can use it along with the TPR and TNR on the validation set to estimate
all entries of the confusion matrix (see Table 4.7), and thus the estimate of
any evaluation measure E on the test set. The score threshold s* selected
using this estimate of E can then be expected to produce optimal test
performance with respect to E. Furthermore, the methodology of selecting s*
on the validation set can help in comparing the test performance of different
classification algorithms, by using the optimal values of s* for each
4.11.4 Aggregate Evaluation of
Although the above approach helps in finding a score threshold s* that
provides optimal performance with respect to a desired evaluation measure
and a certain amount of skew, α, sometimes we are interested in evaluating
the performance of a classifier on a number of possible score thresholds, each
corresponding to a different choice of evaluation measure and skew value.
Assessing the performance of a classifier over a range of score thresholds is
called aggregate evaluation of performance. In this style of analysis, the
emphasis is not on evaluating the performance of a single classifier
corresponding to the optimal score threshold, but to assess the general quality
of ranking produced by the classification scores on the test set. In general,
this helps in obtaining robust estimates of classification performance that are
not sensitive to specific choices of score thresholds.
ROC Curve
One of the widely-used tools for aggregate evaluation is the receiver
operating characteristic (ROC) curve. An ROC curve is a graphical
approach for displaying the trade-off between TPR and FPR of a classifier,
over varying score thresholds. In an ROC curve, the TPR is plotted along the
y-axis and the FPR is shown on the x-axis. Each point along the curve
corresponds to a classification model generated by placing a threshold on the
test scores produced by the classifier. The following procedure describes the
generic approach for computing an ROC curve:
1. Sort the test instances in increasing order of their scores.
2. Select the lowest ranked test instance (i.e., the instance with lowest
score). Assign the selected instance and those ranked above it to the
positive class. This approach is equivalent to classifying all the test
instances as positive class. Because all the positive examples are
classified correctly and the negative examples are misclassified,
3. Select the next test instance from the sorted list. Classify the selected
instance and those ranked above it as positive, while those ranked below
it as negative. Update the counts of TP and FP by examining the actual
class label of the selected instance. If this instance belongs to the
positive class, the TP count is decremented and the FP count remains
the same as before. If the instance belongs to the negative class, the FP
count is decremented and TP count remains the same as before.
4. Repeat Step 3 and update the TP and FP counts accordingly until the
highest ranked test instance is selected. At this final threshold,
TPR=FPR=0, as all instances are labeled as negative.
5. Plot the TPR against FPR of the classifier.
Example 4.10. [Generating ROC
Figure 4.52 shows an example of how to compute the TPR and FPR values
for every choice of score threshold. There are five positive examples and five
negative examples in the test set. The class labels of the test instances are
shown in the first row of the table, while the second row corresponds to the
sorted score values for each instance. The next six rows contain the counts of
TP , FP , TN, and FN, along with their corresponding TPR and FPR. The
table is then filled from left to right. Initially, all the instances are predicted to
be positive. Thus, TP=FP=5 and TPR=FPR=1. Next, we assign the test
instance with the lowest score as the negative class. Because the selected
instance is actually a positive example, the TP count decreases from 5 to 4
and the FP count is the same as before. The FPR and TPR are updated
accordingly. This process is repeated until we reach the end of the list, where
TPR=0 and FPR=0. The ROC curve for this example is shown in Figure 4.53.
Figure 4.52.
Computing the TPR and FPR at every score threshold.
Figure 4.53.
ROC curve for the data shown in Figure 4.52.
Note that in an ROC curve, the TPR monotonically increases with FPR,
because the inclusion of a test instance in the set of predicted positives can
either increase the TPR or the FPR. The ROC curve thus has a staircase
pattern. Furthermore, there are several critical points along an ROC curve
that have well-known interpretations:
(TPR=0, FPR=0): Model predicts every instance to be a negative class.
(TPR=1, FPR=1): Model predicts every instance to be a positive class.
(TPR=1, FPR=0): The perfect model with zero misclassifications.
A good classification model should be located as close as possible to the
upper left corner of the diagram, while a model that makes random guesses
should reside along the main diagonal, connecting the points
(TPR=0, FPR=0) and (TPR=1, FPR=1). Random guessing means that an
instance is classified as a positive class with a fixed probability p, irrespective
of its attribute set. For example, consider a data set that contains n+ positive
instances and n− negative instances. The random classifier is expected to
correctly classify pn+ of the positive instances and to misclassify pn− of the
negative instances. Therefore, the TPR of the classifier is (pn+)/n+=p, while
its FPR is (pn−)/p=p. Hence, this random classifier will reside at the point (p,
p) in the ROC curve along the main diagonal.
Figure 4.54.
ROC curves for two different classifiers.
Figure 4.54. Full Alternative Text
Since every point on the ROC curve represents the performance of a classifier
generated using a particular score threshold, they can be viewed as different
operating points of the classifier. One may choose one of these operating
points depending on the requirements of the application. Hence, an ROC
curve facilitates the comparison of classifiers over a range of operating
points. For example, Figure 4.54 compares the ROC curves of two classifiers,
M1 and M2, generated by varying the score thresholds. We can see that M1
is better than M2 when FPR is less than 0.36, as M1 shows better TPR than
M2 for this range of operating points. On the other hand, M2 is superior
when FPR is greater than 0.36, since the TPR of M2 is higher than that of M1
for this range. Clearly, neither of the two classifiers dominates (is strictly
better than) the other, i.e., shows higher values of TPR and lower values of
FPR over all operating points.
To summarize the aggregate behavior across all operating points, one of the
commonly used measures is the area under the ROC curve (AUC). If the
classifier is perfect, then its area under the ROC curve will be equal 1. If the
algorithm simply performs random guessing, then its area under the ROC
curve will be equal to 0.5.
Although the AUC provides a useful summary of aggregate performance,
there are certain caveats in using the AUC for comparing classifiers. First,
even if the AUC of algorithm A is higher than the AUC of another algorithm
B, this does not mean that algorithm A is always better than B, i.e., the ROC
curve of A dominates that of B across all operating points. For example, even
though M1 shows a slightly lower AUC than M2 in Figure 4.54, we can see
that both M1 and M2 are useful over different ranges of operating points and
none of them are strictly better than the other across all possible operating
points. Hence, we cannot use the AUC to determine which algorithm is
better, unless we know that the ROC curve of one of the algorithms
dominates the other.
Second, although the AUC summarizes the aggregate performance over all
operating points, we are often interested in only a small range of operating
points in most applications. For example, even though M1 shows slightly
lower AUC than M2, it shows higher TPR values than M2 for small FPR
values (smaller than 0.36). In the presence of class imbalance, the behavior of
an algorithm over small FPR values (also termed as early retrieval) is often
more meaningful for comparison than the performance over all FPR values.
This is because, in many applications, it is important to assess the TPR
achieved by a classifier in the first few instances with highest scores, without
incurring a large FPR. Hence, in Figure 4.54, due to the high TPR values of
M1 during early retrieval (FPR<0.36), we may prefer M1 over M2 for
imbalanced test sets, despite the lower AUC of M1. Hence, care must be
taken while comparing the AUC values of different classifiers, usually by
visualizing their ROC curves rather than just reporting their AUC.
A key characteristic of ROC curves is that they are agnostic to the skew in
the test set, because both the evaluation measures used in constructing ROC
curves (TPR and FPR) are invariant to class imbalance. Hence, ROC curves
are not suitable for measuring the impact of skew on classification
performance. In particular, we will obtain the same ROC curve for two test
data sets that have very different skew.
Figure 4.55.
PR curves for two different classifiers.
Figure 4.55. Full Alternative Text
Precision-Recall Curve
An alternate tool for aggregate evaluation is the precision recall curve (PR
curve). The PR curve plots the precision and recall values of a classifier on
the y and x axes respectively, by varying the threshold on the test scores.
Figure 4.55 shows an example of PR curves for two hypothetical classifiers,
M1 and M2. The approach for generating a PR curve is similar to the
approach described above for generating an ROC curve. However, there are
some key distinguishing features in the PR curve:
1. PR curves are sensitive to the skew factor α=P/(P+N), and different PR
curves are generated for different values of α.
2. When the score threshold is lowest (every instance is labeled as
positive), the precision is equal to α while recall is 1. As we increase the
score threshold, the number of predicted positives can stay the same or
decrease. Hence, the recall monotonically declines as the score threshold
increases. In general, the precision may increase or decrease for the
same value of recall, upon addition of an instance into the set of
predicted positives. For example, if the kth ranked instance belongs to
the negative class, then including it will result in a drop in the precision
without affecting the recall. The precision may improve at the next step,
which adds the (k+1)th ranked instance, if this instance belongs to the
positive class. Hence, the PR curve is not a smooth, monotonically
increasing curve like the ROC curve, and generally has a zigzag pattern.
This pattern is more prominent in the left part of the curve, where even a
small change in the number of false positives can cause a large change
in precision.
3. As, as we increase the imbalance among the classes (reduce the value of
α), the rightmost points of all PR curves will move downwards. At and
near the leftmost point on the PR curve (corresponding to larger values
of score threshold), the recall is close to zero, while the precision is
equal to the fraction of positives in the top ranked instances of the
algorithm. Hence, different classifiers can have different values of
precision at the leftmost points of the PR curve. Also, if the
classification score of an algorithm monotonically varies with the
posterior probability of the positive class, we can expect the PR curve to
gradually decrease from a high value of precision on the leftmost point
to a constant value of α at the rightmost point, albeit with some ups and
downs. This can be observed in the PR curve of algorithm M1 in Figure
4.55, which starts from a higher value of precision on the left that
gradually decreases as we move towards the right. On the other hand,
the PR curve of algorithm M2 starts from a lower value of precision on
the left and shows more drastic ups and downs as we move right,
suggesting that the classification score of M2 shows a weaker
monotonic relationship with the posterior probability of the positive
4. A random classifier that assigns an instance to be positive with a fixed
probability p has a precision of α and a recall of p. Hence, a classifier
that performs random guessing has a horizontal PR curve with y=α, as
shown using a dashed line in Figure 4.55. Note that the random baseline
in PR curves depends on the skew in the test set, in contrast to the fixed
main diagonal of ROC curves that represents random classifiers.
5. Note that the precision of an algorithm is impacted more strongly by
false positives in the top ranked test instances than the FPR of the
algorithm. For this reason, the PR curve generally helps to magnify the
differences between classifiers in the left portion of the PR curve.
Hence, in the presence of class imbalance in the test data, analyzing the
PR curves generally provides a deeper insight into the performance of
classifiers than the ROC curves, especially in the early retrieval range of
operating points.
6. The classifier corresponding to (precision=1, recall=1) represents the
perfect classifier. Similar to AUC, we can also compute the area under
the PR curve of an algorithm, known as AUC-PR. The AUC-PR of a
random classifier is equal to α, while that of a perfect algorithm is equal
to 1. Note that AUC-PR varies with changing skew in the test set, in
contrast to the area under the ROC curve, which is insensitive to the
skew. The AUC-PR helps in accentuating the differences between
classification algorithms in the early retrieval range of operating points.
Hence, it is more suited for evaluating classification performance in the
presence of class imbalance than the area under the ROC curve.
However, similar to ROC curves, a higher value of AUC-PR does not
guarantee the superiority of a classification algorithm over another. This
is because the PR curves of two algorithms can easily cross each other,
such that they both show better performances in different ranges of
operating points. Hence, it is important to visualize the PR curves before
comparing their AUC-PR values, in order to ensure a meaningful
4.12 Multiclass Problem
Some of the classification techniques described in this chapter are originally
designed for binary classification problems. Yet there are many real-world
problems, such as character recognition, face identification, and text
classification, where the input data is divided into more than two categories.
This section presents several approaches for extending the binary classifiers
to handle multiclass problems. To illustrate these approaches, let Y=
{y1, y2, … ,yK} be the set of classes of the input data.
The first approach decomposes the multiclass problem into K binary
problems. For each class yi∈Y, a binary problem is created where all
instances that belong to yi are considered positive examples, while the
remaining instances are considered negative examples. A binary classifier is
then constructed to separate instances of class yi from the rest of the classes.
This is known as the one-against-rest (1-r) approach.
The second approach, which is known as the one-against-one (1-1) approach,
constructs K(K −1)/2 binary classifiers, where each classifier is used to
distinguish between a pair of classes, (yi, yj). Instances that do not belong to
either yi or yj are ignored when constructing the binary classifier for (yi, yj).
In both 1-r and 1-1 approaches, a test instance is classified by combining the
predictions made by the binary classifiers. A voting scheme is typically
employed to combine the predictions, where the class that receives the
highest number of votes is assigned to the test instance. In the 1-r approach, if
an instance is classified as negative, then all classes except for the positive
class receive a vote. This approach, however, may lead to ties among the
different classes. Another possibility is to transform the outputs of the binary
classifiers into probability estimates and then assign the test instance to the
class that has the highest probability.
Example 4.11.
Consider a multiclass problem where Y={y1, y2, y3, y4}. Suppose a test
instance is classified as (+, −, −, −) according to the 1-r approach. In other
words, it is classified as positive when y1 is used as the positive class and
negative when y2, y3, and y4 are used as the positive class. Using a simple
majority vote, notice that y1 receives the highest number of votes, which is
four, while the remaining classes receive only three votes. The test instance is
therefore classified as y1.
Example 4.12.
Suppose the test instance is classified using the 1-1 approach as follows:
Binary pair of
classes +:y1−:y2 +:y1−:y3 +:y1−:y4 +:y2−:y3 +:y2−:y4 +:y3−:y4
Classification + + − + − +
The first two rows in this table correspond to the pair of classes (yi, yj)
chosen to build the classifier and the last row represents the predicted class
for the test instance. After combining the predictions, y1 and y4 each receive
two votes, while y2 and y3 each receives only one vote. The test instance is
therefore classified as either y1 or y4, depending on the tie-breaking
Error-Correcting Output Coding
A potential problem with the previous two approaches is that they may be
sensitive to binary classification errors. For the 1-r approach given in
4.12, if at least of one of the binary classifiers makes a mistake in its
prediction, then the classifier may end up declaring a tie between classes or
making a wrong prediction. For example, suppose the test instance is
classified as (+, −, +, −) due to misclassification by the third classifier. In this
case, it will be difficult to tell whether the instance should be classified as y1
or y3, unless the probability associated with each class prediction is taken
into account.
The error-correcting output coding (ECOC) method provides a more robust
way for handling multiclass problems. The method is inspired by an
information-theoretic approach for sending messages across noisy channels.
The idea behind this approach is to add redundancy into the transmitted
message by means of a codeword, so that the receiver may detect errors in the
received message and perhaps recover the original message if the number of
errors is small.
For multiclass learning, each class yi is represented by a unique bit string of
length n known as its codeword. We then train n binary classifiers to predict
each bit of the codeword string. The predicted class of a test instance is given
by the codeword whose Hamming distance is closest to the codeword
produced by the binary classifiers. Recall that the Hamming distance between
a pair of bit strings is given by the number of bits that differ.
Example 4.13.
Consider a multiclass problem where Y={y1, y2, y3, y4}. Suppose we
encode the classes using the following seven bit codewords:
Class Codeword
y1 1 1 1 1 1 1 1
y2 0 0 0 0 1 1 1
y3 0 0 1 1 0 0 1
y4 0 1 0 1 0 1 0
Each bit of the codeword is used to train a binary classifier. If a test instance
is classified as (0,1,1,1,1,1,1) by the binary classifiers, then the Hamming
distance between the codeword and y1 is 1, while the Hamming distance to
the remaining classes is 3. The test instance is therefore classified as y1.
An interesting property of an error-correcting code is that if the minimum
Hamming distance between any pair of codewords is d, then any ⌊ (d−1)/2) ⌋
errors in the output code can be corrected using its nearest codeword. In
Example 4.13, because the minimum Hamming distance between any pair of
codewords is 4, the classifier may tolerate errors made by one of the seven
binary classifiers. If there is more than one classifier that makes a mistake,
then the classifier may not be able to compensate for the error.
An important issue is how to design the appropriate set of codewords for
different classes. From coding theory, a vast number of algorithms have been
developed for generating n-bit codewords with bounded Hamming distance.
However, the discussion of these algorithms is beyond the scope of this book.
It is worthwhile mentioning that there is a significant difference between the
design of error-correcting codes for communication tasks compared to those
used for multiclass learning. For communication, the codewords should
maximize the Hamming distance between the rows so that error correction
can be performed. Multiclass learning, however, requires that both the rowwise and column-wise distances of the codewords must be well separated. A
larger column-wise distance ensures that the binary classifiers are mutually
independent, which is an important requirement for ensemble learning
4.13 Bibliographic Notes
Mitchell [278] provides excellent coverage on many classification techniques
from a machine learning perspective. Extensive coverage on classification
can also be found in Aggarwal [195], Duda et al. [229], Webb [307],
Fukunaga [237], Bishop [204], Hastie et al. [249], Cherkassky and Mulier
[215], Witten and Frank [310], Hand et al. [247], Han and Kamber [244], and
Dunham [230].
Direct methods for rule-based classifiers typically employ the sequential
covering scheme for inducing classification rules. Holte’s 1R [255] is the
simplest form of a rule-based classifier because its rule set contains only a
single rule. Despite its simplicity, Holte found that for some data sets that
exhibit a strong one-to-one relationship between the attributes and the class
label, 1R performs just as well as other classifiers. Other examples of rulebased classifiers include IREP [234], RIPPER [218], CN2 [216, 217], AQ
[276], RISE [224], and ITRULE [296]. Table 4.8 shows a comparison of the
characteristics of four of these classifiers.
Table 4.8. Comparison of
various rule-based classifiers.
(ordered) AQR
(seeded by
a positive
FOIL’s Info
gain Laplace
Entropy and
Number of
All examples
belong to the
same class
Rules cover
Rule pruning Reduced error
pruning None None None
Positive and
condition for
adding rules
orbased on
All positive
are covered
Rule setp
Replace or
modify rules
None None
Greedy Beam search Beam
For rule-based classifiers, the rule antecedent can be generalized to include
any propositional or first-order logical expression (e.g., Horn clauses).
Readers who are interested in first-order logic rule-based classifiers may refer
to references such as [278] or the vast literature on inductive logic
programming [279]. Quinlan [287] proposed the C4.5rules algorithm for
extracting classification rules from decision trees. An indirect method for
extracting rules from artificial neural networks was given by Andrews et al.
in [198].
Cover and Hart [220] presented an overview of the nearest neighbor
classification method from a Bayesian perspective. Aha provided both
theoretical and empirical evaluations for instance-based methods in [196].
PEBLS, which was developed by Cost and Salzberg [219], is a nearest
neighbor classifier that can handle data sets containing nominal attributes.
Each training example in PEBLS is also assigned a weight factor that
depends on the number of times the example helps make a correct prediction.
Han et al. [243] developed a weight-adjusted nearest neighbor algorithm, in
which the feature weights are learned using a greedy, hill-climbing
optimization algorithm. A more recent survey of k-nearest neighbor
classification is given by Steinbach and Tan [298].
Naïve Bayes classifiers have been investigated by many authors, including
Langley et al. [267], Ramoni and Sebastiani [288], Lewis [270], and
Domingos and Pazzani [227]. Although the independence assumption used in
naïve Bayes classifiers may seem rather unrealistic, the method has worked
surprisingly well for applications such as text classification. Bayesian
networks provide a more flexible approach by allowing some of the attributes
to be interdependent. An excellent tutorial on Bayesian networks is given by
Heckerman in [252] and Jensen in [258]. Bayesian networks belong to a
broader class of models known as probabilistic graphical models. A formal
introduction to the relationships between graphs and probabilities was
presented in Pearl [283]. Other great resources on probabilistic graphical
models include books by Bishop [205], and Jordan [259]. Detailed
discussions of concepts such as d-separation and Markov blankets are
provided in Geiger et al. [238] and Russell and Norvig [291].
Generalized linear models (GLM) are a rich class of regression models that
have been extensively studied in the statistical literature. They were
formulated by Nelder and Wedderburn in 1972 [280] to unify a number of
regression models such as linear regression, logistic regression, and Poisson
regression, which share some similarities in their formulations. An extensive
discussion of GLMs is provided in the book by McCullagh and Nelder [274].
Artificial neural networks (ANN) have witnessed a long and winding history
of developments, involving multiple phases of stagnation and resurgence.
The idea of a mathematical model of a neural network was first introduced in
1943 by McCulloch and Pitts [275]. This led to a series of computational
machines to simulate a neural network based on the theory of neural
plasticity [289]. The perceptron, which is the simplest prototype of modern
ANNs, was developed by Rosenblatt in 1958 [290]. The perceptron uses a
single layer of processing units that can perform basic mathematical
operations such as addition and multiplication. However, the perceptron can
only learn linear decision boundaries and is guaranteed to converge only
when the classes are linearly separable. Despite the interest in learning multilayer networks to overcome the limitations of perceptron, progress in this
area remain halted until the invention of the backpropagation algorithm by
Werbos in 1974 [309], which allowed for the quick training of multi-layer
ANNs using the gradient descent method. This led to an upsurge of interest in
the artificial intelligence (AI) community to develop multi-layer ANN
models, a trend that continued for more than a decade. Historically, ANNs
mark a paradigm shift in AI from approaches based on expert systems (where
knowledge is encoded using if-then rules) to machine learning approaches
(where the knowledge is encoded in the parameters of a computational
model). However, there were still a number of algorithmic and computational
challenges in learning large ANN models, which remained unresolved for a
long time. This hindered the development of ANN models to the scale
necessary for solving real-world problems. Slowly, ANNs started getting
outpaced by other classification models such as support vector machines,
which provided better performance as well as theoretical guarantees of
convergence and optimality. It is only recently that the challenges in learning
deep neural networks have been circumvented, owing to better computational
resources and a number of algorithmic improvements in ANNs since 2006.
This re-emergence of ANN has been dubbed as “deep learning,” which has
often outperformed existing classification models and gained wide-spread
Deep learning is a rapidly evolving area of research with a number of
potentially impactful contributions being made every year. Some of the
landmark advancements in deep learning include the use of large-scale
restricted Boltzmann machines for learning generative models of data [201,
253], the use of autoencoders and its variants (denoising autoencoders) for
learning robust feature representations [199, 305, 306], and sophistical
architectures to promote sharing of parameters across nodes such as
convolutional neural networks for images [265, 268] and recurrent neural
networks for sequences [241, 242, 277]. Other major improvements include
the approach of unsupervised pretraining for initializing ANN models [232],
the dropout technique for regularization [254, 297], batch normalization for
fast learning of ANN parameters [256], and maxout networks for effective
usage of the dropout technique [240]. Even though the discussions in this
chapter on learning ANN models were centered around the gradient descent
method, most of the modern ANN models involving a large number of
hidden layers are trained using the stochastic gradient descent method since it
is highly scalable [207]. An extensive survey of deep learning approaches has
been presented in review articles by Bengio [200], LeCun et al. [269], and
Schmidhuber [293]. An excellent summary of deep learning approaches can
also be obtained from recent books by Goodfellow et al. [239] and Nielsen
Vapnik [303, 304] has written two authoritative books on Support Vector
Machines (SVM). Other useful resources on SVM and kernel methods
include the books by Cristianini and Shawe-Taylor [221] and Schölkopf and
Smola [294]. There are several survey articles on SVM, including those
written by Burges [212], Bennet et al. [202], Hearst [251], and Mangasarian
[272]. SVM can also be viewed as an L2 norm regularizer of the hinge loss
function, as described in detail by Hastie et al. [249]. The L1 norm
regularizer of the square loss function can be obtained using the least
absolute shrinkage and selection operator (Lasso), which was introduced by
Tibshirani in 1996 [301]. The Lasso has several interesting properties such as
the ability to simultaneously perform feature selection as well as
regularization, so that only a subset of features are selected in the final model.
An excellent review of Lasso can be obtained from a book by Hastie et al.
A survey of ensemble methods in machine learning was given by Diet-terich
[222]. The bagging method was proposed by Breiman [209]. Freund and
Schapire [236] developed the AdaBoost algorithm. Arcing, which stands for
adaptive resampling and combining, is a variant of the boosting algorithm
proposed by Breiman [210]. It uses the non-uniform weights assigned to
training examples to resample the data for building an ensemble of training
sets. Unlike AdaBoost, the votes of the base classifiers are not weighted
when determining the class label of test examples. The random forest method
was introduced by Breiman in [211]. The concept of bias-variance
decomposition is explained in detail by Hastie et al. [249]. While the biasvariance decomposition was initially proposed for regression problems with
squared loss function, a unified framework for classification problems
involving 0–1 losses was introduced by Domingos [226].
Related work on mining rare and imbalanced data sets can be found in the
survey papers written by Chawla et al. [214] and Weiss [308]. Samplingbased methods for mining imbalanced data sets have been investigated by
many authors, such as Kubat and Matwin [266], Japkowitz [257], and
Drummond and Holte [228]. Joshi et al. [261] discussed the limitations of
boosting algorithms for rare class modeling. Other algorithms developed for
mining rare classes include SMOTE [213], PNrule [260], and CREDOS
Various alternative metrics that are well-suited for class imbalanced problems
are available. The precision, recall, and F1-measure are widely-used metrics
in information retrieval [302]. ROC analysis was originally used in signal
detection theory for performing aggregate evaluation over a range of score
thresholds. A method for comparing classifier performance using the convex
hull of ROC curves was suggested by Provost and Fawcett in [286]. Bradley
[208] investigated the use of area under the ROC curve (AUC) as a
performance metric for machine learning algorithms. Despite the vast body of
literature on optimizing the AUC measure in machine learning models, it is
well-known that AUC suffers from certain limitations. For example, the AUC
can be used to compare the quality of two classifiers only if the ROC curve of
one classifier strictly dominates the other. However, if the ROC curves of
two classifiers intersect at any point, then it is difficult to assess the relative
quality of classifiers using the AUC measure. An in-depth discussion of the
pitfalls in using AUC as a performance measure can be obtained in works by
Hand [245, 246], and Powers [284]. The AUC has also been considered to be
an incoherent measure of performance, i.e., it uses different scales while
comparing the performance of different classifiers, although a coherent
interpretation of AUC has been provided by Ferri et al. [235]. Berrar and
Flach [203] describe some of the common caveats in using the ROC curve for
clinical microarray research. An alternate approach for measuring the
aggregate performance of a classifier is the precision-recall (PR) curve,
which is especially useful in the presence of class imbalance [292].
An excellent tutorial on cost-sensitive learning can be found in a review
article by Ling and Sheng [271]. The properties of a cost matrix had been
studied by Elkan in [231]. Margineantu and Dietterich [273] examined
various methods for incorporating cost information into the C4.5 learning
algorithm, including wrapper methods, class distribution-based methods, and
loss-based methods. Other cost-sensitive learning methods that are algorithmindependent include AdaCost [233], MetaCost [225], and costing [312].
Extensive literature is also available on the subject of multiclass learning.
This includes the works of Hastie and Tibshirani [248], Allwein et al. [197],
Kong and Dietterich [264], and Tax and Duin [300]. The error-correcting
output coding (ECOC) method was proposed by Dietterich and Bakiri [223].
They had also investigated techniques for designing codes that are suitable
for solving multiclass problems.
Apart from exploring algorithms for traditional classification settings where
every instance has a single set of features with a unique categorical label,
there has been a lot of recent interest in non-traditional classification
paradigms, involving complex forms of inputs and outputs. For example, the
paradigm of multi-label learning allows for an instance to be assigned
multiple class labels rather than just one. This is useful in applications such
as object recognition in images, where a photo image may include more than
one classification object, such as, grass, sky, trees, and mountains. A survey
on multi-label learning can be found in [313]. As another example, the
paradigm of multi-instance learning considers the problem where the
instances are available in the form of groups called bags, and training labels
are available at the level of bags rather than individual instances. Multiinstance learning is useful in applications where an object can exist as
multiple instances in different states (e.g., the different isomers of a chemical
compound), and even if a single instance shows a specific characteristic, the
entire bag of instances associated with the object needs to be assigned the
relevant class. A survey on multi-instance learning is provided in [314].
In a number of real-world applications, it is often the case that the training
labels are scarce in quantity, because of the high costs associated with
obtaining gold-standard supervision. However, we almost always have
abundant access to unlabeled test instances, which do not have supervised
labels but contain valuable information about the structure or distribution of
instances. Traditional learning algorithms, which only make use of the
labeled instances in the training set for learning the decision boundary, are
unable to exploit the information contained in unlabeled instances. In
contrast, learning algorithms that make use of the structure in the unlabeled
data for learning the classification model are known as semi-supervised
learning algorithms [315, 316]. The use of unlabeled data is also explored in
the paradigm of multi-view learning [299, 311], where every object is
observed in multiple views of the data, involving diverse sets of features. A
common strategy used by multi-view learning algorithms is co-training [206],
where a different model is learned for every view of the data, but the model
predictions from every view are constrained to be identical to each other on
the unlabeled test instances.
Another learning paradigm that is commonly explored in the paucity of
training data is the framework of active learning, which attempts to seek the
smallest set of label annotations to learn a reasonable classification model.
Active learning expects the annotator to be involved in the process of model
learning, so that the labels are requested incrementally over the most relevant
set of instances, given a limited budget of label annotations. For example, it
may be useful to obtain labels over instances closer to the decision boundary
that can play a bigger role in fine-tuning the boundary. A review on active
learning approaches can be found in [285, 295].
In some applications, it is important to simultaneously solve multiple learning
tasks together, where some of the tasks may be similar to one another. For
example, if we are interested in translating a passage written in English into
different languages, the tasks involving lexically similar languages (such as
Spanish and Portuguese) would require similar learning of models. The
paradigm of multi-task learning helps in simultaneously learning across all
tasks while sharing the learning among related tasks. This is especially useful
when some of the tasks do not contain sufficiently many training samples, in
which case borrowing the learning from other related tasks helps in the
learning of robust models. A special case of multi-task learning is transfer
learning, where the learning from a source task (with sufficient number of
training samples) has to be transferred to a destination task (with paucity of
training data). An extensive survey of transfer learning approaches is
provided by Pan et al. [282].
Most classifiers assume every data instance must belong to a class, which is
not always true for some applications. For example, in malware detection,
due to the ease in which new malwares are created, a classifier trained on
existing classes may fail to detect new ones even if the features for the new
malwares are considerably different than those for existing malwares.
Another example is in critical applications such as medical diagnosis, where
prediction errors are costly and can have severe consequences. In this
situation, it would be better for the classifier to refrain from making any
prediction on a data instance if it is unsure of its class. This approach, known
as classifier with reject option, does not need to classify every data instance
unless it determines the prediction is reliable (e.g., if the class probability
exceeds a user-specified threshold). Instances that are unclassified can be
presented to domain experts for further determination of their true class
Classifiers can also be distinguished in terms of how the classification model
is trained. A batch classifier assumes all the labeled instances are available
during training. This strategy is applicable when the training set size is not
too large and for stationary data, where the relationship between the attributes
and classes does not vary over time. An online classifier, on the other hand,
trains an initial model using a subset of the labeled data [263]. The model is
then updated incrementally as more labeled instances become available. This
strategy is effective when the training set is too large or when there is concept
drift due to changes in the distribution of the data over time.
[195] C. C. Aggarwal. Data classification: algorithms and applications.
CRC Press, 2014.
[196] D. W. Aha. A study of instance-based algorithms for supervised
learning tasks: mathematical, empirical, and psychological evaluations.
PhD thesis, University of California, Irvine, 1990.
[197] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing Multiclass
to Binary: A Unifying Approach to Margin Classifiers. Journal of
Machine Learning Research, 1: 113–141, 2000.
[198] R. Andrews, J. Diederich, and A. Tickle. A Survey and Critique of
Techniques For Extracting Rules From Trained Artificial Neural
Networks. Knowledge Based Systems, 8(6):373–389, 1995.
[199] P. Baldi. Autoencoders, unsupervised learning, and deep
architectures. ICML unsupervised and transfer learning, 27(37-50):1,
[200] Y. Bengio. Learning deep architectures for AI. Foundations and
trends R in Machine Learning, 2(1):1–127, 2009.
[201] Y. Bengio, A. Courville, and P. Vincent. Representation learning:
A review and new perspectives. IEEE transactions on pattern analysis
and machine intelligence, 35(8): 1798–1828, 2013.
[202] K. Bennett and C. Campbell. Support Vector Machines: Hype or
Hallelujah. SIGKDD Explorations, 2(2):1–13, 2000.
[203] D. Berrar and P. Flach. Caveats and pitfalls of ROC analysis in
clinical microarray research (and how to avoid them). Briefings in
bioinformatics, page bbr008, 2011.
[204] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford
University Press, Oxford, U.K., 1995.
[205] C. M. Bishop. Pattern Recognition and Machine Learning.
Springer, 2006.
[206] A. Blum and T. Mitchell. Combining labeled and unlabeled data
with co-training. In Proceedings of the eleventh annual conference on
Computational learning theory, pages 92–100. ACM, 1998.
[207] L. Bottou. Large-scale machine learning with stochastic gradient
descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer,
[208] A. P. Bradley. The use of the area under the ROC curve in the
Evaluation of Machine Learning Algorithms. Pattern Recognition,
30(7):1145–1149, 1997.
[209] L. Breiman. Bagging Predictors. Machine Learning, 24(2):123–
140, 1996.
[210] L. Breiman. Bias, Variance, and Arcing Classifiers. Technical
Report 486, University of California, Berkeley, CA, 1996.
[211] L. Breiman. Random Forests. Machine Learning, 45(1):5–32,
[212] C. J. C. Burges. A Tutorial on Support Vector Machines for
Pattern Recognition. Data Mining and Knowledge Discovery, 2(2):121–
167, 1998.
[213] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer.
SMOTE: Synthetic Minority Over-sampling Technique. Journal of
Artificial Intelligence Research, 16: 321–357, 2002.
[214] N. V. Chawla, N. Japkowicz, and A. Kolcz. Editorial: Special
Issue on Learning from Imbalanced Data Sets. SIGKDD Explorations,
6(1):1–6, 2004.
[215] V. Cherkassky and F. Mulier. Learning from Data: Concepts,
Theory, and Methods. Wiley Interscience, 1998.
[216] P. Clark and R. Boswell. Rule Induction with CN2: Some Recent
Improvements. In Machine Learning: Proc. of the 5th European Conf.
(EWSL-91), pages 151–163, 1991.
[217] P. Clark and T. Niblett. The CN2 Induction Algorithm. Machine
Learning, 3(4): 261–283, 1989.
[218] W. W. Cohen. Fast Effective Rule Induction. In Proc. of the 12th
Intl. Conf. on Machine Learning, pages 115–123, Tahoe City, CA, July
[219] S. Cost and S. Salzberg. A Weighted Nearest Neighbor Algorithm
for Learning with Symbolic Features. Machine Learning, 10:57–78,
[220] T. M. Cover and P. E. Hart. Nearest Neighbor Pattern
Classification. Knowledge Based Systems, 8(6):373–389, 1995.
[221] N. Cristianini and J. Shawe-Taylor. An Introduction to Support
Vector Machines and Other Kernel-based Learning Methods.
Cambridge University Press, 2000.
[222] T. G. Dietterich. Ensemble Methods in Machine Learning. In First
Intl. Workshop on Multiple Classifier Systems, Cagliari, Italy, 2000.
[223] T. G. Dietterich and G. Bakiri. Solving Multiclass Learning
Problems via Error-Correcting Output Codes. Journal of Artificial
Intelligence Research, 2:263–286, 1995.
[224] P. Domingos. The RISE system: Conquering without separating.
In Proc. of the 6th IEEE Intl. Conf. on Tools with Artificial Intelligence,
pages 704–707, New Orleans, LA, 1994.
[225] P. Domingos. MetaCost: A General Method for Making
Classifiers Cost-Sensitive. In Proc. of the 5th Intl. Conf. on Knowledge
Discovery and Data Mining, pages 155–164, San Diego, CA, August
[226] P. Domingos. A unified bias-variance decomposition. In
Proceedings of 17th International Conference on Machine Learning,
pages 231–238, 2000.
[227] P. Domingos and M. Pazzani. On the Optimality of the Simple
Bayesian Classifier under Zero-One Loss. Machine Learning, 29(2-
3):103–130, 1997.
[228] C. Drummond and R. C. Holte. C4.5, Class imbalance, and Cost
sensitivity: Why under-sampling beats over-sampling. In ICML’2004
Workshop on Learning from Imbalanced Data Sets II, Washington, DC,
August 2003.
[229] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification.
John Wiley & Sons, Inc., New York, 2nd edition, 2001.
[230] M. H. Dunham. Data Mining: Introductory and Advanced Topics.
Prentice Hall, 2006.
[231] C. Elkan. The Foundations of Cost-Sensitive Learning. In Proc. of
the 17th Intl. Joint Conf. on Artificial Intelligence, pages 973–978,
Seattle, WA, August 2001.
[232] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent,
and S. Bengio. Why does unsupervised pre-training help deep learning?
Journal of Machine Learning Research, 11(Feb):625–660, 2010.
[233] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. AdaCost:
misclassification cost-sensitive boosting. In Proc. of the 16th Intl. Conf.
on Machine Learning, pages 97–105, Bled, Slovenia, June 1999.
[234] J. Fürnkranz and G. Widmer. Incremental reduced error pruning.
In Proc. of the 11th Intl. Conf. on Machine Learning, pages 70–77, New
Brunswick, NJ, July 1994.
[235] C. Ferri, J. Hernández-Orallo, and P. A. Flach. A coherent
interpretation of AUC as a measure of aggregated classification
performance. In Proceedings of the 28th International Conference on
Machine Learning (ICML-11), pages 657–664, 2011.
[236] Y. Freund and R. E. Schapire. A decision-theoretic generalization
of on-line learning and an application to boosting. Journal of Computer
and System Sciences, 55(1): 119–139, 1997.
[237] K. Fukunaga. Introduction to Statistical Pattern Recognition.
Academic Press, New York, 1990.
[238] D. Geiger, T. S. Verma, and J. Pearl. d-separation: From theorems
to algorithms. arXiv preprint arXiv:1304.1505, 2013.
[239] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. Book
in preparation for MIT Press, 2016.
[240] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, and
Y. Bengio. Maxout networks. ICML (3), 28:1319–1327, 2013.
[241] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and
J. Schmidhuber. A novel connectionist system for unconstrained
handwriting recognition. IEEE transactions on pattern analysis and
machine intelligence, 31(5):855–868, 2009.
[242] A. Graves and J. Schmidhuber. Offline handwriting recognition
with multidimensional recurrent neural networks. In Advances in neural
information processing systems, pages 545–552, 2009.
[243] E.-H. Han, G. Karypis, and V. Kumar. Text Categorization Using
Weight Adjusted k-Nearest Neighbor Classification. In Proc. of the 5th
Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Lyon,
France, 2001.
[244] J. Han and M. Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers, San Francisco, 2001.
[245] D. J. Hand. Measuring classifier performance: a coherent
alternative to the area under the ROC curve. Machine learning,
77(1):103–123, 2009.
[246] D. J. Hand. Evaluating diagnostic tests: the area under the ROC
curve and the balance of errors. Statistics in medicine, 29(14):1502–
1510, 2010.
[247] D. J. Hand, H. Mannila, and P. Smyth. Principles of Data Mining.
MIT Press, 2001.
[248] T. Hastie and R. Tibshirani. Classification by pairwise coupling.
Annals of Statistics, 26(2):451–471, 1998.
[249] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of
Statistical Learning: Data Mining, Inference, and Prediction. Springer,
2nd edition, 2009.
[250] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning
with sparsity: the lasso and generalizations. CRC Press, 2015.
[251] M. Hearst. Trends & Controversies: Support Vector Machines.
IEEE Intelligent Systems, 13(4):18–28, 1998.
[252] D. Heckerman. Bayesian Networks for Data Mining. Data Mining
and Knowledge Discovery, 1(1):79–119, 1997.
[253] G. E. Hinton and R. R. Salakhutdinov. Reducing the
dimensionality of data with neural networks. Science, 313(5786):504–
507, 2006.
[254] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.
R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[255] R. C. Holte. Very Simple Classification Rules Perform Well on
Most Commonly Used Data sets. Machine Learning, 11:63–91, 1993.
[256] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. arXiv preprint
arXiv:1502.03167, 2015.
[257] N. Japkowicz. The Class Imbalance Problem: Significance and
Strategies. In Proc. of the 2000 Intl. Conf. on Artificial Intelligence:
Special Track on Inductive Learning, volume 1, pages 111–117, Las
Vegas, NV, June 2000.
[258] F. V. Jensen. An introduction to Bayesian networks, volume 210.
UCL press London, 1996.
[259] M. I. Jordan. Learning in graphical models, volume 89. Springer
Science & Business Media, 1998.
[260] M. V. Joshi, R. C. Agarwal, and V. Kumar. Mining Needles in a
Haystack: Classifying Rare Classes via Two-Phase Rule Induction. In
Proc. of 2001 ACM-SIGMOD Intl. Conf. on Management of Data, pages
91–102, Santa Barbara, CA, June 2001.
[261] M. V. Joshi, R. C. Agarwal, and V. Kumar. Predicting rare
classes: can boosting make any weak learner strong? In Proc. of the 8th
Intl. Conf. on Knowledge Discovery and Data Mining, pages 297–306,
Edmonton, Canada, July 2002.
[262] M. V. Joshi and V. Kumar. CREDOS: Classification Using Ripple
Down Structure (A Case for Rare Classes). In Proc. of the SIAM Intl.
Conf. on Data Mining, pages 321–332, Orlando, FL, April 2004.
[263] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning
with kernels. IEEE transactions on signal processing, 52(8):2165–2176,
[264] E. B. Kong and T. G. Dietterich. Error-Correcting Output Coding
Corrects Bias and Variance. In Proc. of the 12th Intl. Conf. on Machine
Learning, pages 313–321, Tahoe City, CA, July 1995.
[265] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In Advances in
neural information processing systems, pages 1097–1105, 2012.
[266] M. Kubat and S. Matwin. Addressing the Curse of Imbalanced
Training Sets: One Sided Selection. In Proc. of the 14th Intl. Conf. on
Machine Learning, pages 179–186, Nashville, TN, July 1997.
[267] P. Langley, W. Iba, and K. Thompson. An analysis of Bayesian
classifiers. In Proc. of the 10th National Conf. on Artificial Intelligence,
pages 223–228, 1992.
[268] Y. LeCun and Y. Bengio. Convolutional networks for images,
speech, and time series. The handbook of brain theory and neural
networks, 3361(10):1995, 1995.
[269] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature,
521(7553):436–444, 2015.
[270] D. D. Lewis. Naive Bayes at Forty: The Independence
Assumption in Information Retrieval. In Proc. of the 10th European
Conf. on Machine Learning (ECML 1998), pages 4–15, 1998.
[271] C. X. Ling and V. S. Sheng. Cost-sensitive learning. In
Encyclopedia of Machine Learning, pages 231–235. Springer, 2011.
[272] O. Mangasarian. Data Mining via Support Vector Machines.
Technical Report Technical Report 01-05, Data Mining Institute, May
[273] D. D. Margineantu and T. G. Dietterich. Learning Decision Trees
for Loss Minimization in Multi-Class Problems. Technical Report 99-
30-03, Oregon State University, 1999.
[274] P. McCullagh and J. A. Nelder. Generalized linear models,
volume 37. CRC press, 1989.
[275] W. S. McCulloch and W. Pitts. A logical calculus of the ideas
immanent in nervous activity. The bulletin of mathematical biophysics,
5(4):115–133, 1943.
[276] R. S. Michalski, I. Mozetic, J. Hong, and N. Lavrac. The Multi-
Purpose Incremental Learning System AQ15 and Its Testing Application
to Three Medical Domains. In Proc. of 5th National Conf. on Artificial
Intelligence, Orlando, August 1986.
[277] T. Mikolov, M. Karafiát, L. Burget, J. Cernock`y, and S.
Khudanpur. Recurrent neural network based language model. In
Interspeech, volume 2, page 3, 2010.
[278] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.
[279] S. Muggleton. Foundations of Inductive Logic Programming.
Prentice Hall, Englewood Cliffs, NJ, 1995.
[280] J. A. Nelder and R. J. Baker. Generalized linear models.
Encyclopedia of statistical sciences, 1972.
[281] M. A. Nielsen. Neural networks and deep learning. Published
online: http: // neuralnetworksanddeeplearning. com/ .( visited: 10. 15.
2016) , 2015.
[282] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE
Transactions on knowledge and data engineering, 22(10):1345–1359,
[283] J. Pearl. Probabilistic reasoning in intelligent systems: networks of
plausible inference. Morgan Kaufmann, 2014.
[284] D. M. Powers. The problem of area under the curve. In 2012 IEEE
International Conference on Information Science and Technology, pages
567–573. IEEE, 2012.
[285] M. Prince. Does active learning work? A review of the research.
Journal of engineering education, 93(3):223–231, 2004.
[286] F. J. Provost and T. Fawcett. Analysis and Visualization of
Classifier Performance: Comparison under Imprecise Class and Cost
Distributions. In Proc. of the 3rd Intl. Conf. on Knowledge Discovery
and Data Mining, pages 43–48, Newport Beach, CA, August 1997.
[287] J. R. Quinlan. C4.5: Programs for Machine Learning. MorganKaufmann Publishers, San Mateo, CA, 1993.
[288] M. Ramoni and P. Sebastiani. Robust Bayes classifiers. Artificial
Intelligence, 125: 209–226, 2001.
[289] N. Rochester, J. Holland, L. Haibt, and W. Duda. Tests on a cell
assembly theory of the action of the brain, using a large digital
computer. IRE Transactions on information Theory, 2(3):80–93, 1956.
[290] F. Rosenblatt. The perceptron: a probabilistic model for
information storage and organization in the brain. Psychological review,
65(6):386, 1958.
[291] S. J. Russell, P. Norvig, J. F. Canny, J. M. Malik, and D. D.
Edwards. Artificial intelligence: a modern approach, volume 2. Prentice
hall Upper Saddle River, 2003.
[292] T. Saito and M. Rehmsmeier. The precision-recall plot is more
informative than the ROC plot when evaluating binary classifiers on
imbalanced datasets. PloS one, 10(3): e0118432, 2015.
[293] J. Schmidhuber. Deep learning in neural networks: An overview.
Neural Networks, 61:85–117, 2015.
[294] B. Schölkopf and A. J. Smola. Learning with Kernels: Support
Vector Machines, Regularization, Optimization, and Beyond. MIT Press,
[295] B. Settles. Active learning literature survey. University of
Wisconsin, Madison, 52 (55-66):11, 2010.
[296] P. Smyth and R. M. Goodman. An Information Theoretic
Approach to Rule Induction from Databases. IEEE Trans. on Knowledge
and Data Engineering, 4(4):301–316, 1992.
[297] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R.
Salakhutdinov. Dropout: a simple way to prevent neural networks from
overfitting. Journal of Machine Learning Research, 15(1):1929–1958,
[298] M. Steinbach and P.-N. Tan. kNN: k-Nearest Neighbors. In X. Wu
and V. Kumar, editors, The Top Ten Algorithms in Data Mining.
Chapman and Hall/CRC Reference, 1st edition, 2009.
[299] S. Sun. A survey of multi-view machine learning. Neural
Computing and Applications, 23(7-8):2031–2038, 2013.
[300] D. M. J. Tax and R. P. W. Duin. Using Two-Class Classifiers for
Multiclass Classification. In Proc. of the 16th Intl. Conf. on Pattern
Recognition (ICPR 2002), pages 124–127, Quebec, Canada, August
[301] R. Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society. Series B (Methodological), pages
267–288, 1996.
[302] C. J. van Rijsbergen. Information Retrieval. ButterworthHeinemann, Newton, MA, 1978.
[303] V. Vapnik. The Nature of Statistical Learning Theory. Springer
Verlag, New York, 1995.
[304] V. Vapnik. Statistical Learning Theory. John Wiley & Sons, New
York, 1998.
[305] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Extracting and composing robust features with denoising autoencoders.
In Proceedings of the 25th international conference on Machine
learning, pages 1096–1103. ACM, 2008.
[306] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.
Manzagol. Stacked denoising autoencoders: Learning useful
representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research, 11(Dec):3371–3408, 2010.
[307] A. R. Webb. Statistical Pattern Recognition. John Wiley & Sons,
2nd edition, 2002.
[308] G. M. Weiss. Mining with Rarity: A Unifying Framework.
SIGKDD Explorations, 6 (1):7–19, 2004.
[309] P. Werbos. Beyond regression: new fools for prediction and
analysis in the behavioral sciences. PhD thesis, Harvard University,
[310] I. H. Witten and E. Frank. Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations. Morgan
Kaufmann, 1999.
[311] C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. arXiv
preprint arXiv:1304.5634, 2013.
[312] B. Zadrozny, J. C. Langford, and N. Abe. Cost-Sensitive Learning
by Cost-Proportionate Example Weighting. In Proc. of the 2003 IEEE
Intl. Conf. on Data Mining, pages 435–442, Melbourne, FL, August
[313] M.-L. Zhang and Z.-H. Zhou. A review on multi-label learning
algorithms. IEEE transactions on knowledge and data engineering,
26(8):1819–1837, 2014.
[314] Z.-H. Zhou. Multi-instance learning: A survey. Department of
Computer Science & Technology, Nanjing University, Tech. Rep, 2004.
[315] X. Zhu. Semi-supervised learning. In Encyclopedia of machine
learning, pages 892–897. Springer, 2011.
[316] X. Zhu and A. B. Goldberg. Introduction to semi-supervised
learning. Synthesis lectures on artificial intelligence and machine
learning, 3(1):1–130, 2009.
4.14 Exercises
1. 1. Consider a binary classification problem with the following set of
attributes and attribute values:
Air Conditioner={Working, Broken}
Engine={Good, Bad}
Mileage={High, Medium, Low}
Rust={Yes, No}
Suppose a rule-based classifier produces the following rule set:
Mileage=High→Mileage=HighMileage=Low→Value=HighAir Conditioner
1. Are the rules mutually exclusive?
2. Is the rule set exhaustive?
3. Is ordering needed for this set of rules?
4. Do you need a default class for the rule set?
2. 2. The RIPPER algorithm (by Cohen [218]) is an extension of an earlier
algorithm called IREP (by Fürnkranz and Widmer [234]). Both
algorithms apply the reduced-error pruning method to determine
whether a rule needs to be pruned. The reduced error pruning method
uses a validation set to estimate the generalization error of a classifier.
Consider the following pair of rules:
R2 is obtained by adding a new conjunct, B, to the left-hand side of R1.
For this question, you will be asked to determine whether R2 is
preferred over R1 from the perspectives of rule-growing and rulepruning. To determine whether a rule should be pruned, IREP computes
the following measure:
where P is the total number of positive examples in the validation set, N
is the total number of negative examples in the validation set, p is the
number of positive examples in the validation set covered by the rule,
and n is the number of negative examples in the validation set covered
by the rule. vIREP is actually similar to classification accuracy for the
validation set. IREP favors rules that have higher values of vIREP. On
the other hand, RIPPER applies the following measure to determine
whether a rule should be pruned:
1. Suppose R1 is covered by 350 positive examples and 150 negative
examples, while R2 is covered by 300 positive examples and 50
negative examples. Compute the FOIL’s information gain for the
rule R2 with respect to R1.
2. Consider a validation set that contains 500 positive examples and
500 negative examples. For R1, suppose the number of positive
examples covered by the rule is 200, and the number of negative
examples covered bytheruleis50. For R2, suppose the number of
positive examples covered by the rule is 100 and the number of
negative examples is 5. Compute vIREP for both rules. Which rule
does IREP prefer?
3. Compute vRIPPER for the previous problem. Which rule does
RIPPER prefer?
3. 3. C4.5rules is an implementation of an indirect method for generating
rules from a decision tree. RIPPER is an implementation of a direct
method for generating rules directly from data.
1. Discuss the strengths and weaknesses of both methods.
2. Consider a data set that has a large difference in the class size (i.e.,
some classes are much bigger than others). Which method
(between C4.5rules and RIPPER) is better in terms of finding high
accuracy rules for the small classes?
4. 4. Consider a training set that contains 100 positive examples and 400
negative examples. For each of the following candidate rules,
R1:A→+(covers 4 positive and 1 negative examples),R2:B→+
(covers 30 positive and 10 negative examples),R3:C→+
(covers 100 positive and 90 negative examples),
determine which is the best and worst candidate rule according to:
1. Rule accuracy.
2. FOIL’s information gain.
3. The likelihood ratio statistic.
4. The Laplace measure.
5. The m-estimate measure (with k=2 and p+=0.2).
5. 5. Figure 4.3 illustrates the coverage of the classification rules R1, R2,
and R3. Determine which is the best and worst rule according to:
1. The likelihood ratio statistic.
2. The Laplace measure.
3. The m-estimate measure (with k=2 and p+=0.58).
4. The rule accuracy after R1 has been discovered, where none of the
examples covered by R1 are discarded.
5. The rule accuracy after R1 has been discovered, where only the
positive examples covered by R1 are discarded.
6. The rule accuracy after R1 has been discovered, where both
positive and negative examples covered by R1 are discarded.
6. 6.
1. Suppose the fraction of undergraduate students who smoke is 15%
and the fraction of graduate students who smoke is 23%. If onefifth of the college students are graduate students and the rest are
undergraduates, what is the probability that a student who smokes
is a graduate student?
2. Given the information in part (a), is a randomly chosen college
student more likely to be a graduate or undergraduate student?
3. Repeat part (b) assuming that the student is a smoker.
4. Suppose 30% of the graduate students live in a dorm but only 10%
of the undergraduate students live in a dorm. If a student smokes
and lives in the dorm, is he or she more likely to be a graduate or
undergraduate student? You can assume independence between
students who live in a dorm and those who smoke.
7. 7. Consider the data set shown in Table 4.9
Table 4.9. Data set for
Exercise 7.
Instance A B C Class
1 0 0 0 +
2 0 0 1 −
3 0 1 1 −
4 0 1 1 −
5 0 0 1 +
6 1 0 1 +
7 1 0 1 −
8 1 0 1 −
9 1 1 1 +
10 1 0 1 +
1. Estimate the conditional probabilities for
P(A|+), P(B|+), P(C|+), P(A|−), P(B|−), and P(C|−).
2. Use the estimate of conditional probabilities given in the previous
question to predict the class label for a test sample
(A=0, B=1, C=0) using the naïve Bayes approach.
3. Estimate the conditional probabilities using the m-estimate
approach, with p=1/2 and m=4.
4. Repeat part (b) using the conditional probabilities given in part (c).
5. Compare the two methods for estimating probabilities. Which
method is better and why?
8. 8. Consider the data set shown in Table 4.10.
Table 4.10. Data set for
Exercise 8.
Instance A B C Class
1 0 0 1 −
2 1 0 1 +
3 0 1 0 −
4 1 0 0 −
5 1 0 1 +
6 0 0 1 +
7 1 1 0 −
8 0 0 0 −
9 0 1 0 +
10 1 1 1 +
1. Estimate the conditional probabilities for
P(A=1|+), P(B=1|+), P(C=1|+), P(A=1|−), P(B=1|−), and P(C=1|−)
using the same approach as in the previous problem.
2. Use the conditional probabilities in part (a) to predict the class label
for a test sample (A=1, B=1, C=1) using the naïve Bayes approach.
3. Compare P(A=1), P(B=1), and P(A=1, B=1). State the relationships
between A and B.
4. Repeat the analysis in part (c) using P(A=1), P(B=0), and
P(A=1, B=0).
5. Compare P(A=1, B=1|Class=+) against P(A=1|Class=+) and
P(B=1|Class=+). Are the variables conditionally independent given
the class?
9. 9.
1. Explain how naïve Bayes performs on the data set shown in Figure
2. If each class is further divided such that there are four classes (A1,
A2, B1, and B2), will naïve Bayes perform better?
3. How will a decision tree perform on this data set (for the two-class
problem)? What if there are four classes?
10. 10. Figure 4.57 illustrates the Bayesian network for the data set shown
in Table 4.11. (Assume that all the attributes are binary).
1. Draw the probability table for each node in the network.
2. Use the Bayesian network to compute
P(Engine=Bad, Air Conditioner=Broken).
11. 11. Given the Bayesian network shown in Figure 4.58, compute the
following probabilities:
Figure 4.56.
Data set for Exercise 9.
Figure 4.56. Full Alternative Text
Figure 4.57.
Bayesian network.
1. P(B=good,F=empty, G=empty, S=yes).
2. P(B=bad,F=empty, G=not empty, S=no).
3. Given that the battery is bad, compute the probability that the car
will start.
12. 12. Consider the one-dimensional data set shown in Table 4.12.
1. Classify the data point x=5.0 according to its 1-, 3-, 5-, and 9-
nearest neighbors (using majority vote).
2. Repeat the previous analysis using the distance-weighted voting
approach described in Section 4.3.1.
Table 4.11. Data set for
Exercise 10.
Mileage Engine Air
Number of
Car Value=Hi
Number of
Instances with
Car Value=Lo
Hi Good Working 3 4
Hi Good Broken 1 2
Hi Bad Working 1 5
Hi Bad Broken 0 4
Lo Good Working 9 0
Lo Good Broken 5 1
Lo Bad Working 1 2
Lo Bad Broken 0 2
Figure 4.58.
Bayesian network for Exercise 11.
Figure 4.58. Full Alternative Text
13. 13. The nearest neighbor algorithm described in Section 4.3 can be
extended to handle nominal attributes. A variant of the algorithm called
PEBLS (Parallel Exemplar-Based Learning System) by Cost and
Salzberg [219] measures the distance between two values of a nominal
attribute using the modified value difference metric (MVDM). Given a
pair of nominal attribute values, V1 and V2, the distance between them
is defined as follows:
d(V1, V2)=∑i=1k| ni1n1−ni2n2, | (4.108)
where nij is the number of examples from class i with attribute value Vj
and nj is the number of examples with attribute value Vj.
Table 4.12. Data set for
Exercise 12.
x 0.5 3.0 4.5 4.6 4.9 5.2 5.3 5.5 7.0 9.5
y − − + + + − − + − −
Consider the training set for the loan classification problem shown in
Figure 4.8. Use the MVDM measure to compute the distance between
every pair of attribute values for the Home Owner and Marital Status
14. 14. For each of the Boolean functions given below, state whether the
problem is linearly separable.
3. (A OR B) AND (A OR C)
4. (A XOR B) AND (A OR B)
15. 15.
1. Demonstrate how the perceptron model can be used to represent the
AND and OR functions between a pair of Boolean variables.
2. Comment on the disadvantage of using linear functions as
activation functions for multi-layer neural networks.
16. 16. You are asked to evaluate the performance of two classification
models, M1 and M2. The test set you have chosen contains 26 binary
attributes, labeled as A through Z. Table 4.13 shows the posterior
probabilities obtained by applying the models to the test set. (Only the
posterior probabilities for the positive class are shown). As this is a twoclass problem, P(−)=1−P(+) and P(−|A, …, Z)=1−P(+|A, …, Z). Assume
that we are mostly interested in detecting instances from the positive
1. Plot the ROC curve for both M1 and M2. (You should plot them on
the same graph.) Which model do you think is better? Explain your
2. For model M1, suppose you choose the cutoff threshold to be t=0.5.
In other words, any test instances whose posterior probability is
greater than t will be classified as a positive example. Compute the
precision, recall, and F-measure for the model at this threshold
3. Repeat the analysis for part (b) using the same cutoff threshold on
model M2. Compare the F-measure results for both models. Which
model is better? Are the results consistent with what you expect
from the ROC curve?
4. Repeat part (b) for model M1 using the threshold t=0.1. Which
threshold do you prefer, t=0.5 or t=0.1? Are the results consistent
with what you expect from the ROC curve?
Table 4.13. Posterior
probabilities for Exercise
Instance True
Class P(+|A, …, Z, M1) P(+|A, …, Z, M2)
1 + 0.73 0.61
2 + 0.69 0.03
3 − 0.44 0.68
4 − 0.55 0.31
5 + 0.67 0.45
6 + 0.47 0.09
7 − 0.08 0.38
8 − 0.15 0.05
9 + 0.45 0.01
10 − 0.35 0.04
17. 17. Following is a data set that contains two attributes, X and Y , and two
class labels, “+” and “−”. Each attribute can take three different values:
0, 1, or 2.
X Y Number of Instances
+ −
0 0 0 100
1 0 0 0
2 0 0 100
0 1 10 100
1 1 10 0
2 1 10 100
0 2 0 100
1 2 0 0
2 2 0 100
The concept for the “+” class is Y=1 and the concept for the “−” class is
1. Build a decision tree on the data set. Does the tree capture the “+”
and “−” concepts?
2. What are the accuracy, precision, recall, and F1-measure of the
decision tree? (Note that precision, recall, and F1-measure are
defined with respect to the “+” class.)
3. Build a new decision tree with the following cost function:
C(i, j)={ 0,if i=j;1,if i=+, j=−;Number of
− instancesNumber of+ instancesif i=−, j=+;
(Hint: only the leaves of the old decision tree need to be changed.)
Does the decision tree capture the “+” concept?
4. What are the accuracy, precision, recall, and F1-measure of the new
decision tree?
18. 18. Consider the task of building a classifier from random data, where
the attribute values are generated randomly irrespective of the class
labels. Assume the data set contains instances from two classes, “+” and
“−.” Half of the data set is used for training while the remaining half is
used for testing.
1. Suppose there are an equal number of positive and negative
instances in the data and the decision tree classifier predicts every
test instance to be positive. What is the expected error rate of the
classifier on the test data?
2. Repeat the previous analysis assuming that the classifier predicts
each test instance to be positive class with probability 0.8 and
negative class with probability 0.2.
3. Suppose two-thirds of the data belong to the positive class and the
remaining one-third belong to the negative class. What is the
expected error of a classifier that predicts every test instance to be
4. Repeat the previous analysis assuming that the classifier predicts
each test instance to be positive class with probability 2/3 and
negative class with probability 1/3.
19. 19. Derive the dual Lagrangian for the linear SVM with non-separable
data where the objective function is
f(w)=ǁ w ǁ22+C(∑i=1Nξi)2.
20. 20. Consider the XOR problem where there are four training points:
(1, 1, −), (1, 0, +), (0, 1, +), (0, 0, −).
Transform the data into the following feature space:
φ=(1, 2×1, 2×2, 2x1x2, x12, x22).
Find the maximum margin linear decision boundary in the transformed
21. 21. Given the data sets shown in Figures 4.59, explain how the decision
tree, naïve Bayes, and k-nearest neighbor classifiers would perform on
these data sets.

Figure 4.59.
Data set for Exercise 21.
Figure 4.59. Full Alternative Text
5 Association Analysis: Basic
Concepts and Algorithms
Many business enterprises accumulate large quantities of data from their dayto-day operations. For example, huge amounts of customer purchase data are
collected daily at the checkout counters of grocery stores. Table 5.1 gives an
example of such data, commonly known as market basket transactions.
Each row in this table corresponds to a transaction, which contains a unique
identifier labeled TID and a set of items bought by a given customer.
Retailers are interested in analyzing the data to learn about the purchasing
behavior of their customers. Such valuable information can be used to
support a variety of business-related applications such as marketing
promotions, inventory management, and customer relationship management.
Table 5.1. An example of
market basket transactions.
TID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}
This chapter presents a methodology known as association analysis, which is
useful for discovering interesting relationships hidden in large data sets. The
uncovered relationships can be represented in the form of sets of items
present in many transactions, which are known as frequent itemsets, or
association rules, that represent relationships between two itemsets. For
example, the following rule can be extracted from the data set shown in Table
The rule suggests a relationship between the sale of diapers and beer because
many customers who buy diapers also buy beer. Retailers can use these types
of rules to help them identify new opportunities for cross-selling their
products to the customers.
Besides market basket data, association analysis is also applicable to data
from other application domains such as bioinformatics, medical diagnosis,
web mining, and scientific data analysis. In the analysis of Earth science data,
for example, association patterns may reveal interesting connections among
the ocean, land, and atmospheric processes. Such information may help Earth
scientists develop a better understanding of how the different elements of the
Earth system interact with each other. Even though the techniques presented
here are generally applicable to a wider variety of data sets, for illustrative
purposes, our discussion will focus mainly on market basket data.
There are two key issues that need to be addressed when applying association
analysis to market basket data. First, discovering patterns from a large
transaction data set can be computationally expensive. Second, some of the
discovered patterns may be spurious (happen simply by chance) and even for
non-spurious patterns, some are more interesting than others. The remainder
of this chapter is organized around these two issues. The first part of the
chapter is devoted to explaining the basic concepts of association analysis
and the algorithms used to efficiently mine such patterns. The second part of
the chapter deals with the issue of evaluating the discovered patterns in order
to help prevent the generation of spurious results and to rank the patterns in
terms of some interestingness measure.
5.1 Preliminaries
This section reviews the basic terminology used in association analysis and
presents a formal description of the task.
Binary Representation
Market basket data can be represented in a binary format as shown in Table
5.2, where each row corresponds to a transaction and each column
corresponds to an item. An item can be treated as a binary variable whose
value is one if the item is present in a transaction and zero otherwise. Because
the presence of an item in a transaction is often considered more important
than its absence, an item is an asymmetric binary variable. This
representation is a simplistic view of real market basket data because it
ignores important aspects of the data such as the quantity of items sold or the
price paid to purchase them. Methods for handling such non-binary data will
be explained in Chapter 6.
Table 5.2. A binary 0/1
representation of market
basket data.
TID Bread Milk Diapers Beer Eggs Cola
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1
Itemset and Support Count
Let I={i1, i2, … , id} be the set of all items in a market basket data and T=
{t1, t2, …, tN} be the set of all transactions. Each transaction, ti contains a
subset of items chosen from I. In association analysis, a collection of zero or
more items is termed an itemset. If an itemset contains k items, it is called a
k-itemset. For instance, {Beer, Diapers, Milk} is an example of a 3-itemset.
The null (or empty) set is an itemset that does not contain any items.
A transaction tj is said to contain an itemset X if X is a subset of tj. For
example, the second transaction shown in Table 5.2 contains the itemset
{Bread, Diapers} but not {Bread, Milk}. An important property of an
itemset is its support count, which refers to the number of transactions that
contain a particular itemset. Mathematically, the support count, σ(X), for an
itemset X can be stated as follows:
σ(X)=|{ti|X⊆ti, ti∈T}|,
where the symbol |⋅| denotes the number of elements in a set. In the data set
shown in Table 5.2, the support count for {Beer, Diapers, Milk} is equal to
two because there are only two transactions that contain all three items.
Often, the property of interest is the support, which is fraction of transactions
in which an itemset occurs:
An itemset X is called frequent if s(X) is greater than some user-defined
threshold, minsup.
Association Rule
An association rule is an implication expression of the form X→Y, where X
and Y are disjoint itemsets, i.e., X∩Y=∅. The strength of an association rule
can be measured in terms of its support and confidence. Support determines
how often a rule is applicable to a given data set, while confidence
determines how frequently items in Y appear in transactions that contain X.
The formal definitions of these metrics are
Support, s(X→Y)=σ(X∪Y)N; (5.1)
Confidence, c(X→Y)=σ(X∪Y)σ(X). (5.2)
Example 5.1.
Consider the rule {Milk, Diapers}→{Beer}. Because the support count for
{Milk, Diapers, Beer} is 2 and the total number of transactions is 5, the rule’s
support is 2/5=0.4. The rule’s confidence is obtained by dividing the support
count for {Milk, Diapers, Beer} by the support count for {Milk, Diapers}.
Since there are 3 transactions that contain milk and diapers, the confidence
for this rule is 2/3=0.67.
Why Use Support and Confidence?
Support is an important measure because a rule that has very low support
might occur simply by chance. Also, from a business perspective a low
support rule is unlikely to be interesting because it might not be profitable to
promote items that customers seldom buy together (with the exception of the
situation described in Section 5.8). For these reasons, we are interested in
finding rules whose support is greater than some user-defined threshold. As
will be shown in Section 5.2.1, support also has a desirable property that can
be exploited for the efficient discovery of association rules.
Confidence, on the other hand, measures the reliability of the inference made
by a rule. For a given rule X→Y, the higher the confidence, the more likely it
is for Y to be present in transactions that contain X. Confidence also provides
an estimate of the conditional probability of Y given X.
Association analysis results should be interpreted with caution. The inference
made by an association rule does not necessarily imply causality. Instead, it
can sometimes suggest a strong co-occurrence relationship between items in
the antecedent and consequent of the rule. Causality, on the other hand,
requires knowledge about which attributes in the data capture cause and
effect, and typically involves relationships occurring over time (e.g.,
greenhouse gas emissions lead to global warming). See Section 5.7.1 for
additional discussion.
Formulation of the Association Rule
Mining Problem
The association rule mining problem can be formally stated as follows:
Definition 5.1. (Association Rule
Given a set of transactions T , find all the rules having support ≥ minsup and
confidence ≥ minconf, where minsup and minconf are the corresponding
support and confidence thresholds.
A brute-force approach for mining association rules is to compute the support
and confidence for every possible rule. This approach is prohibitively
expensive because there are exponentially many rules that can be extracted
from a data set. More specifically, assuming that neither the left nor the righthand side of the rule is an empty set, the total number of possible rules, R,
extracted from a data set that contains d items is
R=3d−2d+1+1. (5.3)
The proof for this equation is left as an exercise to the readers (see Exercise 5
on page 440). Even for the small data set shown in Table 5.1, this approach
requires us to compute the support and confidence for 36−27+1=602 rules.
More than 80% of the rules are discarded after applying minsup=20% and
mincof=50%, thus wasting most of the computations. To avoid performing
needless computations, it would be useful to prune the rules early without
having to compute their support and confidence values.
An initial step toward improving the performance of association rule mining
algorithms is to decouple the support and confidence requirements. From
Equation 5.1, notice that the support of a rule X→Y is the same as the
support of its corresponding itemset, X∪Y. For example, the following rules
have identical support because they involve items from the same itemset,
{Beer, Diapers, Milk}:
{Beer, Diapers}→{Milk},{Beer, Milk}→{Diapers},{Diapers, Milk}
→{Beer},{Beer}→{Diapers, Milk},{Milk}→{Beer, Diapers},{Diapers}
→{Beer, Milk}.
If the itemset is infrequent, then all six candidate rules can be pruned
immediately without our having to compute their confidence values.
Therefore, a common strategy adopted by many association rule mining
algorithms is to decompose the problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the itemsets
that satisfy the minsup threshold.
2. Rule Generation, whose objective is to extract all the high confidence
rules from the frequent itemsets found in the previous step. These rules
are called strong rules.
The computational requirements for frequent itemset generation are generally
more expensive than those of rule generation. Efficient techniques for
generating frequent itemsets and association rules are discussed in Sections
5.2 and 5.3, respectively.
5.2 Frequent Itemset Generation
A lattice structure can be used to enumerate the list of all possible itemsets.
Figure 5.1 shows an itemset lattice for I={a, b, c, d, e}. In general, a data set
that contains k items can potentially generate up to 2k−1 frequent itemsets,
excluding the null set. Because k can be very large in many practical
applications, the search space of itemsets that need to be explored is
exponentially large.
Figure 5.1.
An itemset lattice.
Figure 5.1. Full Alternative Text
A brute-force approach for finding frequent itemsets is to determine the
support count for every candidate itemset in the lattice structure. To do this,
we need to compare each candidate against every transaction, an operation
that is shown in Figure 5.2. If the candidate is contained in a transaction, its
support count will be incremented. For example, the support for {Bread,
Milk} is incremented three times because the itemset is contained in
transactions 1, 4, and 5. Such an approach can be very expensive because it
requires O(NMw) comparisons, where N is the number of transactions, M=2k
−1 is the number of candidate itemsets, and w is the maximum transaction
width. Transaction width is the number of items present in a transaction.
Figure 5.2.
Counting the support of candidate itemsets.
There are three main approaches for reducing the computational complexity
of frequent itemset generation.
1. Reduce the number of candidate itemsets (M). The Apriori principle,
described in the next Section, is an effective way to eliminate some of
the candidate itemsets without counting their support values.
2. Reduce the number of comparisons. Instead of matching each candidate
itemset against every transaction, we can reduce the number of
comparisons by using more advanced data structures, either to store the
candidate itemsets or to compress the data set. We will discuss these
strategies in Sections 5.2.4 and 5.6, respectively.
3. Reduce the number of transactions (N). As the size of candidate itemsets
increases, fewer transactions will be supported by the itemsets. For
instance, since the width of the first transaction in Table 5.1 is 2, it
would be advantageous to remove this transaction before searching for
frequent itemsets of size 3 and larger. Algorithms that employ such a
strategy are discussed in the Bibliographic Notes.
5.2.1 The Apriori Principle
This Section describes how the support measure can be used to reduce the
number of candidate itemsets explored during frequent itemset generation.
The use of support for pruning candidate itemsets is guided by the following
Theorem 5.1 (Apriori Principle).
If an itemset is frequent, then all of its subsets must also be frequent.
To illustrate the idea behind the Apriori principle, consider the itemset lattice
shown in Figure 5.3. Suppose {c, d, e} is a frequent itemset. Clearly, any
transaction that contains {c, d, e} must also contain its subsets, {c, d}, {c, e},
{d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets
of {c, d, e} (i.e., the shaded itemsets in this figure) must also be frequent.
Figure 5.3.
An illustration of the Apriori principle. If {c, d, e} is frequent, then
all subsets of this itemset are frequent.
Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets
must be infrequent too. As illustrated in Figure 5.4, the entire subgraph
containing the supersets of {a, b} can be pruned immediately once {a, b} is
found to be infrequent. This strategy of trimming the exponential search
space based on the support measure is known as support-based pruning.
Such a pruning strategy is made possible by a key property of the support
measure, namely, that the support for an itemset never exceeds the support
for its subsets. This property is also known as the anti-monotone property of
the support measure.
Figure 5.4.
An illustration of support-based pruning. If {a, b} is infrequent,
then all supersets of {a, b} are infrequent.
Definition 5.2. (Anti-monotone
A measure f possesses the anti-monotone property if for every itemset X that
is a proper subset of itemset Y, i.e. X⊂Y, we have f(Y)≤f(X).
More generally, a large number of measures—see Section 5.7.1—can be
applied to itemsets to evaluate various properties of itemsets. As will be
shown in the next Section, any measure that has the anti-monotone property
can be incorporated directly into an itemset mining algorithm to effectively
prune the exponential search space of candidate itemsets.
5.2.2 Frequent Itemset Generation
in the Apriori Algorithm
Apriori is the first association rule mining algorithm that pioneered the use of
support-based pruning to systematically control the exponential growth of
candidate itemsets. Figure 5.5 provides a high-level illustration of the
frequent itemset generation part of the Apriori algorithm for the transactions
shown in Table 5.1. We assume that the support threshold is 60%, which is
equivalent to a minimum support count equal to 3.
Figure 5.5.
Illustration of frequent itemset generation using the Apriori
Figure 5.5. Full Alternative Text
Initially, every item is considered as a candidate 1-itemset. After counting
their supports, the candidate itemsets {Cola} and {Eggs} are discarded
because they appear in fewer than three transactions. In the next iteration,
candidate 2-itemsets are generated using only the frequent 1-itemsets because
the Apriori principle ensures that all supersets of the infrequent 1-itemsets
must be infrequent. Because there are only four frequent 1-itemsets, the
number of candidate 2-itemsets generated by the algorithm is (42)=6. Two of
these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently
found to be infrequent after computing their support values. The remaining
four candidates are frequent, and thus will be used to generate candidate 3-
itemsets. Without support-based pruning, there are(63)=20 candidate 3-
itemsets that can be formed using the six items given in this example. With
the Apriori principle, we only need to keep candidate 3-itemsets whose
subsets are frequent. The only candidate that has this property is {Bread,
Diapers, Milk}. However, even though the subsets of {Bread, Diapers,
Milk} are frequent, the itemset itself is not.
The effectiveness of the Apriori pruning strategy can be shown by counting
the number of candidate itemsets generated. A brute-force strategy of
enumerating all itemsets (up to size 3) as candidates will produce
candidates. With the Apriori principle, this number decreases to
candidates, which represents a 68% reduction in the number of candidate
itemsets even in this simple example.
The pseudocode for the frequent itemset generation part of the Apriori
algorithm is shown in Algorithm 5.1. Let Ck denote the set of candidate kitemsets and Fk denote the set of frequent k-itemsets:
The algorithm initially makes a single pass over the data set to
determine the support of each item. Upon completion of this step, the set
of all frequent 1-itemsets, F1, will be known (steps 1 and 2).
Next, the algorithm will iteratively generate new candidate k-itemsets
and prune unnecessary candidates that are guaranteed to be infrequent
given the frequent (k−1)-itemsets found in the previous iteration (steps 5
and 6). Candidate generation and pruning is implemented using the
functions candidate-gen and candidate-prune, which are described in
Section 5.2.3.
To count the support of the candidates, the algorithm needs to make an
additional pass over the data set (steps 7–12). The subset function is
used to determine all the candidate itemsets in Ck that are contained in
each transaction t. The implementation of this function is described in
Section 5.2.4.
After counting their supports, the algorithm eliminates all candidate
itemsets whose support counts are less than N×minsup (step 13).
The algorithm terminates when there are no new frequent itemsets
generated, i.e., Fk=∅ (step 14).
The frequent itemset generation part of the Apriori algorithm has two
important characteristics. First, it is a level-wise algorithm; i.e., it traverses
the itemset lattice one level at a time, from frequent 1-itemsets to the
maximum size of frequent itemsets. Second, it employs a generate-and-test
strategy for finding frequent itemsets. At each iteration (level), new candidate
itemsets are generated from the frequent itemsets found in the previous
iteration. The support for each candidate is then counted and tested against
the minsup threshold. The total number of iterations needed by the algorithm
is kmax+1, where kmax is the maximum size of the frequent itemsets.
5.2.3 Candidate Generation and
The candidate-gen and candidate-prune functions shown in Steps 5 and 6 of
Algorithm 5.1 generate candidate itemsets and prunes unnecessary ones by
performing the following two operations, respectively:
1. Candidate Generation. This operation generates new candidate kitemsets based on the frequent (k−1)-itemsets found in the previous
Algorithm 5.1 Frequent itemset
generation of the Apriori
1: k = 1.
2: F
k = {i | i I ∧ σ({i}) ≥ N × minsup}. {Find all frequent 1-
3: repeat
4: k = k + 1.
5: C
k = candidate-gen(Fk − 1). {Generate candidate itemsets.}
6: C
k = candidate-prune(Ck, Fk − 1). {Prune candidate itemsets.}
7: for each transaction t T do
8: C
t = subset(Ck, t). {Identify all candidates that belong to
9: for each candidate itemset c C
t do
10: σ(c) = σ(c) + 1. {Increment support count.}
11: end for
12: end for
13: F
k = {c | c Ck ∧ σ(c) ≥ N × minsup}. {Extract the frequent
14: until F
k = ∅
15: Result = ∪F
2. Candidate Pruning. This operation eliminates some of the candidate kitemsets using support-based pruning, i.e. by removing k-itemsets whose
subsets are known to be infrequent in previous iterations. Note that this
pruning is done without computing the actual support of these k-itemsets
(which could have required comparing them against each transaction).
Candidate Generation
In principle, there are many ways to generate candidate itemsets. An effective
candidate generation procedure must be complete and non-redundant. A
candidate generation procedure is said to be complete if it does not omit any
frequent itemsets. To ensure completeness, the set of candidate itemsets must
subsume the set of all frequent itemsets, i.e., ∀k:Fk⊆Ck. A candidate
generation procedure is non-redundant if it does not generate the same
candidate itemset more than once. For example, the candidate itemset {a, b,
c, d} can be generated in many ways—by merging {a, b, c} with {d}, {b, d}
with {a, c}, {c} with {a, b, d}, etc. Generation of duplicate candidates leads
to wasted computations and thus should be avoided for efficiency reasons.
Also, an effective candidate generation procedure should avoid generating
too many unnecessary candidates. A candidate itemset is unnecessary if at
least one of its subsets is infrequent, and thus, eliminated in the candidate
pruning step.
Next, we will briefly describe several candidate generation procedures,
including the one used by the candidate-gen function.
Brute-Force Method
The brute-force method considers every k-itemset as a potential candidate and
then applies the candidate pruning step to remove any unnecessary candidates
whose subsets are infrequent (see Figure 5.6). The number of candidate
itemsets generated at level k is equal to (dk), where d is the total number of
items. Although candidate generation is rather trivial, candidate pruning
becomes extremely expensive because a large number of itemsets must be
Figure 5.6.
A brute-force method for generating candidate 3-itemsets.
Fk−1×F1 Method
An alternative method for candidate generation is to extend each frequent (k
−1)-itemset with frequent items that are not part of the (k−1)-itemset. Figure
5.7 illustrates how a frequent 2-itemset such as {Beer, Diapers} can be
augmented with a frequent item such as Bread to produce a candidate 3-
itemset {Beer, Diapers, Bread}.
Figure 5.7.
Generating and pruning candidate k-itemsets by merging a frequent
(k−1)-itemset with a frequent item. Note that some of the
candidates are unnecessary because their subsets are infrequent.
The procedure is complete because every frequent k-itemset is composed of a
frequent (k−1)-itemset and a frequent 1-itemset. Therefore, all frequent kitemsets are part of the candidate k-itemsets generated by this procedure.
Figure 5.7 shows that the Fk−1×F1 candidate generation method only
produces four candidate 3-itemsets, instead of the
(63)=20 itemsets produced by the brute-force method. The Fk−1×F1 method
generates lower number of candidates because every candidate is guaranteed
to contain at least one frequent (k−1)-itemset. While this procedure is a
substantial improvement over the brute-force method, it can still produce a
large number of unnecessary candidates, as the remaining subsets of a
candidate itemset can still be infrequent.
Note that the approach discussed above does not prevent the same candidate
itemset from being generated more than once. For instance, {Bread, Diapers,
Milk} can be generated by merging {Bread, Diapers} with {Milk}, {Bread,
Milk} with {Diapers}, or {Diapers, Milk} with {Bread}. One way to avoid
generating duplicate candidates is by ensuring that the items in each frequent
itemset are kept sorted in their lexicographic order. For example, itemsets
such as {Bread, Diapers}, {Bread, Diapers, Milk}, and {Diapers, Milk}
follow lexicographic order as the items within every itemset are arranged
alphabetically. Each frequent (k−1)-itemset X is then extended with frequent
items that are lexicographically larger than the items in X. For example, the
itemset {Bread, Diapers} can be augmented with {Milk} because Milk is
lexicographically larger than Bread and Diapers. However, we should not
augment {Diapers, Milk} with {Bread} nor {Bread, Milk} with {Diapers}
because they violate the lexicographic ordering condition. Every candidate kitemset is thus generated exactly once, by merging the lexicographically
largest item with the remaining k−1 items in the itemset. If the Fk−1×F1
method is used in conjunction with lexicographic ordering, then only two
candidate 3-itemsets will be produced in the example illustrated in Figure 5.7.
{Beer, Bread, Diapers} and {Beer, Bread, Milk} will not be generated
because {Beer, Bread} is not a frequent 2-itemset.
Fk−1×Fk−1 Method
This candidate generation procedure, which is used in the candidate-gen
function of the Apriori algorithm, merges a pair of frequent (k−1)-itemsets
only if their first k−2 items, arranged in lexicographic order, are identical. Let
A={a1, a2, …, ak−1} and B={b1, b2, …, bk−1} be a pair of frequent (k−1)-
itemsets, arranged lexicographically. A and B are merged if they satisfy the
following conditions:
ai=bi (for i=1, 2, …, k−2).
Note that in this case, ak−1≠bk−1 because A and B are two distinct itemsets.
The candidate k-itemset generated by merging A and B consists of the first k
−2 common items followed by ak−1 and bk−1 in lexicographic order. This
candidate generation procedure is complete, because for every
lexicographically ordered frequent k-itemset, there exists two
lexicographically ordered frequent (k−1)-itemsets that have identical items in
the first k−2 positions.
In Figure 5.8, the frequent itemsets {Bread, Diapers} and {Bread, Milk} are
merged to form a candidate 3-itemset {Bread, Diapers, Milk}. The algorithm
does not have to merge {Beer, Diapers} with {Diapers, Milk} because the
first item in both itemsets is different. Indeed, if {Beer, Diapers, Milk} is a
viable candidate, it would have been obtained by merging {Beer, Diapers}
with {Beer, Milk} instead. This example illustrates both the completeness of
the candidate generation procedure and the advantages of using lexicographic
ordering to prevent duplicate candidates. Also, if we order the frequent (k−1)-
itemsets according to their lexicographic rank, itemsets with identical first k
−2 items would take consecutive ranks. As a result, the Fk−1×Fk−1 candidate
generation method would consider merging a frequent itemset only with ones
that occupy the next few ranks in the sorted list, thus saving some
Figure 5.8.
Generating and pruning candidate k-itemsets by merging pairs of
frequent (k−1)-itemsets.
Figure 5.8 shows that the Fk−1×Fk−1 candidate generation procedure results
in only one candidate 3-itemset. This is a considerable reduction from the
four candidate 3-itemsets generated by the Fk−1×F1 method. This is because
the Fk−1×Fk−1 method ensures that every candidate k-itemset contains at
least two frequent (k−1)-itemsets, thus greatly reducing the number of
candidates that are generated in this step.
Note that there can be multiple ways of merging two frequent (k−1)-itemsets
in the Fk−1×Fk−1 procedure, one of which is merging if their first k−2 items
are identical. An alternate approach could be to merge two frequent (k−1)-
itemsets A and B if the last k−2 items of A are identical to the first k−2
itemsets of B. For example, {Bread, Diapers} and {Diapers, Milk} could be
merged using this approach to generate the candidate 3-itemset {Bread,
Diapers, Milk}. As we will see later, this alternate Fk−1×Fk−1 procedure is
useful in generating sequential patterns, which will be discussed in Chapter 6.
Candidate Pruning
To illustrate the candidate pruning operation for a candidate k-itemset, X=
{i1, i2, …, ik}, consider its k proper subsets, X−{ij}(∀j=1, 2, …, k). If any
of them are infrequent, then X is immediately pruned by using the Apriori
principle. Note that we don’t need to explicitly ensure that all subsets of X of
size less than k−1 are frequent (see Exercise 7). This approach greatly
reduces the number of candidate itemsets considered during support
counting. For the brute-force candidate generation method, candidate pruning
requires checking only k subsets of size k−1 for each candidate k-itemset.
However, since the Fk−1×F1 candidate generation strategy ensures that at
least one of the (k−1)-size subsets of every candidate k-itemset is frequent,
we only need to check for the remaining k−1 subsets. Likewise, the Fk−1×Fk
−1 strategy requires examining only k−2 subsets of every candidate k-itemset,
since two of its (k−1)-size subsets are already known to be frequent in the
candidate generation step.
5.2.4 Support Counting
Support counting is the process of determining the frequency of occurrence
for every candidate itemset that survives the candidate pruning step. Support
counting is implemented in steps 6 through 11 of Algorithm 5.1. A bruteforce approach for doing this is to compare each transaction against every
candidate itemset (see Figure 5.2) and to update the support counts of
candidates contained in a transaction. This approach is computationally
expensive, especially when the numbers of transactions and candidate
itemsets are large.
An alternative approach is to enumerate the itemsets contained in each
transaction and use them to update the support counts of their respective
candidate itemsets. To illustrate, consider a transaction t that contains five
items, {1, 2, 3, 5, 6}. There are (53)=10 itemsets of size 3 contained in this
transaction. Some of the itemsets may correspond to the candidate 3-itemsets
under investigation, in which case, their support counts are incremented.
Other subsets of t that do not correspond to any candidates can be ignored.
Figure 5.9 shows a systematic way for enumerating the 3-itemsets contained
in t. Assuming that each itemset keeps its items in increasing lexicographic
order, an itemset can be enumerated by specifying the smallest item first,
followed by the larger items. For instance, givent={1, 2, 3, 5, 6}, all the 3-
itemsets contained in t must begin with item 1, 2, or 3. It is not possible to
construct a 3-itemset that begins with items 5 or 6 because there are only two
items in t whose labels are greater than or equal to 5. The number of ways to
specify the first item of a 3-itemset contained in t is illustrated by the Level 1
prefix tree structure depicted in Figure 5.9. For instance, 1 2 3 5 6 represents
a 3-itemset that begins with item 1, followed by two more items chosen from
the set {2, 3, 5, 6}.
Figure 5.9.
Enumerating subsets of three items from a transaction t.
Figure 5.9. Full Alternative Text
After fixing the first item, the prefix tree structure at Level 2 represents the
number of ways to select the second item. For example, 1 2 3 5 6 corresponds
to itemsets that begin with the prefix (1 2) and are followed by the items 3, 5,
or 6. Finally, the prefix tree structure at Level 3 represents the complete set of
3-itemsets contained in t. For example, the 3-itemsets that begin with prefix
{1 2} are {1, 2, 3}, {1, 2, 5}, and {1, 2, 6}, while those that begin with prefix
{2 3} are {2, 3, 5} and {2, 3, 6}.
The prefix tree structure shown in Figure 5.9 demonstrates how itemsets
contained in a transaction can be systematically enumerated, i.e., by
specifying their items one by one, from the leftmost item to the rightmost
item. We still have to determine whether each enumerated 3-itemset
corresponds to an existing candidate itemset. If it matches one of the
candidates, then the support count of the corresponding candidate is
incremented. In the next Section, we illustrate how this matching operation
can be performed efficiently using a hash tree structure.
Support Counting Using a Hash
In the Apriori algorithm, candidate itemsets are partitioned into different
buckets and stored in a hash tree. During support counting, itemsets
contained in each transaction are also hashed into their appropriate buckets.
That way, instead of comparing each itemset in the transaction with every
candidate itemset, it is matched only against candidate itemsets that belong to
the same bucket, as shown in Figure 5.10.
Figure 5.10.
Counting the support of itemsets using hash structure.
Figure 5.11 shows an example of a hash tree structure. Each internal node of
the tree uses the following hash function, h(p)=(p−1) mod 3,, where mode
refers to the modulo (remainder) operator, to determine which branch of the
current node should be followed next. For example, items 1, 4, and 7 are
hashed to the same branch (i.e., the leftmost branch) because they have the
same remainder after dividing the number by 3. All candidate itemsets are
stored at the leaf nodes of the hash tree. The hash tree shown in Figure 5.11
contains 15 candidate 3-itemsets, distributed across 9 leaf nodes.
Figure 5.11.
Hashing a transaction at the root node of a hash tree.
Figure 5.11. Full Alternative Text
Consider the transaction, t={1, 2, 3, 4, 5, 6}. To update the support counts of
the candidate itemsets, the hash tree must be traversed in such a way that all
the leaf nodes containing candidate 3-itemsets belonging to t must be visited
at least once. Recall that the 3-itemsets contained in t must begin with items
1, 2, or 3, as indicated by the Level 1 prefix tree structure shown in Figure
5.9. Therefore, at the root node of the hash tree, the items 1, 2, and 3 of the
transaction are hashed separately. Item 1 is hashed to the left child of the root
node, item 2 is hashed to the middle child, and item 3 is hashed to the right
child. At the next level of the tree, the transaction is hashed on the second
item listed in the Level 2 tree structure shown in Figure 5.9. For example,
after hashing on item 1 at the root node, items 2, 3, and 5 of the transaction
are hashed. Based on the hash function, items 2 and 5 are hashed to the
middle child, while item 3 is hashed to the right child, as shown in Figure
5.12. This process continues until the leaf nodes of the hash tree are reached.
The candidate itemsets stored at the visited leaf nodes are compared against
the transaction. If a candidate is a subset of the transaction, its support count
is incremented. Note that not all the leaf nodes are visited while traversing the
hash tree, which helps in reducing the computational cost. In this example, 5
out of the 9 leaf nodes are visited and 9 out of the 15 itemsets are compared
against the transaction.
Figure 5.12.
Subset operation on the leftmost subtree of the root of a candidate
hash tree.
5.2.5 Computational Complexity
The computational complexity of the Apriori algorithm, which includes both
its runtime and storage, can be affected by the following factors.
Support Threshold
Lowering the support threshold often results in more itemsets being declared
as frequent. This has an adverse effect on the computational complexity of
the algorithm because more candidate itemsets must be generated and
counted at every level, as shown in Figure 5.13. The maximum size of
frequent itemsets also tends to increase with lower support thresholds. This
increases the total number of iterations to be performed by the Apriori
algorithm, further increasing the computational cost.
Figure 5.13.
Effect of support threshold on the number of candidate and
frequent itemsets obtained from a benchmark data set.
Figure 5.13. Full Alternative Text
Number of Items (Dimensionality)
As the number of items increases, more space will be needed to store the
support counts of items. If the number of frequent items also grows with the
dimensionality of the data, the runtime and storage requirements will increase
because of the larger number of candidate itemsets generated by the
Number of Transactions
Because the Apriori algorithm makes repeated passes over the transaction
data set, its run time increases with a larger number of transactions.
Average Transaction Width
For dense data sets, the average transaction width can be very large. This
affects the complexity of the Apriori algorithm in two ways. First, the
maximum size of frequent itemsets tends to increase as the average
transaction width increases. As a result, more candidate itemsets must be
examined during candidate generation and support counting, as illustrated in
Figure 5.14. Second, as the transaction width increases, more itemsets are
contained in the transaction. This will increase the number of hash tree
traversals performed during support counting.
A detailed analysis of the time complexity for the Apriori algorithm is
presented next.
Figure 5.14.
Effect of average transaction width on the number of candidate and
frequent itemsets obtained from a synthetic data set.
Figure 5.14. Full Alternative Text
Generation of frequent 1-itemsets
For each transaction, we need to update the support count for every item
present in the transaction. Assuming that w is the average transaction width,
this operation requires O(Nw) time, where N is the total number of
Candidate generation
To generate candidate k-itemsets, pairs of frequent (k−1)-itemsets are merged
to determine whether they have at least k−2 items in common. Each merging
operation requires at most k−2 equality comparisons. Every merging step can
produce at most one viable candidate k-itemset, while in the worst-case, the
algorithm must try to merge every pair of frequent (k−1)-itemsets found in
the previous iteration. Therefore, the overall cost of merging frequent
itemsets is
∑k=2w(k−2)|Ck|<Cost of merging<∑k=2w(k−2)|Fk−1|2,
where w is the maximum transaction width. A hash tree is also constructed
during candidate generation to store the candidate itemsets. Because the
maximum depth of the tree is k, the cost for populating the hash tree with
candidate itemsets is O(∑k=2wk|Ck|). During candidate pruning, we need to
verify that the k−2 subsets of every candidate k-itemset are frequent. Since
the cost for looking up a candidate in a hash tree is O(k), the candidate
pruning step requires O(∑k=2wk(k−2)|Ck|) time.
Support counting
Each transaction of width |t| produces (|t|k) itemsets of size k. This is also the
effective number of hash tree traversals performed for each transaction. The
cost for support counting is O(N∑k(wk)αk), where w is the maximum
transaction width and αk is the cost for updating the support count of a
candidate k-itemset in the hash tree.
5.3 Rule Generation
This Section describes how to extract association rules efficiently from a
given frequent itemset. Each frequent k-itemset, Y, can produce up to 2k−2
association rules, ignoring rules that have empty antecedents or consequents
∅→Y or Y→∅). An association rule can be extracted by partitioning the
itemset Y into two non-empty subsets, X and Y−X, such that X→Y−X
satisfies the confidence threshold. Note that all such rules must have already
met the support threshold because they are generated from a frequent itemset.
Example 5.2.
Let X={a, b, c} be a frequent itemset. There are six candidate association
rules that can be generated from X:{a, b}→{c}, {a, c}→{b}, {b, c}
→{a}, {a}→{b, c}, {b}→{a, c}, and {c}→{a, b}. As each of their support is
identical to the support for X, all the rules satisfy the support threshold.
Computing the confidence of an association rule does not require additional
scans of the transaction data set. Consider the rule {1, 2}→{3}, which is
generated from the frequent itemset X={1, 2, 3}. The confidence for this rule
is σ{(1, 2, 3})/σ({1, 2}). Because {1, 2, 3} is frequent, the anti-monotone
property of support ensures that {1, 2} must be frequent, too. Since the
support counts for both itemsets were already found during frequent itemset
generation, there is no need to read the entire data set again.
5.3.1 Confidence-Based Pruning
Confidence does not show the anti-monotone property in the same way as the
support measure. For example, the confidence for X→Y can be larger,
smaller, or equal to the confidence for another rule X˜→Y˜, where X˜⊆X
and Y˜⊆Y (see Exercise 3 on page 439). Nevertheless, if we compare rules
generated from the same frequent itemset Y, the following theorem holds for
the confidence measure.
Theorem 5.2.
Let Y be an itemset and X is a subset of Y. If a rule X→Y−X does not satisfy
the confidence threshold, then any rule X˜→Y−X˜, where is a subset of X,
must not satisfy the confidence threshold as well.
To prove this theorem, consider the following two rules: X˜→Y−X˜ and
X→Y−X, where X˜⊂X. The confidence of the rules are σ(Y)/σ(X˜) and
σ(Y)/σ(X), respectively. Since X˜ is a subset of X, σ(X˜)/σ(X). Therefore, the
former rule cannot have a higher confidence than the latter rule.
5.3.2 Rule Generation in Apriori
The Apriori algorithm uses a level-wise approach for generating association
rules, where each level corresponds to the number of items that belong to the
rule consequent. Initially, all the high confidence rules that have only one
item in the rule consequent are extracted. These rules are then used to
generate new candidate rules. For example, if {acd}→{b} and {abd}→{c}
are high confidence rules, then the candidate rule {ad}→{bc} is generated by
merging the consequents of both rules. Figure 5.15 shows a lattice structure
for the association rules generated from the frequent itemset {a, b, c, d}. If
any node in the lattice has low confidence, then according to Theorem 5.2,
the entire subgraph spanned by the node can be pruned immediately. Suppose
the confidence for {bcd}→{a} is low. All the rules containing item a in its
consequent, including {cd}→{ab}, {bd}→{ac}, {bc}→{ad}, and {d}
→{abc} can be discarded.
Figure 5.15.
Pruning of association rules using the confidence measure.
Figure 5.15. Full Alternative Text
A pseudocode for the rule generation step is shown in Algorithms 5.2 and
5.3. Note the similarity between the ap-genrules procedure given in
Algorithm 5.3 and the frequent itemset generation procedure given in
Algorithm 5.1. The only difference is that, in rule generation, we do not have
to make additional passes over the data set to compute the confidence of the
candidate rules. Instead, we determine the confidence of each rule by using
the support counts computed during frequent itemset generation.
Algorithm 5.2 Rule generation of
the Apriori algorithm.
1: for each frequent k-itemset fk, k ≥ 2 do
2: H
1 = {i | i fk} {1-item consequents of the rule.}
3: call ap-genrules(fk, H1.)
4: end for
Algorithm 5.3 Procedure apgenrules (fk, Hm).
1: k = |fk| {size of frequent itemset.}
2: m = |Hm| {size of rule consequent.}
3: if k > m + 1 then
4: H
m + 1 = candidate-gen(Hm).
5: H
m + 1 = candidate-prune(Hm + 1,Hm).
6: for each h
m + 1 ∈ Hm + 1 do
7: conf = σ(fk)/σ(fk hm + 1).
8: if conf minconf then
9: output the rule (fk hm + 1) → hm + 1.
10: else
11: delete h
m + 1 from Hm + 1.
12: end if
13: end for
14: call ap-genrules(fk, Hm + 1.)
15: end if
5.3.3 An Example: Congressional
Voting Records
This Section demonstrates the results of applying association analysis to the
voting records of members of the United States House of Representatives.
The data is obtained from the 1984 Congressional Voting Records Database,
which is available at the UCI machine learning data repository. Each
transaction contains information about the party affiliation for a
representative along with his or her voting record on 16 key issues. There are
435 transactions and 34 items in the data set. The set of items are listed in
Table 5.3.
Table 5.3. List of binary
attributes from the 1984 United
States Congressional Voting
Records. Source: The UCI
machine learning repository.
1. Republican
2. Democrat
3. handicapped-infants=yes
4. handicapped-infants=no
5. water project cost sharing=yes
6. water project cost sharing=no
7. budget-resolution=yes
8. budget-resolution=no
9. physician fee freeze=yes
10. physician fee freeze=no
11. aid to El Salvador=yes
12. aid to El Salvador=no
13. religious groups in schools=yes
14. religious groups in schools=no
15. anti-satellite test ban=yes
16. anti-satellite test ban=no
17. aid to Nicaragua=yes
18. aid to Nicaragua=no
19. MX-missile=yes
20. MX-missile=no
21. immigration=yes
22. immigration=no
23. synfuel corporation cutback=yes
24. synfuel corporation cutback=no
25. education spending=yes
26. education spending=no
27. right-to-sue=yes
28. right-to-sue=no
29. crime=yes
30. crime=no
31. duty-free-exports=yes
32. duty-free-exports=no
33. export administration act=yes
34. export administration act=no
The Apriori algorithm is then applied to the data set with minsup=30% and
minconf=90%. Some of the high confidence rules extracted by the algorithm
are shown in Table 5.4. The first two rules suggest that most of the members
who voted yes for aid to El Salvador and no for budget resolution and MX
missile are Republicans; while those who voted no for aid to El Salvador and
yes for budget resolution and MX missile are Democrats. These high
confidence rules show the key issues that divide members from both political
Table 5.4. Association rules
extracted from the 1984 United
States Congressional Voting
Association Rule Confidence
{budget resolution=no, MXmissile=no, aid to El Salvador=yes }→{Republican} 91.0%
{budget resolution=yes, MXmissile=yes, aid to El Salvador=no }→{Democrat} 97.5%
{crime=yes, right-tosue=yes, physician fee freeze=yes }→{Republican} 93.5%
{crime=no, right-to-sue=no, physician fee freeze=no }
→{Democrat} 100%
5.4 Compact Representation of
Frequent Itemsets
In practice, the number of frequent itemsets produced from a transaction data
set can be very large. It is useful to identify a small representative set of
frequent itemsets from which all other frequent itemsets can be derived. Two
such representations are presented in this Section in the form of maximal and
closed frequent itemsets.
5.4.1 Maximal Frequent Itemsets
Definition 5.3. (Maximal Frequent
A frequent itemset is maximal if none of its immediate supersets are frequent.
To illustrate this concept, consider the itemset lattice shown in Figure 5.16.
The itemsets in the lattice are divided into two groups: those that are frequent
and those that are infrequent. A frequent itemset border, which is represented
by a dashed line, is also illustrated in the diagram. Every itemset located
above the border is frequent, while those located below the border (the
shaded nodes) are infrequent. Among the itemsets residing near the border,
{a, d}, {a, c, e}, and {b, c, d, e} are maximal frequent itemsets because all of
their immediate supersets are infrequent. For example, the itemset {a, d} is
maximal frequent because all of its immediate supersets, {a, b, d}, {a, c, d},
and {a, d, e}, are infrequent. In contrast, {a, c} is non-maximal because one
of its immediate supersets, {a, c, e}, is frequent.
Figure 5.16.
Maximal frequent itemset.
Figure 5.16. Full Alternative Text
Maximal frequent itemsets effectively provide a compact representation of
frequent itemsets. In other words, they form the smallest set of itemsets from
which all frequent itemsets can be derived. For example, every frequent
itemset in Figure 5.16 is a subset of one of the three maximal frequent
itemsets, {a, d}, {a, c, e}, and {b, c, d, e}. If an itemset is not a proper subset
of any of the maximal frequent itemsets, then it is either infrequent (e.g., {a,
d, e}) or maximal frequent itself (e.g., {b, c, d, e}). Hence, the maximal
frequent itemsets {a, c, e}, {a, d}, and {b, c, d, e} provide a compact
representation of the frequent itemsets shown in Figure 5.16. Enumerating all
the subsets of maximal frequent itemsets generates the complete list of all
frequent itemsets.
Maximal frequent itemsets provide a valuable representation for data sets that
can produce very long, frequent itemsets, as there are exponentially many
frequent itemsets in such data. Nevertheless, this approach is practical only if
an efficient algorithm exists to explicitly find the maximal frequent itemsets.
We briefly describe one such approach in Section 5.5.
Despite providing a compact representation, maximal frequent itemsets do
not contain the support information of their subsets. For example, the support
of the maximal frequent itemsets {a, c, e}, {a, d}, and {b, c, d, e} do not
provide any information about the support of their subsets except that it
meets the support threshold. An additional pass over the data set is therefore
needed to determine the support counts of the non-maximal frequent itemsets.
In some cases, it is desirable to have a minimal representation of itemsets that
preserves the support information. We describe such a representation in the
next Section.
5.4.2 Closed Itemsets
Closed itemsets provide a minimal representation of all itemsets without
losing their support information. A formal definition of a closed itemset is
presented below.
Definition 5.4. (Closed Itemset.)
An itemset X is closed if none of its immediate supersets has exactly the same
support count as X.
Put another way, X is not closed if at least one of its immediate supersets has
the same support count as X. Examples of closed itemsets are shown in
Figure 5.17. To better illustrate the support count of each itemset, we have
associated each node (itemset) in the lattice with a list of its corresponding
transaction IDs. For example, since the node {b, c} is associated with
transaction IDs 1, 2, and 3, its support count is equal to three. From the
transactions given in this diagram, notice that the support for {b} is identical
to {b, c}.This is because every transaction that contains b also contains c.
Hence, {b} is not a closed itemset. Similarly, since c occurs in every
transaction that contains both a and d, the itemset {a, d} is not closed as it
has the same support as its superset {a, c, d}. On the other hand, {b, c} is a
closed itemset because it does not have the same support count as any of its
Figure 5.17.
An example of the closed frequent itemsets (with minimum support
equal to 40%).
Figure 5.17. Full Alternative Text
An interesting property of closed itemsets is that if we know their support
counts, we can derive the support count of every other itemset in the itemset
lattice without making additional passes over the data set. For example,
consider the 2-itemset {b, e} in Figure 5.17. Since {b, e} is not closed, its
support must be equal to the support of one of its immediate supersets, {a, b,
e}, {b, c, e}, and {b, d, e}. Further, none of the supersets of {b, e} can have a
support greater than the support of {b, e}, due to the anti-monotone nature of
the support measure. Hence, the support of {b, e} can be computed by
examining the support counts of all of its immediate supersets of size three
and taking their maximum value. If an immediate superset is closed (e.g., {b,
c, e}), we would know its support count. Otherwise, we can recursively
compute its support by examining the supports of its immediate supersets of
size four. In general, the support count of any non-closed (k−1)-itemset can
be determined as long as we know the support counts of all k-itemsets.
Hence, one can devise an iterative algorithm that computes the support
counts of itemsets at level k−1 using the support counts of itemsets at level k,
starting from the level kmax, where kmax is the size of the largest closed
Even though closed itemsets provide a compact representation of the support
counts of all itemsets, they can still be exponentially large in number.
Moreover, for most practical applications, we only need to determine the
support count of all frequent itemsets. In this regard, closed frequent itemsets provide a compact representation of the support counts of all frequent
itemsets, which can be defined as follows.
Definition 5.5. (Closed Frequent
An itemset is a closed frequent itemset if it is closed and its support is greater
than or equal to minsup.
In the previous example, assuming that the support threshold is 40%, {b, c} is
a closed frequent itemset because its support is 60%. In Figure 5.17, the
closed frequent itemsets are indicated by the shaded nodes.
Algorithms are available to explicitly extract closed frequent itemsets from a
given data set. Interested readers may refer to the Bibliographic Notes at the
end of this chapter for further discussions of these algorithms. We can use
closed frequent itemsets to determine the support counts for all non-closed
frequent itemsets. For example, consider the frequent itemset {a, d} shown in
Figure 5.17. Because this itemset is not closed, its support count must be
equal to the maximum support count of its immediate supersets, {a, b, d}, {a,
c, d}, and {a, d, e}. Also, since {a, d} is frequent, we only need to consider
the support of its frequent supersets. In general, the support count of every
non-closed frequent k-itemset can be obtained by considering the support of
all its frequent supersets of size k+1. For example, since the only frequent
superset of {a, d} is {a, c, d}, its support is equal to the support of {a, c, d},
which is 2. Using this methodology, an algorithm can be developed to
compute the support for every frequent itemset. The pseudocode for this
algorithm is shown in Algorithm 5.4. The algorithm proceeds in a specific-togeneral fashion, i.e., from the largest to the smallest frequent itemsets. This is
because, in order to find the support for a non-closed frequent itemset, the
support for all of its supersets must be known. Note that the set of all frequent
itemsets can be easily computed by taking the union of all subsets of frequent
closed itemsets.
Algorithm 5.4 Support counting
using closed frequent itemsets.
1: Let C denote the set of closed frequent itemsets and F denote the set of all frequent itemsets.
2: Let k
max denote the maximum size of closed frequent itemsets
3: F
kmax = {f|f C, |f| = kmax} {Find all frequent itemsets of size
4: for k = k
max − 1 down to 1 do
5: F
k = {f|f F, |f| = k} {Find all frequent itemsets of size
6: for each f F
k do
7: if f C then
8: f⋅support=max{f′⋅support|f′∈Fk+1, f⊂f′}
9: end if
10: end for
11: end for
To illustrate the advantage of using closed frequent itemsets, consider the
data set shown in Table 5.5, which contains ten transactions and fifteen items.
The items can be divided into three groups: (1) Group A, which contains
items a1 through a5; (2) Group B, which contains items b1 through b5; and
(3) Group C, which contains items c1 through c5. Assuming that the support
threshold is 20%, itemsets involving items from the same group are frequent,
but itemsets involving items from different groups are infrequent. The total
number of frequent itemsets is thus 3×(25−1)=93. However, there are only
four closed frequent itemsets in the data:
({a3, a4}, {a1, a2, a3, a4, a5}, {b1,b2,b3,b4,b5}, and {c1, c2, c3, c4, c5}). It
is often sufficient to present only the closed frequent itemsets to the analysts
instead of the entire set of frequent itemsets.
Table 5.5. A transaction data
set for mining closed itemsets.
TID a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3 c4 c5
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
4 0 0 1 1 0 1 1 1 1 1 0 0 0 0 0
5 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
6 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
Finally, note that all maximal frequent itemsets are closed because none of
the maximal frequent itemsets can have the same support count as their
immediate supersets. The relationships among frequent, closed, closed
frequent, and maximal frequent itemsets are shown in Figure 5.18.
Figure 5.18.
Relationships among frequent, closed, closed frequent, and
maximal frequent itemsets.
5.5 Alternative Methods for
Generating Frequent Itemsets*
Apriori is one of the earliest algorithms to have successfully addressed the
combinatorial explosion of frequent itemset generation. It achieves this by
applying the Apriori principle to prune the exponential search space. Despite
its significant performance improvement, the algorithm still incurs
considerable I/O overhead since it requires making several passes over the
transaction data set. In addition, as noted in Section 5.2.5, the performance of
the Apriori algorithm may degrade significantly for dense data sets because
of the increasing width of transactions. Several alternative methods have been
developed to overcome these limitations and improve upon the efficiency of
the Apriori algorithm. The following is a high-level description of these
Traversal of Itemset Lattice
A search for frequent itemsets can be conceptually viewed as a traversal on
the itemset lattice shown in Figure 5.1. The search strategy employed by an
algorithm dictates how the lattice structure is traversed during the frequent
itemset generation process. Some search strategies are better than others,
depending on the configuration of frequent itemsets in the lattice. An
overview of these strategies is presented next.
General-to-Specific versus Specific-to-General: The Apriori algorithm
uses a general-to-specific search strategy, where pairs of frequent (k−1)-
itemsets are merged to obtain candidate k-itemsets. This general-tospecific search strategy is effective, provided the maximum length of a
frequent itemset is not too long. The configuration of frequent itemsets
that works best with this strategy is shown in Figure 5.19(a), where the
darker nodes represent infrequent itemsets. Alternatively, a specifictogeneral search strategy looks for more specific frequent itemsets first,
before finding the more general frequent itemsets. This strategy is useful
to discover maximal frequent itemsets in dense transactions, where the
frequent itemset border is located near the bottom of the lattice, as
shown in Figure 5.19(b). The Apriori principle can be applied to prune
all subsets of maximal frequent itemsets. Specifically, if a candidate kitemset is maximal frequent, we do not have to examine any of its
subsets of size k−1. However, if the candidate k-itemset is infrequent,
we need to check all of its k−1 subsets in the next iteration. Another
approach is to combine both general-to-specific and specific-to-general
search strategies. This bidirectional approach requires more space to
store the candidate itemsets, but it can help to rapidly identify the
frequent itemset border, given the configuration shown in Figure
Figure 5.19.
General-to-specific, specific-to-general, and bidirectional
Figure 5.19. Full Alternative Text
Equivalence Classes: Another way to envision the traversal is to first
partition the lattice into disjoint groups of nodes (or equivalence
classes). A frequent itemset generation algorithm searches for frequent
itemsets within a particular equivalence class first before moving to
another equivalence class. As an example, the level-wise strategy used
in the Apriori algorithm can be considered to be partitioning the lattice
on the basis of itemset sizes; i.e., the algorithm discovers all frequent 1-
itemsets first before proceeding to larger-sized itemsets. Equivalence
classes can also be defined according to the prefix or suffix labels of an
itemset. In this case, two itemsets belong to the same equivalence class
if they share a common prefix or suffix of length k. In the prefix-based
approach, the algorithm can search for frequent itemsets starting with
the prefix a before looking for those starting with prefixes b, c, and so
on. Both prefix-based and suffix-based equivalence classes can be
demonstrated using the tree-like structure shown in Figure 5.20.
Figure 5.20.
Equivalence classes based on the prefix and suffix labels of
Figure 5.20. Full Alternative Text
Breadth-First versus Depth-First: The Apriori algorithm traverses the
lattice in a breadth-first manner, as shown in Figure 5.21(a). It first
discovers all the frequent 1-itemsets, followed by the frequent 2-
itemsets, and so on, until no new frequent itemsets are generated. The
itemset lattice can also be traversed in a depth-first manner, as shown in
Figures 5.21(b) and 5.22. The algorithm can start from, say, node a in
Figure 5.22, and count its support to determine whether it is frequent. If
so, the algorithm progressively expands the next level of nodes, i.e., ab,
abc, and so on, until an infrequent node is reached, say, abcd. It then
backtracks to another branch, say, abce, and continues the search from
Figure 5.21.
Breadth-first and depth-first traversals.
Figure 5.21. Full Alternative Text
Figure 5.22.
Generating candidate itemsets using the depth-first approach.
Figure 5.22. Full Alternative Text
The depth-first approach is often used by algorithms designed to find
maximal frequent itemsets. This approach allows the frequent itemset
border to be detected more quickly than using a breadth-first approach.
Once a maximal frequent itemset is found, substantial pruning can be
performed on its subsets. For example, if the node bcde shown in Figure
5.22 is maximal frequent, then the algorithm does not have to visit the
subtrees rooted at bd, be, c, d, and e because they will not contain any
maximal frequent itemsets. However, if abc is maximal frequent, only
the nodes such as ac and bc are not maximal frequent (but the subtrees
of ac and bc may still contain maximal frequent itemsets). The depthfirst approach also allows a different kind of pruning based on the
support of itemsets. For example, suppose the support for {a, b, c} is
identical to the support for {a, b}. The subtrees rooted at abd and abe
can be skipped because they are guaranteed not to have any maximal
frequent itemsets. The proof of this is left as an exercise to the readers.
Representation of Transaction Data
There are many ways to represent a transaction data set. The choice of
representation can affect the I/O costs incurred when computing the support
of candidate itemsets. Figure 5.23 shows two different ways of representing
market basket transactions. The representation on the left is called a
horizontal data layout, which is adopted by many association rule mining
algorithms, including Apriori. Another possibility is to store the list of
transaction identifiers (TID-list) associated with each item. Such a
representation is known as the vertical data layout. The support for each
candidate itemset is obtained by intersecting the TID-lists of its subset items.
The length of the TID-lists shrinks as we progress to larger sized itemsets.
However, one problem with this approach is that the initial set of TID-lists
might be too large to fit into main memory, thus requiring more sophisticated
techniques to compress the TID-lists. We describe another effective approach
to represent the data in the next Section.
Figure 5.23.
Horizontal and vertical data format.
Horizontal Data Layout
Figure 5.23. Full Alternative Text
5.6 FP-Growth Algorithm*
This Section presents an alternative algorithm called FP-growth that takes a
radically different approach to discovering frequent itemsets. The algorithm
does not subscribe to the generate-and-test paradigm of Apriori. Instead, it
encodes the data set using a compact data structure called an FP-tree and
extracts frequent itemsets directly from this structure. The details of this
approach are presented next.
5.6.1 FP-Tree Representation
An FP-tree is a compressed representation of the input data. It is constructed
by reading the data set one transaction at a time and mapping each transaction
onto a path in the FP-tree. As different transactions can have several items in
common, their paths might overlap. The more the paths overlap with one
another, the more compression we can achieve using the FP-tree structure. If
the size of the FP-tree is small enough to fit into main memory, this will
allow us to extract frequent itemsets directly from the structure in memory
instead of making repeated passes over the data stored on disk.
Figure 5.24 shows a data set that contains ten transactions and five items. The
structures of the FP-tree after reading the first three transactions are also
depicted in the diagram. Each node in the tree contains the label of an item
along with a counter that shows the number of transactions mapped onto the
given path. Initially, the FP-tree contains only the root node represented by
the null symbol. The FP-tree is subsequently extended in the following way:
Figure 5.24.
Construction of an FP-tree.
Figure 5.24. Full Alternative Text
1. The data set is scanned once to determine the support count of each
item. Infrequent items are discarded, while the frequent items are sorted
in decreasing support counts inside every transaction of the data set. For
the data set shown in Figure 5.24, a is the most frequent item, followed
by b, c, d, and e.
2. The algorithm makes a second pass over the data to construct the FPtree. After reading the first transaction, {a, b}, the nodes labeled as a
and b are created. A path is then formed from null →a→b to encode the
transaction. Every node along the path has a frequency count of 1.
3. After reading the second transaction, {b, c, d}, a new set of nodes is
created for items b, c, and d. A path is then formed to represent the
transaction by connecting the nodes null →b→c→d. Every node along
this path also has a frequency count equal to one. Although the first two
transactions have an item in common, which is b, their paths are disjoint
because the transactions do not share a common prefix.
4. The third transaction, {a, c, d, e}, shares a common prefix item (which
is a) with the first transaction. As a result, the path for the third
transaction, null →a→c→d→e, overlaps with the path for the first
transaction, null →a→b. Because of their overlapping path, the
frequency count for node a is incremented to two, while the frequency
counts for the newly created nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been mapped onto one
of the paths given in the FP-tree. The resulting FP-tree after reading all
the transactions is shown at the bottom of Figure 5.24.
The size of an FP-tree is typically smaller than the size of the uncompressed
data because many transactions in market basket data often share a few items
in common. In the best-case scenario, where all the transactions have the
same set of items, the FP-tree contains only a single branch of nodes. The
worst-case scenario happens when every transaction has a unique set of
items. As none of the transactions have any items in common, the size of the
FP-tree is effectively the same as the size of the original data. However, the
physical storage requirement for the FP-tree is higher because it requires
additional space to store pointers between nodes and counters for each item.
The size of an FP-tree also depends on how the items are ordered. The notion
of ordering items in decreasing order of support counts relies on the
possibility that the high support items occur more frequently across all paths
and hence must be used as most commonly occurring prefixes. For example,
if the ordering scheme in the preceding example is reversed, i.e., from lowest
to highest support item, the resulting FP-tree is shown in Figure 5.25. The
tree appears to be denser because the branching factor at the root node has
increased from 2 to 5 and the number of nodes containing the high support
items such as a and b has increased from 3 to 12. Nevertheless, ordering by
decreasing support counts does not always lead to the smallest tree,
especially when the high support items do not occur frequently together with
the other items. For example, suppose we augment the data set given in
Figure 5.24 with 100 transactions that contain {e}, 80 transactions that
contain {d}, 60 transactions that contain {c}, and 40 transactions that contain
{b}. Item e is now most frequent, followed by d, c, b, and a. With the
augmented transactions, ordering by decreasing support counts will result in
an FP-tree similar to Figure 5.25, while a scheme based on increasing support
counts produces a smaller FP-tree similar to Figure 5.24(iv).
Figure 5.25.
An FP-tree representation for the data set shown in Figure 5.24
with a different item ordering scheme.
Figure 5.25. Full Alternative Text
An FP-tree also contains a list of pointers connecting nodes that have the
same items. These pointers, represented as dashed lines in Figures 5.24 and
5.25, help to facilitate the rapid access of individual items in the tree. We
explain how to use the FP-tree and its corresponding pointers for frequent
itemset generation in the next Section.
5.6.2 Frequent Itemset Generation
in FP-Growth Algorithm
FP-growth is an algorithm that generates frequent itemsets from an FP-tree
by exploring the tree in a bottom-up fashion. Given the example tree shown
in Figure 5.24, the algorithm looks for frequent itemsets ending in e first,
followed by d, c, b, and finally, a. This bottom-up strategy for finding
frequent itemsets ending with a particular item is equivalent to the suffixbased approach described in Section 5.5. Since every transaction is mapped
onto a path in the FP-tree, we can derive the frequent itemsets ending with a
particular item, say, e, by examining only the paths containing node e. These
paths can be accessed rapidly using the pointers associated with node e. The
extracted paths are shown in Figure 5.26 (a). Similar paths for itemsets
ending in d, c, b, and a are shown in Figures 5.26 (b), (c), (d), and (e),
Figure 5.26.
Decomposing the frequent itemset generation problem into multiple
subproblems, where each subproblem involves finding frequent
itemsets ending in e, d, c, b, and a.
Figure 5.26. Full Alternative Text
FP-growth finds all the frequent itemsets ending with a particular suffix by
employing a divide-and-conquer strategy to split the problem into smaller
subproblems. For example, suppose we are interested in finding all frequent
itemsets ending in e. To do this, we must first check whether the itemset {e}
itself is frequent. If it is frequent, we consider the subproblem of finding
frequent itemsets ending in de,followedby ce, be,and ae. In turn, each of
these subproblems are further decomposed into smaller subproblems. By
merging the solutions obtained from the subproblems, all the frequent
itemsets ending in e can be found. Finally, the set of all frequent itemsets can
be generated by merging the solutions to the subproblems of finding frequent
itemsets ending in e, d, c, b, and a. This divide-and-conquer approach is the
key strategy employed by the FP-growth algorithm.
For a more concrete example on how to solve the subproblems, consider the
task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initial
paths are called prefix paths and are shown in Figure 5.27(a).
Figure 5.27.
Example of applying the FP-growth algorithm to find frequent
itemsets ending in e.
Figure 5.27. Full Alternative Text
2. From the prefix paths shown in Figure 5.27(a), the support count for e is
obtained by adding the support counts associated with node e. Assuming
that the minimum support count is 2, {e} is declared a frequent itemset
because its support count is 3.
3. Because {e} is frequent, the algorithm has to solve the subproblems of
finding frequent itemsets ending in de, ce, be,and ae. Before solving
these subproblems, it must first convert the prefix paths into a
conditional FP-tree, which is structurally similar to an FP-tree, except
it is used to find frequent itemsets ending with a particular suffix. A
conditional FP-tree is obtained in the following way:
1. First, the support counts along the prefix paths must be updated
because some of the counts include transactions that do not contain
item e. For example, the rightmost path shown in Figure 5.27(a),
null → b:2 → c:2 → e:1, includes a transaction {b, c} that does
not contain item e. The counts along the prefix path must therefore
be adjusted to 1 to reflect the actual number of transactions
containing {b, c, e}.
2. The prefix paths are truncated by removing the nodes for e. These
nodes can be removed because the support counts along the prefix
paths have been updated to reflect only transactions that contain e
and the subproblems of finding frequent itemsets ending in de, ce,
be, and ae no longer need information about node e.
3. After updating the support counts along the prefix paths, some of
the items may no longer be frequent. For example, the node b
appears only once and has a support count equal to 1, which means
that there is only one transaction that contains both b and e.Item b
can be safely ignored from subsequent analysis because all itemsets
ending in be must be infrequent.
The conditional FP-tree for e is shown in Figure 5.27(b). The tree looks
different than the original prefix paths because the frequency counts
have been updated and the nodes b and e have been eliminated.
4. FP-growth uses the conditional FP-tree for e to solve the subproblems of
finding frequent itemsets ending in de, ce,and ae. To find the frequent
itemsets ending in de, the prefix paths for d are gathered from the
conditional FP-tree for e (Figure 5.27(c)). By adding the frequency
counts associated with node d, we obtain the support count for {d, e}.
Since the support count is equal to 2, {d, e} is declared a frequent
itemset. Next, the algorithm constructs the conditional FP-tree for de
using the approach described in step 3. After updating the support
counts and removing the infrequent item c, the conditional FP-tree for
de is shown in Figure 5.27(d). Since the conditional FP-tree contains
only one item, a, whose support is equal to minsup, the algorithm
extracts the frequent itemset {a, d, e} and moves on to the next
subproblem, which is to generate frequent itemsets ending in ce. After
processing the prefix paths for c, {c, e} is found to be frequent.
However, the conditional FP-tree for ce will have no frequent items and
thus will be eliminated. The algorithm proceeds to solve the next
subproblem and finds {a, e} to be the only frequent itemset remaining.
This example illustrates the divide-and-conquer approach used in the FPgrowth algorithm. At each recursive step, a conditional FP-tree is constructed
by updating the frequency counts along the prefix paths and removing all
infrequent items. Because the subproblems are disjoint, FP-growth will not
generate any duplicate itemsets. In addition, the counts associated with the
nodes allow the algorithm to perform support counting while generating the
common suffix itemsets.
FP-growth is an interesting algorithm because it illustrates how a compact
representation of the transaction data set helps to efficiently generate frequent
itemsets. In addition, for certain transaction data sets, FP-growth outperforms
the standard Apriori algorithm by several orders of magnitude. The run-time
performance of FP-growth depends on the compaction factor of the data set.
If the resulting conditional FP-trees are very bushy (in the worst case, a full
prefix tree), then the performance of the algorithm degrades significantly
because it has to generate a large number of subproblems and merge the
results returned by each subproblem.
5.7 Evaluation of Association
Although the Apriori principle significantly reduces the exponential search
space of candidate itemsets, association analysis algorithms still have the
potential to generate a large number of patterns. For example, although the
data set shown in Table 5.1 contains only six items, it can produce hundreds
of association rules at particular support and confidence thresholds. As the
size and dimensionality of real commercial databases can be very large, we
can easily end up with thousands or even millions of patterns, many of which
might not be interesting. Identifying the most interesting patterns from the
multitude of all possible ones is not a trivial task because “one person’s trash
might be another person’s treasure.” It is therefore important to establish a set
of well-accepted criteria for evaluating the quality of association patterns.
The first set of criteria can be established through a data-driven approach to
define objective interestingness measures. These measures can be used to
rank patterns—itemsets or rules—and thus provide a straightforward way of
dealing with the enormous number of patterns that are found in a data set.
Some of the measures can also provide statistical information, e.g., itemsets
that involve a set of unrelated items or cover very few transactions are
considered uninteresting because they may capture spurious relationships in
the data and should be eliminated. Examples of objective interestingness
measures include support, confidence, and correlation.
The second set of criteria can be established through subjective arguments. A
pattern is considered subjectively uninteresting unless it reveals unexpected
information about the data or provides useful knowledge that can lead to
profitable actions. For example, the rule {Butter}→{Bread} may not be
interesting, despite having high support and confidence values, because the
relationship represented by the rule might seem rather obvious. On the other
hand, the rule {Diapers}→{Beer} is interesting because the relationship is
quite unexpected and may suggest a new cross-selling opportunity for
retailers. Incorporating subjective knowledge into pattern evaluation is a
difficult task because it requires a considerable amount of prior information
from domain experts. Readers interested in subjective interestingness
measures may refer to resources listed in the bibliography at the end of this
5.7.1 Objective Measures of
An objective measure is a data-driven approach for evaluating the quality of
association patterns. It is domain-independent and requires only that the user
specifies a threshold for filtering low-quality patterns. An objective measure
is usually computed based on the frequency counts tabulated in a
contingency table. Table 5.6 shows an example of a contingency table for a
pair of binary variables, A and B.We use the notation A¯(B¯) to indicate that
A (B)isabsent from a transaction. Each entry fij in this 2×2 table denotes a
frequency count. For example, f11 is the number of times A and B appear
together in the same transaction, while f01 is the number of transactions that
contain B but not A. The row sum f1+ represents the support count for A,
while the column sum f+1 represents the support count for B. Finally, even
though our discussion focuses mainly on asymmetric binary variables, note
that contingency tables are also applicable to other attribute types such as
symmetric binary, nominal, and ordinal variables.
Table 5.6. A 2-way contingency
table for variables A and B.
A f11 f10 f1+
A¯ f01 f00 f0+
f+1 f+0 N
Limitations of the SupportConfidence Framework
The classical association rule mining formulation relies on the support and
confidence measures to eliminate uninteresting patterns. The drawback of
support, which is described more fully in Section 5.8, is that many potentially
interesting patterns involving low support items might be eliminated by the
support threshold. The drawback of confidence is more subtle and is best
demonstrated with the following example.
Example 5.3.
Suppose we are interested in analyzing the relationship between people who
drink tea and coffee. We may gather information about the beverage
preferences among a group of people and summarize their responses into a
contingency table such as the one shown in Table 5.7.
Table 5.7. Beverage preferences
among a group of 1000 people.
Coffee Coffee¯
Tea 150 50 200
Tea¯ 650 150 800
800 200 1000
The information given in this table can be used to evaluate the association
rule {Tea}→{Coffee}. At first glance, it may appear that people who drink
tea also tend to drink coffee because the rule’s support (15%) and confidence
(75%) values are reasonably high. This argument would have been acceptable
except that the fraction of people who drink coffee, regardless of whether
they drink tea, is 80%, while the fraction of tea drinkers who drink coffee is
only 75%. Thus knowing that a person is a tea drinker actually decreases her
probability of being a coffee drinker from 80% to 75%! The rule {Tea}
→{Coffee} is therefore misleading despite its high confidence value.
Now consider a similar problem where we are interested in analyzing the
relationship between people who drink tea and people who use honey in their
beverage. Table 5.8 summarizes the information gathered over the same
group of people about their preferences for drinking tea and using honey. If
we evaluate the association rule {Tea}→{Honey} using this information, we
will find that the confidence value of this rule is merely 50%, which might be
easily rejected using a reasonable threshold on the confidence value, say
70%. One thus might consider that the preference of a person for drinking tea
has no influence on her preference for using honey. However, the fraction of
people who use honey, regardless of whether they drink tea, is only 12%.
Hence, knowing that a person drinks tea significantly increases her
probability of using honey from 12% to 50%. Further, the fraction of people
who do not drink tea but use honey is only 2.5%! This suggests that there is
definitely some information in the preference of a person of using honey
given that she drinks tea. The rule {Tea}→{Honey} may therefore be falsely
rejected if confidence is used as the evaluation measure.
Table 5.8. Information about
people who drink tea and
people who use honey in their
Honey Honey¯
Tea 100 100 200
Tea¯ 20 780 800
120 880 1000
Note that if we take the support of coffee drinkers into account, we would not
be surprised to find that many of the people who drink tea also drink coffee,
since the overall number of coffee drinkers is quite large by itself. What is
more surprising is that the fraction of tea drinkers who drink coffee is
actually less than the overall fraction of people who drink coffee, which
points to an inverse relationship between tea drinkers and coffee drinkers.
Similarly, if we account for the fact that the support of using honey is
inherently small, it is easy to understand that the fraction of tea drinkers who
use honey will naturally be small. Instead, what is important to measure is the
change in the fraction of honey users, given the information that they drink
The limitations of the confidence measure are well-known and can be
understood from a statistical perspective as follows. The support of a variable
measures the probability of its occurrence, while the support s(A, B) of a pair
of a variables A and B measures the probability of the two variables occurring
together. Hence, the joint probability P (A, B) can be written as
P(A, B)=s(A, B)=f11N.
If we assume A and B are statistically independent, i.e. there is no relationship
between the occurrences of A and B, then P(A, B)=P(A)×P(B). Hence, under
the assumption of statistical independence between A and B, the support
sindep(A, B) of A and B can be written as
sindep(A, B)=s(A)×s(B)or equivalently,sindep(A, B)=f1+N×f+1N. (5.4)
If the support between two variables, s(A, B) is equal to sindep(A, B), then A
and B can be considered to be unrelated to each other. However, if s(A, B) is
widely different from sindep(A, B), then A and B are most likely dependent.
Hence, any deviation of s(A, B) from s(A)×s(B) can be seen as an indication
of a statistical relationship between A and B. Since the confidence measure
only considers the deviance of s(A, B) from s(A) and not from s(A)×s(B), it
fails to account for the support of the consequent, namely s(B). This results in
the detection of spurious patterns (e.g., {Tea}→{Coffee}) and the rejection
of truly interesting patterns (e.g., {Tea}→{Honey}), as illustrated in the
previous example.
Various objective measures have been used to capture the deviance of s(A, B)
from sindep(A, B), that are not susceptible to the limitations of the
confidence measure. Below, we provide a brief description of some of these
measures and discuss some of their properties.
Interest Factor
The interest factor, which is also called as the “lift,” can be defined as
I(A, B)=s(A, B)s(A)×s(B)=Nf11f1+f+1. (5.5)
Notice that s(A)×s(B)=sindep(A, B). Hence, the interest factor measures the
ratio of the support of a pattern s(A, B) against its baseline support sindep(A,
B) computed under the statistical independence assumption. Using Equations
5.5 and 5.4, we can interpret the measure as follows:
I(A, B)={=1,if A and B are independent;>1,if A and B are positively related;
<1,if A and B are negatively related. (5.6)
For the tea-coffee example shown in Table 5.7,I=0.150.2×0.8=0.9375, thus
suggesting a slight negative relationship between tea drinkers and coffee
drinkers. Also, for the tea-honey example shown in Table 5.8,
I=0.10.12×0.2=4.1667, suggesting a strong positive relationship between
people who drink tea and people who use honey in their beverage. We can
thus see that the interest factor is able to detect meaningful patterns in the teacoffee and tea-honey examples. Indeed, the interest factor has a number of
statistical advantages over the confidence measure that make it a suitable
measure for analyzing statistical independence between variables.
Piatesky-Shapiro (PS) Measure
Instead of computing the ratio between s(A, B) and sindep(A, B)=s(A)×s(B),
the PS measure considers the difference between s(A, B) and s(A)×s(B) as
PS=s(A, B)−s(A)×s(B)=f11N−f1+f+1N2 (5.7)
The PS value is 0 when A and B are mutually independent of each other.
Otherwise, PS>0 when there is a positive relationship between the two
variables, and PS<0 when there is a negative relationship.
Correlation Analysis
Correlation analysis is one of the most popular techniques for analyzing
relationships between a pair of variables. For continuous variables,
correlation is defined using Pearson’s correlation coefficient (see Equation
2.10 on page 83). For binary variables, correlation can be measured using the
ϕ-coefficient, which is defined as
ϕ=f11f00−f01f10f1+f+1f0+f+0. (5.8)
If we rearrange the terms in 5.8, we can show that the ϕ-coefficient can be
rewritten in terms of the support measures of A, B, and {A, B} as follows:
ϕ=s(A, B)−s(A)×s(B)s(A)×(1−s(A))×s(B)×(1−s(B)). (5.9)
Note that the numerator in the above equation is identical to the PS measure.
Hence, the ϕ-coefficient can be understood as a normalized version of the PS
measure, where that the value of the ϕ-coefficient ranges from −1 to +1.
From a statistical viewpoint, the correlation captures the normalized
difference between s(A, B) and sindep(A, B). A correlation value of 0 means
no relationship, while a value of +1 suggests a perfect positive relationship
and a value of −1 suggests a perfect negative relationship. The correlation
measure has a statistical meaning and hence is widely used to evaluate the
strength of statistical independence among variables. For instance, the
correlation between tea and coffee drinkers in Table 5.7 is −0.0625 which is
slightly less than 0. On the other hand, the correlation between people who
drink tea and people who use honey in Table 5.8 is 0.5847, suggesting a
positive relationship.
IS Measure
IS is an alternative measure for capturing the relationship between s(A, B) and
s(A)×s(B). The IS measure is defined as follows:
IS(A, B)=I(A, B)×s(A, B)=s(A, B)s(A)s(B)=f11f1+f+1. (5.10)
Although the definition of IS looks quite similar to the interest factor, they
share some interesting differences. Since IS is the geometric mean between
the interest factor and the support of a pattern, IS is large when both the
interest factor and support are large. Hence, if the interest factor of two
patterns are identical, the IS has a preference of selecting the pattern with
higher support. It is also possible to show that IS is mathematically equivalent
to the cosine measure for binary variables (see Equation 2.6 on page 81). The
value of IS thus varies from 0 to 1, where an IS value of 0 corresponds to no
co-occurrence of the two variables, while an IS value of 1 denotes perfect
relationship, since they occur in exactly the same transactions. For the teacoffee example shown in Table 5.7, the value of IS is equal to 0.375, while
the value of IS for the tea-honey example in Table 5.8 is 0.6455. The IS
measure thus gives a higher value for the {Tea}→{Honey} rule than the
{Tea}→{Coffee} rule, which is consistent with our understanding of the two
Alternative Objective
Interestingness Measures
Note that all of the measures defined in the previous Section use different
techniques to capture the deviance between s(A, B) and sindep=s(A)×s(B).
Some measures use the ratio between s(A, B) and sindep(A, B), e.g., the
interest factor and IS, while some other measures consider the difference
between the two, e.g., the PS and the ϕ-coefficient. Some measures are
bounded in a particular range, e.g., the IS and the ϕ-coefficient, while others
are unbounded and do not have a defined maximum or minimum value, e.g.,
the Interest Factor. Because of such differences, these measures behave
differently when applied to different types of patterns. Indeed, the measures
defined above are not exhaustive and there exist many alternative measures
for capturing different properties of relationships between pairs of binary
variables. Table 5.9 provides the definitions for some of these measures in
terms of the frequency counts of a 2×2 contingency table.
Table 5.9. Examples of
objective measures for the
itemset {A, B}.
(Symbol) Definition
(ϕ) Nf11−f1+f+1f1+f+1f0+f+0
Odds ratio
(α) (f11f00)/(f10f01)
Kappa (κ) Nf11+Nf00−f1+f+1−f0+f+0N2−f1+f+1−f0+f+0
Interest (I) (Nf11)/(f1+f+1)
Cosine (IS) (f11)/(f1+f+1)
strength (S) f11+f00f1+f+1+f0+f+0×N−f1+f+1−f0+f+0N−f11−f00
Jaccard (ζ) f11/(f1++f+1−f11)
min[f11f1+, f11f+1]
Consistency among Objective
Given the wide variety of measures available, it is reasonable to question
whether the measures can produce similar ordering results when applied to a
set of association patterns. If the measures are consistent, then we can choose
any one of them as our evaluation metric. Otherwise, it is important to
understand what their differences are in order to determine which measure is
more suitable for analyzing certain types of patterns.
Suppose the measures defined in Table 5.9 are applied to rank the ten
contingency tables shown in Table 5.10. These contingency tables are chosen
to illustrate the differences among the existing measures. The ordering
produced by these measures is shown in Table 5.11 (with 1 as the most
interesting and 10 as the least interesting table). Although some of the
measures appear to be consistent with each other, others produce quite
different ordering results. For example, the rankings given by the ϕ-
coefficient agrees mostly with those provided by κ and collective strength,
but are quite different than the rankings produced by interest factor.
Furthermore, a contingency table such as E10 is ranked lowest according to
the ϕ-coefficient, but highest according to interest factor.
Table 5.10. Example of
contingency tables.
Example f11 f10 f01 f00
E1 8123 83 424 1370
E2 8330 2 622 1046
E3 3954 3080 5 2961
E4 2886 1363 1320 4431
E5 1500 2000 500 6000
E6 4000 2000 1000 3000
E7 9481 298 127 94
E8 4000 2000 2000 2000
E9 7450 2483 4 63
E10 61 2483 4 7452
Table 5.11. Rankings of
contingency tables using the
measures given in Table 5.9.
ϕ α κ I IS PS S ζ h
E1 1 3 1 6 2 2 1 2 2
E2 2 1 2 7 3 5 2 3 3
E3 3 2 4 4 5 1 3 6 8
E4 4 8 3 3 7 3 4 7 5
E5 5 7 6 2 9 6 6 9 9
E6 6 9 5 5 6 4 5 5 7
E7 7 6 7 9 1 8 7 1 1
E8 8 10 8 8 8 7 8 8 7
E9 9 4 9 10 4 9 9 4 4
E10 10 5 10 1 10 10 10 10 10
Properties of Objective Measures
The results shown in Table 5.11 suggest that the measures greatly differ from
each other and can provide conflicting information about the quality of a
pattern. In fact, no measure is universally best for all applications. In the
following, we describe some properties of the measures that play an
important role in determining if they are suited for a certain application.
Inversion Property
Consider the binary vectors shown in Figure 5.28. The 0/1 value in each
column vector indicates whether a transaction (row) contains a particular item
(column). For example, the vector A indicates that the item appears in the
first and last transactions, whereas the vector B indicates that the item is
contained only in the fifth transaction. The vectors A¯ and B¯ are the
inverted versions of A and B, i.e., their 1 values have been changed to 0
values (absence to presence) and vice versa. Applying this transformation to
a binary vector is called inversion. If a measure is invariant under the
inversion operation, then its value for the vector pair {A¯, B¯} should be
identical to its value for {A, B}. The inversion property of a measure can be
tested as follows.
Figure 5.28.
Effect of the inversion operation. The vectors A¯ and E¯ are
inversions of vectors A and B, respectively.
Figure 5.28. Full Alternative Text
Definition 5.6. (Inversion Property.)
An objective measure M is invariant under the inversion operation if its value
remains the same when exchanging the frequency counts f11 with f00 and
f10 with f01.
Measures that are invariant to the inversion property include the correlation
(ϕ-coefficient), odds ratio, κ, and collective strength. These measures are
especially useful in scenarios where the presence (1’s) of a variable is as
important as its absence (0’s). For example, if we compare two sets of
answers to a series of true/false questions where 0’s (true) and 1’s (false) are
equally meaningful, we should use a measure that gives equal importance to
occurrences of 0–0’s and 1–1’s. For the vectors shown in Figure 5.28, the ϕ-
coefficient is equal to -0.1667 regardless of whether we consider the pair {A,
B} or pair {A¯, B¯}. Similarly, the odds ratio for both pairs of vectors is
equal to a constant value of 0. Note that even though the ϕ-coefficient and the
odds ratio are invariant to inversion, they can still show different results, as
will be shown later.
Measures that do not remain invariant under the inversion operation include
the interest factor and the IS measure. For example, the IS value for the pair
{A¯, B¯} in Figure 5.28 is 0.825, which reflects the fact that the 1’s in A¯
and B¯ occur frequently together. However, the IS value of its inverted pair
{A, B} is equal to 0, since A and B do not have any co-occurrence of 1’s. For
asymmetric binary variables, e.g., the occurrence of words in documents, this
is indeed the desired behavior. A desired similarity measure between
asymmetric variables should not be invariant to inversion, since for these
variables, it is more meaningful to capture relationships based on the
presence of a variable rather than its absence. On the other hand, if we are
dealing with symmetric binary variables where the relationships between 0’s
and 1’s are equally meaningful, care should be taken to ensure that the chosen
measure is invariant to inversion.
Although the values of the interest factor and IS change with the inversion
operation, they can still be inconsistent with each other. To illustrate this,
consider Table 5.12, which shows the contingency tables for two pairs of
variables, {p, q} and {r, s}. Note that r and s are inverted transformations of
p and q, respectively, where the roles of 0’s and 1’s have just been reversed.
The interest factor for {p, q} is 1.02 and for {r, s} is 4.08, which means that
the interest factor finds the inverted pair {r, s} more related than the original
pair {p, q}. On the contrary, the IS value decreases upon inversion from
0.9346 for {p, q} to 0.286 for {r, s}, suggesting quite an opposite trend to
that of the interest factor. Even though these measures conflict with each
other for this example, they may be the right choice of measure in different
Table 5.12. Contingency tables
for the pairs {p,q} and {r,s}.
q 880 50 930
q¯ 50 20 70
930 70 1000
s 20 50 70
s¯ 50 880 930
70 930 1000
Scaling Property
Table 5.13 shows two contingency tables for gender and the grades achieved
by students enrolled in a particular course. These tables can be used to study
the relationship between gender and performance in the course. The second
contingency table has data from the same population but has twice as many
males and three times as many females. The actual number of males or
females can depend upon the samples available for study, but the relationship
between gender and grade should not change just because of differences in
sample sizes. Similarly, if the number of students with high and low grades
are changed in a new study, a measure of association between gender and
grades should remain unchanged. Hence, we need a measure that is invariant
to scaling of rows or columns. The process of multiplying a row or column of
a contingency table by a constant value is called a row or column scaling
operation. A measure that is invariant to scaling does not change its value
after any row or column scaling operation.
Table 5.13. The gradegender example. (a)
Sample data of size 100.
Male Female
High 30 20 50
Low 40 10 50
70 30 100
(b) Sample data of size 230.
Male Female
High 60 60 120
Low 80 30 110
140 90 230
Definition 5.7. (Scaling Invariance
Let T be a contingency table with frequency counts [f11; f10; f01; f00]. Let T
′ be the transformed a contingency table with scaled frequency counts
[k1k3f11; k2k3f10; k1k4f01; k2k4f00], where k1, k2, k3, k4 are positive
constants used to scale the two rows and the two columns of T . An objective
measure M is invariant under the row/column scaling operation if M(T)=M(T
Note that the use of the term ‘scaling’ here should not be confused with the
scaling operation for continuous variables introduced in Chapter 2 on page
23, where all the values of a variable were being multiplied by a constant
factor, instead of scaling a row or column of a contingency table.
Scaling of rows and columns in contingency tables occurs in multiple ways in
different applications. For example, if we are measuring the effect of a
particular medical procedure on two sets of subjects, healthy and diseased,
the ratio of healthy and diseased subjects can widely vary across different
studies involving different groups of participants. Further, the fraction of
healthy and diseased subjects chosen for a controlled study can be quite
different from the true fraction observed in the complete population. These
differences can result in a row or column scaling in the contingency tables for
different populations of subjects. In general, the frequencies of items in a
contingency table closely depends on the sample of transactions used to
generate the table. Any change in the sampling procedure may affect a row or
column scaling transformation. A measure that is expected to be invariant to
differences in the sampling procedure must not change with row or column
Of all the measures introduced in Table 5.9, only the odds ratio (α) is
invariant to row and column scaling operations. For example, the value of
odds ratio for both the tables in Table 5.13 is equal to 0.375. All other
measures such as the ϕ-coefficient, κ, IS, interest factor, and collective
strength (S) change their values when the rows and columns of the
contingency table are rescaled. Indeed, the odds ratio is a preferred choice of
measure in the medical domain, where it is important to find relationships
that do not change with differences in the population sample chosen for a
Null Addition Property
Suppose we are interested in analyzing the relationship between a pair of
words, such as data and mining, in a set of documents. If a collection of
articles about ice fishing is added to the data set, should the association
between data and mining be affected? This process of adding unrelated data
(in this case, documents) to a given data set is known as the null addition
Definition 5.8. (Null Addition
An objective measure M is invariant under the null addition operation if it is
not affected by increasing f00, while all other frequencies in the contingency
table stay the same.
For applications such as document analysis or market basket analysis, we
would like to use a measure that remains invariant under the null addition
operation. Otherwise, the relationship between words can be made to change
simply by adding enough documents that do not contain both words!
Examples of measures that satisfy this property include cosine (IS) and
Jaccard (ξ) measures, while those that violate this property include interest
factor, PS, odds ratio, and the ϕ-coefficient.
To demonstrate the effect of null addition, consider the two contingency
tables T1 and T2 shown in Table 5.14. Table T2 has been obtained from T1
by adding 1000 extra transactions with both A and B absent. This operation
only affects the f00 entry of Table T2, which has increased from 100 to 1100,
whereas all the other frequencies in the table (f11, f10, and f01) remain the
same. Since IS is invariant to null addition, it gives a constant value of 0.875
to both the tables. However, the addition of 1000 extra transactions with
occurrences of 0–0’s changes the value of interest factor from 0.972 for T1
(denoting a slightly negative correlation) to 1.944 for T2 (positive
correlation). Similarly, the value of odds ratio increases from 7 for T1 to 77
for T2. Hence, when the interest factor or odds ratio are used as the
association measure, the relationships between variables changes by the
addition of null transactions where both the variables are absent. In contrast,
the IS measure is invariant to null addition, since it considers two variables to
be related only if they frequently occur together. Indeed, the IS measure
(cosine measure) is widely used to measure similarity among documents,
which is expected to depend only on the joint occurrences (1’s) of words in
documents, but not their absences (0’s).
Table 5.14. An example
demonstrating the effect of
null addition.
(a) Table T1.
A 700 100 800
A¯ 100 100 200
800 200 1000
(b) Table T2.
A 700 100 800
A¯ 10 1100 1200
800 1200 2000
Table 5.15 provides a summary of properties for the measures defined in
Table 5.9. Even though this list of properties is not exhaustive, it can serve as
a useful guide for selecting the right choice of measure for an application.
Ideally, if we know the specific requirements of a certain application, we can
ensure that the selected measure shows properties that adhere to those
requirements. For example, if we are dealing with asymmetric variables, we
would prefer to use a measure that is not invariant to null addition or
inversion. On the other hand, if we require the measure to remain invariant to
changes in the sample size, we would like to use a measure that does not
change with scaling.
Table 5.15. Properties of
symmetric measures.
Symbol Measure Inversion Null Addition Scaling
ϕ ϕ-coefficient Yes No No
α odds ratio Yes No Yes
κ Cohen’s Yes No No
I Interest No No No
IS Cosine No Yes No
PS Piatetsky-Shapiro’s Yes No No
S Collective strength Yes No No
ζ Jaccard No Yes No
h All-confidence No Yes No
s Support No No No
Asymmetric Interestingness
Note that in the discussion so far, we have only considered measures that do
not change their value when the order of the variables are reversed. More
specifically, if M is a measure and A and B are two variables, then M(A, B) is
equal to M(B, A) if the order of the variables does not matter. Such measures
are called symmetric. On the other hand, measures that depend on the order
of variables (M(A, B)≠M(B, A)) are called asymmetric measures. For
example, the interest factor is a symmetric measure because its value is
identical for the rules A→B and B→A. In contrast, confidence is an
asymmetric measure since the confidence for A→B and B→A may not be
the same. Note that the use of the term ‘asymmetric’ to describe a particular
type of measure of relationship—one in which the order of the variables is
important—should not be confused with the use of ‘asymmetric’ to describe a
binary variable for which only 1’s are important. Asymmetric measures are
more suitable for analyzing association rules, since the items in a rule do
have a specific order. Even though we only considered symmetric measures
to discuss the different properties of association measures, the above
discussion is also relevant for the asymmetric measures. See Bibliographic
Notes for more information about different kinds of asymmetric measures
and their properties.
5.7.2 Measures beyond Pairs of
Binary Variables
The measures shown in Table 5.9 are defined for pairs of binary variables
(e.g., 2-itemsets or association rules). However, many of them, such as
support and all-confidence, are also applicable to larger-sized itemsets. Other
measures, such as interest factor, IS, PS, and Jaccard coefficient, can be
extended to more than two variables using the frequency tables tabulated in a
multidimensional contingency table. An example of a three-dimensional
contingency table for a, b, and c is shown in Table 5.16. Each entry fijk in
this table represents the number of transactions that contain a particular
combination of items a, b, and c. For example, f101 is the number of
transactions that contain a and c, but not b. On the other hand, a marginal
frequency such as f1+1 is the number of transactions that contain a and c,
irrespective of whether b is present in the transaction.
Table 5.16. Example of a threedimensional contingency table.
c b
a f111 f101 f1+1
a¯ f011 f001 f0+1
f+11 f+01 f++1
c b
a f110 f100 f1+0
a¯ f010 f000 f0+0
f+10 f+00 f++0
Given a k-itemset {i1, i2, …, ik}, the condition for statistical independence
can be stated as follows:
fi1i2…ik=fi1+…+×f+i2…+×…×f++…ikNk−1. (5.11)
With this definition, we can extend objective measures such as interest factor
and PS, which are based on deviations from statistical independence, to more
than two variables:
Another approach is to define the objective measure as the maximum,
minimum, or average value for the associations between pairs of items in a
pattern. For example, given a k-itemset X={i1, i2, …, ik}, we may define the
ϕ-coefficient for X as the average ϕ-coefficient between every pair of items
(ip, iq) in X. However, because the measure considers only pairwise
associations, it may not capture all the underlying relationships within a
pattern. Also, care should be taken in using such alternate measures for more
than two variables, since they may not always show the anti-monotone
property in the same way as the support measure, making them unsuitable for
mining patterns using the Apriori principle.
Analysis of multidimensional contingency tables is more complicated
because of the presence of partial associations in the data. For example, some
associations may appear or disappear when conditioned upon the value of
certain variables. This problem is known as Simpson’s paradox and is
described in Section 5.7.3. More sophisticated statistical techniques are
available to analyze such relationships, e.g., loglinear models, but these
techniques are beyond the scope of this book.
5.7.3 Simpson’s Paradox
It is important to exercise caution when interpreting the association between
variables because the observed relationship may be influenced by the
presence of other confounding factors, i.e., hidden variables that are not
included in the analysis. In some cases, the hidden variables may cause the
observed relationship between a pair of variables to disappear or reverse its
direction, a phenomenon that is known as Simpson’s paradox. We illustrate
the nature of this paradox with the following example.
Consider the relationship between the sale of high-definition televisions
(HDTV) and exercise machines, as shown in Table 5.17. The rule
{HDTV=Yes}→{Exercise machine=Yes} has a confidence of 99/180=55%
and the rule {HDTV=No}→{Exercise machine=Yes} has a confidence of
54/120=45%. Together, these rules suggest that customers who buy highdefinition televisions are more likely to buy exercise machines than those
who do not buy high-definition televisions.
Table 5.17. A two-way
contingency table between the
sale of high-definition television
and exercise machine.
Buy Exercise Machine
Yes No
Yes 99 81 180
No 54 66 120
153 147 300
However, a deeper analysis reveals that the sales of these items depend on
whether the customer is a college student or a working adult. Table 5.18
summarizes the relationship between the sale of HDTVs and exercise
machines among college students and working adults. Notice that the support
counts given in the table for college students and working adults sum up to
the frequencies shown in Table 5.17. Furthermore, there are more working
adults than college students who buy these items. For college students:
Table 5.18. Example of a threeway contingency table.
Buy Exercise Machine
Yes No
College Students Yes 1 9 10
No 4 30 34
Working Adult Yes 98 72 170
No 50 36 86
c({HDTV=Yes}→{Exercise machine=Yes})=1/10=10%,c({HDTV=No}
→{Exercise machine=Yes})=4/34=11.8%.
while for working adults:
c({HDTV=Yes}→{Exercise machine=Yes})=98/170=57.7%,c({HDTV=No}
→{Exercise machine=Yes})=50/86=58.1%.
The rules suggest that, for each group, customers who do not buy highdefinition televisions are more likely to buy exercise machines, which
contradicts the previous conclusion when data from the two customer groups
are pooled together. Even if alternative measures such as correlation, odds
ratio, or interest are applied, we still find that the sale of HDTV and exercise
machine is positively related in the combined data but is negatively related in
the stratified data (see Exercise 21 on page 449). The reversal in the direction
of association is known as Simpson’s paradox.
The paradox can be explained in the following way. First, notice that most
customers who buy HDTVs are working adults. This is reflected in the high
confidence of the rule {HDTV=Yes}→{Working Adult}(170/180=94.4%).
Second, the high confidence of the rule {Exercise machine=Yes}
→{Working Adult}(148/153=96.7%) suggests that most customers who buy
exercise machines are also working adults. Since working adults form the
largest fraction of customers for both HDTVs and exercise machines, they
both look related and the rule {HDTV=Yes}→{Exercise machine=Yes}
turns out to be stronger in the combined data than what it would have been if
the data is stratified. Hence, customer group acts as a hidden variable that
affects both the fraction of customers who buy HDTVs and those who buy
exercise machines. If we factor out the effect of the hidden variable by
stratifying the data, we see that the relationship between buying HDTVs and
buying exercise machines is not direct, but shows up as an indirect
consequence of the effect of the hidden variable.
The Simpson’s paradox can also be illustrated mathematically as follows.
where a/b and p/q may represent the confidence of the rule A→B in two
different strata, while c/d and r/s may represent the confidence of the rule
A¯→B in the two strata. When the data is pooled together, the confidence
values of the rules in the combined data are (a+p)/(b+q) and (c+r)/(d+s),
respectively. Simpson’s paradox occurs when
thus leading to the wrong conclusion about the relationship between the
variables. The lesson here is that proper stratification is needed to avoid
generating spurious patterns resulting from Simpson’s paradox. For example,
market basket data from a major supermarket chain should be stratified
according to store locations, while medical records from various patients
should be stratified according to confounding factors such as age and gender.
5.8 Effect of Skewed Support
The performances of many association analysis algorithms are influenced by
properties of their input data. For example, the computational complexity of
the Apriori algorithm depends on properties such as the number of items in
the data, the average transaction width, and the support threshold used. This
Section examines another important property that has significant influence on
the performance of association analysis algorithms as well as the quality of
extracted patterns. More specifically, we focus on data sets with skewed
support distributions, where most of the items have relatively low to
moderate frequencies, but a small number of them have very high
Figure 5.29.
A transaction data set containing three items, p, q, and r, where p is
a high support item and q and r are low support items.
Figure 5.29shows an illustrative example of a data set that has a skewed
support distribution of its items. While p has a high support of 83.3% in the
data, q and r are low-support items with a support of 16.7%. Despite their
low support, q and r always occur together in the limited number of
transactions that they appear and hence are strongly related. A pattern mining
algorithm therefore should report {q, r} as interesting.
However, note that choosing the right support threshold for mining item-sets
such as {q, r} can be quite tricky. If we set the threshold too high (e.g., 20%),
then we may miss many interesting patterns involving low-support items
such as {q, r}. Conversely, setting the support threshold too low can be
detrimental to the pattern mining process for the following reasons. First, the
computational and memory requirements of existing association analysis
algorithms increase considerably with low support thresholds. Second, the
number of extracted patterns also increases substantially with low support
thresholds, which makes their analysis and interpretation difficult. In
particular, we may extract many spurious patterns that relate a highfrequency item such as p to a low-frequency item such as q. Such patterns,
which are called cross-support patterns, are likely to be spurious because the
association between p and q is largely influenced by the frequent occurrence
of p instead of the joint occurrence of p and q together. Because the support
of {p, q} is quite close to the support of {q, r}, we may easily select {p, q} if
we set the support threshold low enough to include {q, r}.
An example of a real data set that exhibits a skewed support distribution is
shown in Figure 5.30. The data, taken from the PUMS (Public Use Microdata
Sample) census data, contains 49,046 records and 2113 asymmetric binary
variables. We shall treat the asymmetric binary variables as items and records
as transactions. While more than 80% of the items have support less than 1%,
a handful of them have support greater than 90%. To understand the effect of
skewed support distribution on frequent itemset mining, we divide the items
into three groups, G1, G2, and G3, according to their support levels, as
shown in Table 5.19. We can see that more than 82% of items belong to G1
and have a support less than 1%. In market basket analysis, such low support
items may correspond to expensive products (such as jewelry) that are
seldom bought by customers, but whose patterns are still interesting to
retailers. Patterns involving such low-support items, though meaningful, can
easily be rejected by a frequent pattern mining algorithm with a high support
threshold. On the other hand, setting a low support threshold may result in the
extraction of spurious patterns that relate a high-frequency item in G3 to a
low-frequency item in G1. For example, at a support threshold equal to
0.05%, there are 18,847 frequent pairs involving items from G1 and G3. Out
of these, 93% of them are cross-support patterns; i.e., the patterns contain
items from both G1 and G3.
Figure 5.30.
Support distribution of items in the census data set.
Figure 5.30. Full Alternative Text
Table 5.19. Grouping the items
in the census data set based on
their support values.
Group G1 G2 G3
Support <1% 1%−90% >90%
Number of Items 1735 358 20
This example shows that a large number of weakly related cross-support
patterns can be generated when the support threshold is sufficiently low. Note
that finding interesting patterns in data sets with skewed support distributions
is not just a challenge for the support measure, but similar statements can be
made about many other objective measures discussed in the previous
Sections. Before presenting a methodology for finding interesting patterns
and pruning spurious ones, we formally define the concept of cross-support
Definition 5.9. (Cross-support
Let us define the support ratio, r(X), of an itemset X={i1, i2, …, ik} as
r(X)=min[s(i1), s(i2), …, s(ik)}max[s(i1), s(i2), …, s(ik)} (5.12)
Given a user-specified threshold hc, an itemset X is a cross-support pattern if
Example 5.4.
Suppose the support for milk is 70%, while the support for sugar is 10% and
caviar is 0.04%. Given hc=0.01, the frequent itemset {milk, sugar, caviar} is
a cross-support pattern because its support ratio is
r=min[0.7, 0.1, 0.0004]max[0.7, 0.1, 0.0004]=0.0040.7=0.00058<0.01.
Existing measures such as support and confidence may not be sufficient to
eliminate cross-support patterns. For example, if we assume hc=0.3 for the
data set presented in Figure 5.29, the itemsets {p, q}, {p, r}, and {p, q, r} are
cross-support patterns because their support ratios, which are equal to 0.2, are
less than the threshold hc. However, their supports are comparable to that of
{q, r}, making it difficult to eliminate cross-support patterns without loosing
interesting ones using a support-based pruning strategy. Confidence pruning
also does not help because the confidence of the rules extracted from crosssupport patterns can be very high. For example, the confidence for {q}→{p}
is 80% even though {p, q} is a cross-support pattern. The fact that the crosssupport pattern can produce a high confidence rule should not come as a
surprise because one of its items (p) appears very frequently in the data.
Therefore, p is expected to appear in many of the transactions that contain q.
Meanwhile, the rule {q}→{r} also has high confidence even though {q, r} is
not a cross-support pattern. This example demonstrates the difficulty of using
the confidence measure to distinguish between rules extracted from crosssupport patterns and interesting patterns involving strongly connected but
low-support items.
Even though the rule {q}→{p} has very high confidence, notice that the rule
{p}→{q} has very low confidence because most of the transactions that
contain p do not contain q. In contrast, the rule {r}→{q}, which is derived
from {q, r}, has very high confidence. This observation suggests that crosssupport patterns can be detected by examining the lowest confidence rule that
can be extracted from a given itemset. An approach for finding the rule with
the lowest confidence given an itemset can be described as follows.
1. Recall the following anti-monotone property of confidence:
conf({i1i2}→{i3, i4, …, ik})≤conf({i1i2i3}→{i4, i5, …, ik}).
This property suggests that confidence never increases as we shift more
items from the left- to the right-hand side of an association rule. Because
of this property, the lowest confidence rule extracted from a frequent
itemset contains only one item on its left-hand side. We denote the set of
all rules with only one item on its left-hand side as R1.
2. Given a frequent itemset {i1, i2, …, ik}, the rule
{ij}→{i1, i2, …, ij−1, ij+1, …,ik}
has the lowest confidence in R1 if s(ij)=max[s(i1), s(i2), …, s(ik)].This
follows directly from the definition of confidence as the ratio between
the rule’s support and the support of the rule antecedent. Hence, the
confidence of a rule will be lowest when the support of the antecedent is
3. Summarizing the previous points, the lowest confidence attainable from
a frequent itemset {i1, i2, …, ik} is
s({i1, i2, …, ik})max[s(i1), s(i2), …, s(ik)].
This expression is also known as the h-confidence or all-confidence
measure. Because of the anti-monotone property of support, the
numerator of the h-confidence measure is bounded by the minimum
support of any item that appears in the frequent itemset. In other words,
the h-confidence of an itemset X={i1, i2, …, ik} must not exceed the
following expression:
h-confidence(X)≤min[s(i1), s(i2), …, s(ik)]max[s(i1), s(i2), …, s(ik)].
Note that the upper bound of h-confidence in the above equation is exactly
same as support ratio (r) given in Equation 5.12. Because the support ratio for
a cross-support pattern is always less than hc, the h-confidence of the pattern
is also guaranteed to be less than hc. Therefore, cross-support patterns can be
eliminated by ensuring that the h-confidence values for the patterns exceed
hc. As a final note, the advantages of using h-confidence go beyond
eliminating cross-support patterns. The measure is also anti-monotone, i.e.,
h-confidence({i1, i2,








Table of Contents



1.1.         Introduction/background/overview/significance. 5

1.2.         Related research/Literature Review.. 5

1.3.         Motivation.. 5

1.4.         Referencing.. 5


2.1.         Overview.. 6

2.2.         Research Question/Hypothesis. 6


3.1.         Aim.. 7

3.2.         Objectives. 7


Phase 1: Starting the project. 9

Phase 2: Development of an analytical framework for efficient bank marketing.. 9

Phase 3: Development of the efficient classifier. 10

Phase 4: Verification and optimization of the model. 10


6.1.         Overview.. 11

6.2.         Population and Study Sample. 11

6.3.         Sample Size and Selection of Sample. 11

6.4.         Sources of Data. 11

6.5.         Collection of Data. 11

6.6.         Exposure Assessment. 11

6.7.         Data Management. 11

6.8.         Data Analysis Strategies. 11

6.9.         Research Design and Prototype. 11

6.10.      Clients/Stakeholders. 11

6.11.       Methods. 11

  1. RESULTS. 12

11.1.       Time Schedule (Ghantt Chart) 13

11.2.       Activity Sequencing.. 14


12.1.      Strengths. 15

12.2.      Weaknesses. 15

  1. BUDGET.. 16

13.1.      Resource requirements. 16

13.2.      Budget/Funding.. 17


Appendix 1: Questionnaire. 19

Appendix 2: 20


 Related research/Literature Review



This project focuses on design and development of an intelligent bank marketing system for a better bank marketing. The project will be carried out in four phases with significant milestone at the end of each phase. In the first phase, we will download and prepare the bank marketing data. Then we will search for information to have clear knowledge about the bank marketing problem and its solutions.

The second phase of the project would concentrate on an analytical framework to analyze the performance of different data mining algorithms that would be developed identifying the main components of the bank marketing.

The third phase would concentrate on the design and the implementation of efficient, reliable and robust data mining tool that achieves a better classification accuracy. Moreover, an ensemble of classifiers would be developed to increase classification performance of an efficient bank marketing.

The fourth and final phase would concentrate on preparing and submitting a conference paper. More details on the different phases are as follows:

Phase 1: Starting the project

First of all, we will download and prepare the bank marketing data for the implementation with WEKA data mining tool. Since the data is unbalanced we try to balance the by using SMOTE technique to have an efficient and accurate classification. Then we will check previous studies done on the same field to have clear knowledge about the bank marketing problem and its solutions and applied methods before.

Phase 2: Development of an analytical framework for efficient bank marketing

In this phase, efficient, intelligent and robust bank marketing system will be developed. The system will employ the bank marketing data with robust attribute selection. The extracted attributes will be employed by means of the Data mining techniques for intelligent and more efficient bank marketing environment. This platform will serve to design and develop application oriented bank marketing system. The downloaded data can be processed by employing a variety of feature extraction and attribute selection techniques. Then comparison of different feature extraction, attribute selection techniques, and data mining algorithms for bank marketing will be made. The employed feature extraction and attribute selection techniques should be robust and intelligent, so that a reliable bank marketing can be realized. The proposed platform will be used to understand and implement the intelligent system for a Bank Marketing. The finding of robust features and attributes of bank marketing data would be very meaningful for determining important insights into the effects of various parameters on the performance of bank marketing system to enlighten which feature extraction and attribute selection a is the most effective and less time consuming.

A classification problem is referred to as imbalanced when the instances in one or several classes, known as the majority classes, out number the instances of the other classes, called the minority classes. The synthetic minority over-sampling technique (SMOTE) (Chawla et al., 2002) is a well acknowledged over-sampling method. In the SMOTE, instead of mere data oriented duplicating, the positive classis over-sampled by creating synthetic instances in the feature space formed by the positive instances and their K-nearest neighbours.


In this step, feature extraction, attribute selection and data mining algorithms will be implemented as off-line. The feature extraction and attribute selection methods are in the set of data processing tools which extract features from the bank marketing data. Feature extraction and attribute selection are also one of the most important steps in data classification.  It is highly effective technique in selection of attributes and is frequently applied to complex, high dimensional, multivariate data. When the features are not appropriate for the given classification problem, obtained performances will be unsatisfactory. In this case, even the classification algorithm is optimally determined for the problem, because of the improper features/attributes; the algorithm cannot generate high performance. Therefore, it is mandatory to find and extract suitable features from the raw data to be able to obtain good classification results. Many feature extraction and attribute selection techniques will be applied. These are CFS Subset Evaluator, and Infogain Attribute Evaluator etc. will be applied.

Phase 3: Development of the efficient classifier

The marketing data data is of multi-dimensional nature. Therefore, it is difficult to find a robust feature extraction, attribute selection and data mining algorithm for Bank marketing. Data mining algorithms have ability to distinguish different type of data. In this phase the focus will be to make a comparison of different data mining algorithms in the field of Bank marketing. The outcome will be the analysis and choice of most appropriate sets of features extraction, attribute selection and data mining techniques, for the Bank marketing. After applying different feature extraction/attribute selection algorithms, the existing data mining algorithms (such as ANN, k-NN, SVM, decision tree algorithms, etc.) capable of dealing with network data will be applied, implemented and tested. For this purpose, various classification schemes for a particular data classification task will be developed. The aim of using data mining techniques is to make a better Bank marketing. Even though one of the algorithms would produce the best overall performance, ensemble classifier approach, where the idea is to consider more than one classification scheme can give better classification accuracy. Classifier ensembles are multiple classifier systems trained on different data or feature subsets, will be used to get better performance and accuracy.

Phase 4: Verification and optimization of the model

The fourth and final phase would concentrate on preparing and submitting a conference paper. Here, the performance metrics such as Area under ROC curve, F-measure, kappa statistic and total classification accuracy would be quantified with the help of WEKA software. Obtained results from experimentation would then be used to verify the accuracy, reliability and robustness of the developed models and would provide feedback for improvement and fine-tuning the models in phase 1, phase 2 and phase 3. The obtained results of the ensemble classifiers would be benchmarked with classical single classifiers. Moreover efficient, intelligent and robust methods will be achieved for bank marketing.


2.1.             Overview

Use headings 2 and 3 as appropriate, and use these headings if appropriate.


2.2.            Population and Study Sample


2.3.            Sample Size and Selection of Sample


2.4.            Sources of Data


2.5.            Collection of Data


2.6.            Exposure Assessment


2.7.            Data Management


2.8.            Data Analysis Strategies

2.9.            Research Design and Prototype

  • Use case Diagrams
  • Database Diagrams
  • Relation Diagrams


2.10.     Clients/Stakeholders


3.     RESULTS

  • Give the tables and results of your research project one by one
  • Give the explanation on the results.


  • Give discussion on the results of your research project
  • Give comments on the results.


  • Conclude and summarize the results of your research project.



7.1.             Time Schedule (Ghantt Chart)

Change the activities and relate  it to the topic which is covid-19

Proposed steps schedule is planned like in the table.


Activity Months
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
ANN Stock Market Prediction                                                                        
Literature Survey                                                                        
Literature Search                                                                        
Literature Review                                                                        
Completed Literature Review                                                                        
Get Stock Market Data                                                                        
Develop ANN                                                                        
Investigate and Evaluate ANN                                                                        
Design ANN                                                                        
Develop and Test ANN                                                                        
Train ANN                                                                        
Use Stock Market Models                                                                        
Review Statistical Tests                                                                        
Analyse and Evaluate                                                                        
Complete Report                                                                        
Project Completed                                                                        


7.2.            Activity Sequencing

Activity-on-the-node diagram represents the tasks you are performing in your project as nodes connected by arrows (Dawson, 2005)



8.1.          Strengths

8.2.          Weaknesses


9.             REFERENCES











Appendix 1: Questionnaire



Appendix 2:


Check the related slides and Rubric for the project report details




…, ik})≥h-confidence({i1, i2, …, ik+1}),
and thus can be incorporated directly into the mining algorithm. Furthermore,
h-confidence ensures that the items contained in an itemset are strongly
associated with each other. For example, suppose the h-confidence of an
itemset X is 80%. If one of the items in X is present in a transaction, there is
at least an 80% chance that the rest of the items in X also belong to the same
transaction. Such strongly associated patterns involving low-support items
are called hyperclique patterns.
Definition 5.10. (Hyperclique
An itemset X is a hyperclique pattern if h-confidence(X)>hc, where hc is a
user-specified threshold.

Get Professional Assignment Help Cheaply

Buy Custom Essay

Are you busy and do not have time to handle your assignment? Are you scared that your paper will not make the grade? Do you have responsibilities that may hinder you from turning in your assignment on time? Are you tired and can barely handle your assignment? Are your grades inconsistent?

Whichever your reason is, it is valid! You can get professional academic help from our service at affordable rates. We have a team of professional academic writers who can handle all your assignments.

Why Choose Our Academic Writing Service?

  • Plagiarism free papers
  • Timely delivery
  • Any deadline
  • Skilled, Experienced Native English Writers
  • Subject-relevant academic writer
  • Adherence to paper instructions
  • Ability to tackle bulk assignments
  • Reasonable prices
  • 24/7 Customer Support
  • Get superb grades consistently

Online Academic Help With Different Subjects


Students barely have time to read. We got you! Have your literature essay or book review written without having the hassle of reading the book. You can get your literature paper custom-written for you by our literature specialists.


Do you struggle with finance? No need to torture yourself if finance is not your cup of tea. You can order your finance paper from our academic writing service and get 100% original work from competent finance experts.

Computer science

Computer science is a tough subject. Fortunately, our computer science experts are up to the match. No need to stress and have sleepless nights. Our academic writers will tackle all your computer science assignments and deliver them on time. Let us handle all your python, java, ruby, JavaScript, php , C+ assignments!


While psychology may be an interesting subject, you may lack sufficient time to handle your assignments. Don’t despair; by using our academic writing service, you can be assured of perfect grades. Moreover, your grades will be consistent.


Engineering is quite a demanding subject. Students face a lot of pressure and barely have enough time to do what they love to do. Our academic writing service got you covered! Our engineering specialists follow the paper instructions and ensure timely delivery of the paper.


In the nursing course, you may have difficulties with literature reviews, annotated bibliographies, critical essays, and other assignments. Our nursing assignment writers will offer you professional nursing paper help at low prices.


Truth be told, sociology papers can be quite exhausting. Our academic writing service relieves you of fatigue, pressure, and stress. You can relax and have peace of mind as our academic writers handle your sociology assignment.


We take pride in having some of the best business writers in the industry. Our business writers have a lot of experience in the field. They are reliable, and you can be assured of a high-grade paper. They are able to handle business papers of any subject, length, deadline, and difficulty!


We boast of having some of the most experienced statistics experts in the industry. Our statistics experts have diverse skills, expertise, and knowledge to handle any kind of assignment. They have access to all kinds of software to get your assignment done.


Writing a law essay may prove to be an insurmountable obstacle, especially when you need to know the peculiarities of the legislative framework. Take advantage of our top-notch law specialists and get superb grades and 100% satisfaction.

What discipline/subjects do you deal in?

We have highlighted some of the most popular subjects we handle above. Those are just a tip of the iceberg. We deal in all academic disciplines since our writers are as diverse. They have been drawn from across all disciplines, and orders are assigned to those writers believed to be the best in the field. In a nutshell, there is no task we cannot handle; all you need to do is place your order with us. As long as your instructions are clear, just trust we shall deliver irrespective of the discipline.

Are your writers competent enough to handle my paper?

Our essay writers are graduates with bachelor's, masters, Ph.D., and doctorate degrees in various subjects. The minimum requirement to be an essay writer with our essay writing service is to have a college degree. All our academic writers have a minimum of two years of academic writing. We have a stringent recruitment process to ensure that we get only the most competent essay writers in the industry. We also ensure that the writers are handsomely compensated for their value. The majority of our writers are native English speakers. As such, the fluency of language and grammar is impeccable.

What if I don’t like the paper?

There is a very low likelihood that you won’t like the paper.

Reasons being:

  • When assigning your order, we match the paper’s discipline with the writer’s field/specialization. Since all our writers are graduates, we match the paper’s subject with the field the writer studied. For instance, if it’s a nursing paper, only a nursing graduate and writer will handle it. Furthermore, all our writers have academic writing experience and top-notch research skills.
  • We have a quality assurance that reviews the paper before it gets to you. As such, we ensure that you get a paper that meets the required standard and will most definitely make the grade.

In the event that you don’t like your paper:

  • The writer will revise the paper up to your pleasing. You have unlimited revisions. You simply need to highlight what specifically you don’t like about the paper, and the writer will make the amendments. The paper will be revised until you are satisfied. Revisions are free of charge
  • We will have a different writer write the paper from scratch.
  • Last resort, if the above does not work, we will refund your money.

Will the professor find out I didn’t write the paper myself?

Not at all. All papers are written from scratch. There is no way your tutor or instructor will realize that you did not write the paper yourself. In fact, we recommend using our assignment help services for consistent results.

What if the paper is plagiarized?

We check all papers for plagiarism before we submit them. We use powerful plagiarism checking software such as SafeAssign, LopesWrite, and Turnitin. We also upload the plagiarism report so that you can review it. We understand that plagiarism is academic suicide. We would not take the risk of submitting plagiarized work and jeopardize your academic journey. Furthermore, we do not sell or use prewritten papers, and each paper is written from scratch.

When will I get my paper?

You determine when you get the paper by setting the deadline when placing the order. All papers are delivered within the deadline. We are well aware that we operate in a time-sensitive industry. As such, we have laid out strategies to ensure that the client receives the paper on time and they never miss the deadline. We understand that papers that are submitted late have some points deducted. We do not want you to miss any points due to late submission. We work on beating deadlines by huge margins in order to ensure that you have ample time to review the paper before you submit it.

Will anyone find out that I used your services?

We have a privacy and confidentiality policy that guides our work. We NEVER share any customer information with third parties. Noone will ever know that you used our assignment help services. It’s only between you and us. We are bound by our policies to protect the customer’s identity and information. All your information, such as your names, phone number, email, order information, and so on, are protected. We have robust security systems that ensure that your data is protected. Hacking our systems is close to impossible, and it has never happened.

How our Assignment Help Service Works

1. Place an order

You fill all the paper instructions in the order form. Make sure you include all the helpful materials so that our academic writers can deliver the perfect paper. It will also help to eliminate unnecessary revisions.

2. Pay for the order

Proceed to pay for the paper so that it can be assigned to one of our expert academic writers. The paper subject is matched with the writer’s area of specialization.

3. Track the progress

You communicate with the writer and know about the progress of the paper. The client can ask the writer for drafts of the paper. The client can upload extra material and include additional instructions from the lecturer. Receive a paper.

4. Download the paper

The paper is sent to your email and uploaded to your personal account. You also get a plagiarism report attached to your paper.

smile and order essay GET A PERFECT SCORE!!! smile and order essay Buy Custom Essay

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
The price is based on these factors:
Academic level
Number of pages
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more
error: Content is protected !!
Open chat
Need assignment help? You can contact our live agent via WhatsApp using +1 718 717 2861

Feel free to ask questions, clarifications, or discounts available when placing an order.
  +1 718 717 2861           + 44 161 818 7126           [email protected]
  +1 718 717 2861         [email protected]