Machine Learning with Python – Techniques

This chapter discusses each of the techniques used in machine learning in detail.


Classification is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories.

Consider the following examples to understand classification technique −

A credit card company receives tens of thousands of applications for new credit cards. These applications contain information about several different features like age, location, sex, annual salary, credit record etc. The task of the algorithm here is to classify the card applicants into categories like those who have good credit record, bad credit record and those who have a mixed credit record.

In a hospital, the emergency room has more than 15 features (age, blood pressure, heart condition, severity of ailment etc.) to analyze before deciding whether a given patient has to be put in an intensive care unit as it is a costly proposition and only those patients who can survive and afford the cost are given top priority. The problem here is to classify the patients into high risk and low risk patients based on the available features or parameters.

While classifying a given set of data, the classifier system performs the following actions −

  • Initially a new data model is prepared using any of the learning algorithms.

  • Then the prepared data model is tested.

  • Later, this data model is used to examine the new data and to determine its class.

Classification, also called categorization, is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing labels/classes/categories.

In classification tasks, the program must learn to predict discrete values for the dependent or output variables from one or more independent or input variables. That is, the program must predict the most probable class, category or label for new observations. Applications of classification include predicting whether on a day it will rain or not, or predicting if a certain company’s share price will rise or fall, or deciding if an article belongs to the sports or entertainment section.

Classification is a form of supervised learning. Mail service providers like Gmail, Yahoo and others use this technique to classify a new mail as spam or not spam. The classification algorithm trains itself by analyzing user behavior of marking certain mails as spams. Based on that information, the classifier decides whether a new mail should go into the inbox or into the spam folder.

Applications of Classification

  • Detection of Credit card fraud – The Classification method is used to predict credit card frauds. Employing historical records of previous frauds, the classifier can predict which future transactions may turn into frauds.

  • E-mail spam – Depending on the features of previous spam mails, the classifier determines whether a newly received e-mail should be sent to the spam folder.

Naive Bayes Classifier Technique

Classification techniques include Naive Bayes Classifier, which is a simple technique for constructing classifiers. It is not one algorithm for training such classifiers, but a group of algorithms. A Bayes classifier constructs models to classify problem instances. These classifications are made using the available data.

An important feature of naive Bayes classifier is that it only requires a small amount of training data to estimate the parameters necessary for classification. For some types of models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting.

In spite of its oversimplified assumptions, naive Bayes classifiers have worked efficiently in many complex real-world situations. These have worked well in spam filtering and document classification.


In regression, the program predicts the value of a continuous output or response variable. Examples of regression problems include predicting the sales for a new product, or the salary for a job based on its description. Similar to classification, regression problems require supervised learning. In regression tasks, the program predicts the value of a continuous output or response variable from the input or explanatory variables.


Recommendation is a popular method that provides close recommendations based on user information such as history of purchases, clicks, and ratings. Google and Amazon use this method to display a list of recommended items for their users, based on the information from their past actions. There are recommender engines that work in the background to capture user behavior and recommend selected items based on earlier user actions. Facebook also uses the recommender method to identify and recommend people and send friend suggestions to its users.

A recommendation engine is a model that predicts what a user may be interested in based on his past record and behavior. When this is applied in the context of movies, this becomes a movie-recommendation engine. We filter items in the movie database by predicting how a user might rate them. This helps us in connecting the users with the right content from the movie database. This technique is useful in two ways: If we have a massive database of movies, the user may or may not find content relevant to his choices. Also, by recommending the relevant content, we can increase consumption and get more users.

Netflix, Amazon Prime and similar movie rental companies rely heavily on recommendation engines to keep their users engaged. Recommendation engines usually produce a list of recommendations using either collaborative filtering or content-based filtering. The difference between the two types is in the way the recommendations are extracted. Collaborative filtering constructs a model from the past behavior of the current user as well as ratings given by other users. This model then is used to predict what this user might be interested in. Content-based filtering, on the other hand, uses the features of the item itself in order to recommend more items to the user. The similarity between items is the main motivation here. Collaborative filtering is often used more in such recommendation methods.


Groups of related observations are called clusters. A common unsupervised learning task is to find clusters within the training data.

We can also define clustering as a procedure to organize items of a given collection into groups based on some similar features. For example, online news publishers group their news articles using clustering.

Applications of Clustering

Clustering finds applications in many fields such market research, pattern recognition, data analysis, and image discussed here −

  • Helps marketers to discover distinct groups in their customer basis and characterize their customer groups based on purchasing patterns.

  • In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in populations.

  • Helps in identification of areas of similar land use in an earth observation database.

  • Helps in classifying documents on the web for information discovery.

  • Used in outlier detection applications such as detection of credit card fraud.

  • Cluster Analysis serves as a data mining function tool to gain insight into the distribution of data to observe characteristics of each cluster.

The task, called clustering or cluster analysis, assigns observations to groups such that observations within groups are more similar to each other based on some similarity measure than they are to observations in other groups.

Clustering is often used to explore a dataset. For example, given a collection of movie reviews, a clustering algorithm might discover sets of positive and negative reviews. The system will not be able to label the clusters as “positive” or “negative”; without supervision, it will only have knowledge that the grouped observations are similar to each other by some measure. A common application of clustering is discovering segments of customers within a market for a product. By understanding what attributes are common to particular groups of customers, marketers can decide what aspects of their campaigns need to be emphasized. Clustering is also used by Internet radio services; for example, given a collection of songs, a clustering algorithm might be able to group the songs according to their genres. Using different similarity measures, the same clustering algorithm might group the songs by their keys, or by the instruments they contain.

Unsupervised learning tasks include clustering, in which observations are organized into groups according to some similar feature. Clustering is used to form groups or clusters of similar data based on common characteristics.

Clustering is a form of unsupervised learning. Search engines such as Google, Bing and Yahoo! use clustering techniques to group data with similar characteristics. Newsgroups use clustering techniques to group various articles based on related topics.

The clustering engine goes through the input data completely and based on the characteristics of the data, it will decide under which cluster it should be grouped. The following points may be noted while clustering −

  • A suitable clustering algorithm, is to be selected to group the elements of a cluster.

  • A rule is required to verify the similarity between the newly encountered elements and the elements in the groups.

  • A stopping condition is required to define the point where no clustering is required.

Types of Clustering

There are two types of clustering – flat clustering and hierarchical clustering.

Flat clustering creates a flat set of clusters without any clear structure that can relate clusters to each other. Hierarchical clustering creates a hierarchy of clusters. Hierarchical clustering gives a hierarchy of clusters as output, a structure that yields more information than the unstructured set of clusters returned by flat clustering. Hierarchical clustering does not require us to specify beforehand the number of clusters. The advantages of hierarchical clustering come at the cost of lower efficiency.

In general, we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems of flat clustering is an issue. Moreover, it is believed by many researchers that hierarchical clustering produces better clusters than flat clustering.

Clustering Algorithms

You need clustering algorithms to cluster a given data. Two algorithms are frequently used – Canopy clustering and K-Means clustering.

The canopy clustering algorithm is an unsupervised pre-clustering algorithm that is often used as preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm. It is used to speed up clustering operations on large data sets, where using another algorithm directly may not be possible due to large size of the data sets.

K-means clustering is an important clustering algorithm. The k in k-means clustering algorithm represents the number of clusters the data is to be divided into. For example, if the k value specified in the algorithm is 3, then algorithm will divide the data into 3 clusters.

Each object is represented as a vector in space. Initially k points are chosen by the algorithm randomly and treated as centers, every object closest to each center are clustered. The k-means algorithm requires vector files as input, therefore we need to create vector files. After creating vectors, we proceed with k-means algorithm.