# Difference between revisions of "Machine Learning"

Traditional software development requires programmers to specify precise instructions to a computer, via a programming language. However in Machine Learning, a programmer provides a template (more formally termed as model) to a computer for a given task. A computer attempts to learn precise instructions (compliant with the provided model) automatically, via pre-processed data (Supervised Learning, Unsupervised Learning) or interaction with an environment (Reinforcement Learning).

Machine Learning has recently received a lot of media attention due to its recent success. Machine Learning is playing an increasingly important role in our lives, and many popular tech companies utilize a Machine Learning arsenal to improve their products. To get you interested, here is a list of exciting breakthroughs in Machine Learning -

Some useful libraries for Machine Learning tasks:

• Scikit-Learn is one of the most versatile & famous machine learning library in Python.
• The relatively new PyCaret is machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.
• GraphLab Create is another Python library, backed by a C++ engine, for quickly building large-scale, high-performance data products.

# Basics of Probability for ML

Machine learning and data science in general, depend a lot on statistical models for making predictions. In most cases where we apply artifiical intelligence there is inherent uncertainity in the system and in the datasets that we have for making predictions and therefore it is essential to use probability to decide which decision or choice to take. Most of the probability that you have covered for JEE serves as the basic backbone of probability theory for ML. Many times you will be modelling your data models in terms of various probability distributions so it is essential to gain an understanding of different models that are used by data scientists. Most models that you will be modelling will have an inherent inconsistency in them and therefore it is essential to understand how data is spread out which we do in terms of statistical parameters like variance and standard deviation. Many times you will be comparing two or more models and then you will need to study their correleation. Probability theory therefore serves as the theoretical background which will help you make better choices in which techniques to use for your predictions.

• This book provides a good theoretical introduction into probability for data science
• This lecture covers the basics of statistics and probability from the ground up.
• This course page has lecture notes related to data analysis and probability for data science that you can consult for further information.

# Linear Regression (incl. Regularization)

Linear Regression is one of the most fundamental models in Machine Learning. It assumes a linear relationship between the input variables (x) and the single output variable (y). Formally stating, (y) can be calculated from a linear combination of the input variables (x). When we have a single variable (x) as the input, the model is called as Simple Linear Regression and when we have multiple input variables, the model is called as Multiple Linear Regression.

In Machine learning, the input variable (x) corresponds to the features of our dataset, for eg. when we are predicting housing prices from a dataset, the features of the dataset will be things such as Area of the house, Locality, etc. and the output variable (y) in this case will be the housing price. The general representation of a Simple Linear Regression Model is -

```                                                         y = θ(0) + θ(1)*x
```

where θ(0) is known as the Bias term, and θ in general is called as the weight vector. This link here explains in detail how we arrived at such a representation of the model. For multiple linear regression, the equation changes as such -

```                                                        y = transpose(Θ)*(X)
```

where X is a vector containing all the features and Θ is a vector containing all the corresponding weights i.e. the parameters of the linear model.

We define a function called the Cost Function that accounts for the prediction errors of the model. We try to minimize the cost function so that we can obtain a model that fits the data as good as possible. To reach the optima of the cost function, we employ a method that is used in almost all of machine learning called Gradient Descent.

• This article's sections 2 and 3 explain the concept of Linear Regression as a Statistical Model, where you can calculate certain statistical quantities to determine and improve the accuracy of your model. This is known as Optimisation.
• This link here explains an optimisation technique for Machine learning models, (Linear Regression in our case), known as Gradient Descent to reach at the minima of the Cost Function so as to arrive at the optimum parameters/weights, i.e., find a value of Θ (the weight vector) such that the prediction of our model is the most accurate. And this is what we mean by Model Training, provide training data to the model, then the model is optimized according to the training data, and the predictions become more accurate. This process is repeated every time we provide new training data to the model.

## Regularisation

Before we get to know what Regularisation means, we first have to understand the problem of Overfitting. Overfitting arises when our model has been trained excessively with training data, so much so that the Cost Function nearly approaches zero. This is a problem because now our model captures 'all' the patterns in our training set, even the undesirable ones, the outliers of the general pattern. To understand this, let us take our earlier example, where we had a dataset for predicting the Housing Prices. Now, let's say our training data coincidentally contains a pattern such that the price of houses with a green door is relatively higher than other houses. Now, in case of overfitting, our model will capture this pattern and when it makes predictions on some new dataset, the predictions will not be accurate because this 'unwanted' pattern deviated the predicted value from the actual value.

• Head over to this link to get a better understanding of overfitting and other common problems with machine learning models.

In order solve this problem, we have a technique called as Regularisation. It is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. Regularisation, significantly reduces the variance of the model, without substantial increase in its bias. λ (Regularisation Parameter) is the tuning parameter that decides how much we want to penalize the flexibility of our model.

• You can check out this link for a detailed explanation on Regularisation and how it works.
• Along with that, you can checkout this link for the application of Regularisation techniques specifically on a Linear Regression model.

# Naive Bayes & Logistic Regression Classifiers

Naive Bayes classifiers refer to a set of classification algorithms that share one underlying assumption, every pair of features being classified is independent of the other.

Let's understand what this means. Generally we would like to classify an nth dimensional vector of input variables (can be discrete or continuous) into one of various classes. We can imagine this vector to represent our binary input data for instance where each element of the vector is a boolean. The Naive Bayes algorithms assume that the probability of one of those input variables (say the kth element of the vector) equaling a particular value, is independent of the value of the other input variables.

This may not seem very useful but what this does is it immensely reduces the training space as now we can train for each input feature independently. This allows us to feasibly set up a classifier based on Bayes rule.

• This tutorial will walk you through a solved example of applying the Naive Bayes algorithm for predictive & classification tasks. Towards the end, it also has code for implementing the Naive Bayes classifier available in `sklearn` for the Iris dataset.
• Implementing algorithms from scratch is a good way to consolidate your understanding about various specifics of the algorithm. Here is a tutorial to help you implement a Naive Bayes classifier from scratch in Python by creating your very own toy dataset. Once you are familiar, you could extent this to more complex datasets.

Logistic regression is a classifier that uses regression to obtain the probability of the input data belonging to one of various classes. We can derive the expression for logistic regression assuming Naive Bayes assumptions although the validity of logistic regression holds even when the former does not. Assuming Naive Bayes for logistic regression does allow for simple expressions of the weights in terms of means and standard deviations of the distributions of the input variables however we have to model in terms of naive bayes assumptions which may not always be helpful. An alternate formulation of logistic regression can then be made where the parameters are estimated using gradient descent.

• This article covers the major aspects of Logistic Regression & should give you a firm grasp on the mathematics behind this algorithm
• While implementing ML algorithms on datasets for real-life applications, there is much more to do that simply building, training & testing the model. A lot has to be done in terms of data preprocessing, data exploration, feature engineering & choosing the right metrics to test the data. This is an elaborate tutorial that tries to implement logistic regression & highlights techniques like Synthetic Minority Oversampling Technique (SMOTE) & Recursive Feature Elimination (RFE) used majorly for feature engineering (along with python implementations) & testing metrics for classification tasks such as Confusion Matrix, F-Measure & Receiver Operating Characteristic (ROC) curve.
• Here is another example for logistic regression, but this time for multi-class classification
• This chapter from Machine Learning by Tom Mitchell talks about the intuition & math behind classifiers based on Bayes Rule (including Naive Bayes & Logistic Regression)

# Basics of Statistical Learning Theory

Statistical learning theory is regarded as one of the most beautifully developed branches of artificial intelligence. It provides the theoretical basis for many of today's machine learning algorithms. It also goes by other names such as statistical pattern recognition, non-parametric classification and estimation, and supervised learning. Statistical learning theory deals with the problem of finding a predictive function based on data.

Stated more formally, statistical learning theory takes the perspective that there is some unknown probability distribution over the product space of the input & output vectors & that the training set is made up of n samples from this probability distribution. The goal is to find a function that maps the input vector space to the output vector space with minimum error (the error metric ca be defined in many ways).

• Here is an article which explains the key concepts & terminologies involved in understanding statistical learning.
• For people who wish to dig deeper into this topic & gain more mathematical insights, this advanced tutorial will introduce the algorithms such as Risk Minimization, Regularization & the mathematical insights to several related concepts.
• These notes by Andrew NG also elaborately sum up the concepts & techniques important for understanding SLT.

The two main sub-components of prediction error, bias and variance are extremely important to understand. These concepts are directly related to faulty ML models which are said to either overfit or underfit the training data. Bias & variance for a model always have a tradeoff associated with them. Here are some resources to get started with this core concept:

• Here are some cool infographics that will help in developing an intuition for the bias-variance tradeoff.
• This article will help develop a good overview of the concepts of overfitting & underfitting of a model.
• To develop a deeper understanding of how overfitting & underfitting can be caused, here is a python tutorial which demonstrates a toy example. Try it out yourself & play around with the model. Try using a different dataset & explore similar results.

# Decision Trees

The goal of supervised learning being to predict the value of a target attribute, decision tree analysis shows the way through.They are a non-parametric supervised learning method that can serve as both regressor and classifier that makes them a reallu useful and unique model. They are really versatile in the sense that they can capture non-linearities in the data. The aim is to develop a model that can predict the value of the target variable/attribute by learning decision rules inferred from the available data attribute values.

• ID3 Algorithm is one of the best algorithms for Decision Trees. The algorithm involves Information Gain function based on entropy of trainer set. Bringing Gain Ratio to the picture increases the efficiency of the algorithm. For a brief introduction to the algorithm, head here.
• To get into the mathematical insights of the concepts and the implementation of the algorithm, this is a really helpful article.

But decision trees too have their demerits to tackle. They suffer from over-fitting because of the fact if allowed to grow uncontrolled they can split the data to n terminal nodes that correspond to the training examples and hence give 100 percent accuracy on the training set but act poorly on the validation set. There are ways to regularise the model that includes Pruning, Ensembling, Random Forests, Boosting etc.

We will look at these in the next sections.

# MLE vs MAP Estimation

Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) are methods that are meant to estimate parameters in a probability distribution. They are referred to as point - estimators because they evaluate a single value rather than a distribution unlike the Bayesian Inference. There is a subtle difference between the two methods.

As the name suggests, MLE is meant for maximisation of the likelihood. On the contrary, MAP is based on the Bayesian setting where posteriori is maximised. The posteriori is evaluated in terms of priori and likelihood using the Bayes Theorem. So, basically the posteriori comes out to be proportional to the product of likelihood and priori. So on a very high-level, MAP can be referred to as Weighted MLE with the weights being the priori. So, MLE is a special case of MAP when the weights are equal for all the values of the parameter.

• To get an overview of the settings of the two methods, head here or here
• This article explains the concepts related to MLE in good detail.
• To delve into the details of the concepts, refer to the slides

# Parameter Estimation

Parameter Estimation for statistical models can be done using either Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) . They can further be used to calculate Bayesian Inference.

• Bored of articles? Watch this video for a more intuitive explanation.
• Head here to know how to calculate Bayesian inference.

# SVMs + Kernel Methods

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. They are popular because they are easy to use, and require very less tuning. In case of SVMs, each data point is viewed as an n-dimensional vector, and we try to seperate those points by an (n-1)-dimensional planes (called hyperplanes).

# Clustering Algorithms

Here, we're going to discuss the two most famous algorithms for clustering of datapoints - K-Means and Gaussian Mixture Models.

K-means clustering is as simple as it sounds - Its goal is to partition points in our dataset into k distinct non-overlapping sub-groups. A short video to get you started.

Also, note that K-means is different from K-Nearest Neighbours. Here is the difference.

Gaussian Mixture Models (GMMs) is quite similar to k-means except that it is more effective because it takes variance into account as well (the word Gaussian gives it away). This allows soft classification, i.e. it provides us with the probabilities that the point belongs to each sub-group. If the word Gaussian intimidates you, check out this video for an exceptionally detailed explanation, it also has links for tutorials in its description. You could read this article as well for a deeper understanding of GMMs.

# Principle Component Analysis (PCA)

When our data has too many variables or dimensions, we ought to minimize computation by reducing the least important variables, identified by several matrix operations. This is called dimensionality reduction. This can be done using Feature Extraction or Feature Elimination.

Feature Elimination, as the name suggests, is achieved by directly eliminating some of the variables, but the disadvantage is that we can't gain any information from those variables now.

Feature Extraction on the other hand, involves the creation of n new variables as the combination of the original n variables in such a way that they best our dependant variable, and then eliminating some of the new variables. Now, even if we eliminate some of our new variables, we're still keeping the old ones, since our the new ones are a combination of old ones. This is what we do in Principle Component Analysis.

Confused? This video explains an application of PCA, while this set of 3 videos discusses an intuition for PCA - 1 2 3.

Some implementations of PCA in Python:

• This tutorial uses PCA on Iris & MNIST dataset.
• This tutorial uses PCA for analysing Breast Cancer & CIFAR-10 dataset.
• This article neatly summarizes the use of PCA on several datasets, including its application for face recognition using the the EigenFaces algortihm.

# Ensemble Learning (Boosting / Bagging)

Till now, we have seen a diverse collection of algorithms that can be used to perform several prediction & classification tasks. But we also know that all these individual models may inadvertently have large bias, variance or prediction errors. Ensemble learning is the art of combining diverse set of learners (individual models) together to improvise on the stability and predictive power of the model. There are several ways of creating an ensemble of learners & use them for the task at hand. Some prominent ensembling techniques include bagging (to decrease bias), boosting (to decrease variance) & stacking (to improve predictions).

• This elaborate tutorial will walk you through the various techniques that are used widely, some specific algorithms to implement these techniques along with python code using `scikit-learn` to test them out yourself.
• This tutorial on bagging demonstrates how to implement bagging from scratch on the Sonar dataset using an ensemble of decision trees. Taking motivation from this, you can start implementing similar algorithms for other types of models that you have come across.
• This elaborate tutorial on boosting will take you from using the existing AdaBoost classifier in `scikit-learn` & also illustrate how to develop the algorithm from scratch using Python.
• We talked about Decision Trees earlier, we will look at Random forests which is an important ensemble technique to improve the accuracy those models, in the next section.

## Random Forest

We discussed about the problem of over-fitting in Decision Trees, and a really promising solution is random forest. Random Forest is an algorithm that was developed by inculcating the bagging concept into the decision tree model.

Bagging generally suggests making multiple models trained on a subset of the data with repitions allowed across subsets. The final prediction is then mean or mode in case of regression or classification respectively. This has an issue with respect to decision trees because the different trees learn the same split features and hence the trees are co-related and therefore we won't be able to reap the benefits of an ensemble to the full extent.

Random Forest is somewhat a step more than just bagging. It restricts the number of features available for a split at any node to make the trees less co-related and hence has greater regularisation effects.

# Fundamentals of Hidden Markov Models (HMMs)

Hidden Markov Model (HMM) is a statistical Markov Model in which the system being modeled is assumed to be a Markov process with unobservable ("hidden") states. To understand this statement, we must first understand the fundamentals of Markov Chains (or observed Markov Model) & the terminology associate with them:

• Markov chains and Hidden Markov Models are both extensions of the finite state automata.
• A Markov chain is a stochastic process, but it differs from a general stochastic process in that a Markov chain must be "memory-less", i.e, (the probability of) future actions are not dependent upon the steps that led up to the present state. This is called the Markov property.
• Here is a brilliant (pun intended :P) explanation about Markov Chains & their properties.
• Markov Chains are useful when we need to compute probabilities of a sequence of events that we can observe in the world. But in several cases, events that we are interested in may not be directly observable in the world.

This is what brings us to the topic of discussion, which is Hidden Markov Models. Here we talk about both, the observed events as well as the hidden events (which form the causal factors in our probabilistic models). We assume that the hidden events are derived from a Markov Model which are somehow related to the observed events. Hidden Markov Models (HMMs) are especially known for their application in reinforcement learning and temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics. HMMs are characterized by 3 fundamental problems:

• Computing Likelihood: Given an HMM with the transition & emission functions & an observation sequence, determine the likelihood of the particular observation sequence. (done using the forward algorithm)
• Decoding: Given an observation sequence & an HMM with transition & emission probabilities, discover the best hidden state sequence. (done using the Viterbi algorithm)
• Learning: Given an observation sequence & a set of HMM states, learn the HMM parameters, i.e., transition & emission probabilities (done using the Baum-Welch algorithm)