Rootstrap Blog

Data Demystified — Machine Learning

The main goal of this article is to cover the most important concepts of machine learning, and lay-out the landscape. The reader will have the vision to understand what kind of solution matches a specific kind of problem, and should be able to find more specific knowledge after diving into a real-life project.

What is Machine Learning?

I’ll start with a 60 years old definition, but still valid today:

Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel (1959)

The name is pretty self-explanatory, and the definition reinforces the same concept.

Machine learning ecosystem

For those with a taste for a more maths-like definition, we can relate to a 1998 definition:

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Tom Mitchell (1998)

Let’s use a typical example to clarify these concepts: classifying an email as spam or no spam. What we’ll typically do is:

  1. The task is classifying emails
  2. Experience is watching how users manually classify their emails, tagging them as spam.
  3. Measure P is the percentage of emails correctly classified as spam.

We have machine learning if P is improved over time.

We have two main categories of machine learning: supervised and unsupervised learning.

Supervised learning

It means that we have a training set: a list of “right values”. We’ll receive an input (typically a vector) and the main goal is to train the algorithm to be able to train its learning mechanisms using these values to finally being able to predict unseen cases.

Supervised learning through manual categorization

Some examples could be:

  • Forecast the price of a real estate based on its size and amenities.
  • Recognize objects in images
  • Forecast the score of a student in an exam, based on previous exams results.

Training sets can be huge but for the sake of simplicity, this is what a really simple one could look like.

Real Estate properties

Each one of the columns represents one of the inputs of the training set, and the Price represents the output. Our goal is to create an algorithm that over time ( Experience E ) will become better over time at the job of predicting the price ( Task T ), minimizing the error.

A wide range of supervised learning algorithms is available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems.

Some of the most common are:

  • Linear Regression
  • Logistic Regression
  • Naive Bayes
  • Neural Networks
  • Decision trees

Unsupervised Learning

It is also known as self-organization and allows modeling probability densities of given inputs. Basically, it tries to detect patterns inside the data.

Let’s say that we have a data set and we’re not told what each data point is. Instead, we’re just told, here is a data set. We ask our algorithm if it can find some structure in the data.

As an example, an Unsupervised Learning algorithm might decide that the data lives in two different clusters.

One example where clustering is used is in Google News and if you have not seen this before, you can actually go to this URL to take a look. What Google News does is every day it goes and looks at tens of thousands or hundreds of thousands of new stories on the web and it groups them into cohesive news stories.

Some of the most common algorithms used in unsupervised learning are:

  • Clustering
  • Anomaly detection
  • Some kinds of neural networks, like Hebbian learning or Generative Adversarial Networks
  • Blind signal separation techniques

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.

What’s the right approach?

There’s not a silver bullet. Knowing many techniques and the kind of problems that each one can solve is the single most important skill to have.

Below, we can see many different problems and the algorithms that you should investigate further when tackling them.

Regression and Clustering

  • Estimate a numerical output based on lots of inputs: Linear Regression
  • Estimate a category based on inputs: Logistic Regression
  • Partition n observations into k clusters: K-Means
  • Finding outliers (anomaly detection): DBSCAN algorithm


  • Complex relationships, MAGIC: Neural networks
  • Group membership based on proximity: K-NN.
  • Non-contiguous data (if/then/else): Decision tree
  • Find best split randomly: Random Forest
  • Maximum margin classifier (very important): SVM
  • Update knowledge step by step with new info: Naive Bayes

Feature Reduction

  • Visual high dimensional data: t-distributed stochastic neighbor embedding
  • Distill feature space into components that describe greatest variance: Principle Component Analysis
  • Making sense of cross-relation matrices: Canonical Correlation analysis.
  • Linear combination of features that separate classes: Linear discriminant analysis.

Some concepts that you REALLY have to know

Neural networks

They are inspired by our biological neurons. An ANN (Artificial neural network) is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it.

In a nutshell, a neural network is a decision-maker. There are unlimited number of topologies for neural networks, and some kinds are best suited for some problems. That’s beyond the scope of this article.

Bias Variance Tradeoff, underfitting, and overfitting

It is is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples and vice versa.

The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

This tradeoff applies to all forms of supervised learning.

Accuracy function

In Classification algorithms, accuracy defines the ratio between the # of correct predictions and the total number of input samples.

Precision function

What proportion of actual positives was identified correctly?

Recall function (also known as Sensitivity)

What proportion of actual positives was identified correctly?

Specificity function

What proportion of actual negatives were correctly identified?


The following libraries are arguably the most important ones in the field:

  • Pandas
  • Scikit
  • Tensor Flow
  • PySpark
  • NumPy
  • Bokeh
  • Keras
  • SciPy


This article is an overview of many different areas in the machine learning space. The main intention is to spark interest and our readers can go deeper into any of these areas. Literally, there are thousands of possible areas of specialization and this technology is bringing exciting opportunities to the future.


Andrew Ng, “Machine Learning” by Stanford University.

Applied Data Sciences Course, Michigan University.

Ronald Van Loon (2018, Feb 5). “Machine learning explained: Understanding supervised, unsupervised, and reinforcement learning” Retrieved from

Pandas Data Library, retrieved from

Rajiv Chopra (2018). “Machine Learning”, Khanna Publishing.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.