AI / Machine Learning
November 19, 2020

Understanding Statistics for Machine Learning Models

If you want to understand machine learning algorithms, it is very important to understand basic statistics and what is behind them.

Understanding how the algorithm operates gives you the option of configuring the model according to what you need, as well as explaining with more confidence the results obtained from the execution of the model.

This article provides a very brief summary that tries to list all basic statistical concepts necessary to face machine learning problems. The article is presented schematically, as the main purpose is to get familiar with terminologies and definitions that are used in machine learning algorithms.

In addition, we will be providing basic information about distributions, as well as explanations for statistical concepts such as the Statistical hypothesis test, used for answering questions about sample data and validating assumptions, for instance. Moreover, we will also discuss sampling distribution and the relationship between variance and bias.

To get started, we provide an outline of definitions about probability and matrices.


Firstly, some probability terms are described. Those terms are being used then to explain the fundamentals of the machine learning algorithms. 

Sample: set of observations drawn from a population. It is necessary to use samples because it is impossible to study all the population. Population refers to the set of all the data. 

Sample space: set of all possible outcomes that can happen in a chance situation.

Event: a subset of the sample space. A probability is assigned to the event.

Probability: a measure of the likelihood that an event will occur in a random experiment. It is quantified as a number between 0 and 1, where. The higher, the more likely is the occurrence of the event. 

Probability = # desired outcomes / # possible outcomes

Probability Rules:

This image has an empty alt attribute; its file name is m8B-6Hdr7IGy_5M1maf9PuwvR8ESXo5qno1feB_nqaueZAfUmLyKdP7p2wWWLCJTpM2F2vm67W7mxq3zgAQtKVc0Yg9EGePfkYTxm0VwIi6GPF8Fx6AHlZPtd6Dn46EWNBBTCr0i
Basic Probability Rules

Independent events: The occurrence of one event has no effect on the probability of occurrence of the other event. If A,B are independent, then P(A and B) = P(A) x P(B)

Joint probability: chance of an outcome of having two events occurring together at the same time.

Marginal probability: the probability of observing an outcome with a single variable, regardless of its other variables.

Conditional probability: the conditional probability of an event A given that the event B occurs. It is written like: P(A|B) = P(A and B) / P(B)

Multiplication rule:

- P(A and B)= P(A|B) x P(B)

- P(A and B)= P(B|A) x P(A)

Bayes Rule: P(A|B)=P(B|A) x P(A)/P(B)


  • Probability tree: diagram to represent different outcomes in function of the occurrence of the events.
This image has an empty alt attribute; its file name is YwTZQVdRXj9Dn97J7Ho6KgHrOsl9nNBgDfGR53PyE7CgZCEprFUGFox9mmCZWU7cS0IuCxI_iRs7NwQ_waguI0beNB9cNp3YSCxHxf81dVaPXUyos1BoTPgoNlHLmXBEc5Iyb_P6
Probability Tree - Example

EventProbabilityA0.15B0.35C0.50Probability Table - Example

Random Variables

A random variable describes the probability for an uncertain future numerical outcome of a random process. It is a function that maps an outcome of a random experiment to a numerical value. Understanding this concept is fundamental for statistics, machine learning algorithms and simulations.

For instance, in the case of the experiment of flipping a coin twice, the sample space is S={HH,TT,HT,TH}. Where H corresponds to head and T to tail respectively. Therefore, let be a random variable X the number of heads, it would be a function that from the outcome determines how many heads were flipped. Thus, X takes the following values:

HH -> 2

TT -> 0

HT -> 1

TH -> 1

Then, the random variable X can take the values {0,1,2}, corresponding to the possible cases. Observe that although the sample space had 4 cases, the random variable can only take 3 values.

Discrete random variable: the set of possible outcomes is finite. 

Continue random variable: can take any value within an interval. 
Expected value: weighted average, based on probability to weigh the possible outcomes. It is the sum of all gains multiplied by each probability. Where x1..xn are values for the sample space of the discrete random variable X. Reaching to the following formula:

E(X) = X1*p(X1) + X2*p(X2) + … + Xn*p(Xn)

Variance: intents to describe how spread is the data from the mean value. It is defined as the expected value of the squared deviation of X from the mean m.

Var(X)= E[(X-m)2] 

So, here the function is g(X)=(X-m)2, applying the formula of the expected value of a function, we get:

Var(X)= E[(X-m)2]=(x1-m)2*p(x1) +(x2-m)2+...+(xn-m)2=i=0n(xi - m)2p(xi)

Standard deviation:it is the square root of Var(X).  It is denoted as x

Covariance:  measures the variance between two random variables.

  • Positive covariance: the variables tend to move in the same direction.
  • Negative covariance: the variables tend to move in inverse directions.
    It is important to notice that the covariance shows the direction of the relationship between the two variables, but not the strength of it.

Correlation: measures the strength of the relationship between variables.

This image has an empty alt attribute; its file name is MbEg2t0BNQLgSerDh0sYk9BHo7wQQpWWt_TE7PWznyLpQ_TVnB2jvuRwCYOeJ9wI0s8xCWZpBbkkM_-nbyzmYlcthDN3unQnYC_G9Wpr07mTt6_JI7oApujhlC7RVJiJEIyHhUDB
  • Positive correlation: the variables are correlated and they move in the same direction.
  • Negative correlation: the variables are correlated and they move in opposite directions.
  • No correlation: when the coefficient is 0 does not exist any relationship between the variables. It means that the variables are independent.

Distance matrix: squared matrix that contains the distance between the variables of the set. The most common distance used is the Euclidean distance, but there are other distances that can be used.

I.i.d (Identically independent distributed) random variables: when two random variables are identically (have the same probability distribution) and are mutually independent. Machine learning algorithms often use the assumption in order to imply that all samples come from the same process and do not depend on past samples.                                          

Matrices basics

Basic knowledge about matrices is necessary to understand deep learning algorithms that handle images.

Amxn It is a matrix with m rows and n columns.

Square matrix: when m=n

Column vector: is a matrix with only 1 column

Row vector: a matrix with only 1 row

Transpose matrix: interchange rows and columns. Notation: A'=t(A)

Diagonal matrix: has 0 values except the main diagonal

Symmetric matrix: square matrix remains unchanged when it is transposed. A'=A

Identity matrix: diagonal matrix with all elements of the diagonal equal to 1. Notation: I

Matrix multiplication: Alxm x Bmxn= Clxn

This image has an empty alt attribute; its file name is 5HX8t81grotAqSyMNeM_LVMH5t6XDKBOeVsaaTub2QCSPdMLLeXWpF2oUorPzH_ODuWef8R1jv_6VPQ2eIHEF26aOpyz0IHc9zoSCljnzuN_8_ZCX0JE7fLsQU6-d6Eee1jZ96zh

Element-wise multiplication: Anxm ͦ Bnxm= Cnxm

Inverse matrix: A(A^-1)=I

Trace: sum of the elements of the diagonal.

Determinant: Notation: det(A)=|A|

This image has an empty alt attribute; its file name is -ng3z_5BMsZgD4pnOA4MbrO1Yc_-0SMT-KYOoHikIMSMRPkyIPJjidowazS2bSyoAoqBoVfV1msnkzVbbT74RKhlSNBK5BLyN4-Cqgi2EX0hXykiirZUSPx80FvM2AGcE0FiWoNz

Eigenvalues and eigenvectors

This image has an empty alt attribute; its file name is CQoQlbBEQmCZW5sCP3vMke7G6JTQ77p_qwY-Veypf6xHx_zGZdyGDf-3ZsUwl0YDdcbC1WEfxL_yoINUyX19s9Ucov_pHIhgryIFbUuaOcb26enXcw-4o6zZOQvM3eliRYQf0guD

λ is a scalar, the eigenvalue of A 

x̄ is the eigenvector belonging to λ.

Any nonzero multiple of x̄ will be an eigenvector

To find λ : |A - λI|=0


Cheat sheets are very useful to have all the concepts in one document, here you can find a cheat sheet for this part.

What is a Distribution?

A probability distribution: is a summary of probabilities for the values of a random variable.

Measurements: The distribution also has general properties that can be measured. Important properties of a probability distribution are: expected value, variance, skewness and kurtosis. 

A discrete probability distribution summarizes the probability for a discrete random variable. In the same way, the summary for a continuous random variable is a continuous probability distribution.

Discrete data involves a finite group of possible values, while continuous data can have infinitive values in a time interval.

Discrete Distributions

Bernoulli:  distribution for a binary variable, represents the probability of a single experiment with 2 possible outcomes (probability p and 1-p).

This image has an empty alt attribute; its file name is Ect18D64F7dcKy3moBMXSQT0XHx4URrhuSxFPhDQqJ0z4QcZN3dPA1KerNKByG3DRN3qSksPBVx-0s017xXbEnHfYEEfp3xrmDlrS6O8idNepZinWBNK3nX4-k168GNgcdco7U1R

Binomial: represents the number of successes in n Bernoulli trials. Thus, the experiments are independent, where each one has 2 possible outcomes.

This image has an empty alt attribute; its file name is Fw8vcaNSy1BtcWlVfRZAV1Avq6-sS9-qWWzZ7JVPxmOaM2jthzLzAZ-f6wdWVl89Iy9Gz3kJ-y2Rm09Y5A18_uOOJj9M188-H2VFrEuZ0jfL8iVacHs9tp8DWk-TccWkJiPYd4oO

Poisson distribution: models the number of events produced by a random process during a fixed interval of time or space. Lambda is the rate of events or arrivals within disjoint intervals.

This image has an empty alt attribute; its file name is hRoOgmpym1c3Zuw7PIcF2AuNrJBHQuxpaj4y4w-79I6MisUfuWDD4zF1QbmEFx3kJrjhZ1vE3l05IogOSKEmTKQaFMKMrp0LsYokpw9wHULFiZ0DUYWJnnp2NJK2cSX-ktDspxp_

Continuous Distributions


A normal distribution is symmetric about the mean, where data near the mean is more frequent in occurrence than data far from the mean. It has a bell shape.
N( μ , σ )

This image has an empty alt attribute; its file name is aQSDoqCoHnxG2bj-aUO4o3JH08Z1-OehdAXEvr7m7kXKLmAGLk_VE7mDyCvZE8UZI74yZfZB1oaNanQz8B_cu5czWikjwtbkJ71BSdg68iMhoOafucTmjoW2iHdUaTA12UXnli_4

Empirical rule:

This image has an empty alt attribute; its file name is 24BABEoy51o_GtY897s_JRhutIjsZQAnr2s_vx6E8Ppo11zl_wY1k_nOZWyr9ZaRtJD1vdZ2OSJhUP-Cws-hh-U66VV592CleBOyjLUQI0DHZWK3VrAsvT76e1YIaOpLJhoFSI6E

The empirical rule establishes:

  • 68% of the data falls within one standard deviation of the mean
  • 95% of the data falls within two standard deviations of the mean 
  • 99.7% of the data falls within three standard deviations of the mean

Truncated Normal

A truncated normal distribution is a variation of the normal distribution, but the random variables are bound from either below, above, or both.

This image has an empty alt attribute; its file name is cfN22lyaefnrggl7TyLDSmLn52hxVijKwiObZJ2_tNWxR12xN0Gw6bQbY8-4j8XxGdLcEVYC2VlC5HUwkNytuVU10Bw_qvpqQ4OzwrpRn0PP2ZnjxqsSAcXNaORhQJ393oKuogjM


A uniform distribution describes an experiment with arbitrary outcomes in a certain bounds determined by  a and b, minimum and maximum values for the interval. The interval can be either open (a,b) or closed [a,b].

This image has an empty alt attribute; its file name is Pqrvxi-_ASm6j7NdlEI8u5e_Flh4pMslu98nEHj0Q1SkcEp2iBttN_Dv6dKrUqRPpbi_rrsLVg0a-NEP9o-qLsO0ctPuE07brbq7KEoX6OMHjxoROZYrQ1mFdw3pQpE1FdxD80Of


An exponential distribution expresses the probability distribution for time between intervals in a Poisson point process.

This image has an empty alt attribute; its file name is vq1vx3aVxKq9KNUR82U48XfKGV4qPcZyw9NZLVi8-QXGMGmjAgUgT9bSwGmLm1IpLuuef8Nr-WW9738q_DkM1_AO4QxiXDbhbtHmxneRNuhLnKUB7oCnlWWZ6xLx0VuUjvvhLx8v


A triangular distribution has a lower, an upper limit and the mode value.

This image has an empty alt attribute; its file name is 8LUpfjJt1oE8Lw71Y1BYAvJ8Z7NtizqQIlbKJlSDgd_6b31gbN8sRYXIwRGLV29fDpJOgeiBwi6M7--3gVm6Xm-34XMevNpjeURKD2ioFGulCmoRKOkdAOeRISoP8a64Z5bacKTJ

Bimodal Distribution

A bimodal distribution involves 2 different models. 


A multimodal distribution involves more than 2 models ( data comes from more than 2 groups).


The skewness measures lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Thus, it can be described as the measure of the distortion from a normal distribution.

A distribution that has negative skewness commonly indicates that the tail is on the left side, and in the same way positive skew indicates that the tail is on the right. 


Kurtosis measures extreme values in either left or right tail. It measures whether the data is heavy-tailed in comparison with a normal distribution. When there is a large kurtosis, the tail exceeds the tail of the normal distribution, and It means that the dataset has outliers. Whereas data sets that have low kurtosis tend to have light tails, or lack of outliers.


The quantiles are equal portions that divide a distribution. The image shows the 4 quantiles for a normal distribution. The divisions correspond to 25%, 50% and 75% of the total distribution.

The inter-quartile range is the difference between Q3 and Q1, and the 2nd quartile is the median.

This image has an empty alt attribute; its file name is puiN4iD7fYq9GarWz2xahjKKwpR2OlrR1-B5GWDm-k2ykc6HDD9-3LilWWvQhqi9cIX_HkfublZk6hXkgSSKBRt2vAELavy9rfNed_gy33YbQXROarPpo99dYlMw-RlNtmYOZJnz

Confidence Interval

A confidence interval is the level of certainty that the true parameter is in the proposed range. The confidence interval represents the probability of containing the true interval. In other words, it represents the proportion of intervals that contain the true value of the statistical parameter. The graph represents a confidence interval for 95%. The level of confidence is 95% and the likelihood that the true population parameter lies outside is α, for example, in this case α = 0.5 = 1 - 0.95.

This image has an empty alt attribute; its file name is -kh8Dlp0dqMpfHIc9aa-mfJnqByJeKbaMON785tDz0-i-IYQyTWECLUDc6wCsRmW1SOVUVAA7AXQ2rUzRlK2PfohmZbI2cVCpYpaIaJdXLyxA9Uerd2Pg1Bw8fjCzNrMEPcy0O2I
Confidence Interval

Now let's take a look at some explanations for statistical concepts, as well as sampling distribution, and the relationship between variance and bias.

Statistical Hypothesis Testing

States a hypothesis that provides the confidence level for the calculation of a quantity under a certain assumption. Commonly, the assumption is about a comparison between two statistical data or a sample against the population parameter. The result of the test allows us to interpret whether the assumption holds or has been violated. The assumption of a statistical test is receives the name null hypothesis or H0.

p-value: is the level of marginal significance, represents the probability of occurrence of a given event under the assumption that the null hypothesis is correct. It is used to quantify the result of the test and either reject or fail to reject the null hypothesis. This is done by comparing the p-value to the desired significance level. A result is statistically significant when the p-value is less than the significant level.

* If p-value > α : Fail to reject the null hypothesis
* If p-value <=
α : Reject the null hypothesis 

The p-value is the smallest significance level at which H0 can be rejected.
Generally, the significance level is 0.05. A smaller value implies a more robust interpretation.

Type error I and II:

There are two different types of errors (type I and type II). Since p-value is based on probability, there is always a chance of making a mistake about the conclusion of accepting or rejecting the null hypothesis. The chances of making these errors are inversely proportional: it means that if type I error rate increases, type II error rate decreases, and vice versa.

Type error IType error IIDefinition Is the rejection of a true null hypothesis 
Is the non-rejection of a false null hypothesisMeaning Take action when unnecessaryFailure to take an appropriate actionCan only occurCan only occur when H0 is true Can only occur when H0 is false

Z-test and T-test:

There are different statistical test according to what we want to test.

Z-testT-testHypothesis test to determine whether two population means are different.Hypothesis test to determine if there is a significant difference between two population means. Standard deviation or variances are knownStandard deviation are unknownLarge sample sizeSmall sample sizeBased on a normal distributionBased on t-distribution (heavier tails, less space in the center)A z-statistic, or z-score, is a number representing the result from the z-test.A t-statistic, or t-score, is a number representing the result from the t-test.

Sampling Distribution

Sometimes in machine learning problems we have a lot of data, so we cannot use all the data. Therefore, we use sampling to extract a group of data from the total.

Sampling distribution: The sampling distribution shows how a statistic varies from sample to sample.

Randomization: ensures that on average a sample mimics the population in order to avoid bias.

Sample size: do not get confused!, larger populations do not require larger samples.

Stratified random sample: divides the sampling frame into subsets before selecting the sample.

Sample size condition to be normally distributed: in function of k4, kurtosis:

Control limits - sets boundaries that determine whether a process should be stopped or allowed to continue in a control chart. It is a graph in function of time.

  • UCL - upper control limit
  • LCL - lower control limit 

By these limits, you can find a balance between errors type I and II. You cannot reduce both errors by moving limits. For instance, in a normal distribution, the limits are the mean +3/-3 standard deviation.

s-chart : control chart that tracks sample standard deviation 

R-chart: control chart that tracks sample ranges observations 

X-bar: controls the mean of a process 

Central Limitorial Theorem: if the sample size is large enough the shape of x̄ is normally distributed regardless of the distribution of the population. Where x̄ is the sampling distribution for the mean.

Recommended books

Manly, B. F. J., & Navarro, A. J. A. (2017). Multivariate statistical methods: A primer. Florida: CRC Press.

Stine, R. A., & Foster, D. P. (2018). Statistics for business: Decision making and Analysis.