If you want to understand machine learning algorithms, it is very important to have basic statistical knowledge to understand what is behind them. Understanding how the algorithm operates gives you the option of configuring the model according to what you need, as well as explaining with more confidence the results obtained from the execution of the model.

This article provides a very brief summary that tries to list all basic statistical concepts necessary to face machine learning problems. The article is presented schematically, as the main purpose is to get familiar with terminologies and definitions that are used in machine learning algorithms.

In addition, we will be providing basic information about distributions, as well as explanations for statistical concepts such as the Statistical hypothesis test, used for answering questions about sample data and validating assumptions. We will also discuss sampling distribution and the relationship between variance and bias.

To get started, an outline of definitions about probability and matrices is provided.

**Probability **

Firstly, some probability terms are described. Those terms are being used then to explain the fundamentals of the algorithms.

**Sample:** set of observations drawn from a population. It is necessary to use samples because it is impossible to study all the population. Population refers to the set of all the data.

**Sample space: **set of all possible outcomes that can happen in a chance situation.

**Event: **a subset of the sample space. A probability is assigned to the event.

**Probability: **a measure of the likelihood that an event will occur in a random experiment. It is quantified as a number between 0 and 1, where. The higher, the more likely is the occurrence of the event.

Probability = # desired outcomes / # possible outcomes

**Probability Rules**:

**Independent events:** The occurrence of one event has no effect on the probability of occurrence of the other event. If A,B are independent, then **P(A and B) = P(A) x P(B)**

**Joint probability:** chance of an outcome of having two events occurring together at the same time.

**Marginal probability:** the probability of observing an outcome with a single variable, regardless of its other variables.

**Conditional probability:** the conditional probability of an event A given that the event B occurs. It is written like: *P(A|B) = P(A and B) / P(B)*

**Multiplication rule:**

– *P(A and B)= P(A|B) x P(B)*

– *P(A and B)= P(B|A) x P(A)*

**Bayes Rule:** *P(A|B)=P(B|A) x P(A)/P(B)*

**Representation: **

**Probability tree:**diagram to represent different outcomes in function of the occurrence of the events.

**Probability table:**A probability table is another way of representing probabilities.

Event | Probability |

A | 0.15 |

B | 0.35 |

C | 0.50 |

**Random Variables**

A **random variable** describes the probability for an uncertain future numerical outcome of a random process. It is a function that maps an outcome of a random experiment to a numerical value.

For instance, in the case of the experiment of flipping a coin twice, the sample space is S={HH,TT,HT,TH}. Where H corresponds to head and T to tail respectively. Therefore, let be a random variable X the number of heads, it would be a function that from the outcome determines how many heads were flipped. Thus, X takes the following values:

HH -> 2

TT -> 0

HT -> 1

TH -> 1

Then, the random variable X can take the values {0,1,2}, corresponding to the possible cases. Observe that although the sample space had 4 cases, the random variable can only take 3 values.

**Discrete random variable: **the set of possible outcomes is finite.

**Continue random variable: **can take any value within an interval. **Expected value: **weighted average, based on probability to weigh the possible outcomes. It is the sum of all gains multiplied by each probability. Where x1..xn are values for the sample space of the discrete random variable X. Reaching to the following formula:

*E(X) = X1*p(X1) + X2*p(X2) + … + Xn*p(Xn)*

**Variance: **intents to describe how spread is the data from the mean value. It is defined as the expected value of the squared deviation of X from the mean m.

*Var(X)= E[(X-m)2] *

So, here the function is g(X)=(X-m)2, applying the formula of the expected value of a function, we get:

*Var(X)= E[(X-m)2]=(x1-m)2*p(x1) +(x2-m)2+…+(xn-m)2=i=0n(xi – m)2p(xi)*

**Standard deviation:**it is the square root of Var(X). It is denoted as x

**Covariance: ** measures the variance between two random variables.

- Positive covariance: the variables tend to move in the same direction.
- Negative covariance: the variables tend to move in inverse directions.

It is important to notice that the covariance shows the direction of the relationship between the two variables, but not the strength of it.

**Correlation: **measures the strength of the relationship between variables.

- Positive correlation: the variables are correlated and they move in the same direction.
- Negative correlation: the variables are correlated and they move in opposite directions.
- No correlation: when the coefficient is 0 does not exist any relationship between the variables. It means that the variables are independent.

**Distance matrix: **squared matrix that contains the distance between the variables of the set. The most common distance used is the Euclidean distance, but there are other distances that can be used.

**I.i.d (Identically independent distributed) random variables:** when two random variables are identically (have the same probability distribution) and are mutually independent. Often this assumption is applied in machine learning algorithms in order to imply that all samples come from the same process which does not depend from past generated samples.

**Matrices**

Basic knowledge about matrices is necessary in order to understand some of the math behind the algorithms and handle images.

**A mxn** It is a matrix with m rows and n columns.

**Square matrix:** when m=n

**Column vector:** is a matrix with only 1 column

**Row vector:** a matrix with only 1 row

**Transpose matrix:** interchange rows and columns. Notation: A’=t(A)

**Diagonal matrix: **has 0 values except the main diagonal

**Symmetric matrix:** square matrix unchanged when it is transposed. A’=A

**Identity matrix:** diagonal matrix with all elements of the diagonal equal to 1. Notation: I

**Matrix multiplication:** A*lxm* x B*mxn*= C*lxn*

**Element-wise multiplication: **A*nxm* ͦ B*nxm*= C*nxm*

**Inverse matrix:** A(A^-1)=I

**Trace:** sum of the elements of the diagonal.

**Determinant:** Notation: det(A)=|A|

**Eigenvalues and eigenvectors**

λ is a scalar and is called the eigenvalue of A

x̄ is the eigenvector belonging to λ.

Any nonzero multiple of x̄ will be an eigenvector

To find λ : |A – λ*I*|=0

**Roadmap **

Cheat sheets are very useful to have all the concepts in one document, here you can find a cheat sheet for this part.

**What is a Distribution?**

A probability distribution: is a summary of probabilities for the values of a random variable.

Measurements: The distribution also has general properties that can be measured. Important properties of a probability distribution are: expected value, variance, skewness and kurtosis.

The probability for a discrete random variable can be summarized with a discrete probability distribution. In the same way, the summary for a continuous random variable is called continuous probability distribution.

Discrete data involves a finite group of possible values, while continuous data can have infinitive values in a time interval.

**Discrete Distributions: **

**Bernoulli:** distribution for a binary variable, represents the probability of a single experiment with 2 possible outcomes (probability p and 1-p)**.**

**Binomial:** represents the number of successes in n Bernoulli trials. Thus, the experiments are independent, where each one has 2 possible outcomes.

**Poisson distribution:** models the number of events produced by a random process during a fixed interval of time or space. Lambda is the rate of events or arrivals within disjoint intervals.

**Continuous Distributions**:

**Normal**

A normal distribution is symmetric about the mean, where data near the mean is more frequent in occurrence than data far from the mean. It has a bell shape.*N*( μ , σ )

The empirical rule establishes:

- Approximately 68% of the data falls within one standard deviation of the mean
- Approximately 95% of the data falls within two standard deviations of the mean
- Approximately 99.7% of the data falls within three standard deviations of the mean

**Truncated Normal**

A truncated normal distribution is a variation of the normal distribution, but the random variables are bound from either below or above, or both.

**Uniform**

A uniform distribution describes an experiment with arbitrary outcomes in a certain bounds determined by a and b, minimum and maximum values for the interval. The interval can be either open (a,b) or closed [a,b].

**Exponential**

An exponential distribution expresses the probability distribution for time between intervals in a Poisson point process.

**Triangular**

A triangular distribution is limited by a lower, upper limits and the mode value.

**Bimodal Distribution**

A bimodal distribution involves 2 different models.

**Multimodality**:

A multimodal distribution involves more than 2 models ( data comes from more than 2 groups).

**Skewness**:

The skewness measureslack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Thus, it can be described as the measure of the distortion from a normal distribution.

A distribution that has negative skewness commonly indicates that the tail is on the left side, and in the same way positive skew indicates that the tail is on the right.

**Kurtosis**

Kurtosis measures extreme values in either left or right tail. Measures whether the data is heavy-tailed in comparison with a normal distribution. When there is a large kurtosis, the tail exceeds the tail of the normal distribution, and It means that the dataset has outliers. Whereas data sets that have low kurtosis tend to have light tails, or lack of outliers.

**Quantiles**

The quantiles are equal portions that divide a distribution. The image shows the 4 quantiles for a normal distribution. When the divisions correspond to 25%, 50% and 75% of the total distribution are called quartiles. The inter-quartile range is the difference between Q3 and Q1, and the 2nd quartile is the median.

**Confidence Interval**

A confidence interval is the level of certainty that the true parameter is in the proposed range. The confidence interval represents the probability of containing the true interval. In other words, it represents the proportion of intervals that contain the true value of the statistical parameter. The graph represents a confidence interval for 95%. The level of confidence is 95% and the likelihood that the true population parameter lies outside is α, in this case α = 0.5 = 1 – 0.95.

*Now lets take a look at some explanations for statistical concepts, as well as sampling distribution and the relationship between variance and bias.*

**Statistical Hypothesis Testing**

States a hypothesis that provides the confidence level for the calculation of a quantity under a certain assumption. Commonly, the assumption to be tested is based on a comparison between two statistical data or a sample against the population parameter. The result of the test allows us to interpret whether the assumption holds or has been violated. The assumption of a statistical test is called the **null hypothesis or H0**.

**p-value:** is the level of marginal significance, represents the probability of occurrence of a given event under the assumption that the null hypothesis is correct. It is used to quantify the result of the test and either reject or fail to reject the null hypothesis. This is done by comparing the p-value to the desired significance level. A result is statistically significant when the p-value is less than the significant level.

* If p-value > α : Fail to reject the null hypothesisα * If p-value <= : Reject the null hypothesis |

The p-value is the smallest significance level at which H0 can be rejected.

The significance level is set generally to 0.05. A smaller value implies a more robust interpretation.

**Type error I and II: **

Two different types of errors (type I and type II) are presented. Since p-value is based on probability, there is always a chance of making a mistake about the conclusion of accepting or rejecting the null hypothesis. The chances of making these errors are inversely proportional: it means that if type I error rate increases, type II error rate decreases, and vice versa.

Type error I | Type error II | |

Definition | Is the rejection of a true null hypothesis | Is the non-rejection of a false null hypothesis |

Meaning | Take action when unnecessary | Failure to take an appropriate action |

Can only occur | Can only occur when H0 is true | Can only occur when H0 is false |

**Z-test and T-test**:

There are different statistical test according to what we want to test.

Z-test | T-test |

Hypothesis test to determine whether two population means are different. | Hypothesis test to determine if there is a significant difference between two population means. |

Standard deviation or variances are known | Standard deviation are unknown |

Large sample size | Small sample size |

Based on a normal distribution | Based on t-distribution (heavier tails, less space in the center) |

A z-statistic, or z-score, is a number representing the result from the z-test. | A t-statistic, or t-score, is a number representing the result from the t-test. |

**Sampling Distribution**

Sometimes we have a lot of data, so we cannot use all the data. Therefore, we use sampling to extract a group of data from the total.

**Sampling distribution: **The sampling distribution shows how a statistic varies from sample to sample.

**Randomization:** ensures that on average a sample mimics the population in order to avoid bias.

**Sample size:** do not get confused, larger populations do not require larger samples.

**Stratified random sample:** divides the sampling frame into subsets before the sample is selected.

**Sample size condition to be normal distributed:** in function of k4, kurtosis

**Control limits** – set boundaries that determine whether a process should be stopped or allowed to continue in a control chart. It is a graph in function of time.

- UCL – upper control limit
- LCL – lower control limit

By these limits you can find a balance between errors type I and II. You cannot reduce both errors by moving limits. For instance, in a normal distribution, the limits are the mean +3/-3 standard deviation.

* s-chart : *control chart that tracks sample standard deviation

** R-chart:** control chart that tracks sample ranges observations

** X-bar:** controls the mean of a process

Central Limitorial Theorem: if the sample size is large enough the shape of x̄ is normally distributed regardless of the distribution of the population. Where x̄ is the sampling distribution for the mean. |

**Recommended** books

Manly, B. F. J., & Navarro, A. J. A. (2017). *Multivariate statistical methods: A primer*. Florida: CRC Press.

Stine, R. A., & Foster, D. P. (2018). *Statistics for business: Decision making and Analysis*.

## 0 Comments