# Understanding Basic Statistics for Machine Learning Models – Part 3

In this article you can find explanations for statistical concepts such as Statistical hypothesis test, used for answering questions about sample data and validating assumptions. In addition, it is provided a list of concepts regarding sampling distribution. Finally, we discuss the relationship between variance and bias.

# Statistical hypothesis testing

States a hypothesis that provides the confidence level for the calculation of a quantity under a certain assumption. Commonly, the assumption to be tested is based on a comparison between two statistical data or a sample against the population parameter. The result of the test allows us to interpret whether the assumption holds or has been violated. The assumption of a statistical test is called the null hypothesis or H0.

p-value: is the level of marginal significance, represents the probability of occurrence of a given event under the assumption that the null hypothesis is correct. It is used to quantify the result of the test and either reject or fail to reject the null hypothesis. This is done by comparing the p-value to the desired significance level. A result is statistically significant when the p-value is less than the significant level.

The p-value is the smallest significance level at which H0 can be rejected.
The significance level is set generally to 0.05. A smaller value implies a more robust interpretation.

## Type error I and II

Two different types of errors (type I and type II) are presented. Since p-value is based on probability, there is always a chance of making a mistake about the conclusion of accepting or rejecting the null hypothesis. The chances of making these errors are inversely proportional: it means that if type I error rate increases, type II error rate decreases, and vice versa.

## Z-test and T-test

There are different statistical test according to what we want to test.

# Sampling distribution

Sometimes we have a lot of data, so we cannot use all the data. Therefore, we use sampling to extract a group of data from the total.

Sampling distribution: The sampling distribution shows how a statistic varies from sample to sample.

Randomization: ensures that on average a sample mimics the population in order to avoid bias.

Sample size: do not get confused, larger populations do not require larger samples.

Stratified random sample: divides the sampling frame into subsets before the sample is selected.

Sample size condition to be normal distributed: in function of k4, kurtosis

Control limits – set boundaries that determine whether a process should be stopped or allowed to continue in a control chart. It is a graph in function of time.

• UCL – upper control limit
• LCL – lower control limit

By these limits you can find a balance between errors type I and II. You cannot reduce both errors by moving limits. For instance, in a normal distribution, the limits are the mean +3/-3 standard deviation.

s-chart : control chart that tracks sample standard deviation

R-chart: control chart that tracks sample ranges observations

X-bar: controls the mean of a process

## Recommended books

Manly, B. F. J., & Navarro, A. J. A. (2017). Multivariate statistical methods: A primer. Florida: CRC Press.

Stine, R. A., & Foster, D. P. (2018). Statistics for business: Decision making and Analysis.