AI / Machine Learning
-
April 27, 2020

Understanding Basic Statistics for Machine Learning Models - Part 3

In this article you can find explanations for statistical concepts such as Statistical hypothesis test, used for answering questions about sample data and validating assumptions. In addition, it is provided a list of concepts regarding sampling distribution. Finally, we discuss the relationship between variance and bias.

Statistical hypothesis testing

States a hypothesis that provides the confidence level for the calculation of a quantity under a certain assumption. Commonly, the assumption to be tested is based on a comparison between two statistical data or a sample against the population parameter. The result of the test allows us to interpret whether the assumption holds or has been violated. The assumption of a statistical test is called the null hypothesis or H0.

p-value: is the level of marginal significance, represents the probability of occurrence of a given event under the assumption that the null hypothesis is correct. It is used to quantify the result of the test and either reject or fail to reject the null hypothesis. This is done by comparing the p-value to the desired significance level. A result is statistically significant when the p-value is less than the significant level.

* If p-value > α : Fail to reject the null hypothesis
* If p-value <=
α : Reject the null hypothesis 

The p-value is the smallest significance level at which H0 can be rejected.
The significance level is set generally to 0.05. A smaller value implies a more robust interpretation.

Type error I and II

Two different types of errors (type I and type II) are presented. Since p-value is based on probability, there is always a chance of making a mistake about the conclusion of accepting or rejecting the null hypothesis. The chances of making these errors are inversely proportional: it means that if type I error rate increases, type II error rate decreases, and vice versa.

Type error IType error IIDefinition Is the rejection of a true null hypothesis 
Is the non-rejection of a false null hypothesisMeaning Take action when unnecessaryFailure to take an appropriate actionCan only occurCan only occur when H0 is true Can only occur when H0 is false

Z-test and T-test

There are different statistical test according to what we want to test.

Z-testT-testHypothesis test to determine whether two population means are different.Hypothesis test to determine if there is a significant difference between two population means. Standard deviation or variances are knownStandard deviation are unknownLarge sample sizeSmall sample sizeBased on a normal distributionBased on t-distribution (heavier tails, less space in the center)A z-statistic, or z-score, is a number representing the result from the z-test.A t-statistic, or t-score, is a number representing the result from the t-test.

Sampling distribution

Sometimes we have a lot of data, so we cannot use all the data. Therefore, we use sampling to extract a group of data from the total.

Sampling distribution: The sampling distribution shows how a statistic varies from sample to sample.

Randomization: ensures that on average a sample mimics the population in order to avoid bias.

Sample size: do not get confused, larger populations do not require larger samples.

Stratified random sample: divides the sampling frame into subsets before the sample is selected.

Sample size condition to be normal distributed: in function of k4, kurtosis

Control limits - set boundaries that determine whether a process should be stopped or allowed to continue in a control chart. It is a graph in function of time.

  • UCL - upper control limit
  • LCL - lower control limit

By these limits you can find a balance between errors type I and II. You cannot reduce both errors by moving limits. For instance, in a normal distribution, the limits are the mean +3/-3 standard deviation.

s-chart : control chart that tracks sample standard deviation

R-chart: control chart that tracks sample ranges observations

X-bar: controls the mean of a process

Central Limitorial Theorem: if the sample size is large enough the shape of x̄ is normally distributed regardless of the distribution of the population. Where x̄ is the sampling distribution for the mean.

Recommended books

Manly, B. F. J., & Navarro, A. J. A. (2017). Multivariate statistical methods: A primer. Florida: CRC Press.

Stine, R. A., & Foster, D. P. (2018). Statistics for business: Decision making and Analysis.