For those of us working with data, we are all too familiar with how difficult it can be to fully understand what we are presented with. No matter how complex, data can be difficult for us to digest and make sense of.
Data visualization is an effective technique that can help us to understand all types of data, its distribution, as well as provide us with insights into hypotheses.
When used correctly, data visualization can help to clearly display and explain results from data analysis, allowing us to make comparisons between certain models.
There are multiple visualization techniques that can be used to effectively extract information from data. Let's take a look at some of them.
A histogram is a representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable.
This technique can be used to provide an idea of the shape of the distribution and to provide outliers of the data in question.
In the beginning, the data is split into respective bins that are known as intervals. Once split, you then need to sum up the number of occurrences for each interval.
Using dots, Scatter Plots can show any possible relationships between two variables, where each ax corresponds to a variable.
This graph interprets the correlation between variables while also detecting outliers. As it showcases two variables, it, therefore, represents bivariate data.
If we want to represent more than two variables in a two-dimension graph, we can use colors and play with the size of the dots to represent other variables.
A Scatterplot Matrix is a square and symmetric grid highlighting several scatterplots embedded within a graph. It's made up of rows and columns with each one corresponding to a specific variable. Each quadrant displays a scatterplot for the variables X and Y of the grid.
A correlation matrix is a table that shows the relationship between variables. Each cell (i,j) corresponds to the value of the correlation between the variables i and j.
The correlation is a value between -1 and 1, that represents the relationship between the variables. The example graph provided is a heatmap - which is a graphical representation of the correlation matrix in which the values of the correlation are represented by colors.
A Biplot displays information on samples and variables of datasets presented in graphs. This kind of plot is generally used for visualizing the relationship between principal components within each other and the variables.
The samples are represented by points and the variables by vectors. The length of the vectors represents the variance of the variable.
The angle between the arrows represents the correlation, where a smaller angle corresponds to a higher correlation. Points that are close to each other are viewed as similar observations.
The above example shows the relationship between some products (fish, vegetables, milk, etc) for each country (Portugal, Spain, etc). This shows how a Biplot is useful when trying to understand the relationship between variables and observations, and also between both variables and observations.
A Boxplot is a graphical representation of the distribution of the data based on its quartiles. The quartiles divide the data into four equal parts.
The first quartile Q1 is the middle value between the median and the smallest value in the dataset. This corresponds to the 25th percentile.
The second quartile Q2 is the median for the dataset. This corresponds to the 50th percentile.
The third quartile Q3 is the middle between the median and the highest value in the dataset. This corresponds to the 75th percentile.
These values are represented inside a box, and the distance between the different parts of the box represents the level of dispersion of the data. The boxplot may represent outliers by points. The IQR value corresponds to the difference between Q3 and Q1.
For the normal distribution, the data is equally distributed around the mean, and as a result, the location of the marks in the boxplot will be equally spaced. We can see this in the following graph.
Other types of plots such as bar plots, pie plots, and map plots, can help to visualize certain information. You can plot the original data to see or validate assumptions about certain characteristics. You can also plot your results by comparing the predicted values for testing and the real values.
In addition, you can plot the clustering results to see how the data is grouped, highlighting the variables that are affecting the determination of the clusters.
There are also more plots available such as 3-dimensional scatter plots and other representations that might be useful for more complicated sceneries.
As highlighted throughout, data visualization is an important and effective technique for helping us understand data.
First off, it helps you to understand and make sense of data and can help to find outliers and take insights around the distribution of the data. Secondly, it can provide information and examples of how certain variables are related to one another.
Finally, it can help you to explain the output of your model, as well as compare the performance of different models against one another.
So, when used correctly, data visualization is a technique that can effectively help not just data scientists working with complex data, but also with simpler types of data that we are presented with on a day-to-day basis.