Rootstrap Blog

Category: Data Science

Total 8 Posts

Data Samples and error visualization techniques

Why we should choose representative samples with error in mind when we build data visualizations. A brief overview of uncertain bar charts and uncertain ranked lists.

The type of data samples that populate our visualizations can add uncertainty to our results. Some common data displays like bar and pie charts work better than others for making that uncertainty understandable. This article explores how to understand our data samples and create the most suitable graphs for visualizing what they represent.
In general, the goals of data science are to understand data and generate predictive models that help us make better decisions. For a more thorough overview of data visualization, see “Data visualization and The Truthful Art.”

Continue Reading

Data Revolution Inside Organizations

How to be prepared for the change that will transform the business landscape forever.

Worldwide access to vast amounts of data has changed the business landscape. Competitive marketing depends on knowing how to manage, process, and analyze that data. This article describes the path organizations need to take from collecting data to maximizing its use.
Today’s organizations are undergoing a challenging transformation process around their technical systems. The static software platforms that might have stored and processed a business’ data are no longer sustainable in the current web environment. Enterprises need cutting-edge technology to collect big data in real-time, analyze that data, and then get the information they need to stay competitive in today’s marketplace.

Continue Reading

Correlation is not causation

Why the confusion of these concepts has profound implications, from healthcare to business management

In correlated data, a pair of variables are related in that one thing is likely to change when the other does. This relationship might lead us to assume that a change to one thing causes the change in the other. This article clarifies that kind of faulty thinking by explaining correlation, causation, and the bias that often lumps the two together.
The human brain simplifies incoming information, so we can make sense of it. Our brains often do that by making assumptions about things based on slight relationships, or bias. But that thinking process isn’t foolproof. An example is when we mistake correlation for causation. Bias can make us conclude that one thing must cause another if both change in the same way at the same time. This article clears up the misconception that correlation equals causation by exploring both of those subjects and the human brain’s tendency toward bias.

Continue Reading

Data Demystified — Machine Learning

A bird-eye view of the machine learning landscape.

The main goal of this article is to cover the most important concepts of machine learning, and lay-out the landscape. The reader will have the vision to understand what kind of solution matches a specific kind of problem, and should be able to find more specific knowledge after diving into a real-life project.

I’ll start with a 60 years old definition, but still valid today:

The name is pretty self-explanatory, and the definition reinforces the same concept.

Continue Reading

Improve data quality by using the pandas library and Python

Data quality is a broad concept with multiple dimensions. I detail that information in another introductory article. This tutorial explores a real-life example. We identify what we want to improve, create the code to achieve our goals, and wrap up with some comments about things that can happen in real-life situations. To follow along, you need a basic understanding of Python.

Python Data Analysis Library (pandas) is an open-source, BSD-licensed library that provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

You can install pandas by entering this code in a command line: python3 -m pip install — upgrade pandas.

Continue Reading

Data Visualization and The Truthful Art

An amazing book about data visualization that I can’t recommend enough is The Truthful Art by Alberto Cairo.

In The Truthful Art, Cairo explains the principles of good data visualization. He describes five qualities that should be your foundation when you work with data visualization: truthful, functional, beautiful, insightful, and enlightening. Cairo also gives some great examples of biased and dishonest visualization.

Before I dive into the “Five Qualities of Great Visualizations,” there’s another related concept that I want to cover: data-ink ratio, introduced by Edward Tufte in The Visual Display of Quantitative Information.

Continue Reading

Data Demystified — Data Quality

Explaining conceptually what it really means, and why it matters.

This article outlines a mental framework to organize our work around Data Quality. Referencing the well-known DIKW Pyramid, data quality is the enabler that allows us to take raw data and use it to generate information, starting from raw data.

In this piece, we’ll go over a few common scenarios, review some theory, and finally outline some advice for anyone facing this increasingly common issue.

The amount of data being generated every second is almost impossible to comprehend. Current estimates say that 294 billion emails and 65 billion WhatsApp messages are sent every single day, and all of it leaves a data trail. The world economic forum estimates that the digital universe is expected to reach 44 zettabytes by 2020. To give you an idea of what that means, take a look at the byte prefixes and remember that each one multiplies by 1000: kilo, mega, giga, tera, peta, exa, zetta.

Continue Reading

Data Demystified — DIKW model

Understanding the big picture first will set the stage for success in this journey.

Data is one of the biggest new trends in both tech and business in general. Data “experts” are quickly becoming some of the best-paid individuals in the industry, and every single company wants to surf the wave of data capabilities.

It is becoming a fundamental way of understanding the world around us. We can think of data sciences as epistemology or a way of knowing. We can think of it, about a way to approach problems and solving them.

But as with any new trend, we have to ask ourselves: what do all these buzzwords actually mean?

What is a data scientist? In short, a person who is better at statistics than any software engineer and better at software engineering than any statistician.

Continue Reading