Data quality is a broad concept with multiple dimensions. I detail that information in another introductory article. This tutorial explores a real-life example. We identify what we want to improve, create the code to achieve our goals, and wrap up with some comments about things that can happen in real-life situations. To follow along, you need a basic understanding of Python.
Python Data Analysis Library (pandas) is an open-source, BSD-licensed library that provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
You can install pandas by entering this code in a command line: python3 -m pip install — upgrade pandas.
Explaining conceptually what it really means, and why it matters.
This article outlines a mental framework to organize our work around Data Quality. Referencing the well-known DIKW Pyramid, data quality is the enabler that allows us to take raw data and use it to generate information, starting from raw data.
In this piece, we’ll go over a few common scenarios, review some theory, and finally outline some advice for anyone facing this increasingly common issue.
The amount of data being generated every second is almost impossible to comprehend. Current estimates say that 294 billion emails and 65 billion WhatsApp messages are sent every single day, and all of it leaves a data trail. The world economic forum estimates that the digital universe is expected to reach 44 zettabytes by 2020. To give you an idea of what that means, take a look at the byte prefixes and remember that each one multiplies by 1000: kilo, mega, giga, tera, peta, exa, zetta.