Data Quality Part 2: What Are the Components of Data Quality?
Posted: May 22nd, 2023Authors: Eugene Y. Aditya S.
In the first article of this series we started with this thought: Defining data quality and implementing a data quality program furthers the goal that the data collected serve the intended purpose, i.e., informed decision making. We walked through an example of different methods to make a measurement; the upshot of that was that we needed to match the measurement to the end use of the result. If we need better, tighter numbers, we need to use a better, tighter tool (and in this context, tool means both the physical tool and the procedure). In this article, we will discuss the questions that help us frame our decision making around methodology. (And we’re going to put fancy words around the questions.)
- Does the data need to be repeatable? Of course it does, right? But how repeatable? In the extension cord example in the first article of this series, it just needs to be repeatable enough that I don’t have to string together two extension cords. When considering data quality, repeatability is called PRECISION.
- Does the data need to be right? Again, of course it does. But how tightly does it need to agree with the actual value? In the lumber estimation example in the first article of this series, it just needs to be close enough to avoid another trip to Home Depot. When considering data quality, agreement with the correct value is called ACCURACY.
- Under what conditions are we making the measurement? For my overall deck case, did I lay out the deck in the right place, and did I start from the right points on the house? Or did I lay out the deck and measure planning to use 2×8 boards, and then bought 2×6 boards? When considering data quality, making sure that the data reflects the intended physical state is called REPRESENTATIVENESS.
- Is my measurement something someone else can use (and vice versa)? If we are sharing, combining, or comparing our data with someone else’s data, we need to have used similar (or near identical) methodologies. I built my deck out of wood and my neighbor built his out of concrete. We’re comparing notes ─ I bought 700 board feet of decking and he bought 4 tons of concrete. We can’t compare. When considering data quality, similarity between measurements is called COMPARABILITY.
- Did I get all the data I need? To stretch my deck analogy, did I forget anything? Yes, I did; while I covered the decking material itself, I forgot anything that had to do with raising or lifting the deck off the ground and making it level. I didn’t measure the drop over the dimensions of the deck, so I didn’t buy any risers or in‑ground support for my risers (back to Home Depot I go…). When considering data quality, getting enough data is called COMPLETENESS.
As we set up a measurement program, we want to define the data quality objectives. We want to identify and define all the specifications to get the quantity and quality of data necessary to answer the underlying question. To do that, we address all the things mentioned above: precision, accuracy, representativeness, comparability, and completeness.
For a small measurement event, the identification and assessment of many of those data quality components needs to be external to our own measurements. For example, three stack test measurements are insufficient to assess precision. Instead, we must assess precision using any of a number of tools: repeat measurements, repeat analysis of known standards, and multiple spiked samples. There are also procedural approaches to data quality; examples here include training, use of standardized procedures, implementation of published methods, and robust and standardized systems. The ribbon that wraps up our data quality assessment and helps make sure that the data collected will address our underlying question is our selection and assignment of data quality objectives.
Until then, feel free to contact either of us:
- Gene Youngerman, firstname.lastname@example.org, 512.649.2571
- Aditya Shivkumar, email@example.com, 281‑201-1239
Links to other blogs from our Data Quality Series: