The Three Phases of Open Data Quality Control
By Dennis D. McDonald, Ph.D., firstname.lastname@example.org
In my previous post about open data quality the suggested solutions relate not just to adhering to standards but also to making sure that the processes by which open data are published and maintained are efficiently and effectively managed. In this post I drill down a bit more on that point about the management processes.
When discussing open data it helps to look at open data projects with tasks divided into at least three related phases:
Assessment and planning
Data preparation and publishing
Ongoing maintenance and support
Different tools and processes are relevant to each phase. Each can have an impact on the quality of the data as well as its perceived quality.
Phase 1. Assessment and planning
Critical to data quality at this first phase of an open data project is an understanding of the "who, where, how, how much, and why" of the data. If the goals of the project include making data from multiple systems and departments accessible and reusable, there's no substitute for having a good understanding of what the source data actually look like early in the project. Developing an understanding of the level of effort involved in preparing the data for public access is critical. Understanding who will be responsible for making changes and corrections on an ongoing basis will also be important.
Data issues (e.g., missing data, lack of standard identifiers, transposed fields, even outright errors) that may have limited impact on traditional internal users may loom large when the data are made public. Data inconsistencies that matter little internally, even if they are not outright errors, may cause embarrassment and can be labeled as "errors" by those whose understanding of data and data management is meager.
This is not to say that outright errors aren't important; of course they are. But nuances such as distinctions between “outliers” or inconsistently tagged or labeled fields may be lost on some members of the public or press. Variations in data management literacy should be expected and planned for.
Given the effort required by data preparation work (see Phase 2) there's no substitute for taking the time during Phase 1 to perform an objective sampling of the source data including, where possible, test runs to see how the tools to be used in managing and accessing the data will behave when faced with live data. Validation tools that check for data formatting and standards compliance at this stage will be very useful. If data are "clean" and error-free, data prep and Phase 2 will run smoothly. If there are issues and they are significant with the data, the earlier you know about it the better.
Phase 2. Data preparation and publishing
This is the "production" phase of the project where plans are put in motion and initial releases of the data are prepared along with the web-based tools that link users with the data. For large volumes of data it’s not unusual at this stage for contractors to be involved with initial extract, transform, and load activities as well as programming and API development tasks. Appropriate testing tools and techniques can answer questions such as these:
Were the number of records extracted from the source system the same as the number of records loaded into the open data portal?
Are predefined filters or data visualization features behaving correctly with varying types and volumes of data?
Are data anonymization strategies impacting the types of analyses that can be conducted with the data?
Are basic statistics being calculated correctly and are missing data or are incorrectly coded data being tagged for special processing?
Making extensive amounts of data available for public scrutiny may mean that some data context will be missing. Because of this some users may lack an understanding of how to interpret the data and may not understand what's significant and what isn't. Something that looks like an anomaly or error might actually be correct.
Supplying such context has less to do with quality control than with how well equipped the user is to make sense of the data. If two different departments use two different address formats or two different expenditure categories for check writing, data files combining these two sources without some indication of such contextual information may lead to a perception of error even though the source data are technically correct.
Detecting the possibility of such inconsistencies is a Phase 1 task. Resolving such inconsistencies on a production or volume bases will be Phase 2 task and may involve manual and automated processes as well as the development of ancillary services such as help files or even online support resources.
Phase 3. Ongoing maintenance and support
Once the open data service goes “live” there need to be ongoing quality management processes that monitor and report to management on the condition of the data. Error detection and error correction systems and processes need to be in place, including a channel for users to provide feedback and corrections. This feedback mechanism is important given that one of the guiding principles of the open data movement is that users are free to use data as they please. Some of these uses may never have been anticipated or tested for and may reveal data issues that need to be addressed.
Finally, ongoing monitoring of source data is needed to remain aware of possible changes of source data that might have impact later on. Some upgrades to source data systems, even when basic formats are controlled by well accepted data standards, might introduce format or encoding changes that have downstream impacts.
Summary and conclusions
Data quality management in the context of open data programs should not be considered as something “extra” but as part of the ongoing program management process. Outright data errors must be stamped out as early as possible before they have a chance to proliferate.
Much of the data provided in open data programs are the byproduct of human activities that have a natural tendency to change over time. This raises the possibility that errors and inconsistencies will arise in even well managed data programs. The solution: pay attention to quality management details at all stages of the process so that good data are provided and costs associated with error correction are minimized.
Data Cleanup, Big Data, Standards, and Program Transparency
Framework for Transparency Program Planning and Assessment
Management Needs Data Literacy To Run Open Data Programs
Recommendations for Collaborative Management of Government Data Standardization Projects
Recouping “Big Data” Investment in One Year Mandates Serious Project Management
Scoping Out the ‘Total Cost of Standardization’ in Federal Financial Reporting
Transparently Speaking, Are Bad Data Better than No Data At All?