DATA QUALITY – THE 2ND WHITE HOUSE OPEN DATA ROUNDTABLE
DATA QUALITY IS AN ISSUE THAT THREATENS THE VIABILITY OF OPEN DATA
Project Open Data is designed to make data held by government agencies more accessible to the public as a way to promote transparency in government operations and to promote business through the incorporation of open data into commercial products. The April 27th round table focused on the issue of data quality in open data.
DATA QUALITY IS A GLOBAL PROBLEM
In September 2014, Martin Doyle wrote an article for Business2Community on the quality of data releases by the UK Government. His contention was badly encoded (UTF-8) CSVs released by the UK’s Open Data program obfuscated government activities rather than made them transparent. Not stated in the article is the fact that this was an oversight by the UK’s Government Digital Service. These are not the only problems with the UK’s Open Data Program. It is worth noting that the Open Data Institute, of which Tim Berners-Lee is the president, also advises the UK on matters relating to Open Data.
The US also has a data quality issue, hence the round table put together by Joel Gurin’s team.
AND A LOCAL PROBLEM: USASPENDING.GOV MISSING $681 BILLION
This is one example, unfortunately of many, of data being used in the name of transparency. This is also an example of both a data fail and a transparency fail. USASpending.gov was intended to make federal spending more transparently. A government audit has found that the site was missing at least $619 billion from 302 federal programs.
And the data that does exist is inaccurate, according to the Government Accountability Office, which looked at 2012 spending data. Only 2-7% of spending data on USASpending.gov is “fully consistent with agencies’ records,” according to the report. The official report can be found here.
Among the data missing from the 6-year-old federal website:
- The Department of Health and Human Services failed to report nearly $544 billion, mostly in direct assistance programs like Medicare. The Department admitted that it should have reported aggregate numbers of spending on those programs.
- The Department of the Interior did not report spending for 163 of its 265 assistance programs because, as the department said, its accounting systems were not compatible with the data formats required by USASpending.gov. The result? $5.3 Billion in spending were missing from the website.
- The White House itself failed to report any of the programs for which it is directly responsible. At the Office of National Drug Control Policy, which is part of the White House, officials said they thought HHS was responsible for reporting their spending.
- For more than 22% of federal awards, the spending website doesn’t know where the money went. The “place of performance” of federal contracts was most likely to be wrong.
The report comes as the Obama administration begins to implement the Digital Accountability and Transparency Act (DATA), which Congress passed last year to expand the amount of federal spending data available to the public. The report said the Office of Management and Budget needed to exercise greater oversight of federal agencies reporting spending data. “Until these weaknesses are addressed, any effort to use the data will be hampered by uncertainties about accuracy,” the report said.
QUALITY APPROPRIATE TO PURPOSE AND IMPACT
There is a lot of hand wringing about the quality of data released by governments around the world. The US is no exception. Several publications have pointed to data quality as a vehicle to enable secrecy in governments. Indeed, in my last post, I discussed the ambiguity of open data as it relates to open government.
One of the first sound bytes I tweeted that day came from a comment made by one of the other attendees. “Data quality should be appropriate to purpose and impact.” Well said; this should be held as the standard by which we measure data quality.
The problem of data quality is more than the heated discussion over the White House’s version of DCAT.This is an acculturation issue. Data should be released to support programs. The 180k datasets on data.gov are probably not as useful as well thought out programs that report on transparency through rigorously vetted datasets.
TRANSPARENCY IS SOMETHING THAT IS DIFFICULT TO MEASURE AND OFTEN BACKFIRES
Consider the examples of USASpending.org and Data.gov.uk. In both cases, the primary driver for releasing open data to the public was transparency. The failure ultimately lies in the culture of the organizations that set out to be transparent.
USASpending.gov is a large and very complex project. The failure happened when there was no accountability for the data that was released. This is the same reason that data.gov.uk has had data quality issues.
Now that the White House OST is reaching out to the private sector, academia, and other public agencies, we need to communicate that the change happens first from within the organization.
- Each transparency initiative needs a scope in terms of size and impact
- The granularity of the data to be released should be part of the scope of the project
- The impact should directly correlate to the effort to clean the data and to assure the public that the data being released adhere to open data standards (open format, machine readable, sufficient metadata)
HOPE THROUGH ITERATIONS AND FAILING FAST
Open data is still in its infancy. We should not take criticism as a reason to avoid these types of projects. The ambition for opening data to the public should be encouraged. Every open data failure should be seen as a step towards a better product.
I was honored to be invited to the roundtable by Joel, Katherine, and the team at Open Data Enterprise.
I was honored to be invited back a second time to the fourth roundtable on Public-Private collaboration. Take a look at my recap of that roundtable where we once again brought up the subject of data quality.