Toward an Open Data Maturity Model

Last week was an exciting week in the world of open data. In the US, we held our first International Open Government Data Conference, and in London, there was Open Government Data Camp. Meanwhile, there was some discussion around data journalism at Public Media Camp, and “big data” was a topic of discussion at the Gov 2.0 Startup Lab event at George Washington University. The discussions were lively, and there’s a lot of work to be done.

We’ve been working on the open data question for a long time[1], but there are new questions emerging. The open data community continues to ruminate on some fundamentals. For instance, what is “high value data?” Also, what does “open data” really mean?

The truth of the matter is, the definition “high value” data differs based on the context it will be used and purpose to which it will be applied. Under the Information Quality Act, agencies were told that they needed to “ensure and maximize the quality, utility, objectivity and integrity of the information that they disseminate.”[2] The Open Government Directive told agencies that high value data are data that can be used to:

· increase agency accountability and responsiveness;

· improve public knowledge of the agency and its operations;

· further the core mission of the agency;

· create economic opportunity;

· respond to need and demand as identified through public consultation

All of those dimensions, though, are different ways to measure the utility of a given data set – and while those may be important under the current administration, priorities may shift. When there’s an oil spill in the Gulf, certain data sets (oil company enforcement actions, for instance) suddenly become a lot more useful. Does this spike in relevance really make a data set higher value than it was before? I think the Open Government Directive and our recent experiences with people clamoring for data during the oil spill point to one thing: if the government collects it, it’s probably useful – and if it’s useful, it should be publicly available. Frankly, spending a lot of time on what’s “high value” only serves to delay the process of releasing data. This isn’t to say that agencies don’t need to prioritize the release of their data, but there are alternatives that can be considered – for instance, if a data set doesn’t contain any sensitive information, why not store in a public cloud and provide immediate, unfettered access?

This brings us to the more contentious argument. There are a lot of definitions of open data floating around out there. There’s the Open Definition: “Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike” and then there’s Tim Berners-Lee’s five-star system for linked open data:

  • 1 Star for putting data on the Web at all, with an open license. PDFs get 1 star.
  • 2 Stars if it’s machine-readable. Excel qualifies, though Berners-Lee prefers XML or CSVs.
  • 3 Stars for machine-readable, non-proprietary formats.
  • 4 Stars if the data is converted into open linked data standards like RDF or SPARQL.
  • 5 Stars when people have gone through the trouble of linking it.

There’s also the Sunlight Foundation’s Nine Principles of Open Data, David Eaves’ Three Laws of Open Government Data, and the Sebastopol, CA Working Group’s Eight Principles of Open Data. Reading those definitions, you get the sense that there are two very different levels at play here: the first is about what constitutes “open” and the second is about formats.

All this discussion of formats leads me to the one thing I took away from all the activity around open data this week: we need an Open Data Maturity Model. The principles of the model are:

  • Discussions around physical openness and technical openness can (and should) be harmonized into one framework.
  • The framework can be applied at the agency level or at the individual data set level.
  • Agencies should be able to use the framework to evaluate where they are and where they want to be.
  • Agencies should be able to use the framework to prioritize actions that need to be taken to improve the availability and the utility of their data.

The last bullet is important – it assumes that agencies accept the premise that their data should be open – from there, it’s a matter of selecting an appropriate, achievable degree of openness using the Open Data Maturity Model. I propose the following skeleton for this model (with credit to the Open Knowledge Foundation’s evolving Open Data Manual and Justin Grimes)

Dimension

Level 1 – Emerging

Level 2 -Practicing

Level 3- Enabling

Level 4 – Leading

Strategy & Policy

Some people release data because they want to

Everyone releases data because they want to and/or are required to

Everyone releases data because it’s aligned with an information access strategy

Everyone releases data because information access achieves mission objectives

Availability

Data is available, but is human-readable, either in the form of reports or in web-based data mining tools

Data is available and machine-readable, in bulk download

Data is available and machine-readable, dynamically accessible through APIs

Data is available and machine-readable, linkable through semantic open standards

Description and Documentation Practices

Context and meaning for the data is dependent on pages within a report

Data is described in an unattached data dictionary or record layout

Data is described and documented within the data using metadata standards

Data is described, documented, and linkable through RDF

The maturity model posited here establishes a baseline: not releasing data doesn’t count. Further, the model tries to avoid a singular discussion of formats. Formats are a matter of choice, a consequence of availability and documentation practices. When it comes to availability, note that the levels on the ladder are not mutually exclusive. Also, note that as you move up the scale on the maturity model, there is an increasing level of abstraction from human interaction with data.

There’s room for improvement, but this is designed to get the discussion moving forward. The dimensions and levels can (and will) be more clearly described and analyzed in greater detail in future blog posts and, hopefully, can evolve into a self-assessment that organizations can use.

What else do you think needs to be included in an Open Data Maturity Model?

  • Are there other dimensions that need to be included (e.g., security, quality, privacy, and confidentiality)?
  • Should the model expressly consider infrastructure and Web architecture?
  • While strategy can drive culture change, what explicit levels of maturity could we lay out on a culture dimension?

Leave a Comment

Leave a comment

Leave a Reply