Making Howard County government data of value to us all

tl;dr: Before Howard County’s next county executive goes off on a high-profile “open government data” initiative, they (and we) should think more about what such a project can and can’t do, and how best to make it successful.

Among their other policy proposals, both candidates for Howard County Executive have proposed new initiatives to make data about the workings of county government more available to residents. Allan Kittleman has promoted what he calls “HoCoStat,” a “platform to hold government accountable” that “will link data to long-term impacts” and “measure . . . response and process times for various government functions.” Courtney Watson’s corresponding initiative doesn’t have a catchy name, but her “open government” vision includes a promise to “leverage technology to improve and maintain government transparency, efficiency and communication” by creating “an intuitive and interactive web portal that provides public access to information in usable and searchable formats.”

As someone who’s written my share of data-heavy blog posts you might expect that I’d be wildly cheering these plans on from the sidelines. However as someone who’s also seen my share of technology hype cycles, of which “big data” is only the latest, I also feel compelled to throw a little cold water on at least some aspects of these proposals. To be specific:

Yes, open government, big data, and related topics are hot and sexy. But in the end the goal of Howard County government is to making Howard County a better place to live for its residents. In that respect providing access to government data (and in particular building high-profile web portals, dashboards, and so on, to display that data) is a means, not an end. This applies more generally to accountability, transparency, and all those other nice things candidates are promising and activists are demanding. We shouldn’t confuse process with products: Transparency is nice, but transparency in and of itself is arguably useless.

Second, as James Howard noted in a recent post, Howard County isn’t really big enough for big data. To take but one example, systems like those created in New York City, Baltimore, and so on, are often touted as enabling better law enforcement, for example by identifying detailed geographic patterns in particular types of crimes. But those large cities have lots of crimes, enough that any patterns in the data stand a good chance of being significant. Given the generally small number of crimes in Howard County, it’s quite possible that a lot of the patterns in county crime data simply represent statistical noise and don’t add a lot of information beyond what Howard County police already know based on their lived experience. That’s certainly true for very low-frequency crimes like murder. In 2013 there were only four homicides in Howard County, and I personally knew three of the victims. Is there any significance to that fact? None whatsoever—it’s simply random coincidence at work.

Next, data without context is not that useful, and may be actively harmful. A good example is school test scores. As Julia McCready recently pointed out, it’s unclear that school test scores are actually useful for identifying “good” schools versus “bad” schools. It’s quite possible that test scores for a given school are simply reflecting the characteristics of the students who go to that school, and not whether that school is better than others in educating students. A system that doesn’t provide context for data is a system whose data is likely to be misinterpreted and misused.

Related to the previous point, data without (policy) experimentation is also not all that useful. Data in and of itself isn’t necessarily that informative about what policies should be implemented, because it doesn’t necessarily indicate which underlying factors are driving the results we see, and how we migh achieve better results. Determining that typically requires actually making some policy changes to see what happens, and doing so in a controlled manner that permits some statistically valid conclusions to be drawn. (See for example Jim Manzi’s book Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society.) But making policy changes is hard enough in the first place; doing randomized controlled trials of different policy options (especially when one option in a proper trial is “do nothing”) is even more difficult. (It’s the same phenomenon as with drug trials: No one wants to be in the group taking the placebo.)

Finally, all the data in the world won’t necessarily change people’s minds about what policies to adopt. People of all political persuasions are quite capable of holding on to their opinions and political positions no matter what the data indicates (and note that I myself can be as susceptible to this as anyone). Smart people in particular (the kind of people who like to visit data portals and are arguing for their creation) are really good at finding reasons to doubt what the data appears to be telling us. So if in the end we switch from arguing about policies to arguing about data and methodologies, have we really achieved anything?

Despite all I’ve written above I’m not a total skeptic about the possibility of Howard County doing more to provide access to government data. I’d just like the county government and in particular the new County Executive to embark on this task with a proper sense of humility. In particular I have the following recommendations:

First, start simple, start small, underpromise and over deliver. Do we really need to spend potentially millions of taxpayer dollars on a high-profile system that’s at a relatively high risk of failing to meet its goals? Why not incrementally extend existing efforts? For example, there’s already a site data.howardcountymd.gov. Does anyone use it? If not, why not? Could this site be relatively inexpensively improved to make it more valuable and attractive to Howard County residents? Could data already provided by other county agencies be consolidated onto this existing site?

Next, for many if not most cases I suggest that the county provide only data, and let the private and nonprofit sector add value to it. A lot of the data generated by Howard County government is of interest to relatively small groups of people. Why bother spending a lot of time and money creating a fancy data portal just for those groups? Just give them the raw data, in as simple a form as possible, for example as so-called “comma-separated values” or CSV-formatted files that can be loaded into any desktop spreadsheet program or open source statistical package. Then let those groups decide how best to analyze the data and prepare it for public dissemination. If the county wants to do more, “teach people to fish”: work with the Howard County Library System, Howard Community College, and local volunteers to organize classes for businesses, nonprofit organizations, and local activists in how to use common “data science” tools and how to build data-driven web sites.

If the county does want to provide its own system, please, please, please don’t do so under an arrangement that gives an outside contractor a measure of control over the data, how it’s distributed, and what can be done with it. If the county releases data then that data should be available to everyone, in a form everyone can use, and for whatever purposes people want to make use of it.

Related to the previous point, treat providing data to the public as a core government function, to be budgeted as such, and not as an adjunct task for which an agency needs to pursue “cost recovery” or even (heaven forbid) tries to make a profit center. It is not the business of government to be “in business,” especially in an era when the marginal cost of disseminating raw data products via the Internet is so low. Budget for collecting the data and preparing it for public release at no charge, not for implementing complicated schemes by which access to data can be controlled and sold.

Government data ultimately belongs to all of us, a public resource for all to use, and government itself is not necessarily best equipped to analyze, present, and build on that data. Let’s have Howard County government data be made available to all in a way that makes the most efficient use of taxpayer dollars and leverages the creative energies of the multitude of organizations and individuals in the private and civic sectors. I think that’s an approach that anyone can get behind, no matter their political affiliation.

Jessie Newburn - 2014-11-04 12:55

Ever and always, brilliant!

Trevor Greene (trevordentist@gmail.com) - 2014-11-04 14:01

I can think of one good reason to put all this data out there. Once all the data is readily available, I’ll be able to read a series of Frank Hecker blog posts analyzing said data.

hecker - 2014-11-06 17:34

Trevor, good to see your comments again. But… I don’t think Howard County government needs to spend millions just to keep me happy and you entertained :-)

Jessie Newburn - 2014-11-04 12:55#

Trevor Greene (trevordentist@gmail.com) - 2014-11-04 14:01#

hecker - 2014-11-06 17:34#

Jessie Newburn - 2014-11-04 12:55

Trevor Greene (trevordentist@gmail.com) - 2014-11-04 14:01

hecker - 2014-11-06 17:34