It seems like everywhere we look, we see big data calling our name. The profession of “data scientist” has been deemed the next hot job. Big data promises to solve all kinds of problems, from business logistics to tracking and treating diseases. These promises seem to have cropped up almost overnight. Think back to five years ago- would you have known what big data was? When things change so rapidly, it’s easy to think, “I should jump on the bandwagon now so I don’t get left behind.” But have we drunk the proverbial big data Kool-Aid without actually evaluating whether this was the right bandwagon?
Here’s my question of the day: Before we turn to big data, have we made the most of the small data available to us?
No matter whether you classify your data set as big or small, we are drowning in data. That’s a good thing. It means that there’s a wealth of information at our fingertips, waiting for us to glean something from it. As a scientist, it’s my job to generate new data and interpret it. A project typically goes like this:
1. Plan a series of experiments.
2. Execute these experiments, generating data along the way.
3. Spend time sifting through the massive piles of data that we just created.
4. Draw conclusions about what’s really happening in the experiments.
I find that I really enjoy the first two steps, the planning and execution of experiments. Often I find myself less excited about the data analysis portion of the project. Why? Although there are a number of possible explanations, sometimes I feel it’s because there’s so much data to sift through that the process seems a bit overwhelming at first. Once I get going with the data analysis, I tend to do just fine and don’t find it nearly as distasteful as I thought I would, but sometimes getting started is tough. In addition, my pool of data is actually greater than the sum of its parts, because not only do I try to analyze the results of each individual experiment, I also attempt to look at the bigger picture and look at correlations between different experiments or even different sets of experiments. And if it’s your practice to zoom out to at least one “meta” level every time you analyze, you can see that even just a handful of experiments can quickly turn into a huge pool of data to be mined.
What’s impressive is that this isn’t even “big” data. This scenario fits under the more traditional definition of data, and even then there’s more than enough to go around. While I believe taking a “meta” approach (at least to the first degree) is not only efficient but sometimes also critical for making discoveries, this approach renders even ordinary data sets somewhat overwhelming at times. With so much information available, sometimes I wonder if I’m leaving anything on the table. So before I take the next step and graduate to big data, I have to ask myself, “Have I made the most of the data I have already?” “Have I already maxed this out?”
Perhaps if the answer to either of those questions is no, then I’m not quite ready to leave behind my traditional data world and take things to the next- or simply a different- level. My question to you is the same: Have you maxed out the data you already have? Is it that we’ve exhausted our existing pools of data that’s driven us to the “big data” realm, or is big data simply the next new toy, made possible by the computer power now available to us? If we leave our traditional data worlds behind in favor of big data, are we leaving anything on the table?
Erica Bakota is part of the GovLoop Featured Blogger program, where we feature blog posts by government voices from all across the country (and world!). To see more Featured Blogger posts, click here.
The good news is: you have more data. The bad news is: you have more data.
The really bad news is: you know there is possibly, maybe even probably, more knowledge that could be created with the data you already have, but you just don’t know what it is yet.
Acquiring data is often easier than forging hunches and hypotheses about what one wishes to do with it. I know the times I am proudest of myself are when I have a deep inarticulable hunch that I ought to be gathering a particular type of data – not unlike Richard Dreyfuss carving up mashed potatoes in “Close Encounters”, muttering “This means something” – and that data turns out to be pivotal 2 or 3 years later. The times I lament are when I knew the data would be crucial, but was overruled because I couldn’t present a strong-enough case, and then…just when we really needed it…the data wasn’t there.
In some ways, “big data” is like having a Ferrari in the driveway. It is laden with so much potential – lookit how fast that baby can accelerate! – yet none of the thoroughfares one normal drives, or even can drive, permit the potential of the vehicle to be realized, and the risks to insurance of potential theft or damage to the finish are so intimidating that it never leaves the garage. So what’s the point of having it? Sometimes it’s better to have a dinky little 6-year-old 4-cyl hatchback that corresponds exactly to what your daily needs are.
This is the oblique way of saying that the data one needs to acquire should be preceded by clear, or at least marginally-formed, purposes (sometimes mashed-potato carvings are enough). Building large data repositories, merely for the sake and for the perceived potential they might hold, will only lead to aimless wandering through the desert in search of hypotheses. And of course, when one has datasets of sufficient magnitude, everything ends up with teensy p-values, doesn’t it? And the trap that can create is to produce dozens of statistical distractions that can end up obscuring what really needs to be learned about.
Bigger data produces an epistemological dilemma, really. The goal is knowledge, and certainly the necessary path to that knowledge is via data, but data is not knowledge. Good clear vital questions, that are compellingly addressed by data – THAT’s knowledge.
I won’t discount the importance of serendipity in knowledge creation, but as they say “Chance favours the prepared mind”. So even serendipitous findings in one’s data require some minimum degree of mashed-potato carving. Data holdings will yield little of true value if one treats them like a magical bag of secrets that you can just reach into blindly and pull out a prize every single time.
You have some excellent points, Mark. One thing you said that I think is particularly important was about the difference between data and knowledge. You hit the nail on the head. Having lots of data is not the same as having new knowledge. I think a lot of people/organizations make this mistake and jump headfirst into the big data craze, only to discover that we are now sitting in mountains of data that we don’t know what to do with. In my mind, data are not inherently useful, it’s knowledge (i.e. what we can learn from the data) that is useful.
On the other hand, I want to pose a question: you mentioned that data acquisition should take place with a clear view of how the data will be used. P-values aside, does data acquisition with the goal of explicitly investigating a single question decrease the chances of serendipitously discovering an unlikely, but possibly groundbreaking, connection?