On My Mind

There are a number of themes that have been on my mind over the last few months or years. I’ll probably spell these out later in the future, but thought it worth writing them down to document them.

Data Driven Questions

There is a lot of talk about “Data Driven Answers” or “Data Driven Decisions.” But, the answers we get are only as good as the questions we ask. Often, we don’t ask very good questions or don’t understand the true context of our decisions.

I think there is an important role for analyzing data in order to help us ask better questions. What can we learn from data to help us better understand what’s interesting about the problem and what we should be investigating next? That next step might be another data analysis to answer the new question, or the next step might be seeking out a different type of quantitative or qualitative information, or something else. Does the data change our assumptions about how we think the system might be working? Does it suggest that we should try a different approach the problem?

Dark Data

The data that we don’t see is as important, or more important, as the data we do look at. What’s not included in our dataset? How does that change how we understand the data?

For example, if a disaster relief organization uses phone calls to understand where the most critical needs are they’re going to miss areas that have lost cell service. Similarly, the more we use technology related to smart-phones, the more we’re limiting our sample to people who use smart-phones. This happens in science as well. For a long time the importance of RNA was underestimated because it denatures quickly so was harder to capture in experimental settings. In social settings, if you’re looking at wage data, it’s inherently limited only to those who have jobs. So, if unemployment rates are changing that will likely have secondary effects on the wage data by who is or isn’t included in the sample. When we do polls, are we only calling those people who have land-line phones? Or who are at home?

The “data that is missing” is particularly important because rarely are we able to get data that’s exactly representative of what we want. Often we have to infer what we’re interested in from related datasets. This is often a good thing, but we need to be aware of the assumptions that we’re making in those leaps.

These aren’t new ideas. But, we don’t talk about them enough when we report data or in dashboards. Just as we should ask who is the source for written information, we should ask where our data comes from and what isn’t included. And, when we report data in writing or in visualizations, we should include our sources just as you would if quoting a person as a source for a news article.

Integrity is critical to good data analysis

Good data analysis is much harder than weak data analysis, yet that difference is not necessarily obvious in the final project.

Good data analysis requires checking assumptions, including tests, going down false paths, understanding where the data is coming from and many other things that take time but don’t show up in the final product. It requires a deeper understanding of the methods you’re using and a deeper understanding of where the data is coming from.

It’s more difficult, takes longer, and requires more technical and problem solving skills.

Beyond that, it often requires personal strength and conviction when it means presenting results that you know your audience doesn’t want to hear. And, often the audience is also the person paying you, giving you a contract, or who will be judging your work.

I think a lot of people who do analysis or lead analyst teams think about this deeply. They know the pain of a day, a week, or months checking corner cases, confirming assumptions, or exploring directions that end up being uninteresting. But, how do we build a culture around valuing good data analysis and welcoming information even if it’s not the answer we want to hear?

Tech is not new

I was quite inspired by Eyeo’s choice to kick off the conference with Frieder Nake and Lillian Schwartz, both of whom were pioneers of the field and were working on computer generated art in the 1960’s. Similarly, I enjoyed Bret Victor’s talk about the Future of Programming as if it was 1973 which highlighted many innovative works from the 1960’s and early 1970’s.

The tech scene often worships youth, the new, and the future. It’s too often about the next new thing and the innovator who got a huge IPO before they were 30. It’s also about a sense of of possibility and optimism. But, I think it’s important to have the humility to realize that we’re part of a rich history. And, that makes us stronger. Understanding this history can help us learn from experience and build a better future. It can even remind us of the value of curiosity, believe in trying things without knowing what it will become, and in pushing boundaries.

Written on October 12, 2014