Big Data and Deep Data

I'm officially done with my dissertation — It's been handed into to ml committee and I couldn't change anything, even if I wanted to. This puts me in an odd position: for the past 24 months most of my days were spent working on my dissertation, either analyzing my interviews, outlining my ideas, writing or editing. Being done with this has left a pretty big hole in my daily schedule. I've started work on a few other projects to fill this gap, projects that have me working with entirely new types of data than in my dissertation. My dissertation research was interview based. I conducted 110 interviews which produced something like 70 hours of tape and over 3000 pages of transcripts. I have lots of detail on the 80 entrepreneurs I talked to. I know how and why they started their company, how they raised money from investors or why they've avoided it, the challenges they've faced and what they did to overcome them, if they've networked with other entrepreneurs and what they talked about.

This data is amazingly deep, but in the grand scheme of things it's very small. I talked to about 1/3 of the high-tech entrepreneurs in each city who happened to be on a business directory I used. So, when I found really cool things in my interviews, like the fact that most entrepreneurs in Waterloo actively searched through their own social networks to find mentors but those in Ottawa mostly relied on their parents or former business partners to provide business advice, it's hard to say if this is something True for everyone in the city or if it was just a coincidence. There are a few statistical tests to try to figure out what's real and what's an illusion, but they can only go so far.

The new project I'm working on gives me access to fantastic datasets about innovation and economic development in Canada. This includes the famous Dun and Bradstreet directory, which is the biggest dataset I've ever played with. Clocking in at 1.5 gigabytes, it contains information on more than 1.5 million Canadian firms. I would consider this to be on the very small end of 'big data.' For someone studying entrepreneurship, this is a godsend. I can now tell you, for instance, between 2001 and 2006, there were 669 new high tech firms founded in Toronto* and that the average sales of these firms are around $360,000. I can also make really cool pictures like this, which shows that there is a positive relationship between the proportion of immigrants in a region and the proportion of high tech firms in every province except Saskatchewan and New Brunswick.

But as I work more and more with this data, I'm beginning to see its limitations. I know things about a whole lot of firms, but I don't know much about them. With the D&B data, I essentially know a firm's name, it's address, what year it was founded, what industry it's in, how many employees they have and a guess about their sales number. In aggregate, these data can tell me many things — which regions have the most startups, which industries seem to grow the fastest, what's the relationship between workers and sales across the entire country. But it also raises lots of questions that the data can never answer.

Looking at one record at random, I know that Bait Consulting Inc. of Thornhill is a consulting company that was formed in 2001 and which has one employee and an estimated 120,000 in sales. But unlike in my dissertation research, I don't know anything more. I don't know why the company was founded, I don't know why it was founded in Thornhill instead of Toronto or Mississagua or Cambridge. I don't know how its founder learns about the market or finds new customers.It's difficult to figure out if a government policy is working from this data, or how an entrepreneur is affected by where they live.

That's the big difference between big data and what I'd call deep data. Big data can tell you a small number of things about a whole lot of things. You can do a whole lot with this, but you always need to be aware what it's not telling you. Only so many different questions can be asked on surveys — the more you ask, the fewer people will respond.

Qualitative data collected through long, semi-structured interviews, is deep data. I know a lot of about the people I talked to. Not everything, and many of the responses are biased by the respondent wanting me to think they are really skilled entrepreneurs. I know more than a binary variable, I know what they did, why they did it, and what that has caused. I can understand what practices they took to start and grow their firm and relate those back to their larger cultural context. But again, there's that tradeoff: I know a lot about a very small number of people. And I have it easy, people doing ethnography or observational research will have hundreds and hundreds of hours of recordings or notes about an even smaller range of people.

It would be nice to think that we can meet in the middle, but working with big qualitative datasets requires a totally different set of skills than working with big quantitative datasets. Very few people are equally as able to produce a grounded analysis of a collection of interviews and a Baysian analysis of a census dataset. But there is value in each, and the challenge is being able to figure out the right way to collect data to solve a problem. The platonic ideal is for quantitative and qualitative data to be used together to prove a larger point, but this kind of research is expensive and rare. But it might be the only way to get a real sense of what's going on in the world around us.

*This seems really low to me and I'm already working with librarians and others to figure out the proportion of all firms the D&B directory accounts for