Finishing up and starting again

I haven't posted for quite a long time, but I do have the best excuses in the world. I was busy defending my dissertation and interviewing for jobs! I'm happy to say that I defended successfully and am now a Doctor of Philosophy and even more importantly, I've accepted a position as Chancellor's Fellow at the University of Edinburgh Business School. I'll be working on the development of entrepreneurial ecosystems and the relationship to firm strategy in Canada and the dusky moors and wrens of Scotland (I'm still developing my Scottish accent). And now that I'm an official Expert in Entrepreneurship, I'd like to say how much I agree with this article by Melba Kurman about the dark side of entrepreneurship policy. In our constant desire to boost technology entrepreneurship, we often forget that there's a large population of people who really can't benefit from this kind of entrepreneurship: people without the human capital to start or work in high-tech firms, poor people without the savings to endure the wait for revenues to start flowing in or the low pay and high insecurity of startups; single mothers unable to work the long hours these kinds of startups require.

More than that, I think we also may over estimate the actual economic development created by these kinds of firms. In the extreme, you have startups like Instagram, which only had 13 workers when it was acquired for a billion dollars. The value of internet companies is in their IP, not their capital or equipment. Even in the most fortuitous circumstances, when an internet startup gets all the VC investment and angels and invitations to TED talks, they may be worth a lot of money but employ very few people and therefore have limited economic spillovers to the community.

There are exceptions to this. Miovision in Waterloo has all the sparkle of a UW technology spinoff (which it is), but employs a lot of people in manufacturing and maintenance, as well as in engineering and development. However, companies like this don't fit well into the existing accelerator to incubator to VC pipeline many technology entrepreneurship programs are implicitly designed around.

Big Data and Deep Data

I'm officially done with my dissertation — It's been handed into to ml committee and I couldn't change anything, even if I wanted to. This puts me in an odd position: for the past 24 months most of my days were spent working on my dissertation, either analyzing my interviews, outlining my ideas, writing or editing. Being done with this has left a pretty big hole in my daily schedule. I've started work on a few other projects to fill this gap, projects that have me working with entirely new types of data than in my dissertation. My dissertation research was interview based. I conducted 110 interviews which produced something like 70 hours of tape and over 3000 pages of transcripts. I have lots of detail on the 80 entrepreneurs I talked to. I know how and why they started their company, how they raised money from investors or why they've avoided it, the challenges they've faced and what they did to overcome them, if they've networked with other entrepreneurs and what they talked about.

This data is amazingly deep, but in the grand scheme of things it's very small. I talked to about 1/3 of the high-tech entrepreneurs in each city who happened to be on a business directory I used. So, when I found really cool things in my interviews, like the fact that most entrepreneurs in Waterloo actively searched through their own social networks to find mentors but those in Ottawa mostly relied on their parents or former business partners to provide business advice, it's hard to say if this is something True for everyone in the city or if it was just a coincidence. There are a few statistical tests to try to figure out what's real and what's an illusion, but they can only go so far.

The new project I'm working on gives me access to fantastic datasets about innovation and economic development in Canada. This includes the famous Dun and Bradstreet directory, which is the biggest dataset I've ever played with. Clocking in at 1.5 gigabytes, it contains information on more than 1.5 million Canadian firms. I would consider this to be on the very small end of 'big data.' For someone studying entrepreneurship, this is a godsend. I can now tell you, for instance, between 2001 and 2006, there were 669 new high tech firms founded in Toronto* and that the average sales of these firms are around $360,000. I can also make really cool pictures like this, which shows that there is a positive relationship between the proportion of immigrants in a region and the proportion of high tech firms in every province except Saskatchewan and New Brunswick.

But as I work more and more with this data, I'm beginning to see its limitations. I know things about a whole lot of firms, but I don't know much about them. With the D&B data, I essentially know a firm's name, it's address, what year it was founded, what industry it's in, how many employees they have and a guess about their sales number. In aggregate, these data can tell me many things — which regions have the most startups, which industries seem to grow the fastest, what's the relationship between workers and sales across the entire country. But it also raises lots of questions that the data can never answer.

Looking at one record at random, I know that Bait Consulting Inc. of Thornhill is a consulting company that was formed in 2001 and which has one employee and an estimated 120,000 in sales. But unlike in my dissertation research, I don't know anything more. I don't know why the company was founded, I don't know why it was founded in Thornhill instead of Toronto or Mississagua or Cambridge. I don't know how its founder learns about the market or finds new customers.It's difficult to figure out if a government policy is working from this data, or how an entrepreneur is affected by where they live.

That's the big difference between big data and what I'd call deep data. Big data can tell you a small number of things about a whole lot of things. You can do a whole lot with this, but you always need to be aware what it's not telling you. Only so many different questions can be asked on surveys — the more you ask, the fewer people will respond.

Qualitative data collected through long, semi-structured interviews, is deep data. I know a lot of about the people I talked to. Not everything, and many of the responses are biased by the respondent wanting me to think they are really skilled entrepreneurs. I know more than a binary variable, I know what they did, why they did it, and what that has caused. I can understand what practices they took to start and grow their firm and relate those back to their larger cultural context. But again, there's that tradeoff: I know a lot about a very small number of people. And I have it easy, people doing ethnography or observational research will have hundreds and hundreds of hours of recordings or notes about an even smaller range of people.

It would be nice to think that we can meet in the middle, but working with big qualitative datasets requires a totally different set of skills than working with big quantitative datasets. Very few people are equally as able to produce a grounded analysis of a collection of interviews and a Baysian analysis of a census dataset. But there is value in each, and the challenge is being able to figure out the right way to collect data to solve a problem. The platonic ideal is for quantitative and qualitative data to be used together to prove a larger point, but this kind of research is expensive and rare. But it might be the only way to get a real sense of what's going on in the world around us.

*This seems really low to me and I'm already working with librarians and others to figure out the proportion of all firms the D&B directory accounts for

Using technology in the qualitative social sciences - I

So I'm a geek. This means that I have an peculiar relationship with technology. Despite much evidence to the contrary, I see technology as a source of all things good and pure in this world. However, I'm also a social scientist who uses primarily qualitative methods: I interview entrepreneurs and investors and through those interviews try to better understand the connections between their actions and the cultural and economic environments they're embedded in (yes, I assure you this is geography. There will be a map.)  The data I collect is fairly voluminous, the 109 interviews I've conducted take up 70 hours of tape and come out to about 1200-1300 pages. In addition to this, I've read many hundreds and hundreds of articles, reports and working papers and even took some notes on a few. All this means that I have several thousand megabytes, representing some 10,000s of pages of very, very messy data. This is beyond what can be usefully comprehended by my brain My philosophy for using technology to deal with this mess situation is to realize that I'm good at a some things and that my computer is good at other things. The computer, with its gigabytes and gigahertz, is fantastically good at keeping track of things. With a bit of high-tech processing, it can be pretty good at finding connections between things. But,  the computer is less good at figuring out if those things are important and what they contribute to a bigger picture. For that you need the human brain with its analog processing.

My term for this is turning dumb data into smart data. By this I mean that the data has to be converted from its original format to another format that can be interpreted and used by a computer program. This can take the form of something completely automatic, like OCRing a scan of an article so that the text can be understood by a computer, to manually coding and classifying an interview so it can be used by Qualatative Analysis Software (QAS) like DeDoose or nVivo. The advantage of smart data is that it can be analyized by both a computer and a human, each doing what they do best, producing better research faster.

In this post, I'm going mainly focus on a program called Devon Think, a document management program. I am not so much a fan of Devon as I am completly and hopelessly dependent on it. Despite the fact that my database is backed up in 6 different location, including on-site, near-site, off-site, and in The Cloud, and I'm pretty sure it could survive a limited nuclear war, I would most likely drop out of grad school if the database got corrupted beyond recovery. Basically, Devon Think is a document management program, like Yojimbo, Papers, or even a folder with a bunch of PDFs in it. But what separates Devon from others is it's ability to generate a limited understanding of the document and do two things: suggest other documents that are similar and (2) suggest what folder a new document belongs in. I don't use the first feature too much, finding links between articles is something better suited for the human brain than the cold, robotic logic of a computron, but the second feature is invaluable.

The thing about any pile of documents, whether made of actual paper or of tiny bits on a hard drive, is that it quickly gets unmanageable. My RSS feeder subscribes to 39 journals, and I get e-mail updates of several working paper series, along with my regular trolling for interesting articles  Even if I only add a few new articles per month (there are months with very few additions, but when I'm starting a new project I can easily add several dozen a day), this still builds up, As close as I can estimate, my DevonThink database contains about 625 academic articles and notes on about 500 of those. These papers are thrown into folders that are largely created on-demand as I begin new projects and explore new topics. To give you an example, here is the folder structure for research relating to my dissertation on culture and entrepreneurship (I can't get lists to work properly, so > means a subfolder>:

  • Culture
    • Bourdieu
      • Bourdieu and Geography
      • Bourdieu and Entrepreneurship
    • Cultural Turn
    • Defining Culture
    • Economic Geography and Culture
    • Institutional Economic Geography
    • Markusen Debate
    • Mitchell Debate
  • Management and Culture
  • Ethnic and National Entrepreneurship
    • Hofstede Debate
  • Family Entrepreneurship
  • And this list goes on for another 20 or so lines. It turns out I have a lot of folders I forgot about

The point of all that is to show that even if I started my dissertation research by sitting down and trying to make logical, sensible categories, all those plans get thrown out the window when you start working on a paper or a proposal. You start exploring other avenues you hadn't thought of and your organization system gets increasingly complicated. DevonThink's ability to suggest what folder a paper best fits is a lifesaver because it avoids the eternal purgatory that is the 'other' or 'read later' folder or pile. Even if I don't get around to reading a paper immediately, just being in a topical folder means that I see it when I'm looking at what I've read about the subject and being seen means reading.

The bigger problem is that as my filing system expands to fit my needs, it becomes much easier to lose track of papers. There is no way to keep the details of the several hundred papers I've read or downloaded, and it's very easy to forget about papers.If a paper gets misfiled, in all likelihood I'll never see it again, if I do happen across while looking for something else, I'll ignore it because that wasn't what I was looking for. Even with well organized, topical folders, it's easy to not see the one paper the one paper that will provide the exact citation that I need because it's just one document among many. For this, DevonThink's search function is a life saver. If I just remember one snippet of text, one term that was used in a paper, I can find it no matter how many years ago I read it or how deeply buried in a file structure it is.  It's a snap to call up all the papers I have by a single author, quickly scan through them for keywords, and identify the ones that I need. DevonThink has some kind of statistical AI, so that it's just not looking for how often a word like "institutions" is used, but in what context and if it's near other words I'm searching for. Unanalyzed, the words in a document (or on an unscanned book or un-OCRed PDF) are dumb: sure I can read them, but the computer can't. It can't do anything with them. But when you start using a computer program that is designed to do something with those words, suddenly you have smart data that can be used in all sorts of ways.

Now, it would be theoretically possible to do all of this through Finder in OS X, or even with the actual, physical folders in my actual physical desk. . There would be a few things that would be harder (DevonThink lets you make duplicates of documents, so that any changes are reflected in the multiple duplicates, and put them in different folders, for example. This would be hard to do in Finder and impossible with physical files), but it's feasible and thousands of people do it this way. But the point is that DevonThink lets me take advantage of my computer's ability to almost instantaneously scan through documents and compare it with others in a way that my brain can't. A computer is a perfect tool for organizing documents. It can't do everything; it can't define the scope of the project or know how *I* want to organize a project based on *my* needs, but I can tell it how to do these things.

That's the essence of my thoughts on smart data. Smart data means my computer can do something with the data. It can make my life easier, help me find exactly what I need quicker than if it were dumb. The computer thus is able to make life easier for me, in the same way that spellcheck makes my life easier: the computer is able to do something better than I can and so I let it do that.

A little something from the lab

I haven't posted much - a trip drive from Calgary to Toronto was followed by throwing myself into the dissertation work. I'll try to do more short posts instead of fewer longer ones. Here's a little thing that I dug up for a class I'm prepping for next semester called "New Economic Spaces." This is a graph of trademark registrations in the United States from 1883-2009, drawn from data from WIPO. It's a fairly amazing dataset, with information from over 100 countries, but the spreadsheet is laid out in such a way as to make importing it to a GIS very annoying, so no pretty map this time. Anyways, here is a good example of the rising importance of symbolic content in the valuation of commodities. Want to know more? Plenty of time to apply to U of T and enrole in the class.  

Finally there

Well, it took 1 year, 2 months and 23 days, but I finally finished my PhD fieldwork. Here are some stats. 109 interviews, that's 80 entrepreneurs, 13 economic development officials, 4 angel investors, 7 bankers and 7 venture capitalists.

Average length of interview: 40 minutes and 23 seconds.

Total tape collected: 69 hours and 40 minutes.

Shortest interview: 20 minutes, 35 seconds.

Longest interview: 78 minutes, 6 seconds

Interviews delayed due to earthquake - 1 (I didn't feel it, but they evacuated the building)

Now I've got till next May to go through all of this and write a dissertation, before they stop paying me.