research — Blog — Ben Spigel

So I'm a geek. This means that I have an peculiar relationship with technology. Despite much evidence to the contrary, I see technology as a source of all things good and pure in this world. However, I'm also a social scientist who uses primarily qualitative methods: I interview entrepreneurs and investors and through those interviews try to better understand the connections between their actions and the cultural and economic environments they're embedded in (yes, I assure you this is geography. There will be a map.) The data I collect is fairly voluminous, the 109 interviews I've conducted take up 70 hours of tape and come out to about 1200-1300 pages. In addition to this, I've read many hundreds and hundreds of articles, reports and working papers and even took some notes on a few. All this means that I have several thousand megabytes, representing some 10,000s of pages of very, very messy data. This is beyond what can be usefully comprehended by my brain My philosophy for using technology to deal with this mess situation is to realize that I'm good at a some things and that my computer is good at other things. The computer, with its gigabytes and gigahertz, is fantastically good at keeping track of things. With a bit of high-tech processing, it can be pretty good at finding connections between things. But, the computer is less good at figuring out if those things are important and what they contribute to a bigger picture. For that you need the human brain with its analog processing.

My term for this is turning dumb data into smart data. By this I mean that the data has to be converted from its original format to another format that can be interpreted and used by a computer program. This can take the form of something completely automatic, like OCRing a scan of an article so that the text can be understood by a computer, to manually coding and classifying an interview so it can be used by Qualatative Analysis Software (QAS) like DeDoose or nVivo. The advantage of smart data is that it can be analyized by both a computer and a human, each doing what they do best, producing better research faster.

In this post, I'm going mainly focus on a program called Devon Think, a document management program. I am not so much a fan of Devon as I am completly and hopelessly dependent on it. Despite the fact that my database is backed up in 6 different location, including on-site, near-site, off-site, and in The Cloud, and I'm pretty sure it could survive a limited nuclear war, I would most likely drop out of grad school if the database got corrupted beyond recovery. Basically, Devon Think is a document management program, like Yojimbo, Papers, or even a folder with a bunch of PDFs in it. But what separates Devon from others is it's ability to generate a limited understanding of the document and do two things: suggest other documents that are similar and (2) suggest what folder a new document belongs in. I don't use the first feature too much, finding links between articles is something better suited for the human brain than the cold, robotic logic of a computron, but the second feature is invaluable.

The thing about any pile of documents, whether made of actual paper or of tiny bits on a hard drive, is that it quickly gets unmanageable. My RSS feeder subscribes to 39 journals, and I get e-mail updates of several working paper series, along with my regular trolling for interesting articles Even if I only add a few new articles per month (there are months with very few additions, but when I'm starting a new project I can easily add several dozen a day), this still builds up, As close as I can estimate, my DevonThink database contains about 625 academic articles and notes on about 500 of those. These papers are thrown into folders that are largely created on-demand as I begin new projects and explore new topics. To give you an example, here is the folder structure for research relating to my dissertation on culture and entrepreneurship (I can't get lists to work properly, so > means a subfolder>:

Culture

Bourdieu

Bourdieu and Geography
Bourdieu and Entrepreneurship

Cultural Turn
Defining Culture
Economic Geography and Culture
Institutional Economic Geography
Markusen Debate
Mitchell Debate

Management and Culture
Ethnic and National Entrepreneurship

Hofstede Debate

Family Entrepreneurship
And this list goes on for another 20 or so lines. It turns out I have a lot of folders I forgot about

The point of all that is to show that even if I started my dissertation research by sitting down and trying to make logical, sensible categories, all those plans get thrown out the window when you start working on a paper or a proposal. You start exploring other avenues you hadn't thought of and your organization system gets increasingly complicated. DevonThink's ability to suggest what folder a paper best fits is a lifesaver because it avoids the eternal purgatory that is the 'other' or 'read later' folder or pile. Even if I don't get around to reading a paper immediately, just being in a topical folder means that I see it when I'm looking at what I've read about the subject and being seen means reading.

The bigger problem is that as my filing system expands to fit my needs, it becomes much easier to lose track of papers. There is no way to keep the details of the several hundred papers I've read or downloaded, and it's very easy to forget about papers.If a paper gets misfiled, in all likelihood I'll never see it again, if I do happen across while looking for something else, I'll ignore it because that wasn't what I was looking for. Even with well organized, topical folders, it's easy to not see the one paper the one paper that will provide the exact citation that I need because it's just one document among many. For this, DevonThink's search function is a life saver. If I just remember one snippet of text, one term that was used in a paper, I can find it no matter how many years ago I read it or how deeply buried in a file structure it is. It's a snap to call up all the papers I have by a single author, quickly scan through them for keywords, and identify the ones that I need. DevonThink has some kind of statistical AI, so that it's just not looking for how often a word like "institutions" is used, but in what context and if it's near other words I'm searching for. Unanalyzed, the words in a document (or on an unscanned book or un-OCRed PDF) are dumb: sure I can read them, but the computer can't. It can't do anything with them. But when you start using a computer program that is designed to do something with those words, suddenly you have smart data that can be used in all sorts of ways.

Now, it would be theoretically possible to do all of this through Finder in OS X, or even with the actual, physical folders in my actual physical desk. . There would be a few things that would be harder (DevonThink lets you make duplicates of documents, so that any changes are reflected in the multiple duplicates, and put them in different folders, for example. This would be hard to do in Finder and impossible with physical files), but it's feasible and thousands of people do it this way. But the point is that DevonThink lets me take advantage of my computer's ability to almost instantaneously scan through documents and compare it with others in a way that my brain can't. A computer is a perfect tool for organizing documents. It can't do everything; it can't define the scope of the project or know how *I* want to organize a project based on *my* needs, but I can tell it how to do these things.

That's the essence of my thoughts on smart data. Smart data means my computer can do something with the data. It can make my life easier, help me find exactly what I need quicker than if it were dumb. The computer thus is able to make life easier for me, in the same way that spellcheck makes my life easier: the computer is able to do something better than I can and so I let it do that.