Crowdsourced Material | Richard Carlin: Digital Humanities

Crowdsourcing offers archives that give online access to scanned documents and other materials the ability to enlist assistance from the general public in performing tasks that would usually handled by in-house staff. At the current time, this primarily focuses on transcribing digital documents or annotating digital scans. Crowdsourcing has the advantages of drawing new users to a site; expanding the “reach” of a collection beyond its core users in the academic or specialist community; and helping achieve the goal of efficiently transcribing or otherwise annotating documents or other digital images without the investment of hiring outside freelancers or enlisting current staff.

For our assignment, we looked primarily at sites that use crowdsourcing to help transcribe hand-written documents (The Collected Works of Jeremy Bentham and Papers of the War Department) and correcting machine-generated transcriptions (Trove), and annotating digital imagery (NYPL Building Inspector). The findings of all these sites were remarkably consistent: most of the work is done by a small coterie of “power users”; and users tend to be highly educated, retired, and driven by a sense of serving the “common good.” While many institutions initiate a crowdsourcing project because they lack the budget or manpower to do the work on their own, I thought it was telling that the one site that evaluated the cost benefits of crowdsourcing (Jeremy Bentham papers) found that the money spent on hiring two project managers to oversee the work of the volunteer transcribers could have been better spent on just having the two managers perform the transcriptions work themselves. Of course, this doesn’t factor in the cost of having outside editors review the work of the two managers, which would also be necessary.

Indeed, crowdsourcing sites must be carefully designed to be easy to use, with few barriers for participation, otherwise few will complete the work. Further, full time curators are needed to assist the volunteers, which—as the Bentham experience shows—is not inexpensive. The site management and design can be quite costly to implement, and there is not much information yet on the long-term benefits of this approach. Will people continue to be engaged with a site for a sufficiently long period to transcribe what are often massive amounts of papers or annotate a great number of digital scans?

My own major motivation to participate in these activities was if I had some interest in the content itself. The task of transcribing is fairly tedious, and the handwritten documents are difficult to read. Then again, there is a willingness among those who are fascinated by the subject matter to perform what can be time-consuming work. I am personally skeptical of the thinking behind NYPL’s Building Inspector project that individuals will use their spare time waiting on line to correct the tracing of the footprints of buildings on old fire insurance maps. This is not the kind of engaging “gamification” that one finds on Candy Crush or similar addictive apps and websites. It will be interesting to see over time if enough material is reviewed to achieve the project’s goals.

Although most people use Wikimedia like a traditional encyclopedia to answer factual questions, not many are familiar with how each entry is created and what this may mean in terms of its accuracy, bias, and reliability. Many have heard the term “crowdsourcing” but may not understand that it can have different meanings depending on the formal and informal rules and regulations used in its implementation.

Although Wikipedia was founded on the idea of “crowdsourcing”–that each entry would be written and revised by its users–there is a good deal of policing of the site that occurs through a group of editors and guardians who enforce certain organizational rules that have evolved over time. There is also a good deal of sensitivity to weed out spammers or those promoting a specific bias or point of view, particularly those who may be promoting their own work. This has led to controversy as some newer users accuse the “old garde” of limiting their contributions. Users can even be blocked from the site by site editors if they feel they are not following the rules.

Nonetheless, Wikipedia does offer a good deal of transparency to the editorial process, mostly through the ability to examine the “History” of each entry. Taking the Digital Humanities entry as an example (https://en.wikipedia.org/wiki/Digital_humanities), the user can track the history of the entry back to its origins in 2006 when it was begun by DH librarian at Stanford University, Elijah Meeks. Each change made over time can be examined individually, with the ability to compare the changed text with the previous version. This is most illuminating in this entry as it shows how–not surprisingly in a new field–the definition of what constitutes DH has expanded over time and this has led to many new types of projects and approaches being embraced by the field.

Another key feature of Wikipedia is you can find out the background of many of the contributors by clicking on their name in the history tab. Some choose not to create a biography or never “log in” as users, but many at least offer a generic biography that points to their background. Not surprisingly, in this field that is dominated by academic discord, most of the major content providers to this entry are academics themselves who work in the DH field.

Another key feature is the requirement that all factual information be sourced. The DH entry offers 94 footnotes and an extensive bibliography. This encourages the reader to go beyond this entry to engage more fully in the discussions and debates in the field.

This type of analysis is most appropriate for those who are seeking to expand their study of a topic beyond the basic “just the facts” approach that is offered by Wikipedia–and indeed any encyclopedia. Encyclopedias are best for answering basic factual questions–although even simple facts like birth dates can be contested–but are not be-alls and end-alls for research.

Richard Carlin: Digital Humanities

Category Archives: Crowdsourced Material

Crowdsourcing

Reading Wikipedia

Digital humanities issues, tools, and resources