HTRC UnCamp 2018 has ended

Welcome to the fourth iteration of the semi-annual HathiTrust Research Center (HTRC) UnCamp. This is where members of the HTRC community gather to explore the latest developments in using HTRC tools and services to anlayze the HathiTrust Digital Library corpus. Visit https://www.hathitrust.org/htrc_uncamp2018 for more information or see our online proceedings at https://osf.io/view/htrc_uncamp2018 hosted by OSF Meetings.

View analytic
Thursday, January 25 • 3:00pm - 4:00pm
HTRC Advanced Collaborative Support (ACS) Awardee Project Panel

Log in to save this to your schedule and see who's attending!

Feedback form is now closed.
Session Moderator: Eleanor Dickson

The impact of OCR quality on natural language processing (David Bamman)
The rise of large-scale digitized book collections such as the HathiTrust is enabling a fundamentally new kind of text analysis that exploits the scale of collections to ask questions not possible with smaller corpora.  One prerequisite for this work is high-quality optical character recognition (OCR), in which image scans of individual pages in a book are converted to text.  While OCR errors can complicate even simple analyses of word frequency (as seen in Google Ngram data), it poses an even greater challenge for structured representations of language in NLP, such as part-of-speech tagging or syntactic parsing.  In this short talk, I'll describe research into quantifying the impact of OCR quality in the HathiTrust on the quality of downstream NLP, and sketch out approaches for automatically assessing OCR accuracy without recourse to a gold-standard reference transcriptions.

HathiTrust ACS Report: A Writer’s Workshop Workset with the Program Era Project (Loren Glass, Nick Kelly, Nicole White)
How can computer-assisted text analysis help us document and explore the history of Creative Writing program in the University? This presentation looks at how the Program Era Project, a DH initiative at the University of Iowa, is working to build an online, public-facing database of information on the Iowa Writers’ Workshop, its writers, and their work. We will focus on the Project’s recent ACS collaboration with HathiTrust and how we plan to incorporate text analysis data from an HT-provided corpus of Workshop-affiliated writings into our larger database of institutional and biographical information on Workshop writers. In addition to providing background on the Project, the text analysis tools it has developed, and the necessities that led to the HathiTrust collaboration, we will discuss our progress (and impediments) with the collaboration and the next steps we plan to take in working with our HathiTrust data capsule.

Evaluating the History of the Chicago School: Why Supervised Algorithms? (Dan Baciu)
The history of the Chicago School has become digital, but what does this fact mean for data accessibility, research, and future dissemination? At the HathiTrust, the term is found in over 100,000 books and periodicals covering the last two centuries. Can digital tools analyze this massive history of publication? The values of the Chicago Schools have been disseminated, translated and transformed. This present work attempts a first computer aided, systematic and critical evaluation. The research has been supported by several institutions including the HathiTrust Research Center (HTRC), the Fulbright Program and the Swiss National Science Foundation (SNF). Digital technology is implemented on three levels: analysis of the historic text data, interpretation of the results, and scholarly exchange.
Our collaboration with the HathiTrust Research Center and the Cognitive Computation Group at University of Illinois provided sufficient data to evaluate the complete history of publication of the Chicago Schools. We succeeded in implementing a knowledge-based approach on a massive, previously unattempted scale and found that it offered significant advantages over using the unstructured data alone. From our large Chicago School corpus, we also built three additional datasets together with a framework for non-consumptive research which allows us to filter, classify, and cross-validate the results. Among the 2016 ACS projects, we were the only one to rely primarily on a supervised, knowledge-based approach. Most other projects worked with unsupervised algorithms. This UnCamp presentation will focus on our choice of supervised algorithms and their methodological role within the framework for non-consumptive research.

Using contemporary technology to analyze historical social movements (Laura Nelson)
New methods in computational text and network analysis have opened up exciting possibilities to better understand the complex historical dynamics within large, diverse, and recurrent social movements such as women's movements, labor movements, and civil rights movements. The methods are readily available, but they require rich, digitized data that can capture multifaceted and temporal intra-movement dynamics. While libraries have made great progress providing digitized "collections as data" to researchers, documents produced by social movement actors are not systematically included in standard categorical collections. In this talk I discuss my experience working with HathiTrust, using metadata as well as vector space models, to identify and collect digitized texts produced by a diverse array of individuals and organizations involved in the women's movement between 1860 and 1975. I discuss the challenges involved in collecting such a corpus, as well as the new types of historical and cultural analyses these data enable.

Scalable Detection of Text Reuse (Doug Duhaime)
In 2016 Yale University's Digital Humanities Lab began work on a full-stack web application that allows users to detect and visualize text reuse in large collections. A prototype of the app is available here: (Try searching for Thomas Gray).
This project builds off of research began during an Advanced Collaborative Research Grant with the HTRC, and implements a number of features that distinguish it from recently released packages for text reuse. During this talk, I would give a brief overview of the data processing pipeline, discuss the front-end UI options we've prioritized and sketched to-date, then open the floor for suggestions of other features that could help users study text reuse in large text collections.


David Bamman

UC Berkeley
avatar for Doug Duhaime

Doug Duhaime

Yale University
Data analysis and visualization!

Loren Glass

University of Iowa
Loren Glass is Associate Professor of English at the University of Iowa. He writes on celebrity, obscenity, modernism, and the avant-garde. He is currently completing a history of Grove Press which will appear in the Post*45 Series with Stanford University Press. Abstract:"Killer... Read More →

Nick Kelly

University of Iowa

Laura Nelson

Northeastern University

Nicole White

University of Iowa

Thursday January 25, 2018 3:00pm - 4:00pm
Moffitt Library, 5th floor

Attendees (31)