Skip to main content

Preparing a Corpus for Textual Analysis: Home

This guide gives resources for finding and cleaning your texts.

Preparing a Corpus for Textual Analysis

This guide walks you through the steps and provides you with resources for finding and cleaning your texts--the necessary preliminaries to textual analysis projects. This isn't the most glamorous part of digital scholarship, but with these resources, you can move quickly and avoid the most common pitfalls.

Why would you need to prepare a corpus? Most textual analysis projects rely on plain-text digital versions of documents, which are easily read and analyzed by a computer. These documents need to be relatively clean of spelling errors, clearly labeled, and all collected in one place for the results to be accurate.

Use the tabs at the top to explore the most common steps of preparing a digital corpus: finding, cleaning, and analyzing.

Digital Scholarship?

Stephen Ramsay says that digital scholarship is about building things. That's one definition; you'll find many others here.

Looking for an overview? Here's a free online textbook on digital scholarship.

Picking a Topic

Remember, on its most basic level, all textual analysis is counting and classifying words. Some of these tools are more advanced than others, of course!

A few tips:

  • Start with the question, not with the tool you want to use. Just like with any other scholarship, there needs to be a critical gap which your research will answer.
  • Start out by dreaming big--what can you imagine learning that is new, exciting, useful?
  • Then, scale back a bit. Look at the resources available to you. Are your texts in the public domain and already digitized? (If not, your work isn’t impossible, but it can get tricky!) Do you know what sort of corpus you might want to work with?

Introduction to Computer-Assisted Textual Analysis

What does computer-assisted textual analysis mean?

Textual analysis is any time you look at a text and pull out a deeper meaning. Computer-assisted textual analysis just means that you use some type of computing to help.

The real reason you might do this is because it can show you something that you might not otherwise be able to see.

What does it help you do that you couldn’t otherwise do?

  1. Read more stuff than you can really read. It would take you weeks to count how many times different words appear in even one text. Computers can count words almost instantly. They can also analyze which words appear near each other or seem to be related. Some programs can pull out place names or proper names and give you information on those.
  2.  Automatically classify texts and parts of texts.Computers can “learn” to recognize different authors, and then can use this information to classify anonymous texts in a more scientific way. Computers can make pretty accurate stabs at genre, and are good at recognizing things like whether a text is fiction or non-fiction based on the presence of dialogue and so on. Big companies use sophisticated textual analysis software for “Sentiment Analysis,” or to see whether people are saying positive or negative things about their products.
  3. Find and represent networks. Computers offer us the potential to see patterns of influence, tracing ideas as they transition from news into fiction, or showing us repeated tropes or ideas. They’ve also been used to visualize networks of who knew and was communicating with whom, and to represent the migration of ideas across time and space.

Tip: If you’ve got a question that can be answered by a small number of texts that you can realistically and easily read, it might not be the best question for computer-assisted textual analysis.

Tip: Textual analysis software is changing very quickly, and there is probably a new tool out there today that wasn’t there yesterday that will let you do this—and more!

Features of the Best Projects

  • Big datasets. You’re looking at so many texts that it would be impossible for you to read them and/or process them comprehensively the way you’re doing.
  • Unique questions. You’re looking at things which are very hard to understand otherwise.
  • Transformative. This project offers the potential to change the way we understand something--a historical movement, a person, literary history...