Research Guides: Introducing the Digital Humanities: Text Analytics

Example Projects

Hildegard of Bingen: Authorship and Stylometry
Using stylometry to determine the author of historic works.
Index Thomisticus
The original digital humanities project built in collaboration between Father Roberto Busa and IBM. Work began on this project in 1949.
Women Writers
This project aims to encode with TEI and make available the texts of early modern women writers for research. It also has some built in tools for doing textual analysis on the collections.

What is text analysis?

The first digital humanities project was text analysis. As more text becomes available in machine readable digital formats it becomes easier to apply text analysis methods to answer research questions. These techniques, often referred to as distant reading, allow you to examine a large volume of texts, known as a corpus, to learn about the information they contain and answer questions. This is opposed to the traditional close reading many people are more familiar with where you closely scrutinize individual texts.

Specific methods include concordances, the arrangement of words within a text that allows you to find its frequency and where; topic modelling, which identifies the reoccuring theme of texts based on computational linquistics and common words; and stylometry, the statistical analysis of writing style that is useful in research such as authorship studies.

Before you can apply textual analysis methods you must assemble and clean your corpus. This is often the most laborious part of a text analysis project as you may need to find machine readable copies of your textual materials, make them machine readable through OCR (optical character recognition), remove formatting and tags from XML or HTML, and clean up mistakes from OCR.

Computational Textual Analysis Libguide
To learn how to create a textual data set and analyze your corpus, visit our Libguide expressly on the topic!
Stylometry Libguide
To learn how to analyze texts specifically in order to determine authorship, check out our guide on the topic.

Tools

Voyant
Voyant is a web based set of tools for reading and analyzing digital text. You can plug corpora of websites or files into Voyant for textual analysis. Note that Voyant uses many tools to do different types of analysis, including topic modelling, concordance and even some network analysis.
R
R is a powerful and popular language and environment for statistical computing and graphics. R is particularly useful for technical analysis when coupled with the TM (text mining) package.
MALLET
MALLET is a tool for analyzing text through topic modelling.
stylo
A package for R to perform stylometric analyses. This package was used to generate the authorship results for the Hildegard of Bingen project in the Example Projects box on this libguide.
AntConc
AntConc is used to perform concordance studies of texts.
Google NGram Viewer
A tool by Google to study over 5 million of the books in googlebooks up to the year 2008.
TEI (Text Encoding Initiative)
A standard for representing digital texts, TEI provides a way to mark up text for machine readability. This can improve the understanding of and things you can do with digital texts.

Tools for Corpus Cleaning

OpenRefine
This is a handy tool for working with large data sets and cleaning up files for text analysis and other needs.
Vard 2
This software uses techniques derived from modern spell checkers to find candidate modern form replacements for spelling variants found within historical texts. The user can choose to process texts manually, selecting a candidate replacement offered by the system; automatically, allowing the system to use the best candidate replacement found; or semi-automatically, training the tool on a sample of the corpora.

Temple University

University Libraries

See all library locations

Need help? Email us at asktulibrary@temple.edu