Research Guides: Stylometry Methods and Practices: Methods

Getting Started

Before you are able to input any of your texts into various stylometric programs, it's important that you take a step back and prepare your corpus by collecting all of your texts, organizing them into readable files, and removing any metadata.

Here is a comprehensive guide to doing just that: Preparing a Corpus for Textual Analysis

This step is especially important when using coding programs like R, since the functions won't work if your folders and files are not named correctly.

Principles

Stylometry looks at a variety of features of an author’s style such as:

Word length
Sentence length
Paragraph length
Punctuation
Function words (for example: the, and, a, of, to, in, that, with, but, it)
Letters
N-grams, bigrams, trigrams (characters in a row)
Bi-words and Tri-words (two or three words occurring in a certain order)

General Tips

The larger or more specific your corpora, the better! Research has shown that testing larger amounts of data (more than one novel or text) per author produces the most accurate results.
Have an extensive corpus of “control” texts available if you are trying to determine unknown authorship “blindly” (without a sub-set of suspected authors).

N-grams?

N-grams are essentially numeric sets of characters or words used to measure probability. Let’s use the first sentence of Leo Tolstoy’s Anna Karenina as an example.

Original Sentence:

“Happy families are all alike; every unhappy family is unhappy in its own way.”

A Word n-gram of size two (bigram):

"happy families" "families are" "are all" "all alike" "alike every" "every unhappy" "unhappy family" "family is" "is unhappy" "unhappy in" "in its" "its own" "own way"

Character n-gram of size three (trigram):

"h a p" "a p p" "p p y" "p y " "y f" " f a" "f a m" "a m i" "m i l" "i l i" "l i e" "i e s" "e s " "s a" " a r" "a r e" "r e " "e a" " a l" "a l l" "l l " "l a" " a l" "a l i" "l i k" "i k e" "k e " "e e" " e v" "e v e" "v e r" "e r y" "r y " "y u" " u n" "u n h" "n h a" "h a p" "a p p" "p p y" "p y " "y f" " f a" "f a m" "a m i" "m i l" "i l y" "l y " "y i" " i s" "i s " "s u" " u n" "u n h" "n h a" "h a p" "a p p" "p p y" "p y " "y i" " i n" "i n " "n i" " i t" "i t s" "t s " "s o" " o w" "o w n" "w n " "n w" " w a" "w a y"

Obviously, this can get a bit tedious if you try to do this by hand, so computational programs are the best resource.