Skip to content

This project will attempt to use Natural Language Processing and Linguistic Analysis of the Federalist Papers to contribute to the source of authorship of the 12 disputed articles.

Notifications You must be signed in to change notification settings

mizzmo/authorship_of_the_federalist_papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Authorship of the Federalist Papers

Author: Toby Surtees

Problem Description

What are The Federalist Papers?

The Federalist Papers are a collection of 85 essays written by Alexander Hamilton, James Madison, and John Jay under the pseudonym 'Publius' in 1787-88. The aim of these articles was to persuade New Yorkers to support ratification of the US Constitution.

Examples of ideas in The Federalist Papers include opposition to the Bill of Rights (Federalist No. 84) over dangers of limiting the rights that people had to those only appearing in the bill, giving the govenment power to infringe on rights not listed, as well as topics on foerign influence, interstate conflict, power and taxation and most extensively the US Constitution and Power of Congress and structure of the new federal government.

Disputed Authorship

At the time of publication, the authors of the Federalist Papers attempted to hide their identities by using the collective pseudonym 'Publius', leading to conflicting ideas of who wrote each essay. Today, researchers generally agree on the authors of the essays; however, the remaining 12 remain disputed between Hamilton and Madison.

The modern consensus is as follows:

  • Hamilton: 51 Articles numbering 1, 6-9, 11-13, 15-17, 21-36, 59-61, 65-85.
  • Madison: 29 Articles numbering 10, 14, 18-20, 37-58, 62-63.
  • Jay: 5 Articles numbering 2-5, 64.

The disputed 12 articles are around the authorship of articles 49-58 (Madison), 18-20 (Madison and Hamilton) and 64 (Jay), with conflicting claims from Hamilton, claiming 63 essays including 3 joint with Madison, while Madison claims 29 essays for himself citing rushed clerical work after the death of Hamilton in 1804.

Most modern research strongly favours Madison's claim of authorship over that of Hamilton, but the debate is not considered fully closed.

Project Aims

The aim of this project is to contribute to the research around the federalist papers by using Linguistics and Natural Language Processing (NLP) methods to attempt to decern who of the three authors is most likely to have written the 12 disputed papers.

At the stage of writing, this will mostly consist of comparison and statistical analysis of the text by looking at other works of the authors and comparing writing style and commmonalities between each work, as well as other confirmed articles in the same paper.

Notes on Methodology

  1. LLM or Bayesian Models - Research into this specific case has shown no significant performance improvement of using Large Language Models (LLMs) against more traditional Bayesian or Topical Models (3), suggesting the best route for this project is a more typical Bayesian approach.
  2. Feature Selection - In tasks like this one, where writing style is the main differentiator between authors, feature selection is critical in producing useful results. Previous approaches have included using only function words (3,4), less affected by subject matter and thus better for attributing authorship. Specifically, Topic Embeddings of Function Words have been shown to give the best attribution performance (3).
  3. Stylistic Signals - Looking beyond word frequency into stylistic signals such as rhetoric and discourse structure can help in article attribution (1), a potential path for exploring beyond statistical analysis.
  4. Types of Classification - Using multiple types of classification and combining the results can increase the robustness of results (5), so I should explore using more than one classification technique.

Language and Technique

Python is the obvious choice for this project. I'm familiar with the language, and it also includes many powerful libraries for NLP and text analysis. It also includes many testing tools for code correctness.

Supervised learning seems like the best choice for this application, as we have plenty of options for other works attributed to each author, without disputes over authorship. This means I can compare these knowing for a fact who wrote them, and use them as a comparison to the disputed articles. This also works for other articles in The Federalist Papers where authorship is not contested. I will use both in this project.

Features

  1. Lexical Features - Average number of words per sentence, Sentence Length and Variation, Lexical Diversity.
  2. Punctuation Features - Average number of commas, semicolons and colons per sentence.
  3. Bag-of-Words (BoW) - Frequencies of different words in each article using topic-independent keywords for better analysis across different papers and subjects.
  4. Syntactic Features - Lexical categories and feature labelling, Part-of-Speech (POS).

Data Structures

Stylistic Analysis

Name: federalist_dict
Type: Dict (Int : Str)
Purpose: Holds each unprocessed article indexed to its corresponding article number. For example, Article 1 is stored at Key 1.
Example: {1, "after an unequivocal experience of the inefficacy of the subsisting federal government, you are called upon to deliberate on a new constitution for..."}

Name: author_dict
Type: Dict (Int : Str)
Purpose: Stores each article number and corresponding author. In the case that an article is disputed, the author is marked as Disputed.
Example: {1: 'HAMILTON', 2: 'JAY', 3: 'JAY', 4}

Name: total_punctuation
Type: Dict(Str : {Dict(Char : Float)})
Purpose: For each author, stores the normalized average punctuation usage for each punctuation that occurs, across all articles attributed to that author.
Example: {'HAMILTON': {',': 7.608161512126157, '.': 3.1401708329525233, ';': 0.7993101464541578}}

Name: disputed_punctuation
Type: Dict(Int : {Dict(Char : Float)})
Purpose: For each disputed article number, store the normalized average punctuation for each type of punctuation that occurs in that article.
Example: {18: {',': 9.13091309130913, '.': 4.9504950495049505, ';': 1.3751375137513753}}

Name: total_averages
Type: Dict (Str : Float)
Purpose: For each author, store the average number of words per sentence across all attributed articles.
Example: {'HAMILTON': 34.25, 'JAY': 38.41, 'MADISON': 34.51}

Name: disputed_averages
Type: Dict (Int : Float)
Purpose: For each disputed article number, stores the average words per sentence for that article.
Example: {18: 23.62, 19: 29.55, 20: 27.47, 49: 28.33, 50: 26.19, 51: 31.34, 52: 31.25, 53: 31.33, 54: 30.69, 55: 33.43, 56: 31.09, 57: 28.29, 58: 34.15, 64: 41.14}

Results

The First Results - Punctuation Analysis

The first major results for this project come from punctuation analysis confined to the articles in the paper only. Punctuation analysis is the first step in stylistic analysis, and by looking at the punctuation usage of each author, we can take a step towards establishing a writing style for each individual, allowing direct comparison to that used in each article. Using this method means that we can then compare which style most closely resembles that which is used in each article, giving us a better idea of which author was responsible for writing a particular article.

For the first step into punctuation analysis, I have restricted my search to The Federalist Papers only as a first step. Naturally, this means the next logical step is to expand the analysis to previous works of each author, building up the author styles I previously mentioned. However, for now, I have used this current data and run a preliminary comparison to see if I'm heading in the correct direction. The metrics for this comparison consist of a comparison between total punctuation usage per 100 words of each article, per author, across the entire publication. I then used Total Absolute Difference (TAD), which consists of taking the difference of the author's total average across all papers (Average function usage of all undisputed papers attributed to one author across all papers) against the average of each individual contested paper.

As an example, take Article 18. The graph shows the punctuation usage across the article per sentence. I take an average of these values across 100 words to get a normalized average. I then take the total average of each author, take Hamilton as an example, and compute the difference between that total average and the average of Article 18 across each type of punctuation. The results can be seen in the following figure.

Graph Showing the Average Punctuation Usage per Sentence for Article 18 of The Federalist Papers

article         ,         .         ;         (         )         -          :         ?         `         '    author
     18  1.522752  1.810324  0.575827  0.077005  0.077005  0.114279        NaN       NaN       NaN       NaN  HAMILTON

I then take the sum of each of these differences to get the TAD, which works out to be 4.177193 for this particular article and author. Repeating this for each author, we also get the values 4.005356 for Jay and 3.045500 for Madison. We then compare these values and take the lowest value, as this represents the closest match to the overall punctuation usage for an author and that used in the particular article. So in the case of Article 18, we attribute Madison, as his TAD is closest to 0.

Repeating this for every disputed paper, we get the following results.

          HAMILTON_TAD   JAY_TAD  MADISON_TAD Best Match
article
18           4.177193  4.005356     3.045500    MADISON
19           5.186822  4.994623     4.094340    MADISON
20           5.694233  5.395003     4.575195    MADISON
49           1.705146  2.290675     2.108466   HAMILTON
50           4.397100  4.108221     4.082052    MADISON
51           1.164875  2.187222     2.103554   HAMILTON
52           1.699144  1.678109     1.311747    MADISON
53           1.335050  1.245165     0.918898    MADISON
54           1.278345  1.320628     1.033813    MADISON
55           0.789800  1.657792     1.750226   HAMILTON
56           2.056060  2.641643     2.623698   HAMILTON
57           3.133577  4.182924     4.199794   HAMILTON
58           1.322836  1.456486     1.253914    MADISON
64           2.584629  1.598785     2.071018        JAY

You will notice that this does not line up exactly with the modern-day consensus, but this is exactly what I expected, as this is only one piece of the larger puzzle. However, these results do show promise, as some of the results do match up, leading me to believe that, as a method of exploration, punctuation usage and style are a worthwhile avenue to explore more deeply.

As I mentioned earlier, the next logical step would be to expand this search to works outside of The Federalist Papers, as well as to build more on this stylistic profile for each author.

Euclidean Distance

The second step for this analysis was using a different measure for comparing punctuation. For this I went with Euclidean Distance, the benefits over TAD being the higher penalties for larger distances. This uses the same basic idea as TAD, so the results are expected to be similar, if not the same, but I aim to use this more as a reinforcement rather than a totally different measure of similarity.

Implementing Euclidean Distance is fairly easy, and just uses the following equation:

Euclidean Distance Equation

The results we get from applying this equation are as expected, with very little variation in final result (closest match) over TAD, but with different values of measurement.

Euclidean Distance Per Article Per Author

    HAMILTON       JAY   MADISON Best Match
18  2.476729  2.538480  2.029672    MADISON
19  2.806521  2.876214  2.294644    MADISON
20  3.045540  2.780510  2.354734    MADISON
49  0.923194  1.311699  1.121732   HAMILTON
50  2.999860  2.520699  2.430505    MADISON
51  0.636588  1.408510  1.263493   HAMILTON
52  0.853494  0.935949  0.613632    MADISON
53  0.780391  0.575158  0.425526    MADISON
54  0.730146  0.698429  0.488848    MADISON
55  0.483086  0.981196  0.972721   HAMILTON
56  0.872656  1.153955  1.188977   HAMILTON
57  2.411042  3.163500  3.130167   HAMILTON
58  0.713367  0.837343  0.729615   HAMILTON
64  1.476969  0.749335  0.910483        JAY

We can see the direct comparison in the following figure, where the only difference is in Article 58:

     Euclidean      TAD
18     MADISON      MADISON
19     MADISON      MADISON
20     MADISON      MADISON
49    HAMILTON      HAMILTON
50     MADISON      MADISON
51    HAMILTON      HAMILTON
52     MADISON      MADISON
53     MADISON      MADISON
54     MADISON      MADISON
55    HAMILTON      HAMILTON
56    HAMILTON      HAMILTON
57    HAMILTON      HAMILTON
58    HAMILTON      MADISON
64         JAY      JAY

This may look boring, but it's actually an important step towards building a robust model. In confirming the results across more than one form of measurement, it shows that the dataset is sound and resistant to methodology changes. This gives me confidence to continue down this route of punctuation analysis.

We also see that one article, 58, is a mismatch. This tells us that 58 may be ambiguous or near the boundary for comparison between Madison and Hamilton in particular. This means that punctuation analysis may be less effective for this article, as two authors share a similar style to that used in the article. However, if we wanted to keep going with punctuation analysis for article 58, the next logical step would be to introduce a third metric, which I intend to do regardless in Cosine Similarity.

References

[1]Collins, J. et al. (2004) ‘Detecting Collaborations in Text Comparing the Authors’ Rhetorical Language Choices in The Federalist Papers’, Computers and the Humanities, 38(1), pp. 15–36. Available at: https://doi.org/10.1023/B:CHUM.0000009291.06947.52.

[2]Fung, G. (2003) ‘The disputed federalist papers: SVM feature selection via concave minimization’, in Proceedings of the 2003 conference on Diversity in computing. TAPIA03: Richard Tapia Celebration of Diversity in Computing Conference, Atlanta Georgia USA: ACM, pp. 42–46. Available at: https://doi.org/10.1145/948542.948551.

[3]Jeong, S.W. and Ročková, V. (2025) ‘From Small to Large Language Models: Revisiting the Federalist Papers’. arXiv. Available at: https://doi.org/10.48550/arXiv.2503.01869.

[4]Mosteller, F. and Wallace, D.L. (2012, 1984) Applied Bayesian and Classical Inference: The Case of The Federalist Papers. Springer Science & Business Media.

[5]Savoy, J. (2013) ‘The Federalist Papers revisited: A collaborative attribution scheme’, Proceedings of the American Society for Information Science and Technology, 50(1), pp. 1–8. Available at: https://doi.org/10.1002/meet.14505001036.

Credits

All credit for 'The Federalist Papers' resource used in this project goes to https://www.gutenberg.org/, where the documents are available to download for free.

The list of function words used in this project is credited to James O’Shea and is available at https://semanticsimilarity.wordpress.com/function-word-lists/.

About

This project will attempt to use Natural Language Processing and Linguistic Analysis of the Federalist Papers to contribute to the source of authorship of the 12 disputed articles.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages