Society for the Study of Speciation: Bibliographic Analysis Using LLM

Illustration of "ring species" concept, or Rassenkreis
Illustration of "ring species" concept, or Rassenkreis
Alfred E. Emerson at University of Chicago
Alfred E. Emerson at University of Chicago

LLMs offer tools for historical analysis of primary sources. In a previous post, I produced synthetic data for an LLM-based analysis of the 1939 International Genetical Congress in Edinburgh. In this post, I provide data for an LLM-based analysis of the short-lived Society for the Study of Speciation (SSS), organised by Alfred Emerson. The documents under study are three items from the 1941 circular Emerson produced for the Society:

  • Bibliography, which contains 1250 citations
  • Address List, which contains addresses for 374 individuals
  • News and Information, which contains notes from individuals about their current work as well as one, long review essay written by Alfred Emerson about the 1940 volume, New Systematics, edited by Julian Huxley.

I previously published an open-access scholarly edition of Emerson’s circulars in Regular Contact With Anyone Interested. Documents of the Society for the Study of Speciation. 2nd edition (Cain 2007), and a scholarly journal article about the society’s formation and its short-lived activities:

Analysis of the circular material is ideal for LLM-powered tools because (1) content consists of long lists of structured, factual information, (2) those lists have notations contributing to hierarchical organisation but that hierarchy is difficult to use in the form of the circulars, and (3) the overall amount of content is small by LLM standards, so analysis is not computationally overwhelming. The goal of analysis using an LLM in this case is pattern-finding to create grist for further historical analysis.

To examine Emerson’s circulars I used Google NotebookLM because this tool allows closed content ecosystems rather than drawing from general internet knowledge. For instance, it’s possible to restrict the analysis to one specific document or to a select combination of documents. NotebookLM is powered by Google’s Gemini platform; in this case, Gemini 2.5 Flash.

My method consisted of uploading three SSS documents as pdfs derived from Cain (2007). A separate pdf consisting of Emerson’s classification (Cain 2007: 103, here named “SSS classification”) was uploaded for ease of reference. Simple queries were undertaken as a first step to confirm complete upload and functionality. Some additional learning was required by the LLM. (For instance, 1250 citations are included in the Bibliography but 1249 citations are numbered. At first the LLM asserted #1250 was missing. In fact, for some reason, Emerson used #320 twice (as #320(1) and #320(2)), making the total number of citations 1250 despite numbering only 1249. This pattern needed to be described to the LLM.)

Analysis of Society for the Study of Speciation Using LLM

NotebookLM provides facility for an audio overview of content. This is the “deep dive” summary of the “News and Information” section:

Audio overview of Emerson’s “News and Information” circular for Society for the Study of Speciation, 1941. This overview imports additional content, i.e., content from outside the closed content ecosystem in this study. The ending is abrupt.

Sample Queries

Below is a sample of queries regarding Emerson’s circulars. Some queries focus on one of the three documents. Other queries seek integrated connections and draw from multiple documents in the content trio.

For the Address List, rank all institutions with more than one individual.

For the Address List, sort the addresses by continent, country, and US state. Provide the results in terms of highest to lowest numbers.

List the individuals in the Address List located outside the United States or Canada.

For all names on the Address List, count the different types of institution. For example “University”, “College”, “Institute”, “Museum”, “Station”, etc. Rank in terms of most frequent to least frequent.

Count the number of citations in the Bibliography for each category in Emerson’s classification. In cases where multiple categories are provided, count the citation for each single category.

Count the number of category codes for each citation in the bibliography and list the number of citations that have 1 category code, 2 codes, 3 codes, etc.

For citations with more than 1 category code, rank the most frequent combinations.

Count the number of citations in each language. Rank the languages from most frequent to least frequent.

Using the Address List and the Bibliography, rank the top 20 institutions in terms of the highest number of citations in the bibliography. Use the author addresses to identify the institutions.

List the organisms named in the citations for the top 20 institutions (in terms of the highest number of citations in the Bibliography). Use the author addresses to identify the institutions. I want to know if specific organisms or groups of organisms are associated with specific institutions.