• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Data Collections and Datasets

This version was saved 8 years, 4 months ago View current version     Page history
Saved by Alan Liu
on February 11, 2016 at 11:02:05 pm
Digital Humanities projects collage



DH Toychest:



Data Collections and Datasets
           (curated by Alan Liu)

 Guides to Digital Humanities | Tutorials | Tools | Examples | Data Collections & Datasets



Starter Kit:

Demo Corpora  (Small to moderate-sized text collections for teaching, text-analysis workshops, etc.)

Quick Start  (Common text analysis tools & resources to get instructors and students started quickly)


Demo Corpora (Text Collections Ready for Use)


Demo corpora are sample or toy collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials--e.g., for teaching text analysis, topic modeling, etc.  Ideal collections for this purpose are public domain or open access, plain-text, relatively modest in number of files, organized neatly in a folder(s), and downloadable as a zip file. (Contributions welcome: if you have demo collections you have created, please email ayliu@english.ucsb.edu.)

       (Note: see separate section in the DH Toychest  for linguistic corpora--i.e., corpora usually of various representative texts or excerpts designed for such purposes as corpus linguistics.)


arrow-right Plain-text collections downloadable as zip files:

  • General Collections
  • Historical Materials
  • Literature
  • Miscellaneous
    • Demo text collections assembled by David Bamman, UC Berkeley School of Information:
      • Book summaries (2,000 book summaries from Wikipedia) (zip file)
      • Film summaries (2,000 movie summaries from Wikipedia) (zip file)
    • U.S. patents related to the humanities, 1976-2015 (patents mentioning "humanities" or "liberal arts" located through the U.S  Patent Office's searchable archive of fully-digital patent descriptions since 1976. Collected for the 4Humanities WhatEvery1Says project. Idea by Jeremy Douglass; files collected and scraped as plain text by Alan Liu)
      • Metadata (Excel spreadsheet)
      • Humanities Patents (76 patents related to humanities or liberal arts) (zip file)
      • Humanities Patents - Extended Set (336 additional patents that mention the humanities or liberal arts in a peripheral way--e.g., only in reference citations and institutional names of patent holders, as minor or arbitrary examples, etc.) (zip file)
      • Humanities Patents - Total Set (412 patents; combined set of above "Humanities Patents" and "Humanities Patents - Extended Set") (zip file)


arrow-right Sites containing large numbers of books, magazines, etc., in plain-text format (among others) that must be downloaded individually:



Quick Start (Common Text-Analysis Tools & Resources)


A minimal selection of common tools and resources to get instructors and students started working with text collections quickly. (These tools are suitable for use with moderate-scale collections of texts, and do not require setting up a Python, R, or other programming-language development environment, which is typical for advanced, large-scale text analysis.)


  • arrow-right Text-Preparation Tools
    (Start here to clean, section, and wrangle texts to optimize them for text analysis)
    • Lexos (Included in the Lexos online text-analysis workflow platform are tools for uploading, cleaning [scrubbing] texts, sectioning texts [cutting or chunking], and applying stopwords, lemmatization, and phrase consolidations)
    • [See also fuller list of Text-Preparation Tools indexed on the tools page of DH Toychest]
  • arrow-right Text Analysis Tools
    • AntConc (Concordance program with multiple capabilities commonly used by the corpus linguistics research community; with versions for Windows, Mac & Linux; site includes video tutorials. The tool can be used to generate a word frequency list, view words in a KWIC concordance view, locate word clusters and n-grams, and compare word usage to a reference corpus such as a corpus linguistics corpus of typical language use in a nation and time period)
    • Lexos (Online integrated workflow platform for text analysis that allows users to upload texts, prepare and clean them in various ways, and then perform cluster analyses of various kinds and create visualizations from the results; can also be installed on one's own server) (Note: Lexos has recently added the capability to import Mallet 'topic-counts' output files to visualize topics as word clouds, and also to convert topic-counts files into so-called "topic documents" that facilitate cluster analyses of topics. See Lexos > Visualize > Multicloud)
    • Overview (open-source web-based tool designed originally for journalists needing to sort large numbers of stories automatically and cluster them by subject/topic; includes visualization and reading interface; allows for import of documents in PDF and other formats.  "Overview has been used to analyze emails, declassified document dumps, material from Wikileaks releases, social media posts, online comments, and more." Can also be installed on one's own server)
    • Voyant Tools (Online text reading and analysis environment with multiple capabilities that presents statistics, concordance views, visualizations, and other analytical perspectives on texts in a dashboard-like interface. Works with plain text, HTML, XML, PDF, RTF, and MS Word files [multiple files best uploaded as a zip file]. Also comes with two pre-loaded sets of texts to work on (Shakespeare's works and the Humanist List archives [click the “Open” button on the main page to see these sets])
    • Topic Modeling: (Currently, Mallet is the standard, off-the-shelf tool that scholars in the humanities use for topic modeling. [Specifically, Mallet is is a "LDA" or Latent Dirichlet Allocation topic modeling tool. For good "simple" explanations of LDA topic modeling intended for humanist and other scholars, see Edwin Chen and Ted Underwood's posts.] It is a command-line tool that requires students to fuss with exact strings of commands and path names. But the few GUI-interface implementations that now exist such as the Topic Modeling Tool do not allow for enough customization of options for serious exploration.)
      • Mallet (Download source for Mallet, which must be installed in your local computer's root directory. See the Programming Historian's excellent tutorial for installing and starting with Mallet.)
    • [See also fuller list of Text-Analysis Tools indexed on the tools page of DH Toychest]
  • arrow-right Stopwords (Lists of frequent, functional, and other words with little independent semantic value that text-analyis tools can be instructed to ignore--e.g., by loading a stopword list into Antconc [instructions, at 8:15 in tutorial video] or Mallet [instructions]. Researchers conducting certain kinds of text analysis (such as topic modeling) where common words, spelled-out numbers, names of months/days, or proper nouns such as "John" indiscriminately connect unrelated themes often apply stopword lists. Usually they start with a standard stopword list such as the Buckley-Salton or Fox lists and add words tailored to the specific texts they are studying. For instance, typos, boilerplate words or titles, and other features specific to a body of materials can be stopped out. For other purposes in text analysis such as stylistic study of authors, nations, or periods, common words may be included rather than being stopped out because their frequency and participation in patterns offer meaningful clues.)
    • English:
      • Buckley-Salton Stoplist (571 words) (1971; information about list)
      • Fox Stoplist (421 words) (1989; information about list
      • Mallet Stoplist (523 words) (default English stop list hard-coded into Mallet topic modeling tool; stoplists for German, French, Finnish, German, and Japanese also included in "stoplists" folder in local Mallet installations) (Note: Use the default English stoplist in Mallet by adding the option "--remove-stopwords " in the command string when inputting a folder of texts. To add stopwords of your own or another stopwords file, create a separate text file and additionally use the command "--extra-stopwords filename". See Mallet stopwords instructions)
      • Jockers Expanded Stoplist (5,621 words, including many proper names) ("the list of stop words I used in topic modeling a corpus of 3,346 works of 19th-century British, American, and Irish fiction. The list includes the usual high frequency words (“the,” “of,” “an,” etc) but also several thousand personal names.")
      • Goldstone-Underwood Stoplist (6,032 words) (2013; stopword list used by Andrew Goldstone and Ted Underwood in their topic modeling work with English-language literary studies journals) (text file download link)
    • Other Languages:
      • Kevin Bougé Stopword Lists Page ("stop words for Arabic, Armenian, Brazilian, Bulgarian, Chinese, Czech, Danish, Dutch, English, Farsi, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish")
      • Ranks NL Stopword Lists Page (stop words for Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Latvian, Lithuanian, Marathi, Norwegian, Persian, Polish, Portugese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukranian, Urdu)
  • arrow-right Reference Linguistic Corpora (Curated corpus-linguistics corpora of language samples from specific nations and time periods designed to be representative of prevalent linguistic usage. May be used online [or downloaded for use in Antconc] as reference corpora for the purpose of comparing linguistic usage in specific texts against broader usage)
    • Corpus.byu.edu (Mark Davies and Brigham Young University's excellent collection of linguistic corpora)



Document/Image Collections



API = has API for automated data access or export
Img = Image archives in public domain


  • ARTFL: Public Databases (American and French Research on the Treasury of the French Language) (expansive collection of French-language resources in the humanities and other fields from the 17th to 20th centuries)
  • Avant-garde and Modernist Magazines (Monoskop guide to modernist avant-garde magazines; includes extensive links to online archives and collections of the magazines)
  • British Library Images ("We have released over a million images onto Flickr Commons for anyone to use, remix and repurpose. These images were taken from the pages of 17th, 18th and 19th century books digitised by Microsoft who then generously gifted the scanned images to us, allowing us to release them back into the Public Domain")
  • BT (British Telecom) Digital Archives ("Bhas teamed up with Coventry University and The National Archives to create a searchable digital resource of almost half a million photographs, reports and items of correspondence preserved by BT since 1846.... collection showcases Britain’s pioneering role in the development of telecommunications and the impact of the technology on society")
  • CELL (created by the Electronic Literature Organization, "Consortium on Electronic Literature (CELL) is an open access, non-commercial resource offering centralized access to literary databases, archives, and institutional programs in the literary arts and scholarship, with a  focus on electronic literature")
  • Creative Commons  Img
  • Digging Into Data Challenge - List of Data Repositories 
  • Digital Public Library of America ("brings together the riches of America’s libraries, archives, and museums, and makes them freely available to the world") API
    • Apps and API's drawing on DPLA:
  • Digital Repositories from around the world (listed by EHPS, European History Primary Resources)
  • ELMCIP Electronic Literature Knowledge Base ("cross-referenced, contextualized information about authors, creative works, critical writing, and practices")
  • EEBO-TCP Texts (public-domain digitized texts from the Early English Books Online / Text Creation Partnership; "EEBO-TCP is a partnership with ProQuest and with more than 150 libraries to generate highly accurate, fully-searchable, SGML/XML-encoded texts corresponding to books from the Early English Books Online Database.... these trace the history of English thought from the first book printed in English in 1475 through to 1700.")
  • EPC (Electronic Poetry Center)
  • Europeana (" digitised collections of museums, libraries, archives and galleries across Europe") API
  • Flickr Commons Img
  • Folger Library Digital Image Collection  Img (" tens of thousands of high resolution images from the Folger Shakespeare Library, including books, theater memorabilia, manuscripts, and art. Users can show multiple images side-by-side, zoom in and out, view cataloging information when available, export thumbnails, and construct persistent URLs linking back to items or searches")
  • French Revolution Digital Archive Img (collaboration of the Stanford University Libraries and the Bibliothèque nationale de France to put online "high-resolution digital images of approximately 12,000 individual visual items, primarily prints, but also illustrations, medals, coins, and other objects, which display aspects of the Revolution")
  • Gallica (documents and images from Gallica, the "digital library of the Bibliothèque nationale de France and its partners")
  • GDELT: The Global Database of Events, Language, and Tone (downloadable datasets from "initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world over the last two centuries down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first "realtime social sciences earth observatory"; "Nearly a quarter-billion georeferenced events capture global behavior in more than 300 categories covering 1979 to present with daily updates")
  • Getty Embeddable Images (major collection of stock and archival photos that includes over 30 million images that can be embedded in an iframe on a user's web page; in the embeddable image collection, hover over a photo and select the "</>" icon to get the HTML code for embedding) 
  • Google Advanced Image Search  Img (can be used to search by usage rights)
  • Google Maps Gallery (maps and map data from Google and content creators publishing their data on Google Maps)
  • HathiTrust Digital Library ("international partnership of more than 50 research institutions and libraries ... working together to ensure the long-term preservation and accessibility of the cultural record"; "more than 8 million volumes, digitized from the partnering library collections"; "more than 2 million of these volumes are in the public domain and freely viewable on the Web. Texts of approximately 120,000 public domain volumes in HathiTrust are available immediately to interested researchers. Up to 2 million more may be available through an agreement with Google that must be signed by an institutional sponsor. More information about obtaining the texts, including the agreement with Google, is available at http://www.hathitrust.org/datasets")
  • HuNI ("unlocking and uniting Australia's cultural datasets")
  • Internet Archive
    • Digital Books Collections
    • Book Images  Img (millions of images from books that the Internet Archive has uploaded to its Flickr account; images are accompanied by extensive metadata, including information on location in original book, text immediately before and after the image, any copyright restrictions that apply, etc. Images also tagged to enhance searching)
  • Isidore (French-language portal that "allows researchers to search across a wide range of data from the social sciences and humanities. ISIDORE harvests records, metadata and full text from databases, scientific and news websites that have chosen to use international standards for interoperability")
  • The Japanese American Evacuation and Resettlement: A Digital Archive (UC Berkeley) (Approximately 100,000 images and pages of text; searches by creator, genre of document, and confinement location produce records with downloadable PDF's of original documents [OCR'd]
  • The Mechanical Curator ("public-domain "randomly selected small illustrations and ornamentations, posted on the hour. Rediscovered artwork from the pages of 17th, 18th and 19th Century books")
  • Media History Digital Library ("non-profit initiative dedicated to digitizing collections of classic media periodicals that belong in the public domain for full public access"; "digital scans of hundreds of thousands of magazine pages from thousands of magazine issues from 1904 to 1963") (to download plain-text versions of material, select a magazine and volume, click on the "IA Page" link, and on the resulting Internet Archive page for the volume click on the "All Files: HTTPS:" option; then save the file that ends "_djvu.txt")
  • Metadata Explorer (searches Digital Public Library of America, Europeana, Digital New Zealand, Harvard ,and other major library and collected repository metadata; then generates interactive network graphs for exploring the collections)
  • Metropolitan Museum of Art (NYC)  Img (400,000 downloadable hi-res public domain images from the museum's collection, identified with an icon for "OASC" or Open Access for Scholarly Content"; see FAQ for OASC images)
  • National Archives (U. S.)
  • Nebraska Newspapers ("300,000 full-text digitized pages of 19th and early 20th Century newspapers from selected communities in Nebraska that can be used for text mining...TIFF images, JPEG2000, and PDFs with hidden text. Optical character recognition has been performed on the scanned images, resulting in dirty OCR")
  • New York Times TimesMachine (requires NY Times subscription; provides searchable facsimile and full-text PDF access to historical archives of the Times before 1980)
  • OAKsearch (portal for searching across multiple open-access collections for scholarly articles that are "digital, online, free-of-charge, and free of most copyright and licensing restrictions")
  • Open Images  Img ("open media platform that offers online access to audiovisual archive material to stimulate creative reuse. Footage from audiovisual collections can be downloaded and remixed into new works. Users of Open Images also have the opportunity to add their own material to the platform and thus expand the collection")
  • Open Library ("We have well over 20 million edition records online, provide access to 1.7 million scanned versions of books, and link to external sources like WorldCat and Amazon when we can. The secondary goal is to get you as close to the actual document you're looking for as we can, whether that is a scanned version courtesy of the Internet Archive, or a link to Powell's where you can purchase your own copy")
  • Oxford University Text Archive 
  • Perseus Digital Library ("Perseus has a particular focus upon the Greco-Roman world and upon classical Greek and Latin.... Early modern English, the American Civil War, the History and Topography of London, the History of Mechanics, automatic identification and glossing of technical language in scientific documents, customized reading support for Arabic language, and other projects that we have undertaken allow us to maintain a broader focus and to demonstrate the commonalities between Classics and other disciplines in the humanities and beyond")
  • Powerhouse Museum (Sydney) API
  • Project Gutenberg (42,000 free ebooks) [limited automated access (see also tips on downloading from Project Gutenberg)]
  • Ranking Web of Repositories (extensive ranked listings, with links, to world online repositories that include peer-reviewed papers)
  • Shared Self Commons ("free, open-access library of images. Search and browse collections with tools to zoom, print, export, and share images")
  • SNAC (Social Networks & Archival Contexts) | Prototype
  • Trove ("Find and get over 355,846,887 Australian and online resources: books, images, historic newspapers, maps, music, archives and more") API
  • VADS Online Source for Visual Arts  Img ("visual art collections comprising over 100,000 images that are freely available and copyright cleared for use in learning, teaching and research in the UK")
  • YAGO2s ("huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames") [downloadable metadata]


Linguistic Corpora

(A "corpus" is a large collection of writings, sentences, and phrases. In linguistics, corpora cover particular nationalities, periods, and other kinds of language for use in the study of language.)


Map Collections

  • David Rumsey Map Collection (includes historical maps)
  • Library of Congress "American Memory: Map Collections" (focused on "Americana and Cartographic Treasures of the Library of Congress. These images were created from maps and atlases and, in general, are restricted to items that are not covered by copyright protection")
  • Mapping History (provides "interactive and animated representations of fundamental historical problems and/or illustrations of historical events, developments, and dynamics.  The material is copyrighted, but is open and available to academic users...")
  • National Historial Geographical Information System (NHGIS) ("free of charge, aggregate census data and GIS-compatible boundary files for the United States between 1790 and 2011")
  • Old Maps Online ("easy-to-use gateway to historical maps in libraries around the world"; "All copyright and IPR rights relating to the map images and metadata records are retained by the host institution which provided them and any query regarding them must be addressed to that particular host institution")


Datasets (Public / Open Datasets)

(Includes some datasets accessible only though API's, especially if accompanied by code samples or embeddable code for using the API's.) [Currently this section is being collected; organization into sub-categories by discipline or topic may occur in the future]




























Comments (0)

You don't have permission to comment on this page.