Data Collections and Datasets

Digital Humanities projects collage







Digital Humanities Resources for Project Building  — Data Collections & Datasets


(curated by Alan Liu)

(DH Toychest started 2013; last update 2017)


 Guides to Digital Humanities | Tutorials | Tools | Examples | Data Collections & Datasets


Data Collections & Datasets: Demo corpora | Document/Image Collections | Linguistic Corpora | Map Collections | Datasets


Starter Kit:

Demo Corpora  (Small to moderate-sized text collections for teaching, text-analysis workshops, etc.)

Quick Start  (Common text analysis tools & resources to get instructors and students started quickly)


Demo Corpora (Text Collections Ready for Use)


Demo corpora are sample or toy collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials--e.g., for teaching text analysis, topic modeling, etc.  Ideal collections for this purpose are public domain or open access, plain-text, relatively modest in number of files, organized neatly in a folder(s), and downloadable as a zip file. (Contributions welcome: if you have demo collections you have created, please email

       (Note: see separate section in the DH Toychest  for linguistic corpora--i.e., corpora usually of various representative texts or excerpts designed for such purposes as corpus linguistics.)


Arrow-right Plain-text collections downloadable as zip files:


Arrow-right Sites containing large numbers of books, magazines, etc., in plain-text format (among others) that must be downloaded individually:



Quick Start (Common Text-Analysis Tools & Resources)


A minimal selection of common tools and resources to get instructors and students started working with text collections quickly. (These tools are suitable for use with moderate-scale collections of texts, and do not require setting up a Python, R, or other programming-language development environment, which is typical for advanced, large-scale text analysis.)




Document/Image Collections



API = has API for automated data access or export
Img = Image archives in public domain



Linguistic Corpora

(A "corpus" is a large collection of writings, sentences, and phrases. In linguistics, corpora cover particular nationalities, periods, and other kinds of language for use in the study of language.)


Map Collections


Datasets (Public / Open Datasets)

(Includes some datasets accessible only though API's, especially if accompanied by code samples or embeddable code for using the API's.) [Currently this section is being collected; organization into sub-categories by discipline or topic may occur in the future]



DH Toychest was started in 2013, and supersedes Alan Liu's older "Toy Chest".