Data Collections and Datasets


Digital Humanities projects collage

 

 

DH 
Toychest
 

 

 

 

Digital Humanities Resources for Project Building  — Data Collections & Datasets

   

(curated by Alan Liu)

(DH Toychest started 2013; last update 2017)

 

 Guides to Digital Humanities | Tutorials | Tools | Examples | Data Collections & Datasets

 

Data Collections & Datasets: Demo corpora | Document/Image Collections | Linguistic Corpora | Map Collections | Datasets

 

Starter Kit:


Demo Corpora  (Small to moderate-sized text collections for teaching, text-analysis workshops, etc.)

Quick Start  (Common text analysis tools & resources to get instructors and students started quickly)

 

Demo Corpora (Text Collections Ready for Use)

 

Demo corpora are sample or toy collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials--e.g., for teaching text analysis, topic modeling, etc.  Ideal collections for this purpose are public domain or open access, plain-text, relatively modest in number of files, organized neatly in a folder(s), and downloadable as a zip file. (Contributions welcome: if you have demo collections you have created, please email ayliu@english.ucsb.edu.)

       (Note: see separate section in the DH Toychest  for linguistic corpora--i.e., corpora usually of various representative texts or excerpts designed for such purposes as corpus linguistics.)


 

Arrow-right Plain-text collections downloadable as zip files:

 

Arrow-right Sites containing large numbers of books, magazines, etc., in plain-text format (among others) that must be downloaded individually:

 

 

Quick Start (Common Text-Analysis Tools & Resources)

 

A minimal selection of common tools and resources to get instructors and students started working with text collections quickly. (These tools are suitable for use with moderate-scale collections of texts, and do not require setting up a Python, R, or other programming-language development environment, which is typical for advanced, large-scale text analysis.)


 

 

 

Document/Image Collections

 

 

API = has API for automated data access or export
Img = Image archives in public domain

 

 

Linguistic Corpora

(A "corpus" is a large collection of writings, sentences, and phrases. In linguistics, corpora cover particular nationalities, periods, and other kinds of language for use in the study of language.)

 

Map Collections

 

Datasets (Public / Open Datasets)

(Includes some datasets accessible only though API's, especially if accompanied by code samples or embeddable code for using the API's.) [Currently this section is being collected; organization into sub-categories by discipline or topic may occur in the future]

 

 


DH Toychest was started in 2013, and supersedes Alan Liu's older "Toy Chest".