|
Data Collections and Datasets
Page history
last edited
by Alan Liu 9 months, 3 weeks ago
|
|
Digital Humanities Resources for Project Building — Data Collections & Datasets
|
|
|
(DH Toychest started 2013; last update 2017)
|
Guides to Digital Humanities | Tutorials | Tools | Examples | Data Collections & Datasets
Starter Kit:
● Demo Corpora (Small to moderate-sized text collections for teaching, text-analysis workshops, etc.)
● Quick Start (Common text analysis tools & resources to get instructors and students started quickly)
Demo Corpora (Text Collections Ready for Use)
Demo corpora are sample or toy collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials--e.g., for teaching text analysis, topic modeling, etc. Ideal collections for this purpose are public domain or open access, plain-text, relatively modest in number of files, organized neatly in a folder(s), and downloadable as a zip file. (Contributions welcome: if you have demo collections you have created, please email ayliu@english.ucsb.edu.)
(Note: see separate section in the DH Toychest for linguistic corpora--i.e., corpora usually of various representative texts or excerpts designed for such purposes as corpus linguistics.)
Plain-text collections downloadable as zip files:
- General Collections
- Historical Materials
- U.S. Presidents
- U.S. Presidents' Inaugural Speeches (all 57 inaugural speeches from Washington through Obama collected from the American Presidency Project with the assistance of project co-director John T. Woolley; assembled as individual plain-text files by Alan Liu) (zip file)
- Abraham Lincoln
- Lincoln Speeches & Letters (84 works and excerpts from Project Gutenberg's version of Speeches and Letters of Abraham Lincoln, 1832-1865, ed. Merwin Roe, 1907; assembled by Alan Liu as separate files for each work) (zip file) (metadata)
- DocSouth Data (selected collections from the Documenting The American South initiative at the University of North Carolina, Chapel Hill. Contains collections that have been packaged for text analysis. Each is a zip file in which a folder named "data" includes a "toc.csv" file of metadata and subfolders for both plain text and xml versions of the documents in the collection) (See additional literary materials in DocSouth Data)
- The Church in the Southern Black Community
- First-Person Narratives of the American South ("a collection of diaries, autobiographies, memoirs, travel accounts, and ex-slave narratives written by Southerners. The majority of materials in this collection are written by those Southerners whose voices were less prominent in their time, including African Americans, women, enlisted men, laborers, and Native Americans")
- North American Slave Narratives ("The North American Slave Narratives collection at the University of North Carolina contains 344 items and is the most extensive collection of such documents in the world")
- Michigan State University Libraries Text Collections
- Literature
- DocSouth Data (selected collections from the Documenting The American South initiative at the University of North Carolina, Chapel Hill. Contains collections that have been packaged for text analysis. Each is a zip file in which a folder named "data" includes a "toc.csv" file of metadata and subfolders for both plain text and xml versions of the documents in the collection) (see additional historical materials in DocSouth Data)
- First-Person Narratives of the American South ("a collection of diaries, autobiographies, memoirs, travel accounts, and ex-slave narratives written by Southerners. The majority of materials in this collection are written by those Southerners whose voices were less prominent in their time, including African Americans, women, enlisted men, laborers, and Native Americans")
- Library of Southern Literature
- North American Slave Narratives ("The North American Slave Narratives collection at the University of North Carolina contains 344 items and is the most extensive collection of such documents in the world")
- Fiction from the 1880s (sample corpora assembled from Project Gutenberg by students in Alan Liu's English 197 course, Fall 2014 at UC Santa Barbara) (zip files below)
- Shakespeare plays (24 plays from Project Gutenberg assembled by David Bamman, UC Berkeley School of Information) (zip file)
- txtLAB450 - A Multilingual Data Set of Novels for Teaching and Research (from the .txtLab at McGill U.: a collection of 450 novels "published in English, French, and German during the long nineteenth century (1770-1930). The novels are labeled according to language, year of publication, author, title, author gender, point of view, and word length. They have been labeled as well for use with the stylo package in R. They are drawn exclusively from full-text collections and thus should not have errors comparable to OCR’d texts." Download the plain-text novels as a zip file | Download the associated metadata as a .csv file.
- William Wordsworth
- (with Samuel Taylor Coleridge) Lyrical Ballads, 1798 (assembled by Alan Liu from Project Gutenberg as separate files for each poem and the "Advertisement") (zip file) (metadata)
- The Prelude, 1850 version (William Knight's 1896 edition assembled by Alan Liu from Project Gutenberg as separate plain-text files for each book and cleaned of line numbers, notes, and note numbers) (zip file) (metadata)
- Miscellaneous
- Demo text collections assembled by David Bamman, UC Berkeley School of Information:
- Book summaries (2,000 book summaries from Wikipedia) (zip file)
- Film summaries (2,000 movie summaries from Wikipedia) (zip file)
- U.S. patents related to the humanities, 1976-2015 (patents mentioning "humanities" or "liberal arts" located through the U.S Patent Office's searchable archive of fully-digital patent descriptions since 1976. Collected for the 4Humanities WhatEvery1Says project. Idea by Jeremy Douglass; files collected and scraped as plain text by Alan Liu)
- Metadata (Excel spreadsheet)
- Humanities Patents (76 patents related to humanities or liberal arts) (zip file)
- Humanities Patents - Extended Set (336 additional patents that mention the humanities or liberal arts in a peripheral way--e.g., only in reference citations and institutional names of patent holders, as minor or arbitrary examples, etc.) (zip file)
- Humanities Patents - Total Set (412 patents; combined set of above "Humanities Patents" and "Humanities Patents - Extended Set") (zip file)
Sites containing large numbers of books, magazines, etc., in plain-text format (among others) that must be downloaded individually:
Quick Start (Common Text-Analysis Tools & Resources)
A minimal selection of common tools and resources to get instructors and students started working with text collections quickly. (These tools are suitable for use with moderate-scale collections of texts, and do not require setting up a Python, R, or other programming-language development environment, which is typical for advanced, large-scale text analysis.)
- Text-Preparation Tools
(Start here to clean, section, and wrangle texts to optimize them for text analysis)
- Lexos (Included in the Lexos online text-analysis workflow platform are tools for uploading, cleaning [scrubbing] texts, sectioning texts [cutting or chunking], and applying stopwords, lemmatization, and phrase consolidations)
- [See also fuller list of Text-Preparation Tools indexed on the tools page of DH Toychest]
- Text Analysis Tools
- AntConc (Concordance program with multiple capabilities commonly used by the corpus linguistics research community; with versions for Windows, Mac & Linux; site includes video tutorials. The tool can be used to generate a word frequency list, view words in a KWIC concordance view, locate word clusters and n-grams, and compare word usage to a reference corpus such as a corpus linguistics corpus of typical language use in a nation and time period)
- Lexos (Online integrated workflow platform for text analysis that allows users to upload texts, prepare and clean them in various ways, and then perform cluster analyses of various kinds and create visualizations from the results; can also be installed on one's own server) (Note: Lexos has recently added the capability to import Mallet 'topic-counts' output files to visualize topics as word clouds, and also to convert topic-counts files into so-called "topic documents" that facilitate cluster analyses of topics. See Lexos > Visualize > Multicloud)
- Overview (open-source web-based tool designed originally for journalists needing to sort large numbers of stories automatically and cluster them by subject/topic; includes visualization and reading interface; allows for import of documents in PDF and other formats. "Overview has been used to analyze emails, declassified document dumps, material from Wikileaks releases, social media posts, online comments, and more." Can also be installed on one's own server)
- Voyant Tools (Online text reading and analysis environment with multiple capabilities that presents statistics, concordance views, visualizations, and other analytical perspectives on texts in a dashboard-like interface. Works with plain text, HTML, XML, PDF, RTF, and MS Word files [multiple files best uploaded as a zip file]. Also comes with two pre-loaded sets of texts to work on (Shakespeare's works and the Humanist List archives [click the “Open” button on the main page to see these sets])
- Topic Modeling: (Currently, Mallet is the standard, off-the-shelf tool that scholars in the humanities use for topic modeling. [Specifically, Mallet is is a "LDA" or Latent Dirichlet Allocation topic modeling tool. For good "simple" explanations of LDA topic modeling intended for humanist and other scholars, see Edwin Chen and Ted Underwood's posts.] It is a command-line tool that requires students to fuss with exact strings of commands and path names. But the few GUI-interface implementations that now exist such as the Topic Modeling Tool do not allow for enough customization of options for serious exploration.)
- Mallet (Download source for Mallet, which must be installed in your local computer's root directory. See the Programming Historian's excellent tutorial for installing and starting with Mallet.)
- [See also fuller list of Text-Analysis Tools indexed on the tools page of DH Toychest]
- Stopwords (Lists of frequent, functional, and other words with little independent semantic value that text-analyis tools can be instructed to ignore--e.g., by loading a stopword list into Antconc [instructions, at 8:15 in tutorial video] or Mallet [instructions]. Researchers conducting certain kinds of text analysis (such as topic modeling) where common words, spelled-out numbers, names of months/days, or proper nouns such as "John" indiscriminately connect unrelated themes often apply stopword lists. Usually they start with a standard stopword list such as the Buckley-Salton or Fox lists and add words tailored to the specific texts they are studying. For instance, typos, boilerplate words or titles, and other features specific to a body of materials can be stopped out. For other purposes in text analysis such as stylistic study of authors, nations, or periods, common words may be included rather than being stopped out because their frequency and participation in patterns offer meaningful clues.)
- English:
- Buckley-Salton Stoplist (571 words) (1971; information about list)
- Fox Stoplist (421 words) (1989; information about list)
- Mallet Stoplist (523 words) (default English stop list hard-coded into Mallet topic modeling tool; stoplists for German, French, Finnish, German, and Japanese also included in "stoplists" folder in local Mallet installations) (Note: Use the default English stoplist in Mallet by adding the option "--remove-stopwords " in the command string when inputting a folder of texts. To add stopwords of your own or another stopwords file, create a separate text file and additionally use the command "--extra-stopwords filename". See Mallet stopwords instructions)
- Jockers Expanded Stoplist (5,621 words, including many proper names) ("the list of stop words I used in topic modeling a corpus of 3,346 works of 19th-century British, American, and Irish fiction. The list includes the usual high frequency words (“the,” “of,” “an,” etc) but also several thousand personal names.")
- Goldstone-Underwood Stoplist (6,032 words) (2013; stopword list used by Andrew Goldstone and Ted Underwood in their topic modeling work with English-language literary studies journals) (text file download link)
- Other Languages:
- Kevin Bougé Stopword Lists Page ("stop words for Arabic, Armenian, Brazilian, Bulgarian, Chinese, Czech, Danish, Dutch, English, Farsi, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish")
- Ranks NL Stopword Lists Page (stop words for Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Latvian, Lithuanian, Marathi, Norwegian, Persian, Polish, Portugese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukranian, Urdu)
- Reference Linguistic Corpora (Curated corpus-linguistics corpora of language samples from specific nations and time periods designed to be representative of prevalent linguistic usage. May be used online [or downloaded for use in Antconc] as reference corpora for the purpose of comparing linguistic usage in specific texts against broader usage)
- Corpus.byu.edu (Mark Davies and Brigham Young University's excellent collection of linguistic corpora)
Document/Image Collections
API = has API for automated data access or export
Img = Image archives in public domain
- ARTFL: Public Databases (American and French Research on the Treasury of the French Language) (expansive collection of French-language resources in the humanities and other fields from the 17th to 20th centuries)
- Avant-garde and Modernist Magazines (Monoskop guide to modernist avant-garde magazines; includes extensive links to online archives and collections of the magazines)
- British Library Images ("We have released over a million images onto Flickr Commons for anyone to use, remix and repurpose. These images were taken from the pages of 17th, 18th and 19th century books digitised by Microsoft who then generously gifted the scanned images to us, allowing us to release them back into the Public Domain")
- BT (British Telecom) Digital Archives ("Bhas teamed up with Coventry University and The National Archives to create a searchable digital resource of almost half a million photographs, reports and items of correspondence preserved by BT since 1846.... collection showcases Britain’s pioneering role in the development of telecommunications and the impact of the technology on society")
- CELL (created by the Electronic Literature Organization, "Consortium on Electronic Literature (CELL) is an open access, non-commercial resource offering centralized access to literary databases, archives, and institutional programs in the literary arts and scholarship, with a focus on electronic literature")
- Creative Commons Img
- Digging Into Data Challenge - List of Data Repositories
- Digital Public Library of America ("brings together the riches of America’s libraries, archives, and museums, and makes them freely available to the world") API
- Apps and API's drawing on DPLA:
- Digital Repositories from around the world (listed by EHPS, European History Primary Resources)
- ELMCIP Electronic Literature Knowledge Base ("cross-referenced, contextualized information about authors, creative works, critical writing, and practices")
- EEBO-TCP Texts (public-domain digitized texts from the Early English Books Online / Text Creation Partnership; "EEBO-TCP is a partnership with ProQuest and with more than 150 libraries to generate highly accurate, fully-searchable, SGML/XML-encoded texts corresponding to books from the Early English Books Online Database.... these trace the history of English thought from the first book printed in English in 1475 through to 1700.")
- EPC (Electronic Poetry Center)
- Europeana (" digitised collections of museums, libraries, archives and galleries across Europe") API
- Flickr Commons Img
- Folger Library Digital Image Collection Img (" tens of thousands of high resolution images from the Folger Shakespeare Library, including books, theater memorabilia, manuscripts, and art. Users can show multiple images side-by-side, zoom in and out, view cataloging information when available, export thumbnails, and construct persistent URLs linking back to items or searches")
- French Revolution Digital Archive Img (collaboration of the Stanford University Libraries and the Bibliothèque nationale de France to put online "high-resolution digital images of approximately 12,000 individual visual items, primarily prints, but also illustrations, medals, coins, and other objects, which display aspects of the Revolution")
- Gallica (documents and images from Gallica, the "digital library of the Bibliothèque nationale de France and its partners")
- GDELT: The Global Database of Events, Language, and Tone (downloadable datasets from "initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world over the last two centuries down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first "realtime social sciences earth observatory"; "Nearly a quarter-billion georeferenced events capture global behavior in more than 300 categories covering 1979 to present with daily updates")
- Getty Embeddable Images (major collection of stock and archival photos that includes over 30 million images that can be embedded in an iframe on a user's web page; in the embeddable image collection, hover over a photo and select the "</>" icon to get the HTML code for embedding)
- Google Advanced Image Search Img (can be used to search by usage rights)
- Google Maps Gallery (maps and map data from Google and content creators publishing their data on Google Maps)
- HathiTrust Digital Library ("international partnership of more than 50 research institutions and libraries ... working together to ensure the long-term preservation and accessibility of the cultural record"; "more than 8 million volumes, digitized from the partnering library collections"; "more than 2 million of these volumes are in the public domain and freely viewable on the Web. Texts of approximately 120,000 public domain volumes in HathiTrust are available immediately to interested researchers. Up to 2 million more may be available through an agreement with Google that must be signed by an institutional sponsor. More information about obtaining the texts, including the agreement with Google, is available at http://www.hathitrust.org/datasets")
- HuNI ("unlocking and uniting Australia's cultural datasets")
- Internet Archive
- Digital Books Collections
- Book Images Img (millions of images from books that the Internet Archive has uploaded to its Flickr account; images are accompanied by extensive metadata, including information on location in original book, text immediately before and after the image, any copyright restrictions that apply, etc. Images also tagged to enhance searching)
- Isidore (French-language portal that "allows researchers to search across a wide range of data from the social sciences and humanities. ISIDORE harvests records, metadata and full text from databases, scientific and news websites that have chosen to use international standards for interoperability")
- The Japanese American Evacuation and Resettlement: A Digital Archive (UC Berkeley) (Approximately 100,000 images and pages of text; searches by creator, genre of document, and confinement location produce records with downloadable PDF's of original documents [OCR'd]
- The Mechanical Curator ("public-domain "randomly selected small illustrations and ornamentations, posted on the hour. Rediscovered artwork from the pages of 17th, 18th and 19th Century books")
- Media History Digital Library ("non-profit initiative dedicated to digitizing collections of classic media periodicals that belong in the public domain for full public access"; "digital scans of hundreds of thousands of magazine pages from thousands of magazine issues from 1904 to 1963") (to download plain-text versions of material, select a magazine and volume, click on the "IA Page" link, and on the resulting Internet Archive page for the volume click on the "All Files: HTTPS:" option; then save the file that ends "_djvu.txt")
- Metadata Explorer (searches Digital Public Library of America, Europeana, Digital New Zealand, Harvard ,and other major library and collected repository metadata; then generates interactive network graphs for exploring the collections)
- Metropolitan Museum of Art (NYC) Img (400,000 downloadable hi-res public domain images from the museum's collection, identified with an icon for "OASC" or Open Access for Scholarly Content"; see FAQ for OASC images)
- National Archives (U. S.)
- Nebraska Newspapers ("300,000 full-text digitized pages of 19th and early 20th Century newspapers from selected communities in Nebraska that can be used for text mining...TIFF images, JPEG2000, and PDFs with hidden text. Optical character recognition has been performed on the scanned images, resulting in dirty OCR")
- New York Times TimesMachine (requires NY Times subscription; provides searchable facsimile and full-text PDF access to historical archives of the Times before 1980)
- OAKsearch (portal for searching across multiple open-access collections for scholarly articles that are "digital, online, free-of-charge, and free of most copyright and licensing restrictions")
- Open Images Img ("open media platform that offers online access to audiovisual archive material to stimulate creative reuse. Footage from audiovisual collections can be downloaded and remixed into new works. Users of Open Images also have the opportunity to add their own material to the platform and thus expand the collection")
- Open Library ("We have well over 20 million edition records online, provide access to 1.7 million scanned versions of books, and link to external sources like WorldCat and Amazon when we can. The secondary goal is to get you as close to the actual document you're looking for as we can, whether that is a scanned version courtesy of the Internet Archive, or a link to Powell's where you can purchase your own copy")
- Oxford University Text Archive
- Perseus Digital Library ("Perseus has a particular focus upon the Greco-Roman world and upon classical Greek and Latin.... Early modern English, the American Civil War, the History and Topography of London, the History of Mechanics, automatic identification and glossing of technical language in scientific documents, customized reading support for Arabic language, and other projects that we have undertaken allow us to maintain a broader focus and to demonstrate the commonalities between Classics and other disciplines in the humanities and beyond")
- Powerhouse Museum (Sydney) API
- Project Gutenberg (42,000 free ebooks) [limited automated access (see also tips on downloading from Project Gutenberg)]
- Ranking Web of Repositories (extensive ranked listings, with links, to world online repositories that include peer-reviewed papers)
- Shared Self Commons ("free, open-access library of images. Search and browse collections with tools to zoom, print, export, and share images")
- SNAC (Social Networks & Archival Contexts) | Prototype
- Trove ("Find and get over 355,846,887 Australian and online resources: books, images, historic newspapers, maps, music, archives and more") API
- VADS Online Source for Visual Arts Img ("visual art collections comprising over 100,000 images that are freely available and copyright cleared for use in learning, teaching and research in the UK")
- YAGO2s ("huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames") [downloadable metadata]
- Wellcome Collection of Historical Images (CC licensed)
Linguistic Corpora
(A "corpus" is a large collection of writings, sentences, and phrases. In linguistics, corpora cover particular nationalities, periods, and other kinds of language for use in the study of language.)
Map Collections
- David Rumsey Map Collection (includes historical maps)
- Library of Congress "American Memory: Map Collections" (focused on "Americana and Cartographic Treasures of the Library of Congress. These images were created from maps and atlases and, in general, are restricted to items that are not covered by copyright protection")
- Mapping History (provides "interactive and animated representations of fundamental historical problems and/or illustrations of historical events, developments, and dynamics. The material is copyrighted, but is open and available to academic users...")
- National Historial Geographical Information System (NHGIS) ("free of charge, aggregate census data and GIS-compatible boundary files for the United States between 1790 and 2011")
- Old Maps Online ("easy-to-use gateway to historical maps in libraries around the world"; "All copyright and IPR rights relating to the map images and metadata records are retained by the host institution which provided them and any query regarding them must be addressed to that particular host institution")
Datasets (Public / Open Datasets)
(Includes some datasets accessible only though API's, especially if accompanied by code samples or embeddable code for using the API's.) [Currently this section is being collected; organization into sub-categories by discipline or topic may occur in the future]
- Datasets of Humanities Materials (including materials related to culture, popular culture, the arts)
- Major Federated/Collected Humanities Dataset Sources
- Digital Public Library of America (DLPA): Developers (API's and bulk downloads)
- Europeana Data Downloads ("File dumps for 20 million objects from Europeana providers are downloadable in RDF at http://data.europeana.eu/download/2.0/. A detailed overview of the datasets and their data providers is available as a Google spreadsheet")
- HathiTrust Research Center (HTRC) Portal (allows registered users to search the HathiTrust's ~3 million public domain works, create collections, upload worksets of datra in CSV format, and perform algorithmic analysis -- e.g., word clouds, semantic analysis, topic modeling) (Sign-up for login to HTRC portal; parts of the search and analysis platform requiring institutional membershkp also require a userid for the user's university)
- Features Extracted From the HTRC ("A great deal of fruitful research can be performed using non-consumptive pre-extracted features. For this reason, HTRC has put together a select set of page-level features extracted from the HathiTrust's non-Google-digitized public domain volumes. The source texts for this set of feature files are primarily in English. Features are notable or informative characteristics of the text. We have processed a number of useful features, including part-of-speech tagged token counts, header and footer identification, and various line-level information. This is all provided per-page.... The primary pre-calculated feature that we are providing is the token (unigram) count, on a per-page basis"; data is returned in JSON format)
- National Archive or National Centralized Dataset Sources
- Library & Bibliographical Datasets
- Datahub: Bibliographic Data ("open bibliographic datasets")
- British National Library Linked Open Data: Bulk Downloads
- Cambridge University Library Open Data
- Harvard Library Bibliographic Dataset ("over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials")
- JISC Open Bibliography British National Bibliography Dataset ("records of all new books published in the UK since 1950")
- JISC Open Bib Cambridge University Library Bibliographic Dataset
- National Library of France/Bibliothèque nationale de France: Semantic Web & Data Model Page (includes RDF data dumps)
- National Library of the Netherlands (Koninklijke Bibliotheek) - Data Services / API's
- New York Public Library: Data
- OCLC Developer Resources - API's (API's for access to WorldCat library data)
- Open Bibliographic Data Guide - Examples
- Open Library Data Dumps ("Open Library is an open, editable library catalog, building towards a web page for every book ever published.... provides dumps of all the data in various formats.... generated every month")
- The Open University - Linked Data (Datasets)
- Publications New Zealand: Metadata (bibliographic records created by the New Zealand "National Library and other New Zealand libraries, and
a small number of bibliographic records from OCLC’s WorldCat")
- Museum Datasets
- Other Humanities Datasets (datasets for projects, disciplines, etc.; includes general-purpose datasets of use in humanities research)
- 100 Interesting Datasets for Statistics (Robb Seaton, 2014)
- 20th-Century American Bestsellers Database (compiled "by undergraduate and graduate students at the University of Virginia, the University of Illinois, Catholic University, and Brandeis University, from 1998 to the present")
- Australian Data Archive (ADA): Historical
- Bach Choral Harmony Data Set
- BaroqueArt ("data collection of Hispanic Baroque painters and paintings from 1550 to 1850"; downloadable dataset)
- Chronicling America Historic American Newspapers - Data Access and API's
- Connected Histories API (data access to Connect Histories British History Sources, 1500-1900, which facilitates searches for "digital resources related to early modern and nineteenth century Britain")
- CORE (COnnecting REpositories) - API's (data access to CORE portal providing'free access to scholarly publications distributed across many systems")
- Cultural Policy & the Arts National Data Archive (CPANDA) ("interactive digital archive of policy-relevant data on the arts and cultural policy in the United States")
- David Copperfield Word Adjacencies ("adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens. Please cite M. E. J. Newman, Phys. Rev. E 74, 036104 (2006)"; from Mark Newman's list of network datasets)
- DBPedia (structured data extracted from Wikipedia; includes downloads and datasets)
- English Heritage Places ("metadata for about 400,000 nationally important places as recorded by English Heritage, the UK Government's statutory adviser on the historic environment")
- French Old Regime Bureaucrats: Intendants de Province, 1661-1790 (data on "administrative elites for the years 1661-1770, with a limited amount of biographical information before these dates")
- Google Books Ngram Viewer Downloadable Datasets
- Hansard Parliamentry Debates Archive (downloadable datasets in XML)
- HathiTrust Research Center (HTRC) - Extracted Features Dataset ("dataset, consisting of page-level features extracted from a quarter-million text volumes")
- Historical Census Browser (U. S.)
- History Data Service (HDS) ("collects, preserves, and promotes the use of digital resources, which result from or support historical research, learning and teaching")
- ImageNet ("image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images")
- Kinomatics Project ("collects, explores, analyses and represents data about the creative industries.... Current focus is on the spatial and temporal dimensions of international film flow and the location of Australian live music gigs"; also includes visualizations and tools for film impact rating)
- Les Miserables ("coappearance network of characters in the novel Les Miserables. Please cite D. E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley, Reading, MA (1993)"; from Mark Newman's list of network datasets)
- Marvel (Comics) Universe Social Graph (datasets include Gephi files and related files for: Comic and Hero Network, Marvel Social Network Comic and Hero Network Data, & Hero Social Network Data)
- Million Song Dataset ("a freely-available collection of audio features and metadata for a million contemporary popular music tracks")
- Modernist Journals Project - Datasets ("In addition to the MODS catalogue records and TEI text transcripts for each journal, we’re now making available three .txt-file datasets for each journal we post on the MJP Lab's Sourceforge repository. These datasets are derived from the MODS files, aggregating most of the data recorded there; they thus give users the MJP’s catalogue information about each journal without the hassle of having to extract, concatenate, and organize the data themselves")
- Movie Data Set (from UC Irvine Machine Learning Repository; "contains a list of over 10000 films including many older, odd, and cult films. There is information on actors, casts, directors, producers, studios, etc.")
- Movie Review Data ("distribution site for movie-review data for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity")
- New York Times - Developers (Search by API) ("why just read the news when you can hack it?"; API's for accessing headlines, abstracts, first paragraphs, links, etc. to NYT data ; includes API's for articles, best sellers, comments by users, most popular items, newswire, and other parts of the Times) (Note: See details of access by time range to full-text, partial-text, PDF versions of the articles discoverable through the NYT API)
- Nobelprize.org Data (API and as Linked Data for "information about who has been awarded the Nobel Prize, when, in what prize category and the motivation, as well as basic information about the Nobel Laureates such as birth data and the affiliation at the time of the award")
- Old Bailey API (data for the Proceedings of the Old Bailey project with its "access to over 197,000 trials and biographical details of approximately 2,500 men and women executed at Tyburn"; the Old Bailey API includes "Old Bailey API Demonstrator to build queries and export texts to Voyant Tools")
- Online Catasto of 1427 (" searchable database of tax information for the city of Florence in 1427-29; c. 10,000 records")
- Open Images - API (data access to Open Images platform offering "online access to audiovisual archive material to stimulate creative reuse. Footage from audiovisual collections can be downloaded and remixed into new works. Users of Open Images also have the opportunity to add their own material to the platform and thus expand the collection")
- PastPlace API (historical gazetteer service for A Vision of Britain through Time)
- Polymath Virtual Library ("aims to bring together the works of the most important Hispanic polymaths and to establish semantic relationships between them, expressing the different schools of thought, from Seneca to Octavio Paz")
- Theatricalia ("database of past and future theatre productions.... nearly 20,000 productions involving over 60,000 people at over 1,500 theatres")
- Trans-Atlantic Slave Trade Database
- Datasets in Other Disciplines (sample datasets for social-science, demographic, ethnicity/diversity, economic, health, media, and other research)
- American National Election Studies datasets
- Association of Religion Data Archives (ARDA) ("collection of surveys, polls, and other data submitted by researchers and made available online ... nearly 775 data files")
- Australian Data Archive (ADA)
- Berman Jewish DataBank (North America)
- Bureau of Labor Statistics (U. S.)
- CAIDA (Cooperative Association for Internet Data Analysis): Datasets, Monitors, and Reports ("Collection and sharing of data for scientific analysis of Internet traffic, topology, routing, performance, and security-related events is one of CAIDA's core objectives")
- Cat Dataset ("10,000 cat images. For each image, we annotate the head of cat with nine points, two for eyes, one for mouth, and six for ears")
- CDC Wonder (public health data and related data from the U. S. Centers for Disease Control & Prevention)
- Census Bureau Data Tools and Apps (data from the U. S. Census Bureau)
- Center for International Earth Science Information Network (CIESIN) Data Links By Subject
- Center for Population Research in LGBT Health: Data Resources
- China Data Center (from U. Michigan)
- Data is Plural (extensive directory of public datasets of many kinds by Jeremy Singer-Vine, including many curious and small ones--e.g., the number of squirrels in New York's Central Park according to a squirrel count)
- DataFerrett ("data analysis and extraction tool to customize [U. S.] federal, state, and local data to suit your requirements.... you can develop an unlimited array of customized spreadsheets that are as versatile and complex as your usage demands then turn those spreadsheets into graphs and maps without any additional software")
- Data and Story Library (DASL) (pedagogically oriented site with large number of sample datasets accompanied by "stories" that apply "a particular statistical method to a set of data")
- Data.gov ("home of the US government’s open data. You can find Federal, state and local data, tools, and resources to conduct research, build apps, design data visualizations, and more")
- Data.gov.uk
- Data on the Net ("Search or browse our listing of 363 Internet sites of numeric Social Science statistical data, data catalogs, data libraries, social science gateways, addresses and more")
- Datasets for Data Mining and Data Science (from KDnuggets)
- DiversityData.org (U. S.) ("Create customized reports describing over 100 measures of diversity, opportunity, and quality of life for 362 metropolitan areas")
- Economic Policy Institute Datazone
- Enron Email Dataset ("data from about 150 users, mostly senior management of Enron ... contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation)
- EnronSent Corpus ("special preparation of a portion of the Enron Email Dataset designed specifically for use in Corpus Linguistics and language analysis ... created by cleaning up a portion of the original Enron Corpus. It contains 96,107 messages from the "Sent Mail" directories of all the users in the corpus. ... an attempt has been made to remove as much non-human generated text as possible from the raw messages in the original data")
- Global Health Observatory (GHO) (from the World Health Organization)
- Global Terrorism Database (GTD) ("open-source database including information on terrorist events around the world from 1970 through 2012 [with annual updates planned for the future].... includes systematic data on [U. S.] domestic as well as international terrorist incidents ... includes more than 113,000 cases")
- Google Datasets Search Engine
- Higher Education Datasets (U. S.) (from Data.gov)
- Homeland Security (U. S.) Data and Statistics (from U. S. Department of Homeland Security)
- ICWSM-2009 Blogs Dataset ("44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. ... formatted in XML and is further arranged into tiers approximating to some degree search engine ranking.... To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) icwsm.org. Once your form is processed ... you will be sent a URL and password where you can download the collection")
- Immigration Statistics (from U. S. Department of Homeland Security)
- Infochimps Financial Datasets (including stock market historical datasets in csv format)
- Mexican Migration Project (MMP)
- National Archive of Criminal Justice Data (U. S.)
- National Atlas Data Download (U. S.)
- New York Times - Developers (Search by API) ("why just read the news when you can hack it?"; API's for accessing headlines, abstracts, first paragraphs, links, etc. to NYT data ; includes API's for articles, best sellers, comments by users, most popular items, newswire, and other parts of the Times)
- Open Context: Downloadable Data Tables (archaeological research)
- Pew Research Center Datasets (datasets for the following Pew Research projects: People & the Press; Journalism; Hispanic Trends; Global Attitudes; Internet and American Life; Social & Demographic Trends; Religion & Public Life)
- Public Datasets (list created by Vivek Patil)
- Qualitative Data Repository (QDR) ("selects, ingests, curates, archives, manages, durably preserves, and provides access to digital data used in qualitative and multi-method social inquiry"; Syracuse U.)
- Quora List of Large Public Datasets
- Reddit 2.5 Million Posts ("dataset of the all-time top 1,000 posts, from the top 2,500 subreddits by subscribers, pulled from reddit between August 15–20, 2013")
- Réseau Quetelet (French social-science datasets)
- Resource Center for Minority Data (RCMD) (U. S.)
- Robert Seaton, "100+ Interesting Data Sets for Statistics" ("Looking for interesting data sets? Here's a list of more than 100 of the best stuff, from dolphin relationships to political campaign donations to death row prisoners") (2014)
- Sharing Ancient Wisdoms (SAW): Data (RDF data files and other data for project on "collections of ideas and opinions – ranging from pithy sayings to short passages from longer philosophical texts - which make up the ancient genre of Wisdom Literature")
- SMS Spam Collection ("public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-encoded messages, tagged according being legitimate (ham) or spam"
- Social Computing Data Repository (datasets for social media research from Arizona State U.)
- Spambase Data Set (University of Callifornia, Irvine's dataset)
- Stanford Large Network Dataset Collection (datasets of nodes and edges for social networks, communication networks, citation networks, collaboration networks, Amazon networks, road networks, Twitter, online communities, etc.)
- Terrorism and Preparedness Data Resource Center (U. Michigan; "archives and distributes data collected by government agencies, non-governmental organizations (NGOs), and researchers about the nature of intra- (domestic) and international terrorism incidents, organizations, perpetrators, and victims; governmental and nongovernmental responses to terror, including primary, secondary, and tertiary interventions; and citizen's attitudes towards terrorism, terror incidents, and the response to terror")
- Texas Department of Criminal Justice Death Row Information (dataset of last words of prisoners executed since 1984)
- Time Series Data Library
- Twitter Data set for Arabic Sentiment Analysis Data Set
- UK Data Archive
- UNdata (data from the Statistics Division of the United Nations Department of Economic and Social Affairs; includes data access by API)
- University of California, Irvine, Machine Learning Repository Datasets
- University College Dublin's Open Data Sets of Social Networks (overview)
- U. S. National Survey on Drug Use and Health, 2012 (datasets)
- U. S. Survey of Inmates in State and Federal Correctional Facilities, 2004 (datasets)
- World Bank Poverty & Equity Data
- World Data Center for Human Interactions in the Environment (from Columbia U.)
- World Values Survey (WVS) ("global research project that explores people’s values and beliefs, their stability or change over time and their impact on social and political development of the societies in different countries of the world")
- Yelp Academic Dataset ("Yelp is providing all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research. We've provided some examples on GitHub to get you started. To get them running, you will need to install MRJob, our python framework for Map-Reduce computing")
DH Toychest was started in 2013, and supersedes Alan Liu's older "Toy Chest".
Data Collections and Datasets
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
|
|
Comments (0)
You don't have permission to comment on this page.