| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Digital Humanities Tools

This version was saved 9 years, 11 months ago View current version     Page history
Saved by Alan Liu
on April 25, 2014 at 3:21:45 pm
 

Digital Humanities Resources for Student Project-Building

Curated by Alan Liu

 

 

 Guides | Tutorials | Tools | Examples      See also Data Collections & Datasets

 

 

Digital Humanities Tools

Online or downloadable tools that are either free or have free student licenses or generous trial periods. Bias toward tools run online or on a personal computer without needing to be installed on a local institutional server.)
For a more comprehensive tools list, see the following:
  • Bamboo DiRT (Digital Research Tools) (annotated tool directory; includes both commercial and free toos; can filter for "free")
  • TAPoR 2 Portal (annotated tool directory focused on "tools used in sophisticated text analysis and retrieval"; includes tool reviews)
  • Digital Textuality Resource Pages (listing of tools kept by Kimberly Knight and her students at U. Texas, Dallas, for text production, visualization, still image work, sound work, and video and animation; includes some student reviews of tools)
Students may also be interested in online hosting services for their own domains or sites.  Some providers offer suites of content management systems like WordPress,, Omeka, etc..  Providers include:
 checkmark= Currently a tool that is prevalent, canonical, or has "buzz" in the digital humanities community.
checkmark blue= Other tools with high power or general application (some caveats may apply for scholarly use)

 


 

  • Animation & Storyboarding
    • FrameByFrame (stop-motion animation tool for Mac) ( creates stop-motion animation videos using any webcam/video camera connected to your Mac, including iSight)
    • Pencil (2D animation software suitable for beginners at animation)
    • Popcorn Maker (creates interactive videos; "helps you easily remix web video, audio and images into cool mashups that you can embed on other websites. Drag and drop content from the web, then add your own comments and links . . . ; videos are dynamic, full of links and unique with every view") | Tutorial by Miriam Posner
    • Storyteller ("application from Amazon Studios that lets you turn a movie script into a storyboard. You choose the backgrounds, characters, and props to visually tell a story")

  • Audio Tools
    • checkmark blueAudiotool (free, web-based application for electronic music production; meant to serve as a fully functioning virtual studio. Users drop and drag synthesizers, drum machines, sequencers, filters, samples, and note sequences into the workspace from a toolbar)
    • Augmented Notes ("integrates scores and audio files to produce interactive multimedia websites in which measures of the score are highlighted in time with music.")
    • Praat (free software package for phonetic analysis; designed to analyse, synthesize, manipulate, and visualize speech)
    • Sonic Visualiser (program to facilitate study of musical recourdings; "of particular interest to musicologists, archivists, signal-processing researchers and anyone else looking for a friendly way to take a look at what lies inside the audio file")

  • Authoring/Annotation/Editing/Publishing Platforms & Tools (including collaborative platforms) (see also under Content Management Systems and Exhibition Platforms & Tools) 
    • Annotation Studio ("suite of tools for collaborative web-based annotation.... Currently supporting the multimedia annotation of texts... will ultimately allow students to annotate video, image, and audio sources")
    • CommentPress ("open source theme and plugin for the WordPress blogging engine that allows readers to comment paragraph-by-paragraph, line-by-line or block-by-block in the margins of a text. Annotate, gloss, workshop, debate: ... do all of these things on a finer-grained level, turning a document into a conversation")
    • INKE Tools and Prototypes (tools and platforms developed by the INKE project)
    • Inklewriter (free online tool "designed to allow anyone to write and publish interactive stories. It’s perfect for writers who want to try out interactivity, but also for teachers and students looking to mix computer skills and creative writing"; "keeps your branching story organised, so you can concentrate on what’s important – the writing." Also allows export of stories to Kindle with hyperlinks for the interactive features of a story.)
    • NewRadial (visualization interface from the INKE project designed to facilitate studying, commenting on, and social editing of texts)
    • Oppia (Google's tool for making "embeddable interactive educational 'explorations' that let people learn by doing"; "Oppia aims to simulate the one-on-one interaction that a student has with a teacher by capturing and generalizing 'interaction dialogues'"; explorations can contain maps, images, text)
    • Prism ("a tool for "crowdsourcing interpretation." Users are invited to provide an interpretation of a text by highlighting words according to different categories, or "facets." Each individual interpretation then contributes to the generation of a visualization which demonstrates the combined interpretation of all the users. We envision Prism as a tool for both pedagogical use and scholarly exploration, revealing patterns that exist in the subjective experience of reading a text.")
    • checkmark Scalar (multi-modal authoring platform: "free, open source authoring and publishing platform that’s designed to make it easy for authors to write long-form, born-digital scholarship online. Scalar enables users to assemble media from multiple sources and juxtapose them with their own writing in a variety of ways, with minimal technical expertise required")
    • Scroll Kit (drag-and-drop online platform for creating scrollable multimedia narratives that also scale for mobile device screens; "Make stories people will want to touch.  Scroll Kit is a powerful visual content editor . . . typography, images, motion")

  • Content Management Systems
    • checkmark bluePBWorks (content management system hosted online with strong educational user base; particular robust as a wiki platform for project or course sites; free education-user licenses) 
    • checkmarkWordPress (content management system based originally on blog paradigm; hosted online or downloadable for installation on local server)

  • Crowdsourcing
    • checkmark blueAllOurIdeas ("social data collection" wiki platform that solicits information online by survey "while still allowing for new information to 'bubble up' from respondents as happens in interviews, participant observation, and focus groups")

  • Design

  • Exhibition/Collection/Digital Edition Publication Platforms & Tools (see also Visualization subsections on Infographics and Timelines) 
    • CollectiveAccess ("cataloguing tool and web-based application for museums, archives and digital collections")
    • Exhibit (downloadable software for creating "web pages with advanced text search and filtering functionalities, with interactive maps, timelines, and other visualizations"; part of the Simile Widgets suite)
    • checkmark Omeka ("create complex narratives and share rich collections, adhering to Dublin Core standards with Omeka on your server, designed for scholars, museums, libraries, archives, and enthusiasts"; hosted online or downloadable for installation on server) | Getting Started
    • Open Exhibits ("free multitouch & multiuser software initiative for museums, education, nonprofits, and students")
    • checkmark Neatline ("allows scholars, students, and curators to tell stories with maps and timelines. As a suite of add-on tools for Omeka, it opens new possibilities for hand-crafted, interactive spatial and temporal interpretation"; downloadable for installation on server)
    • checkmark bluePrezi (alternative to PowerPoint; uses an infinite canvas metaphor rather than a slide metaphor; free online production and viewing version; offline production version by subscription)
    • ViewShare ("free platform for generating and customizing views (interactive maps, timelines, facets, tag clouds) that allow users to experience your digital collections")
    • checkmark blueSimile Widgets (embeddable code for visualizing time-based data, including Timeline, Timeplot, Runway, and Exhibition)
    • TextGrid ("a virtual research environment (VRE) for humanities scholars in which various tools and services are available for the creation, analysis, editing, and publication of texts and images"; provides "a variety of tested tools, services, and resources, allowing for the complete workflow of, for example, generating a critical textual edition"; "also supports the storage and re-use of research data through the integration of the TextGrid Repository")

  • Mapping
    • CartoDB (online tools for visualizing and analyzing geospatial data; free plan includes up to 5 tables and 5Mb of data)
    • ChartsBin (creates interactive maps)
    • Esri Story Maps: Storytelling with Maps ("Story maps combine intelligent Web maps with Web applications and templates that incorporate text, multimedia, and interactive functions")
    • checkmark blueGoogle Earth
      • Google Lit Trips (site unaffiliated with Google that provides "free downloadable files that mark the journeys of characters from famous literature on the surface of Google Earth. At each location along the journey there are placemarks with pop-up windows containing a variety of resources including relevant media, thought provoking discussion starters, and links to supplementary information about 'real world' references made in that particular portion of the story.  The focus is on creating engaging and relevant literary experiences for students."  Includes documentation about how to make lit trips.)
    • Google Maps "My Maps" ("create and share maps of your world, marked with the locations, routes and regions of interest that matter to you")
    • checkmark Neatline ("allows scholars, students, and curators to tell stories with maps and timelines. As a suite of add-on tools for Omeka, it opens new possibilities for hand-crafted, interactive spatial and temporal interpretation"; downloadable for installation on server)
    • Power Map Preview for Excel (download) (tool from Microsoft Research Labs that allow users to generate from Excel spreadsheets map visualizations with geolocation, 2D and 3D data mapping, and interactive "video tours")
    • Timemap ("Javascript library to help use online maps, including Google, OpenLayers, and Bing, with a SIMILE timeline. The library allows you to load one or more datasets in JSON, KML, or GeoRSS onto both a map and a timeline simultaneously")
    • WorldMap (open source platform "to lower barriers for scholars who wish to explore, visualize, edit, collaborate with, and publish geospatial information.  WorldMap is Open Source software.... provides researchers with the ability to: upload large datasets and overlay them up with thousands of other layers; create and edit maps and link map features to rich media content; share edit or view access with small or large groups; export data to standard formats; make use of powerful online cartographic tools; georeference paper maps online...; publish one’s data to the world or to just a few collaborators")

  • Mind-Mapping (Conceptualization Tools)
    • DebateGraph (collaborative mindmapping platform that allows individuals or groups to: facilitate group dialogue, make shared decisions, report on conferences, make and share posters, tell non-linear stories, explore the connections between subjects, etc.)

  • Modeling & Simulation
    • checkmarkNetLogo (downloadable software for agent-based simulations: "NetLogo is a programmable modeling environment for simulating natural and social phenomena. . . . NetLogo is particularly well suited for modeling complex systems developing over time. Modelers can give instructions to hundreds or thousands of independent 'agents' all operating concurrently. This makes it possible to explore the connection between the micro-level behavior of individuals and the macro-level patterns that emerge from the interaction of many individuals. NetLogo lets students open simulations and 'play' with them, exploring their behavior under various conditions. It is also an authoring environment which enables students, teachers and curriculum developers to create their own models. NetLogo is simple enough that students and teachers can easily run simulations or even build their own. And, it is advanced enough to serve as a powerful tool for researchers in many fields. NetLogo has extensive documentation and tutorials. It also comes with a Models Library, which is a large collection of pre-written simulations that can be used and modified. These simulations address many content areas in the natural and social sciences, including biology and medicine, physics and chemistry, mathematics and computer science, and economics and social psychology")
    • Scratch (visual programming platform developed by the MIT Media Lab to teach children about programming by allowing them to use a visual interface to create interactive programs, games, etc.; useful for allowing advanced humanities scholars without programming skills to program dynamic, interactive visual scenes and learn about programming logic)
    • checkmark blueSecond Life (general-purpose, Internet-based, immersive, 3D, and highly scalable (massively multi-user) "virtual world" where users can create an avatar, create richly rendered spaces and objects, and interact with each other as well as with various media sources)
    • SET (Simulated Environment for Theatre) ("3D environment for reading, exploring, and directing plays. Designed and developed by a multidisciplinary team of researchers, SET uses the Unity game engine to allow users to both author and playback digital theatrical productions")

  • Network Analysis / Social Network Analysis (see also under Visualization)
    • checkmarkGephi ("interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs")
    • Netlytic ("cloud-based text and social networks analyzer that can automatically summarize large volumes of text and discover social networks from online conversations on social media sites such as Twitter, Youtube, blogs, online forums and chats")
    • TAGS v5.0 (Twitter Archiving Google Spreadsheet)

  • Statistics (see also under Visualization)
    • checkmark"R" ( R Project for Statistical Computing) ("language and environment for statistical computing and graphics.... provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ... and graphical techniques, and is highly extensible")

  • Text Analysis (see also "Text Preparation" and "Topic Modeling") (See the TAPoR 2 portal of text-analysis tools for an omnibus listing with reviews, ratings, difficulty levels, etc.) 
    • checkmark blueBookLamp.org ("home of the Book Genome Project. Similar to how Pandora.com matches music lovers to new music, BookLamp helps you find books through a computer-based analysis of written DNA") | "Understanding the Book Genome Project"
      • Sentiment Browser ("observe a book's emotional ebbs & flows")
      • StoryDNA Browser ("allows you to search the Booklamp corpus for combinations of StoryDNA. Want to find books that have both Musical Performances and Sea Voyages? How about Medieval Weapons and Libraries? Assuming those books exist, this is the tool that will aid you in finding them")
      • Stream Graph Viewer ("we don't just record what StoryDNA we see inside of a book, we also look at where in the book that StoryDNA appears. This labs project is meant to show this data and allow users to explore the StoryDNA of books")
    • CLAWS ("grammatical tagger that analyzes words in a text by part of speech. Based on the approximately 10 million words of the British National Corpus")   
    • Concordance Programs:
    • Corpus Linguistics Programs/Resources:
    • checkmark blueGoogle Ngram Viewer
    • Lexos - Integrated Lexomics Workflow ("online tool ... to "scrub" (clean) your text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations")
    • Macro-Etymological Analyzer (program by Jonathan Reeve that runs a frequency analysis of plain-text documents, looking up each word using the Etymological Wordnet, and tallying the words according to origin language family)
    • Named Entity Parsers
    • checkmarkPoem Viewer ("web-based tool for visualizing poems in support of close reading")
    • Prospero ([documentation in French] text-analysis suite designed for humanists working with from historical and diachronic textual series; focused on exploring "complex cases")
    • checkmark blueSentiment Analysis (interactive demo plus information and research paper for the analysis of degrees of positive/negative "sentiment" in text passages based on an extensive "sentiment bank"; site includes downloadable dataset and code)
    • Signature Stylometric System ("program designed to facilitate "stylometric" analysis and comparison of texts, with a particular emphasis on author identification")
    • checkmark TaPOR (Text Analysis Portal) (collection of online text-analysis tools--ranging from the basic to sophisticated)
      • TaPOR 2.0 (current, redesigned TAPoR portal; includes tool descriptions and reviews; also includes documentation of some historical or legacy tools)
    • Textometrica (web-based text analysis tool designed to "analyse large amounts of text in several different ways. For example, you can examine the frequency of individual words, see how often one term is linked to another, and see which words together form ideas and concepts in the text.  Users can also create different visualisations and graphs from their text in order to gain a better overview of the structure of the text")
    • Textal ("free smartphone app that allows you to analyze websites, tweet streams, and documents, as you explore the relationships between words in the text via an intuitive word cloud interface. You can generate graphs and statics, as well as share the data and visualizations in any way you like")
    • twXplorer (online service that provides search tools for Twitter tweets, terms, links, and hashtags in relation to each other; provides a first-pass analytical view of a tweet or term, for example, in its relevant context)
    • checkmarkTXM (Textométrie) ("The TXM platform combines powerful and original techniques for the analysis of large text corpora using modular components and open-source.... Helps users to build and analyze any type of digital textual corpus possibly labeled and structured in XML... Distributed as a Windows, Linux or Mac software application ... and as an online portal run by a web application")
    • Voyant Tools
    • checkmark blue Word and Phrase.info (powerful tool that allows users to match texts they enter against the 450-million word Corpus of Contemporary American English [COCA] to analyze their text by word frequencies, word lists, collocates, concordance, and related phrases in COCA)
    • checkmarkWord2Vec ("deep-learning" neural network analysis tool from Google that seeks out relationships (vectors) between words in texts)
    • checkmark WordHoard ("Powerful text-analysis tool for a select group of "highly canonical literary texts"--currently, all of early Greek epic (in original and translation), all of Chaucer and Shakespeare, and Edmund Spenser's Faerie Queene and Shepheardes Calendar")
    • WordSeer ("web-based text analysis and sensemaking environment for humanists and social scientists")
    • WordSimilarity (Word 2 Word) ("open-source tool to plot and visualize semantic spaces, allowing researchers to rapidly explore patterns in visual data representative of statistical relations between words. Words are visualized as nodes and word similarities as directed edges of varying strengths or thicknesses.... system contains a large library of ready to use, modern, statistical relationship models along with an interface to teach them from various language sources")
    • Word Tree (tool for online, interactive word trees for texts submitted by users)
    • WordWanderer ("We are experimenting with visual ways in which we can enhance people's engagement with language. By fusing the information we can obtain from corpus searches, concordance outputs and word clouds we are aiming to enable and encourage people to notice and wander through the words they read, write and speak")

  • Text Collation
    • checkmark Juxta Commons ("a tool that allows you to compare and collate versions of the same textual work")
    • Versioning Machine, version 4.0 ("a framework and an interface for displaying multiple versions of text encoded according to the Text Encoding Initiative (TEI) Guidelines")
    • Visualizing Variation ("code library of free, open-source, browser-based visualization prototypes that textual scholars can use in digital editions, online exhibitions, born-digital articles, and other projects. All of the visualization prototypes offered here deal with different aspects of the bibliographical phenomenon of textual variation: the tendency of words, lines, passages, images, prefatory material, and other aspects of texts to change from one edition to the next, and even between supposedly identical copies of the same edition. Variants are material reminders of the complex social lives of texts")
    • VVV (Version Variation Visualization) ("explore great works with their world-wide translations")

  • Text Preparation for Digital Work: Harvesting, Scraping, Cleaning, Classifying, etc. (also known as data "wrangling") -- See also tutorials on text wrangling
    • Python Resources
      • checkmarkPython ("a clear and powerful object-oriented programming language" often used for text and data wrangling)
      • Beautiful Soup ("Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work")
    • Web Browser Automation (can be used for scripting web-scraping)
    • Data Science Tool Kit (variety of tools for such purposes as converting/mapping street address to geographical coordinates, coordinates to political areas, coordinates to statistics, IP address to coordinates, text to sentences [i.e., removing boilerplate from text passages], text to sentiment, HTML to text, HTML to story, text to people, and files to text [e.g., PDF, Word docs, and Excel spreadsheets to text])
    • Import.io ("Turn any website into a table of data or an API in minutes without writing any code")
    • Jeroen Janssens, "7 Command-line Tools for Data Science" (2013) (tools for "obtaining, scrubbing, exploring, modeling, and interpreting data")
    • checkmark blueLexos - Integrated Lexomics Workflow ("online tool ... to "scrub" (clean) your text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations")
    • NameChanger ("Rename a list of files quickly and easily. See how the names will change as you type")
    • NEX - Named Entity eXtraction (Web tool from dataTXT to identify names, concepts, etc. in short texts; also allows API access)
    • OpenRefine ("tool for working with messy data, cleaning it, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase")
    • OutWit Hub (standalone program or Firefox extension for data extraction using Firefox: "contents extracted from a Web page are presented in an easy and visual way, without requiring any programming skills or advanced technical knowledge. Users can easily extract links, images, email addresses, RSS news, data tables, etc. from series of pages without ever seeing the source code. Extracted data can be exported to CSV, HTML, Excel or SQL databases, while images and documents, are directly saved to your hard disk"; paid "pro" version has more capabilities and capacity)
    • Overview ("automatically sorts thousands of documents into topics and sub-topics, by reading the full text of each one")
    • pdf2htmlEX  ("renders PDF files in HTML, utilizing modern Web technologies. It aims to provide an accurate rendering, while keeping optimized for Web display")
    • PhoTransEdit ("Text to Phonetics online transcriber for turning English text into phonetic transcription using IPA symbols; also has free downloadable version)
    • Scan Tailor (" interactive tool for post-processing of scanned pages. It gives the ability to cut or crop pages, compensate for skew angle, and add / delete content fields and margins, among others. You begin with raw scans, and end up with tiff's that are ready for printing or assembly in PDF or DjVu file")
    • Scraper ("simple data mining extension for Google Chrome"; "to use it: highlight a part of the webpage you'd like to scrape, right-click and choose "Scrape similar...". Anything that's similar to what you highlighted will be rendered in a table ready for export, compatible with Google Docs")
    • ScraperWiki (free tools for scraping from Twitter and table extraction from PDF's)
    • ScraperWiki Classic (archive of user-created scraping tools for specific purposes and resources; includes resources and tutorials for creating your own scraper)
    • Scrapy (downloadable Python-based tool for "fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing")
    • TET Plugin (Plugin for Adobe Acrobat designed to extract text from PDFs)
    • VARD 2 ("software produced in Java designed to assist users of historical corpora in dealing with spelling variation, particularly in Early Modern English texts. The tool is intended to be a pre-processor to other corpus linguistic tools such as keyword analysis, collocations," etc.)
    • Text Preparation "Recipes" for Topic Modeling Work:
      • Matthew Jockers
        • "'Secret' Recipe for Topic Modeling Themes" (guidance on creating stop lists, using parts-of-speech taggers to filter text, and "chunking" texts into suitable-length sections to optimize topi-modeling results)
        • "Expanded Stopwords List"  ("Below is the list of stop words I used in topic modeling a corpus of 3,346 works of 19th-century British, American, and Irish fiction. The list includes the usual high frequency words (“the,” “of,” “an,” etc) but also several thousand personal names.")
      • Andrew Goldstone & Ted Underwood, "Code Used ... in Analyzing Topic Models of Literary-studies Journals" (GitHub repository of stoplist, code, and resources for Goldstone and Underwood's topic modeling project)

  • Topic Modeling (see also Text Preparation "Recipes" for Topic Modeling Work above) (see also tutorials for topic modeling)
    • checkmark blueDFR-Browser (browser-based visualization interface created by Andrew Goldstone for exploring JSTOR articles [facilitated by the JSTOR "Data for Research" (DFR) site through topic-modeling)
    • Gensim ("free Python library: scalable statistical semantics, analyze plain-text documents for semantic structure, retrieve semantically similar documents")
    • Glimmer.rstudio.com Topic Modeling (LDA) visualization tool (allows users to upload their own data to generate scatterplots and bar charts)
    • checkmark blueIn-Browser Topic Modeling ("Many people have found topic modeling a useful (and fun!) way to explore large text collections. Unfortunately, running your own models usually requires installing statistical tools like R or Mallet. The goals of this project are to (a) make running topic models easy for anyone with a modern web browser, (b) explore the limits of statistical computing in Javascript and (c) allow tighter integration between models and web-based visualizations"; by David Mimno.) Note: the files for this tool can be downloaded and run locally; download from GitHub here.  
    • checkmarkMallet
      • Mallet (MAchine Learning for LanguagE Toolkit)
        • GRMM (GRaphical Models in Mallet)
    • The Networked Corpus ("a Python script that generates a collection of Web pages like the ones we have created for <em>The Spectator</em>.... designed to work with MALLET."  The Networked Corpus project "provides a new way to navigate large collections of texts. Using a statistical method called topic modeling, it creates links between passages that share common vocabularies, while also showing in detail the way in which the topic modeling program has “read” the texts. ")
    • Stanford Topic Modeling Toolbox ("brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to: * Import and manipulate text from cells in Excel and other spreadsheets; * Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text; * Select parameters (such as the number of topics) via a data-driven process; * Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data")
    • TMVE ("basic implementation of a topic model visualization engine")
    • checkmark blueTopic Modeling Tool (Java-based "graphical user interface tool for Latent Dirichlet Allocation topic modeling" by David Newman; comes with test input files [look in "Downloads" tab on site]. Input files should be in .txt files saved in same directory; the input files are formatted with returns between each separate document)
    • "Two Topic Browsers" by Jonathan Goodwin

  • Video

  • Visual Programming
    • Scratch (visual programming platform developed by the MIT Media Lab to teach children about programming by allowing them to use a visual interface to create interactive programs, games, etc.; useful for allowing advanced humanities scholars without programming skills to program dynamic, interactive visual scenes and learn about programming logic) | Scratch 2.0 Offline Editor
    • Yahoo Pipes ("powerful composition tool to aggregate, manipulate, and mashup content from around the web....  Simple commands can be combined together to create output that meets your needs: combine many feeds into one, then sort, filter and translate it; geocode your favorite feeds and browse the items on an interactive map....")

  • Visualization (including data viz, graphing, and network visualization [subsections for infographics, timelines, word clouds])
    • General or Multiple Purpose Viz Tools:
      • Chart and Image Gallery: 30+ Free Tools for Data Visualization and Analysis (gathering of tools by Sharon Machlis)
      • Circos ("software package for visualizing data and information ... in a circular layout ... ideal for exploring relationships between objects or positions")
      • checkmarkD3.js ("a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS")
        • Scott Murray, "Made with D3.js" (curated gallery of example projects made with D3.js) (2013)
      • GapMinder World (online or desktop data/statistics animation)
      • checkmarkGephi ("interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs")
      • Google Fusion Tables (Google's "experimental data visualization web application to gather, visualize, and share larger data tables. Visualize bigger table data online; Filter and summarize across hundreds of thousands of rows.  Then try a chart, map, network graph, or custom layout  and embed or share it")
      • ImageJ  (image processing program that can create composite or average images)
        • ImagePlot ("free software tool that visualizes collections of images and video of any size.... implemented as a macro which works with the open source image processing program ImageJ")
      • checkmark blueManyEyes (powerful, flexible, online suite of dataset-to-diagrammatic visualization tools)
      • OpenHeatMap (creates "heat map" visualizations from spreadsheets)
      • Palladio (data visualization tool designed for humanities work; "web-based app that allows you to upload, visualize, and filter your data on-the-fly")
      • checkmark bluePixlr Editor (highly-capable, free, Photoshop-like photoeditor program that runs entirely in Flash in a browser; allows users to import images from local on online sources, edit, resize, crop, adjust image features, apply filters, etc.; Pixlr also has versions for mobile devices)
      •   checkmarkProcessing ("Processing is a simple programming environment that was created to make it easier to develop visually oriented applications with an emphasis on animation and providing users with instant feedback through interaction. The developers wanted a means to “sketch” ideas in code. As its capabilities have expanded over the past decade, Processing has come to be used for more advanced production-level work in addition to its sketching role. Originally built as a domain-specific extension to Java targeted towards artists and designers, Processing has evolved into a full-blown design and prototyping tool used for large-scale installation work, motion graphics, and complex data visualization")
      • checkmark-red"R" ( R Project for Statistical Computing) ("language and environment for statistical computing and graphics.... provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ... and graphical techniques, and is highly extensible")
      • checkmark-blueRAW (open web app that allows users to use a simple interface to upload data from a spreadsheet, choose and configure a vector graphics visualization, and export the results; built on top of the D3.js library)
      • checkmark-blueTableau Public ("within minutes, our free data visualization tool can help you create an interactive viz and embed it in your website or share it")
      • Viewshare ("free platform for generating and customizing views (interactive maps, timelines, facets, tag clouds) that allow users to experience your digital collections")
      • VisualSense ("interactive visualization and analysis tool ... developed for textual and numerical data extracted from image analysis of images from different cultures and influences")
      • WiGis ("visualization of large-scale, highly interactive graphs in a user's web browser. Our software is delivered natively in your web browser and does not require any plug-ins or add-ons. Our method produces clean, smooth animation in a browser through asynchronous data transfer (AJAX), and access to rich server side resources without the need for technologies such as Flash, Java Applets, Flex or Silverlight. We believe that our new techniques have broad reaching potential across the web")
      • yEd ("downloadable and online diagramming tools.  Functions include the automatic layout of networks and diagrams: "the yFiles library offers the user many advantages, one of which is its ability to automatically draw networks and diagrams. yFiles layout algorithms enable the clear presentation of flow charts, UML diagrams, organization charts, genealogies, business process diagrams, etc.")
    • Diagramming & Graphing Tools:
      • aiSee Graph Visualization ("graphing program for Windows, Mac OS X, and Linux")
      • Gliffy (online diagramming and flow-charting)
      • yEd ("downloadable and online diagramming tools.  Functions include the automatic layout of networks and diagrams: "the yFiles library offers the user many advantages, one of which is its ability to automatically draw networks and diagrams. yFiles layout algorithms enable the clear presentation of flow charts, UML diagrams, organization charts, genealogies, business process diagrams, etc.")
    • Infographics:
    • Network Visualization Tools (see also under "General or Multiple Purpose Viz Tools" above:
      • checkmarkD3.js ("a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS")
      • checkmarkGephi ("interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs")
      • NodeXL ("free, open-source template for Microsoft® Excel® 2007, 2010 and (possibly) 2013 that makes it easy to explore network graphs.  With NodeXL, you can enter a network edge list in a worksheet, click a button and see your graph, all in the familiar environment of the Excel window")
      • Textexture (online tool that allows users to "visualize any text as a network. The resulting graph can be used to get a quick visual summary of the text, read the most relevant excerpts (by clicking on the nodes), and find similar texts")
      • yEd ("downloadable and online diagramming tools.  Functions include the automatic layout of networks and diagrams: "the yFiles library offers the user many advantages, one of which is its ability to automatically draw networks and diagrams. yFiles layout algorithms enable the clear presentation of flow charts, UML diagrams, organization charts, genealogies, business process diagrams, etc.")
    • Text Viz (Specialized text viz tools, including word clouds, text difference, text variation):
      • checkmark blueHistory Flow Visualization (tool for visualizing the evolution of documents created by multiple authors) (download site for tool)
      • Tagxedo (word cloud from multiple sources)
      • Textexture (online tool that allows users to "visualize any text as a network. The resulting graph can be used to get a quick visual summary of the text, read the most relevant excerpts (by clicking on the nodes), and find similar texts")
      • Wordle (online word cloud tool)
      • Word Tree (tool for online, interactive word trees for texts submitted by users) 
    • Time Lines:
      • ChronoZoom (open-source project that allows users to create zoomable, Prezi-like timeline-history exhibitions "of everything," on various scales of time-space)
      • Dipity (timeline infographics)
      • checkmarkSimile Widgets (embeddable code for visualizing time-based data, including Timeline, Timeplot, Runway, and Exhibition)
      • Tiki-Toki (web-based platform for creating timelines with multimedia; capable of "3D" timelines)
      • Timeline Builder (online tool for building interactive Flash-based timelines from the Roy Rosenzweig Center for History and New Media)
      • Timeline JS
      • Timemap ("Javascript library to help use online maps, including Google, OpenLayers, and Bing, with a SIMILE timeline. The library allows you to load one or more datasets in JSON, KML, or GeoRSS onto both a map and a timeline simultaneously")
    • Twitter Visualization:
      • TAGSExplorer (step-by-step instructions with tools for archiving Twitter event hashtags and creating interactive visualizations of the conversations)
      • TweetBeam (creates "Twitter Wall" to "visualize the conversation around your event")
      • TweetsMap (analyzes and maps geographical location of one's Twitter followers)
      • Visible Tweets ("Visible Tweets is a visualisation of Twitter messages designed for display in public space") 

  • "Deformance" Tools: (While many tools can be used against-the-grain to "deform" materials for play or discovery, the following are tools expressly designed for this purpose.  On "deformance" in the digital humanities, see for example Mark Sample, "Notes Towards a Deformed Humanities")
    • The Eater of Meaning ("tool for extracting the message from the medium. Format and presentation are unaffected, but words and letters are subjected to an elaborate nonsensification progress that eliminates semantics root and branch")
    • GIFMelter (creates dynamic, flowing distortions of online images)
    • Glitch Images (interactive interface with sliders to "glitch" imported .jpg images)
    • The N + 7 Machine (English version only; "The N+7 procedure, invented by Jean Lescure of Oulipo, involves replacing each noun in a text with the seventh one following it in a dictionary")

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 Posted: October 12, 2012 | Author: Jens Finnäs | Filed under: Tutorials | Tags: google chrome, parliament, politics, scraper, screen scraping, tutorial |18 Comments »

How do you get information from a website to a Excel spreadsheet? The answer is screenscraping. There are a number of softwares and plattforms (such as OutWit Hub, Google Docs and Scraper Wiki) that helps you do this, but none of them are – in my opinion – as easy to use as the Google Chrome extension Scraper, which has become one of my absolutely favourite data tools.
What is a screenscraper?

I like to think of a screenscraper as a small robot that reads websites and extracts pieces of information. When you are able to unleash a scraper on hundreads, thousands or even more pages it can be an incredibly powerful tool.

In its most simple form, the one that we will look at in this blog post, it gathers information from one webpage only.
Google Chrome’s Scraper

Scraper is an Google Chrome extension that can be installed for free at Chrome Web Store.

Image

Now if you installed the extension correctly you should be able to see the option “Scrape similar” if you right-click any element on a webpage.

The Task: Scraping the contact details of all Swedish MPs

Image

This is the site we’ll be working with, a list of all Swedish MPs, including their contact details. Start by right-clicking the name of any person and chose Scrape similar. This should open the following window.

Understanding XPaths
At w3schools you’ll find a broader introduction to XPaths.

Before we move on to the actual scrape, let me briefly introduce XPaths. XPath is a language for finding information in an XML structure, for example an HTML file. It is a way to select tags (or rather “nodes”) of interest. In this case we use XPaths to define what parts of the webpage that we want to collect.

A typical XPath might look something like this:

    //div[@id="content"]/table[1]/tr

Which in plain English translates to:

    // - Search the whole document...
    div[@id="content"] - ...for the div tag with the id "content".
    table[1] -  Select the first table.
    tr - And in that table, grab all rows.

Over to Scraper then. I’m given the following suggested XPath:

    //section[1]/div/div/div/dl/dt/a

The results look pretty good, but it seems we only get names starting with an A. And we would also like to collect to phone numbers and party names. So let’s go back to the webpage and look at the HTML structure.

Right-click one of the MPs and chose Inspect element. We can see that each alphabetical list is contained in a section tag with the class “grid_6 alpha omega searchresult container clist”.

 And if we open the section tag we find the list of MPs in div tags.

We will do this scrape in two steps. Step one is to select the tags containing all information about the MPs with one XPath. Step two is to pick the specific pieces of data that we are interested in (name, e-mail, phone number, party) and place them in columns.

Writing our XPaths

In step one we want to try to get as deep into the HTML structure as possible without losing any of the elements we are interested in. Hover the tags in the Elements window to see what tags correspond to what elements on the page.

In our case this is the last tag that contains all the data we are looking for:

    //section[@class="grid_6 alpha omega searchresult container clist"]/div/div/div/dl

Click Scrape to test run the XPath. It should give you a list that looks something like this.

Scroll down the list to make sure it has 349 rows. That is the number of MPs in the Swedish parliament. The second step is to split this data into columns. Go back to the webpage and inspect the HTML code.

I have highlighted the parts that we want to extract. Grab them with the following XPaths:

    name: dt/a
    party: dd[1]
    region: dd[2]/span[1]
    seat: dd[2]/span[2]
    phone: dd[3]
    e-mail: dd[4]/span/a

Insert these paths in the Columns field and click Scrape to run the scraper.

Click Export to Google Docs to get the data into a spreadsheet. Here is my output.

Now that wasn’t so difficult, was it?
Taking it to the next level

Congratulations. You’ve now learnt the basics of screenscraping. Scraper is a very useful tool, but it also has its limitations. Most importantly you are only able to scrape one page at a time. If you need to collect data from several pages Scraper is no longer very efficient.

To take your screenscraping to the next level you need to learn a bit of programming. But fear not. It is not rocket science. If you understand XPaths you’ve come a long way.

Here is how you continue from here:

    Read Scraping for journalists by Paul Bradshaw. This e-books introduces scraping to non-programmers using tools such as Google Docs, OutWit Hub and ScraperWiki.
    Learn Ruby. Dan Nguyen shows you how with his Bastards Book of Ruby and Coding for Journalists 101.
    Or learn Python. One way to get started is the P2PU course Python for Journalists.

Share this:

    Share

Related

Mapping Ratata: Who's Hot?In "Own projects"

Tutorial: Choropleth map in CartoDB using QGISIn "Tutorials"

One month Wall Street occupation mappedIn "Own projects"
18 Comments on “Get started with screenscraping using Google Chrome’s Scraper extension”

    Dokumentation: Slidesen, klippen, tweetsen « Fajk
    November 26, 2012 at 23:12    

    [...] Blogg: Get started with screenscraping using Google Chrome’s Scraper extension [...]
    Reply    
    Continued
    May 17, 2013 at 15:56    

    I think what makes Sims so universally popular is the
    ability to completely manipulate your Sim. This includes the voodoo doll,
    cauldron, crystal ball and new hairstyles and clothing for your magical and fortune telling needs.
    If your Sims have been using the gem spawner all
    this time, they should have found several different types of gem.
    Reply    
    visit this website
    June 5, 2013 at 08:24    

    ‘s brother Sweet is not too happy with him for ditching his gang for all these years and bullies him into helping take back the neighborhood from their rival gang, the Ballas. The needed driver was automatically found and installed in a few seconds, and I was ready to start getting camcorder clips. I don’t
    believe they have the taxi, EMT, or firefighter missions
    anymore.
    Reply    
    #Tip: Scrape web pages using this Chrome extension | Editors Blog | Journalism.co.uk
    June 28, 2013 at 10:10    

    […] Jens Finnäs has provided a tutorial on his Dataist blog. He explains how he used the extension to scrape the contact details of all Swedish […]
    Reply    
    multiple sclerosis facts
    June 30, 2013 at 00:02    

    New research suggests this ‘intestinal’ multiple sclerosis
    journal could even do, before it happened to me. Adults and older children who have had
    an allergic reaction. I believe if you were an only child
    or your siblings were so many years older than you that they already had multiple sclerosis journal.
    But what else would I want free piriton for? Often remains dormant in your
    body for life after someone has been exposed to Multiple Sclerosis Journal, a viral disease also called varicella, hides in the nerve cells that line your spinal cord.
    Reply    
    DreamHost Discount
    July 11, 2013 at 08:17    

    This scaper plugin is really cool. I didn’t know that this kind of plugin really exists on the earth. Thanks for giving such a detailed instructions. I am gonna install it right now.
    Reply    
    Anton
    July 11, 2013 at 09:59    

    This plugin is not so bad, but http://convextra.com can do the same faster in 1 click..
    Reply    
    Tory Burch バリー 財布
    September 10, 2013 at 03:04    

    ブランド 財布 メンズ
    Reply    
    ブーツ ファッション
    September 14, 2013 at 04:50    

    iphone 防水 スピーカー ブーツ ファッション http://www.cnjgov.com/
    Reply    
    Tools, Slides and Links from NICAR13 // Ricochet by Chrys Wu
    September 16, 2013 at 07:51    

    […] • Scrape screen scraper Chrome extension. Journalist Jens Finnäs wrote a tutorial for it on Dataists. • Time Flow by Martin Wattenberg & Fernanda Viegas • Stately – a symbol font […]
    Reply    
    Craig Gooden
    September 26, 2013 at 14:55    

    I was wondering if anyone can help me – I’m very new to Google data scraping (and loving it!). I need to get data from a website (1000s of pages) – the information I’m after is in the source code – XX12345 01/2013 – what I need for each URL is the ‘XX12345 01/2013. I have a list of the URLs in Excel / CSV format. Is there any automated way of going it? Someting like Google Spreadsheet function – importHTML. Any help appreciated.
    Reply    
        Craig Gooden
        September 26, 2013 at 14:57    

        Sorry, post has removed my code… I need to find the text in the code…
        XX12345 01/2013
        the bit I need is XX12345 01/2013
        Reply    
    Craig Gooden
    September 26, 2013 at 14:58    

    ah….

    __XX12345 01/2013__
    Reply    
    Craig Gooden
    September 26, 2013 at 14:59    

    I give up!!!!

    XX12345 01/2013
    Reply    
    Craig Gooden
    September 26, 2013 at 15:00    

    Basically it’s the text from within , XX12345 01/2013 , < , / , p
    Reply    
    Craig Gooden
    September 26, 2013 at 15:07    

    Ok, this site basically strips out the code I want to show you – I’m trying to get the text which is after the phrase ” dateCode ” on every webpage. I have a list of URLs in Excel/CSV. Can I use Google spreadsheet function like importHTML?
    Reply    
    solarscourge
    December 6, 2013 at 10:56    

    You should try GrabzIt’s on-line screen scraping tool, which offers more advanced features and is much more customizable: http://grabz.it/scraper
    Reply    
    john bole
    January 11, 2014 at 05:52    

    hello I was trying to scrape the business name, address, phone and URL from this list. The problem is that the scraper first needs to enter the hyperlink for each company and scrape info from second page. Can anybody help?

    Below is the website I need to get the info as you can see there are 37 pages

    http://ces14.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page=1#GotoResults

    thanks
    Reply    

Leave a Reply
← Using Google Spreadsheet as a database
Tutorial: Using Google Refine to clean mortgage data →
What’s this?

This blog is about finding, exploring and presenting data online. Or simply data journalism.

The author is Jens Finnäs, a freelance journalist from Finland, currently living in Stockholm, Sweden.
Get in touch

    jens.finnas@gmail.com
    Twitter: @jensfinnas
    Twitter: @the_dataist

Tags
age animation anti-jihadist athletics backlinks Brevik campaign funding climate change copyright laws crime d3.js earthquake economy ehdolla.org elections eurovision facebook free trade geocoding geojson gephi google fusion tables google maps google public data explorer google refine helsingin sanomat helsinki region infoshare housing prices hs open income interactive japan kepa linguistics many eyes maps melodifestivalen migration music Nato network Norway occupy wall street open data Open Knowledge Foundation open street map pekka haavisto politics polymaps protovis raphael ratata ruby sauli niinistö screen scraping slideshow social network spss stockholm marathon suomen kuvalehti ted talks terrorism text analysis transparency tree map truth-o-meter tutorial twitter vaalikone visualization voting advice applications wordle world records yahoo place finder yle
Search
Search for:
More datajournalism
Want to learn more about datajournalism? At datajoutnalistik.se I gather resources for journalists that want to become better at working with data.
RSS

    RSS - Posts

Comments (0)

You don't have permission to comment on this page.