University of Kansas: Analytical methods in XML

[University of Kansas, Spooner Hall]

Maintained by: David J. Birnbaum ( [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2013-08-24T11:06:36+0000


David J. Birnbaum
University of Pittsburgh

Jeffrey A. Rydberg-Cox
University of Missouri, Kansas City

Description and goals

This workshop focuses on the use of analytical tools (especially the statistical package R and the topic-modeling toolkit Mallet) and methods (especially Bayesian classification and SVG visualization) to discover and explore information within XML data. By the end of the sessions participants will have learned how to apply the techniques and methods discussed to the analysis and visualization of their own XML texts.

The workshop is intended for beginners, and no prior experience with any of the technologies is required, although participants will need to prepare the outside readings (see below) before each of the two working days. The workshop sessions will then guide the participants through the process of selecting a text, preparing it for processing with XML-related tools, and analyzing the text using R and Mallet.

Day 1 (6 hours of instruction) provides an overview of XML and XML-related technologies, including the tools needed to extract information from the XML in the formats required by the toolkits. Day 2 (6.5 hours of instruction) concentrates on the actual analysis of the data and on formatting it for textual and graphic presentation.

Prepare in advance

Instructor/host preparation

Before day 1

Before day 2



Day 1, before session 1 (8:30–9:00, 1/2 hour)

Goals: Ensure that participants’ computers are configured properly before the beginning of the workshop

Topics: Installing software

Which users Which tools Notes
All users <oXygen> Patrick (and others)
Windows users cygwin

Day 1, session 1 (9:00–12:00, 3 hours)

Goals: Getting started

Topics: Using the command line and XML

Time Topic Notes
9:00–10:00 (60) Using the command line (lecture and hands-on) pwd, cd, less, cp, mv, grep (including regex), wc; redirection and piping (slides)
Prep: Jeff
10:00–10:15 (15) Coffee break with chocolate chip cookies
10:15–10:45 (30) Introduction to XML and TEI lite (lecture) Document analysis, OHCO, elements, attributes, well-formedness, validity, entities and character references
Prep: David
TEI Lite tutorial
10:45–11:30 (45) XML tagging (hands-on) Examine a TEI lite text (Wordhoard Hamlet)
Autotag a plain text with regex search-and-replace (Gutenberg Hamlet)
Prep: David
Autotag Arienne’s dataset
Prep: jeff
11:30–11:45 (15) Brief overview of the XML family of standards Schema languages, schematron, namespaces, XPath, XSLT, XQuery
Prep: David
11:45–12:00 (15) Brief overview of web standards (X)HTML and HTML5, CSS (drive-by JavaScript, PHP)
Prep: Jeff

Day 1, lunch (12:00–1:00, 60 minutes)

Lunch (provided) (12:00–1:00)

Day 1, session 2 (1:00–4:00, 3 hours)

Goals: Getting data out

Topics: Using XPath and XSLT

Time Topic Notes
1:00–1:30 (30) XPath paths and axes (lecture and hands-on) Prep: David
1:30–2:15 (45) XPath predicates and functions (lecture and hands-on) Prep: David
2:15–2:30 (15) Coffee break with chocolate chip cookies
2:30–4:00 (90) XSLT (lecture and hands-on) Output should be plain text that can be used as input on day 2, including result-document
Drive-by TEI-to-HTML
Prep: Jeff

Day 2, before session 1 (8:30–9:00, 1/2 hour)

Goals: Ensure that participants’ computers are configured properly before the beginning of the workshop

Topics: Installing software

Which users Which tools Notes
All users R, Mallet, Saxon Patrick (and others)

Day 2, session 1 (9:00–12:00, 3 hours)

Goals: Is this statistically unusual or interesting? What is it about?

Topics: Using XSLT, perl, and R; calculatingTFxIDF

Time Topic Notes
9:00–9:55 (55) Introduction to perl
Tokenizing, stemming and parsing, counting features using Perl hashes
Bayesian classification (lecture and hands-on)
Do men and women speak differently in Hamlet?
Count stuff in Arienne’s dataset
Prep: Jeff
9:55–10:10 (15) Coffee break with chocolate chip cookies
10:10–11:00 (50) Variation, standard deviation and z-scores in R (lecture and hands-on; R for digital humanities [web site]) Are Hamlet’s speeches significantly longer than anyone else’s?
Correlation and difference in Arienne’s dataset (test significance based on counts of etymology)
Prep: Jeff
11:00–12:00 (60) Quantifying what a text is about; keyword extraction (lecture)
Calculating TFxIDF as a keyword metric (hands-on)
What words characterize each speaker’s vocabulary in Hamlet?
Prep: Jeff

Day 2, lunch (12:00–1:00)

Lunch (provided) (12:00–1:00)

Day 2, session 2 (1:00–4:30, 3.5 hours)

Goals: What is this text about? How can I make the information accessible?

Topics: Using Mallet; creating SVG and other visualizations

Time Topic Notes
1:00–2:00 (60) Mallet (lecture and hands-on) What do the topic models look like for Hamlet?
Prep: Jeff
2:00–2:30 (30) Why SVG: scalability, integration with HTML and JavaScript
SVG basics: lines, circles, rectangles, text; the coordinate space and transformations
Prep: David
2:30–2:45 (15) Coffee break with chocolate chip cookies
2:45–4:30 (105) XSLT transformation to SVG (hands-on) Bar chart of speech lengths by character with z-score thresholds
Scatter plot of average sentence length (y) over text chunks (x)?
Visualize some of Arienne’s materials?
Alternative visualizations
Prep: David