University of Kansas: Analytical methods in XML

Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2013-08-24T11:06:36+0000

Instructors

David J. Birnbaum
University of Pittsburgh
Email: djbpitt@gmail.com
URL: http://www.obdurodon.org

Jeffrey A. Rydberg-Cox
University of Missouri, Kansas City
Email: rydbergcoxj@umkc.edu
URL: http://r.web.umkc.edu/rydbergcoxj/

Description and goals

This workshop focuses on the use of analytical tools (especially the statistical package R and the topic-modeling toolkit Mallet) and methods (especially Bayesian classification and SVG visualization) to discover and explore information within XML data. By the end of the sessions participants will have learned how to apply the techniques and methods discussed to the analysis and visualization of their own XML texts.

The workshop is intended for beginners, and no prior experience with any of the technologies is required, although participants will need to prepare the outside readings (see below) before each of the two working days. The workshop sessions will then guide the participants through the process of selecting a text, preparing it for processing with XML-related tools, and analyzing the text using R and Mallet.

Day 1 (6 hours of instruction) provides an overview of XML and XML-related technologies, including the tools needed to extract information from the XML in the formats required by the toolkits. Day 2 (6.5 hours of instruction) concentrates on the actual analysis of the data and on formatting it for textual and graphic presentation.

Prepare in advance

Instructor/host preparation

Create wireless accounts for all users
Prepare jump drives with
- Data files
- Saxon, Mallet, R, <oXygen/>

Before day 1

Read
Install all major browsers on your laptops and ensure that they’re up to date
- Windows users: Chrome, Firefox, Internet Explorer, Opera, Safari
- Mac users: Chrome, Firefox, Opera, Safari
- Linux users: Firefox
Install <oXygen/> on your laptop. A free, fully-functional temporary license is available from the download site.

Before day 2

Read

Syllabus

Day 1, before session 1 (8:30–9:00, 1/2 hour)

Goals: Ensure that participants’ computers are configured properly before the beginning of the workshop

Topics: Installing software

Which users	Which tools	Notes
All users	<oXygen>	Patrick (and others)
Windows users	cygwin	Patrick (and others)

Day 1, session 1 (9:00–12:00, 3 hours)

Goals: Getting started

Topics: Using the command line and XML

Time	Topic	Notes
9:00–10:00 (60)	Using the command line (lecture and hands-on)	pwd, cd, less, cp, mv, grep (including regex), wc; redirection and piping (slides) Prep: Jeff
10:00–10:15 (15)	Coffee break with chocolate chip cookies
10:15–10:45 (30)	Introduction to XML and TEI lite (lecture)	Document analysis, OHCO, elements, attributes, well-formedness, validity, entities and character references Prep: David TEI Lite tutorial
10:45–11:30 (45)	XML tagging (hands-on)	Examine a TEI lite text (Wordhoard Hamlet) Autotag a plain text with regex search-and-replace (Gutenberg Hamlet) Prep: David Autotag Arienne’s dataset Prep: jeff
11:30–11:45 (15)	Brief overview of the XML family of standards	Schema languages, schematron, namespaces, XPath, XSLT, XQuery Prep: David
11:45–12:00 (15)	Brief overview of web standards	(X)HTML and HTML5, CSS (drive-by JavaScript, PHP) Prep: Jeff

Day 1, lunch (12:00–1:00, 60 minutes)

Lunch (provided) (12:00–1:00)

Day 1, session 2 (1:00–4:00, 3 hours)

Goals: Getting data out

Topics: Using XPath and XSLT

Time	Topic	Notes
1:00–1:30 (30)	XPath paths and axes (lecture and hands-on)	Prep: David
1:30–2:15 (45)	XPath predicates and functions (lecture and hands-on)	Prep: David
2:15–2:30 (15)	Coffee break with chocolate chip cookies
2:30–4:00 (90)	XSLT (lecture and hands-on)	Output should be plain text that can be used as input on day 2, including result-document Drive-by TEI-to-HTML Prep: Jeff

Day 2, before session 1 (8:30–9:00, 1/2 hour)

Goals: Ensure that participants’ computers are configured properly before the beginning of the workshop

Topics: Installing software

Which users	Which tools	Notes
All users	R, Mallet, Saxon	Patrick (and others)

Day 2, session 1 (9:00–12:00, 3 hours)

Goals: Is this statistically unusual or interesting? What is it about?

Topics: Using XSLT, perl, and R; calculatingTFxIDF

Time	Topic	Notes
9:00–9:55 (55)	Introduction to perl Tokenizing, stemming and parsing, counting features using Perl hashes Bayesian classification (lecture and hands-on)	Do men and women speak differently in Hamlet? Count stuff in Arienne’s dataset Prep: Jeff
9:55–10:10 (15)	Coffee break with chocolate chip cookies
10:10–11:00 (50)	Variation, standard deviation and z-scores in R (lecture and hands-on; R for digital humanities [web site])	Are Hamlet’s speeches significantly longer than anyone else’s? Correlation and difference in Arienne’s dataset (test significance based on counts of etymology) Prep: Jeff
11:00–12:00 (60)	Quantifying what a text is about; keyword extraction (lecture) Calculating TFxIDF as a keyword metric (hands-on)	What words characterize each speaker’s vocabulary in Hamlet? Prep: Jeff

Day 2, lunch (12:00–1:00)

Lunch (provided) (12:00–1:00)

Day 2, session 2 (1:00–4:30, 3.5 hours)

Goals: What is this text about? How can I make the information accessible?

Topics: Using Mallet; creating SVG and other visualizations

Time	Topic	Notes
1:00–2:00 (60)	Mallet (lecture and hands-on)	What do the topic models look like for Hamlet? Prep: Jeff
2:00–2:30 (30)	Why SVG: scalability, integration with HTML and JavaScript SVG basics: lines, circles, rectangles, text; the coordinate space and transformations	Prep: David
2:30–2:45 (15)	Coffee break with chocolate chip cookies
2:45–4:30 (105)	XSLT transformation to SVG (hands-on)	Bar chart of speech lengths by character with z-score thresholds Scatter plot of average sentence length (y) over text chunks (x)? Visualize some of Arienne’s materials? Alternative visualizations Prep: David