University of Kansas: Creating and using XML

Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2015-09-25T19:30:36+0000

Friday, 2015-09-25

David J. Birnbaum
University of Pittsburgh
Email: djbpitt@gmail.com
URL: http://www.obdurodon.org

Jeffrey A. Rydberg-Cox
University of Missouri, Kansas City
Email: rydbergcoxj@umkc.edu
URL: http://daedalus.umkc.edu

Description and goals

These two workshops are designed to help digital humanists with basic XML experience refine their skills in document analysis, markup, and XML processing. The morning workshop (Creating literary and linguistic annotation, 3 hours of instruction) concentrates on document analysis and advanced beginner level XML annotation. The afternoon workshop (Using literary and linguistic annotation once you’ve created it, 3 hours of instruction) introduces the use of XPath and XSLT to transform and query XML documents. The two workshops are independent of each other; participants may register for either or for both. Both workshop will take place in Watson Library room 455, and are part of the University of Kansas DH Forum 2015: Peripheries, barriers, hierarchies.

Before the workshops

Download <oXgyen/> from http://www.oxygenxml.com and install it on your laptop. You will need to request a free, fully-functional temporary license, which you can do at the download site.
Read for both the morning and afternoon workshops: An even gentler introduction to XML
Read for the afternoon workshop (Using literary and linguistic annotation once you’ve created it): What can XPath do for me?

Workshop I: Introductory XML and XPath

This workshop will concentrate on document analysis, project design, and making markup decisions in complex cases, such as those involving overlap or dependencies on external documents. Examples will be drawn from data supplied in advance by participants and from other sources. Participants should already have hands-on experience tagging XML documents and should already have read or re-read An even gentler introduction to XML.

9:00–9:30: Brief re-introduction to XML, using Julius Segall’s Wo ist des Armen Vaterland?
- Scanned image
- XML transcription
- Plain text (for autotagging exercise)
9:30–10:15: Working session #1, using Jarring Prov. 105 (Turki legal document from 1855–56)
This document includes missing text due to holes in the manuscript, marginalia, a seal, and interesting indigenous paper. Possible topics for discussion may include:
- Handling right-to-left text, including diacritic marks (for short vowels, doubled consonants, etc.).
- Using regular expressions in XSLT to replace text in predictable contexts (e.g., if a word begins with two consonants insert an i between them)?
- Some superscript characters, such as ʰ, have a dedicated Unicode code point, but some others don’t. What’s the best way to represent them (e.g., by wrapping them in tags so that they will later be rendered as superscripts). (For Unicode values see Unicode 8.0 Character Code Charts. To identify the Unicode code points of an existing digital text, see Richard Ishida’s Unicode code converter. Mac OSX users can download earthlingsoft’s free Unicode checker application.)
Prov.11
10:15–10:30: Break
10:30–11:15: Working session #2, using Krieg und Frieden (1917-09-08)
- Scanned image
- XML transcription
Goals: encode as much of the formatting as possible in order to provide truly diplomatic transcriptions. Issues include multiple columns, centering-related issues, indentation of individual lines, footnotes and annotations.
11:15–12:00: Working session #3, using Rudyard Kipling’s The truce of the bear (1914-08-31)
- Scanned image
- XML transcription
Goals: encode as much of the formatting as possible in order to provide truly diplomatic transcriptions. Issues include multiple columns, centering-related issues, indentation of individual lines, footnotes and annotations.

This poem is particularly challenging but fairly typical for poems we are finding in periodicals: centered poem title, centered prose after title/before poem, two-colum layout, unusual indentation/spacing within poem text, etc.

Workshop II: Using literary and linguistic annotation once you’ve created it

This workshop will introduce participants to querying and transforming XML documents using XPath and XSLT and to validating documents using XPath and Schematron. Examples will be drawn from data supplied in advance by participants and from other sources. Participants should already have hands-on experience tagging XML documents and should already have read or re-read An even gentler introduction to XML and What can XPath do for me?. The following is not required, but those who are interested may read ahead in our introductory XSLT and Schematron tutorials. The first workshop is not a prerequisite for the second; participants may enroll in either or in both.

1:00–1:30: Processing XML with XSLT
1:30–2:15: Working session #1, using Jarring Prov. 105 (Turki legal document from 1855–56)
This document includes missing text due to holes in the manuscript, marginalia, a seal, and interesting indigenous paper. Processing tasks include managing editor annotations (seal, holes, marginalia, paper, etc.)

Possible topics for discussion:
- Handling right-to-left text, including diacritic marks (for short vowels, doubled consonants, etc.).
- Reading and writing XML comments. The sample Turki text contains a commented-out <orth> tier. What’s the best way to process that? And conversely, what’s the best way to write the output of a function into a comment?
- Using regular expressions in XSLT to replace text in predictable contexts (e.g., if a word begins with two consonants insert an i between them)?
- Some superscript characters, such as ʰ, have a dedicated Unicode code point, but some others don’t. What’s the best way to represent them (e.g., by wrapping them in tags so that they will later be rendered as superscripts).
- Transforming XML to HTML with CSS. (On this last point, see the Using XSLT to create HTML section of ourHTML basics tutorial.)
- Using Schematron to verify whether parallel tiers in Leipzig-like transcriptions agree in the number of spaces and hyphens. Sample code is available at http://ku.obdurodon.org/schematron.xhtml (originally a Schematron homework assignment in a DH course at the University of Pittsburgh).
2:15–2:30: Break
2:30–3:15: Working session #2, using Krieg und Frieden (1917-09-08)
- Scanned image
- XML transcription
Goals: encode as much of the formatting as possible in order to provide truly diplomatic transcriptions. Issues include multiple columns, centering-related issues, indentation of individual lines, footnotes and annotations.
3:15–4:00: Working session #3, using Rudyard Kipling’s The truce of the bear (1914-08-31)
- Scanned image
- XML transcription
Goals: encode as much of the formatting as possible in order to provide truly diplomatic transcriptions. Issues include multiple columns, centering-related issues, indentation of individual lines, footnotes and annotations.

This poem is particularly challenging but fairly typical for poems we are finding in periodicals: centered poem title, centered prose after title/before poem, two-colum layout, unusual indentation/spacing within poem text, etc.

One of the hardest formatting challenges for indented poetry is echeloned lines, of the sort popularized by Vladimir Vladimirovič Majakovskij and Frank O’Hara. See our tutorial on Formatting echeloned poetry about how to deal with that sort of indentation in a plain-text to XML to HTML transformation process.

Auxiliary materials

Materials from David J. Birnbaum’s Computational methods in the humanities course at the University of Pittsburgh
vaterlandXSL_class-indent.xsl
cristVaterlandCorrected.xml