Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2015-09-24T01:10:07+0000
This tutorial is based on an assignment that required students to create a Schematron schema that would take input like:
<sentence> <orth>Мы с Марко поеха-л-и автобус-ом в Переделкино</orth> <translit>My s Marko poexa-l-i avtobus-om v Peredelkino.</translit> <ilg>we with Marko go-PST-P bus-by to Peredelkino.</ilg> <free>Marko and I went to Perdelkino by bus.</free> </sentence>
and verify that the first three tiers (<orth>
,
<translit>
, and <ilg>
) all have the same
number of spaces and the same number of hyphens.
In order to verify that we could test our Schematron against a document that contained more than one sentence, we created a slightly enlarged example:
<?xml version="1.0" encoding="UTF-8"?> <stuff> <sentence> <orth>Мы с Марко поеха-л-и автобус-ом в Переделкино</orth> <translit>My s Marko poexa-l-i avtobus-om v Peredelkino.</translit> <ilg>we with Marko go-PST-P bus-by to Peredelkino.</ilg> <free>Marko and I went to Perdelkino by bus.</free> </sentence> <sentence> <orth>Мы с Марин-ой поеха-л-и поезд-ом в Казань</orth> <translit>My s Marin-oj poexa-l-i poezd-om v Kazan′.</translit> <ilg>we with Marina-INS go-PST-P train-by to Kazan.</ilg> <free>Marina and I went to Kazan by train.</free> </sentence> </stuff>
If there’s a discrepancy between two tiers, it seemed most natural to report that as an error at the level of the sentence, rather than of one or another of the tiers, since although Schematron can easily recognize when the tiers don’t agree, there’s no way for it to tell which of the discrepant tiers contains an error.
Here is our simple solution:
<?xml version="1.0" encoding="UTF-8"?> <schema xmlns="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"> <pattern> <rule context="sentence"> <let name="orthSpaces" value="string-length(orth) - string-length(translate(orth,' ',''))"/> <let name="translitSpaces" value="string-length(translit) - string-length(translate(translit,' ',''))"/> <let name="ilgSpaces" value="string-length(ilg) - string-length(translate(ilg,' ',''))"/> <let name="orthHyphens" value="string-length(orth) - string-length(translate(orth,'-',''))"/> <let name="translitHyphens" value="string-length(translit) - string-length(translate(translit,'-',''))"/> <let name="ilgHyphens" value="string-length(ilg) - string-length(translate(ilg,'-',''))"/> <report test="($orthSpaces, $translitSpaces, $ilgSpaces) != avg(($orthSpaces, $translitSpaces, $ilgSpaces))" >The spaces don’t match: orth (<value-of select="$orthSpaces"/>) ~ translit (<value-of select="$translitSpaces"/>) ~ ilg (<value-of select="$ilgSpaces" />)</report> <report test="($orthHyphens, $translitHyphens, $ilgHyphens) != avg(($orthHyphens, $translitHyphens, $ilgHyphens))" >The hyphens don’t match: orth (<value-of select="$orthHyphens"/>) ~ translit (<value-of select="$translitHyphens"/>) ~ ilg (<value-of select="$ilgHyphens" />)</report> </rule> </pattern> </schema>
@context
attribute of the Schematron <rule>
element
Because we want to report the error at the level of the <sentence>
element (see above), we set the value of the @context
attribute on our
Schematron <rule>
element to sentence
.
A popular strategy for counting the number of spaces or hyphens in a string is to
subtract the length of the string after stripping out the character in question from its
original length (using the XPath translate()
function). We use this
strategy to set up six variables, recording the number of spaces and hyphens in our
<orth>
, <translit>
, and
<ilg>
tiers. There are alternative strategies that will achieve
the same result, but we find this easiest to understand and to write.
We need to run a three-way test, that is, we need to compare the count of spaces or
hyphens in three strings to determine whether they all have the same count. Some
programming languages permit transitive comparisons, and in those languages you could
write something like a = b = c
to test whether a
,
b
, and c
are all equal. XPath is not that kind of
language, though, so we need an alternative strategy. One straightforward approach would
combine two tests in one:
$orthSpaces eq $translitSpaces and $orthSpaces eq $ilgSpaces
In a combined test with the and
operator, the result of the test is false
unless both parts succeed. We don’t need to run a third test, comparing transliteration
to interlinear glossing directly, because arithmetic equality transitivity ensures that
if a = b and b = c, it is also true that a = c.
We opted, though, for a more elegant approach. We created a sequence of the three values
and compared that, using the general comparison operator !=
, to the average
of the three values. If any of the three values is not equal to the average (that is, if
the test for general nonequality succeeds), they are not all the same.
To make it easier for the human to find the error, we used the Schematron
<value-of>
element to report the space or hyphen counts for each
tier. It isn’t possible using this strategy to find the word where a mismatch in hyphens
occurs because we’re doing our counting on the level of the entire line, and not on a
word-by-word basis.
Comparing the sequence of all values to the average of all values, which we do above, works because only if all values are the same will every individual value be equal to the average of all of the values. A more elegant approach, though, might just count the number of distinct values:
<report test="count(distinct-values(($orthSpaces, $translitSpaces, $ilgSpaces))) eq 1">
If the values are all the same, there will be only one distinct value.
Getting a word-level report is tricky because, since the individual words are not
independent nodes in the XML tree, it isn’t possible to set the value of the
@context
element to point to them. We need, then, to continue to run
our tests on the level of entire sentences, and find some other way to do a word-by-word
test.
Happily, it is possible to declare the XSLT namespace in Schematron and use XSLT
features, including <xsl:function>
, which lets us declare our own
function. We use that strategy to let our user-defined function
djb:hyphenation()
generate a word-by-word result, which Schematron can
then use to output a more specific error report. Here is the code:
<?xml version="1.0" encoding="UTF-8"?> <schema xmlns="http://purl.oclc.org/dsdl/schematron" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" queryBinding="xslt2"> <ns prefix="djb" uri="http://www.obdurodon.org"/> <xsl:function name="djb:hyphenation" as="xs:boolean+"> <xsl:param name="orthWords" as="xs:string+"/> <xsl:param name="transWords" as="xs:string+"/> <xsl:param name="ilgWords" as="xs:string+"/> <xsl:for-each select="1 to count($orthWords)"> <xsl:variable name="orthHyphens" as="xs:integer" select="string-length($orthWords[current()]) - string-length(translate($orthWords[current()],'-',''))"/> <xsl:variable name="transHyphens" as="xs:integer" select="string-length($transWords[current()]) - string-length(translate($transWords[current()],'-',''))"/> <xsl:variable name="ilgHyphens" as="xs:integer" select="string-length($ilgWords[current()]) - string-length(translate($ilgWords[current()],'-',''))"/> <xsl:sequence select="$orthHyphens eq $transHyphens and $orthHyphens eq $ilgHyphens"/> </xsl:for-each> </xsl:function> <pattern> <rule context="sentence"> <let name="orthSpaces" value="string-length(orth) - string-length(translate(orth,' ',''))"/> <let name="translitSpaces" value="string-length(translit) - string-length(translate(translit,' ',''))"/> <let name="ilgSpaces" value="string-length(ilg) - string-length(translate(ilg,' ',''))"/> <let name="orthWords" value="tokenize(orth,'\s+')"/> <let name="transWords" value="tokenize(translit,'\s+')"/> <let name="ilgWords" value="tokenize(ilg,'\s+')"/> <let name="results" value="djb:hyphenation($orthWords,$transWords,$ilgWords)"/> <report test="($orthSpaces, $translitSpaces, $ilgSpaces) != avg(($orthSpaces, $translitSpaces, $ilgSpaces))" >The spaces don’t match: orth (<value-of select="$orthSpaces"/>) ~ translit (<value-of select="$translitSpaces"/>) ~ ilg (<value-of select="$ilgSpaces" />)</report> <report test="$results != true()">Word # <value-of select="index-of($results,false())[1]"/> doesn't match: "<value-of select="$orthWords[index-of($results,false())[1]]"/>" (orthographic, <value-of select="string-length($orthWords[index-of($results,false())[1]]) - string-length(translate($orthWords[index-of($results,false())[1]],'-',''))" />) ~ "<value-of select="$transWords[index-of($results,false())[1]]"/>" (transliterated, <value-of select="string-length($transWords[index-of($results,false())[1]]) - string-length(translate($transWords[index-of($results,false())[1]],'-',''))" />) ~ "<value-of select="$ilgWords[index-of($results,false())[1]]"/>" (interlinear gloss, <value-of select="string-length($ilgWords[index-of($results,false())[1]]) - string-length(translate($ilgWords[index-of($results,false())[1]],'-',''))" />)</report> </rule> </pattern> </schema>
We use exactly the same strategy for checking spaces as we did in the simple solution, above.
User-defined functions have to be in a user-defined namespace, and we’ve used the URI
http://www.obdurodon.org
as the namespace for our function and bound it to
the prefix djb:
. Schematron does not support the general XML namespace
declaration syntax, so we have to use the Schematron-specific namespace declaration
syntax instead:
<ns prefix="djb" uri="http://www.obdurodon.org"/>
We also have to declare the XSLT namespace, which we do using the standard XML namespace declaration syntax, writing:
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
inside the <schema>
start tag. The reason we use the standard XML
namespace declaration syntax here is that this namespace declaration applies at a higher
level, at the stage where the processor has to sort out which parts of the page are in
the Schematron namespace and which are in the XSLT namespace, that is, while looking at
children of the root <schema>
element. It deals with the
djb
namespace at a deeper level, where the Schematron processor is able
to manage the namespace resolution.
djb:hyphenation()
functionThis isn’t a general tutorial on writing your own XSLT functions (see the clear and
comprehensive write-up in the Michael Kay book for that), but the general way the
function works is that we break the three strings into words using the
tokenize()
function and pass those three sequences of words into the
function, saving the output of the function in a variable we call
$results
<let name="results" value="djb:hyphenation($orthWords,$transWords,$ilgWords)"/>
When the function returns, the $results
variable will contain information
that we can use to determine the position in the sequence of words in which a discrepany
in hyphen count appears (see below).
The function iterates over the words in an <xsl:for-each>
element,
calculates the number of hyphens in each word in corresponding positions in the three
tiers, and compares those three numbers (we used the two-part test with and
this time because it was more legible). The comparison returns (as the value of the
<xsl:sequence>
element) a Boolean (true
or false
)
value depending on whether the counts of hyphens in the same word position in the three
tiers are all equal to one another. The function generates one Boolean value for each
word position in the sentence, and after it has examined all of the sets of words, it
returns a sequence of Boolean values to the Schematron rule, where, as we noted above,
it becomes the value of the $results
variable.
If any value in the returned sequence is not equal to Boolean true()
, the
Schematron report finds the position of the first false()
value (using the
XPath index-of()
function) and uses that position to find the specific
words, count the hyphens in those words on all three tiers, and output a report that
gives, for each tier, the tier identifier, the word, and the number of hyphens. The
<report>
element is difficult to read because although Schematron
allows the use of the <let>
element to create variables, it doesn’t
permit the creation of variables inside a <report>
element. This
means that we can’t create convenience variables to hold our counts of string lengths
and hyphens, which would make our code easier to read, and we have to do all of the
measurement and arithmetic at once instead.
After we wrote the Schematron above, which meets all of our requirements, we noticed that
inside the djb:hyphenation
function we perform the same computation three
times, with different input, to count the hyphens in each word. Furthermore, the
calculation itself is pretty verbose, and therefore hard to read. To make things more
legible, we refactored (revised) that part of our code, breaking the calculation out
into a separate djb:countHyphens
function that we could call three times.
And just to keep in practice we tried a different strategy for testing whether all three
counts were the same. Here’s the revised code, with the revisions highlighted:
<?xml version="1.0" encoding="UTF-8"?> <schema xmlns="http://purl.oclc.org/dsdl/schematron" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" queryBinding="xslt2"> <ns prefix="djb" uri="http://www.obdurodon.org"/> <xsl:function name="djb:hyphenation" as="xs:boolean+"> <xsl:param name="orthWords" as="xs:string+"/> <xsl:param name="transWords" as="xs:string+"/> <xsl:param name="ilgWords" as="xs:string+"/> <xsl:for-each select="1 to count($orthWords)"> <xsl:variable name="orthHyphens" as="xs:integer" select="djb:countHyphens($orthWords[current()])"/> <xsl:variable name="transHyphens" as="xs:integer" select="djb:countHyphens($transWords[current()])"/> <xsl:variable name="ilgHyphens" as="xs:integer" select="djb:countHyphens($ilgWords[current()])"/> <xsl:sequence select="count(distinct-values(($orthHyphens, $transHyphens, $orthHyphens))) eq 1"/> </xsl:for-each> </xsl:function> <xsl:function name="djb:countHyphens" as="xs:integer"> <xsl:param name="word"/> <xsl:variable name="length" as="xs:integer" select="string-length($word)"/> <xsl:variable name="dehyphenatedLength" as="xs:integer" select="string-length(translate($word,'-',''))"/> <xsl:sequence select="$length - $dehyphenatedLength"/> </xsl:function> <pattern> <rule context="sentence"> <let name="orthSpaces" value="string-length(orth) - string-length(translate(orth,' ',''))"/> <let name="translitSpaces" value="string-length(translit) - string-length(translate(translit,' ',''))"/> <let name="ilgSpaces" value="string-length(ilg) - string-length(translate(ilg,' ',''))"/> <let name="orthWords" value="tokenize(orth,'\s+')"/> <let name="transWords" value="tokenize(translit,'\s+')"/> <let name="ilgWords" value="tokenize(ilg,'\s+')"/> <let name="results" value="djb:hyphenation($orthWords,$transWords,$ilgWords)"/> <report test="($orthSpaces, $translitSpaces, $ilgSpaces) != avg(($orthSpaces, $translitSpaces, $ilgSpaces))" >The spaces don’t match: orth (<value-of select="$orthSpaces"/>) ~ translit (<value-of select="$translitSpaces"/>) ~ ilg (<value-of select="$ilgSpaces" />)</report> <report test="$results != true()">Word # <value-of select="index-of($results,false())[1]"/> doesn't match: "<value-of select="$orthWords[index-of($results,false())[1]]"/>" (orthographic, <value-of select="string-length($orthWords[index-of($results,false())[1]]) - string-length(translate($orthWords[index-of($results,false())[1]],'-',''))" />) ~ "<value-of select="$transWords[index-of($results,false())[1]]"/>" (transliterated, <value-of select="string-length($transWords[index-of($results,false())[1]]) - string-length(translate($transWords[index-of($results,false())[1]],'-',''))" />) ~ "<value-of select="$ilgWords[index-of($results,false())[1]]"/>" (interlinear gloss, <value-of select="string-length($ilgWords[index-of($results,false())[1]]) - string-length(translate($ilgWords[index-of($results,false())[1]],'-',''))" />)</report> </rule> </pattern> </schema>