Working With OU-XML
Contents
Working With OU-XML#
In this section we will briefly review different ways of working with an OU-XML document, including treating a document as a simple, structured and searchable database, as well as ways of displaying or rendering the XML content.
OU Internal readers might also find the Structured Content Tag Guide a useful, and more comprehensive, guide to the OU-XML document structure.
Treating OU-XML Documents As Databases#
In production terms, OU-XML documents are used are gold-master documents for representing academic content in a strcutured way. The structure adds semantics (which is to say, “meaning”) to content elements and these semantics are then interpreted using visual cues when the document is rendered as a final output text. For example, in the OU VLE, <Acitivity>
elements are typically rendered as a box element with a blut background that distinguishes the activity from the rest of the content.
From a scrape of OpenLearn, we can create a database of OU-XML documents that we can use as a basis for strcutured search over the whole OpenLearn free-learning corpus.
As well as supporting the rendering of a document for presentation purposes in a meanigful way, the structure of an XML document provides a way of discovering content elements with particular semantics, which is to say: we can treat an XML document as a simple database by running queries directly over it and returning results extracted from it. We can also mine XML documents to extract elements of a particular type and place those elements in another database (either a “traditional” database, or another XML document) and use that second database to search over the extracted elements.
In this section, we will briefly review how we can treat an individual XML document as a database that can be both queried and mined.
Start by creating a connection to our database of scraped OU-XML documents scraped from OpenLearn:
from sqlite_utils import Database
# Open database connection
dbname = "all_openlean_xml.db"
db = Database(dbname)
We can grab a single OU-XML document for a particular unit by name:
from lxml import etree
import pandas as pd
# Unit title
title = "An introduction to computers and computer systems"
# Grab an OU-XML file that is known to contain glossary items
xml_raw = pd.read_sql(f"SELECT xml FROM xml WHERE name='{title}'",
con=db.conn).loc[0, "xml"]
We can parse this document into an XML object:
# Parse the XML into an xml object
root = etree.fromstring(xml_raw)
The XML document is a hierarchical, tree based tag strcutured document format.
If you haveever seen raw HTML, HTML is a particular flavour of XML.
The nested tag elements provide a way of indexing (that is, addressing or referring to) to a particular element.
For example, in the following fragment of XML:
<Glossary>
<GlossaryItem>
<Term>Computer</Term>
<Definition>
A machine that manipulates data following a list of instructions that have been programmed into it.
</Definition>
</GlossaryItem>
<GlossaryItem>
<Term>Computer program</Term>
<Definition>
The list of instructions the computer follows to process input and produce output.
</Definition>
</GlossaryItem>
</Glossary>
we can “address” the Computer term as Glossary/GlossaryItem[0]/Term
and the definition of the Computer program term as Glossary/GlossaryItem[1]/Definition
. These path elements tell us how to “walk the tree”. The numbers in brackets tell you which of several items with the same “tag path” you are interested in (the values are zero-indexed, which is to say the first one has index value 0
, the second one has index value 1
, and so on).
We can query our XML dcoument see if it has a Glossary
:
root.xpath("//Glossary")
[<Element Glossary at 0x7fc6aeae2800>]
Yes it does: there is a single Glossary
tag in the document.
We can preview it as follows:
from xml_utils import tidy
# The result of out query was a list with a single item
# We can address that item as the first (zero indexed) item
# in the list, and then "tidy" it to grab the XML it
# refers to, printing the result to get a formatted output
# Only display the first 750 characters of the result
print(tidy(root.xpath("//Glossary")[0])[:750])
<Glossary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<GlossaryItem>
<Term>Computer</Term>
<Definition>A machine that manipulates data following a list of instructions that have been programmed into it.</Definition>
</GlossaryItem>
<GlossaryItem>
<Term>Computer program</Term>
<Definition>The list of instructions the computer follows to process input and produce output.</Definition>
</GlossaryItem>
<GlossaryItem>
<Term>Input device</Term>
<Definition>A component that can function both as an input and as an output device.</Definition>
</GlossaryItem>
<Glo
We can also run a query to get just the glossary terms:
# The search returns a list of items
terms = root.xpath("//Glossary/GlossaryItem/Term")
# Preview the first five only
terms[:5]
[<Element Term at 0x7fc6aeaf8800>,
<Element Term at 0x7fc6aeaf8b40>,
<Element Term at 0x7fc6aeaf8ac0>,
<Element Term at 0x7fc6aeaf8d00>,
<Element Term at 0x7fc6aeaf8cc0>]
Let’s see those terms:
for term in terms[:5]:
print(term.text)
Computer
Computer program
Input device
Internet
Output device
The syntax gets a little fiddly, but we can also search within elemnt text to find terms containing a particular word:
computer_terms = root.xpath("//Glossary/GlossaryItem/Term[contains(text(), 'Computer')]")
for term in computer_terms:
print(term.text)
Computer
Computer program
Computer bus
Computer system
From these result elements, we can get the “parent” <GlossaryItem>
element and display the related terms and definitions:
items = [el.getparent() for el in computer_terms]
for item in items:
print(tidy(item))
<GlossaryItem xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Term>Computer</Term>
<Definition>A machine that manipulates data following a list of instructions that have been programmed into it.</Definition>
</GlossaryItem>
<GlossaryItem xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Term>Computer program</Term>
<Definition>The list of instructions the computer follows to process input and produce output.</Definition>
</GlossaryItem>
<GlossaryItem xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Term>Computer bus</Term>
<Definition>The internal data connections across the input and output subsystems and the secondary memory subsystem to the computer’s processor and main memory.</Definition>
</GlossaryItem>
<GlossaryItem xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Term>Computer system</Term>
<Definition>Formally a processor and its associated devices to make a usable ‘system’, but often the complete system, is referred to as a ‘computer’.</Definition>
</GlossaryItem>
We could equally have run a query to find all the <GlossaryItem>
elements where the <Term>
included out search term. In code, for any particular task, there is generally more thn one way to do it…
What this shows is that we can run queries into different structured parts of the OU-XML document rather than just searching the whole document for a particular term.
We can also use this approach to extract (or mine) particualr sorts of element from the document and place those into another collection of just those sorts of thing. For example, by mining the Glossary
elements from all our OpenLearn OU-XML documents, we could easily generate a OpenLearn Glossary that combines all glossed items into a single OpenLearn-wide glossary.
Viewing and Rendering XML Elements Using XSLT#
One way of generating rendered views of XML content is to use XSLT, a transfomration process in which an XSLT document describes how to transform each node in an XML document, such as an OU-XML document to another form. For example, I have previously used XSLT to transform an OU-XML document into a set of simple markdown documents that can then be rendered as an interactive HTML textbook using a publishing workflow such as the Quarto or Jupyter Book publisng workflows.
Whilst the XSLT stylesheet I have previously used expects to find a <Session>
element as the root element, we can also co-opt the stysleheet to render any collection of elements by definging a dummy root element and then applying stylesheets within that context:
<xsl:template match="DummyRoot">
<md>
<xsl:apply-templates />
</md>
</xsl:template>
As my style sheet was desgined to generate markdown (.md
) structured content (which can also legitimately include HTML structured content), I nominally use the above transformation to dump the text into an XML <md>
tag.
Let’s import an XML processing package and create a simple utility function to convert a XML object to a text format:
from lxml import etree
def unpack(x, as_str=False):
"""Convenience function to look at the structure of an XML object."""
return etree.tostring(x) if not as_str else etree.tostring(x).decode()
Define a handle for our XSLT-powered transformations:
xslt_transformer = etree.XSLT(etree.fromstring(open("xslt/ouxml2md.xslt").read()))
Create some example XML to demonstrate the process:
test_xml = etree.XML("""<Activity><Question xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">\n\t\t\t\t\t\t\t\t<Paragraph><language xml:lang="FR">Lisez maintenant le poème à haute voix et allez ensuite écouter l’auteur lire son poème sur Internet, </language><?oxy_delete author="js34827" timestamp="20200630T134829+0100" content="<a href="http://routes.open.ac.uk/ixbin/hixclient.exe?_IXDB_=routes&amp;_IXSPFX_=g&amp;submit-button=summary&amp;%24+with+res_id+is+res23034=."><b><language xml:lang="FR">Paul Fort : poème</language></b></a>"?><?oxy_insert_start author="js34827" timestamp="20200630T134832+0100"?><a href="https://wheatoncollege.edu/vive-voix/par-auteur/fort/"><b><language xml:lang="FR">Paul Fort : poème</language></b></a><?oxy_insert_end?><b><language xml:lang="FR">.</language></b></Paragraph>\n\t\t\t\t\t\t\t</Question></Activity>""")
Now we need to wrap the text XML in a “shim” to which we can apply the transformation process using our previously created XSLT stylesheet:
wrapped_xml= etree.XML("<DummyRoot></DummyRoot>")
wrapped_xml.append(test_xml)
Apply the transformation:
# Apply the XSLT stylesheet
transformed_xml = xslt_transformer(wrapped_xml)
# Convert the genereated XML object to text
md = unpack(transformed_xml.getroot()).decode()
# Strip the <md> tags from the text string
md = md.replace('<md xmlns:str="http://exslt.org/strings">', '').replace("</md>", "")
# HACK: Sphinx/Jupyterbook gets upset by header in cell output
md = md.replace("###", "HEADER: ")
print(md)
<!-- #region tags=["style-activity"] -->
HEADER: # Question
Lisez maintenant le poème à haute voix et allez ensuite écouter l’auteur lire son poème sur Internet, [__Paul Fort : poème__](https://wheatoncollege.edu/vive-voix/par-auteur/fort/)__.__
<!-- #endregion -->
Now get rid of the <md>
wrapper tags, convert the markdown to HTML and render the markdown using IPython display machinery:
from IPython.display import Markdown
Markdown(md)
HEADER: # Question
Lisez maintenant le poème à haute voix et allez ensuite écouter l’auteur lire son poème sur Internet, Paul Fort : poème.
For convenience, we could wrap those steps up in a single function:
def xml_transform(xml, xslt):
"""Transform an XML document via XSLT."""
xslt_transformer = etree.XSLT(etree.fromstring(open(xslt).read()))
# Apply the XSLT stylesheet
transformed_xml = xslt_transformer(xml)
return transformed_xml
def ouxml2md(ouxml, xslt="xslt/ouxml2md.xslt", shim="DummyRoot"):
"""Convert OU-XML fragment to markdown."""
# Convert bytes to parsed XML doc if required
if isinstance(ouxml, bytes):
ouxml = etree.fromstring(ouxml)
elif isinstance(ouxml, str):
ouxml = etree.fromstring(ouxml.encode("UTF8"))
# Create the shim so we can apply the templat at fragment level
wrapped_xml= etree.XML(f"<{shim}></{shim}>")
wrapped_xml.append(ouxml)
transformed_xml = xml_transform(wrapped_xml, xslt)
# Surely there's a better way to get the tag content?
md = unpack(transformed_xml.getroot()).decode()
md = md.replace('<md xmlns:str="http://exslt.org/strings">', '').replace("</md>", "")
return md
Let’s try it:
ouxml2md(test_xml)
'\n<!-- #region tags=["style-activity"] -->\n\n#### Question\n\nLisez maintenant le poème à haute voix et allez ensuite écouter l’auteur lire son poème sur Internet, [__Paul Fort : poème__](https://wheatoncollege.edu/vive-voix/par-auteur/fort/)__.__\n\n<!-- #endregion -->\n'
We can now also convert the markdown to HTML:
from markdown import markdown
# HACK: Sphinx/Jupyterbook gets upset by header in cell output
md = md.replace("###", "HEADER: ")
# Convert the markdown to HTML
html = markdown(md)
print(html)
<!-- #region tags=["style-activity"] -->
<p>HEADER: # Question</p>
<p>Lisez maintenant le poème à haute voix et allez ensuite écouter l’auteur lire son poème sur Internet, <a href="https://wheatoncollege.edu/vive-voix/par-auteur/fort/"><strong>Paul Fort : poème</strong></a><strong>.</strong></p>
<!-- #endregion -->
And preview that, agin using the IPython display machinery:
from IPython.display import HTML
# Render the HTML
HTML(html)
HEADER: # Question
Lisez maintenant le poème à haute voix et allez ensuite écouter l’auteur lire son poème sur Internet, Paul Fort : poème.
What this means is that we can search for and extract elements from our OU-XML documents and then preview those elements as HTML, assuming the stylesheet has appropriate rules defined for the corresponding OU-XML elements.
Generating Fully Rendered Output Documents from OU-XML Documents#
TO DO