Working With Glossary Items#

A large number of units in OpenLearn include explictly identified glassary items. By mining OpenLearn OU-XML documents, we can trivially create a “meta-glossary” of terms across the whole of OpenLearn.

You can try such a search here. WARNING: 100MB+ download: this database application runs purely in your browser and may take a minute or two to load.

One of the fiddly but necessary tasks associated with course production is the creation of glossary items. Glossary items are explicitly defined in OU-XML materials, which means we can trivially extract them and easily build up meta-glossaries with varying scopes: for example, a meta-glossary of all MXXX units, or a meta-glossary from all beginner level science courses. At an individual student level, we could construct a met-aglossary of all the glossed terms that have appeared in units the student has studied to dat (along with a reference to which unit they appeared in). And so on.

In this section, I will demonstrate how we can scrape glossary items from acoss the OpenLearn unit OU-XML files in order to create a simple full-text search tool that allows us to search over just glossary terms and definitions.

Preparing the Ground#

As ever, we need to set up a database connection:

from sqlite_utils import Database

# Open raw XML database connection
xml_dbname = "all_openlean_xml.db"
xml_db = Database(xml_dbname)

# Open assets database
dbname = "openlean_assets.db"
db = Database(dbname)

And get a sample XML file, selecting one that we know contains struturally marked up glossary items:

from lxml import etree
import pandas as pd

# Grab an OU-XML file that is known to contain glossary items
a210_xml_raw = pd.read_sql("SELECT xml FROM xml WHERE name='Approaching plays'",
                           con=xml_db.conn).loc[0, "xml"]

# Parse the XML into an xml object
root = etree.fromstring(a210_xml_raw)

Extracting Glossary Items#

Glossary items are defined using a <Glossary> element, althoughthe substantive details are contained in <GlossaryItem> elements [docs]:

<GlossaryItem xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Term>Amphitheatre</Term>
    <Definition> a circular structure with seats rising behind and above each other around a central open space or arena; originating in classical Greece, they are the first known specifically designated theatre spaces
    </Definition>
</GlossaryItem>

It is trivial enough to extract all the terms from a single unit:

def get_gloss_items(root):
    """Extract glossary items from an OU-XML XML object."""
    gloss = root.xpath('//GlossaryItem')

    glossary = []
    for g in gloss:
        g_term = g.xpath("Term")[0].text
        g_definition = g.xpath("Definition")[0].text

        if not g_term or not g_definition:
            continue

        glossary.append( (g_term, g_definition) )
    
    return glossary

Let’s see how that works for our test unit:

# Preview the first few items
get_gloss_items(root)[:5]
[('Amphitheatre',
  ' a circular structure with seats rising behind and above each other around a central open space or arena; originating in classical Greece, they are the first known specifically designated theatre spaces.'),
 ('Apostrophe',
  " a rhetorical convention in which the speaker either addresses a dead or absent person, or an inanimate object or abstraction. An apostrophe can also refer to a speaker's address to a particular member or section of the audience."),
 ('Anagnorisis', ' a scene of recognition or discovery.'),
 ('Aside', ' a short speech spoken '),
 ('Blank verse', ' unrhymed iambic pentameters.')]

Adding Glossary Items to the Database#

It’s trivial to add the glossary terms and definitions for all our units to the database, along with support for full text search over all the items.

First, create appropriate tables to store the data:

all_gloss_tbl = db["glossary"]
all_gloss_tbl.drop(ignore=True)
all_gloss_tbl.create({
    "code": str,
    "name": str,
    "term": str,
    "definition": str,
    "_id": str
})
# Note that in this case the _id is not unique
# because the same id may apply to multiple los
# The _id is a reference for joining tables only

# Enable full text search
# This creates an extra virtual table (glossary_fts) to support the full text search
db[f"{all_gloss_tbl.name}_fts"].drop(ignore=True)
db[all_gloss_tbl.name].enable_fts(["term", "definition", "_id"], create_triggers=True)
<Table glossary (code, name, term, definition, _id)>

Now we can iterate over all the OU-XML documents, extract any glossary items contained therein, and add them to our database table:

from xml_utils import create_id

for row in xml_db.query("""SELECT * FROM xml;"""):
    root = etree.fromstring(row["xml"])
    gloss_items = get_gloss_items(root)
    # From the list of glossary items,
    # create a list of dict items we can add to the database
    gloss_item_dicts = [{"term": g[0], "definition": g[1],
                         "code": row["code"], "name": row["name"]} for g in gloss_items if g[0] or g[1] ]
    
    # Add a reference id for each record
    create_id(gloss_item_dicts, id_field="_id")
    
    # Add items to the database
    db[all_gloss_tbl.name].insert_all(gloss_item_dicts)

Now we can test a query:

pd.read_sql("SELECT * FROM glossary LIMIT 3", con=db.conn)
code name term definition _id
0 L314 Advanced Spanish: Protest song desalambrar Quitar las vallas de alambre que cercan un rec... 3c7d27258e3c209f19f09a459d223ca875523a10
1 L314 Advanced Spanish: Protest song el encasillamiento clasificación, generalmente simplista 3c7d27258e3c209f19f09a459d223ca875523a10
2 L314 Advanced Spanish: Protest song busca amplitud en su propuesta quiere realizar un trabajo que comprende difer... 3c7d27258e3c209f19f09a459d223ca875523a10

Or a full text search:

def fts(db, base_tbl, q):
    """Run a simple full-text search query 
       over a table with an FTS virtual table."""
    _q = f"""SELECT * FROM {base_tbl}_fts 
             WHERE {base_tbl}_fts MATCH {db.quote(q)} ;"""
    
    return pd.read_sql(_q, con=db.conn)
fts(db, "glossary", "member audience").to_dict(orient="records")
[{'term': 'Apostrophe',
  'definition': " a rhetorical convention in which the speaker either addresses a dead or absent person, or an inanimate object or abstraction. An apostrophe can also refer to a speaker's address to a particular member or section of the audience.",
  '_id': '6a3ff37cfd1ffed0fe1e98c833db2dc0b0dd53c1'}]