OU-XML Health Checks and Quality Audits#

As a structured document, OU-XML contains a wide variety of elements that are countable. As such, it is trivial to generate reports that describe how many elements of a particular kind are contained within a particular document. For example, we might count the number of learning outcomes defined in a particular unit, or the number media assets it includes. These counts can feed into quality checks for the document. For example, we might require that the number of outcomes is non-zero (i.e. at least one learning outcome is described at a particular level within the OU-XML document); or where media items are included, we might flag this unit as one that may have rights clearance considerations, or that may raise accessibility issues.

Furthermore, several structural elements are defined in such a way that they encourage, if not actually require, the inclusion of metadata or supplementary information. For example, an estimated time requirement in an activity, or a transcript or long description for a media item.

Finally, the OU-XML document may include references to assets that are themselves “testable”: web links to third party resources that can be link checked, for example.

In this section, we will review several ways in which we can produce quick summary reports that review the structural content of an OU-XML document and hint at possible health checks and quality audit data we might derive from it.

As well as running reports on individual OU-XML documents, we can also run reports over a set of OU-XML documents, or assets retrieved from them. This means we could generate reports on the health of the OpenLeaen OU-XML corpus as a whole, for example. Or audit OU-XML documents from different faculties to see whether units from one facultry have a different asset mix to units from another, for example.

Database Setup#

Set up a database connection to our OpenLearn curriculum asset database:

from sqlite_utils import Database
import pandas as pd

# Open database connection
dbname = "all_openlean_xml.db"
db = Database(dbname)

OU-XML Media Item Quality Checks#

Where media items are defined in an OU-XML document, they should include information supplementary to the resource reference such as a long description or transcript.

It is trivial to check that the corresponding fields are populated, if not check the actual substance of the content of those fields.

A Source reference attribute is available that may be useful for rights tracking purposes etc.

Checking Figure Asset Definitions#

When image assets are used, there is often a requirement to provide a written description of it that can be used by a screen-reader to provide an alternative way of experiencing the asset to visually impaired readers.

By scraping individual figure items into a database table, we can trivially generate reports on the extent to which alternative descriptions are provided, and identify units where they may be missing.

For example, we can query all the figure assets referenced across all OpenLearn units to identify figures where there is no caption, description or alternative text given:

pd.read_sql("""SELECT xpath, code, name, type, caption, alt, description
            FROM figures
            WHERE caption IS NULL AND alt IS NULL AND description IS NULL;""",
            con=db.conn)
xpath code name type caption alt description
0 /MediaContent/Figure L101 A brief history of communication: hieroglyphic... jpg None None None
1 /Figure H807 Accessibility of eLearning jpg None None None
2 /Figure Addysg gynhwysol: deall yr hyn a olygwn (Cymru) tif None None None
3 /Figure Addysg gynhwysol: deall yr hyn a olygwn (Cymru) tif None None None
4 /Figure Addysg gynhwysol: deall yr hyn a olygwn (Cymru) tif None None None
... ... ... ... ... ... ... ...
2005 /Introduction/MediaContent/Figure Розуміння вашої галузі jpg None None None
2006 /Introduction/MediaContent/Figure Розуміння вашої галузі jpg None None None
2007 /MediaContent/Figure Розуміння вашої галузі jpg None None None
2008 /Activity/Question/MediaContent/Figure K314 Розуміння проблем психічного здоров'я jpg None None None
2009 /Activity/Multipart/Part/Question/MediaContent... K314 Розуміння проблем психічного здоров'я jpg None None None

2010 rows × 7 columns

Checking Audio Asset Definitions#

Ideally, every audio asset should have a transcript. We can generate summary reports from the database to identify audio items that do not have an associated transcript.

For example:

pd.read_sql("""SELECT xpath, code, name, type, caption, transcript, description
            FROM media WHERE type='audio' AND transcript IS NULL""",
            con=db.conn)
xpath code name type caption transcript description
0 /Activity/Question/MediaContent L313 Advanced German: Language, culture and history audio None None None
1 /Activity/Question/MediaContent L313 Advanced German: Language, culture and history audio None None None
2 /Activity/Question/MediaContent L313 Advanced German: Language, culture and history audio None None None
3 /Activity/Question/MediaContent L313 Advanced German: Language, culture and history audio None None None
4 /Activity/Question/MediaContent L313 Advanced German: Language, culture and history audio None None None
... ... ... ... ... ... ... ...
112 /Activity/Question/MediaContent S110 Using numbers and handling data audio None None None
113 /Activity/Question/MediaContent S110 Using numbers and handling data audio None None None
114 /Activity/Question/MediaContent S110 Using numbers and handling data audio None None None
115 /Activity/Question/MediaContent S110 Using numbers and handling data audio None None None
116 /MediaContent S111 What are waves? audio Audio 1 The song of a humpback whale. (1:09 min) None None

117 rows × 7 columns

Lots of those appear to be audio files related to activities in language units, so perhaps the provision of a transcript would defeat the purpose of the activity?

The “whale song” item is interesting in its lack of a transcript or description: there is no transcript as such, but the sound presumably still be described?

Checking Video Asset Definitions#

In the same way that we checked audio asset descriptions for the presence of transcripts, etc., we can also check for the presence of video transcripts.

Once again, reporting the xpath context reference for the asset might provide a clue as to why a transcript is missing in any particular case.

pd.read_sql("""SELECT xpath, code, name, type, caption, transcript, description
            FROM media WHERE type='video' AND transcript IS NULL""",
            con=db.conn)
xpath code name type caption transcript description
0 /Activity/Question/MediaContent L310 Advanced French: At the science museum in Paris video None None None
1 /MediaContent An introduction to geology video Video 1.5 Magnetic reversal stripes None None
2 /MediaContent S182 Aquatic mammals video Video 2 Humpback whale feeding None None
3 /Activity/Question/MediaContent S182 Aquatic mammals video Video 3 Hunting behaviour of dolphins None None
4 /MediaContent Being an OU student video None None None
... ... ... ... ... ... ... ...
97 /Activity/Question/MediaContent U214 Англійська у світі сьогодні video None None None
98 /Activity/Question/MediaContent U214 Англійська у світі сьогодні video None None None
99 /Activity/Question/MediaContent U214 Англійська у світі сьогодні video None None None
100 /Activity/Question/MediaContent U214 Англійська у світі сьогодні video None None None
101 /Activity/Question/MediaContent U214 Англійська у світі сьогодні video None None None

102 rows × 7 columns

Equation Quality Checks#

The equation object definition supports captions, alternative descriptions etc. and as such, completion of these tags is checkable.