Working With Figures and Media Items#

When working with images in course production, there are several approaches we may take:

  • discovering and reusing images and diagrams that already exist in order to illustrate a topic; example images may be selected by an author either with the intention that rights clearance (if required) is obtained for using that image, or that image may be used as a prompt to a picure researcher to find an image with a similar sense;

  • discovering images and diagrams based on a description of an image that is intended to serve a particular effect; recent advances in the automated generation of images from free text descriptions (e.g. StableDiffusion, DALL-E) may offer interesting possibilties here, either by generating images that can be used directly or by generating images that can be used as prompts by picture researchers;

  • using scientific or mathematical charts and diagrams generated by scientific software (example). In these cases, whilst the detail of the diagram may be important (data points are in the correct locations), the style of the diagram may be subject to artistic or desgin considerations;

  • generating diagrams using strict syntax diagram generation tools: for example, generating box and arrow diagrams or flow charts from a simple text descriptions; (see for example Generating Diagrams from Text Using BlockDiag).

The OU-XML document structure supports the description of figures, which is to say, images, as well as various media file types, such as audio and video asset types. In this section, we will see how we can support media discovery across the OpenLearn units by means of a simple full-text search over media captions and transcripts. THe availability of figure descriptions also provides us with a corpus of text descriptions that it might to interesting to run through image-from-text generation tools to see what sorts of things they come up with, and how text prompts need to be phrased to generate something resembling the intended effect.

For now, the actual media assets (images, videos, audio files, etc.), will not be scraped or added to the database. This means that media assets can be searched for, but not directly rendered.

Preparing the Ground#

As ever, we need to set up a database connection.

from sqlite_utils import Database

# Open database connection
xml_dbname = "all_openlean_xml.db"
xml_db = Database(xml_dbname)

media_dbname = "openlean_assets.db"
db = Database(media_dbname)

And get a sample XML file, selecting one that we know contains struturally marked up glossary items:

from lxml import etree
import pandas as pd

# Grab an OU-XML file that is known to contain glossary items
a210_xml_raw = pd.read_sql("SELECT xml FROM xml WHERE code='A111'",
                           con=xml_db.conn).loc[0, "xml"]

# Parse the XML into an xml object
root = etree.fromstring(a210_xml_raw)

Let’s also import a couple of utlilty functions:

from xml_utils import flatten, unpack

Extracting Figure Items#

Image references are provided inside a <Figure> element [docs]. This structured element type also bundles a caption and a description:

<Figure>
    <Image src="https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_1_openlearn_piano.tif" x_printonly="y" x_folderhash="5c95a855" x_contenthash="d74c57b3" x_imagesrc="a111_1_openlearn_piano.tif.jpg" x_imagewidth="512" x_imageheight="325"/>
    <Caption><b>Figure</b> 10 Playing the blues</Caption>
    <Description>This is a photograh showing a person\'s hands playing the piano.</Description>
</Figure>

An <Alternative> text tag may also be included.

To support discovery of figure items in a particular context, it will be useful to save the XML path to the figure elements. We can then use this as a basis for identifying whether a figure is used in an activity context, for example.

We can obtain the path to an element by generating a document tree and the locating a particular element within that structure.

# Generate the tree structure
tree = etree.ElementTree(root)

# Get example figure item
test_figure = root.xpath("//Figure")[0]

# Find the path to a figure item
tree.getpath(test_figure)
'/Item/Unit/Session[1]/Figure'

We can use a regular expression replacement to tidy the path up to remove the index values and give us a raw context path:

import re

def context_path(xpath):
    """Simplify an xpath to give a cleaner context path."""
    xpath = re.sub(r'\[\d+\]', '', xpath)
    xpath = re.sub(r'(/Item|/Unit|/Session|/Section|/SubSection|/InternalSection)', '', xpath)
    return xpath

For example:

context_path('/Item/Unit[7]/Session[3]/Activity[1]/Question/')
'/Activity/Question/'

Let’s now create a function that grabs data from a figure element that we can use to support figure discovery and audit:

def get_figure_items(root):
    """Extract figure items from an OU-XML XML object."""
    figures = root.xpath('//Figure')
    
    # We use the tree to find the path to the figure element
    tree = etree.ElementTree(root)
    
    # ist of media items in a unit
    figure_items = []
    figure_urls = []
    for f in figures:
        f_url = f.find("Image").get("src")
        # If there is no image reference, do not save this item
        # TO DO: should we warn if there is other figure data but no url?
        if not f_url:
            continue

        # If the same figure element appears multiple times in a unit
        # we only want to save it once.
        # TO DO: we should probably check multiple instances to ensure
        #        that the other elements (eg descriptions) are consistent.
        if f_url in figure_urls:
            # Optionally report on the duplicate items - useful for debugging
            #print(f"Duplicate item {f_url} — duplicate not checked or stored.")
            # Don't extract the duplicate
            continue
        else:
            figure_urls.append(f_url)

        f_type = f_url.split(".")[-1]
        tmp_caption = f.xpath("Caption")
        f_caption = flatten(tmp_caption[0]).strip() if tmp_caption else None
        tmp_alt = f.xpath("Alternative")
        f_alt = flatten(tmp_alt[0]).strip() if tmp_alt else None
        tmp_desc = f.xpath("Description")
        f_desc = flatten(tmp_desc[0]).strip() if tmp_desc else None
        
        # Get the path the the figure element
        f_xpath = context_path(tree.getpath(f))
        
        # These attributes appear to be access control elements
        x_folderhash = f.find("Image").get("x_folderhash")
        x_contenthash = f.find("Image").get("x_contenthash")
        x_imagesrc = f.find("Image").get("x_imagesrc")
        
        figure_items.append( (f_type, f_url, f_caption, f_alt, f_desc, f_xpath,
                              x_folderhash, x_contenthash, x_imagesrc) )

    return figure_items
get_figure_items(root)[:2]
[('tif',
  'https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_1_f01.tif',
  'Figure 1 B. B. King performing c.1968. Photo Michael Ochs Archives/Getty Images (via Britannica Image Quest)',
  None,
  'This is a photograph of B. B. King playing the guitar on stage.',
  '/Figure',
  '5c95a855',
  'b434d4cc',
  'a111_1_f01.tif.jpg'),
 ('tif',
  'https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_1_openlearn_beatles.tif',
  'Figure 2 The music of the Beatles',
  None,
  'This is an image showing a number of records by the Beatles.',
  '/Figure',
  '5c95a855',
  '4300681e',
  'a111_1_openlearn_beatles.tif.jpg')]

Adding Figure Items to the Database#

To support discovery, we can add figure item captions, descriptions and alternative text to a simple database table and then provide full-test search over them.

For now, we will omit scraping the figure assets into the database. An example of how to add a mdia item to the database, as well as retrieve and render a media item, is provided in the Scraping OpenLearn section.

Let’s define a simple table to store this information, along with an associated full-text search table:

all_figures_tbl = db["figures"]
all_figures_tbl.drop(ignore=True)
all_figures_tbl.create({
    "type": str,
    "caption": str,
    "alt": str,
    "description": str,
    "xpath": str,
    "x_folderhash": str,
    "x_contenthash": str,
    "x_imagesrc": str,
    "code": int,
    "name": str,
    "url": str,
    "id": str,
    "_id": str #derived from code and name
}, pk=("id"))

# Enable full text search
# This creates an extra virtual table (media_fts) to support the full text search
db[f"{all_figures_tbl.name}_fts"].drop(ignore=True)
db[all_figures_tbl.name].enable_fts(["caption", "alt", "description",
                                   "id"], create_triggers=True)
<Table figures (type, caption, alt, description, xpath, x_folderhash, x_contenthash, x_imagesrc, code, name, url, id, _id)>

Now we can populate the table with our figure assets:

from xml_utils import create_id

for row in xml_db.query("""SELECT * FROM xml;"""):
    _root = etree.fromstring(row["xml"])
    figure_items = get_figure_items(_root)
    # From the list of media items,
    # create a list of dict items we can add to the database
    figure_item_dicts = [{"type": f[0],  "url": f[1],
                         "caption": f[2], "alt": f[3],
                          "description": f[4], "xpath": f[5],
                          "x_folderhash": f[6], "x_contenthash": f[7], "x_imagesrc": f[8],
                         "code": row["code"], "name": row["name"]} for f in figure_items if f[1] ]

    # Add a unique id for each record
    create_id(figure_item_dicts, fields=["code", "name", "url"])
    # And a cross reference id for each record
    create_id(figure_item_dicts, id_field="_id")
        
    # Add items to the database
    db[all_figures_tbl.name].insert_all(figure_item_dicts)
db.vacuum()

Let’s see what image types are referenced across the OpenLearn units:

pd.read_sql("SELECT DISTINCT(type) FROM figures", con=db.conn)
type
0 jpg
1 tif
2 png
3 eps
4 gif
5 mp4

What sort of context to the fugures appear in terms of their path locations within the OU-XML documents?

pd.read_sql("SELECT DISTINCT(xpath) FROM figures", con=db.conn)
xpath
0 /MediaContent/Figure
1 /Figure
2 /Activity/Question/Figure
3 /Activity/Discussion/Figure
4 /Activity/Multipart/Part/Question/Figure
5 /Activity/Multipart/Figure
6 /Activity/Multipart/Part/Question/Table/tbody/...
7 /Activity/Question/Reading/Figure
8 /Activity/Answer/Figure
9 /Box/Figure
10 /SAQ/Answer/Figure
11 /Activity/Multipart/Part/Answer/Figure
12 /Activity/Multipart/Part/Question/MediaContent...
13 /Activity/Question/MediaContent/Figure
14 /FrontMatter/Introduction/Figure
15 /Activity/Discussion/Extract/Figure
16 /SAQ/Question/Figure
17 /NumberedList/ListItem/Figure
18 /CaseStudy/Figure
19 /Activity/Multipart/Part/Question/NumberedList...
20 /Introduction/MediaContent/Figure
21 /Activity/Multipart/Part/Discussion/Figure
22 /FrontMatter/Introduction/MediaContent/Figure
23 /StudyNote/Figure
24 /Extract/Figure
25 /Activity/Question/UnNumberedList/ListItem/Figure
26 /Activity/Multipart/Part/Question/Reading/Figure
27 /Activity/Multipart/Part/Question/Reading/Read...
28 /Introduction/Figure
29 /Activity/Question/CaseStudy/Figure
30 /SAQ/Discussion/Figure
31 /Activity/Question/Extract/Figure
32 /Box/MediaContent/Figure
33 /Table/tbody/tr/td/Figure
34 /Example/Figure
35 /Example/Answer/Figure
36 /ITQ/Answer/Figure
37 /Activity/Answer/Box/MediaContent/Figure
38 /Activity/Answer/Box/Figure
39 /Activity/MediaContent/Figure
40 /Activity/Multipart/Part/MediaContent/Figure
41 /Activity/Multipart/Part/Question/Box/Figure
42 /Activity/Question/Box/Figure
43 /Activity/Question/Box/MediaContent/Figure
44 /Activity/Discussion/MediaContent/Figure
45 /Activity/Discussion/NumberedList/ListItem/Figure
46 /ITQ/Question/Figure
47 /BulletedList/ListItem/Figure
48 /SubSubSection/Figure
49 /SubSubSection/Activity/Question/Figure
50 /SubSubSection/Activity/Discussion/Figure
51 /FrontMatter/Introduction/Activity/Question/Me...
52 /Activity/Multipart/Part/Answer/NumberedList/L...
53 /Activity/Question/Quote/Figure
54 /Activity/Multipart/MultiColumnText/MultiColum...
55 /SAQ/Question/MediaContent/Figure
56 /MediaContent/Description/Figure
57 /Box/BulletedList/ListItem/Figure

Let’s also try a full-text search over all the images referenced from OU-XML documents:

from xml_utils import fts

fts(db, "figures", "protest")
caption alt description id
0 Figure 5 Policing and the right to protest None A line of uniformed police officers standing o... 05bce9978659e418b36040e8590996e2f116edaf
1 Figure 1 Climate change protesters near the Br... None A young person holding up a placard during a p... 28c921f9786fd2f6e625eb1369ef5462b7805738
2 Figure 3 Left, a sculpture of Baartman at the ... On the left is a photograph of a sculpture of ... On the left is a photograph of a sculpture of ... 68045d707309f11a19ab4f45286d1c5daaae783b
3 Figure 4 None A trade union protest, with the people holding... 8b9f5e88187ec0cd0e2f40ccc8f936da4e0c500e
4 Figure 12: Emma Gonzalez (centre right) at a M... None A photograph of Emma Gonzalez (centre right) a... 6721a0c5b3566e937bce11c79b3f06958a8bac7c
5 Figure 4 None A trade union protest, with the people holding... 604d1ca213c5850b1e6e683993e3ee766e9da20d

If we grabbed the image assets themselves into the database, or if we can resolve the URLs to the images, we could provide a gallery view over the images, or return visual images as part of our search results.

Extracting MediaContent Items#

The MediaContent element [docs] provides a structured way of describing audio and video media assets as well as other asset types.

For example, a minimal decription of audio asset might be provided as follows:

<MediaContent src="https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_2019j_aug123_00.00-00.37.mp3" type="audio" x_manifest="a111_2019j_aug123_00.00-00.37_1_server_manifest.xml" x_filefolderhash="ae00f4bb" x_folderhash="ae00f4bb" x_contenthash="fbeac38a">
    <Caption>Audio 1 Robert Petway</Caption>
</MediaContent>

Video assets are also described using the <MediaContent> tag, the type= attribute being used to distinguish the media type. The <MediaContent> element includes a <Transcript> element that can be used to package a transcript alongside the referenced mdia item.

The following fragment also shows how an additional figure element can be included in the <MediaContent> element; this may be used to display “cover art” associated with the asset, for example.

<MediaContent src="https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_2019j_vid004-640x360.mp4" type="video" width="512" x_manifest="a111_2019j_vid004_1_server_manifest.xml" x_filefolderhash="ae00f4bb" x_folderhash="ae00f4bb" x_contenthash="bd2d0a18" x_subtitles="a111_2019j_vid004-640x360.srt">
    <Caption>Video 1</Caption>
    <Transcript>...</Transcript>
    <Figure>
        <Image src="https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_2019j_vid004_poster_image.jpg" x_folderhash="ae00f4bb" x_contenthash="02a6127c" x_imagesrc="a111_2019j_vid004_poster_image.jpg" x_imagewidth="512" x_imageheight="288"/>
    </Figure>
</MediaContent>
def get_media_items(root):
    """Extract media content items from an OU-XML XML object."""
    media = root.xpath('//MediaContent')
    
    # We use the tree to find the path to the figure element
    tree = etree.ElementTree(root)
    
    # ist of media items in a unit
    media_items = []
    media_urls = []
    for m in media:
        m_url = m.get("src")
        
        # If the same media element appears multiple times in a unit
        # we only want to save it once.
        # TO DO: we should probably check multiple instances to ensure
        #        that the other elements (eg transcript) are consistent.
        if m_url in media_urls:
            # Optionally report on the duplicate items - useful for debugging
            #print(f"Duplicate item {m_url} — duplicate not checked or stored.")
            # Don't extract the duplicate
            continue
        else:
            media_urls.append(m_url)

        m_type = m.get("type")
        tmp_caption = m.xpath("Caption")
        m_caption = flatten(tmp_caption[0]).strip() if tmp_caption else None
        tmp_transcript = m.xpath("Transcript")
        m_transcript = flatten(tmp_transcript[0]).strip() if tmp_transcript else None
        tmp_desc = m.xpath("Description")
        m_desc = flatten(tmp_desc[0]).strip() if tmp_desc else None
        
        # Get the path the the figure element
        m_xpath = context_path(tree.getpath(m))
        
        #ET.tostring(xml, encoding='unicode') #.decode() 
        if not m_type:
            continue
        media_items.append( (m_type, m_url, m_caption, m_transcript, m_desc, m_xpath) )

    return media_items
get_media_items(root)[:2]
[('audio',
  'https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_2019j_aug123_00.00-00.37.mp3',
  'Audio 1 Robert Petway',
  None,
  None,
  '/Activity/Question/MediaContent'),
 ('audio',
  'https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_2019j_aug113_00.00-00.35.mp3',
  'Audio 2 Muddy Waters',
  None,
  None,
  '/Activity/Question/MediaContent')]

Adding Media Item Metadata to the Database#

To support discovery, we can add media item captions and transcripts to a simple database table and then provide full-test search over them.

For now, we will omit scraping the media assets into the database. An example of how to add a mdia item to the database, as well as retrieve and render a media item, is provided in the Scraping OpenLearn section.

Let’s define a simple table to store this information, along with an associated full-text search table:

all_media_tbl = db["media"]
all_media_tbl.drop(ignore=True)
all_media_tbl.create({
    "type": str,
    "caption": str,
    "transcript": str,
    "description": str,
    "xpath": str,
    "code": int,
    "name": str,
    "url": str,
    "id": str,
    "_id": str,
}, pk=("id"))
# Note: we might also want to include a way of referencing
# any associated cover art, or at least flag the existence
# of an associated figure containd in the media element structure

# Enable full text search
# This creates an extra virtual table (media_fts) to support the full text search
db[f"{all_media_tbl.name}_fts"].drop(ignore=True)
db[all_media_tbl.name].enable_fts(["caption", "transcript", "description",
                                   "type", "id"], create_triggers=True)
<Table media (type, caption, transcript, description, xpath, code, name, url, id, _id)>
db.vacuum()

Now we can iterate over all the OU-XML documents, extract any media items contained therein, and add them to our database table:

for row in xml_db.query("""SELECT * FROM xml;"""):
    _root = etree.fromstring(row["xml"])
    media_items = get_media_items(_root)
    # From the list of media items,
    # create a list of dict items we can add to the database
    media_item_dicts = [{"type": m[0],  "url": m[1],
                         "caption": m[2], "transcript": m[3],
                         "description": m[4], "xpath": m[5],
                         "code": row["code"], "name": row["name"]} for m in media_items if m[0] and m[1] ]
    
    # Add a unique id for each record
    create_id(media_item_dicts, fields=["code", "name", "url"])
    # And a cross reference id for each record
    create_id(media_item_dicts, id_field="_id")
    
    # Add items to the database
    db[all_media_tbl.name].insert_all(media_item_dicts)

Let’s review what types of media assets there are from across the OpenLearn units:

pd.read_sql("SELECT DISTINCT(type) FROM media", con=db.conn)
type
0 video
1 html5
2 audio
3 file
4 embed
5 moodlequestion
6 openmark
7 flash
8 oembed

How about a full text search over the media asset captions and transcripts?

fts(db, "media", "communication web")
caption transcript description type id
0 None SPEAKER\n Vinton Cerf is vi... None audio e0389de262a92186c872a5e7a2619c4809eb0b7b
1 None REBECCA FIELDING: The key skills and attribut... None video 46ba32530064312d4275661b793801387109fb84
2 None MartinWe’re live. OK guys, thanks for joining ... None embed c40e38ef5dd5afb8213642aa63cae5eb7d0c9315