Working With Figures and Media Items
Contents
Working With Figures and Media Items#
When working with images in course production, there are several approaches we may take:
discovering and reusing images and diagrams that already exist in order to illustrate a topic; example images may be selected by an author either with the intention that rights clearance (if required) is obtained for using that image, or that image may be used as a prompt to a picure researcher to find an image with a similar sense;
discovering images and diagrams based on a description of an image that is intended to serve a particular effect; recent advances in the automated generation of images from free text descriptions (e.g. StableDiffusion, DALL-E) may offer interesting possibilties here, either by generating images that can be used directly or by generating images that can be used as prompts by picture researchers;
using scientific or mathematical charts and diagrams generated by scientific software (example). In these cases, whilst the detail of the diagram may be important (data points are in the correct locations), the style of the diagram may be subject to artistic or desgin considerations;
generating diagrams using strict syntax diagram generation tools: for example, generating box and arrow diagrams or flow charts from a simple text descriptions; (see for example Generating Diagrams from Text Using BlockDiag).
The OU-XML document structure supports the description of figures, which is to say, images, as well as various media file types, such as audio and video asset types. In this section, we will see how we can support media discovery across the OpenLearn units by means of a simple full-text search over media captions and transcripts. THe availability of figure descriptions also provides us with a corpus of text descriptions that it might to interesting to run through image-from-text generation tools to see what sorts of things they come up with, and how text prompts need to be phrased to generate something resembling the intended effect.
For now, the actual media assets (images, videos, audio files, etc.), will not be scraped or added to the database. This means that media assets can be searched for, but not directly rendered.
Preparing the Ground#
As ever, we need to set up a database connection.
from sqlite_utils import Database
# Open database connection
xml_dbname = "all_openlean_xml.db"
xml_db = Database(xml_dbname)
media_dbname = "openlean_assets.db"
db = Database(media_dbname)
And get a sample XML file, selecting one that we know contains struturally marked up glossary items:
from lxml import etree
import pandas as pd
# Grab an OU-XML file that is known to contain glossary items
a210_xml_raw = pd.read_sql("SELECT xml FROM xml WHERE code='A111'",
con=xml_db.conn).loc[0, "xml"]
# Parse the XML into an xml object
root = etree.fromstring(a210_xml_raw)
Let’s also import a couple of utlilty functions:
from xml_utils import flatten, unpack
Extracting Figure
Items#
Image references are provided inside a <Figure>
element [docs]. This structured element type also bundles a caption and a description:
<Figure>
<Image src="https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_1_openlearn_piano.tif" x_printonly="y" x_folderhash="5c95a855" x_contenthash="d74c57b3" x_imagesrc="a111_1_openlearn_piano.tif.jpg" x_imagewidth="512" x_imageheight="325"/>
<Caption><b>Figure</b> 10 Playing the blues</Caption>
<Description>This is a photograh showing a person\'s hands playing the piano.</Description>
</Figure>
An <Alternative>
text tag may also be included.
To support discovery of figure items in a particular context, it will be useful to save the XML path to the figure elements. We can then use this as a basis for identifying whether a figure is used in an activity context, for example.
We can obtain the path to an element by generating a document tree and the locating a particular element within that structure.
# Generate the tree structure
tree = etree.ElementTree(root)
# Get example figure item
test_figure = root.xpath("//Figure")[0]
# Find the path to a figure item
tree.getpath(test_figure)
'/Item/Unit/Session[1]/Figure'
We can use a regular expression replacement to tidy the path up to remove the index values and give us a raw context path:
import re
def context_path(xpath):
"""Simplify an xpath to give a cleaner context path."""
xpath = re.sub(r'\[\d+\]', '', xpath)
xpath = re.sub(r'(/Item|/Unit|/Session|/Section|/SubSection|/InternalSection)', '', xpath)
return xpath
For example:
context_path('/Item/Unit[7]/Session[3]/Activity[1]/Question/')
'/Activity/Question/'
Let’s now create a function that grabs data from a figure element that we can use to support figure discovery and audit:
def get_figure_items(root):
"""Extract figure items from an OU-XML XML object."""
figures = root.xpath('//Figure')
# We use the tree to find the path to the figure element
tree = etree.ElementTree(root)
# ist of media items in a unit
figure_items = []
figure_urls = []
for f in figures:
f_url = f.find("Image").get("src")
# If there is no image reference, do not save this item
# TO DO: should we warn if there is other figure data but no url?
if not f_url:
continue
# If the same figure element appears multiple times in a unit
# we only want to save it once.
# TO DO: we should probably check multiple instances to ensure
# that the other elements (eg descriptions) are consistent.
if f_url in figure_urls:
# Optionally report on the duplicate items - useful for debugging
#print(f"Duplicate item {f_url} — duplicate not checked or stored.")
# Don't extract the duplicate
continue
else:
figure_urls.append(f_url)
f_type = f_url.split(".")[-1]
tmp_caption = f.xpath("Caption")
f_caption = flatten(tmp_caption[0]).strip() if tmp_caption else None
tmp_alt = f.xpath("Alternative")
f_alt = flatten(tmp_alt[0]).strip() if tmp_alt else None
tmp_desc = f.xpath("Description")
f_desc = flatten(tmp_desc[0]).strip() if tmp_desc else None
# Get the path the the figure element
f_xpath = context_path(tree.getpath(f))
# These attributes appear to be access control elements
x_folderhash = f.find("Image").get("x_folderhash")
x_contenthash = f.find("Image").get("x_contenthash")
x_imagesrc = f.find("Image").get("x_imagesrc")
figure_items.append( (f_type, f_url, f_caption, f_alt, f_desc, f_xpath,
x_folderhash, x_contenthash, x_imagesrc) )
return figure_items
get_figure_items(root)[:2]
[('tif',
'https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_1_f01.tif',
'Figure 1 B. B. King performing c.1968. Photo Michael Ochs Archives/Getty Images (via Britannica Image Quest)',
None,
'This is a photograph of B. B. King playing the guitar on stage.',
'/Figure',
'5c95a855',
'b434d4cc',
'a111_1_f01.tif.jpg'),
('tif',
'https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_1_openlearn_beatles.tif',
'Figure 2 The music of the Beatles',
None,
'This is an image showing a number of records by the Beatles.',
'/Figure',
'5c95a855',
'4300681e',
'a111_1_openlearn_beatles.tif.jpg')]
Adding Figure Items to the Database#
To support discovery, we can add figure item captions, descriptions and alternative text to a simple database table and then provide full-test search over them.
For now, we will omit scraping the figure assets into the database. An example of how to add a mdia item to the database, as well as retrieve and render a media item, is provided in the Scraping OpenLearn
section.
Let’s define a simple table to store this information, along with an associated full-text search table:
all_figures_tbl = db["figures"]
all_figures_tbl.drop(ignore=True)
all_figures_tbl.create({
"type": str,
"caption": str,
"alt": str,
"description": str,
"xpath": str,
"x_folderhash": str,
"x_contenthash": str,
"x_imagesrc": str,
"code": int,
"name": str,
"url": str,
"id": str,
"_id": str #derived from code and name
}, pk=("id"))
# Enable full text search
# This creates an extra virtual table (media_fts) to support the full text search
db[f"{all_figures_tbl.name}_fts"].drop(ignore=True)
db[all_figures_tbl.name].enable_fts(["caption", "alt", "description",
"id"], create_triggers=True)
<Table figures (type, caption, alt, description, xpath, x_folderhash, x_contenthash, x_imagesrc, code, name, url, id, _id)>
Now we can populate the table with our figure assets:
from xml_utils import create_id
for row in xml_db.query("""SELECT * FROM xml;"""):
_root = etree.fromstring(row["xml"])
figure_items = get_figure_items(_root)
# From the list of media items,
# create a list of dict items we can add to the database
figure_item_dicts = [{"type": f[0], "url": f[1],
"caption": f[2], "alt": f[3],
"description": f[4], "xpath": f[5],
"x_folderhash": f[6], "x_contenthash": f[7], "x_imagesrc": f[8],
"code": row["code"], "name": row["name"]} for f in figure_items if f[1] ]
# Add a unique id for each record
create_id(figure_item_dicts, fields=["code", "name", "url"])
# And a cross reference id for each record
create_id(figure_item_dicts, id_field="_id")
# Add items to the database
db[all_figures_tbl.name].insert_all(figure_item_dicts)
db.vacuum()
Let’s see what image types are referenced across the OpenLearn units:
pd.read_sql("SELECT DISTINCT(type) FROM figures", con=db.conn)
type | |
---|---|
0 | jpg |
1 | tif |
2 | png |
3 | eps |
4 | gif |
5 | mp4 |
What sort of context to the fugures appear in terms of their path locations within the OU-XML documents?
pd.read_sql("SELECT DISTINCT(xpath) FROM figures", con=db.conn)
xpath | |
---|---|
0 | /MediaContent/Figure |
1 | /Figure |
2 | /Activity/Question/Figure |
3 | /Activity/Discussion/Figure |
4 | /Activity/Multipart/Part/Question/Figure |
5 | /Activity/Multipart/Figure |
6 | /Activity/Multipart/Part/Question/Table/tbody/... |
7 | /Activity/Question/Reading/Figure |
8 | /Activity/Answer/Figure |
9 | /Box/Figure |
10 | /SAQ/Answer/Figure |
11 | /Activity/Multipart/Part/Answer/Figure |
12 | /Activity/Multipart/Part/Question/MediaContent... |
13 | /Activity/Question/MediaContent/Figure |
14 | /FrontMatter/Introduction/Figure |
15 | /Activity/Discussion/Extract/Figure |
16 | /SAQ/Question/Figure |
17 | /NumberedList/ListItem/Figure |
18 | /CaseStudy/Figure |
19 | /Activity/Multipart/Part/Question/NumberedList... |
20 | /Introduction/MediaContent/Figure |
21 | /Activity/Multipart/Part/Discussion/Figure |
22 | /FrontMatter/Introduction/MediaContent/Figure |
23 | /StudyNote/Figure |
24 | /Extract/Figure |
25 | /Activity/Question/UnNumberedList/ListItem/Figure |
26 | /Activity/Multipart/Part/Question/Reading/Figure |
27 | /Activity/Multipart/Part/Question/Reading/Read... |
28 | /Introduction/Figure |
29 | /Activity/Question/CaseStudy/Figure |
30 | /SAQ/Discussion/Figure |
31 | /Activity/Question/Extract/Figure |
32 | /Box/MediaContent/Figure |
33 | /Table/tbody/tr/td/Figure |
34 | /Example/Figure |
35 | /Example/Answer/Figure |
36 | /ITQ/Answer/Figure |
37 | /Activity/Answer/Box/MediaContent/Figure |
38 | /Activity/Answer/Box/Figure |
39 | /Activity/MediaContent/Figure |
40 | /Activity/Multipart/Part/MediaContent/Figure |
41 | /Activity/Multipart/Part/Question/Box/Figure |
42 | /Activity/Question/Box/Figure |
43 | /Activity/Question/Box/MediaContent/Figure |
44 | /Activity/Discussion/MediaContent/Figure |
45 | /Activity/Discussion/NumberedList/ListItem/Figure |
46 | /ITQ/Question/Figure |
47 | /BulletedList/ListItem/Figure |
48 | /SubSubSection/Figure |
49 | /SubSubSection/Activity/Question/Figure |
50 | /SubSubSection/Activity/Discussion/Figure |
51 | /FrontMatter/Introduction/Activity/Question/Me... |
52 | /Activity/Multipart/Part/Answer/NumberedList/L... |
53 | /Activity/Question/Quote/Figure |
54 | /Activity/Multipart/MultiColumnText/MultiColum... |
55 | /SAQ/Question/MediaContent/Figure |
56 | /MediaContent/Description/Figure |
57 | /Box/BulletedList/ListItem/Figure |
Let’s also try a full-text search over all the images referenced from OU-XML documents:
from xml_utils import fts
fts(db, "figures", "protest")
caption | alt | description | id | |
---|---|---|---|---|
0 | Figure 5 Policing and the right to protest | None | A line of uniformed police officers standing o... | 05bce9978659e418b36040e8590996e2f116edaf |
1 | Figure 1 Climate change protesters near the Br... | None | A young person holding up a placard during a p... | 28c921f9786fd2f6e625eb1369ef5462b7805738 |
2 | Figure 3 Left, a sculpture of Baartman at the ... | On the left is a photograph of a sculpture of ... | On the left is a photograph of a sculpture of ... | 68045d707309f11a19ab4f45286d1c5daaae783b |
3 | Figure 4 | None | A trade union protest, with the people holding... | 8b9f5e88187ec0cd0e2f40ccc8f936da4e0c500e |
4 | Figure 12: Emma Gonzalez (centre right) at a M... | None | A photograph of Emma Gonzalez (centre right) a... | 6721a0c5b3566e937bce11c79b3f06958a8bac7c |
5 | Figure 4 | None | A trade union protest, with the people holding... | 604d1ca213c5850b1e6e683993e3ee766e9da20d |
If we grabbed the image assets themselves into the database, or if we can resolve the URLs to the images, we could provide a gallery view over the images, or return visual images as part of our search results.
Extracting MediaContent
Items#
The MediaContent
element [docs] provides a structured way of describing audio and video media assets as well as other asset types.
For example, a minimal decription of audio asset might be provided as follows:
<MediaContent src="https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_2019j_aug123_00.00-00.37.mp3" type="audio" x_manifest="a111_2019j_aug123_00.00-00.37_1_server_manifest.xml" x_filefolderhash="ae00f4bb" x_folderhash="ae00f4bb" x_contenthash="fbeac38a">
<Caption>Audio 1 Robert Petway</Caption>
</MediaContent>
Video assets are also described using the <MediaContent>
tag, the type=
attribute being used to distinguish the media type. The <MediaContent>
element includes a <Transcript>
element that can be used to package a transcript alongside the referenced mdia item.
The following fragment also shows how an additional figure element can be included in the <MediaContent>
element; this may be used to display “cover art” associated with the asset, for example.
<MediaContent src="https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_2019j_vid004-640x360.mp4" type="video" width="512" x_manifest="a111_2019j_vid004_1_server_manifest.xml" x_filefolderhash="ae00f4bb" x_folderhash="ae00f4bb" x_contenthash="bd2d0a18" x_subtitles="a111_2019j_vid004-640x360.srt">
<Caption>Video 1</Caption>
<Transcript>...</Transcript>
<Figure>
<Image src="https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_2019j_vid004_poster_image.jpg" x_folderhash="ae00f4bb" x_contenthash="02a6127c" x_imagesrc="a111_2019j_vid004_poster_image.jpg" x_imagewidth="512" x_imageheight="288"/>
</Figure>
</MediaContent>
def get_media_items(root):
"""Extract media content items from an OU-XML XML object."""
media = root.xpath('//MediaContent')
# We use the tree to find the path to the figure element
tree = etree.ElementTree(root)
# ist of media items in a unit
media_items = []
media_urls = []
for m in media:
m_url = m.get("src")
# If the same media element appears multiple times in a unit
# we only want to save it once.
# TO DO: we should probably check multiple instances to ensure
# that the other elements (eg transcript) are consistent.
if m_url in media_urls:
# Optionally report on the duplicate items - useful for debugging
#print(f"Duplicate item {m_url} — duplicate not checked or stored.")
# Don't extract the duplicate
continue
else:
media_urls.append(m_url)
m_type = m.get("type")
tmp_caption = m.xpath("Caption")
m_caption = flatten(tmp_caption[0]).strip() if tmp_caption else None
tmp_transcript = m.xpath("Transcript")
m_transcript = flatten(tmp_transcript[0]).strip() if tmp_transcript else None
tmp_desc = m.xpath("Description")
m_desc = flatten(tmp_desc[0]).strip() if tmp_desc else None
# Get the path the the figure element
m_xpath = context_path(tree.getpath(m))
#ET.tostring(xml, encoding='unicode') #.decode()
if not m_type:
continue
media_items.append( (m_type, m_url, m_caption, m_transcript, m_desc, m_xpath) )
return media_items
get_media_items(root)[:2]
[('audio',
'https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_2019j_aug123_00.00-00.37.mp3',
'Audio 1 Robert Petway',
None,
None,
'/Activity/Question/MediaContent'),
('audio',
'https://www.open.edu/openlearn/ocw/pluginfile.php/1407405/mod_oucontent/oucontent/72913/a111_2019j_aug113_00.00-00.35.mp3',
'Audio 2 Muddy Waters',
None,
None,
'/Activity/Question/MediaContent')]
Adding Media Item Metadata to the Database#
To support discovery, we can add media item captions and transcripts to a simple database table and then provide full-test search over them.
For now, we will omit scraping the media assets into the database. An example of how to add a mdia item to the database, as well as retrieve and render a media item, is provided in the Scraping OpenLearn
section.
Let’s define a simple table to store this information, along with an associated full-text search table:
all_media_tbl = db["media"]
all_media_tbl.drop(ignore=True)
all_media_tbl.create({
"type": str,
"caption": str,
"transcript": str,
"description": str,
"xpath": str,
"code": int,
"name": str,
"url": str,
"id": str,
"_id": str,
}, pk=("id"))
# Note: we might also want to include a way of referencing
# any associated cover art, or at least flag the existence
# of an associated figure containd in the media element structure
# Enable full text search
# This creates an extra virtual table (media_fts) to support the full text search
db[f"{all_media_tbl.name}_fts"].drop(ignore=True)
db[all_media_tbl.name].enable_fts(["caption", "transcript", "description",
"type", "id"], create_triggers=True)
<Table media (type, caption, transcript, description, xpath, code, name, url, id, _id)>
db.vacuum()
Now we can iterate over all the OU-XML documents, extract any media items contained therein, and add them to our database table:
for row in xml_db.query("""SELECT * FROM xml;"""):
_root = etree.fromstring(row["xml"])
media_items = get_media_items(_root)
# From the list of media items,
# create a list of dict items we can add to the database
media_item_dicts = [{"type": m[0], "url": m[1],
"caption": m[2], "transcript": m[3],
"description": m[4], "xpath": m[5],
"code": row["code"], "name": row["name"]} for m in media_items if m[0] and m[1] ]
# Add a unique id for each record
create_id(media_item_dicts, fields=["code", "name", "url"])
# And a cross reference id for each record
create_id(media_item_dicts, id_field="_id")
# Add items to the database
db[all_media_tbl.name].insert_all(media_item_dicts)
Let’s review what types of media assets there are from across the OpenLearn units:
pd.read_sql("SELECT DISTINCT(type) FROM media", con=db.conn)
type | |
---|---|
0 | video |
1 | html5 |
2 | audio |
3 | file |
4 | embed |
5 | moodlequestion |
6 | openmark |
7 | flash |
8 | oembed |
How about a full text search over the media asset captions and transcripts?
fts(db, "media", "communication web")
caption | transcript | description | type | id | |
---|---|---|---|---|---|
0 | None | SPEAKER\n Vinton Cerf is vi... | None | audio | e0389de262a92186c872a5e7a2619c4809eb0b7b |
1 | None | REBECCA FIELDING: The key skills and attribut... | None | video | 46ba32530064312d4275661b793801387109fb84 |
2 | None | MartinWe’re live. OK guys, thanks for joining ... | None | embed | c40e38ef5dd5afb8213642aa63cae5eb7d0c9315 |