Working With Equations#

One of the asset types that is represented within OU-XML is an <Equation> type. This element can be used to represent mathematical and chemical equations. Internal OU readers can refer to the corresponding OU-XML docs here.

The Equation element tag accepts the following child tags: <Alternative>, <Caption>, <Description>, <Image>, <Label>, <MathML>, <SourceReference>, and <TeX>. There is also an <InlineEquation> tag that accepts <Image>, <Alternative>, <MathML> and <TeX> tags. The <Image>, <MathML> and <TeX> tags define the “expression” of the equation.

Equation items may be described using either a MathML expression or an Image. The MathML elements are rendered in the VLE using Mathjax and via LaTex for PDF print publications. Browsers such as Firefox are also capable of rendering MathML directly.

One of the problems with MathML as a structure is that it is not the sort of thing you would write by hand, and as such, it may be difficult to discover via simple search. (A simpler way of writing equations is to use LaTeX, for example.)

One way of supporting discover might be to index the text in the subsection that an equation is contained in. We can find a local context from the the path to each equation element, so for now, let’s just grab the path; we can then think about how we can generate a search context around that path in another section.

Foer now, only handle equations that are described using MathML; ignore eqautions that are rendered as images.

Preparing the Ground#

As ever, we need to set up a database connection:

from sqlite_utils import Database

# Open database connection
xml_dbname = "all_openlean_xml.db"
xml_db = Database(xml_dbname)

eqns_dbname = "openlean_assets.db"
db = Database(eqns_dbname)

And get a sample XML file, selecting one that we know contains structurally marked up equation items:

import pandas as pd

pd.read_sql("SELECT * FROM xml WHERE xml LIKE '%<Equation>%'",
                           con=xml_db.conn)
code name xml id
0 T212 An introduction to electronics b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... e70841f12a908401ab9e6a69923bdb684928c888
1 An introduction to geology b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... 4c8058285a4de53528f646ee2742dc8394fd4e38
2 S276 An introduction to minerals and rocks under th... b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... 6bff78840be5165329dda278418bbbd54c909047
3 T193 Assessing risk in engineering, work and life b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... 11e5486d113eebd6c01126c9c65b91591c211b9b
4 SK299 Blood and the respiratory system b'<?xml version="1.0" encoding="UTF-8"?>\n<?dc... 904a100e4d41cf1a696b547eec1b2f625fc5bd78
5 Discovering chemistry b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... 884164a46f4066c6b26894c812484c74ab2e8531
6 Mathematics for science and technology b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... 84fea7b4cf86cdd4e31e3272572372972fb81fe2
7 s315 Metals in medicine b'<?xml version="1.0" encoding="UTF-8"?>\n<Ite... c2c90459369d82e28e768dfd9072047eab95be4d
8 SM123 Particle physics b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... 4095122554b7cc3cff824f31c3cf531087e63b2c
9 S112 Scales in space and time b'<?xml version="1.0" encoding="UTF-8"?>\n<?sc... 75a013ae7e703481e8f0e05bde38d6d71fa732b6
10 Succeed with maths: part 2 b'<?xml version="1.0" encoding="UTF-8"?>\n<!--... 0ab4176f55d221bdc8bd68fbcd2e9c575d2a5e4e
11 Taking your first steps into higher education b'<?xml version="1.0" encoding="UTF-8"?>\n<?sc... 0c0fb0a9800e8b37c269fd12cb8b600a44c1d189
12 Teaching mathematics b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... 7995427cc08b1874750af6c79cb45f7556addf4b
13 T271 Toys and engineering materials b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... c86d9d40259a5889e62488029ce5454a8ddb6a00
14 S215 What chemical compounds might be present in dr... b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... 2adf47ad5e2f97c2983da6e3f08a0e7066e67e8a
15 S111 What is a metal? b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... 0961a26cafb1aa6132dfc455251d123bf390bf70
16 DB123 You and your money b'<?xml version="1.0" encoding="utf-8"?>\n<?sc... b2235d2d9c032379eb891b42b46c41ed96a495c0
from lxml import etree
import pandas as pd

# Grab an OU-XML file that is known to contain equation items
# Maybe also: Teaching mathematics
equation_xml_raw = pd.read_sql("SELECT xml FROM xml WHERE name='Discovering chemistry'",
                           con=xml_db.conn).loc[0, "xml"]

# Parse the XML into an xml object
root = etree.fromstring(equation_xml_raw)

Grabbing the Path#

By walking up the path that leads to an equation element, we can identify its context within an OU-XML document, and from that we should be able to generate a “local search context” we can index and search within in order to support discovery of an equation from terms included in its surrounding text, for example.

# We need to grab a model of the document tree
tree = etree.ElementTree(root)

# And then grab the paths
for e in root.xpath('//Equation'):
    display(tree.getpath(e))
'/Item/Unit[2]/Session[1]/Section[2]/Equation'
'/Item/Unit[3]/Session[3]/Section[3]/ITQ[1]/Answer/Equation'
'/Item/Unit[6]/Session[1]/Section/Equation'
'/Item/Unit[6]/Session[2]/Equation[1]'
'/Item/Unit[6]/Session[2]/Equation[2]'
'/Item/Unit[6]/Session[2]/Section[1]/Equation[1]'
'/Item/Unit[6]/Session[2]/Section[1]/Equation[2]'
'/Item/Unit[6]/Session[2]/Section[1]/Equation[3]'
'/Item/Unit[6]/Session[2]/Section[1]/Equation[4]'
'/Item/Unit[6]/Session[2]/Section[2]/ITQ[1]/Answer/Equation'
'/Item/Unit[6]/Session[2]/Section[2]/ITQ[2]/Question/Equation'
'/Item/Unit[6]/Session[5]/Section/Equation'
'/Item/Unit[6]/Session[6]/Equation'
'/Item/Unit[9]/Session[2]/Equation'
'/Item/Unit[9]/Session[2]/Section[2]/Equation'
'/Item/Unit[9]/Session[3]/ITQ[2]/Question/Equation'
'/Item/Unit[9]/Session[3]/ITQ[2]/Answer/Equation[1]'
'/Item/Unit[9]/Session[3]/ITQ[2]/Answer/Equation[2]'

Looking at the paths, we might then identify a context as a particular block level elemet further up the tree. For example, we might say the context is the first element reached as we walk back up the tree from the set Section, SubSection, or Session. For an even more tightly defined search context, we might add an activity element types to that list (Activity, ITQ, SAQ, or Exercise).

import re

def navigational_context(path, elements=None):
    """Find meaninglful exact local context path."""
    elements = ["Section", "SubSection", "Session",
                "Activity", "ITQ", "SAQ", "Exercise"
               ] if elements is None else elements
    # Iterate the path elements in reverse order
    path_elements = path.split("/")
    path_len = len(path_elements)
    for i, subpath in enumerate(path_elements[::-1]):
        # Clean the numeric index from the path element
        if re.sub(r'\[\d+\]', '', subpath) in elements:
            return "/".join(path_elements[:path_len-i])
    return path

We can the find the exact navigational path to a the first local context element we meet at a desired lvel of granularity.

For example:

example_path = '/Item/Unit[9]/Session[3]/ITQ[2]/Question/Equation'

example_context = navigational_context(example_path)
example_context
'/Item/Unit[9]/Session[3]/ITQ[2]'

We could then index the text of that context block to support discovery of the equation:

from xml_utils import flatten, unpack

# Example text for indexing to support equation discovery
# We are rendering the flattened equation here, so it may not make much sense!
flatten( root.xpath(example_context)[0] ), \
    flatten(root.xpath(example_context)[0].find("*//Equation"))
('The combination of sulfur dioxide with oxygen, and the decomposition of steam into hydrogen and oxygen are both reactions of great potential value. These reactions and their equilibrium constants at 427oC (700K) are as follows.2SO2(g)+O2(g)=2SO3\u2062 (g)K=106\u2062 mol−1\u2062\u2062 litre 2H2O(g)=2H2(g)+O2(g)K=10−33\u2062 mol\u2062 \u2062 litre−1Write expressions for the equilibrium constants of the two reactions.When the two reactions are attempted at 700K, neither seems to occur. Which of the two might be ‘persuaded’ to proceed at this temperature, and what form might your persuasion take?The equilibrium constant of the first reaction, K1, is given by K1=[SO3(g)]2[SO2(g)]2[O2(g)]That of the second,K2=[H2(g)]2[O2(g)][H2O(g)]2The data show that K2 is tiny: at equilibrium, the concentrations of the hydrogen and oxygen in the numerator (the top line of the fraction) are minute in comparison with the concentration of steam in the denominator (the bottom line of the fraction). So in a closed system at 700 K, significant amounts of hydrogen and oxygen will never be formed from steam. By contrast, K1 is large, so the equilibrium position at 700 K lies well over to the right of the equation, and conversion of sulfur dioxide and oxygen to sulfur trioxide is favourable. The fact that the reaction does not occur must be due to a slow rate of reaction. We may therefore be able to obtain sulfur trioxide in this way if we can find a suitable catalyst to speed up the reaction. A suitable catalyst is vanadium pentoxide, V2O5, and at 700 K, this reaction is the key step in the manufacture of sulfuric acid from sulfur, oxygen and water.',
 '2SO2(g)+O2(g)=2SO3\u2062 (g)K=106\u2062 mol−1\u2062\u2062 litre 2H2O(g)=2H2(g)+O2(g)K=10−33\u2062 mol\u2062 \u2062 litre−1')

Extracting Equation Items#

We can trivially extract equation items from a single OU-XML XML document object:

from xml_utils import unpack

def get_equation_items(root, typ='//Equation'):
    """Extract equations from an OU-XML XML object."""
    tree = etree.ElementTree(root)
    
    # Return the mathml and the path
    return [(tree.getpath(eq), unpack(eq)) for eq in root.xpath(typ)]

What do we get?

get_equation_items(root)[:3]
[('/Item/Unit[2]/Session[1]/Section[2]/Equation',
  b'<Equation xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><MathML><math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mmultiscripts><mrow><mi>X</mi></mrow><mprescripts/><mrow><mi>Z</mi></mrow><mrow><mi>A</mi></mrow></mmultiscripts></mrow></math></MathML></Equation>'),
 ('/Item/Unit[3]/Session[3]/Section[3]/ITQ[1]/Answer/Equation',
  b'<Equation xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Image>K<sup>+</sup>, Ca<sup>2+</sup>, Al<sup>3+</sup>, S<sup>2-</sup>, F<sup>-</sup> and Br<sup>-</sup></Image></Equation>'),
 ('/Item/Unit[6]/Session[1]/Section/Equation',
  b'<Equation xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><MathML><math xmlns="http://www.w3.org/1998/Math/MathML"><mstyle mathvariant="normal"><mrow><mstyle mathvariant="normal"><mrow><mi>C</mi><mi>u</mi><mo>(</mo><mi>s</mi><mo>)</mo><mo>+</mo><msub><mrow><mn>4</mn><mi>H</mi><mi>N</mi><mi>O</mi></mrow><mrow><mn>3</mn></mrow></msub><mo>(</mo><mi>a</mi><mi>q</mi><mo>)</mo></mrow></mstyle><mo>=</mo><msub><mrow><msub><mrow><mstyle mathvariant="normal"><mrow><mi>C</mi><mi>u</mi></mrow></mstyle><mo>(</mo><mstyle mathvariant="normal"><mrow><mi>N</mi><mi>O</mi></mrow></mstyle></mrow><mrow><mn>3</mn></mrow></msub><mo>)</mo></mrow><mrow><mn>2</mn></mrow></msub><mstyle mathvariant="normal"><mrow><mo>(</mo><mi>a</mi><mi>q</mi><mo>)</mo></mrow></mstyle><mo>+</mo><msub><mrow><mstyle mathvariant="normal"><mrow><mn>2</mn><mi>N</mi><mi>O</mi></mrow></mstyle></mrow><mrow><mn>2</mn></mrow></msub><mstyle mathvariant="normal"><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mstyle><mo>+</mo><msub><mrow><mn>2</mn><mi>H</mi></mrow><mrow><mn>2</mn></mrow></msub><mi>O</mi><mo>(</mo><mi>l</mi><mo>)</mo></mrow></mstyle></math></MathML><Label>(5.1)</Label></Equation>')]

The equation is represented using MathML.

Let’s just get the <math> part from one of the equations:

import re

def clean_equation(mathml):
    """Get cleaned equation mathml."""
    mathml = mathml.decode() if isinstance(mathml, bytes) else mathml 
    
    #Replace \n mutliline
    mathml = mathml.replace("\n", "")
    
    # Extract the <math>...</math> component
    eqs = re.findall(r'.*<MathML>(.*)</MathML>.*', mathml)
    
    # We might have an image rather than MathML...
    eq = eqs[0] if eqs else None
    return eq
# Get an example equation element
# We want the Mathml (second item  / index [1] in the returned 2-tuple)
eq = get_equation_items(root)[2][1].decode()

eq = clean_equation(eq)
eq
'<math xmlns="http://www.w3.org/1998/Math/MathML"><mstyle mathvariant="normal"><mrow><mstyle mathvariant="normal"><mrow><mi>C</mi><mi>u</mi><mo>(</mo><mi>s</mi><mo>)</mo><mo>+</mo><msub><mrow><mn>4</mn><mi>H</mi><mi>N</mi><mi>O</mi></mrow><mrow><mn>3</mn></mrow></msub><mo>(</mo><mi>a</mi><mi>q</mi><mo>)</mo></mrow></mstyle><mo>=</mo><msub><mrow><msub><mrow><mstyle mathvariant="normal"><mrow><mi>C</mi><mi>u</mi></mrow></mstyle><mo>(</mo><mstyle mathvariant="normal"><mrow><mi>N</mi><mi>O</mi></mrow></mstyle></mrow><mrow><mn>3</mn></mrow></msub><mo>)</mo></mrow><mrow><mn>2</mn></mrow></msub><mstyle mathvariant="normal"><mrow><mo>(</mo><mi>a</mi><mi>q</mi><mo>)</mo></mrow></mstyle><mo>+</mo><msub><mrow><mstyle mathvariant="normal"><mrow><mn>2</mn><mi>N</mi><mi>O</mi></mrow></mstyle></mrow><mrow><mn>2</mn></mrow></msub><mstyle mathvariant="normal"><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mstyle><mo>+</mo><msub><mrow><mn>2</mn><mi>H</mi></mrow><mrow><mn>2</mn></mrow></msub><mi>O</mi><mo>(</mo><mi>l</mi><mo>)</mo></mrow></mstyle></math>'

In Firefox at least, we can render the <math> MathML markup text directly:

from IPython.display import HTML

# This works in firefox at least
HTML(eq)
Cu(s)+4HNO3(aq)=Cu(NO3)2(aq)+2NO2(g)+2H2O(l)

To explore:

Adding Equations to the Database#

We can create a simple database table to index the equations and either add a “context text column” to that table, or reference it in a separate table. For now, let’s just munge it all together.

all_eqn_tbl = db["equations"]
all_eqn_tbl.drop(ignore=True)

all_eqn_tbl.create({
    #"Alternative": str,
    #"Description": str,
    #"Label": str,
    #"SourceReference": str,
    #"Image": str,
    #"MathML":str,
    #"TeX": str,
    "equation": str, # This is the raw XML for the object
    "xpath": str,
    "typ": str,
    "search_context_path": str,
    "_id": str
}, pk=("_id", "xpath"))
# Note that in this case the _id is not unique
# because the same id may apply to multiple los
# The _id is a reference for joining tables only

all_eqn_context = db["equations_context"]
all_eqn_context.drop(ignore=True)
all_eqn_context.create({
    "search_context": str,
    "search_context_path": str,
    "_id": str
}, pk=("_id", "search_context_path"))

# Enable full text search
# This creates an extra virtual table (glossary_fts) to support the full text search
db[f"{all_eqn_context.name}_fts"].drop(ignore=True)
db[all_eqn_context.name].enable_fts(["search_context", "search_context_path", "_id"], create_triggers=True)
<Table equations_context (search_context, search_context_path, _id)>

We can now add our equations, and their context, to the database.

from xml_utils import create_id

for row in xml_db.query("""SELECT * FROM xml;"""):
    _root = etree.fromstring(row["xml"])
    
    # Get the tree structure
    tree = etree.ElementTree(_root)

    eq_items = [ ("Equation", eq) for eq in get_equation_items(_root, "//Equation")]
    eq_items.extend([ ("InlineEquation", eq) for eq in get_equation_items(_root, "//InlineEquation")])
    
    _id = create_id( (row["code"], row["name"]) )
    
    # From the list of equation items,
    # create a list of dict items we can add to the database
    eq_item_dicts = []
    eq_context_dicts = []
    _unique_contexts = []
    for (typ, eq) in eq_items:
        if eq[1]:
            # We can unpack and extract items from the XML
            eq_dict = {"equation": clean_equation(eq[1]),
                        "xpath": eq[0],
                        "typ": typ,
                        "search_context_path":navigational_context(eq[0]),
                        "_id": _id}
            eq_item_dicts.append(eq_dict)
            
            search_context_path = navigational_context(eq[0])
            if eq[1] and search_context_path not in _unique_contexts:
                _unique_contexts.append(search_context_path)
                search_context = unpack( _root.xpath(navigational_context(eq[0]))[0] ).decode()
                eq_context_dicts.append({"search_context_path": search_context_path,
                                         "_id": _id,
                                         # It might be better to flatten than unpack
                                          # For now, unpack lets us render the XML
                                        "search_context": search_context})

    # Add items to the database
    db[all_eqn_tbl.name].insert_all(eq_item_dicts)
    db[all_eqn_context.name].insert_all(eq_context_dicts)

We can now search for equations by context:

from xml_utils import fts

# Sample query
q = 'steam hydrogen oxygen'

example_eq_search = fts(db, "equations_context", q)
example_eq_search
search_context search_context_path _id
0 <ITQ xmlns:xsi="http://www.w3.org/2001/XMLSche... /Item/Unit[9]/Session[3]/ITQ[2] 884164a46f4066c6b26894c812484c74ab2e8531

We can display that context:

from IPython.display import Markdown
from xml_utils import ouxml2md

def display_result_md(search_context):
    """Render Markdown for result path."""
    _md = ouxml2md( search_context )
    
    # Hack because Sphinx renders errors
    _md = _md.replace("###", "HEADER: ")
    _md = _md.replace("####", "SUBHEADER: ")
    
    display(Markdown(_md))
example_eq_search.drop_duplicates(subset="search_context_path")["search_context"].apply(display_result_md)

HEADER: # Question

The combination of sulfur dioxide with oxygen, and the decomposition of steam into hydrogen and oxygen are both reactions of great potential value. These reactions and their equilibrium constants at 427oC (700K) are as follows. 2SO2(g)+O2(g)=2SO3⁢ (g)K=106⁢ mol−1⁢⁢ litre 2H2O(g)=2H2(g)+O2(g)K=10−33⁢ mol⁢ ⁢ litre−1

  1. Write expressions for the equilibrium constants of the two reactions.

  2. When the two reactions are attempted at 700K, neither seems to occur. Which of the two might be ‘persuaded’ to proceed at this temperature, and what form might your persuasion take?

HEADER: # Answer

The equilibrium constant of the first reaction, K1, is given by K1=[SO3(g)]2[SO2(g)]2[O2(g)] That of the second, K2=[H2(g)]2[O2(g)][H2O(g)]2 The data show that K2 is tiny: at equilibrium, the concentrations of the hydrogen and oxygen in the numerator (the top line of the fraction) are minute in comparison with the concentration of steam in the denominator (the bottom line of the fraction). So in a closed system at 700 K, significant amounts of hydrogen and oxygen will never be formed from steam.

By contrast, K1 is large, so the equilibrium position at 700 K lies well over to the right of the equation, and conversion of sulfur dioxide and oxygen to sulfur trioxide is favourable. The fact that the reaction does not occur must be due to a slow rate of reaction. We may therefore be able to obtain sulfur trioxide in this way if we can find a suitable catalyst to speed up the reaction. A suitable catalyst is vanadium pentoxide, V2O5, and at 700 K, this reaction is the key step in the manufacture of sulfuric acid from sulfur, oxygen and water.

0    None
Name: search_context, dtype: object

We can also display the equations that are described in that context by using the the search context path to join the equation context table with the equations table:

pd.read_sql(f"""SELECT e.* FROM equations e, equations_context_fts
                WHERE equations_context_fts MATCH {db.quote(q)}
                    AND e._id=equations_context_fts._id
                    AND e.search_context_path=equations_context_fts.search_context_path;
              """ , db.conn)["equation"].apply(lambda eq: display(HTML(eq+"<hr/>")));
K1=[SO3(g)]2[SO2(g)]2[O2(g)]
K2=[H2(g)]2[O2(g)][H2O(g)]2
2SO2(g)+O2(g)=2SO3(g)K=106mol1litre2H2O(g)=2H2(g)+O2(g)K=1033mollitre1