HTML Unescape in Python

04 Oct 2008

Ok, if you read the last post, you knew this one was coming.

Yes

Here's how to unescape html text:

from htmlentitydefs import name2codepoint 
def replace_entities(match):
    try:
        ent = match.group(1)
        if ent[0] == "#":
            if ent[1] == 'x' or ent[1] == 'X':
                return unichr(int(ent[2:], 16))
            else:
                return unichr(int(ent[1:], 10))
        return unichr(name2codepoint[ent])
    except:
        return match.group()

entity_re = re.compile(r'&(#?[A-Za-z0-9]+?);')
def html_unescape(data):
    return entity_re.sub(replace_entities, data)

You can fiddle with regexp to make the ending colon optional or not. Right now if it can't decode the entity, it just copies it. If you want to skip then, change the ending return match.group() to return ''

The Really Slow Way

As usual, google pointed me to the slowest way possible:

import htmllib
def html_unescape_parser(s):
    p = htmllib.HTMLParser(None)
    p.save_bgn()
    p.feed(s)
    return p.save_end()

This is 5x slower, and does not decode hexadecimal entities

Better, but not quite

A nice post in comp.lang.python provides a better implementation. But has a few problems:

  • Missing obvious optizations (precompiling regexp, etc).
  • Pops exceptions when it encounters bad numeric entities (e.g. &#12ZZZ;)
  • Doesn't encode numeric entities in hexadecimal prefixed by "X"

But big props to the author for writing unit tests. I'll post mine shortly.

Because you are a dork like me

The spec for all this is here.