Python default string encoding woes

13 Nov 2008

The most excellent Ian Bicking wrote a few articles in 2005 on python unicode madness ([1], [2], [3]) and writer commented here here.

Ok that was in 2005, so you might have forgotten the issue. When you convert a unicode string to a 'normal' string, python by default converts using ASCII, and will fail hard if the unicode string is not pure ascii. Which makes this function explode if you pass in non ascii character:

def printit(x):
    print(str(x))

Which make the unicode object the only python type that breaks str.

And to make matters worse, there is no nice way to change the default encoding. Which is more bizarre.

Here's a concrete example. Let's say you are making a python extension of a C-library. Most C libraries pass in and out using UTF8. Why? Unicode doesn't have a well defined data type and is not portable. The Posix wchar_t might be 2 bytes, or might be 4 bytes. Then you might have endian encoding issues. UTF-8 is based on 1 byte char types so it's portable.

Now you are using a extension to use this C library. It expects a UTF-8 encoded string You have a unicode python string, and pass it in. Python will convert the unicode to string and explode the first time a non-ascii character is found. To fix this, you'll have to change every call to this library to do this "if type(s) is unicode, then s = s.encode('utf8')' or rewrite the wrapper library. Great.

Or you can use this hack to fix this:

reload(sys)
sys.setdefaultencoding('utf8')

You need to reload the sys module since after it loads the setdefaultencoding is removed. It's not even in the pydoc for sys

yuck.