Sunday, June 14, 2009

python fuzzy date parsing

You want the Labix python-datetutil module. See the parser section and set fuzzy=True. This was added since the current search for "python fuzzy date parsing" sucks. Maybe this will help.

Python, Atom Generator, and Google gdata

It's hidden, but google provides a python Atom module in the gdata API.

Installing is just the usual "easy_install URL" with the latest tarball from the downloads section. If having all the other gdata apis makes you uncomfortable, just checkout the atom bits with:

svn co http://gdata-python-client.googlecode.com/svn/trunk/src/atom/

(consider usng svn:external)

The pedantic implementation of the first example from the Atom spec is:

from gdata.client import atom
# or "import atom" if you did the svn checkout method

feedauthor = atom.Author(name = atom.Name(text='John Doe'))
feedtitle = atom.Title(text = "Example Feed")
feedlink = atom.Link(href = "http://example.org/")
# anything as long as it is unique                                                                                                          
feedid = atom.Id(text="urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6")
# datetime.datetime.now().isoformat()                                                                                                       
feedupdated = atom.Updated("2003-12-13T18:30:02Z")


entries = []
e_title   = atom.Title(text="Atom-Powered Robots Run Amok")
e_link    = atom.Link(href= "http://example.org/2003/12/13/atom03")
e_id      = atom.Id(text="urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a")
e_updated = atom.Updated("2003-12-13T18:30:02Z")
e_summary = atom.Summary(text="Some text.")

entries.append( atom.Entry(title=e_title, link=e_link, atom_id=e_id, summary=e_summary))

feed = atom.Feed(entry=entries, title=feedtitle, link=feedlink, atom_id=feedid, updated=feedupdated)

print(str(feed))

As you can see it's kinda verbose, but hey, we are dealing with XML so whattdoya expect. pydoc is your friend.

The output will look something like the following. If you throw it under Apache with "junk.atom" you should be able to see it with your fav RSS reader or view it in your browser.

<?xml version='1.0' encoding='UTF-8'?>
<ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom">
<ns0:title>Example Feed</ns0:title>
<ns0:id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</ns0:id>
<ns0:link href="http://example.org/" />
<ns0:updated>2003-12-13T18:30:02Z</ns0:updated>
<ns0:entry>
  <ns0:id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</ns0:id>
  <ns0:link href="http://example.org/2003/12/13/atom03" />
  <ns0:summary>Some text.</ns0:summary>
  <ns0:title>Atom-Powered Robots Run Amok</ns0:title>
</ns0:entry>
</ns0:feed>

I'm sure this library also does parsing, but I haven't tested that

Friday, June 12, 2009

Fo shizzle my drizzle

I bumped into a most mysterious new database based on MySql, named Drizzle. The Home Page and the Dizzle Project Page. The Wiki has the good stuff. Here's my read:

  • Take MySql 6.0
  • Remove all triggers, views, stored procs and other things that complicate life. That's the applications job.
  • Simplify the types (no more small int vs. medium int
  • Make it more pluggable and desired for modern multicore architectures
  • Stop trying to make the database the center of the universe: remove LDAP, ACLs, and other things provided by the OS
  • Remove half-assed features; MyISAM fulltext (sucked), and the POSTGIS spatial extensions (incomplete). I especially clarity of the goals. They say if you want GIS: use postgres.

Oh yeah, windows support is removed. Haha.

and as a tease I saw something about "Sharding" on the site. If they get auto-sharding working and easy to use: YUM!

The developers seem like heavy hitters and are testing on 16-core machines. It's currently in pre-alpha

Ok, why you ask? Let me put on my tech marketing hat

databaseScalingFeaturesOpen
BigTable/SDBHighLowClosed
Drizzle?Medium-HighMediumOpen
Mysql/PostgresMediumHighOpen
Sqlite3LowMediumOpen

That's quite course, but you get the idea. You could make a nifty chart, but it boils down to: Can you get much better scaling and operations, by getting rid of a bunch of crap, but still being SQL-based and open source? I think Drizzle says yes.

Thursday, June 11, 2009

Serving Precompressed Static Content with Apache

Using mod_negiotiation

It's pretty simple but has a big gotcha

First, edit your mine.conf file, and uncomment or add:

AddEncoding x-compress .Z
AddEncoding x-gzip .gz .tgz

and if you use Indexes comment out:

#AddType application/x-compress .Z
#AddType application/x-gzip .gz .tgz

then add Options MultiViews to any Directory that has compressible content.

Now compress your content. Hack the following as needed:

for file in *.js; do
    echo "gzipping ${file}"
    gzip -9 -c $file > ${file}.gz;
done

Ok now you are all set, except for one thing

You have to get rid of the suffix on your URLs. For example, if you have:

<link src ="foo.css">
You have to change it to:
<link src ="foo">

Grump. I don't really like that. The good news is that the old links (with the suffix) still work, but aren't compressed.

mod_rewrite

Smart guy Mark Aufflick figure it out another way using mod_rewrite.

It's a slight differently method. You are making new mine types .jsgz and .cssgz and telling apache explicitly what they are. The compressor script is slightly different. The output should be foo.jsgz not foo.js.gz

for file in *.js; do
    echo "gzipping ${file}"
    gzip -9 -c $file > ${file}gz;
done

Add these to your apache2.conf or http.conf. This sets up the new mime types:

AddType "text/javascript;charset=UTF-8" .jsgz
AddEncoding gzip .jsgz
AddType "text/css;charset=UTF-8" .cssgz
AddEncoding gzip .cssgz

Then add the gzip-if-client-can-do-it command:

<Directory "/YOUR/DIRECTORY/WITH/js">
RewriteEngine on
RewriteCond %{HTTP:Accept-Encoding} gzip
RewriteRule (.*)\.css$ $1\.cssgz [L]
</Directory>

etc.. ta-da! Now do the same thing with your CSS.

Notes on Google App Engine June, 2009

For a personal project I started to use Google's App Engine. The application is a bit non-traditional in that is requires a good amount of batch data to be send to the server regularly and a good amount of the data needs to be deleted as well.

Constraints

I can't remember where: but each request better take less than 1M in memory and take less than 2-3 seconds to complete. For a normal webby app that should be fine. Anything requiring service more than that will be painful to do. Everything is a "web request".. meaning if you have admin work, it's a URL, and the http request better complete in a second or two, else it will be cut off.

Good News

I didn't find using the datastore particularly confusing or restrictive (mostly, see below). One may need to create indexes or keys a bit different than one might in normal SQL, but over all it works well

The python performance is great. Content gets compressed and served from Google's CDN.

Lot of tools, dev-kits, consoles and monitoring.

I suspect that the app runs from a distributed set of geographically distributed nodes (a CDN for the app). That's hot

The Bad News

App Engine is really bad for bulk uploads. It's fine for the first-time uploads, but not so great for regular updates and downloads. They have a nice remote_api but things time out in under 5 seconds. To work around this, they have a bunch of tools that splits up uploads in numerous parallel requests. It works "mostly" well, but even modest uploads can consume -10% of your quota.

There is no batch delete; you have to make a web request to delete a chunk of data and keep going until there is no more data. I finally gave up on App Engine, when deleting a whole 500 records (small, with no indexes) can't be completed in 5 seconds.

And then there is the python issues. Really it's not bad, but if your existing code is in 2.6 you may find it annoying to downgrade to 2.5 (the only version Google supports). Any extension that uses a C-library or uses the local filesystem is off limits too (e.g. no Sqlite3 -- you must use their data store)

The local development AppEngine that runs locally is really slow, and doesn't enforce the same constraints as the live version.

What it's good for

You probably know this already. It good for user based apps where most of the data is created organically. It's not good for data processing and things involving bulk uploads. I knew this too, but wanted to see if could bend my app to use google's infrastructure. Sadly it's not a fit.

IIf you have a webby application that matches App Engine's sweet spot , I would high recommend it.