C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

© 2006-2017 Andrew Cooke (site) / post authors (content).

Lessons Learned from AppEngine's Data Store

From: andrew cooke <andrew@...>

Date: Tue, 2 Aug 2011 19:57:31 -0400

This is a brief summary of the things I've learnt while using Google's
AppEngine Data Store - a "NoSQL" database designed for high performance.

1 - Do this!  I was wary of AppEngine because of lock-in, etc, but you can
easily get Django working, which avoids learning a whole new framework,
and Django non-rel has the promise to liberate you completely, if needed
(but see below).

No amount of reading about "NoSQL" taught me what I learnt writing code
- if you're a programmer, the GAE Data Store is a great intro.

2 - Think hard about how your application works up-front.  This is a big
shift from SQL, where you probably had a logical, normalised,
independent, data model and then mapped between that and your
application with SQL.  You can't do that with the Data Store.  Instead,
you need to design the data model around the actions that occur in your
application.

In other words: with SQL you have the luxury of a layer of isolation

3 - Think hard about where you need transactions, and where not.  Again,
this is reflected directly in the data model.  The one aspect of the
Data Store that has impressed me most is how they have managed to
combine scalability with transactions.

For me, the necessary structure was pretty clear - I have users that
"own" certain objects, so those "owned" objects are children of the
users.  This lets me guarantee consistency where I need it (where users
can see an account balance, for example).  Separate from that, and free
of any transactions or trees, are the main data in my application.
These are not guaranteed to be immediately consistent, but are much more
efficiently handled.  The data model reflects all this.

4 - Think about how caching can fail.  This isn't NoSQL-specific, but it's
important anyway: caching gets easier the less strict you are about
behaviour.  Choose the design so that if you cache too much, or for too
long, it's not a problem - make it generous by default (so, for example,
I have a resource that expires after a certain time, but I don't care
whether caching extends that - what is important is that I never
over-restrict a user).

Related: use negative caching only where it is absolutely critical.
It's so easy to get in a mess here...

5 - Carefully choose the keys for your cache.  They should reflect the
entire state you are caching, so that you don't need to worry about
retrieving inconsistent data.

6 - Clean out your database in a separate thread.  Omit non-critical
write/delete operations from views.  Instead, delegate them to a
background worker task.  This is particularly true when deleting - it's
a slow, painful process to delete large amounts of data from the store.

Inconsistency is your friend.  Much of your code has to work assuming
assume very little and then tidy things later in a separate, batch task.

7 - Don't rely on Django non-rel until you understand the store without it.
The non-rel package was a great help when I was starting - my initial
code looked like a nice, familiar Django project.  Then I began
wondering just what "eventual consistency" might mean and realised I had
some very nasty bugs, because non-rel doesn't currently support
transactions.

And even when transactions are added to non-rel (they are work in
progress), I would suggest using the basic models Google provides until
you understand the system in detail.  Despite reading much of the
documentation I really didn't grasp how everything worked until I had
used the API.

So I would suggest the following: if it helps, start with non-rel to get
transactions; go back to non-rel if and when you are confident it makes
sense.

[Beware that the non-rel and related packages bundle 1.3, while 1.2 is
the latest supported directly by AppEngine - I switched back to 1.2 when
I switched models and it was worth it just for the reduced deploy time]

In summary:

- Simplify cache use with careful key choice and relaxed behaviour.
- Don't try to keep everything consistent in your views - delegate
Andrew
BTW, the site on which teh above is based is http://www.parti.cl - it provides
put next to users, you just load the images from there.