arbinger systems

by JA Robson    RSS feed

Wednesday, April 30, 2008

Google App Engine: [A Better] Many-to-many JOIN

GAE(This is a follow-up to my original post GAE: Many-to-many JOIN. It probably wouldn't hurt to read that first, since this post sort of assumes you have.)

After getting some feedback on my original post, a simpler, more SQL analogous way to obtain the many-to-many behavior was pointed out to me.

I've created another sample (download it here), and will go over it below. Afterwards, I'll talk about why you shouldn't model your data this way, and instead should denormalize your data for optimization in the Datastore.

Here are the new data Models. (The full code listing is here.)
class Libraries(db.Model):
notes = db.StringProperty()

class Books(db.Model):
notes = db.StringProperty()

class Library(db.Model):
name = db.StringProperty()
address = db.StringProperty()
city = db.StringProperty()
libscol = db.ReferenceProperty(Libraries,
collection_name='libscol')

def books(self):
return (x.book for x in self.librarybook_set)

class Book(db.Model):
title = db.StringProperty()
author = db.StringProperty()
bookscol = db.ReferenceProperty(Books,
collection_name='bookscol')

def libraries(self):
return (x.library for x in self.librarybook_set)

class LibraryBook(db.Model):
library = db.ReferenceProperty(Library)
book = db.ReferenceProperty(Book)
I still have the Books and Libraries models, as you can see. These are needed to collect the Library and Book entities so I can easily iterate over them and output. The Book model contains a reference to Books, via Book.bookscol, and Library to Libraries, via Library.libscol.

The LibraryBook model just contains references to the Library and Book models. This creates our "join". After we add libraries and books to the Datastore, we will link them to each other using LibraryBook entities.

When the page loads, we first create and store our data entities.
# Library collection
libs = Libraries()
libs.put()

# Book collection
books = Books()
books.put()

# Setup libraries
lib1 = Library(name='lib1', address='street a',
city='city1', libscol=libs)
lib2 = Library(name='lib2', address='street b',
city='city2', libscol=libs)
lib1.put()
lib2.put()

book1 = Book(title='book1', author='author one',
bookscol=books)
book1.put()
book2 = Book(title='book2', author='author one',
bookscol=books)
book2.put()
book3 = Book(title='book1', author='author two',
bookscol=books)
book3.put()
book4 = Book(title='book2', author='author two',
bookscol=books)
book4.put()
book5 = Book(title='book3', author='author two',
bookscol=books)
book5.put()

l1 = LibraryBook(library=lib1, book=book1)
l2 = LibraryBook(library=lib1, book=book2)
l3 = LibraryBook(library=lib1, book=book4)
l4 = LibraryBook(library=lib2, book=book4)
l5 = LibraryBook(library=lib2, book=book5)
l6 = LibraryBook(library=lib2, book=book3)
l7 = LibraryBook(library=lib2, book=book1)
l1.put()
l2.put()
l3.put()
l4.put()
l5.put()
l6.put()
l7.put()
First, we create our Libraries and Books entities, libs and books. These will be passed into each Library and Book entity we create.

After we create our books and libraries, we generate a lot of LibraryBook entities, assigning a library and a book to each one. Each LibraryBook entity now links one library with one book. As you may have noticed, some books are assigned to both libraries, some are not.

Library contains a method called books(). It returns every book in the librarybook_set as an iterable data structure. Because LibraryBook holds a reference to Library, any Library entity (say, lib1), is given a back-reference to the collection of LibraryBook entities. If you do not define a collection_name, GAE automatically creates one by appending "_set" to the model name. This is where librarybook_set came from, in case you were wondering.

Given a library entity like lib1, the books() method allows us to easily return all the books at that library by simply assigning or iterating over lib1.books(). The Book model contains a method called libraries() which does just the opposite: allows you to get all the libraries where a given book resides.

Our data has been created and linked. Now we pass it in to the template.
template_values= {
'lib': lib1.name,
'books_at_lib': lib1.books(),
'forbook': book1.title,
'libs_by_book': book1.libraries(),
'libs_books': libs.libscol.order('name'),
'books_libs': books.bookscol.order('-author').order('title')
}
In this example, we not only display all libraries and all books (via libs_books and books_libs) the way we did in the previous post, but also output all books at a library (books_at_lib), and all libraries that contain a given book (libs_by_book).


Here's the template, if you want to take a look at it.

Denormalize your data

As I stated before, the GAE Datastore is not a relational database. Databases were designed for compactness and efficiency, and normalization is used, in part, as a way to minimize the size of your data on disk.

The Datastore has been built, first and foremost, with scalability in mind. Scalability means, in essence, "add more servers as needed, without re-writing your code". Specifically to the GAE Datastore, it means "disk space is cheap, stop worrying about it, and scale".

Consider modifying our LibraryBook model above to look like
class LibraryBook(db.Model):
library = db.ReferenceProperty(Library)
book = db.ReferenceProperty(Book)
booktitle = db.StringProperty()
libraryname = db.StringProperty()
Now, we are not only storing each book's title in the LibraryBook entity, but we are also storing it in the title property of the referenced Book entity. While this is obviously not space efficient, and certainly not the elegant, normalized way of storing relational data our brains are used to, it scales well and is fast.

It scales because the Datastore runs on who knows how many commodity computers in the background (without the knowledge of our application), and it's fast because we have the most commonly needed fields available immediately. If you need to poke further into the data, like to get the street address of the library, you would use the referenced models, and our JOIN then comes into play.

(Thanks, Ben the Indefatigable for illuminating this.)
posted by JA at    [ 3 Comments ] reddit
Arbinger Systems: LUI - aumpel - HTMLCaptcha - Sylbi - madxlib

Monday, April 28, 2008

Google App Engine: Many-to-many JOIN

GAEUpdate: After reading this, you might want to check out GAE: [A Better] Many-to-many JOIN, which gives an improved way of doing this, plus goes into why you shouldn't normalize your data.

A public library has many books. In SQL-speak, this is a one-to-many relationship. (For the sake of the argument, I'll assume each library has only one copy of a given book). It follows then, that many libraries have many books. This is a many-to-many relationship. On the heels of my recent post GAE: One-to-many JOIN, here is an example showing how to do a many-to-many JOIN using the Google App Engine Datastore.

You can download this entire sample here.

A many-to-many SQL query for our library scenario would look something like
SELECT
*
FROM
library
INNER JOIN
libraries_books
ON
library.KEY=libraries_books.library_KEY
INNER JOIN
books
ON
libraries_books.book_KEY=books.KEY
To duplicate this functionality in the Datastore, we have to model our data as follows. (Full code listing here.)
# These are used for linking/ordering
class Books(db.Model):
notes = db.StringProperty(required=False)

class Libraries(db.Model):
notes = db.StringProperty(required=False)

# Data models
class Library(db.Model):
name = db.StringProperty(required=True)
address = db.StringProperty(required=True)
city = db.StringProperty(required=True)
library_list = db.ReferenceProperty(Libraries,
required=True, collection_name='ref_libs')

class Book(db.Model):
title = db.StringProperty(required=True)
author = db.StringProperty(required=True)
library = db.ReferenceProperty(Library,
required=True, collection_name='books')
book_list = db.ReferenceProperty(Books,
required=True, collection_name='ref_books')
The Library and Book models share a one-to-many relationship. This is setup using the Book.library db.ReferenceProperty. Nothing really new here (if you read my one-to-many post, anyway).

We need some additional references to pull off the many-to-many relationships, however, plus a couple extra Models. (It's important to note that the db.ReferenceProperty in itself only allows for a one-to-many relationship. That's why we need more than one get the many-to-many behavior.) I've created the Libraries and Books models for this. You may notice that they have an optional, largely unnecessary property named notes. This can pretty much be ignored. We really just need these entities to exist in order to point to them from our Library and Book entities.

The Library model contains a reference to Libraries through a property named library_list. Book has a reference to Books via book_list. Having references to both Libraries and Books allows us to manipulate the sorting for each collection, as you will see below.

When the page loads in our browser, the first thing we do is create entities from our models, and give them some data.
# Library collection
libs = Libraries()
libs.put()

# Books collection
books = Books()
books.put()

# Setup libraries
lib1 = Library(name='lib1', address='street a', city='city1',
library_list=libs)
lib2 = Library(name='lib2', address='street b', city='city2',
library_list=libs)
lib1.put()
lib2.put()

# Books:
# Both libraries
book1 = Book(title='book1', author='author one',
library=lib1, book_list=books)
book2 = Book(title='book1', author='author one',
library=lib2, book_list=books)
# Only first library
book3 = Book(title='book2', author='author one',
library=lib1, book_list=books)
# Both libraries
book4 = Book(title='book3', author='author two',
library=lib1, book_list=books)
book5 = Book(title='book3', author='author two',
library=lib2, book_list=books)
book1.put()
book2.put()
book3.put()
book4.put()
book5.put()
We declare our "link" entities, libs and books, first. Next we create two library instances, lib1 and lib2, and assign libs to library_list to create a one-to-many relationship from Library to Libraries.

A Book entity has two relationships to setup. A one-to-many relationship to a given Library entity, and a one-to-many relationship to the Books entity. These are established through the library and book_list properties, respectively.

After we store our data, we use the collections in our Library and Book models to create two objects that we will pass to our template.
libs_books = libs.ref_libs.order('name')
books_libs = books.ref_books.order('author').order('-title')

template_values = {
'libs_books': libs_books,
'books_libs': books_libs
}
Both libs_books and books_libs contain many-to-many relationships between libraries and books. But libs_books references books from libraries, allowing you to sort by library, and books_libs does the opposite, referencing libraries from books, letting you sort by books. This is certainly more clumsy and more work than our SQL counterpart, which just needs an ORDER BY clause to sort either way.

On to the template. To output books by library, we have to iterate over every library lib in libs_books, and then iterate over every book referenced to lib.
{% for lib in libs_books %}
{% for book in lib.books %}
<tr>
<td>{{ lib.name }}</td>
<td>{{ lib.address }}</td>
<td>{{ lib.city }}</td>
<td>{{ book.title }}</td>
<td>{{ book.author }}</td>
</tr>
{% endfor %}
{% endfor %}
Because of the way references are setup in libs_books, we are able to order the output based on the libraries, as you can see in the first table below.

results

The second table above shows the output from books_libs, which we use to order by books. Here's how we generate the data in the template:
{% for book in books_libs %}
<tr>
<td>{{ book.title }}</td>
<td>{{ book.author }}</td>
<td>{{ book.library.name }}</td>
<td>{{ book.library.address }}</td>
<td>{{ book.library.city }}</td>
</tr>
{% endfor %}
We don't have to use nested loops, and we simply use book.library as a normal reference (not a back-reference) to get the library associated to the given book. The reason we don't have to nest is because a Book entity has a many-to-one relationship with a Library entity, so each book is already attached to a Library. Library entities have a one-to-many relationship to Book entities, so every time you get lib, you have to find it's many, which requires the second loop.

There you have it. A first blush example, to be sure, but I think it conveys the core steps required to duplicate the behavior of a relational many-to-many JOIN.
posted by JA at    [ 0 Comments ] reddit
Arbinger Systems: LUI - aumpel - HTMLCaptcha - Sylbi - madxlib

Saturday, April 26, 2008

Google App Engine: One-to-many JOIN

GAE By now, no doubt, most developers have heard about the Google App Engine (GAE). And even if you didn't get one of the 10K free accounts, you might still have downloaded and started messing around with the SDK.

Google touts the platform's ease of development, and stepping through the samples reinforce that it is, in fact, quite easy. However, it doesn't take long to discover what will probably be the biggest hurdle for developers entrenched in the relational database paradigm: The Google Datastore. It's not a relational database, and it's not an OOP wrapper to a relational database. It's a web-specialized data storage mechanism, accessed through classes called Models, and objects called Entities.

I'm willing to bet that most of the developers playing with the SDK will first really "get" this when they move past the simple "one table" queries in the samples, and try to do a basic JOIN query. Although there is a SQLlike syntax called Gql -- as stated in the Docs -- there is no JOIN.

To get this functionality, you have to use db.ReferenceProperty to link one object to another. Here's a short demonstration of how it's done. I figure this is much needed, since there seems to be no good examples for it in the Google documentation. (The best information I could find was in the GAE discussion group.)

Below, I've listed example.py in its entirety (don't worry, it's short), and I'll refer to each pertinent section by the line numbers. (You can download the entire sample here. Put it under the SDK folder, and run it like any of the GAE samples.)
1  import os
2 import cgi
3 import wsgiref.handlers
4
5 from google.appengine.ext import webapp
6 from google.appengine.ext import db
7 from google.appengine.ext.webapp import template
8
9 class MainPage(webapp.RequestHandler):
10 def get(self):
11
12 url = EnteredUrl(url="http://domain.com/page.html")
13 url.put()
14
15 match1 = AffinityUrl(
16 url="http://domain.com/dir/page1.html",
17 affinity = .83,
18 entered_url=url
19 )
20 match1.put()
21
22 match2 = AffinityUrl(
23 url="http://domain.com/dir/page2.html",
24 affinity = .8301,
25 entered_url=url
26 )
27 match2.put()
28
29 matched_urls=url.matched_urls.order('-affinity')
30
31 aff_entries = AffinityUrl.all().order('url')
32
33 template_values = {
34 'url' : url.url,
35 'matched_urls': matched_urls,
36 'aff_entries': aff_entries
37 }
38
39 path = os.path.join(os.path.dirname(__file__), 'index.html')
40 self.response.out.write(template.render(path, template_values))
41
42 class EnteredUrl(db.Model):
43 url = db.StringProperty(required=True)
44
45 class AffinityUrl(db.Model):
46 url = db.StringProperty(required=True)
47 affinity = db.FloatProperty(required=True)
48 entered_url = db.ReferenceProperty(EnteredUrl,
49 required=True, collection_name='matched_urls')
50
51 def main():
52 application = webapp.WSGIApplication(
53 [('/', MainPage)],
54 debug=True)
55 wsgiref.handlers.CGIHandler().run(application)
56
57 if __name__ == "__main__":
58 main()
The above stores a URL someone has entered, and then stores other URLs that match it by some degree (the "affinity"). The affinity is a numeric score. This is a simple one-to-many relationship, and to get at the data using standard SQL, we'd write something like:
SELECT
    entered_url.url,
    affinity_url.url,
    affinity_url.affinity
FROM
    entered_url
JOIN
    affinity_url
ON
    entered_url.KEY=affinity_url.FOREIGN_KEY
Here are the steps using the GAE Datastore.

Lines 42-49.
First, let's define the data Model. EnteredUrl defines a single string property, url, for the obvious reason. AffinityUrl defines a string property for url, as well as a float affinity property, for storing the score.

Lines 48-49.
Also, AffinityUrl defines a db.ReferenceProperty named entered_url, which refers to an EnteredUrl object. This is the link between our two data objects, and how we effectively do a JOIN. The collection_name, matched_urls, is used to refer to the collection of AffinityUrl objects that will be linked.

Lines 12-13.
When the page is loaded in the browser we create an EnteredUrl entity named url, setting its url property to a string value.

Lines 15-27.
We setup two AffinityUrl objects, and assign them both a url and a numeric score. Additionally, we point entered_url to our EnteredUrl object, url. We have just linked one object (url) to many (match1, and match2).

Line 29.
This line queries the data in the one-to-many way, and stores it in an object, matched_urls, which I pass through to the template for iteration and output. This is where the collection name we defined in the db.ReferenceProperty attributes is used. Note that the collection name, matched_urls, is called like a method from url, since url is the object being referenced.

Line 31.
Additionally, for illustration, I query the AffinityUrl object data and save it in aff_entries. Just as in SQL, where you can JOIN tables, or query them individually, the App Engine allows you to do both. (Hopefully, you've realized by now that although they look and are accessed differently, these linked entities are behaving quite a lot like relational database tables.)

In the template, I output the data from matched_urls by getting each AffinityUrl object in the collection, and displaying that URL. Note that because of the .order('-affinity') call, we are displaying the URLs with the closest affinity at the top (descending order).
<table>
{% for affurl in matched_urls %}
<tr><td>{{ affurl.url }}</td></tr>
{% endfor %}
</table>
Load this up in your browser, and refresh a few times, and this is what you get:



You may have noticed from the code that I also pass all the data stored in the AffinityUrl model (line 31) to the template as well. This is output in the second table, above.

Because I've refreshed the page several times, I've generated and stored the match1 and match2 objects multiple times to the Datastore. This highlights something strikingly different about the Datastore and a SQL table. SQL statements like the one I give will display all the entries that match between EnteredUrl and AffinityUrl, even if entries in AffinityUrl are duplicated. As you can see, even though we have duplicate AffinityUrl entities stored, the reference from the EnteredUrl entity is smart enough to realize that they are duplicates, and only displays the ones that are unique. Update: please see the comments for a correction of the previous statements. The Datastore is creating new entities each time with a unique ID...

The Datastore takes a little getting used to, especially for those experienced in the standard relational data models. (Good ol' paradigm shift.) The GAE documentation feels unfinished or at least rushed, which is unfortunate. I personally think they should have concentrated more on giving good examples that demonstrate mapping relational concepts to Datastore concepts, since the majority of developers looking at the GAE will be old hands at the relational stuff.

I'm sure they'll get there eventually. In the meantime, I hope you found this tutorial useful.
posted by JA at    [ 4 Comments ] reddit
Arbinger Systems: LUI - aumpel - HTMLCaptcha - Sylbi - madxlib

Monday, April 14, 2008

I'm Trying To Quit... Commercial Software, Pt. 1

Trying to quit This experiment started out simply enough. It was 2007, and I got a new laptop. I had been running Quickbooks 2004 for our checking accounts, and Office 2003 for our meager office tools needs. I decided this software would stay on my old laptop (now my wife's), and I would try Free Open Source Software (FOSS) alternatives on the new one. I was bored with Office, and fed up with Quickbooks, anyway, so why not?

From there, the experiment broadened, and I decided to see if Linux/FOSS could keep me from ever having to boot into a proprietary system (Windows), or use proprietary software. I decided to keep notes, and now I seem to have enough material to start sharing the experience.

This is where it begins. I replace Quickbooks with GnuCash, and Microsoft Office with OpenOffice on my new laptop, which is running Windows XP.

GnuCash

Since I wasn't sure of anything, I didn't move our checking accounts out of Quickbooks. My wife and I were simply doing the laptop shuffle anyway, so it was just easier to leave everything where it was, and continue to maintain our registers on the old laptop.

However, we wanted to start a monthly budget, and I decided to let GnuCash step up and take a shot. Installing GnuCash was as easy as any other Windows application. Simply download the installer and run through the prompts. No sweat.

After doing a minimal amount of reading, and marginally more button punching and tab poking, I figured out that I would have to first create a register, and then apply a budget estimation to it.

So I setup a register called Monthly Budgeting. We decided on a monthly dollar amount, and I made this the initial deposit. Then, I began entering our receipts.

Here's what the register looks like:


So far, nothing surprising or mind-boggling. GnuCash felt a lot like Quicken. There's only so much variation a register is going to have, after all. This is good, because it means that the learning curve from one product to the next is minimal.

After finishing all my month's entries, I did some more poking around, and finally got the budget estimate working. Hint: Select the Budget tab, click "Options" to set your intervals etc, and then click "Estimate".

Here's our budget after a few months of keeping track. The budget outline for the Monthly Budgeting register is displayed horizontally, each month showing whether you are under budget (positive dollar amount), or over (negative) for that period.



The only problem I've had was with the backup. Quickbooks has an easy backup feature, and the backup is stored in a single file. I've been backing up GnuCash by copying all the files from its directory to a flash card.

This seems to work okay, but at one point GnuCash (or I, or both) got confused, and I had to restore from the backup directory, and in the end I lost about a month's worth of entries. The backup could be a little easier, I think.

GnuCash has worked out well. I've since added my business register to it, and it has all the standard features that you would at least find in Quicken. I'm not an accountant, so I can't really say whether GnuCash could replace Quickbooks for a business. I can say, however, that it seems like a pretty painless way to not pay for software for managing your personal check registers and budgets.

OpenOffice

This will be pretty short. I barely ever use MS Word for anything, but occasionally need Excel. My wife uses Word the most, but not in any way that OpenOffice (or even Wordpad) couldn't handle.

So far, Calc has been sufficient for my spreadsheet needs. There was barely a learning curve, and like I said, I don't make too many heavy demands on a spreadsheet. MS Access is another story, but for me that's more of something that I might use in development (say, of a .NET application, because it was convenient), so I'm not including it here.

Like GnuCash, I think OpenOffice ranks high enough in quality and design to work fine for a very large percentage of home users, and even for a lot of offices. As time progresses, whatever gaps there may be will only get narrower.

So...

As you might have figured out by now, this experiment is not a feature-by-feature scrutiny of competing products. I'm just using software the way I would normally, which is essentially, "I don't care about feature X, until I need feature X". I think most people work this way, unless they have a specific reason to become an expert. I'm not an accountant, doubt I will ever be an accountant, so I don't put a whole lot of time learning every arcane feature available in Quickbooks. I learn enough to do what I want, and won't go further until I need to.

In this experiment, FOSS is effectively graded on whether or not it can substitute all or most of my proprietary software needs, in the way in which I use software. It is highly subjective, and human nature, like laziness and apathy, is very much a part of it, as you will see.

(Next up: My old laptop dies, and we have to get another one. I decide to try Linux along with Vista, and see how little I actually have to use Vista.)
posted by JA at    [ 0 Comments ] reddit
Arbinger Systems: LUI - aumpel - HTMLCaptcha - Sylbi - madxlib

Thursday, April 10, 2008

Insert or Update With a Single SQL Statement

sqlEver come across the situation while developing data-driven web applications when you needed to create a new record if one doesn't exist, but if one does exist, then you need to update it instead?

I certainly have, and I must admit with some shame that in the past I've handled it in the most obvious, and least elegant and efficient way, by

querying SQL for the existence of the record,
checking the result set in my code by looping and assigning a variable,
checking the variable for a value
, and if one doesn't exist, then doing the insert.
Otherwise, doing the update
.

There are a couple problems here. First, it's a lot more code than necessary. Second, it requires two calls to SQL instead of one.

You can eliminate this by making SQL do the conditional logic for you, via IF EXISTS. Here's the sample:
IF EXISTS(
SELECT 1
FROM MY_TABLE
WHERE ITEM='somevalue' AND ENTERDATE='12/31/1999')
--Update Statement
UPDATE MY_TABLE
SET ITEM='anothervalue'
WHERE ITEM='somevalue' AND ENTERDATE='12/31/1999'
ELSE
--Insert Statement
INSERT INTO MY_TABLE
(ITEM, ENTERDATE)
VALUES
('somevalue', '12/31/1999')
EXISTS lets you run a query statement, and if a value is returned, it outputs true. Otherwise, it outputs false. Couple that to IF/ELSE, and you can see how useful this particular SQL clause is.

The query inside EXISTS returns 1 if the parameters in the WHERE clause match, and returns nothing otherwise. What we return really doesn't matter. We're interested mainly in the parameters. If the parameters match something, then we will update them. Otherwise (ELSE), we insert them into the table.

Pretty simple. We just add our code parameters to the above statement (if your language uses parameters, e.g. Perl or C#), and send it on its way. One SQL call, and a lot less logic.

Update: I should have been clearer. This is TSQL, and will not work, in say, MySQL. (Thanks anonymous commenter!)
posted by JA at    [ 4 Comments ] reddit
Arbinger Systems: LUI - aumpel - HTMLCaptcha - Sylbi - madxlib

All content copyright (c) Arbinger Systems