Google App Engine: [A Better] Many-to-many JOIN
6 Comments Published by JA on Wednesday, April 30, 2008 at 9:00 AM.
(This is a follow-up to my original post GAE: Many-to-many JOIN. It probably wouldn't hurt to read that first, since this post sort of assumes you have.)After getting some feedback on my original post, a simpler, more SQL analogous way to obtain the many-to-many behavior was pointed out to me.
I've created another sample (download it here), and will go over it below. Afterwards, I'll talk about why you shouldn't model your data this way, and instead should denormalize your data for optimization in the Datastore.
Here are the new data Models. (The full code listing is here.)
class Libraries(db.Model):I still have the
notes = db.StringProperty()
class Books(db.Model):
notes = db.StringProperty()
class Library(db.Model):
name = db.StringProperty()
address = db.StringProperty()
city = db.StringProperty()
libscol = db.ReferenceProperty(Libraries,
collection_name='libscol')
def books(self):
return (x.book for x in self.librarybook_set)
class Book(db.Model):
title = db.StringProperty()
author = db.StringProperty()
bookscol = db.ReferenceProperty(Books,
collection_name='bookscol')
def libraries(self):
return (x.library for x in self.librarybook_set)
class LibraryBook(db.Model):
library = db.ReferenceProperty(Library)
book = db.ReferenceProperty(Book)
Books and Libraries models, as you can see. These are needed to collect the Library and Book entities so I can easily iterate over them and output. The Book model contains a reference to Books, via Book.bookscol, and Library to Libraries, via Library.libscol.The
LibraryBook model just contains references to the Library and Book models. This creates our "join". After we add libraries and books to the Datastore, we will link them to each other using LibraryBook entities.When the page loads, we first create and store our data entities.
# Library collectionFirst, we create our
libs = Libraries()
libs.put()
# Book collection
books = Books()
books.put()
# Setup libraries
lib1 = Library(name='lib1', address='street a',
city='city1', libscol=libs)
lib2 = Library(name='lib2', address='street b',
city='city2', libscol=libs)
lib1.put()
lib2.put()
book1 = Book(title='book1', author='author one',
bookscol=books)
book1.put()
book2 = Book(title='book2', author='author one',
bookscol=books)
book2.put()
book3 = Book(title='book1', author='author two',
bookscol=books)
book3.put()
book4 = Book(title='book2', author='author two',
bookscol=books)
book4.put()
book5 = Book(title='book3', author='author two',
bookscol=books)
book5.put()
l1 = LibraryBook(library=lib1, book=book1)
l2 = LibraryBook(library=lib1, book=book2)
l3 = LibraryBook(library=lib1, book=book4)
l4 = LibraryBook(library=lib2, book=book4)
l5 = LibraryBook(library=lib2, book=book5)
l6 = LibraryBook(library=lib2, book=book3)
l7 = LibraryBook(library=lib2, book=book1)
l1.put()
l2.put()
l3.put()
l4.put()
l5.put()
l6.put()
l7.put()
Libraries and Books entities, libs and books. These will be passed into each Library and Book entity we create.After we create our books and libraries, we generate a lot of
LibraryBook entities, assigning a library and a book to each one. Each LibraryBook entity now links one library with one book. As you may have noticed, some books are assigned to both libraries, some are not.Library contains a method called books(). It returns every book in the librarybook_set as an iterable data structure. Because LibraryBook holds a reference to Library, any Library entity (say, lib1), is given a back-reference to the collection of LibraryBook entities. If you do not define a collection_name, GAE automatically creates one by appending "_set" to the model name. This is where librarybook_set came from, in case you were wondering.Given a library entity like
lib1, the books() method allows us to easily return all the books at that library by simply assigning or iterating over lib1.books(). The Book model contains a method called libraries() which does just the opposite: allows you to get all the libraries where a given book resides.Our data has been created and linked. Now we pass it in to the template.
template_values= {
'lib': lib1.name,
'books_at_lib': lib1.books(),
'forbook': book1.title,
'libs_by_book': book1.libraries(),
'libs_books': libs.libscol.order('name'),
'books_libs': books.bookscol.order('-author').order('title')
}In this example, we not only display all libraries and all books (via libs_books and books_libs) the way we did in the previous post, but also output all books at a library (books_at_lib), and all libraries that contain a given book (libs_by_book).
Here's the template, if you want to take a look at it.
Denormalize your data
As I stated before, the GAE Datastore is not a relational database. Databases were designed for compactness and efficiency, and normalization is used, in part, as a way to minimize the size of your data on disk.The Datastore has been built, first and foremost, with scalability in mind. Scalability means, in essence, "add more servers as needed, without re-writing your code". Specifically to the GAE Datastore, it means "disk space is cheap, stop worrying about it, and scale".
Consider modifying our
LibraryBook model above to look likeclass LibraryBook(db.Model):Now, we are not only storing each book's title in the
library = db.ReferenceProperty(Library)
book = db.ReferenceProperty(Book)
booktitle = db.StringProperty()
libraryname = db.StringProperty()
LibraryBook entity, but we are also storing it in the title property of the referenced Book entity. While this is obviously not space efficient, and certainly not the elegant, normalized way of storing relational data our brains are used to, it scales well and is fast.It scales because the Datastore runs on who knows how many commodity computers in the background (without the knowledge of our application), and it's fast because we have the most commonly needed fields available immediately. If you need to poke further into the data, like to get the street address of the library, you would use the referenced models, and our JOIN then comes into play.
(Thanks, Ben the Indefatigable for illuminating this.)

Hi Mr Robson,
There is something I don' get with your blog post/example. Why use the Books and Libraries models?
I've hacked your examples removing them and replacing the use of them with the following (snippet):
template_values= {
'lib': lib1.name,
'books_at_lib': lib1.books(),
'forbook': book1.title,
'libs_by_book': book1.libraries(),
'libs_books': Library.all().order('name'),
'books_libs': Book.all().order('-author').order('title'),
}
I can see a use for them if you actually had different groups of them but for place holders like how you use them I just don't get it?
I think your example it would be significantly easier to understand if they where removed too.
Best Regards,
Johan Carlsson
http://www.easypublisher.com
Johan: You're right, they are unnecessary, in retrospect.
It was something I missed. If you care to post a link to your sample, that might be useful for others to view.
Also, now that I understand the datastore better, I wouldn't try as hard to design a many-to-many relationship. I'd flatten and not worry about redundancy. It's more work, and with the Datastore, we don't care particularly about size on disk.
A cleaned up sample by Johan Carlsson:
Sample
Insightful.. the code sample from Carlsson also clear things up.. but it this the 'proper' way to do it? I mean will it scale?
I have a similar scenario, say I have a Post object and Tag object, Post of course can have many Tags, and Tag can be associated to many Posts.
How do I flatten this without using M:N relationship? And second question, should I flatten it?
You should only 'normalize' where there is absolute need, as this will scale more poorly than flattened data.
By flattening data, you are resigning yourself to data redundancy, but this is fine as far as the platform (Google Apps) is concerned. 'Disk space is cheap, stop worrying about it and scale.'
I would also note that the M:N relationship is still there, in the flattened table, i.e. there are many redundant somethings mapping to exactly one something elses within the table. There just not being stored in an optimized way, like a "true" M:N relationship in an RDBMS.
>compactness and efficiency, and normalization is >used, in part, as a way to minimize the size of your >data on disk.
The real reason behind normalization is that a table is a set, a tuple is a set. set can't contain duplicates and non-values such as NULL.
SQL violates the relational model because it allows duplicates and uses NULLs. Introduce a 3-Value-Logic by allowing "true-value" to be True, False or NULL. The SQL NULL is an abberation; NULL is a non-value but NULL == NULL evaluates to true. SUM() ignores null but COUNT does not... empty strings are equivalent to NULL but not in a WHERE statement... I could go on and on. In Python and thus the GAE runtime, None is less of a problem because None == None is false. You can also control the use of None in your validation methods.
To assert what you don't know, you have to normalize and maintain a separate table to keep track of missing information, the same way GAE recommends to maintain your own counters since they don't provide a count operator.
http://www.dbdebunk.citymax.com/page/page/1396241.htm
http://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without-nulls.pdf
Another approach is How To Handle Missing Information Using S-by-C (Darwen/Smout)
http://www.dcs.warwick.ac.uk/~hugh/TTM/HTHMIUS-by-C-review-draft.pdf
What's is really interesting with the App Engine datastore is that a GQL query doesn't return missing values. *Because the Datastore doesn't return missing value**, you can imagine that a Kind is similar to a SQL view and that all the tables underneath are normalized as suggested by Fabian Pascal in the Final Null in the Coffin.
Even though the datastore is not a relational database, you can use a strict relational model on top of this implementation. To a certain extent this is what GAE's db.Model is trying to achieve.
The only things that have been preventing us from building truly relational database in the past was performance problem in the implementations, the shortcomings of the SQL language and ignorance. Google has fixed the performance problem.