Date: 22 July 2009, Revision: 0.28
Massively scalable web applications encounter a fundamental tension in computing between performance and correctness: performance is often addressed by using a large and therefore distributed machine where programs are multi-threaded and interruptible, whereas correctness requires data invariants to be maintained with absolute certainty. A solution to this problem is transactions [Gray-Reuter].
Some distributed systems such as Google App Engine [GAE] provide transaction semantics but only for functions that access one of a set of predefined local regions of the database: a "Local Transaction" (LT) [TransGAE]. To address this problem we give a "Distributed Transaction" (DT) algorithm which provides transaction semantics for functions that operate on any set of objects distributed across the machine. We assume an underlying layer providing "strong" (distributed-) consistent LTs.
The algorithm is "optimistic" [opt-cur] in the sense that it is optimized for low contention: it is assumed that most DTs will succeed but occasionally contention can make a DT abort.
Correctness and performance are two of the essential concerns of software engineering.
Correctness: the output is what you want;
Performance: the output doesn't cost too much,
where cost is expenditure of any resource: time, space, energy, money, people, machines.
Reasoning about correctness is best done by establishing and maintaining
Invariants: sentences the truth of which does not change while all else is changing.
Typical invariants maintain the integrity of
data structures, for example in a doubly-linked list x->next is null or x->next->prev == x [McPeak], or
global conservation laws, for example in a bank the total amount of money remains constant even as transfers are made.
Scalable performance requires distributed computing: the machine as a single-point abstraction is a myth that can no longer be maintained. Distributed computers exhibit some properties that make programming challenging:
Non-Reliable: ongoing random local failures of machines and their processes,
Non-Serial: operation in massive parallel where data is constantly being read and written by multiple threads,
Non-Synchronized: lacking in a single point coordinating behavior or even a global notion of time.
Maintaining correctness on a distributed machine requires
carefully distinguishing the "good" states of the machine, those where invariants hold, and
grouping operations together into "transactions" that take us from state to state.
We need these transactions to have the following properties.
Durable: states persist once achieved,
Atomic and Isolated: there are no in-between states for yourself or others.
Semantic-Consistent: transactions go from one good state to another good state.
Some databases, such that of Google App Engine, provide the ability to
partition the database into disjoint parts, and
run a transaction on the data within a single part.
We call these "Local Transactions" (LTs) as their transactional semantics is only available to local parts of the database.
This is a paper on distributed databases. The word "consistency" already means one thing in the context of distributed algorithms and another thing in the context of databases. Above we defined a "semantic-consistency", which is the databases notion of consistency. Here we discuss "distributed-consistency" which is the distributed algorithms notion of consistency.
The Google App Engine teams calls the property we need (and that GAE provides) "strong [distributed-] consistency"; we will define it more precisely below as "synchronous local sequential distributed-consistency"; see the section "Full proof of Isolation".
In general programs need the ability to do operations across an unrestricted set of objects within a single transaction. In this article we presuppose a layer providing Local Transactions. Upon this layer we construct an algorithm providing transactions
for read and write operations
on any set of objects
that are specified directly by their keys.
In some sense such a layer has "distributed" its semantics across the partition and thus we call it a "Distributed Transaction" (DT) layer.
When using our Distributed Transactions the set of objects operated upon must be specified directly by their keys, as one does with an object store, not by predicates on their properties, as does with a general relational query. Indices in GAE are not updated at the same time as the data therefore the set of query results can differ from those that satisfy the predicate [TransIsolGAE]. Further, queries do not honor Local Transactions. Such properties make it difficult to make queries honor Distributed Transactions and doing so is beyond the scope of this article.
The absence of queries is not such an onerous restriction: one may still implement all forms of relations, though not necessarily using traditional relational idioms. For example, to implement a one-many relation, use a list field from the one to the many rather than a who-is-pointing-at-me query for fields pointing from the many to the one [Barrett].
Google App Engine [GAE], backed by Big Table [BigTable], provides a method "run_in_transaction" that takes a client function and its arguments and runs that function on those arguments in a single transaction [TransGAE, TransIsolGAE]. GAE transactions have some restrictions as follows.
GAE libraries provide a means by which an object can specify a "parent" object at creation time. A connected component of such parent relationships is called an "Entity Group". A GAE transaction can only operate on objects within one "Entity Group".
In a relational database, all specifications of both single and sets of objects are done with a "where" clause predicate, which implies a query to map the predicate to the result set. In contrast, in the GAE/Big Table model there is a distinction between an access to an individual object versus a query. GAE Local Transactions allow the reading and writing the objects as accessed by their id; GAE does not allow queries within Local Transactions.
Our contribution is that we provide transactional semantics without restriction (1): we create a kind of transaction that works across objects not in the same Entity Group. We call these transactions "Distributed Transactions" (DTs) in order to distinguish them from the original GAE "Local (to one Entity Group) Transactions". We do not remove restriction (2): our Distributed Transactions do not work for queries.
It should be noted that the underlying Google App Engine database is "strongly [distributed-] consistent" [Barrett, DatastoreIntro]. We define this below more precisely as "synchronous local sequential distributed-consistency"; see the section "Full proof of Isolation" for more.
Our algorithm applies to any database where the data is partitioned and where data in a single part can be operated on in a single Local Transaction and where these operations are strongly distributed-consistent; however for the purposes of specificity, we will uniformly adopt the terminology of the Python version of Google App Engine. To those who are experienced in this field we think such specificity will not be an undue burden. To those who are new the GAE terminology will give them a specific system with which to play while learning; we think this best as all algorithms are best learned first in the specific.
Please note that GAE uses the terms "get" and "put"; since we provide code fragments we use these terms as well when referring to actual GAE puts and gets. However the more established algorithmic terminology for database interactions is "read" and "write" so we revert to those when not speaking of a literal GAE put and get.
As suggested in "Transaction Processing" by Jim Gray and Andreas Reuter [Gray-Reuter], we want to know that our Distributed Transactions provide the traditional ACID properties: Atomicity, (Semantic-) Consistency, Isolation, and Durability. However many more properties than those are needed for a practical framework, as follows below.
In order to state these properties we need names for the various roles different parts play in the system:
User: the human for whom a service is being provided;
App: the software providing the service to the user;
Client: a function that is run in a transaction by the Distributed Transaction infrastructure.
Objectification: Data is just a blob of bits; we want an abstraction on top of data where we can distinguish separate non-overlapping objects that persist over time.
Uniqueness: Objects should have unique IDs by which they can be one-to-one identified. Further, the identity of an object should remain well-defined: multiple versions of an object should not be simultaneously accessible to the client.
These may seem very simple, but I included them because I have violated them before!
The ACID properties of a transaction are more correctly considered in the following order which starts with the uniquely-specified objects distinguished above and builds successive layers of abstraction.
Durability: Objects live in a place called a database as distinguished from working memory. While when operating on objects we have an ephemeral copy in memory, whereas after transaction completion the object in the persistent database.
Atomicity: Memory objects interact with objects in the database by read and write operations. A transaction is a set of such operations. The operations of a transaction either all occur or none occur.
Isolation: The database state and the result of all operations compute as if (1) the transactions were ordered according to some total order and (2) all of the operations of a transaction occured before all the operations of the preceding transactions and after those of successive transactions.
Semantic-Consistency: Each transaction maintains the application-level invariants. Note that this is not a property of the transaction layer, but a property of the application's use of it.
Local Sequential Distributed-Consistency: In GAE Local Transactions ensure that each entity group exhibits sequential distributed-consistency [seq-con]. We preserve this property. Further, in section "Application idioms" we suggest idioms that provide either best-effort or guaranteed (user-) synchronous local sequential distributed-consistency.
Thread safety: Multiple threads can interact with a DT. The state transitions of the algorithm must only be within the correct state transition graph (see below) no matter how many threads are rolling forward the same DT.
Can always make global progress — no deadlock: the system should not get into a state where it cannot make global progress.
Always making local progress — no starvation: no local part of the system should fail to make progress even if in theory it could. Note that we do not attempt to guarantee any kind of "fairness" to this progress.
All long-term storage resources are semantic — no garbage: a Distributed Transaction should not leave behind garbage objects. There is a kind of garbage that does collect: object keys are never re-used.
Preservation of read-only-ness — no gratuitous writing: In a distributed system there is the constant threat to progress due to contention for user objects. While reads share, writes exclude and are therefore more expensive; therefore the "read-only-ness" of an operation on a user object is a resource not to be wasted. That is, we do not want a DT read on a user object to turn into an LT write on that object due to, say, the implementation of the locking protocol under the hood.
The app would like to run a client function in a Distributed Transaction. The app calls the distributed transaction infrastructure requesting that the client function be run:
distributed_run_in_transaction(client_func, *args, **kwargs).
The client function is synchronously run in a Distributed Transaction as follows.
A Distributed Transaction (DT) object is created to track the state of the process.
At the end of the process the DT object will also hold the result: whether it succeeded or failed and either the return value or the exception that caused the failure (see section "Completion state is visible even in the presence of asynchrony").
If the thread does not time out then the function will return the DT object synchronously but if it does not then another thread such as a thread started to serve a subsequent user request will have to query for DT objects.
Note that this is the basic Optimistic concurrency [opt-cur] strategy: let the client function do whatever it does without locks and only later get locks and attempt to commit the operation to the database.
Note that the client function should not have any interaction with the outside world other than reading and writing from/to the database; any other such sources and effects cannot be managed by the Distributed Transaction infrastructure.
The essence of the transactional nature of this algorithm is that the integrity of the semantic time implied by the dataflow of the client code is preserved even though actual time is being rearranged for performance reasons:
The actions of the client, specifically the reads and writes, are only recorded when they occur and are committed to the database later.
If the database has changed out from under the client in the meantime this inconsistency can be detected by comparing the version numbers of the objects at access time versus commit time.
If such an inconsistency arises at commit time then the transaction is aborted by simply throwing the uncommitted transcript of changes away.
Since the transcript of client operations is complete, it is possible to acquire a set of locks sufficient to temporarily monopolize the part of the database to be modified. Thereby we guarantee that committing the transaction can be performed while ensuring all the desired transactional properties.
Another way to look at it is that the only effect of the client is to map reads plus other input to writes. Atomicity and Isolation are satisfied if, from the point of view of the database, it looks as though there is one instant in time where the client did its entire computation: reading, computing, and writing all at once.
The algorithm has two major phases: (1) Run & Record and (2) Commit.
Run the client program intercepting its database reads and writes.
When the client first reads an object, serve it from the database and also record the object's version number.
If the object read is locked, suspend this DT and roll forward the DT holding the lock.
Maintain an "optimistic" cache of objects read and written/created/deleted.
Serve subsequent reads of the same object from the cache; optimistically ignore the possibility of the object having changed in the database (any coherency failures will be caught later).
Record writes in the cache, preventing them from going to the database.
When the client is done, flush the cache of any written (or created or deleted) objects by writing their state to "shadow" objects. Write each shadow object into the same entity group as its respective original object. We also record the objects read in get_list_obj and their versions in get_list_version and the objects written in put_list_obj and their shadows in put_list_shadow.
Make permanent in the database the map from reads to writes computed by the client.
Lock written objects in order:
Sort the written objects by their keys and group by entity group.
In sorted order for each written object, get a write lock on it.
If the object is already locked, suspend this DT and roll forward the DT holding the lock.
Check read objects: for each read object abort unless
not-locked-by-other: no other DT holds the write lock on the object, and
has-recorded-version: the database object still has the recorded version number.
Complete writes by entity group: for each written object and in one LT per entity group,
copy the state of the shadow object to that of the real object,
delete the shadow object, and
delete the write lock.
Note that for (client-) named DT-flavored objects, we maintain a pure meta-data object whenever the object is not in existence for semantic continuity of meta-data for the name.
If the "Run & Record" phase times out, the DT eventually aborts whenever the next thread finds it.
If an object fails "Check read objects" then the DT immediately aborts.
Atomicity: an abort is not possible after the write locks have been obtained and the read objects checked.
Isolation: There is one moment after all the write locks are obtained where the DT "has all locks": mode 2locked. This moment must serialize with that of any other DT that interacts with the same objects where at least one interaction is a write.
Thread safety: Each step of the algorithm proceeds with atomic steps along a monotonic path in the state space taking only correct transitions; thus despite multiple threads the state of the entire distributed transaction will stay within the correct state transition graph.
No deadlock: Writers in stage 3checked only wait on write locks of other writers in 3checked. Locks are obtained consistent with a global order. Readers only wait for writers when in stage 0init and holding no locks. Therefore no wait-for cycles can arise.
No starvation: If a thread times out once in 1ready it will be rolled forward by other threads: blocked DTs roll forward blocking DTs, the app should roll forward old DTs started by the same user before starting new ones, and a background thread rolls-forward any old DTs it finds.
We require the following new data structures or new fields on existing data structures.
A Distributed Transaction, DT, object has the following members:
created: the datetime when created; created = db.DateTimeProperty(auto_now_add=True)
modified: the datetime when last modified; modified = db.DateTimeProperty(auto_now=True)
user: the User who requested the DT.
mode: one of None, 0init, 1ready, 2locked, 3checked, 4done, 3aborting, 4aborted.
"get_list", a synchronized pair of lists. Conceptually, get_list is just a list of [key, version number] pairs. The fact that this semantics implemented as a pair of synchronized lists instead of a list of pairs is due to the underling Big Table architecture.
get_list_obj: a list of keys of objects gotten; "None" is the key if there was no object.
get_list_version: a list of the version numbers of objects recorded when objects were gotten; "None" is the version if there was no object.
"put_list", a synchronized pair of lists. Similarly to get_list above, put_list is conceptually a list of [key, shadow] pairs, but must be implemented as a synchronized pair of lists:
put_list_obj: a list of keys of objects, each to be replaced by the corresponding object pointed to by put_list_shadow, or None if the object is new.
put_list_shadow: a list of keys of instances of class Shadow to be copied and then put, or an instance of ShadowDelete if the object is to be deleted.
Optimization: We could get rid of the "created" field and just use the "modified" field in its place with some modification of the semantics.
In order to ensure thread safety we constrain mode transitions to be allowed within the correct mode transition directed acyclic graph (DAG) which is the union of the following graphs:
Normal progression: None -> 0init -> 1ready -> 2locked -> 3checked -> 4done.
Abort progression: { None, 0init, and 2locked } -> 3aborting -> 4aborted.
Create progression: (DT Non-existent) -> { None and 0init }.
Delete progression: { None, 4done, and 4aborted } -> (DT Deleted).
Each object in the database is required to have the following additional properties for use by the DTs system:
write_lock: a key of the DT object having a write lock on this object, or None.
version: a "version number", actually the key of the last DT object to write the object.
Note that all user objects also have a "flavor": either LT or DT. However (client controlled) LTs know nothing of this DT meta-data; that is, no object will have "flavor = LT". Any means to infer that an object is participating in the DT infrastructure is enough to know that it has flavor "DT: there is no point in spending space actually recording such a field. See section "Interaction of LTs and DTs".
We also have a class shadow that in Python would inherit from Expando. It stores shadow copies of user objects that the client has put but have not actually been written to the database yet. Each shadow object is put in the same entity group as the user object the state of which it stores. Further we need a special ShadowDelete class which we use when the object is to be deleted rather than having its state updated. In addition to having shadow copies of fields of the object in question to be written, it has the following properties.
created: the datetime when created; created = db.DateTimeProperty(auto_now_add=True)
dist_trans: a key of the DT object that created this object for its shadow use.
Optimization: We could get rid of the created field if GAE queries become reliable; see section "Handling hard timeouts in a background thread".
Note: Due to an artifact of the underlying Big Table implementation, we might need to also save copies of meta-data in the shadow objects. For example we may need to do something special for objects that have an empty list member.
We keep a cache of user objects that have been read or written. If an object is operated on more than once, we use the cached copy instead. We optimistically ignoring the possibility of the object having changed in the database; note that any cache coherency failures will be caught later.
Specifically, the cache consists of three parts as follows.
Object cache: A map from keys to cached objects.
All reads return the object in this cache first if it is there. If it is not there, that is, on the first read if there have been no previous operations, then the cache is populated by a read from the database.
All writes write to this cache, not the database.
Version number cache: A map from keys to version numbers.
When the object is read the first time its version number is recorded here.
If the object is never read, only written, then there will be no version number here.
Write operation cache: A map from keys to the last write operation done on the object:
"get" if the object was never written,
"put" if the last write op was to put,
"delete" if the last write op was to delete.
When the client is done all of the objects flagged as "put" and "delete" in the write operation cache are written to the database as shadows before transitioning to mode 1ready.
In order to prevent an app author from picking a name that collides with that of the DT infrastructure, all names in the shared namespace should be prefixed with "__DT" and the app author informed of this restriction. Note that this does not violate the rules for names of keys as those rules prohibit a name to both start and end with a double underscore [Keys]. The names that should be protected this way are:
The DT class should be named "__DT"
User object meta-data should be
"__DT_write_lock" and
"__DT_version".
The shadow classes should be
"__DT_Shadow" and
"__DT_ShadowDelete".
The fields of the shadow object should be
"__DT_created" and
"__DT_dist_trans".
The names of the caches should be
"__DT_object_cache",
"__DT_version_number_cache", and
"__DT_write_operation_cache".
Make a DT object dt.
Set dt.mode to 0init.
Put it into the database; note that this will automatically set dt.created.
Note that there is no need for a GAE Local Transaction as we are creating the DT object.
Now the client runs, using the API on dt to read and write objects instead of using the standard database API.
The client gets an object X from the database using its key by calling dt.get(key) (instead of the using the usual db.get()). This operates as follows:
Return the cached instance of the object if present.
Get the object from the database by key in the standard way.
If it does not exist and if X_key is a name key (instead of a numeric id [Keys, KeyClass]; see section "Namespace management") get the corresponding pure meta-data object and use it instead; see section "Persistent meta-data for names".
If the object is not DT-flavored, abort the DT.
If the read object is locked, suspend this DT and roll forward the DT holding the lock and then retry.
Cache a copy of the object; if there was no object, record "None". Also cache the version number of the initial read.
Return X to the client; return None if we found no named object but a pure meta-data object instead.
If we read an object that already has a write lock and therefore suspend the current DT to roll forward the other, the current thread could time out. Since the DT is still in stage 0init this will abort the DT: there will be no way to roll it forward later. However there is no better alternative: the locking DT is in stage 3checked and cannot be aborted even if we thought it would be prudent to do so and if the reader proceeds it will simply abort later during "Check read objects" (unless the writer aborts).
On a second read of the same object the cache could hide the presence of a write lock or a new version number on the object. In the case of a write lock we miss the opportunity to roll forward the other DT, possibly resulting in it aborting and allowing this DT to proceed. However, since the basic assumption of this algorithm is optimistic, that most transactions succeed, we assume this is an unlikely win. Mostly if we read an object a second time and it is write locked or has a new version number we are out of luck and are going to abort. Checking for this case is a trade-off as it costs an extra read; given that we have assumed that we are in an optimistic regime, we make the trade-off in favor of this case being rare and just always use the cached copy.
For every object X that the client wants to put to the database the client calls dt.put(X) (or create or delete) instead of the usual, say, db.put(X) or X.put(). We imitate the behavior of the App Engine database though caching the results rather than writing them to the database. This operates as follows.
Write X into the cache.
If X is being deleted, cache a None.
Also cache the write operation:
If X is being put, cache "put".
If X is being deleted, cache "delete".
Note that caching a delete entry for X_key means that if the object is read again, None is returned. That is, this means the fact that the object was deleted is cached, and not that the object is deleted from the cache.
When the client function is done flush the cache as follows.
For each object X in the version number cache, do as follows.
Append X.key() to dt.get_list_obj.
Append X.version to dt.get_list_version. If there was no object, it has version "None".
For each object X flagged "put" or "delete" in the write operation cache, do as follows.
Make a new shadow object in the same entity group as X.
If X is flagged "put" copy the state of X into a new instance of class Shadow.
If X is flagged "delete" write an instance of the special class ShadowDelete which has no state.
Set shadow.dist_trans = dt.key().
Put the shadow object into the database (now the shadow has a key).
Append X.key() to dt.put_list_obj.
Append None instead if X is being created (has no key());
Append shadow.key() to dt.put_list_shadow.
Note that we need two classes for shadows: a class Shadow that holds the state of a put object, and a class ShadowDelete (possibly inheriting from class Shadow) having no state that indicates that the user object is to be deleted.
Note that a Local Transaction is not needed for creating shadow objects.
Note that at least at this layer of the database, copying the state of the user object to a new shadow object (rather than somehow cleverly reusing objects) is unavoidable. First, if the client function reads objects it has already written out of the cache, it expects the object to have the usual class, not class "Shadow". Second, we cannot write shadow objects into the database with the same class as the user object because then queries for the user objects could find the shadows. Those with access to lower layers of GAE can perhaps do an optimization and simply mutate the "Kind" of the object to Shadow and then back again.
Optimization: if the whole DT reads and writes objects all in one entity group, do everything in a single LT. Don't even bother to write shadows or get write locks; just check for the absence of write locks on both objects written and read, check the read version numbers for read objects, and then write straight to the objects. Instead of transitioning the DT in 1ready, go straight to 4done. If write locks are found on objects being written (but not read) then revert to the usual algorithm.
Optimization: When writing out the buffered-up shadows, sort them in the way they are sorted for write locking and write them out grouped by entity group. Ryan Barrett [Barrett]: "as long as they're in the same single put() or delete() call, they'll be grouped by entity group. this is definitely a significant optimization when you need to write to multiple entities in the same entity group."
Optimization: When writing out the shadows, also read in the objects and get write locks on them in the same pass; note that this now requires using an LT. If getting locks blocks, then revert to simply writing out the shadows from then on until the end of the pass. Record the transition point in the DT when we do the write transitioning to 1ready; resume getting locks from that point during "Lock written objects"; if we manage to get all the locks then write None as the resume point.
Best possible scenario: Note that the above optimization can be combined with the optimization below for objects that were both read and written where we do the read checking at the same time as getting the write locks. It is therefore possible in situations with no other conflicts to do all the preparation work in a single pass over the objects before completing writes.
This is a special case of section "Transitioning to a new mode" below; we repeat it here for clarity.
When writing out the shadow objects is done, in a Local Transaction:
Get the dt from the database. It should be in mode 0init.
If not, go into 3aborting: there is an obscure case involving huge clock skew where the garbage collection thread can accidentally have aborted this DT.
Set dt.mode to 1ready.
Save dt to the database.
Once the DT gets to mode 1ready timeouts are no longer relevant: future DT roll-forwards will eventually complete this DT one way or the other.
If we do not get to this point due to a timeout then the timeout handling code will clean up the shadow objects and notify the user that the DT failed when it reaches mode 4aborted.
This is exactly the second time the DT is written into the database; the first being when it was created in mode 0init. No user objects in the database have been written yet.
The deadlock avoidance algorithm requires us to get the locks in an order consistent with some global order (see section "Ongoing progress: No deadlock"). Any global order will do, so we use the object keys.
Sort sort put_list_obj by their keys and permute put_list_shadow to match.
Note that the key sort order should group objects by their entity group [Keys, DatastoreTalk]: (Wilkerson) "When I sort objects by their keys, that also sorts them by their entity group as that is the first part of the key, right?"; Ryan Barrett [Barrett]:"correct." In any case if it did not, change it until it does by sorting by entity group and then by the other parts of the key.
For each object key X_key in put_list_obj and its corresponding shadow key S_key in put_list_shadow, in sorted order, get a write lock on object X as follows.
If there is no shadow object S corresponding S_key, then the write has already been done and this thread is way behind another thread rolling forward the same DT which is already at least in 3checked or 3aborting. Go to "Dispatching on mode".
If X_key is None, the shadow is an object being created with a generated ID; there is nothing to lock.
Otherwise, do the following in a Local Transaction:
Get X from db.get(X_key).
If it does not exist and if X_key is a name key (instead of a numeric id [Keys, KeyClass]; see section "Namespace management") get the corresponding pure meta-data object and use it instead; see section "Persistent meta-data for names".
If it is not DT-flavored, abort the DT.
If X.write_lock is None:
Set it to dt.key().
Put X.
Otherwise, if X.write_lock is dt.key(), we already hold the lock, so do nothing.
Otherwise X.write_lock points at a blocking DT; suspend operation on this DT and roll forward the blocking DT, as follows.
Exit the Local Transaction.
Roll forward that other DT.
Try again to get the lock.
Transition into mode 2locked. Optimization: this transition exists only for the proof of Isolation and writing the DT into the data base can be elided.
Optimization: When getting write locks it is not strictly necessary to group the object interactions by entity group into one LT, but given the underlying implementation of LTs in GAE, the performance will be better if you do. (Wilkerson) "If I group a bunch of object gets together into one get call, that is faster right?" Ryan Barrett [Barrett]: "right!"
Optimization: if an object is read and written, check the version number in same Local Transaction as getting the write lock. (Once we have the write lock, no other DT can have it and therefore no other DT can write it and change the version number.)
We must not combine the two passes of getting write locks and checking reads into one pass, as if a read is checked before a write lock is taken, Isolation can fail; see the example in section "The algorithm is tight / Locking and checking". Therefore grouping getting write locks with checking read locks on two different objects in the same entity group into one LT is not correct. However we can do the reads, say, in reverse sort order, combining the last entity group of the locking pass with the first entity group of the read-checking pass into one LT even if the particular objects checked and read in that LT differ.
For each object key X_key, where
X_key in get_list_obj and
optimization: X_key not also in put_list_obj if checked during the write lock stage,
check it as follows.
Get X from db.get(X_key).
If X.write_lock is non-empty and not dt.key(), then abort.
If X.version does not equal the corresponding member of dt.get_list_version, then abort.
Transition into mode 3checked.
Note that None is a version number for a non-existent object.
If we recorded a version number at read time but at check time the object is gone then it now has that Great Version Number In The Sky which is not the version number you recorded; therefore the version number check fails and this DT must abort.
If we recorded None for the version number because it did not exist at read time but at check time it now does exist then the version number check fails and this DT must abort.
Optimization: When checking read versions it is not strictly necessary to group the object interactions by entity group into one LT, but given the underlying implementation of LTs in GAE, the performance will be better if you do. We already sort the written objects, but to facilitate this optimization it is likely easiest to also sort the read objects.
Note that we only test version numbers for equality, so as long as DT keys are never re-used correct semantics is ensured: that is, version "numbers" cannot roll-over.
One might contemplate the possibility of the following scenario: one DT reads and records that there is no object and then is suspended; another DT then creates it, then another DT deletes it, then the first resumes, does the version number check which again finds no object, so the check passes and the first DT continues.
This can't happen for (DT-flavored) objects with generated IDs because you can't attempt to read an object without its key and you can only have that if it had been created before: that is, you can't create an object with a generated ID twice.
This can't happen for (DT-flavored) named objects because of the pure meta-data object.
See section "Namespace management".
Completing writes should accomplish the following for each object that the DT has written. Recall that complete writes can be run in either mode 3checked or mode 3aborting.
Monotonicity: If there is no shadow object corresponding to shadow_key then there is nothing to do for its corresponding user object; otherwise, ensure the below properties.
If in mode 3checked:
Ensure the user object reflects the shadow object state even if we have to create or destroy it.
If an object remains, ensure its version is set to dt.key().
Ensure this DT does not hold the write lock.
Ensure the deletion the shadow object.
Ensure for named objects in the case of a creation or a deletion that the meta-data handoff is accomplished between the user objects and its corresponding pure meta-data object and that the pure meta-data object is created or deleted as appropriate.
There are many cases so we discuss in detail how to actually accomplish the goals above.
Note that it is critical to the correctness of allowing LT reads to mix with DTs that the "Complete writes" stage be done by grouping all interaction with one entity group into one LT. This requirement to perform operations grouped maximally by entity groups is in contrast to the other places where we suggest grouping operations by entity group for the purposes of optimization. Note further that if the DT aborted in 0init then the put_list_obj may not be sorted. Therefore sort the objects by their keys; see section "Algorithm detail: Commit / Lock written objects in order" for more discussion on this sorting ensuring grouping by entity group.
For each
object key X_key in dt.put_list_obj, and
its corresponding shadow_key in dt.put_list_shadow,
as grouped by entity group,
in one Local Transaction per entity group,
do as below depending on the dt.mode.
When done if in mode 3checked transition to mode 4done, otherwise if in mode 3aborting transition to mode 4aborted.
If dt.mode is 3checked, for each object do the following:
Get the shadow object from db.get(shadow_key).
Monotonicity: If there is no shadow object in the database corresponding to shadow_key, this write has already been done, so continue to the next iteration of the loop.
If X_key is None, this is a creation.
Create the target object X from the state stored in the shadow.
Set X.version to dt.key().
Put X into the database.
Otherwise, if the shadow is an instance of ShadowDelete, this is a deletion.
If X_key is a name key (instead of a numeric id [Keys, KeyClass]; see section "Namespace management") create the corresponding pure meta-data object for X using state as if we were updating X rather than deleting X:
Set write_lock to None.
Set version to dt.key().
Put the pure meta-data object into the database.
Delete the target object, X. This also deletes its write lock.
Otherwise, this is an update.
Get X from db.get(X_key).
If get returns None and X_key is a name key, check for a corresponding pure meta-data object. If there is none, this is a creation; proceed as in the creation case above. Otherwise, if one exists this is a re-creation of a named object, so copy over its meta-data in to X and delete the pure meta-data object (also deleting the write lock). From now on, this proceeds as an update to X as follows.
X.write_lock should be the key of this DT; set it to None.
Copy the state of the shadow object to X.
Set X.version to dt.key().
Put X into the database.
Delete the shadow object.
If dt.mode is 3aborting, for each object do the following.
Get the shadow object from db.get(shadow_key).
Monotonicity: If there is no shadow object in the database corresponding to shadow_key, the write on this object has already been aborted, so continue to the next iteration of the loop.
If X_key is None, this is an aborted creation: there is no user object and no lock on it to delete.
Otherwise, this is an aborted deletion or put.
Get X from db.get(X_key).
X.write_lock must be the key of this DT or None. Set the write lock to None if it points at this DT; do not change the write lock if it points to another DT.
Put X into the database.
Delete the shadow object.
Here we summarize all of the places in the Distributed Transaction algorithm where we use the built-in GAE Local Transactions in the core algorithm. We also give which object is the intended target of the Local Transaction (implicit by the operation, but we include it for clarity). In the algorithm such points are emphasized by the presence of "in a Local Transaction" in bold.
On the DT: At each mode transition.
On each user object: Getting write locks.
On each user object and shadow (per entity group): Doing shadow copying.
When creating and deleting named objects and deleting and creating respectively their corresponding pure meta-data objects.
Note that each of the above has to re-read the object from the database on which it is performing an action in the Local Transaction and then write it back again.
Note there is no need to use Local Transactions when performing "Check read objects".
The algorithm transitions to a new mode when it has completed or aborted a previous one and is about to start a new mode. What is important is that we enforce that we are not skipping a step and not leaving the DAG of legal transitions (see section "Mode Directed Acyclic Graph").
In this situation there are three modes:
"Current mode" is the mode of the in-memory copy of the DT in the current thread.
"New mode" is the mode the current thread wants to put the DT into. New mode > current mode.
dt.mode is the mode that the thread finds the DT in when it reads the new copy of the dt from the database.
Check that the transition from current mode to new mode is part of the legal transition DAG. Then, to transition from the current mode to a new mode, in a Local Transaction do the following.
Get the dt object. Since other threads may be rolling forward this DT, if we read the DT from the database and it no longer exists, there is nothing further to do on this DT.
If dt.mode != current mode then exit the Local Transaction and go to "Dispatching on mode".
Otherwise, if new mode is "Deleted" (that is, the mode transition is to delete the DT object):
Delete the DT.
Otherwise:
Set dt.mode to new mode.
Put dt into the database.
After the local transaction is done, go to "Dispatching on mode".
Note that as a consequence one thread cannot abort a DT that another thread has already transitioned into 3checked: the thread that wants to abort will read the DT, find it is in 3checked, discard the new mode it was planning on going into, namely 3aborting, and continue to roll the DT forward in 3checked.
If dt.mode == 0init then
If dt.created is older than TIMEOUT, go to "Aborting during 0init" above. This DT did not make it to the end of phase 0init; the thread that started it must have timed out and another thread (this one) discovered it.
Otherwise, if dt.created is younger than TIMEOUT, wait and read the DT mode from the database again: the initiating thread is completing and the DT cannot be rolled forward by another thread. This should be a rare situation.
If dt.mode == 1ready then go to "Sort the written objects" followed by "Get write locks".
If dt.mode == 2locked then go to "Check the read version numbers".
If dt.mode == 3checked or 3aborting then go to "Complete writes".
If dt.mode == 4done or 4aborted or the DT does not exist then simply return; we are done with this DT.
There are two places where we attempt to interact with an object which may already have a lock on it, forcing us to block.
Read blocking on a write: during "Run & Record" in mode 0init when the client reads an object that has a write lock on it.
Write blocking on a write: during "Commit" in mode 1ready when getting write locks.
If this happens we must suspend this DT until that lock is released. In order to prevent blocking indefinitely, we use this thread to roll forward the blocking DT. This is ok as the whole process is thread-safe and deadlock-free.
In order to prevent the inefficient contention and duplication of work caused by multiple threads rolling forward the same DT, if the blocking DT has a modification time that is less than TIMEOUT, then we wait instead of rolling the blocking DT forward. See "A note on time" for more about the function of TIMEOUT and choosing a good one.
If a DT does not complete in state 4done then it must abort, going through 3aborting to clean up its state and then ending up in 4aborted.
A DT can fail and need to be aborted while in mode 0init.
The client or a library it calls can throw an exception.
The thread can timeout: all threads in GAE are running under a timer and if it expires, the thread stops.
If either of these happen the exception handler or soft timeout handler does the following.
Write the DT to the database in mode 3aborting. Note that since this thread is the creating thread and we are still in mode 0init, no Local Transaction or further checking is needed here (unlike below). Note also that this writes the put_list_shadow into the database.
Roll that mode forward by going to "Dispatching on mode".
A DT can abort during "Check read objects" if a read object fails a check. A DT cannot abort after reaching 3checked. Aborting is a special case of a mode transition.
Exit any current Local Transaction.
Transition into mode 3aborting using section "Transitioning to a new mode" above. Note that the protocol in that section prevents a DT in mode 3checked from being aborted.
Go to "Dispatching on mode" which should go to section "Completing writes".
Note that when run in mode 3aborting instead of 3checked, section "Complete writes" does not actually copy the state of the shadow objects to that of the user objects, however it does still delete write locks and shadow objects.
When a request times out in App Engine that there is first a soft timeout [ExceptionsGAE] and then a hard timeout [RequestTimer]. Ryan Barrett [Barrett]: "[I] think you get 1s btw the soft and hard timeouts, but i'd budget for a lot less. still, you should have enough time to do a few gets and puts".
The soft timeout exception can be caught and should do the following:
Handle potential changes to the DT mode.
If the DT mode is 0init, put it into 3aborting as detailed in the section "Aborting during 0init". Do not right now attempt to roll forward the DT to finish the abort procedure and delete the DT and its shadows, as we want to get to the redirect step below.
Otherwise if the mode is past 0init, there is nothing to do here.
Throw a soft timeout exception to the app.
The app should redirect the user to a "Still working…" view. It would be best if the client would automatically reload, but if it doesn't then whenever the user interacts again with the app it will query for any operations initiated by the user and roll them forward.
When rolled forward in the case of an abort it will finish the abort procedure and in other cases will continue the roll-forward.
Both the soft timeout handler and the exception handler can experience a hard timeout, leaving a DT in the database. If this DT is in mode 0init, then there may be several shadow objects pointing at it without the put_list_shadow of the DT pointing back at them.
Background processes will soon be possible in Google App Engine [OfflineProcessingTalk]. A background thread cleans up this situation:
Query for DTs having a creation time older than TIMEOUT that made it to mode 1ready: roll them forward.
Query for stillborn DTs: those having a creation time older than TIMEOUT still in 0init. Query for its shadows and put them into put_list_shadow then transition into mode 3aborting per section "Transitioning to a new mode". In particular note that it rechecks that the mode is 0init and if not does nothing as we have been fooled by huge clock skew.
Query for shadows having a creation time older than TIMEOUT and check if their DT is now gone: delete them and their corresponding write lock.
Note that the correctness of this algorithm does not depend on bounded clock skew. While the notion of the age of a DT used above assumes an upper bound on clock skew, a suspicious idea at best in a distributed machine, the worst consequence of such skew is reduced performance, aborting DTs that could otherwise have succeeded. See section "A note on time" for the function of TIMEOUT and choosing a good one and for more discussion of clock skew.
The third point of the algorithm above is in fact necessary, as follows. Due to the fact that GAE queries can fail to return objects that satisfy their predicate [TransIsolGAE] it is possible that the query will miss a shadow object. The thread that was adding these shadow objects should be gone, but if there is huge clock skew, it might still be operating and a shadow could be missed by the query and garbage of these "orphaned" shadows could accumulate. The combination of huge clock skew and a query missing an object is likely to be quite rare, but strictly speaking this is a fundamental problem that cannot be easily removed. The only solution I see is that given in the third point above: put a creation time field onto shadows and have yet another background query that looks for very old shadow objects and checks if their DT is gone and deletes them if they are. (Note that there is no way to run a query for shadows the dist_trans field of which points at a deleted DT.) If queries ever become reliable, this third pass and the creation date field can be removed from shadow objects.
In two places we use timeouts as a heuristic to prevent the inefficiency of multiple threads pushing forward the same DT.
When attempting to get a lock that is held by another DT, we check to see if the mode of the other DT has not been modified for a time exceeding TIMEOUT to decide if our own thread should actively push forward the locking DT or expect that another thread is already doing so.
In a background thread or during dispatch finds a DT it has to decide if the DT has been abandoned and should be rolled forward (if in mode 1ready or later) or aborted (if in mode 0init) we check if it was modified longer ago than TIMEOUT.
In a background thread to locate orphaned shadow objects we first search for those having a creation time older than a TIMEOUT (probably using a TIMEOUT longer than the ones mentioned above).
Ryan Barrett [Barrett]: "[T]here's no upper bound on the clock skew between machines. :/" and "[W]e don't make any guarantees about clock skew. in practice, expect just a few seconds or so. maybe up to 10, but that'd probably be unusual". As you can see, were unable to get an upper-bound on the possible clock-skew in the Google App Engine infrastructure from the Google App Engine team. Without such a bound, there can be no absolutely reliable global notion of time and therefore correctness cannot depend on it. No distributed algorithm should depend on a global notion of time for correctness and this algorithm does not.
Note that TIMEOUTS should be tuned carefully. For example, when rolling forward other DTs that have a lock we need, as an optimization wait until the other DT's modified field is at least a little old before rolling it forward so as to prevent many threads from all doing this roll-forward at once, which is simply wasteful.
TIMEOUT should be just a bit longer than the amount of time another thread would take between updating the DT object as the DT object goes through its life cycle; that way, other threads will not jump in.
However the delay should not be such a long delay that the waiting thread will time out before deciding to roll forward a DT that has in fact been abandoned and then managing to roll it forward at least one step; otherwise forward progress would no longer be ensured.
An object that participates in a Distributed Transaction, a "DT-flavored" object, must no longer be written (created/updated/deleted) by client-controlled Local Transactions as those might not honor the DT protocol.
The namespace of named objects must be partitioned into DT-flavored names and LT-flavored names as the identity of a DT-flavored named object persists across deletion and re-creation and (client controlled) LTs need not honor the mechanism for enforcing that; see section "Namespace management".
Since queries do not honor Local Transactions, and therefore also neither Distributed Transactions, an exception should be thrown if a query ever returns a DT-flavored object.
Reading DT-flavored objects in an LT is ok since DT copy shadows operations are grouped into a single LT per entity group. While LTs deletes do not mix with DTs, deleting a DT in a final mode (4done or 4aborted) or in None mode that has expired (see section "Best-effort temporary read locks") is an LT write that can interact with DTs; this is a special case of "Transitioning to a new mode".
In order to enforce that LT writes do not mix with DT operations in an illegal way, every user object in the system is either LT-flavored or DT-flavored.
All ID-ed objects (not named) are the flavor of the transaction that created it.
All named objects having a name prefixed with "__DT_" are DT-flavored; otherwise they are LT-flavored.
A client-controlled (not part of the DT infrastructure) LT-write (create/update/delete) on a DT-flavored object results in an exception.
Note that if one were to attempt to extend this system to allow an object to change flavors one must be mindful of very subtle corner conditions that arise at the conjunction of flavor change and creation/deletion and the conjunction of flavor change and interaction with multiple transactions of different flavors. Such an attempt is not recommended.
Technically there is no "create" operation in App Engine; there are "get", "put", and "delete". That seems oddly asymmetric because it is. There are basically two good regimes for namespace and object management, as follows.
Service controls the namespace. This is the way C++ manages objects: new/delete for creating and deleting objects or "context", read/write for creating/deleting data or "content".
Client controls the namespace. This is the way Perl manages objects, which they call "auto-vivification": you only have "put" and "get" and the namespace is somehow "eternal" and managed for you. If you put an object, it is there. If you get something, it is there. You never have to create or delete anything.
App Engine does a weird hybrid of the two and can act as either depending on usage [Keys].
If you allow keys to be auto-generated then it is (1) above, except where you must tell if you are doing a create or a write only by looking at the object and checking if it has a key already or not before you put it.
If you use user name keys then you are using (2) except that there is no garbage collection thread running in the background so you have to manually delete objects.
Note that in App Engine you can tell whether an object has a client-chosen name key or a service-chosen numeric ID simply by checking if the first character of the key is alpha or numeric. Keys can be retrieved from their objects [KeyMethodOfModel]. A key is not a string, but an object that can be losslessly interconverted to and from a string; you can tell whether a key is named or not by asking the key form of the object [KeyClass].
The DT algorithm as stated was designed for idiom (1): letting the datastore pick the ids for objects; however in the interest of minimizing our impact on the usage of App Engine, we can to allow idiom (2): letting the app pick the names of objects. This is a tricky problem because once the user can pick the object name it has an existence independent of the object. The situation is as if the entity group were an object and the name a field on that object. As we noticed before, you can't mix LT writes with DT operations on an object, but by allowing the user to create named objects in both LTs and DTs, we are allowing the mixing LT writes and DT writes on the entity group "object".
Consider the following example:
DT1 reads object named "foo", finds that it does not exist.
DT1 writes foo, but then is suspended before it can commit this change.
DT2 creates object "foo" and commits.
DT3 deletes "foo" and commits.
DT1 wakes up and commits "foo" and possibly some other related changes.
If foo had been being read and written rather than being created and deleted, this would clearly violate isolation. However, we do not detect it because we are creating and deleting objects instead of reading and writing them. By having a name "foo" we create the sense that all of the objects called "foo" have some semantic continuity which is being violated here. Recall that this example is not possible using generated IDs as creating an object with a generated ID is guaranteed to pick an ID that has never been used before: we cannot delete and then "re-create" an object with an ID as we did with named object "foo" above.
We suggest that the transactional semantics that will make sense to most people is that named objects always have a value, even if that value is "object does not exist"; just as with objects, this value has a version number, a write lock, and a flavor. In the example above, a version number check by DT1 in the last step would have found that object "foo" had been written since DT1 read it and DT1 would have aborted.
However, we do not want to force the use of the Distributed Transactions system on applications that want to operate on named objects using Local Transactions. As noted above in section "Enforcing flavors" mixing DT and LT flavored interaction on named objects results in situations with very subtle semantics and therefore we simply avoid it.
All named objects having a name prefixed with "__DT_" are DT-flavored
All others are LT-flavored.
Doing so partitions the namespace in a permanent and unambiguous way.
We create persistent meta-data for names as follows.
A non-existent named object has a corresponding "pure meta-data" object with the same key but with the kind suffixed with "__DT_PureMetaData__".
Clearly we do not create such objects for every possible name: for a name that has never been used before if the corresponding meta-data object does not exist then we act as if it did with version None, write_lock None, and flavor DT.
Exactly one of the object and the pure meta-data object exist at the same time and continuity of meta-data between them is ensured.
The pure meta-data object is created to preserve the meta-data of a named object when the named object is deleted; both of these should be done together in a Local Transaction.
Similarly the pure meta-data object is deleted and its meta-data moved to the object when a named object is created; both of these should be done together in a Local Transaction.
When a named object is read and does not exist, do as follows.
Get the pure meta-data object and if it exists, use its version number.
Otherwise use None as the version number.
When a named object is written and does not exist, do as follows.
Get the pure meta-data object, creating it if necessary.
Get a write lock on the pure meta-data object.
We adopt a similar restriction as App Engine does on the Name part of an object key [Keys]: application writers may not use Kinds that start with two underscores. We add the convention for others building infrastructures to co-exist with ours that for names starting with double-underscores the part of the namespace after the first non-leading double-underscore is infrastructure meta-data. Further we claim the subspace that starts with "DT".
Note that if we could use the exact same key as the user object, with the same Kind part as well, then everything would be much simpler. In particular, we might be able to allow LT-create and DT-create to mix. However it wouldn't work because queries would find these pure meta-data objects as if they were the actual objects: they would pollute the query results. Therefore we must create pure meta-data objects as a convention of the DT infrastructure; since LTs are not required to honor this convention, we must prohibit LTs from creating named objects having names in the "__DT_"-prefixed namespace, as we stated above.
Additional idioms beyond the core DT algorithm are required in order for the application to provide transactional semantics all the way out to the user interface. We suggest some idioms and ensure that the DT algorithm admits of them.
In GAE Local Transactions ensure that each entity group exhibits data-sequential distributed-consistency [seq-con]. We preserve this property by completing writes to the database within entity-group granularity Local Transactions.
DTs are synchronous issue but asynchronous complete, allowing for the possibility that DTs will complete in an order other than that in which they were issued. While the user can likely tolerate this to some degree, were it to become severe it would be annoying. Therefore we want to preserve the user-local sense of time as much as we can. That is, we want user-synchronous order: we attempt to complete DTs in the order they were requested.
That is, the distributed-consistency can be made user-synchronous in two different ways: one best-effort and the other guaranteed. The idioms that provide these properties vary in other ways such as in their performance impact and therefore the choice between them is a trade-off.
At the start of processing a request the app should first query for any pending DTs by the requesting user and then attempt to roll them forward to completion them before initiating any further activity on behalf of that user. We allow for this idiom but it is up to the app to do it. Using this idiom the completion order is best-effort because other threads may be rolling forward incomplete DTs in parallel that were initiated by the same user.
A more heavyweight approach can guarantee that the DTs issued by a user complete in the order issued as follows. Maintain a list of pending DTs issued by a user in their User object: when a DT is created, append it to the end list and when a DT is registered as complete (see below) delete it from this list. All threads wishing to roll forward a DT must look up its user object and always roll forward the first DT on the list that is not in state 4done or 4aborted. Always interact with the User object in a Local Transaction. Put all DTs created by a user into the same entity group as its User object; create the DT object and append it to the pending DT list in the user object in the same LT. This list of pending transactions can be presented to the user along with their current state and final result.
A potential problem is that the user may issue several requests which could potentially be completed independently. If the one registered to be completed first is slow, this will cause the others to block, and in general reduce parallelism. This may be more frustrating for the user than asynchronous completion of DTs.
Note that using guaranteed user-synchronous distributed-consistency is equivalent to providing "strong (distributed-) consistency" or (user-) synchronous local (data-) sequential distributed-consistency to the layer above the DT layer.
If the distributed_run_in_transaction function manages to return synchronously within the thread that initiated it then it should return the DT. However, in the presence of timeouts, DT completion can be asynchronous. Nonetheless we need the completion state of DTs to always be visible to the app. Therefore when DTs complete we need to keep them around in the database to give the app a chance to acknowledge the completion and reflect it in the user-visible state.
Once the app has acknowledged the change, it can delete the DT; it should only delete DTs that are in states 4aborted or 4done. Note that deleting the DT is a special case of "Transitioning to a new mode" and therefore should be done in a Local Transaction; see the notes on deleting a DT in that section. This may seem like overkill but deleting the DT and reporting its final state to the user should be one action; doing this in an LT prevents odd non-transactional effects in the UI such as, say, the double-reporting of the results of the DT to the user. Of course transactional semantics of such reporting cannot be guaranteed: the app could delete the DT and timeout before reporting the results to the user (or the reverse if done in the reverse order). In any case, transactional semantics in the database is still guaranteed.
Some applications are likely to be sufficiently complex that it should push to a window somewhere in the UI a list of all DTs, what the function and arguments were, and whether they are in process, completed, or aborted. Thus upon completion in 4aborted or 4done, the DT should contain the following information:
the client function and its arguments that were run in the DT,
the return value of the client, and
the result, either successful completion or the exceptional reason for the abort.
Note that annotating the client function onto the DT object should be done at DT object construction time, annotating the return value of the client should be done in the same LT as the mode transition into mode 1ready, and annotating the reason for the abort should be done in the same LT as the mode transition into mode 3aborting.
The user is part of the computation. We want to ensure the maintenance of invariants not only within the database but between the state of the database and the state of the user. That is, in an important application, a user taking an action based upon state data in their UI could result in disaster. We provide idioms to address this problem. There seem to be only a few good points in the tradeoff space between performance and the maintenance of invariants across the user and database.
Un-isolated: All reads to render output to the user are done in Local Transactions. Local object reads are guaranteed to be Isolated from other updates to that object, but the entire display is not guaranteed to give an isolated snapshot of the database. This mode is cheap and could be useful for a rapidly updating ambient view of the state of the database.
Isolated read with no lock: Use the idiom of section "Transactional interaction with the user" below to render an Isolated whole-screen view of the database. While the view is Isolated if the user takes an action and another DT has written to any object in the view then the action will abort. If this became frequent, many users operating on the same objects could all get into mutual livelock [livelock].
Isolated read with temporary lock: To alleviate the potential livelock problem above we can use the idiom of section "Best-effort temporary read locks" below. This lets the user get a read lock on the displayed objects user and guarantee a window to take an action that very likely will not fail.
Write collision merge tool: As people who design revision control systems have noticed, changes that collide are often semantically orthogonal anyway and therefore neither need be discarded if care is taken. That is, another way of resolving write collisions without a lock is instead of simply aborting the attempted write and requiring the user to start over, we can show the user exactly what collided and give them a tool to perform a merge. See section "Write collision merge tool".
We want to interact with the user, both when reading and writing, in a way that has the ACID properties. That is, we do not want the user seeing a screen of data that is not an consistent and isolated snapshot of the database nor taking actions based on stale data. We therefore suggest the following idiom.
In a DT copy the state of all of the objects needed to render a page to the user to a user-page object.
In an LT read that user-page object and render it to user.
Recall that this works because an LT read interacts with DT writes transactionally, per the section "Interaction of LTs and DTs".
To read from user in a DT:
When we copy objects to user-page, also keep their version numbers.
Check versus those version numbers (rather than the current version numbers) when performing a DT on behalf of the user in response to that page.
Alternatively, it may be more efficient for the client function to
Copy only the object keys and their version numbers, rather than the entire object state, to the new object in the user-page entity group.
Construct the page (say in HTML) to render to the user and return it as the function return value in the DT.
Render that return value to the user when the DT completes.
Note that either buffering the objects to be rendered or buffering a copy of the page (as opposed to pipelining the rendering straight out the socket) is an unavoidable consequence of enforcing transactional semantics. It's just a question of which buffering is more efficient.
The most elegant way to execute this technique in combination with best-effort temporary read locks below is as follows.
Record the objects and version numbers read when rendering the page to the user in a second DT object; return that second DT as part of this DT's return value.
When the user takes an action in response to the displayed page, perform it within this second DT object; keep that second DT around for that purpose.
Use the technique of section "Best-effort temporary read locks" below to temporarily lock out other DTs from modifying these objects before the user can respond.
Note that we no longer need to copy of the read objects and their version numbers to a user-page object.
Our DT algorithm guarantees transactional semantics. However, being optimistic, a DT is vulnerable on its read side: if a second DT writes one of its read objects before the first can get to 3checked it will be forced to abort. To the user this shows up as the phenomenon that under high contention there can be many aborted DTs possibly resulting in livelock [livelock].
There is likely no one-size-fits-all solution to this problem, however providing a mechanism that allows various apps using the same database to at least attempt to cooperate may help. We therefore suggest a simple temporary read locking mechanism as follows.
When desiring a read lock on a set of objects create a DT
Preemptively populate the get_list_obj and get_list_version with the object keys and their versions respectively.
Set the mode to "None" (instead of say 0init).
Add a flag read_lock set to true.
Add a timestamp timeout field set to the current time plus whatever timeout is desired, say one minute.
Put the DT into the database.
This DT can be kept around and used later to issue a distributed_run_in_transaction as long as it has not expired.
We must modify the current algorithm as follows. When attempting to obtain a write lock on an object do the following:
Query for any DTs that
have a read_lock flag,
list the object in their get_list_obj, and
have a timeout later than now.
If the response is non-empty, block until the timeout.
An expired read lock is a DT that has mode "None" and a timeout in the past. Most of these read-lock DTs can be eliminated more efficiently as follows.
When a user takes an action the app should query for any expired read locks belonging to that user and abort them.
To ensure the absence of expired read locks accumulating as garbage a background thread must query for and abort any that remain.
There is a race between transitioning a read-lock DT in mode None into mode 0init versus garbage collecting it by deleting it. Therefore doing either must be done in a Local Transaction; this is a special case of section "Transitioning to a new mode". Note that this is different from simply creating a DT object in mode 0init when it is not transitioning from having been a read-lock DT: that can be done without a Local Transaction.
Note that
the locking mechanism above involves using queries and may fail in rare race conditions [TransIsolGAE],
however the correctness of the basic transactional properties of the algorithm do not depend on read locks working at all, much less failing only rarely.
That is, temporary read locks are just a best-effort layer on top of the mechanism which already guarantees the transactional properties.
Many distributed collaboration systems have found that having a transaction result in an all-or-nothing commit or abort is too annoying for users. Frequently a set of changes to one part of the database is basically independent of that of another set of changes made by another user, but they may technically overlap in some unimportant ways. Thus many collaborate systems provide a merge tool for the user when writes collide; version control systems are one such. Simon suggests that when changes collide, show the user:
the original state,
their attempted new state that failed to get checked-in,
the new state from the other user.
Then let the user do the merge. This however is an application-level concern. A more complete analysis of the semantic-consistency property would state this as a property of the system that we then ensure; however, stating correct merge semantics is beyond the scope of this article. We state it for completeness as another use case of the idiom we use of keeping aborted DTs around for post-mortem analysis by the app discussed in section "Making completion state visible even in the presence of asynchrony".
We attempt to ensure the algorithm above has the properties we wanted.
Critical: We assume that a root key having a generated numeric ID (as opposed to a name key) will never be reused. Ryan Barrett [Barrett]: "[R]oot, id-based entities of the same kind will always have unique ids[.]" [IDsOfDel, DatastoreTalk, Keys]
Critical: If you do gets and puts not within a run_in_transaction then each one of them is implicitly wrapped in its own run_in_transaction under the hood [Barrett, Armbrust]. That is, there are no gets or puts done not within at least a Local Transaction.
Critical: As LTs need not honor the conventions of the DT layer, certain combinations of operations between DTs and LTs are not allowed, such as mixing LT writes with DTs. Enforcing such restrictions cannot be done if the user is allowed to use the LT layer directly. Erick Armbrust [Armbrust] suggests simply reflecting an interface to the LT layer into the DT layer which delegates to the LT layer but implements the required restrictions and then telling the user to not even import the LT layer. See section "Interaction of LTs and DTs".
Performance: We assume that the GAE thread timeout will never exceed our TIMEOUT value, including skew between machines. Correctness does not depend on this assumption, however if it is wrong, some DTs may be aggressively aborted that would not otherwise have been and multiple threads may end up wastefully pushing forward the same DT.
GAE on top of Big Table provides an object abstraction. Further objects are loaded and stored atomically and GAE provides Local Transactions for manipulating local groups of objects in more complex ways.
This property may sound simple, but an earlier version of my design violated it! We want the fact that puts do not happen until the end of the DT to not be able to enter into the dataflow of the application. Here is an example:
Client reads x=1.
Client writes x=2. This is buffered as a shadow.
Client reads again, missing the shadow, x=1!
This cannot happen in our design as all accessed objects are imported into the in-memory cache which prevents the client from seeing more than one version of an object at a time.
We complete a DT after all the changes have been written to the user objects. Durability of DTs follows from the Durability of the underlying Local Transaction layer.
A DT can abort only under the following circumstances:
Timing out in in stage 0init.
Objects read failing the check for for the absence of write locks and the currentness of version numbers.
Recall that during mode 0init when the client is doing writes, all writes are made to shadow objects, not user objects. In either case above since the DT does not make it to mode 3checked all writes are abandoned.
Once in mode 3checked the DT can no longer abort. During copying a Local Transaction can fail but this is only a transient failure; we simply retry until success.
See section "No starvation due to DTs being abandoned" for details on why a DT cannot languish indefinitely without being rolled forward to completion or being aborted.
We copy over all changes to the database from the shadow objects only after all of the objects to be affected have been locked and we only release the locks after the copying is done. Therefore other changes cannot interleave.
However, the changes we make are not against the current version of the objects, they are against the version we read initially. Therefore the last step of acquiring locks before copying includes a step to check that the object version we read is still the current version.
The following example is not a proof, but it captures why this works: DT1 has write locks on a and b. DT1 had written a but not b. DT2 reads a and b. As Simon says, we want to make sure at this point that DT2 is doomed.
If T2 tries to complete first, it will abort when it does the check for write locks on a and b.
If T1 completes first, when T2 tries to complete, it will abort when it does the version number check on b.
See the full proof below in the section "Full proof of Isolation".
Using DTs correctly to jump from only to good states is an application-level concern and is beyond the scope of the transactional layer; however as it occurs as the "C" of ACID we state it for completeness. A machine-checked assurance of this property would likely take the form of a static analysis showing that for each function run in a DT if the invariants hold when it starts and it runs to completion then they hold when it ends.
In GAE Local Transactions ensure that each entity group exhibits sequential distributed-consistency [seq-con]. We preserve this property by completing writes to the database within entity-group granularity Local Transactions.
Further, in section "Application idioms" we suggest idioms that provide either best-effort or guaranteed (user-) synchronous local sequential distributed-consistency.
In mode 0init actions on a DT are single-threaded. The DT can hold no locks, so no other thread will attempt to roll it forward. If a timed-out instance of a DT in mode 0init is found by another thread that thread can only abort the DT.
After mode 0init, multiple threads can be operating on the same DT. Note therefore that the sequence of states of the DT must have the following properties.
Atomicity: a transition succeeds or fails completely.
Monotonicity: progress through the state space can not repeat.
These properties guarantee thread safety; see the full proof below in the section "Full proof of thread safety".
One DT only waits for another in two situations.
A reader waits for a writer only when (1) the reader is in 0init and holds no locks and (2) the writer is in 3checked and only waiting on locks held by other writer in 3checked.
We sort objects on which we obtain write locks consistent with a total order and then obtain the locks in that order [dining-phil]. Therefore a writer in 3checked only waits for locks on objects strictly lower in the order than the object it is attempting to get a lock on.
Therefore the wait-for graph cannot have a directed cycle.
DT objects that (1) time out and then (2) are abandoned by their initiating user could in theory languish indefinitely.
The app should be tracking the state of its DTs.
The app should reveal the state of a given DT to the initiating user.
The app should roll forward old DTs started by the same user before starting new ones.
Blocked DTs roll forward blocking DTs.
If the object is read, the reading DT will roll it forward at client read time.
If the object is written, the DT requesting the lock will roll it forward.
As detailed in section "Handling hard timeouts in a background thread" a background thread rolls-forward any old DTs it finds.
Note that no attempt is made to provide any kind of scheduling fairness.
DTs can conflict when they both access an object: while reads share, writes exclude.
Ryan Barrett [Barrett]: "it's a generalization, but in our experience building and running large webapps, that reads often outnumber writes by 10x to 100x." and "i didn't do any serious analysis to back that up. we may have elsewhere at google, but if we have, i haven't followed it or based this claim on it. it's just a back of the envelope, educated guess". Therefore we made sure that in our system read storms cannot block a write.
As pointed out above one DT cannot indefinitely block on a write lock by another as both readers and writers waiting on the write lock will suspend and roll the locking DT forward.
In general an optimistic transaction algorithm is vulnerable to write storms. In general there is no help for this problem, but algorithms exist for specific situations. For example, read-then-write operations that combine (such as addition) can be performed in parallel using fan-in trees.
Scheduling such trees gets even easier if the operations in question associate and/or commute. However taking advantage of reordering operations requires DTs to be asynchronous-issue: when requested by the user the operations should be added to a queue and run later by a background process.
The only non-user data created in the database are shadow objects, DT objects, and write locks.
See section "No starvation due to DTs being abandoned" for details on why a DT cannot languish indefinitely without being rolled forward to either 4done or 4aborted. A DT in this state is semantic in that it communicates to the app that the completion of this DT has not been yet acknowledged. The suggested idiom is that a DT object is deleted by the app after it reaches 4done or 4aborted and is acknowledged by the app.
All shadow objects are attached to the DT that created them and normally can be found from it through the put_list_shadow field. Whether the DT completes normally or is aborted, except in the presence of a hard timeout during mode 0init, its shadow objects are deleted during "Complete writes" along with their corresponding write locks.
In the presence of a hard timeout during 0init, the situation is more complex; see "Handling hard timeouts in a background thread" for the details.
DT object keys are never re-used. This is the only kind of garbage that accumulates: unusable keys. Keys of DTs must therefore forever grow longer.
By inspection it is clear that our locking protocol only writes user objects when that object is written by the client.
Definition: Let the entire state of a DT be the product of the states of
its members, including the mode,
all of the objects touched by the client,
their shadow objects, and
the variables of the client.
Definition: Let the correct entire state transition graph be a non-deterministic directed acyclic graph on the set of possible states of a DT given by the non-deterministic union of all possible entire state transcripts of a single-threaded lifetime of a DT. By "non-deterministic union" we mean that if two possible transitions can occur, both are included in the graph. Note that the legal mode transition DAG is a subset of this graph as the modes are a subset of the state.
Definition: An algorithm is thread-safe if when executed in the multi-threaded case the database state of the DT only makes transitions that lie within the correct entire state transition graph. That is, the multi-threaded version of the algorithm behaves as if it were a single-threaded version.
The section "Transitioning to a new mode" a transition has the following properties:
Is atomic: a mode transition happens at the granularity of a local transaction. This follows from the fact that the transitions are done in a Local Transaction.
Is monotonic: states to not repeat. This follows from the fact that the mode transition graph is acyclic.
One important consequence of atomicity is that there can be no confusion between two threads operating on the same DT as to whether a DT is aborting or not.
Corollary: No continuing a DT in mode 3aborting or 4aborted. An aborted DT cannot go on to further non-aborting modes, such as 3checked: when the mode transition is attempted, the transitioning thread will find mode 3aborting or 4aborted.
Corollary: No aborting a DT in mode 3checked. In particular once a thread has transitioned a DT into mode 3checked, if a second thread rolling forward the same DT later repeats some of the read-checking and finds a user object is now write-locked or has a new version number, that second thread cannot now abort the DT: when it attempts to transition into 3aborting, it will be stopped by the fact that the DT is now in 3checked.
Mode 0init is thread safe because of the following.
It is single-threaded: only the creating thread operates the DT in mode 0init.
Other threads can interfere only by aborting it and can do so only in a thread-safe way as detailed in Lemma "Once a DT makes a mode transition it cannot be unmade".
States having modes subsequent to 0init must be considered to be multi-threaded as
one DT can block on another and the blocked DTs thread can decide to roll the blocking DT forward, or
a DT can time out and another thread can find it and roll it forward.
In these modes, the client is finished and so its state is no longer relevant. The remaining entire state consists of the state of
the DT object,
shadow objects, and
the client objects including their meta-data.
The "strong" distributed-consistency property of the underlying Local Transactions means that the entire state (that of the DT object and of the client objects) may be coordinated by a thread: the changes made by that thread must show up in the database objects in the same order as the Local Transactions in which the thread issues them. This sentence may be so obvious as to be confusing if you have never contemplated underlying distributed-consistency properties weaker than "strong" distributed-consistency; see subsection "Strong consistency" of section "Full proof of Isolation" for more.
Multiple threads may therefore roll forward a system without any race conditions as long as the following properties hold.
Atomicity: a transition succeeds or fails completely.
Monotonicity: progress through the state space can not repeat.
Note that one way to make a state progression monotonic is to make the changes idempotent, or two-state monotonic: doing the change once is the same as having done it many times — that is, the change simply goes to a particular goal state.
The following list of state transitions is exhaustive and the transitions made are atomic and monotonic.
The process of checking the read objects is not idempotent as the result depends on the state of the read objects: perhaps a thread found a read object to have no write lock and a good version but a subsequent thread found otherwise. However the entire result of the check reads process is the binary result of a transition into mode 3checked or 3aborting and by Lemma "Once a DT makes a mode transition it cannot be unmade" the entire mode transition process is monotonic: if a thread puts the DT into 3checked other threads can no longer alter that.
Getting write locks and completing writes: getting a write lock is clearly idempotent, however when followed by deleting a lock the entire progression of the lock state in isolation is not monotonic. However when this state is combined with deleting the shadow at the same time as deleting the lock, the whole process is a monotonic progression through the following states:
shadow object and no lock,
shadow object and lock,
no shadow object and no lock.
Similarly deleting the DT object preserves monotonicity as it is part of a monotonic progression through the following states:
DT does not exist and its key is free for use,
DT exists and its key may not be used,
DT does not exist and its key may not be used.
Presenting the result of running the client, such as obtaining the return value, displaying any error message, etc. is idempotent as
these values are recorded on the DT object as detailed in "Completion state is visible even in the presence of asynchrony", and
since deleting a DT in a final mode (4done or 4aborted) is an LT write that can interact with DTs (section "Interaction of LTs and DTs"), these values can be read, reported to the user, and then the DT deleted in one LT.
Objects have a well-defined identity over time since each object has a unique key. User object states are only updated through DT writes and a DT write annotates the object written with a version number, namely the id of the last DT to write it. Therefore, every user object state corresponds to a unique (object key, version number) pair. A DT reads a set of user object states as input (and possibly takes other input as well) and writes another set of user object states as output. All persistent application computation, that is, changes to the object states, is performed by a DT in this way.
To help visualize this space of states, arrange the set of such states in a matrix where all the states of a given object key occur in a column. Since all operations writing user object states occur in a Local Transaction on that object and these Local Transactions can be linearly-ordered (by isolation of Local Transactions), for each column, sort the states vertically top-to-bottom in that order.
If you want you could use this same matrix to visualize the deadlock avoidance algorithm by drawing waits-for edges on it and noticing that there can be no cycles. Recall that for deadlock avoidance object keys are sorted in a global order (the order in which locks are obtained). The simplest visualization would be to sort the columns left-to-right in that order; waits-for edges will always go to the right. However doing this is unnecessary for the Isolation proof.
This is a paper on distributed databases. The word "consistency" already means one thing in the context of distributed algorithms and another thing in the context of databases. Above we defined a "semantic-consistency", which is the databases notion of consistency. Here we discuss "distributed-consistency" which is the distributed algorithms notion of consistency.
A Local Transaction is an interaction between one thread and one entity group. (Note that all database interactions are forced to effectively occur in a Local Transaction as an LT is created for operation that does not have one.)
We need that the underlying layer provides "synchronous local" version of "sequential [distributed-] consistency" [seq-con].
Data-Local Sequential (Distributed-) Consistency: Each entity group has a totally ordered local history of operations.
Thread-Local Sequential (Distributed-) Consistency: Each thread has a totally ordered local history of operations. (Everyone implicitly assumes this property of threads.)
Synchrony (Blocking reads and writes): For any thread and any entity group, the restriction of all operations to those between that thread and entity group is consistent with the total orders of both.
To keep the name short, we also call these properties collectively "strong" (distributed-) consistency.
In general, we use "<" to denote "happens before". Recall that in a distributed system time only makes sense in the context of some locality, such as the Local Transaction stream of
a user object or
a Distributed Transaction object.
Data-Local Sequential (Distributed-) Consistency: When we wish to specify a locality we make it explicit by annotating with the locality: "<(X)" meaning in the local time of object X or "<(DT-A)" meaning in the local time of Distributed Transaction DT-A.
Thread-Local Sequential (Distributed-) Consistency: There is also a local time of a particular thread that is pushing forward the state of the DT; since transitions on the state of a DT are done in an LT the thread state can be finer-grained than the state of the DT but is forced to stay consistent with it. (Only using this finer-grained thread-local time can we reason about the transition into mode 2locked.)
Synchrony (Blocking reads and writes): Using "strong" distributed-consistency of the underlying layer defined above we can combine "<" sentences from the local times of different objects and DTs and conclude new sentences by transitivity of "<".
The algorithm enforces several useful inequalities as follows. Note that we say a DT reads an object X when the first read of X occurs, and the DT writes X when the shadow is copied to the user object.
All-locking-brackets-2locked: Recall that in the local time of any DT we have there is a time when it first enters mode 2locked which we will write "DT 2locked". For all objects X written by that DT:
DT locks X <(DT) DT 2locked <(DT) DT unlocks X.
All-reads-and-checks-bracket-2locked: Further all reading is done before entering mode 2locked.
DT reads X <(DT) DT 2locked.
Below DT-A and DT-B are two distinct DTs that do not subsequently abort and X is an object.
Locks-mutually-exclude: Only one DT can hold the write lock of X at a time, so if DT-A writes before DT-B writes then:
DT-A unlocks X <(X) DT-B locks X.
Locks-exclude-reads-and-checks-after: Writing and unlocking occur in the same LT, so if DT-A writes X before DT-B reads X then:
DT-A unlocks X <(X) DT-B reads X <(X) DT-B checks X.
Locks-exclude-reads-and-checks-before: If DT-B reads X before DT-A wrote X then DT-B may not check after DT-A writes or it will abort during read checking from the updated version number. Further DT-B may not check X between DT-A locking X and DT-A writing X or it will abort during read checking from DT-A's write lock. Therefore if DT-B reads X before DT-A writes X then:
DT-B reads X <(X) DT-B checks X <(X) DT-A locks X.
Lemma write-write: DT-A write X <(X) DT-B write X implies DT-A 2locked < DT-B 2locked.
This follows from All-locking-brackets-2locked and Locks-mutually-exclude:
Lemma write-read: DT-A write X <(X) DT-B read X implies DT-A 2locked < DT-B 2locked.
This follows from All-locking-brackets-2locked and Locks-exclude-reads-and-checks-after and All-reads-and-checks-bracket-2locked:
Lemma read-write: DT-A read X <(X) DT-B write X implies DT-A 2locked < DT-B 2locked.
This follows from All-reads-and-checks-bracket-2locked, Locks-exclude-reads-and-checks-before (with A and B reversed), and All-locking-brackets-2locked:
Lemma read-read: DT-A read X <(X) DT-B read X implies DT-A 2locked < DT-B 2locked or DT-A and DT-B read the same version number of X.
If DT-A read X <(X) DT-B read X then there are two cases:
(1) If a write on X by some DT-C came between the two reads, then using Lemma read-write and Lemma write-read we have that
(2) If no write came between the two reads then the two reads were of the same version number of X.
The conjunction of all four lemmas is the following:
For all objects X and for all DTs DT-A and DT-B if DT-A accesses X <(X) DT-B accesses X, then
DT-A 2locked < DT-B 2locked or
both accesses were reads of the same version of X.
Therefore all such objects X "might as well have been accessed in the same order": that is,
must be accessed by DT-A and DT-B in the same order,
except when both are reads of the same version of X.
Therefore there exists a partial order on all DTs that is consistent with all local object accesses.
Any partial order can be extended to a total order.
The mechanism provided by the algorithm is sufficient to provide the properties that we want; however, there is also nothing extra: each mechanism is also necessary to accomplish this.
Any time there is a mode in which the DT can abort there needs to be a mode after that when the danger of aborting has passed. The DT can go into that next mode or 3aborting but never both as mode transitions are conducted in a Local Transaction on the DT object in the data base; this prevents ambiguity between the two threads as to whether it is being rolled forward or aborting. Therefore modes 1ready and 3checked are needed as they follow 0init and 2locked which can abort. Similarly mode 3aborting is needed to definitively initiate an abort.
Modes 4aborted and 4done are needed so the app can examine the DT and know that it is done. Mode 2locked exists for use in the proof of Isolation but there is no need to write it into the database, so we include an optimization where we do not.
We need mode 0init to be distinct from the mode None — the mode of DTs functioning as best-effort read locks waiting to be used — because once in mode 0init the read lock timeout field no longer applies to the DT. The real question is why do we have to put the DT into the database before the shadows are written (when not using best-effort read locks). Consider the following. When the background thread finds an old shadow it must ask if there is a DT pointing to it to know if the shadow is still in use. We don't want to query for DTs that point to it because if that query had a false negative [TransIsolGAE] we would delete a shadow for a possibly active DT. This would be an unrecoverable error: the DT could be in 3checked already copying shadows and not able to be aborted. Therefore we have a dist_trans field on the shadow so we can ask if a particular DT exists, and if not then we know we can delete the shadow. In order to write the dist_trans field onto shadows, we must have a key for its DT object. The only way to do that is to have put the DT into the database before mode 1ready. We might as well call that mode 0init.
The created field of the DT is needed so the app can roll forward DTs in the order that the user issued them. The modified field is needed so that when blocked on obtaining a write lock the thread of the blocked DT can guess if any other thread is likely to be rolling forward the blocking DT or not and decided whether to jump in do that itself (if it is old) or wait and avoid duplicate work (if it is not); see the comments on TIMEOUT above in the section "A note on time".
The get_list_obj and get_list_version fields are needed to find the read objects and their versions. The put_list_obj field is needed to find the put objects; put_list_shadow is needed so that the shadow for an object can be found without resorting to a query on the dist_trans field of the shadow. Recall that queries are not allowed in Local Transactions.
Without the shadow objects being in the same entity group as the user object the copy of the state of the shadow object to that of the user object and the deletion of the shadow object could not be done in a Local Transaction.
Without the write lock and version of an object being in the same entity group as the user object the read checking cannot use a Local Transaction to force serialization with
obtaining the write locks and
with the copying of shadows, which updates the version number.
See the discussion on why we need mode 0init above for why we need the dist_trans field on a shadow object. Without the created field on a shadow object the third query for orphaned shadow objects in the same aforementioned section would not be possible.
Without the cache we could not prevent the fact that writes are buffered as shadow objects from showing up to the client function which would be a failure of Uniqueness.
Without buffering writes in shadow objects, an abort during checking version numbers could leave some writes done without others which would be a failure of Atomicity.
Without sorting the written objects by their keys deadlock could happen when DTs wait on each other for the write locks.
Without write locks writes from different DTs could interleave which would be a failure of Isolation.
Without recording and checking version numbers, other writes could interpose between two reads which would be a failure of Isolation.
Without checking all the reads only after getting all the write locks a second DT could write the read object after the first but read the written object before the first which would be a failure of Isolation. Here is an example. Note that (1) DT-A reads X before DT-B writes X but (2) DT-B reads Y before DT-A writes Y.
Without the read check also checking for the absence of a write lock along with a version number reads can interpose between writes, violating Isolation. Here is an example.
We have omitted an attempt to allow queries within DTs. This is made particularly difficult due to the fact that native GAE queries do not honor even Local Transactions and further the allowed native GAE query predicates are rather limited compared to, say, those of SQL queries. One might imagine providing queries by explicitly providing within the DT infrastructure itself the locking necessary to achieve Isolation. In doing so, we could even provide queries of more complexity than native GAE queries. However, note that the difficulty of providing Isolation for such queries depends very much on the richness of the query language.
We rely on "strong (distributed-) consistency" in the underlying layer; an attempt could be made to design an algorithm that only needs a weaker level of distributed-consistency. Requiring only say causal distributed-consistency [cause-con] would make the algorithm far more general.
We have not investigated what further optimizations might be possible using deeper integration with the lower level APIs of App Engine and Big Table. For example, perhaps a shadow object could simply "replace" a user object rather than having its state be "copied".
We have not performed any performance evaluation.
Thanks to Russell Sears, Karl Chen, Ryan Barrett, Erick Armbrust, Robert Johnson, and Alfred Fuller for discussions leading to this design.
Thanks also to Ryan for not only reading the entire article, but also
for suggesting the use of optimistic rather than pessimistic transactions,
and for his extensive consultation on the details of the App Engine and Big Table infrastructures.
Thanks also to Erick for not only reading the whole article, but also
finding the bug that the version number should be set when we copy the shadow object (in the same Local Transaction) rather than when we obtain the write lock,
thinking of the optimization of explicitly deleting the write locks in the same LT as doing the shadow copy which is a big reduction in database writes,
and for implementing the algorithm.
Thanks also to Rob for
suggesting that read locks were completely unnecessary which is a huge simplification of the design.
Thanks also to Alfred for
suggesting independently of Simon that when the client reads an write-locked object that it should be paused and the thread should roll forward the holder of the write lock.
[Armbrust]: Personal communication from Erick Armbrust of Google.
[Barrett]: Personal communication from Ryan Barrett of the Google App Engine team.
[McPeak]: Personal communication from Scott McPeak.
[Gray-Reuter] Jim Gray and Andreas Reuter. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann.
[O'Neil-O'Neil] Patrick O'Neil and Elizabeth O'Neil. 2001. Database: Principles, Programming and Performance. Morgan Kaufmann.
[BigTable] Big Table: http://labs.google.com/papers/bigtable.html
[GAE] Google App Engine: http://code.google.com/appengine/docs/
[DatastoreIntro]: Introducing the GAE Datastore: http://code.google.com/appengine/docs/python/datastore/overview.html#Introducing_the_Datastore
[DatastoreTalk] Under the covers of the GAE Datastore: http://snarfed.org/space/datastore_talk.html
[ExceptionsGAE] GAE Exceptions: http://code.google.com/appengine/docs/python/datastore/exceptions.html
[IDsOfDel] "are numeric ids of deleted entities reused?": http://groups.google.com/group/google-appengine/browse_thread/thread/dec83c2dbd9542e4
[KeyClass] The Key Class: http://code.google.com/appengine/docs/python/datastore/keyclass.html
[KeyMethodOfModel] The key() method of Model: http://code.google.com/appengine/docs/python/datastore/modelclass.html#Model_key
[Keys] Keys and Entity Groups: http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html
[OfflineProcessingTalk] Offline Processing on App Engine: a Look Ahead: http://code.google.com/events/io/sessions/OfflineProcessingAppEngine.html
[RequestTimer] The Request Timer: http://code.google.com/appengine/docs/python/runtime.html#The_Request_Timer
[TransGAE] GAE Transactions: http://code.google.com/appengine/docs/python/datastore/transactions.html
[TransIsolGAE] Transaction Isolation in App Engine: http://code.google.com/appengine/articles/transaction_isolation.html
[cause-con] Causal consistency: http://en.wikipedia.org/wiki/Causal_consistency
[dining-phil] Dining philosopher's problem and solutions: http://en.wikipedia.org/wiki/Dining_philosophers_problem#Resource_hierarchy_solution
[livelock] Livelock: http://en.wikipedia.org/wiki/Livelock#Livelock
[opt-cur] Optimistic concurrency: http://en.wikipedia.org/wiki/Optimistic_concurrency_control
[seq-con] Sequential consistency: http://en.wikipedia.org/wiki/Sequential_consistency
Copyright 2008-2009 Daniel Shawcross Wilkerson and Simon Fredrick Vicente Goldsmith.