This post introduces a new data structure - the vector map - which solves some issues related to storing collections in MVCC data stores. Further, vector maps have some super nice use cases for "occasionally connected" systems.
The idea warrants a more rigorous discourse, but I need to get it off my chest, so here is a blog entry describing it.
Modern distributed data stores such as CouchDB and Riak, use variants of Multi-Version Concurrency Control to detect conflicting database updates and present these as multi-valued responses.
So, if I and my buddy Ola both update the same data record concurrently, the result may be that the data record now has multiple values - both mine and Ola's - and it will be up to the eventual consumer of the data record to resolve the problem. The exact schemes used to manage the MVCC differs from system to system, but the effect is the same; the client is left with the turd to sort out.
This led me to an idea, of trying to create a data structure which is by it's very definition itself able to be merged, and then store such data in these kinds of databases. So, if you are handed two versions, there is a reconciliation function that will take those two records and "merge" them into one sound record, by some definition of "sound".
From what I have seen, the "thing" stored is often itself a collection like a list or a hash map, and say that Ola and I both add new elements to the collection and store the results, the resulting multiple records are - with proper definitions - naturally mergeable; namely the list or map that contains the original entries plus both mine and Ola's.
So, this is the presentation of my idea: A vector map, which is a data structure that is designed to be used in this context. It also has other interesting applications as we shall discuss towards the end of this post.
Vector Maps
A vector map is defined as a set of assignment events, Key=Value, each such event being time stamped with a vector clock (hence the name). From a high-level point of view, a vector map can be seen as a hash table i.e., a collection key/value pairs.
Two vector maps can be reconciled (VectorMap1 ⊕ VectorMap2), so that for each key, the "most recent assignment event" wins. If assignment events are in conflict (vector-clock wise concurrent), then the resulting value is multivalued.
The reconciliation function for vector maps is defined so that it is commutative i.e., it can be applied in any order i.e.,
A ⊕ B = B ⊕ A,
It is also associative,
A ⊕ (B ⊕ C) = (A ⊕ B) ⊕ C,
which means that reordering done inside the database store, is insignificant to the resulting usage.
In fact, Riak itself (the distributed key/value store) is just a big distributed and redundant version of this; which just goes to prove the versatility of the idea. So in a sense, a vector map is just a recursion over the Riak (Dynamo) concepts applied to a data structure.
In the following we will define things a little more rigorously and provide some examples; and towards the end there is a discussion of how vector maps can be used.
Assignments as Vector Clocked Events
For the purposes of this discourse, we will model an assignment event as
- a Key,
- a Set of values, and
- a vector clock, VC, describing when the assignment happened,
so an assignment event has the following form:
Assignment :: VC: Key = [Value, Value, ...]
We will let the set contain multiple values in case a conflict has been observed, but to start off with our sets will be single-valued.
Our system will use the vector clock to determine "who wins". Let's see what happens ...
To begin with, I do an assignment "X" = 4, and record this event:
Assignment1 = (krab:1) : "X" = [4]
Reading: at krab's time 1, "X" is bound to 4.
Later, after observing my assignment, a colleague Ola, at his time 2, defines "X" to be 5
Assignment2 = (ola:2,krab:1) : "X" = [5]
Now, the immediate beauty is that because each assignment is time stamped with a vector clock, we can easily determine that Assignment2 happened after Assignment1 (the vector clock says that Ole did indeed observe Assignment1 before creating his own), and so if we see both we can discard the earlier one without loss.
Conflict happens
Now, what happens if Jens comes in, and based on only observing my original assignment, he reassigns "X" to be 7. We'll describe this with the following event:
Assignment3 = (jens:3,krab:1) : "X" = [7]
To Jens, this is not problematic, but if someone observes both Assignment2 and Assignment3 they'll know that Jens did not observe Ola's time 2.
To make some kind of sense out of this, we define the reconciliation operator ⊕, describing the aggregation of knowledge we have when combining the known events.
Assignment2 ⊕ Assignment3 = (krab:1,ola:2,jens:3) : "X" = [5,7]
I.e., we describe that "X" has conflicting values 5 and 7 at a point in time which is after both Assignment2 and Assignment3.
Deleting values
A further complication is what happens when we want to delete a binding, but this is handled quite simply by making a new assignment to a unique tombstone value which is somehow outside scope of other possible values; and then reconciling that into the current event set. This does have some interesting properties, because there is an observable difference between a binding that was never there, and a binding that was removed.
Defining Reconciliation
In the example above, we can see from the vector clocks that Jens had not observed Ola's assignment, and so we reconciled the two conflicting events by
- creating an artificial vector clock (krab:1,ola:2,jens:3) which is logically after both (ola:2,krab:1) and (jens:3,krab:1), and
- combining the bound values [5] and [7] into a set of values [5,7].
Had it been obvious that one assignment happened before the other, we would simply had thrown one of them away, but because the two vector clocks were in conflict we have to recognize the conflict.
So the actual definition of reconciliation for two assignment events with the same key is as follows:
VC1:Key=Set1 ⊕ VC2:Key=Set2 ≡
VC1 ≤ VC2 → VC2 : Key = Set2;
VC2 ≤ VC1 → VC1 : Key = Set1;
otherwise→ lub(VC1,VC2) : Key = (Set1 ∪ Set2)
The reconciliation operator (o-plus) is commutative (just like good old addition), so we can use it to reconcile assignment events in any order and we'll always arrive at the same result in the end. Which makes it perfect for making decisions in a distributed system; because it means that even if we get to know about events in different order, the state will eventually reconcile to the same value (eventually meaning when we've seen all the events).
The definition uses the lub (least upper bound) on two vector clocks, which combines two such by taking the maximum local time stamp for each agent present in two sets of vector clocks:
lub( VC1, VC2 ) ≡
∀ a ∈ agents(VC1) ∪ agents(VC2)
a : max( time(VC1,a), time(VC2,a) )
Where time(VC,a) is 0 (zero) if VC does not list an agent a. For example
lub( (b:3), (a:1, b:2) ) = (a:1, b:3)
What is this good for?
You may rightly say that this doesn't solve the problem, it just pushes the problem one level down. And that's right; but ... it does solve an interesting range of problems, and with some care you can often structure your usage of keys in vector maps so that you can avoid conflicts all together.
Further, since your favorite data store already does this for you ... you may say that you can just split your map into individual key/value bindings and store those in the database. But that comes at a price of network round trips and data locality.
Use case: Modeling Relationships in MVCC data stores
Vector maps are really nice for modeling relationships inside MVCC databases.
Assuming you want to store a one-to-many relationship in Riak (say, order - order-item), you run into the problem on the one-side that it is likely to be concurrently updated if multiple items are added to the same order.
With vector maps, you can easily model the entire relationship as one order object which is just a vector map, where each item is stored with a distinct key. If that key is e.g. a sufficiently large random number, making it very unlikely that order-item-id's conflict, then you're pretty much home free. Alternatively, you can devise a mechanism so that each client of the system is able to construct globally unique keys (agent + sequence number).
For this use case, you'll see much improved performance also, because of the improved locality of reference. If your data store needs to go and fetch each individual order-item on the disk somewhere, then performance will be seriously hampered.
Use case: off-line data
Vector maps are great for off-line data, because they give a well defined meaning to the concept of synchronization (something I would really like my iCal to do :-) Synchronization is simply defined as the exchange of vector maps, storing the result of the reconciliation on both sides.
Such synchronization can happen in near-real time (peer-to-peer update) or as a delayed synchronization whenever there is contact to a server/peer.
This is perhaps the most interesting use case, because it an be used as a simple foundation for making data available to e.g. mobile clients in an "occasionally connected" system, in a way that makes sense for both online and offline mode.
Implementation Issues
Since it is a little complicated to manipulate a vector map, we need implementations in the most common languages out there to get it off the ground. I'm currently hacking on an Erlang and a Java version.
Working on this, and I have come to the conclusion that it would be great if vector maps have a well defined binary representation so that they can be meaningfully manipulated in a number of different contexts, easily stored and transmitted, etc. So, a special mime-type that lets multiple parties consume vector maps independent of programming languages.
If vector maps had it's own mime type and well defined data representation, data stores such as Riak or CouchDB could even do the reconciliation automatically before serving the data to a client.
So, right now I am working with a protocol buffers definition that looks like this on the wire, to be encoded as Content-Type application/x-protobuf;proto=vectormap, and likely a also a JSON representation application/json;schema=vectormap.
message VectorMap {
repeated Entry entries = 1;
}
message Entry {
required string key = 1;
repeated Clock vclocks = 2;
repeated Value value = 3;
}
message Value {
optional string mime_type = 1 [ default = "application/json;charset=utf-8" ];
optional bytes content = 2;
optional bool deleted = 3 [ default = false ];
}
message Clock {
required string node = 1;
required uint32 counter = 2;
required uint64 utc_millis = 3;
}
Conclusions
The Dynamo idea of using vector clocks to time stamp data is great, but I think the power of the idea goes quite a bit further if the logic is brought all the way to the client.
CouchDB tries to do this by suggesting that client devices (mobile devices) should have a full fledged CouchDB running there. But I think that exposing data this way makes the mobile client model much more manageable. This entire idea can be wrapped up in a single Java class which is easily deployed in a Android app; and it is sufficiently simple to be implementable in a range of programming languages.
What do you think?
Recent Comments