Ralf Th. Pietsch · The Power of Replication in CouchDB

CouchDB is a distributed NoSQL database which makes it easy to distribute data, even in unreliable network environments; mobile network, low throughput, fast moving vehicles, …

The concept of CouchDB allows you to connect to one node, read and write data instantly. The data will be replicated from this node automagically to other nodes (by configuration, of course). Therefore the power of CouchDB is availability and partition tolerance. Availability: If you are connected, you can instantly read and change data, without waiting for the data to be synchronized throughout the network. Partition Tolerance: If the network was down the nodes will automagically reconnect and start synchronizing again. But the user will not notice anything about this; that’s the point.

Of course there is one point. If you provide availability and partition tolerance the whole thing is – according to the CAP triangle – not able to be consistent.

        Consistency
           /\
          /  \
         /    \
        / CAP  \
       /________\
 Availability   Partition
                Tolerance

(See also the CAP theorem illustrated at Scott Logic)

So, when two users change the same data object (aka document) on two nodes, this may become a conflict during synchronization. CouchDB is solving this issue by implementing a simple algorithm to define, which document wins in the first stage. That’s okay for some cases, but not for most. That’s why CouchDB provides an API point for listing conflicts. It provides all the current existing conflicting documents including the conflicting versions. That makes it easy to implement a conflict resolution task or even an automatic conflict resolving procedure.

Why is CouchDB doing it this way? Because in many (or most?) scenarios conflicts never or very seldom occur. And even if conflicts occur, there is a good chance to handle them automagically.

Replication is boring

So how can you use this replication feature in your business case, you may ask.

I like to explain it by an example. Imagine you are creating a time booking system for your employees. Now let’s consider the following scenario: Each employee gets their own database, with one document for a day. In this document he may enter his work time and how long he worked on different projects.

Now the end of the month is there and the invoice for your client has to be made. Consider another database which replicates all the employee’s database into one and calculates the numbers. Each document may become an attribute “uneditable”, or even the ID of the invoice. This information is replicated back into the different employee’s databases, so the employees are not allowed (or better able) to change the document. Even if anybody changes some old document, it will not be replicated in the “central” database, because the monthly replication does it for the month only, and never again. So we have an integral state in our central database.

Easy, isn’t it?