We have a requirement to replicate several TB of data over a shared secure 50Mbit connection. Copying that data over a dedicated link would take a week, but we need it to take a day.
Generally, change data capture is the way to go. Once you are into small changesets, nothing beats CDC. However, not all changesets are small and not all databases support CDC.
Solution for this exact problem is going to become an important part of our OmniLoader.
Advantages of cluster
OmniLoader is designed for incredible transfer speeds migrating data between clouds or between databases on premises and the cloud. It uses massive parallelism to push data through many connections at once, offsetting WAN latency. However, you still need enough bandwidth to push that through.
When bandwidth is very small, we are taking advantage of having agents close to the source and target databases, and leveraging their locality to send the minimal data amount needed to resolve the changes.
In essence, agents do enormous amount of work and only push only what is absolutely necessary over the network.
Getting the most out of the limited bandwidth
We need to compress and encrypt the data. The best compression is available for columnar data, so our over-the-wire format will be quite similar to Parquet. As we are optimizing for bandwidth, I plan to rearrange the data so it's highly compressible, and use LZMA compression. It will be interesting to see how much can we gain over Parquet.
Initial synchronization (which may take a long time) still needs to copy all of these terabytes over the wire, and the wire is shared with other important processes. So, our networking code needs to be throttled to allow other services to run smoothly.
Interesting problem, and its benefits are numerous. Will write more about this.