Mark and sweep

Efficiently collecting data changes for the very low-bandwidth distributed replication scenarios.

Damir Bulic

Sep 8, 2021 • 1 min read

Synchronizing data over low-bandwidth connection relies on multi-staged approach:

1. Hash the data on both ends
2. Compare the hashes to figure out what has changed (mark)
3. Go back to the databases and fetch the marked data (sweep)
4. View the differences, or
5. Replicate changes by executing a minimal set of SQL statements on the target to bring it in sync with the source

Marking

Once hashes arrive from the databases, the comparison is very simple - advance on sorted keys and put the key into one of the three buckets (added, changed, deleted).

Sweeping

Once we have the set of keys for the data we need, we need to look into how to make that as efficient as possible. Usually, we would just do a full copy where we can migrate terabytes per hour, easily. However, that would fully saturate the low-bandwidth connection, so we need to rely on our syncing engine (which supports bandwidth throttling) even for the initial pass. That means, billions of records on that first pass.

As result of our marking phase is a list of keys, we are folding the keys by detecting all non-gapped monotonously increasing PK ranges. Such ranges can be described with BETWEEN expressions of the WHERE clause. Outliers can be fetched via IN expressions.

To avoid loading a billion records with a single connection, we are limiting a range of records to 100k, by default. That allows us to massively parallelize the indexed fetch, something Omni Loader is very good at.

For compound primary keys, all but the right-most columns must be separated, and the right-most can be optimized by range folding as described in the single-column scenario.