In a previous post I listed four types of data change replication strategies. Let us explore polling.
For OLAP databases, data is rarely updated and almost always added. In this case it makes sense to often scan for new data, while full scan can be done once a day.
Determining which data is new is not hard - you simply find MAX(primary key) in the target database and select all data in the source table with primary key larger than that. Push that to the target.
This is how I designed our system when scanning for all changes (inserts, deletes, and updates). First, both source and target tables need to be ordered the same way (by the primary or unique key). If databases are of a different type or use a different collation and therefore sort order, you will need to fetch everything and sort in-memory. You could SQLite or something to do that for you, but it will be slow.
Read both source and target table in parallel, keep track of primary key, and move one record at a time. When you have a gap in values on one side, you have either inserted or deleted record. Track that. When both PKs are present, compare all values in the row and note the differences. Generate change SQL script and execute on the target.
When network latency is very large or throughput very low, you can't just push all of that data on each run through the network just to compare everything. Instead, have agent processes in a local network of each database, have them read and hash the data, then collect those (compressed) row hashes. Once you locate different rows, request the data of those rows.
Not so transactional
This approach won't give you transactional guarantees as you are not aware of transactions. It works really well for many use-cases, though - and it anyway picks up the changes you miss at one point in time on the very next run.