OTN Appreciation Day: Looking for Change
In this case we need to compare the data we are loading with the data we already hold and look for changes. Some developers do this by outer joining source to target and then comparing columns to find changed data, either through direct comparison of columns or computing a MD5 hash of the concatenated columns. Either way need to do something with those pesky nulls in our data as we may get false matches or even miss changed rows.
A more NULL tolerant approach is the use a set-based approach. The classic is something like SELECT a,b,c from StagedData MINUS select x,y,z from TargetData. Using SQL minus like this we get rows in our result set that do not exist in our target + rows that are different in the two sets, stage and target.
Some people worry that the SQL minus is going to be a heavyweight query, but there is a simple thing we can do lighten the load. Instead of a straight
SELECT * from STAGE_TABLE -- a 20 row table
SELECT * from TARGET_TABLE -- a 50 million row table
we pre-filter the Target table to only include rows with the stage table’s business keys; and EXISTS or an inner join would do nicely here
Simple but an effective way of reducing the amount of data being processed in a set based query.