Mind the Gap — Getting the Number on ODI Sequences
At work last week I lifted the covers on the Red Pill Analytics Slowly Changing Dimension ODI IKM for Oracle. If you haven’t heard about this knowledge module, it was specifically written to take a dimension source table and transform it to a slowly changing dimension target table. Unlike other similar IKMs, this module uses a mix of analytic functions and set operations to handle record closure and surrogate key management on SCD2 records efficiently. As an added bonus it works with mixed SCD2/ SCD1 attributes. We also have a similar IKM for Microsoft SQL Server.
Currently, I am enhancing the RPA Oracle IKM to build-in flow control handling. We left that part out earlier as it is, perhaps, a less common use case. If you are following an Oracle Reference Architecture approach to building your BI platform you probably have done the data validity checking at an earlier stage than the “publish to dimension table”.
Whilst implementing the new code and testing it I was reminded about some of the quirks of using sequences in ODI. ODI supports many kinds of sequences: Standard, Specific, and Native. However, coming from a database background, I prefer to use native if the underlying technology supports it. Not all databases support sequences, but where they do they will be the most efficient way to generate unique numeric key values.
Adding a sequence to a mapping is a simple matter of editing the expression on the column properties, either directly on the target table or on any component that exposes the column. Using the RPA Oracle SCD IKM as an example this is often the target column as our dimension load is very simple indeed.
Sequences declared in ODI can be selected directly from the ODI Objects selector of the Expression Editor
By default our sequence becomes :[project name].[sequence name].NEXTVAL (I am assuming you don’t really want to use CURIAL), meaning the sequence is an ODI “bind” object. The ODI manual also gives us a second syntax for a sequence expression. This syntax replaces the leading colon with a #. The two syntaxes give different results at execution time. It is down to how the ODI agent executing the mapping interprets the expression.
:[project name].[sequence name].NEXTVAL equates to ‘for each target row get the next sequence value’
#[project name].[sequence name].NEXTVAL equates to ‘for this execution get the next sequence value then apply it to all target rows’
This difference between ‘:’ and ‘#’ also explains why using the ODI bind (:) is slower when using variables in a mapping.
Rail travellers in the UK and particularly on the London Underground will have heard the expression ‘MIND THE GAP’. Gaps need minding when the distance from the train to the platform in stations is larger than expected and there is a danger of falling.
So what have gaps to do with sequences? In fact having a gap in a sequence is the way of life. In Oracle databases we can get gaps when transactions rollback (there is no mechanism to return the number to the sequence pool); gaps when we start the database as the sequence is often set to ‘cache’ a block of numbers to speed sequence access; gaps on RAC when each node may have its own cache of numbers.
When testing my SCD module I noticed some large gaps between key values when new records were processed and certainly not associated with RAC, DB restarts nor rollback. So why the gaps? Let’s look at this snippet of pseudo code used in the KM for surrogate key creation.
CASE WHEN NEW_KEY_NEEDED = TRUE THEN :DEMO.sq1.NEXTVAL ELSE CURRENT_SK END
What happens here is that the ODI Agent evaluates the sequence expression for each row processed whether the value is used or not. This will leave gaps, and in the case of a dimension where we have few changes that trigger new SCD rows per execution, gaps almost the same size as number of members in the dimension. In a way this does not really matter as the surrogate key is only a unique reference to the row and has no semantic meaning based on its value. However big gaps may mean we hit the maximum value we set on the sequence sooner than we expect.
We can draw on the techniques we used in ODI 10 for using a database sequence and explicitly name the sequence in the column — so if our sequence, SCD_TEST, is in the ODI_DEMO_TRG schema we can set our expression to be ODI_DEMO_TRG.SCD_TEST.nextval, just as we would do in SQL. In reality I would use an ODI Ref Function to resolve the schema name. This gets rid of all the gaps generated by the execution of the KM.