My whole data career has been about finding the truth within data, seeking correlations within and between data sources (and not always relational sources at that). You get to notice patterns in the world and even develop skills to spot outliers on a rapidly scrolling screen. In those early days I used tools like i2 Analyst Notebook (now an IBM offering) to search call data records or (more fun) some of the free text systems I had access to.
In DSS / BI / Data Science / whatever, we just can’t do our job of finding patterns and relationships if data is truly random. Relationships always exist (or at least are statistically significant). For an example, the non-fiction books people buy from Amazon often have a bias towards a person’s occupation or education; the bookcase above my desk has a large number of database and mathematics books with the odd volume on management. It’s these biases that make building BI test or training data more complex that just randomly populating columns.
We could use obfuscated data but that sometimes is problematic as obfuscation needs careful thought to prevent inadvertent exposure of live data through subject domain knowledge and guessing. Often though for test and training we need to come up with 100% synthetic content.
Personally, I prefer to see my synthetic data in my own language. I have a bit of a bee in my bonnet about ‘Zsdefes Aagrtf’ being my top customer when ‘Ricco daCosta’ or ‘Claire Wilson’ would look so much better on generated reports. Likewise just randomising CITY, STATE, COUNTRY as independent entities may be great for building ETL test sets but for reporting we end up with a far too flat distribution and many repeating city names that relate to completely different locations; just how many Little Rock, Utah really exist in Belgium? It is relatively easy to build a table of real city-state-country data and then map that to our synthetic geography data dimensions. If we will be analyzing our data on phone number or postal code I’d extend my table to include realistic area codes and representative postal codes. As an aside, unless I am doing single country work I tend to avoid using postal codes as there is no global standard for where a postal code exists in an address hierarchy, UK and Netherlands postcodes go down to a few houses on a street, French code are to a grain of one or more whole towns or villages within a Department, US Zip codes sit somewhere in-between and might represent an area in a large city.
We can’t just randomly allocate our fake customers to our generated city list or assign a random Standard and Poor’s credit rating to our corporate accounts. These things actually follow some form of distribution. We may well have more customers in San Francisco than in Fresno; and there are far fewer ‘AAA’ rated companies around than ‘A’ rated (but hopefully more ‘AAA’ than ‘D’). Thus we need to map some form of distribution over synthetic data. If we have a real data source and something like Oracle Enterprise Data Quality (OEDQ) we can easily build representative data distributions we need to superimpose on our synthetic data, or we could go home grown with R or even DBMS_STAT_FUNCS.SUMMARY to find how data is distributed. Often though something really simple like the analytic NTILE function will allow us to readily combine our source data with the synthetic obfuscation we are applying.
My fake data tool chest contains public domain geo-data (preferably including city size so that I can distribute on that), the dbms_random family of Oracle functions, the ntile and row_number analytic functions and the good old case statement to help bucket up small domain data sets such as distribution of S&P ratings. Building good synthetic data is a great exercise and allows you to get quite inventive with SQL.
As an example, I needed to add a Standard and Poor’s rating attribute to each of the 5000 members of our company dimension. The testing requirement was for a realistic distribution of ratings over the dimension, but there was no need to use real company rating data.
To do this I created a couple of staging tables. The first was the company key and a bucket number, here I am using 175 buckets.
create table bucket_list as select unique company_id, ntile(175) over (order by company_id) bkt from COMPANY_D;
I am using unique as company_id is the business key and for this use I want to keep the same rating across any dimension change history.
Next I get creative and build a simple table of S&P rating and start and end bucket numbers. Here I shape the distribution of the ratings; if I had the real distribution I could model that, but for this use I just shape a distribution that looks nice!
We could use analytic functions to derive the bucket_end value but for a small data set like this it is often as easy to type the values in.
Finally we join these two tables and use them in a merge statement to update the dimension: