datacontroller.io/content/blog/bitemporal-historisation-and-the-sas-dds/index.md

61 lines
16 KiB
Markdown
Raw Permalink Normal View History

2024-06-05 13:53:46 +00:00
---
title: Bitemporal Historisation and the SAS DDS
description: Bitemporal historisation is supported in SAS DDS so long as appropriate transforms are applied, and consistent extraction logic is used.
date: '2020-08-06 18:32:46'
author: 'Allan Bowe'
authorLink: https://www.linkedin.com/in/allanbowe/
previewImg: './Screenshot-from-2020-08-06-22-40-55.png'
tags:
- Banking
- Bitemporal
- Data Warehouse
- DDL
- DDS
- Erwin
- ETL
- Insurance
- SAS
- Telco
---
Does the SAS DDS support Bitemporal historisation? Why yes indeed, with the right transform and extract logic, and updates to the primary keys.
The SAS Detailed Data Store (DDS) suite provides "out of the box" data models for industries such as Banking, Insurance and Telco. They arrive with logical and physical models (eg in Erwin format), and a set of DDL files for your preferred database. These assets help to accelerate the delivery and deployment of a company data warehouse that has the further advantage of standardised integration with SAS Solution offerings. The model is also very large. It is highly likely that much of the model will be unused, especially in industries like insurance - where corporate structures, regulatory environments, and product offerings can be quite diverse.
As a "baseline" model then, another way to utilise it is to take it as just that - a model, that provides guidance on putting together a warehouse that suits you, the customer. There is nothing to stop you picking & choosing the parts you like, that make sense for your particular use case(s).
## The Validity of SCD2
The historisation approach in the DDS is based on the "VALID_FROM_DTTM" and "VALID_TO_DTTM" columns. These provide the open and close datetime pairs representing the 'validity' of the record. This - is where the confusion begins.
What is 'validity'? Perhaps this represents the 'truth', eg the number of widgets we sold last month. So what is this truth? Is it the state of the database (transaction datetimes) between the 1st and the end of the month? Or does it represent the current view of last months widget sales, which were finally loaded on, say, the 15th of the following month? Dealing with this scenario (late arriving records) poses a number of challenges for the SCD2 model:
- Reloading historical data (as records must be physically removed, in order to reload)<
- Loading corrections (again, records must first be removed)<
- Maintaining an audit history, or even _the ability to run the same query twice and get the same result._
An emerging consensus from the datawarehousing domain is the use of bitemporal datetime ranges for managing such requirements. The below article borrows heavily from [this excellent paper](/wp-content/uploads/2020/08/hist-op_1.1.6_en_manual.pdf) by [Arnd Wussing](https://www.linkedin.com/in/arnd-wussing-8660381), which explains the topic in much greater detail.
## Background to Bitemporal
The concept of Bi-Temporal Historisation is not new it was originally associated with a chap called Richard Snodgrass back in 1992. <span style="color: #000000;"><span>There is further info on </span></span><span style="color: #0000ff;"><u><a href="http://en.wikipedia.org/wiki/Temporal_database"><span>Wikipedia</span></a></u></span><span style="color: #000000;"><span> and </span></span><span style="color: #0000ff;"><u><a href="http://informix-myview.blogspot.co.uk/2012/03/bitemporal-data-is-this-next-big-thing.html"><span>this blog</span></a></u></span><span style="color: #000000;"><span>, and a more recent article on <a href="https://medium.com/kamu-data/a-brief-history-of-time-in-data-modelling-olap-systems-9032f63b8b7f">medium</a>.</span></span>
<p class="western"><span style="color: #000000;"><span>Teradata have specifically implemented </span></span><span style="color: #0000ff;"><u><a href="smb://smteam.sas.com/DavWWWRoot/psd/rmi/Implementation%20challenges/TeradataBiTemporal.pdf"><span>temporal features</span></a></u></span><span style="color: #000000;"><span>, which (interestingly) holds the datetime </span></span><span style="color: #000000;"><span><i>pairs</i></span></span><span style="color: #000000;"><span> in a single column (see attachment). Notice the SAS DDS and Teradata nomenclature differences (for SAS DDS: VALID typically means Transaction Datetimes; for Teradata: VALID refers to Business Datetimes).</span></span></p> <a href="/wp-content/uploads/2020/08/hist-op_1.1.6_en_manual.pdf"><img class=" aligncenter" src="/wp-content/uploads/2020/08/bt.png" alt="bitemporal" width="900" height="102" /></a> <p class="western"><span style="color: #000000;"> <span>Furthermore, this SUGI paper ( </span></span><span style="color: #0000ff;"><u><a href="http://www2.sas.com/proceedings/sugi29/110-29.pdf"><span>http://www2.sas.com/proceedings/sugi29/110-29.pdf</span></a></u></span><span style="color: #000000;"><span>) covers the issue. Here is an extract (page 8): </span></span></p> <blockquote> <p class="western"><span style="color: #000000;"><span style="font-family: ArialMT, Arial, sans-serif;"><span style="font-size: small;"><span><b>Versioning history</b></span></span></span></span><span style="color: #000000;"><span style="font-family: ArialMT, Arial, sans-serif;"><span style="font-size: small;"><span> (Type Two style) will always require at least a single </span></span></span></span><span style="color: #000000;"><span style="font-family: Arial-BoldMT, Arial Bold, sans-serif;"><span style="font-size: small;"><span><b>updated date </b></span></span></span></span><span style="color: #000000;"><span style="font-family: ArialMT, Arial, sans-serif;"><span style="font-size: small;"><span>for the record, and two </span></span></span></span><span style="color: #000000;"><span style="font-family: Arial-BoldMT, Arial Bold, sans-serif;"><span style="font-size: small;"><span><b>valid from / valid to dates </b></span></span></span></span><span style="color: #000000;"><span style="font-family: ArialMT, Arial, sans-serif;"><span style="font-size: small;"><span>if using a normalized data model. You will also require two dates in a star schema if past point in-time history queries are to be easily run in a single query.</span></span></span></span></p> <p class="western"><span style="color: #000000;"><span style="font-family: ArialMT, Arial, sans-serif;"><span style="font-size: small;"><span><b>Business history</b></span></span></span></span><span style="color: #000000;"><span style="font-family: ArialMT, Arial, sans-serif;"><span style="font-size: small;"><span> may also dictate a need for </span></span></span></span><span style="color: #000000;"><span style="font-family: Arial-BoldMT, Arial Bold, sans-serif;"><span style="font-size: small;"><span><b>effective from / effective to dates </b></span></span></span></span><span style="color: #000000;"><span style="font-family: ArialMT, Arial, sans-serif;"><span style="font-size: small;"><span>when these may differ from the data warehouse versioning dates. This is especially true in certain sectors, like insurance, where value of business is counted over a period rather than a single time. It is also common when such changes are forward-dated in operational systems.</span></span></span></span></p> </blockquote> <p class="western">So enough of the background what on earth is “Bitemporal Historisation” anyway?</p> <h2 class="western">Bitemporal Historisation - Overview</h2> <p class="western">Once you get it, the approach is conceptually very simple. There are essentially just TWO datetime pairs to consider:</p> <p class="western"><b>1 Transaction datetimes.</b> These from/to datetimes show when the <i>warehouse</i> table is populated. They effectively constitute a version number for the data. If we
<pre class="western" style="padding-left: 30px;">
SELECT coverage
FROM customer_coverage_table AS c
WHERE c.Contact_LName = 'Fudd'
AND (c.BusinessFrom le '2020-04-01:00:00:00'dt lt c.BusinessTo)
AND (c.TransactionFrom le '2020-04-03:00:00:00'dt lt c.TransactionTo);
</pre> <p class="western">Why aren't we using BETWEEN? Because between is <a href="https://sqlblog.org/2011/10/19/what-do-between-and-the-devil-have-in-common#:~:text=See%20the%20full%20index.,range%20%E2%80%93%20not%20everyone%20gets%20that.">evil</a>!</p> <h2 class="western">Bitemporal Prerequisites and Implications</h2> <p class="western">Implementing a bitemporal approach requires a few principles to be adopted.</p> <h3>Records are Never Modified</h3> <p class="western">With the exception of the TransactionTo datetime field (and maybe the PROCESSED_DTTM in the DDS model), once loaded, a record must <strong>never be modified</strong> (or deleted). This would violate the objective of query repeatability.</p> <h3>Matching Close / Open Dates</h3> When a transaction is closed out and re-opened, or if business values are changing over time, the <em>closing</em> datetime must equal the <em>opening</em> datetime. This is to prevent the "temporal gap" that can happen when you close out a record at, say, 23:59:59 and re-open it at 00:00:00. What happens if you query at "23:59:59.5" ? The data has disappeared!! Note - not all ETL tools have this capability. It's common for an SCD2 load to add a second, or a day, when opening new records. <h3>Business / Transaction FROM must be less than the TO value</h3> Leading on from the previous point, FROM and TO dates cannot be equal, and it also follows that queries should be formed as follows:
<pre class="western" style="padding-left: 30px;">
SELECT *
FROM sometable as t
WHERE t.pk = 'some key value'
AND (t.BusinessFrom le &amp;BUSFROM lt t.BusinessTo)
AND (t.TransactionFrom le &amp;TXFROM lt c.TransactionTo);
</pre> The above query would always return either 0 or 1 records. It's imperative that there can only be a single record for a particular key value at a particular point in transaction + business time. <h2 class="western">Simple Bitemporal Examples</h2> <p class="western">Looking at the following (dummy) hierarchy, imagine we first loaded a table on 01JAN2019.</p> <p class="western" lang="en-GB"><img class="aligncenter size-full " src="/wp-content/uploads/2020/08/Screenshot-from-2020-08-06-22-56-04.png" alt="bitemporal" width="791" height="89" /></p> <p class="western">In the first case, consider the ERROR in the country code for XYZ Capital. This was spotted on 8th Feb 2019. The table is updated as follows:</p> <p class="western" lang="en-GB"><img class="aligncenter size-full " src="/wp-content/uploads/2020/08/Screenshot-from-2020-08-06-22-53-22.png" alt="" width="790" height="106" /></p> <p class="western">In the second case, lets consider a business change in NAME from "Crypto Fund" to "Doge GmbH". The need for this change was raised by the actuaries and performed by the IT team on 4th July 2020. However the actual (legal) change in name occurred on 20<sup>th</sup> April 2020. The data is updated as follows:</p> <img class="aligncenter size-full " src="/wp-content/uploads/2020/08/Screenshot-from-2020-08-06-22-50-51.png" alt="" width="794" height="150" /> In this manner, the previous results can always be reproduced (audited), and an "up to date" version of past periods can also be generated. <h2>Complex Bitemporal Example</h2> <p class="western">It can be seen that iterative insertions in the bitemporal model are fairly straightforward, but how will it deal with historical restatements?</p> <p class="western">It is noted that ALL historical restatements deal with the scenario of <i>overlapping ranges</i><i><b>.</b></i><b> </b>The most complex of these is the situation below:</p> <p class="western" lang="en-GB"><a href="/wp-content/uploads/2020/08/hist-op_1.1.6_en_manual.pdf"><img class="aligncenter size-full " src="/wp-content/uploads/2020/08/bitemporal5.png" alt="bitmporal overlapping" width="705" height="128" /></a></p> <p class="western" lang="en-GB">The solution is simply to remove the overlap and create three new records:</p> <p class="western" lang="en-GB"><a href="/wp-content/uploads/2020/08/hist-op_1.1.6_en_manual.pdf"><img class="aligncenter size-full " src="/wp-content/uploads/2020/08/bitemporal6.png" alt="bitemporal ranges" width="707" height="228" /></a></p> <p class="western"><span lang="en-GB">Lets see how this would apply to our data. It is decided by the new CFO on 6th August to temporarily rename </span><span lang="en-GB">"Trust Us Provincial" to "So Very Solvent SA" for the IFRS17 year end results. Who are we to argue!</span></p> <p class="western" lang="en-GB">The table is dutifully updated as follows:<a href="/wp-content/uploads/2020/08/hist-op_1.1.6_en_manual.pdf"><img class="aligncenter size-full " src="/wp-content/uploads/2020/08/Screenshot-from-2020-08-06-22-40-55.png" alt="" width="799" height="213" /></a></p> <p class="western">We simply query the table (for the natural key “3”) where TechnicalTo equals high date. This gave us 1 record. The new record was inserted from 31DEC2019. It applied To 01JAN2020. There are now 3 records of business history for the current transaction version of that natural key entry.</p> <h2 class="western">Summary</h2> <p class="western">Bi-temporal historisation can solve many date stamping woes and allow safe modifications to business history without affecting auditability (reproducability) of results. This is far more efficient than taking snapshots of the database, and far easier to work with.</p> SAS does not ship with a Bitemporal transform, however - Data Controller does. It also provides full Data Lineage (forwards &amp; reverse, table &amp; column level, including business logic applied). DDS features such as retained keys, PROCESSED_DTTM columns, and of course - SCD2 is also supported. A Data Catalog, Data Dictionary, and DDL generator are
[get in touch](/contact/).