Populating dwh staging area

1 minute read

Models prefixed with stg _comprise the so-called staging area. Source tables are green-coloured, staging tables are blue on the picture below (click to expand):

Since I want to use dbt to deploy my projects I would like to build a common ground. In other words that would be staging area pointing to source tables covered with data tests thoroughly.

Let us examine the source code for stg_ym_sessions_facts:

with source as (

select

     {{ surrogate_key(["account_id",
       "clientids_id",
       "dates_id",
       "traffic_id",
       "locations_id",
       "devices_id"])
       }} as id

, f.account_id
, f.clientids_id
, f.dates_id
, f.traffic_id
, f.locations_id
, f.devices_id
, f.sessions
, f.bounces
, f.pageviews
, f.duration
, gd.dt
, gd.ts

from {{ source('metrika', 'sessions_facts') }} as f
left join {{ ref('stg_general_dates') }} as gd
on gd.id = f.dates_id

{{ filter_rows(
account_id=var('account_id_metrika'),
last_number_of_days=true,
ts_field='dt'
) }}

)

select * from source

Staging tables serve the following key purposes:

Listing available columns

So that any subsequent model could use all of them or specific columns.

Simple joins are performed

See the ‘stg_general_dates‘ is joined to enrich the staging table with basic time dimension: session date and timestamp. You might also want to add weekday, quarter, date of the week start, quarter, etc.

Surrogate keys

Are generated where applicable. Macro is used which I will describe in detail later.

Rows are filtered

Whole logic is implemented in filter_rows macro. Key point here is you may have several source accounts streaming data to a single table, each of them having its own unique identifier. I only want to fetch rows for a specific account (customer).

There is also one major improvement for the Continuous Integration pipeline: using only 3 last days of data to test your DWH before you merge a new patch. I will cover it later.

Other things you can do in the staging area:

Rename your columns to align different sources
Cast field types, cast timezones (local -> utc)
Deduplicate rows if your ELT service doesn’t do so

Twitter Facebook LinkedIn

Artemiy Kozyr

Populating dwh staging area

You May Also Enjoy

Performing set operations union, intersect, minus with looker bi tool

Dbt supercharges sql with macros

Releasing core dwh version

Performing custom ‘not empty’ tests