Dimensional modeling in Amazon Redshift

July 20, 2023

1

Amazon Redshift is a completely managed and petabyte-scale cloud knowledge warehouse that’s utilized by tens of 1000’s of consumers to course of exabytes of knowledge each day to energy their analytics workload. You may construction your knowledge, measure enterprise processes, and get precious insights shortly will be accomplished by utilizing a dimensional mannequin. Amazon Redshift supplies built-in options to speed up the method of modeling, orchestrating, and reporting from a dimensional mannequin.

On this publish, we talk about learn how to implement a dimensional mannequin, particularly the Kimball methodology. We talk about implementing dimensions and information inside Amazon Redshift. We present learn how to carry out extract, rework, and cargo (ELT), an integration course of targeted on getting the uncooked knowledge from an information lake right into a staging layer to carry out the modeling. Total, the publish gives you a transparent understanding of learn how to use dimensional modeling in Amazon Redshift.

Answer overview

The next diagram illustrates the answer structure.

Within the following sections, we first talk about and show the important thing facets of the dimensional mannequin. After that, we create an information mart utilizing Amazon Redshift with a dimensional knowledge mannequin together with dimension and reality tables. Knowledge is loaded and staged utilizing the COPY command, the information within the dimensions is loaded utilizing the MERGE assertion, and information will likely be joined to the scale the place insights are derived from. We schedule the loading of the scale and information utilizing the Amazon Redshift Question Editor V2. Lastly, we use Amazon QuickSight to achieve insights on the modeled knowledge within the type of a QuickSight dashboard.

For this answer, we use a pattern dataset (normalized) supplied by Amazon Redshift for occasion ticket gross sales. For this publish, we’ve narrowed down the dataset for simplicity and demonstration functions. The next tables present examples of the information for ticket gross sales and venues.

In keeping with the Kimball dimensional modeling methodology, there are 4 key steps in designing a dimensional mannequin:

Establish the enterprise course of.
Declare the grain of your knowledge.
Establish and implement the scale.
Establish and implement the information.

Moreover, we add a fifth step for demonstration functions, which is to report and analyze enterprise occasions.

Conditions

For this walkthrough, it is best to have the next stipulations:

Establish the enterprise course of

In easy phrases, figuring out the enterprise course of is figuring out a measurable occasion that generates knowledge inside a company. Often, firms have some form of operational supply system that generates their knowledge in its uncooked format. It is a good place to begin to establish varied sources for a enterprise course of.

The enterprise course of is then continued as a knowledge mart within the type of dimensions and information. our pattern dataset talked about earlier, we are able to clearly see the enterprise course of is the gross sales made for a given occasion.

A typical mistake made is utilizing departments of an organization because the enterprise course of. The info (enterprise course of) must be built-in throughout varied departments, on this case, advertising and marketing can entry the gross sales knowledge. Figuring out the right enterprise course of is important—getting this step fallacious can influence all the knowledge mart (it might trigger the grain to be duplicated and incorrect metrics on the ultimate stories).

Declare the grain of your knowledge

Declaring the grain is the act of uniquely figuring out a report in your knowledge supply. The grain is used within the reality desk to precisely measure the information and allow you to roll up additional. In our instance, this could possibly be a line merchandise within the gross sales enterprise course of.

In our use case, a sale will be uniquely recognized by trying on the transaction time when the sale befell; this would be the most atomic stage.

Establish and implement the scale

Your dimension desk describes your reality desk and its attributes. When figuring out the descriptive context of your enterprise course of, you retailer the textual content in a separate desk, retaining the very fact desk grain in thoughts. When becoming a member of the scale desk to the very fact desk, there ought to solely be a single row related to the very fact desk. In our instance, we use the next desk to be separated right into a dimensions desk; these fields describe the information that we are going to measure.

When designing the construction of the dimensional mannequin (the schema), you may both create a star or snowflake schema. The construction ought to carefully align with the enterprise course of; subsequently, a star schema is greatest match for our instance. The next determine reveals our Entity Relationship Diagram (ERD).

Within the following sections, we element the steps to implement the scale.

Stage the supply knowledge

Earlier than we are able to create and cargo the scale desk, we’d like supply knowledge. Subsequently, we stage the supply knowledge right into a staging or non permanent desk. That is sometimes called the staging layer, which is the uncooked copy of the supply knowledge. To do that in Amazon Redshift, we use the COPY command to load the information from the dimensional-modeling-in-amazon-redshift public S3 bucket positioned on the us-east-1 Area. Observe that the COPY command makes use of an AWS Identification and Entry Administration (IAM) function with entry to Amazon S3. The function must be related to the cluster. Full the next steps to stage the supply knowledge:

Create the venue supply desk:

CREATE TABLE public.venue (
    venueid bigint,
    venuename character various(100),
    venuecity character various(30),
    venuestate character(2),
    venueseats bigint
) DISTSTYLE AUTO
        SORTKEY
    (venueid);

Load the venue knowledge:

COPY public.venue
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/venue.csv'
IAM_ROLE '<Your IAM function arn>'
DELIMITER ','
REGION 'us-east-1'
IGNOREHEADER 1

Create the gross sales supply desk:

CREATE TABLE public.gross sales (
    salesid integer,
    venueid character various(256),
    saletime timestamp with out time zone,
    qtysold BIGINT,
    fee numeric(18,2),
    pricepaid numeric(18,2)
) DISTSTYLE AUTO;

Load the gross sales supply knowledge:

COPY public.gross sales
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/gross sales.csv'
IAM_ROLE '<Your IAM function arn>'
DELIMITER ','
REGION 'us-east-1'
IGNOREHEADER 1

Create the calendar desk:

CREATE TABLE public.DimCalendar(
    dateid smallint,
        caldate date,
        day varchar(20),
        week smallint,
        month varchar(20),
        qtr varchar(20),
        yr smallint,
        vacation boolean
) DISTSTYLE AUTO
SORTKEY
    (dateid);

Load the calendar knowledge:

COPY public.DimCalendar
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/date.csv'
IAM_ROLE '<Your IAM function arn>'
DELIMITER ',' 
REGION 'us-east-1'
IGNOREHEADER 1

Create the scale desk

Designing the scale desk can rely on your enterprise requirement—for instance, do you must monitor adjustments to the information over time? There are seven completely different dimension varieties. For our instance, we use kind 1 as a result of we don’t want to trace historic adjustments. For extra about kind 2, seek advice from Simplify knowledge loading into Sort 2 slowly altering dimensions in Amazon Redshift. The size desk will likely be denormalized with a main key, surrogate key, and some added fields to point adjustments to the desk. See the next code:

create schema SalesMart;

CREATE TABLE SalesMart.DimVenue( 
    "VenueSkey" int IDENTITY(1,1) main key
    ,"VenueId" VARCHAR NOT NULL
    ,"VenueName" VARCHAR NULL
    ,"VenueCity" VARCHAR NULL
    ,"VenueState" VARCHAR NULL
    ,"VenueSeats" INT NULL
    ,"InsertedDate" DATETIME NOT NULL
    ,"UpdatedDate" DATETIME NOT NULL
) 
diststyle AUTO;

Just a few notes on creating the scale desk creation:

The sphere names are reworked into business-friendly names
Our main secret is VenueID, which we use to uniquely establish a venue at which the sale befell
Two extra rows will likely be added, indicating when a report was inserted and up to date (to trace adjustments)
We’re utilizing an AUTO distribution type to offer Amazon Redshift the accountability to decide on and modify the distribution type

One other necessary issue to think about in dimensional modelling is the utilization of surrogate keys. Surrogate keys are synthetic keys which might be utilized in dimensional modelling to uniquely establish every report in a dimension desk. They’re sometimes generated as a sequential integer, and so they don’t have any which means within the enterprise area. They provide a number of advantages, comparable to guaranteeing uniqueness and enhancing efficiency in joins, as a result of they’re sometimes smaller than pure keys and as surrogate keys they don’t change over time. This permits us to be constant and be part of information and dimensions extra simply.

In Amazon Redshift, surrogate keys are sometimes created utilizing the IDENTITY key phrase. For instance, the previous CREATE assertion creates a dimension desk with a VenueSkey surrogate key. The VenueSkey column is mechanically populated with distinctive values as new rows are added to the desk. This column can then be used to affix the venue desk to the FactSaleTransactions desk.

Just a few ideas for designing surrogate keys:

Use a small, fixed-width knowledge kind for the surrogate key. This may enhance efficiency and scale back space for storing.
Use the IDENTITY key phrase, or generate the surrogate key utilizing a sequential or GUID worth. This may be certain that the surrogate secret is distinctive and might’t be modified.

Load the dim desk utilizing MERGE

There are quite a few methods to load your dim desk. Sure components must be thought of—for instance, efficiency, knowledge quantity, and maybe SLA loading occasions. With the MERGE assertion, we carry out an upsert while not having to specify a number of insert and replace instructions. You may arrange the MERGE assertion in a saved process to populate the information. You then schedule the saved process to run programmatically through the question editor, which we show later within the publish. The next code creates a saved process known as SalesMart.DimVenueLoad:

CREATE OR REPLACE PROCEDURE SalesMart.DimVenueLoad()
AS $$
BEGIN
MERGE INTO SalesMart.DimVenue USING public.venue as MergeSource
ON SalesMart.DimVenue.VenueId = MergeSource.VenueId
WHEN MATCHED
THEN
UPDATE
SET VenueName = ISNULL(MergeSource.VenueName, 'Unknown')
, VenueCity = ISNULL(MergeSource.VenueCity, 'Unknown')
, VenueState = ISNULL(MergeSource.VenueState, 'Unknown')
, VenueSeats = ISNULL(MergeSource.VenueSeats, -1)
, UpdatedDate = GETDATE()
WHEN NOT MATCHED
THEN
INSERT (
VenueId
, VenueName
, VenueCity
, VenueState
, VenueSeats
, UpdatedDate
, InsertedDate
)
VALUES (
ISNULL(MergeSource.VenueId, -1)
, ISNULL(MergeSource.VenueName, 'Unknown')
, ISNULL(MergeSource.VenueCity, 'Unknown')
, ISNULL(MergeSource.VenueState, 'Unknown')
, ISNULL(MergeSource.VenueSeats, -1)
, ISNULL(GETDATE() , '1900-01-01')
, ISNULL(GETDATE() , '1900-01-01')
);
END;
$$
LANGUAGE plpgsql;

Just a few notes on the dimension loading:

When a report in inserted for the primary time, the inserted date and up to date date will likely be populated. When any values change, the information is up to date and the up to date date displays the date when it was modified. The inserted date stays.
As a result of the information will likely be utilized by enterprise customers, we have to substitute NULL values, if any, with extra business-appropriate values.

Establish and implement the information

Now that we’ve declared our grain to be the occasion of a sale that befell at a particular time, our reality desk will retailer the numeric information for our enterprise course of.

We have now recognized the next numerical information to measure:

Amount of tickets offered per sale
Fee for the sale

Implementing the Truth

There are three forms of reality tables (transaction reality desk, periodic snapshot reality desk, and accumulating snapshot reality desk). Every serves a unique view of the enterprise course of. For our instance, we use a transaction reality desk. Full the next steps:

Create the very fact desk

CREATE TABLE SalesMart.FactSaleTransactions( 
    CalendarDate date NOT NULL
    ,SaleTransactionTime DATETIME NOT NULL
    ,VenueSkey INT NOT NULL
    ,QuantitySold BIGINT NOT NULL
    ,SaleComission NUMERIC NOT NULL
    ,InsertedDate DATETIME DEFAULT GETDATE()
) diststyle AUTO;

An inserted date with a default worth is added, indicating if and when a report was loaded. You need to use this when reloading the very fact desk to take away the already loaded knowledge to keep away from duplicates.

Loading the very fact desk consists of a easy insert assertion becoming a member of your related dimensions. We be part of from the DimVenue desk that was created, which describes our information. It’s greatest follow however elective to have calendar date dimensions, which permit the end-user to navigate the very fact desk. Knowledge can both be loaded when there’s a new sale, or every day; that is the place the inserted date or load date turns out to be useful.

We load the very fact desk utilizing a saved process and use a date parameter.

Create the saved process with the next code. To maintain the identical knowledge integrity that we utilized within the dimension load, we substitute NULL values, if any, with extra enterprise applicable values:

create or substitute process SalesMart.FactSaleTransactionsLoad(loadate datetime)
language plpgsql
as
    $$
start
--------------------------------------------------------------------
/*** Delete information loaded for the day, ought to there be any ***/
--------------------------------------------------------------------
Delete from SalesMart.FactSaleTransactions
the place forged(InsertedDate as date) = CAST(loadate as date);
RAISE INFO 'Deleted rows for load date: %', loadate;
--------------------------------------------------------------------
/*** Insert information ***/
--------------------------------------------------------------------
INSERT INTO SalesMart.FactSaleTransactions (
CalendarDate    
,SaleTransactionTime    
,VenueSkey  
,QuantitySold  
,Salecomission
)
SELECT DISTINCT
    ISNULL(c.caldate, '1900-01-01') as CalendarDate
    ,ISNULL(a.saletime, '1900-01-01') as SaleTransactionTime
    ,ISNULL(b.VenueSkey, -1) as VenueSkey
    ,ISNULL(a.qtysold, 0) as QuantitySold
    ,ISNULL(a.fee, 0) as SaleComission
FROM
    public.gross sales as a
 
LEFT JOIN SalesMart.DimVenue as b
on a.venueid = b.venueid
 
LEFT JOIN public.DimCalendar as c
on to_char(a.saletime,'YYYYMMDD') = to_char(c.caldate,'YYYYMMDD');
--Non-compulsory filter, must you need to load solely the newest knowledge from supply
--where forged(a.saletime as date) = forged(loadate as date);
  
finish;
$$;

Load the information by calling the process with the next command:

name SalesMart.FactSaleTransactionsLoad(getdate())

Schedule the information load

We will now automate the modeling course of by scheduling the saved procedures in Amazon Redshift Question Editor V2. Full the next steps:

We first name the dimension load and after the dimension load runs efficiently, the very fact load begins:

BEGIN;
----Insert Dim Hundreds
name SalesMart.DimVenueLoad();

----Insert Truth Hundreds. They'll solely run if the DimLoad is profitable
name SalesMart.FactSaleTransactionsLoad(getdate());
END;

If the dimension load fails, the very fact load is not going to run. This ensures consistency within the knowledge as a result of we don’t need to load the very fact desk with outdated dimensions.

To schedule the load, select Schedule in Question Editor V2.

We schedule the question to run each day at 5:00 AM.
Optionally, you may add failure notifications by enabling Amazon Easy Notification Service (Amazon SNS) notifications.

Report and evaluation the information in Amazon Quicksight

QuickSight is a enterprise intelligence service that makes it simple to ship insights. As a completely managed service, QuickSight enables you to simply create and publish interactive dashboards that may then be accessed from any system and embedded into your functions, portals, and web sites.

We use our knowledge mart to visually current the information within the type of a dashboard. To get began and arrange QuickSight, seek advice from Making a dataset utilizing a database that’s not autodiscovered.

After you create your knowledge supply in QuickSight, we be part of the modeled knowledge (knowledge mart) collectively primarily based on our surrogate key skey. We use this dataset to visualise the information mart.

Our finish dashboard will include the insights of the information mart and reply important enterprise questions, comparable to complete fee per venue and dates with the best gross sales. The next screenshot reveals the ultimate product of the information mart.

Clear up

To keep away from incurring future costs, delete any sources you created as a part of this publish.

Conclusion

We have now now efficiently applied an information mart utilizing our DimVenue, DimCalendar, and FactSaleTransactions tables. Our warehouse will not be full; as we are able to broaden the information mart with extra information and implement extra marts, and because the enterprise course of and necessities develop over time, so will the information warehouse. On this publish, we gave an end-to-end view on understanding and implementing dimensional modeling in Amazon Redshift.

Get began along with your Amazon Redshift dimensional mannequin at this time.

Concerning the Authors

Bernard Verster is an skilled cloud engineer with years of publicity in creating scalable and environment friendly knowledge fashions, defining knowledge integration methods, and guaranteeing knowledge governance and safety. He’s enthusiastic about utilizing knowledge to drive insights, whereas aligning with enterprise necessities and aims.

Abhishek Pan is a WWSO Specialist SA-Analytics working with AWS India Public sector prospects. He engages with prospects to outline data-driven technique, present deep dive periods on analytics use circumstances, and design scalable and performant analytical functions. He has 12 years of expertise and is enthusiastic about databases, analytics, and AI/ML. He’s an avid traveler and tries to seize the world by his digital camera lens.

Dimensional modeling in Amazon Redshift

Answer overview

Conditions

Establish the enterprise course of

Declare the grain of your knowledge

Establish and implement the scale

Stage the supply knowledge

Create the scale desk

Load the dim desk utilizing MERGE

Establish and implement the information

Implementing the Truth

Schedule the information load

Report and evaluation the information in Amazon Quicksight

Clear up

Conclusion

Concerning the Authors

Related Articles

Pathlight Finds a Path to Actual-World GenAI Productiveness

Pretend WinRAR PoC Exploit Conceals VenomRAT Malware

iPhone 15 gives extra particulars on battery well being

LEAVE A REPLY Cancel reply

Latest Articles

Pathlight Finds a Path to Actual-World GenAI Productiveness

Pretend WinRAR PoC Exploit Conceals VenomRAT Malware

iPhone 15 gives extra particulars on battery well being

Google Advertisements Routinely Created Belongings Obtainable In 8 Languages

Atlas VPN Evaluate: Finest VPN for Torrenting Safely and Anonymously

About Us