The first & the foremost thing in developing a data warehouse is to imagine & implement the schema according to which the ETL jobs will ingest data. The simplest way of schema that can be used for developing data marts is called star schema.
Before jumping to star schema example let me list the main advantages & building blocks of star schema.
Advantages of star schema
- Simpler Queries.
- Simplified Business Reporting Logic.
- Query Performance Gains.
- Faster Aggregations.
- Feeding Cubes.
Building Blocks of Star Schema
There are two building blocks of star schema
- Fact Tables : Hold the measurable quantitative data about a business generally numeric values & foreign Keys to dimension tables. Usually 1 in number.
- Dimension Tables : Hold the descriptive information about the entities present in fact table. Has less number of records in comparison but each record holds a lot of information.
Star Schema by example
Lets consider an online store where you can order different items. So what will be the key components of the store from data perspective, let me list them for you.
Each of the components are simply entities of a store & each entity has different attributes , let's break it down to a list
- Sales Information : SaleID, SalePerson , SaleAmount
- Customer Information : CustomerID,Phone,Address,Name,City,State, Country
- Product Information : ProductID,Name,Category,Description
- Employees Information : EmployeID, Name,Status,ManagerID
By default any online store has a propriety database which stores this information in this format but each database has it's limitations this database can be perfect for OLTP but not for OLAP.
Lets start building
Based on the attributes listed above lets start building our star schema. Its a common practice to place Fact & Dim either preceding or succeeding the table name.
Here's what our fact table will look like
As i mentioned earlier it only contains numeric values (SalesAmount,Quantity) & keys to dimension tables.
Lets take a glance at one of our dimension tables:
You will definitely notice a few things here which can possibly be confusing
- CustomerDimID & CustomerID : why can't we use the customerID as CustomerDimID, here's why most of the proprietary databases just care about the current state of data, there is no way to see historical trends & patterns that's why we assigned a separate primary key, for example our customer shifted to another city his CustomerID will remain the same but he will be assigned a new CustomerDimID which will allow us to see how many versions of a specific customer are present within our system. This concept is called Slowly Changing Dimensions.
Using the same approach we can build our remaining dimension tables which are ProductDim, EmployeDim & DateDim.
Note : Primary keys in DateDim are usually constructed via numeric dates i.e for 2018-09-13 our key will be 20180913.
After making our table connections here's how our schema will look like.
Data now a days is ever growing , star schema is a great way to break down your your data into more understandable & manageable tables. It provides you a flexible way to expand your data warehousing tendencies as you begin to handle more & more data. Stay tuned for my snowflake schema example!!