The Three Key Ideas about Warehouse
The purpose of the data warehouse should be clear. In most cases, it is for the analysis of data, not for data production.1
- Subject-oriented: since data warehouses are for decision makers, arrange them into subjects makes it much easier to access.
- Integraded: many sources are integrated for easy analysis
- Time-variant: observation time should be recorded since the data is also used to analyze time evolution
- Nonvolatile: simply for analysis
OLTP and OLAP
- OLTP: online transaction processing
- OLAP: online analytical processing
|user||customer||data scientist, managers|
|database||entity relation model, application oriented||star/snowflake model, subject oriented|
|history||usually no need to record the history||history is crucial|
|query||short and frequent read and write||read-only and but complicated analysis|
Scope of Data Warehouse
- Enterprise warehouse: targeting the whole organization
- Data mart: for a specific group of people
- Virtual warehouse: views not tables
Fact and Dimension
Fact is the value of something specified by the dimension. For example, we have a table that tells us the revenue of the year 2018 for keyboards is 3 million euros.
Category serve as dimensions and
Revenue serves as the fact.
The dimension tells us what we are talking about and the fact tells us the values. For the previous example, we could set up a coordinate system, with two axes,
Category. On the
Year axis, we have two values,
2018. On the
Category axis, we have two values
Headset. If we are looking for the revenue of keyboards in 2018, then we are looking for
Headset. It is clearly a 2 dimensional data cube.
In princple, we could have N dimensional data cube. For example, we would like to look at the revenue of
- keyboards (
- in 2018 (
- in Cologne (
- sold offline (
Then we have a 4 dimensional dataset. Revenue is the fact and
Channel are the dimensions.
Concept hierarchies are crucial to OLAP operations.
We consider a data cube about revenue with the dimension
Date with values such as 2018-11-26. We would like to know the monthly revenue, the quarterly revenue, and the annual revenue. It’s straightforward to calculate those. The idea is that monthly, quarterly, and annual revenue form a hierarchy: day $\in$ month $\in$ quarter $\in$ year. This is a convention used by everyone on this planet.
In some other cases, concept hierarchies are very much designed instead of being a convention. Suppose we record the prices of houses. It would be nice to categorize the prices into different ranges of prices. Some would say, we divide them into five ranges:
- below 25 percentile
- 25 to 50 percentile
- 50 to 75 percentile
- above 75 percentile
Others would say, we divide them into 100 euros segments
- 0 to 100 euros
- 100 to 200 euros
In fact, this is not always the case. Sometimes, data products can be delivered using some assist of data warehouse if the data product doesn’t rely on the performance of the data source. ↩