edba: Datawarehouse

Data from different operational source/production database is extracted usinng ETL tools and loaded into a separated non-volatile database to provide meaningful business insights.It provides required information to businesses to make strategic decisions using Query and reporting tools.
Idea is to discover meaningful new correlation, pattens, and trends by mining large amount data to analyse the data using multidimensional views.

Data marts contain a subset of organization-wide data valuable to a specific group of people in an organization.
Metadata defines What tables, attributes, and keys does the Data Warehouse contain?

Data Model is used to Entity(tables) , Attribute(properties/columns) and Relationship(FK column) between two entities.

Dimensional Model is used to read, summarize and analyze numeric information using "fact" and "dimension" tables.
Fact have numerical values like cost by product id ,location id , time id.Summation can be done on it. Fact table have primary key.
Dimension have Product, Location and Time details.Dimension table have foreign key.

Star schema have fact table(Normalized) at center and dimension table(de-normalized) at the outer points of the star.
Snowflake Schema : Start schema where each dimension are normalized and connected to more dimension tables.
Galaxy schema : It contains two fact table that shares dimension tables.

Data lakes contains data n its original format until the time it is needed.It store large amount of structured, semi-structured, and unstructured data.

Data Mart :

Data Mart is a collection of data coming from many sources.So there may be all the data relating to sales in one data mart.

From this data mart you would identify the fact data (sales) and the dimension data (shops, sales dates, sales persons, etc). Then you bild a cube that contains the facts and dimensions.The cube actually physically contains the data and becomes what you could call a multidimensional table. You can query this by each of your dimensions to reveal the facts.

A fact table is a set of metrics or measures generated that relates to a specific business process or event.

A dimension table contains the attributes that places the metrics in the fact table into a particular context.

A star schema is a fact table and a set of dimension tables and the relationships between the two sets.

A snowflake schema data mart is a star schema with dimension tables that have been normalized

Cube is a multidimensional database that is specifically architected to support all flavors of reporting including analytics, trending and historical reporting. It is organized into measures (facts) and dimensions.

Notes :

To process large volumes of data quickly, the server must be able to support parallel, large-table-full-table scans for data warehouse aggregation.

Oracle data warehouses are I/O-bound and OLTP Oracle system are mostly CPU-bound.
In datawahouse , ways to remove I/O bottlenecks and to make data warehouse applications more CPU intensive should be employed.Once bottleneck moves from I/O to CPU, we can scale performance by adding additional processors

Use AWR to find large-table-full-table scans(db file scattered read), hash joins (which might be replaced with STAR joins),RAM usage within the pga_aggregate_target region.

Use Multiple Blocksizes for indexes using range scan , objects accessed through full table/index scan.

Use ASM and AMM

Cache frequently used small dimension tables and indexes tables in KEEP pool
Use 64 architecture with large RAM

Parallel full table scans bypass SGA data buffers(db_cache_size) and blocks are read directly into the program global area (PGA). So super large SGA does not benefit data warehouse and DSS systems.

Hash joins do not use indexes and perform full-table scans (often using parallel query). It also use more PGA memeory. Hence, hash joins with parallel full-table scans tend to drive up CPU consumption.

Hints :

Hash Join :

The optimizer uses the smaller of two tables or data sources to build a hash table on the join key in memory. It then scans the larger table, using
the hash table to find the joined rows.

Nested Loop Join:

It indicates that an index is available for use during the join. optimizer first selects a driving table for the join.
For every row in the outer table(driving table)/outer_loop, Oracle accesses all the rows in the inner table/outer_loop.To influence Nested Loop Join , use
ORDERED ,INDEX ,FULL TABLE SCAN , or USE NESTED LOOPS hints. NESTED LOOPS is a very effective join operation for OLTP.

Star Transformation :

A bitmap index should be built on each of the foreign key columns of the fact table or tables.
The initialization parameter STAR_TRANSFORMATION_ENABLED should be set to TRUE.

It retrieves the necessary rows from the fact table (row set) and joins this row set to the dimension tables.

Use hint Star Transformation

Best practice to achieve good performance on a data warehouse environment

Majority of operations that execute in parallel bypass the buffer cache. Parallel servers communicate among themselves and with the Query Coordinator by
passing messages via memory buffers. Only the memory needed for the parallel_min_servers will be pre-allocated from the shared_pool at database startup.

shared_pool_size : Memory(Single machine) = #Of Users * DOP * (4 + 2 * DOP)*parallel_execution_message_size
Memory per instance(RAC) = Users * (DOP / NumberOfInstances * (2 + 2 * DOP/NumberOfInstances + 4 * (DOP – DOP/NumberOfInstances)) *parallel_execution_message_size

pga_aggregate_target: Greater than 100MB X parallel_max_servers

parallel_execution_message_size: Ideally increase it to 16k . Ensure enough parallel servers available.

Use a Dimensional Star Schema Design
Partition Tables that are likely to grow large.
Use Bitmap indexes on Fact tables for Dimension table joins : Create a bitmap index on each of the foreign key columns in the fact table
or tables
Use Materialized views to create generic aggregate tables.

Large Block size : Use Larger block sizes.
db_file_multiblock_read_count : Keep it 1024/db_block_size
Large pga_aggregate_target (or sort_area_size).
Table Statistics
Use the Keep buffer pool.
Use Hints

Use Parallel : alter session enable parallel query |DML ; alter system set parallel_degree_policy=auto;

parallel_min_servers: Number of parallel servers that will be started during database startup

parallel_max_servers: Keep it greatre than Maximum # of concurrent queries * maximum DOP needed by a query (Default : 10 *cpu_count * parallel_threads_per_cpu )

For designing a balanced system have at least 200MB/sec IO throughput per core. If server is dual core , each node can
drive 400MB/sec of IO. HBA should sustain this throughput or if you have 2 HBA per node , each should sustain 200MB/s.

Ensure the disks and controllers can sustain a total throughput of 200MB/s.
Disks will be connected to switch. If you have disk array of 4 disk which are connected to switch through fibre channel(200 MB/s),
throughput from the disk array into the switch will be 4 x 200 MB/s . From the switch , it's connected to 4 nodes using 2 HBA on each. So this 4 x 200 MB/s
will be divided by 4 i.e 400 MB/s will be available to each node when say full table scan is running on all 4 nodes simultaneously.