Data Warehousing, Decision Support & OLAP
I. Data warehouses: An overview (Chaudhuri and Dayal 97))
Data Warehouses
- A data warehouse is an organization-wide data repository,
typically used for decision-making.
- An integrated enterprise warehouse (e.g., walmart) collects info about all subjects, e.g. customers, products, sales, assets, personnel.
- A DBMS that runs these decision-making queries efficiently is
sometimes called a "Decision Support System" DSS.
OLAP vs OLTP
Data warehouses support on-line analytic processing (OLAP). They are maintained separately from the operational databases that support on-line transaction processing (OLTP) system, due to differences in functional and performance requirements
OLTP
- Applications:
- banking
- ticketing
- telecommunications: phone location update and lookups
- e-commerce: buying and searches
- Detailed, up-to-date data; current value, most recent snapshot of data, e.g., current salary or price.
- Structured and repetitive tasks: banking operations, order entry
- Mixture of reads and writes
- Consist of short, atomic, isolated transactions
- Consistency and recovery are critical
- Goal: maximizing transaction throughput
OLAP
- Applications:
- manufacturing: order shipment and customer support
- retail: user profiling, inventory management
- financial services: claims analysis, risk analysis, credit card analysis, fraud detection
- telecommunications: call analysis, fraud detection
- utilities: power usage analysis
- healthcare: outcome analysis
- Consolidated data, including all hostical data and sometimes also summaries.
- Orders of magnitude larger than operational databases: consolidated from multiple operational databases
- Mostly complex queries that access lots of data, and do many scans, joins, and aggregations. Tend to look for "the big picture".
- Typical workloads: read intensive, append only.
- Goals: query throughput and response times. Parallelism a must.
In class exercises: OLTP vs OLAP
- Search log analysis. Data: (keyword list, clicks on returned URLs).
- click stream analysis. Data: (URL, click, user info).
- Yahoo user community. Data includes user profiles, products, reviews, ratings of reviews. Updates and searches both exist.
-->
Data model: N-dimentional View
To facilitate analysis and visualization, data is often modeled multidimensionally
- In the multidimensional model, a set of numeric measures are the objects of analysis.
- E.g. sales, budget, revenue, inventory, ROI (return on investment), etc.
- These are things over which you might aggregate
- Each of the numeric measures depends on a set of dimensions, which provide the context for the measure.
- E.g. for a sales warehouse, we have dimensions time_of_sale,
sales_district, salesperson, product…
- A dimention can be described by a set of attributes, often organized hierarchically with different levels of detail.
- E.g. time_of_sale has the day-month-quarter-year hierarchy, and product has the product-category-industry hierarchy.
Operations include:
- Aggregation by one or more dimensions: e.g., computing and ranking the total sales by each county, each year, etc.
- Roll up: Take the current data object and do a further group-by. E.g., time_of_sale may be "rolled up" into day-month-quarter-year, product "rolled up" into product-category-industry
- Drill down: opposite of rollup
- Slice_and_dice: i.e. selection & projection in the dimensions
- Pivot: re-orient the multidimensional view
ROLAP vs. MOLAP
ROLAP (Relational OLAP) uses standard relational tables & engine to do OLAP
- Requires denormalized schema
- Star Schema: Fact table + table per dimension
- Snowflake Schema: off of the dimensions, have rolled-up versions
- Products: MicroStrategy, Metacube (Informix), Information
Advantage.
- Uses standard relational query processing, with lots of indexes
and precomputation
MOLAP (Multidimensional OLAP) actually stores things in multi-d format
- Special index structures are used to support this
- Note that much of the cube is empty! (no sales of Purple Chevy
Trucks in June in Reno)
- Identify the "dense" and "sparse" dimensions. Build an index over
combos of sparse dimensions. Within a combo, find a dense subcube of
the dense dimensions. Zhao, et al. propose a different model.
- Products: Essbase (Arbor), Express (Oracle), Lightship (Pilot)
- Essentially everything is precomputed
More recently, HOLAP (Hybrid OLAP) to combine the best of both
- Microsoft Plato due out soon, will make OLAP commonplace
- Some vendors (e.g. Informix/Essbase) talking about MOLAP ADTs
inside an ORDBMS
- Keep this in mind as we read Zhao/Deshpande/Naughton
Query processing issues
- Indexing (ROLAP): Bitmap indexes, join indexes, index intersection
- Cube computation (ROLAP and MOLAP)
- Precomputation (ROLAP): materialized views
- Query decomposition
- Storage: column stores, with new indexes and query optimizations
III. The Data Cube (Gray et al. 07)
Gray et al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals".
Motivation: Data analysis tasks intensively use dimensionality reduction (aggregation) but GROUP BY in traditional SQL failed to adequately address the requirements. Gray et al. proposed new CUBE and ROLLUP operators and addressed efficient computation of the cube.
How do traditional relational databases fit into the multi-dimensional data analysis picture?
- How can flat files model an N-dimensional problem? A relation with N attributes representing the dimensions and additional attributes representing the measures.
- How do the relational systems support operations over N-dimensional representations? Often data along the other dimensions that are not included in a 2D representation are summarized via aggregation in the form of histogram, cross-tabulation, subtotals, etc. They rely on SQL aggregate functions and GROUP BY.
Problems of GROUP BY:
- To build a N-dimensional rollup, too many GROUP BYs and UNIONS (worst case 2^N).
- Too expensive to analyze for optimization (worse case 2^N scans, sorts or hanshes).
CUBE and ROLLUP operators
Main idea: illustrated in the followig picture:
Operator
CUBE: The full data cube is all the base data in the cube, plus all the subaggregates obtained by projection.
Semantics: It requires the generation of the power set (set of all subsets) of the aggregation columns.
- It first aggregates over all the attributes in the GROUP BY clause as in a standard GROUP BY.
- Then it UNIONs in each super-aggregate of the global cube--substituting ALL for the aggregation columns. If there are N attributes in the select list, there will 2^N-1 super-aggregate values.
ROLLUP: If the fullcube is not required, use ROLLUP to produce just the super-aggregates wanted. For example, roll-ups by year, week, day are common, but a cube on these three atributes would be meaningless.
Semantics:
- Semantic 1: Rollup aggregates data from higher-dimensionality to lower-dimensionality, e.g, 3D to 2D (?)
- Semantic 2 (which makes more sense): When a dimension consists of a hierarchy of attributes, rollup means aggregating from fine-level attribute to course-level attribute.
CUBE as relational operator implies the need to model projection with a
magic ALL value (ick!). Recall the mess with NULLs -- this is worse.
Implemetnation
Categories of aggregate functions (which turned out to be useful for sensor networks)
- Distributive: computation can be totall distributed in subsets,e.g., COUNT, MIN, MAX, SUM.
- Algebraic: results are a function of values computed by distributive functions, e.g., AVG, DEV, etc.
- Holistic: need all source data to compute each aggregate function, e.g., MEDIAN, PERCENTILE, RANK, etc.
We will look at computing the cube soon
(Zhao/Deshpande/Naughton.)
IV. Computing the Cube Efficiently (Zhang et al. 07)
ROLAP: Use hashing or sorting (Agrawal, et al., VLDB ’96). Be sure
to reuse aggs computed for small groups to compute aggs over big groups.
MOLAP: Carefully walk the basic cube to compute projected aggs
with minimal memory requirements (Zhao, et al. SIGMOD ’97)
Zhao, et al. point out interesting interplay (coming up at the end.)
A. Storing cubes
- Chunk the cube into n-dimensional subcubes
- For dense cubes (> 40% full), store them as is (fixed size.)
- For sparse cubes (< 40% full), compress them with "chunk-offset
compression" (<offset-in-Chunk, value> pairs).
- An algorithm is given for converting a relation to a chunked array
representation.
B. A simple array cubing algorithm
- Given: cube of dimensions A,B,C. Compute: aggregates for BC.
- This is like "sweeping a plane" through the A dimension, and
bringing the values into the BC plane. Do this in a chunkwise fashion –
compute the agg on a chunk, then on the chunk in front of it, and so on.
When you’re done with a "front cell", write it to disk. This uses only 1
chunk-size piece of memory at a time.
- Then compute individual aggs (B and C) from the agg for BC.
Question: what do we use to compute subaggs? Think of the following
lattice: The decision corresponds to a choice of a
spanning tree in the lattice. We are given dimension sizes and chunk
sizes, so we can compute the size to store each node in lattice, as well
as how much memory needed to use a node to compute its child.
Minimum Spanning Tree (MST) of Lattice is defined as follows:
For each node n, its parent in the MST is the
parent node n’ with minimum size.
The full naive algorithm:
- Construct the MST
- Compute each node from its parent in the MST as illustrated above.
Problem: we rescan things too often (to compute AB, AC, and BC we scan
ABC 3 times)
C. A Multi-Way Array Algorithm
Goal: compute all the subaggs in a single pass of the underlying cube,
using minimal memory. Dimension Order: an ordering on the
dimensions corresponding to a "row major" traversal of chunks (ABC in the
following picture):
This determines memory requirements:
- The BC values require 1 chunk BCs each to aggregate away the A's
- The AC values require 4 chunks ACs each to aggregate away the B's
- The AB values require 16 chunks ABs each to aggregate away the C's
- More generally, if array element size is u, dimension
X chunk size is Xc, and dimension Y size
is Yd, we need
|Bc||Cc|u +
|Ad||Cc|u +
|Ad||Bd|u memory to do this
simultaneously.
- This generalizes naturally to more dimensions: to compute a
projection on one dimension, you need to look at product of sizes of
dimensions in the ordering prefix before the projected
dimension, times the product of the chunk sizes of the
remaining dimensions.
- I.e. it's harder to aggregate away dimensions that are later in the
dimension order
This leads to the Minimum Memory Spanning Tree (MMST):
- For each node below the root of the lattice, choose the parent that
requires the least memory during traversal (Note: this depends on the
chosen dimension ordering.)
- To project out more than 1 dimension, you use the same logic to go
from k dimensions to k-1 as you did from all dimensions to all-1. E.g.
to compute C from AC, you need to look at |Ad| values
simultaneously. This happens as you write the k-d aggs to disk.
- There’s a simple formula for memory requirements for the MMST.
What is the optimal dimension order?
- Can be computed by optimizing that simple formula.
- Turns out to be simple: order by increasing dimension size!
- Memory requirement is independent of the size of the largest
dimension (which is aggregated away 1st).
D. Multi-pass Multi-Way Array Algorithm
What if we don’t have enough memory for that? There’s a multi-pass algorithm:
- It uses a heuristic to allocate memory to subtrees of the root from the right to left order. For the subtrees not chosen to be fully memory resident, compute partial aggregates and write them out.
- Benefits: less I/O, better memory utilization (think of in-memory chunk update frequency).
Upshot: Not only does their algorithm work well and beat the
best known ROLAP technique, it suggests a better ROLAP technique:
- Scan table and convert to array
- Do MOLAP processing
- Dump resulting cube to a table.
Some of the materials based on the lecture notes for the redbook (Readings in Database Systems).
|