DBMS query optimization

AndrewSL

queryoptimization.pdf

Home >Computer Science homework help >DBMS query optimization

DBMS Architecture

Query Executor

Buffer Manager

Storage Manager

Storage

Transaction Manager

Logging & Recovery

Lock Manager

Buffers Lock Tables

Main Memory

User/Web Forms/Applications/DBA query transaction

Query Optimizer

Query Rewriter

Query Parser

Files & Access Methods

Past lectures

Today’s lecture

Many query plans to execute a SQL query

S UTR SR

• Even more plans: multiple algorithms to execute each operation

Sort-merge

hash join

Table-scan

index-scan

Table-scanindex-scan

• Compute the join of R(A,B) S(B,C) T(C,D) U(D,E)

Explain command in PostgreSQL

SELECT C.STATE, SUM(O.NETAMOUNT), SUM(O.TOTALAMOUNT) FROM CUSTOMERS C

JOIN CUST_HIST CH ON C.CUSTOMERID = CH.CUSTOMERID JOIN ORDERS O ON CH.ORDERID = O.ORDERID

GROUP BY C.STATE

Communication between operators: iterator model

• Each physical operator implements three functions: – Open: initializes the data structures. – GetNext: returns the next tuple in the result. – Close: ends the operation and frees the resources.

• It enables pipelining • Other option: compute the result of the operator in full

and store it in disk or memory: – inefficient.

Query optimization: picking the fastest plan • Optimal approach plan

– enumerate each possible plan – measure its performance by running it – pick the fastest one – What’s wrong?

• Rule-based optimization – Use a set of pre-defined rules to generate a fast plan

• e.g., If there is an index over a table, use it for scan and join.

Review Definitions • Statistics on table R:

– T(R): Number of tuples in R – B(R): Number of blocks in R

• B(R) = T(R ) / block size – V(R,A): Number of distinct values of attribute A in R

Plans to select tuples from R: σA=a(R) • We have an unclustered index on R • Plans:

– (Unclustered) indexed-based scan – Table-scan (sequential access)

• Statistics on R – B(R)=5000, T(R)=200,000 – V(R,A) = 2, one value appears in 95% of tuples.

• Unclustered indexed scan vs. table-scan ?

Presenter

Presentation Notes

Unclustered indexed scan vs. table-scan ? table-scan is the winner!

Query optimization methods • Rule-based optimizer fails

– It uses static rules – The rules do not consider the distribution of the data.

• Cost-based optimization – predict the cost of each plan – search the plan space to find the fastest one – do it efficiently

• Optimization itself should be fast!

Cost-based optimization • Plan space

– which plans to consider? – it is time consuming to explore all alternatives.

• Cost estimator – how to estimate the cost of each plan without executing it? – we would like to have accurate estimation

• Search algorithm – how to search the plan space fast? – we would like to avoid checking inefficient plans

Space of query plans: basic operators • Selection

– algorithms: sequential, index-based – Ordering

• Projection – Ordering

• Join – algorithms: nested loop, sort-merge, hash – ordering

Reducing plan space • Multiple logical query plan for each SQL query

Star(name,BD,bio), StarsIn(movie, sname, year, plot)

SELECT movie,name FROM Stars, StarsIn WHERE Star.name = StarsIn.sname AND year = 1950

Generally Faster StarsIn Star

StarsIn.sname = Star.name

σ year=1950

StarsIn

Star

StarsIn.sname = Star.name

year=1950

movie, name movie, name

Reducing plan space • Push selection down to reduce # of rows • Push projection down to reduce # of columns

SELECT movie, name, BD FROM Stars, StarsIn WHERE Star.name = StarsIn.sname

StarsIn Star

StarsIn.sname = Star.name

movie, name, BD

StarsIn Star

StarsIn.sname = Star.name

movie, name, BD

movie, sname name,BD

Less effective than pushing down selection.

• Queries with multiple joins may have exponentially many possible plans.

• System-R style considers only left-deep joins

Reducing plan space

T USR

• Left-deep trees allow us to generate fully pipelined plans

• System R-style avoids plans with Cartesian products – Cartesian products are generally larger than joins.

• Example: – Join of R(A,B), S(B,C), U(C,D) – (R ⋈ U) ⋈ S has a Cartesian product – Pick (R ⋈ S) ⋈ U instead

• If cannot avoid Cartesian products, delay them.

Reducing plan space

Components of a cost-based optimizer • Plan space

– Which plans to consider? – Time-consuming to explore every possible plan.

• Cost estimator – How to estimate the cost of a plan without executing it? – Accurate estimations.

• Search algorithm – How to search the plan space fast? – Avoid checking inefficient plans.

• Our goal is to maximize relative accuracy – Compare plans, not to predict exact costs.

• For each operator in the plan: – Find input size – Estimate its cost based on the input size

• Example: sort-merge join of R ⋈ S is 3 B(R) + 3 B(S) – Estimate output size or selectivity

• Selectivity: ratio of output to input

Cost estimation

• Why estimate output size? – Output of an operator = Input to the next one in the plan – Used to estimate of the cost of the next operator

• Cost of a plan – Sum of the cost of its operators

Cost estimation

Cost estimation: Selinger Style • Input: stats on each table

– T(R): Number of tuples in R – B(R): Number of blocks in R

• B(R) = T(R ) / block size – V(R,A): Number of distinct values of attribute A in R

• We assume that attributes and predicates are independence.

• When no estimate available, use magic numbers.

Presenter

Presentation Notes

too much information to keep, use histogram

Selectivity factors: selection • Point selection: S = σA=a(R)

– T(S) ranges from 0 to T(R) – V(R,A) + 1 – consider its mean: F = 1 / V (R,A)

• Range selection: S = σA>a(R) – F = (max(A) – a) / (max(A) – min(A)) – If not arithmetic inequality, use magic number

• F = 1 / 3

• Range selection: S = σ b <A<a(R) – F = (a - b) / (max(A) – min(A)) – If not arithmetic, use magic number

• F = 1 / 4

Examples: selection with multiple conditions

• S = σA=1 AND B>10(R) – Given T(R) = 10,000, V(R,A) = 50 – No information on B – Multiply 1/V(R,A) for equality and 1/3 for inequality

• T(S) = 10000 / (50 * 3) = 66

• S = σA=1 OR B>10(R) – Sum of estimates of predicates minus their product

• T(S) = 200 + 3333 – 66 = 3467

• Containment of values assumption – If V(S,A) <= V (R,A), then A-values in S are a subset of A-

values in R

• Let’s assume V(S,A) <= V (R,A) – Each tuple t in S joins x tuple(s) in R – consider its mean: x = T(R) / V (R,A) – T(R ⋈A S) = T (S) * T(R) / V(R,A)

• General formula: T(R ⋈A S) = T(R) * T(S) / max(V(R,A), V(S,A))

Selectivity factors: join predicates

Presenter

Presentation Notes

Typical join: A is a key in R and a foreign key in S

• What if we join over more than one attribute?

T(R S) = T(R) T(S) / max(V(R,A),V(S,A) max(V(R,B),V(S,B))

A,B

Selectivity factors: join predicates

Components of a cost-based optimizer • Plan space

– Which plans to consider? – Time-consuming to explore every possible plan.

• Cost estimator – How to estimate the cost of a plan without executing it? – Accurate estimations.

• Search algorithm – How to search the plan space fast? – Avoid checking inefficient plans.

Search the plan space • Baseline: exhaustive search

– Enumerate all (left-deep) trees and compare their costs

• Enormous space to search => time-consuming!

Plan search: System-R style • A.K.A: Selinger style optimization • Bottom-up

– Start from the ground relation (in FROM) – Work up the tree to form a plan – Compute the cost of larger plans based on its sub-trees.

• Dynamic programming – greedily remove sub-trees that are costly (useless)

• Step 1: For each {Ri}: – Plans({Ri}): algorithms to access Ri

• Table-Scan, Index-Scan – Costs({Ri}): Costs of accessing to Ri

• e.g., B(Ri) for table-scan on Ri – Plan({Ri}): access method with lowest cost – Cost({Ri}): lowest cost in Costs({Ri}) – Output-size ({Ri}): B(Ri)

Dynamic programming algorithm

• Step 2: For each {Ri, Rj}: – Plans({Ri,Rj}): join algorithms to compute Ri ⋈ Rj

• E.g., nested-loop and sort-merge based algorithms – Costs({Ri,Rj}): function of size of Ri and Rj

• #I/O access of the chosen join algorithm • E.g., 5 B(R) + 5 B(S) for sort-merge join

– Plan({Ri,Rj}): the join algorithm with lowest cost – Cost({Ri,Rj}): lowest cost in Costs({Ri,Rj}) – Output-size ({Ri,Rj}): estimate of the size of join

Dynamic programming algorithm

• Step i: For each S ⊆ {R1, …, Rn} of cardinality i: – for every S1 ,S2 s.t. S = S1 ∪ S2 compute

C = cost(S1) + cost(S2) + cost(S1 ⋈ S2) – Cost(S) = the smallest C – Plan(S) = the plan with cost(S) – Output-size(S): estimate the size of S

• Return Plan({R1, …, Rn})

Dynamic programming algorithm

• R ⋈ S ⋈ T ⋈ U on common attribute A • T(R)= 2000, T(S)=5000, T(T)=3000, T(U)=1000 • We assume that each block contains a single tuple.

– B(R)= 2000, B(S)=5000, B(T)=3000, B(U)=1000 • No index on relations

– Table-scan to access each relation • The only join algorithm is sort-merge

– Cost for R ⋈ S: 5 B(R) + 5 B(S) – There is enough memory for sort-merge joins.

Example

• Relations: R, S, T, U • T(R)= 2000, T(S)=5000, T(T)=3000, T(U)=1000 • B(R)= 2000, B(S)=5000, B(T)=3000, B(U)=1000

• To simplify our example, we assume – V(R,A)=V(S,A)=V(T,A)=V(U,A)= 100 – T(R ⋈ S) = 0.01 * T(R) * T(S)

• Same formula for joins of every pair of relations.

Example

Query Output-Size Cost Plan

R⋈ S

R⋈ T

R⋈ U

S⋈ T

S⋈ U

T⋈ U

R⋈ S⋈ T

R⋈S⋈U

R⋈T⋈U

S⋈T⋈U

R⋈S⋈T⋈U

Query Output-Size Cost Plan

R⋈S 100K 35K R⋈S

R⋈T 60K 25K R⋈T

R⋈U 20K 15K R⋈U

S⋈T 150K 40K S⋈T

S⋈U 50K 30K S⋈U

T⋈U 30K 20K T⋈U

R⋈S⋈ T

R⋈S⋈U

R⋈T⋈U

S⋈T⋈U

R⋈S⋈T⋈U

Query Output-Size Cost Plan

R⋈S 100K 35K R⋈S

R⋈T 60K 25K R⋈T

R⋈U 20K 15K R⋈U

S⋈T 150K 40K S⋈T

S⋈U 50K 30K S⋈U

T⋈U 30K 20K T⋈U

R⋈S⋈T 3M 170K S⋈ (R⋈T)

R⋈S⋈U 1M 80K S⋈ (R⋈U)

R⋈T⋈U 0.6M 70K T⋈ (R⋈U)

S⋈T⋈U 1.5M 105K S⋈ (T⋈U)

R⋈S⋈T⋈U

Query Output-Size Cost Plan

R⋈S 100K 35K R⋈S

R⋈T 60K 25K R⋈T

R⋈U 20K 15K R⋈U

S⋈T 150K 40K S⋈T

S⋈U 50K 30K S⋈U

T⋈U 30K 20K T⋈U

R⋈S⋈T 3M 170K S⋈ (R⋈T)

R⋈S⋈U 1M 80K S⋈ (R⋈U)

R⋈T⋈U 0.6M 70K T⋈ (R⋈U)

S⋈T⋈U 1.5M 105K S⋈ (T⋈U)

R⋈S⋈T⋈U 30M 1.35M S⋈ (T⋈ (R⋈U))

Interesting Orders • Useful sorted intermediate results • Order By

– E.g., Order By based on A for the result R⋈AS • Sorted order based on A is an interesting order

• Multi-relation joins – E.g., T⋈A (R⋈AS)

• Sorted output of R⋈AS helps fast join of T⋈A (R⋈AS). • Sorted order based on A is an interesting order

• Grouping

Interesting Orders • Plans with sorted results save sorting time/cost.

– E.g., sort-merge join for R⋈AS with Order By on A. – E.g., sort-merge join for R⋈AS for query T⋈A (R⋈AS).

• The dynamic programming algorithm always keeps – Fastest overall plan – Fastest plan for each interesting order

Nested subqueries • Subqueries are optimized separately • Correlation: order of evaluation

– uncorrelated queries • nested subqueries do not reference outer subqueries • evaluate the most deeply nested subquery first

– correlated queries: nested subqueries reference the outer subqueries

Select name From employee X Where salary > (Select salary

From employee Where employee_num = X.manager)

Nested subqueries • correlated queries: nested subqueries reference the outer

subqueries Select name From employee X Where salary > (Select salary

From employee Where employee_num = X.manager)

• The nested subquery is evaluated once for each tuple in the outer query.

• If there are small number of distinct values in the outer relation, it is worth sorting the tuples. – reduces the #evaluation of the nested query.

Optimization: all operations • Reduce plan space

– Push down selections and projections – Choose good plans, discard bad ones

• Base relations access – find all plans for accessing each base relations

• Join ordering – Bottom-up dynamic programming algorithm – Consider only left-deep plans – Remove/ postpone Cartesian product – Keep the cheapest plan for unordered and each interesting

order 40

Summary: optimization • Ideal goal

– Map a declarative query to the most efficient plan • Plan space

– Many semantically equivalent alternatives • Why important?

– Difference between good/bad plans is order of magnitude • Conventional wisdom:

– At least avoid bad plans

State of the art • Academia: always a core database research topic

– Optimizing for interactive querying – Optimizing for novel parallel frameworks

• Industry: most optimizers use System-R style – They started with rule-based.

• Oracle 7 and its prior versions used rule-based • Oracle 7 – 10: rule-based, and cost based • Oracle 10g (2003): cost-based

CS 540 �Database Management Systems
DBMS Architecture
Many query plans to execute a SQL query
Explain command in PostgreSQL
Communication between operators: � iterator model
Query optimization: picking the fastest plan
Review Definitions
�Plans to select tuples from R: sA=a(R) �
Query optimization methods
Cost-based optimization
Space of query plans: basic operators
Reducing plan space
Reducing plan space
Reducing plan space
Reducing plan space
Components of a cost-based optimizer
Cost estimation
Cost estimation
Cost estimation: Selinger Style
Selectivity factors: selection
Examples: selection with multiple conditions
Selectivity factors: join predicates
Selectivity factors: join predicates
Components of a cost-based optimizer
Search the plan space
Plan search: System-R style
Dynamic programming algorithm
Dynamic programming algorithm
Dynamic programming algorithm
Example
Example
Slide Number 32
Slide Number 33
Slide Number 34
Slide Number 35
Interesting Orders
Interesting Orders
Nested subqueries
Nested subqueries
Optimization: all operations
Summary: optimization
State of the art