DBMS query optimization

profileAndrewSL
queryoptimization.pdf

DBMS Architecture

Query Executor

Buffer Manager

Storage Manager

Storage

Transaction Manager

Logging & Recovery

Lock Manager

Buffers Lock Tables

Main Memory

User/Web Forms/Applications/DBA query transaction

Query Optimizer

Query Rewriter

Query Parser

Files & Access Methods

Past lectures

Today’s lecture

Many query plans to execute a SQL query

3

S UTR SR

T

U

SR

T

U

• Even more plans: multiple algorithms to execute each operation

SR

T

U

Sort-merge

Sort-merge

hash join

Table-scan

index-scan

Table-scanindex-scan

• Compute the join of R(A,B) S(B,C) T(C,D) U(D,E)

Explain command in PostgreSQL

4

SELECT C.STATE, SUM(O.NETAMOUNT), SUM(O.TOTALAMOUNT) FROM CUSTOMERS C

JOIN CUST_HIST CH ON C.CUSTOMERID = CH.CUSTOMERID JOIN ORDERS O ON CH.ORDERID = O.ORDERID

GROUP BY C.STATE

Communication between operators: iterator model

• Each physical operator implements three functions: – Open: initializes the data structures. – GetNext: returns the next tuple in the result. – Close: ends the operation and frees the resources.

• It enables pipelining • Other option: compute the result of the operator in full

and store it in disk or memory: – inefficient.

5

Query optimization: picking the fastest plan • Optimal approach plan

– enumerate each possible plan – measure its performance by running it – pick the fastest one – What’s wrong?

• Rule-based optimization – Use a set of pre-defined rules to generate a fast plan

• e.g., If there is an index over a table, use it for scan and join.

6

Review Definitions • Statistics on table R:

– T(R): Number of tuples in R – B(R): Number of blocks in R

• B(R) = T(R ) / block size – V(R,A): Number of distinct values of attribute A in R

7

Plans to select tuples from R: σA=a(R) • We have an unclustered index on R • Plans:

– (Unclustered) indexed-based scan – Table-scan (sequential access)

• Statistics on R – B(R)=5000, T(R)=200,000 – V(R,A) = 2, one value appears in 95% of tuples.

• Unclustered indexed scan vs. table-scan ?

8

Presenter
Presentation Notes
Unclustered indexed scan vs. table-scan ? table-scan is the winner!

Query optimization methods • Rule-based optimizer fails

– It uses static rules – The rules do not consider the distribution of the data.

• Cost-based optimization – predict the cost of each plan – search the plan space to find the fastest one – do it efficiently

• Optimization itself should be fast!

9

Cost-based optimization • Plan space

– which plans to consider? – it is time consuming to explore all alternatives.

• Cost estimator – how to estimate the cost of each plan without executing it? – we would like to have accurate estimation

• Search algorithm – how to search the plan space fast? – we would like to avoid checking inefficient plans

10

Space of query plans: basic operators • Selection

– algorithms: sequential, index-based – Ordering

• Projection – Ordering

• Join – algorithms: nested loop, sort-merge, hash – ordering

11

Reducing plan space • Multiple logical query plan for each SQL query

Star(name,BD,bio), StarsIn(movie, sname, year, plot)

SELECT movie,name FROM Stars, StarsIn WHERE Star.name = StarsIn.sname AND year = 1950

12

Generally Faster StarsIn Star

StarsIn.sname = Star.name

σ year=1950

StarsIn

Star

StarsIn.sname = Star.name

year=1950

movie, name movie, name

Reducing plan space • Push selection down to reduce # of rows • Push projection down to reduce # of columns

SELECT movie, name, BD FROM Stars, StarsIn WHERE Star.name = StarsIn.sname

13

StarsIn Star

StarsIn.sname = Star.name

movie, name, BD

StarsIn Star

StarsIn.sname = Star.name

movie, name, BD

movie, sname name,BD

Less effective than pushing down selection.

14

• Queries with multiple joins may have exponentially many possible plans.

• System-R style considers only left-deep joins

Reducing plan space

SR

T

U

SR

T

U

T USR

• Left-deep trees allow us to generate fully pipelined plans

15

• System R-style avoids plans with Cartesian products – Cartesian products are generally larger than joins.

• Example: – Join of R(A,B), S(B,C), U(C,D) – (R ⋈ U) ⋈ S has a Cartesian product – Pick (R ⋈ S) ⋈ U instead

• If cannot avoid Cartesian products, delay them.

Reducing plan space

Components of a cost-based optimizer • Plan space

– Which plans to consider? – Time-consuming to explore every possible plan.

• Cost estimator – How to estimate the cost of a plan without executing it? – Accurate estimations.

• Search algorithm – How to search the plan space fast? – Avoid checking inefficient plans.

16

17

• Our goal is to maximize relative accuracy – Compare plans, not to predict exact costs.

• For each operator in the plan: – Find input size – Estimate its cost based on the input size

• Example: sort-merge join of R ⋈ S is 3 B(R) + 3 B(S) – Estimate output size or selectivity

• Selectivity: ratio of output to input

Cost estimation

18

• Why estimate output size? – Output of an operator = Input to the next one in the plan – Used to estimate of the cost of the next operator

• Cost of a plan – Sum of the cost of its operators

Cost estimation

Cost estimation: Selinger Style • Input: stats on each table

– T(R): Number of tuples in R – B(R): Number of blocks in R

• B(R) = T(R ) / block size – V(R,A): Number of distinct values of attribute A in R

• We assume that attributes and predicates are independence.

• When no estimate available, use magic numbers.

19

Presenter
Presentation Notes
too much information to keep, use histogram

20

Selectivity factors: selection • Point selection: S = σA=a(R)

– T(S) ranges from 0 to T(R) – V(R,A) + 1 – consider its mean: F = 1 / V (R,A)

• Range selection: S = σA>a(R) – F = (max(A) – a) / (max(A) – min(A)) – If not arithmetic inequality, use magic number

• F = 1 / 3

• Range selection: S = σ b <A<a(R) – F = (a - b) / (max(A) – min(A)) – If not arithmetic, use magic number

• F = 1 / 4

21

Examples: selection with multiple conditions

• S = σA=1 AND B>10(R) – Given T(R) = 10,000, V(R,A) = 50 – No information on B – Multiply 1/V(R,A) for equality and 1/3 for inequality

• T(S) = 10000 / (50 * 3) = 66

• S = σA=1 OR B>10(R) – Sum of estimates of predicates minus their product

• T(S) = 200 + 3333 – 66 = 3467

22

• Containment of values assumption – If V(S,A) <= V (R,A), then A-values in S are a subset of A-

values in R

• Let’s assume V(S,A) <= V (R,A) – Each tuple t in S joins x tuple(s) in R – consider its mean: x = T(R) / V (R,A) – T(R ⋈A S) = T (S) * T(R) / V(R,A)

• General formula: T(R ⋈A S) = T(R) * T(S) / max(V(R,A), V(S,A))

Selectivity factors: join predicates

Presenter
Presentation Notes
Typical join: A is a key in R and a foreign key in S

23

• What if we join over more than one attribute?

T(R S) = T(R) T(S) / max(V(R,A),V(S,A) max(V(R,B),V(S,B))

A,B

Selectivity factors: join predicates

Components of a cost-based optimizer • Plan space

– Which plans to consider? – Time-consuming to explore every possible plan.

• Cost estimator – How to estimate the cost of a plan without executing it? – Accurate estimations.

• Search algorithm – How to search the plan space fast? – Avoid checking inefficient plans.

24

Search the plan space • Baseline: exhaustive search

– Enumerate all (left-deep) trees and compare their costs

• Enormous space to search => time-consuming!

25

Plan search: System-R style • A.K.A: Selinger style optimization • Bottom-up

– Start from the ground relation (in FROM) – Work up the tree to form a plan – Compute the cost of larger plans based on its sub-trees.

• Dynamic programming – greedily remove sub-trees that are costly (useless)

26

27

• Step 1: For each {Ri}: – Plans({Ri}): algorithms to access Ri

• Table-Scan, Index-Scan – Costs({Ri}): Costs of accessing to Ri

• e.g., B(Ri) for table-scan on Ri – Plan({Ri}): access method with lowest cost – Cost({Ri}): lowest cost in Costs({Ri}) – Output-size ({Ri}): B(Ri)

Dynamic programming algorithm

28

• Step 2: For each {Ri, Rj}: – Plans({Ri,Rj}): join algorithms to compute Ri ⋈ Rj

• E.g., nested-loop and sort-merge based algorithms – Costs({Ri,Rj}): function of size of Ri and Rj

• #I/O access of the chosen join algorithm • E.g., 5 B(R) + 5 B(S) for sort-merge join

– Plan({Ri,Rj}): the join algorithm with lowest cost – Cost({Ri,Rj}): lowest cost in Costs({Ri,Rj}) – Output-size ({Ri,Rj}): estimate of the size of join

Dynamic programming algorithm

29

• Step i: For each S ⊆ {R1, …, Rn} of cardinality i: – for every S1 ,S2 s.t. S = S1 ∪ S2 compute

C = cost(S1) + cost(S2) + cost(S1 ⋈ S2) – Cost(S) = the smallest C – Plan(S) = the plan with cost(S) – Output-size(S): estimate the size of S

• Return Plan({R1, …, Rn})

Dynamic programming algorithm

30

• R ⋈ S ⋈ T ⋈ U on common attribute A • T(R)= 2000, T(S)=5000, T(T)=3000, T(U)=1000 • We assume that each block contains a single tuple.

– B(R)= 2000, B(S)=5000, B(T)=3000, B(U)=1000 • No index on relations

– Table-scan to access each relation • The only join algorithm is sort-merge

– Cost for R ⋈ S: 5 B(R) + 5 B(S) – There is enough memory for sort-merge joins.

Example

31

• Relations: R, S, T, U • T(R)= 2000, T(S)=5000, T(T)=3000, T(U)=1000 • B(R)= 2000, B(S)=5000, B(T)=3000, B(U)=1000

• To simplify our example, we assume – V(R,A)=V(S,A)=V(T,A)=V(U,A)= 100 – T(R ⋈ S) = 0.01 * T(R) * T(S)

• Same formula for joins of every pair of relations.

Example

32

Query Output-Size Cost Plan

R⋈ S

R⋈ T

R⋈ U

S⋈ T

S⋈ U

T⋈ U

R⋈ S⋈ T

R⋈S⋈U

R⋈T⋈U

S⋈T⋈U

R⋈S⋈T⋈U

33

Query Output-Size Cost Plan

R⋈S 100K 35K R⋈S

R⋈T 60K 25K R⋈T

R⋈U 20K 15K R⋈U

S⋈T 150K 40K S⋈T

S⋈U 50K 30K S⋈U

T⋈U 30K 20K T⋈U

R⋈S⋈ T

R⋈S⋈U

R⋈T⋈U

S⋈T⋈U

R⋈S⋈T⋈U

34

Query Output-Size Cost Plan

R⋈S 100K 35K R⋈S

R⋈T 60K 25K R⋈T

R⋈U 20K 15K R⋈U

S⋈T 150K 40K S⋈T

S⋈U 50K 30K S⋈U

T⋈U 30K 20K T⋈U

R⋈S⋈T 3M 170K S⋈ (R⋈T)

R⋈S⋈U 1M 80K S⋈ (R⋈U)

R⋈T⋈U 0.6M 70K T⋈ (R⋈U)

S⋈T⋈U 1.5M 105K S⋈ (T⋈U)

R⋈S⋈T⋈U

35

Query Output-Size Cost Plan

R⋈S 100K 35K R⋈S

R⋈T 60K 25K R⋈T

R⋈U 20K 15K R⋈U

S⋈T 150K 40K S⋈T

S⋈U 50K 30K S⋈U

T⋈U 30K 20K T⋈U

R⋈S⋈T 3M 170K S⋈ (R⋈T)

R⋈S⋈U 1M 80K S⋈ (R⋈U)

R⋈T⋈U 0.6M 70K T⋈ (R⋈U)

S⋈T⋈U 1.5M 105K S⋈ (T⋈U)

R⋈S⋈T⋈U 30M 1.35M S⋈ (T⋈ (R⋈U))

Interesting Orders • Useful sorted intermediate results • Order By

– E.g., Order By based on A for the result R⋈AS • Sorted order based on A is an interesting order

• Multi-relation joins – E.g., T⋈A (R⋈AS)

• Sorted output of R⋈AS helps fast join of T⋈A (R⋈AS). • Sorted order based on A is an interesting order

• Grouping

36

Interesting Orders • Plans with sorted results save sorting time/cost.

– E.g., sort-merge join for R⋈AS with Order By on A. – E.g., sort-merge join for R⋈AS for query T⋈A (R⋈AS).

• The dynamic programming algorithm always keeps – Fastest overall plan – Fastest plan for each interesting order

37

Nested subqueries • Subqueries are optimized separately • Correlation: order of evaluation

– uncorrelated queries • nested subqueries do not reference outer subqueries • evaluate the most deeply nested subquery first

– correlated queries: nested subqueries reference the outer subqueries

Select name From employee X Where salary > (Select salary

From employee Where employee_num = X.manager)

38

Nested subqueries • correlated queries: nested subqueries reference the outer

subqueries Select name From employee X Where salary > (Select salary

From employee Where employee_num = X.manager)

• The nested subquery is evaluated once for each tuple in the outer query.

• If there are small number of distinct values in the outer relation, it is worth sorting the tuples. – reduces the #evaluation of the nested query.

39

Optimization: all operations • Reduce plan space

– Push down selections and projections – Choose good plans, discard bad ones

• Base relations access – find all plans for accessing each base relations

• Join ordering – Bottom-up dynamic programming algorithm – Consider only left-deep plans – Remove/ postpone Cartesian product – Keep the cheapest plan for unordered and each interesting

order 40

Summary: optimization • Ideal goal

– Map a declarative query to the most efficient plan • Plan space

– Many semantically equivalent alternatives • Why important?

– Difference between good/bad plans is order of magnitude • Conventional wisdom:

– At least avoid bad plans

41

State of the art • Academia: always a core database research topic

– Optimizing for interactive querying – Optimizing for novel parallel frameworks

• Industry: most optimizers use System-R style – They started with rule-based.

• Oracle 7 and its prior versions used rule-based • Oracle 7 – 10: rule-based, and cost based • Oracle 10g (2003): cost-based

42

  • CS 540 �Database Management Systems
  • DBMS Architecture
  • Many query plans to execute a SQL query
  • Explain command in PostgreSQL
  • Communication between operators: � iterator model
  • Query optimization: picking the fastest plan
  • Review Definitions
  • �Plans to select tuples from R: sA=a(R) �
  • Query optimization methods
  • Cost-based optimization
  • Space of query plans: basic operators
  • Reducing plan space
  • Reducing plan space
  • Reducing plan space
  • Reducing plan space
  • Components of a cost-based optimizer
  • Cost estimation
  • Cost estimation
  • Cost estimation: Selinger Style
  • Selectivity factors: selection
  • Examples: selection with multiple conditions
  • Selectivity factors: join predicates
  • Selectivity factors: join predicates
  • Components of a cost-based optimizer
  • Search the plan space
  • Plan search: System-R style
  • Dynamic programming algorithm
  • Dynamic programming algorithm
  • Dynamic programming algorithm
  • Example
  • Example
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Interesting Orders
  • Interesting Orders
  • Nested subqueries
  • Nested subqueries
  • Optimization: all operations
  • Summary: optimization
  • State of the art