HW IN distributed Data Base

profileSamG
principles-of-distributed-database-systems.pdf

Principles of Distributed Database Systems

M. Tamer Özsu • Patrick Valduriez

Principles of Distributed Database Systems

Third Edition

All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer, software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Springer New York Dordrecht Heidelberg London

M. Tamer Özsu David R. Cheriton School of Computer Science University of Waterloo Waterloo Ontario Canada N2L 3G1

ISBN 978-1-4419-8833-1 e-ISBN 978-1-4419-8834-8 DOI 10.1007/978-1-4419-8834-8

This book was previously published by: Pearson Education, Inc.

[email protected]

Library of Congress Control Number: 2011922491

© Springer Science+Business Media, LLC 2011

Patrick Valduriez

LIRMM

34392 Montpellier Cedex France [email protected]

INRIA

161 rue Ada

To my family and my parents M.T.Ö.

To Esther, my daughters Anna, Juliette and Sarah, and my parents

P.V.

Preface

It has been almost twenty years since the first edition of this book appeared, and ten years since we released the second edition. As one can imagine, in a fast changing area such as this, there have been significant changes in the intervening period. Distributed data management went from a potentially significant technology to one that is common place. The advent of the Internet and the World Wide Web have certainly changed the way we typically look at distribution. The emergence in recent years of different forms of distributed computing, exemplified by data streams and cloud computing, has regenerated interest in distributed data management. Thus, it was time for a major revision of the material.

We started to work on this edition five years ago, and it has taken quite a while to complete the work. The end result, however, is a book that has been heavily revised – while we maintained and updated the core chapters, we have also added new ones. The major changes are the following:

1. Database integration and querying is now treated in much more detail, re- flecting the attention these topics have received in the community in the past decade. Chapter 4 focuses on the integration process, while Chapter 9 discusses querying over multidatabase systems.

2. The previous editions had only brief discussion of data replication protocols. This topic is now covered in a separate chapter (Chapter 13) where we provide an in-depth discussion of the protocols and how they can be integrated with transaction management.

3. Peer-to-peer data management is discussed in depth in Chapter 16. These systems have become an important and interesting architectural alternative to classical distributed database systems. Although the early distributed database systems architectures followed the peer-to-peer paradigm, the modern incar- nation of these systems have fundamentally different characteristics, so they deserve in-depth discussion in a chapter of their own.

4. Web data management is discussed in Chapter 17. This is a difficult topic to cover since there is no unifying framework. We discuss various aspects

vii

viii Preface

of the topic ranging from web models to search engines to distributed XML processing.

5. Earlier editions contained a chapter where we discussed “recent issues” at the time. In this edition, we again have a similar chapter (Chapter 18) where we cover stream data management and cloud computing. These topics are still in a flux and are subjects of considerable ongoing research. We highlight the issues and the potential research directions.

The resulting manuscript strikes a balance between our two objectives, namely to address new and emerging issues, and maintain the main characteristics of the book in addressing the principles of distributed data management.

The organization of the book can be divided into two major parts. The first part covers the fundamental principles of distributed data management and consist of Chapters 1 to 14. Chapter 2 in this part covers the background and can be skipped if the students already have sufficient knowledge of the relational database concepts and the computer network technology. The only part of this chapter that is essential is Example 2.3, which introduces the running example that we use throughout much of the book. The second part covers more advanced topics and includes Chapters 15 – 18. What one covers in a course depends very much on the duration and the course objectives. If the course aims to discuss the fundamental techniques, then it might cover Chapters 1, 3, 5, 6–8, 10–12. An extended coverage would include, in addition to the above, Chapters 4, 9, and 13. Courses that have time to cover more material can selectively pick one or more of Chapters 15 – 18 from the second part.

Many colleagues have assisted with this edition of the book. S. Keshav (Univer- sity of Waterloo) has read and provided many suggestions to update the sections on computer networks. Renée Miller (University of Toronto) and Erhard Rahm (University of Leipzig) read an early draft of Chapter 4 and provided many com- ments, Alon Halevy (Google) answered a number of questions about this chapter and provided a draft copy of his upcoming book on this topic as well as reading and providing feedback on Chapter 9, Avigdor Gal (Technion) also reviewed and critiqued this chapter very thoroughly. Matthias Jarke and Xiang Li (University of Aachen), Gottfried Vossen (University of Muenster), Erhard Rahm and Andreas Thor (University of Leipzig) contributed exercises to this chapter. Hubert Naacke (University of Paris 6) contributed to the section on heterogeneous cost modeling and Fabio Porto (LNCC, Petropolis) to the section on adaptive query processing of Chapter 9. Data replication (Chapter 13) could not have been written without the assistance of Gustavo Alonso (ETH Zürich) and Bettina Kemme (McGill University). Tamer spent four months in Spring 2006 visiting Gustavo where work on this chapter began and involved many long discussions. Bettina read multiple iterations of this chapter over the next one year criticizing everything and pointing out better ways of explaining the material. Esther Pacitti (University of Montpellier) also contributed to this chapter, both by reviewing it and by providing background material; she also contributed to the section on replication in database clusters in Chapter 14. Ricardo Jimenez-Peris also contributed to that chapter in the section on fault-tolerance in database clusters. Khuzaima Daudjee (University of Waterloo) read and provided

Preface ix

comments on this chapter as well. Chapter 15 on Distributed Object Database Man- agement was reviewed by Serge Abiteboul (INRIA), who provided important critique of the material and suggestions for its improvement. Peer-to-peer data management (Chapter 16) owes a lot to discussions with Beng Chin Ooi (National University of Singapore) during the four months Tamer was visiting NUS in the fall of 2006. The section of Chapter 16 on query processing in P2P systems uses material from the PhD work of Reza Akbarinia (INRIA) and Wenceslao Palma (PUC-Valparaiso, Chile) while the section on replication uses material from the PhD work of Vidal Martins (PUCPR, Curitiba). The distributed XML processing section of Chapter 17 uses material from the PhD work of Ning Zhang (Facebook) and Patrick Kling at the University of Waterloo, and Ying Zhang at CWI. All three of them also read the material and provided significant feedback. Victor Muntés i Mulero (Universitat Politècnica de Catalunya) contributed to the exercises in that chapter. Özgür Ulusoy (Bilkent University) provided comments and corrections on Chapters 16 and 17. Data stream management section of Chapter 18 draws from the PhD work of Lukasz Golab (AT&T Labs-Research), and Yingying Tao at the University of Waterloo. Walid Aref (Purdue University) and Avigdor Gal (Technion) used the draft of the book in their courses, which was very helpful in debugging certain parts. We thank them, as well as many colleagues who had helped out with the first two editions, for all their assistance. We have not always followed their advice, and, needless to say, the resulting problems and errors are ours. Students in two courses at the University of Waterloo (Web Data Management in Winter 2005, and Internet-Scale Data Distribution in Fall 2005) wrote surveys as part of their coursework that were very helpful in structuring some chapters. Tamer taught courses at ETH Zürich (PDDBS – Parallel and Distributed Databases in Spring 2006) and at NUS (CS5225 – Parallel and Distributed Database Systems in Fall 2010) using parts of this edition. We thank students in all these courses for their contributions and their patience as they had to deal with chapters that were works-in-progress – the material got cleaned considerably as a result of these teaching experiences.

You will note that the publisher of the third edition of the book is different than the first two editions. Pearson, our previous publisher, decided not to be involved with the third edition. Springer subsequently showed considerable interest in the book. We would like to thank Susan Lagerstrom-Fife and Jennifer Evans of Springer for their lightning-fast decision to publish the book, and Jennifer Mauer for a ton of hand-holding during the conversion process. We would also like to thank Tracy Dunkelberger of Pearson who shepherded the reversal of the copyright to us without delay.

As in earlier editions, we will have presentation slides that can be used to teach from the book as well as solutions to most of the exercises. These will be available from Springer to instructors who adopt the book and there will be a link to them from the book’s site at springer.com.

Finally, we would be very interested to hear your comments and suggestions regarding the material. We welcome any feedback, but we would particularly like to receive feedback on the following aspects:

x Preface

1. any errors that may have remained despite our best efforts (although we hope there are not many);

2. any topics that should no longer be included and any topics that should be added or expanded; and

3. any exercises that you may have designed that you would like to be included in the book.

M. Tamer Özsu ([email protected]) Patrick Valduriez ([email protected])

November 2010

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Distributed Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 What is a Distributed Database System? . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Data Delivery Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Promises of DDBSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.1 Transparent Management of Distributed and Replicated Data 7 1.4.2 Reliability Through Distributed Transactions . . . . . . . . . . . . . 12 1.4.3 Improved Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.4 Easier System Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.5 Complications Introduced by Distribution . . . . . . . . . . . . . . . . . . . . . . 16 1.6 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6.1 Distributed Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.2 Distributed Directory Management . . . . . . . . . . . . . . . . . . . . . 17 1.6.3 Distributed Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.4 Distributed Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . 18 1.6.5 Distributed Deadlock Management . . . . . . . . . . . . . . . . . . . . . 18 1.6.6 Reliability of Distributed DBMS . . . . . . . . . . . . . . . . . . . . . . . 18 1.6.7 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6.8 Relationship among Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6.9 Additional Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.7 Distributed DBMS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.7.1 ANSI/SPARC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.7.2 A Generic Centralized DBMS Architecture . . . . . . . . . . . . . . 23 1.7.3 Architectural Models for Distributed DBMSs . . . . . . . . . . . . . 25 1.7.4 Autonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.7.5 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.7.6 Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.7.7 Architectural Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.7.8 Client/Server Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.7.9 Peer-to-Peer Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.7.10 Multidatabase System Architecture . . . . . . . . . . . . . . . . . . . . . 35

xi

xii Contents

1.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.1 Overview of Relational DBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.1.1 Relational Database Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.1.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.1.3 Relational Data Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.2 Review of Computer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.2.1 Types of Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.2.2 Communication Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.2.3 Data Communication Concepts . . . . . . . . . . . . . . . . . . . . . . . . 65 2.2.4 Communication Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.3 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3 Distributed Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.1 Top-Down Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2 Distribution Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.2.1 Reasons for Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2.2 Fragmentation Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.2.3 Degree of Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.2.4 Correctness Rules of Fragmentation . . . . . . . . . . . . . . . . . . . . . 79 3.2.5 Allocation Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.2.6 Information Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.3 Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.3.1 Horizontal Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.3.2 Vertical Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.3.3 Hybrid Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

3.4 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.4.1 Allocation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.4.2 Information Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.4.3 Allocation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.4.4 Solution Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

3.5 Data Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 3.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4 Database Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.1 Bottom-Up Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.2 Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

4.2.1 Schema Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.2.2 Linguistic Matching Approaches . . . . . . . . . . . . . . . . . . . . . . . 141 4.2.3 Constraint-based Matching Approaches . . . . . . . . . . . . . . . . . 143 4.2.4 Learning-based Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.2.5 Combined Matching Approaches . . . . . . . . . . . . . . . . . . . . . . . 146

4.3 Schema Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Contents xiii

4.4 Schema Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 4.4.1 Mapping Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.4.2 Mapping Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

4.5 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 4.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

5 Data and Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 5.1 View Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

5.1.1 Views in Centralized DBMSs . . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.1.2 Views in Distributed DBMSs . . . . . . . . . . . . . . . . . . . . . . . . . . 175 5.1.3 Maintenance of Materialized Views . . . . . . . . . . . . . . . . . . . . . 177

5.2 Data Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.2.1 Discretionary Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . 181 5.2.2 Multilevel Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 5.2.3 Distributed Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

5.3 Semantic Integrity Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.3.1 Centralized Semantic Integrity Control . . . . . . . . . . . . . . . . . . 189 5.3.2 Distributed Semantic Integrity Control . . . . . . . . . . . . . . . . . . 194

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 5.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

6 Overview of Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.1 Query Processing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 6.2 Objectives of Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 6.3 Complexity of Relational Algebra Operations . . . . . . . . . . . . . . . . . . . 210 6.4 Characterization of Query Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 211

6.4.1 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.4.2 Types of Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.4.3 Optimization Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.4.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.4.5 Decision Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.4.6 Exploitation of the Network Topology . . . . . . . . . . . . . . . . . . . 214 6.4.7 Exploitation of Replicated Fragments . . . . . . . . . . . . . . . . . . . 215 6.4.8 Use of Semijoins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

6.5 Layers of Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.5.1 Query Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 6.5.2 Data Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.5.3 Global Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 6.5.4 Distributed Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 6.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

xiv Contents

7 Query Decomposition and Data Localization . . . . . . . . . . . . . . . . . . . . . . 221 7.1 Query Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

7.1.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 7.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 7.1.3 Elimination of Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 7.1.4 Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

7.2 Localization of Distributed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 7.2.1 Reduction for Primary Horizontal Fragmentation . . . . . . . . . . 232 7.2.2 Reduction for Vertical Fragmentation . . . . . . . . . . . . . . . . . . . 235 7.2.3 Reduction for Derived Fragmentation . . . . . . . . . . . . . . . . . . . 237 7.2.4 Reduction for Hybrid Fragmentation . . . . . . . . . . . . . . . . . . . . 238

7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 7.4 Bibliographic NOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

8 Optimization of Distributed Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 8.1 Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

8.1.1 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 8.1.2 Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 8.1.3 Distributed Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

8.2 Centralized Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 8.2.1 Dynamic Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 257 8.2.2 Static Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 8.2.3 Hybrid Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

8.3 Join Ordering in Distributed Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 267 8.3.1 Join Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 8.3.2 Semijoin Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 8.3.3 Join versus Semijoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

8.4 Distributed Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 8.4.1 Dynamic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 8.4.2 Static Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 8.4.3 Semijoin-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 8.4.4 Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 8.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

9 Multidatabase Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 9.1 Issues in Multidatabase Query Processing . . . . . . . . . . . . . . . . . . . . . . 298 9.2 Multidatabase Query Processing Architecture . . . . . . . . . . . . . . . . . . . 299 9.3 Query Rewriting Using Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

9.3.1 Datalog Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 9.3.2 Rewriting in GAV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 9.3.3 Rewriting in LAV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

9.4 Query Optimization and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 9.4.1 Heterogeneous Cost Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 307 9.4.2 Heterogeneous Query Optimization . . . . . . . . . . . . . . . . . . . . . 314

Contents xv

9.4.3 Adaptive Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 9.5 Query Translation and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 9.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

10 Introduction to Transaction Management . . . . . . . . . . . . . . . . . . . . . . . . . 335 10.1 Definition of a Transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

10.1.1 Termination Conditions of Transactions . . . . . . . . . . . . . . . . . 339 10.1.2 Characterization of Transactions . . . . . . . . . . . . . . . . . . . . . . . 340 10.1.3 Formalization of the Transaction Concept . . . . . . . . . . . . . . . . 341

10.2 Properties of Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 10.2.1 Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 10.2.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 10.2.3 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 10.2.4 Durability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

10.3 Types of Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 10.3.1 Flat Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 10.3.2 Nested Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 10.3.3 Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353

10.4 Architecture Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 10.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358

11 Distributed Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 11.1 Serializability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 11.2 Taxonomy of Concurrency Control Mechanisms . . . . . . . . . . . . . . . . . 367 11.3 Locking-Based Concurrency Control Algorithms . . . . . . . . . . . . . . . . 369

11.3.1 Centralized 2PL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 11.3.2 Distributed 2PL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374

11.4 Timestamp-Based Concurrency Control Algorithms . . . . . . . . . . . . . . 377 11.4.1 Basic TO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 11.4.2 Conservative TO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 11.4.3 Multiversion TO Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

11.5 Optimistic Concurrency Control Algorithms . . . . . . . . . . . . . . . . . . . . 384 11.6 Deadlock Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

11.6.1 Deadlock Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 11.6.2 Deadlock Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 11.6.3 Deadlock Detection and Resolution . . . . . . . . . . . . . . . . . . . . . 391

11.7 “Relaxed” Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 11.7.1 Non-Serializable Histories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 11.7.2 Nested Distributed Transactions . . . . . . . . . . . . . . . . . . . . . . . . 396

11.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 11.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

xvi Contents

12 Distributed DBMS Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 12.1 Reliability Concepts and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406

12.1.1 System, State, and Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 12.1.2 Reliability and Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 12.1.3 Mean Time between Failures/Mean Time to Repair . . . . . . . . 409

12.2 Failures in Distributed DBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 12.2.1 Transaction Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 12.2.2 Site (System) Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 12.2.3 Media Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 12.2.4 Communication Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

12.3 Local Reliability Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 12.3.1 Architectural Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 12.3.2 Recovery Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 12.3.3 Execution of LRM Commands . . . . . . . . . . . . . . . . . . . . . . . . . 420 12.3.4 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 12.3.5 Handling Media Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426

12.4 Distributed Reliability Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 12.4.1 Components of Distributed Reliability Protocols . . . . . . . . . . 428 12.4.2 Two-Phase Commit Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 12.4.3 Variations of 2PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434

12.5 Dealing with Site Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 12.5.1 Termination and Recovery Protocols for 2PC . . . . . . . . . . . . . 437 12.5.2 Three-Phase Commit Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 443

12.6 Network Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 12.6.1 Centralized Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 12.6.2 Voting-based Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450

12.7 Architectural Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 12.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 12.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455

13 Data Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 13.1 Consistency of Replicated Databases . . . . . . . . . . . . . . . . . . . . . . . . . . 461

13.1.1 Mutual Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 13.1.2 Mutual Consistency versus Transaction Consistency . . . . . . . 463

13.2 Update Management Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 13.2.1 Eager Update Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 13.2.2 Lazy Update Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466 13.2.3 Centralized Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466 13.2.4 Distributed Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

13.3 Replication Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 13.3.1 Eager Centralized Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 13.3.2 Eager Distributed Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 13.3.3 Lazy Centralized Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 13.3.4 Lazy Distributed Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480

13.4 Group Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482

Contents xvii

13.5 Replication and Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 13.5.1 Failures and Lazy Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 485 13.5.2 Failures and Eager Replication . . . . . . . . . . . . . . . . . . . . . . . . . 486

13.6 Replication Mediator Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 13.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 13.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493

14 Parallel Database Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 14.1 Parallel Database System Architectures . . . . . . . . . . . . . . . . . . . . . . . . 498

14.1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 14.1.2 Functional Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 14.1.3 Parallel DBMS Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 502

14.2 Parallel Data Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 14.3 Parallel Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512

14.3.1 Query Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 14.3.2 Parallel Algorithms for Data Processing . . . . . . . . . . . . . . . . . 515 14.3.3 Parallel Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 521

14.4 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 14.4.1 Parallel Execution Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 14.4.2 Intra-Operator Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . 527 14.4.3 Inter-Operator Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . 529 14.4.4 Intra-Query Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . 530

14.5 Database Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 14.5.1 Database Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 535 14.5.2 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 14.5.3 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540 14.5.4 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542 14.5.5 Fault-tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545

14.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 14.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547

15 Distributed Object Database Management . . . . . . . . . . . . . . . . . . . . . . . . 551 15.1 Fundamental Object Concepts and Object Models . . . . . . . . . . . . . . . 553

15.1.1 Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 15.1.2 Types and Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 15.1.3 Composition (Aggregation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 15.1.4 Subclassing and Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . 558

15.2 Object Distribution Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560 15.2.1 Horizontal Class Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 561 15.2.2 Vertical Class Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 15.2.3 Path Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 15.2.4 Class Partitioning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 564 15.2.5 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 15.2.6 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565

15.3 Architectural Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566

xviii Contents

15.3.1 Alternative Client/Server Architectures . . . . . . . . . . . . . . . . . . 567 15.3.2 Cache Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572

15.4 Object Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 15.4.1 Object Identifier Management . . . . . . . . . . . . . . . . . . . . . . . . . . 574 15.4.2 Pointer Swizzling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576 15.4.3 Object Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577

15.5 Distributed Object Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 15.6 Object Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582

15.6.1 Object Query Processor Architectures . . . . . . . . . . . . . . . . . . . 583 15.6.2 Query Processing Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 15.6.3 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589

15.7 Transaction Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 15.7.1 Correctness Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 15.7.2 Transaction Models and Object Structures . . . . . . . . . . . . . . . 596 15.7.3 Transactions Management in Object DBMSs . . . . . . . . . . . . . 596 15.7.4 Transactions as Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605

15.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 15.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607

16 Peer-to-Peer Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 16.1 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614

16.1.1 Unstructured P2P Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 16.1.2 Structured P2P Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618 16.1.3 Super-peer P2P Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 16.1.4 Comparison of P2P Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 624

16.2 Schema Mapping in P2P Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624 16.2.1 Pairwise Schema Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 16.2.2 Mapping based on Machine Learning Techniques . . . . . . . . . 626 16.2.3 Common Agreement Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 626 16.2.4 Schema Mapping using IR Techniques . . . . . . . . . . . . . . . . . . 627

16.3 Querying Over P2P Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628 16.3.1 Top-k Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628 16.3.2 Join Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640 16.3.3 Range Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642

16.4 Replica Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 16.4.1 Basic Support in DHTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646 16.4.2 Data Currency in DHTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 16.4.3 Replica Reconciliation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649

16.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 16.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653

17 Web Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 17.1 Web Graph Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658

17.1.1 Compressing Web Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660 17.1.2 Storing Web Graphs as S-Nodes . . . . . . . . . . . . . . . . . . . . . . . . 661

Contents xix

17.2 Web Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663 17.2.1 Web Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664 17.2.2 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667 17.2.3 Ranking and Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668 17.2.4 Evaluation of Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . 669

17.3 Web Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670 17.3.1 Semistructured Data Approach . . . . . . . . . . . . . . . . . . . . . . . . . 671 17.3.2 Web Query Language Approach . . . . . . . . . . . . . . . . . . . . . . . . 676 17.3.3 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681 17.3.4 Searching and Querying the Hidden Web . . . . . . . . . . . . . . . . 685

17.4 Distributed XML Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689 17.4.1 Overview of XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691 17.4.2 XML Query Processing Techniques . . . . . . . . . . . . . . . . . . . . . 699 17.4.3 Fragmenting XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703 17.4.4 Optimizing Distributed XML Processing . . . . . . . . . . . . . . . . 710

17.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718 17.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719

18 . . . . . . . . . . . . 723 18.1 Data Stream Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723

18.1.1 Stream Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725 18.1.2 Stream Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727 18.1.3 Streaming Operators and their Implementation . . . . . . . . . . . . 732 18.1.4 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734 18.1.5 DSMS Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738 18.1.6 Load Shedding and Approximation . . . . . . . . . . . . . . . . . . . . . 739 18.1.7 Multi-Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740 18.1.8 Stream Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741

18.2 Cloud Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744 18.2.1 Taxonomy of Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745 18.2.2 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748 18.2.3 Cloud architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751 18.2.4 Data management in the cloud . . . . . . . . . . . . . . . . . . . . . . . . . 753

18.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760 18.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833

Current Issues: Streaming Data and Cloud Computing

Chapter 1 Introduction

Distributed database system (DDBS) technology is the union of what appear to be two diametrically opposed approaches to data processing: database system and computer network technologies. Database systems have taken us from a paradigm of data processing in which each application defined and maintained its own data (Figure 1.1) to one in which the data are defined and administered centrally (Figure 1.2). This new orientation results in data independence, whereby the application programs are immune to changes in the logical or physical organization of the data, and vice versa.

One of the major motivations behind the use of database systems is the desire to integrate the operational data of an enterprise and to provide centralized, thus controlled access to that data. The technology of computer networks, on the other hand, promotes a mode of work that goes against all centralization efforts. At first glance it might be difficult to understand how these two contrasting approaches can possibly be synthesized to produce a technology that is more powerful and more promising than either one alone. The key to this understanding is the realization

PROGRAM 1

Data

Description

PROGRAM 2

FILE 1

FILE 2

FILE 3 PROGRAM 3

Data

Description

Data

Description

R E

D U

N D

A N

T D

A T A

Fig. 1.1 Traditional File Processing

1 DOI 10.1007/978-1-4419-8834-8_1, © Springer Science+Business Media, LLC 2011 M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems: Third Edition,

2 1 Introduction

...

Data Description

Data Manipulation DATABASE

PROGRAM 1

PROGRAM 2

PROGRAM 3

Fig. 1.2 Database Processing

that the most important objective of the database technology is integration, not centralization. It is important to realize that either one of these terms does not necessarily imply the other. It is possible to achieve integration without centralization, and that is exactly what the distributed database technology attempts to achieve.

In this chapter we define the fundamental concepts and set the framework for discussing distributed databases. We start by examining distributed systems in general in order to clarify the role of database technology within distributed data processing, and then move on to topics that are more directly related to DDBS.

1.1 Distributed Data Processing

The term distributed processing (or distributed computing) is hard to define precisely. Obviously, some degree of distributed processing goes on in any computer system, even on single-processor computers where the central processing unit (CPU) and in- put/output (I/O) functions are separated and overlapped. This separation and overlap can be considered as one form of distributed processing. The widespread emergence of parallel computers has further complicated the picture, since the distinction be- tween distributed computing systems and some forms of parallel computers is rather vague.

In this book we define distributed processing in such a way that it leads to a definition of a distributed database system. The working definition we use for a distributed computing system states that it is a number of autonomous processing elements (not necessarily homogeneous) that are interconnected by a computer network and that cooperate in performing their assigned tasks. The “processing element” referred to in this definition is a computing device that can execute a program on its own. This definition is similar to those given in distributed systems textbooks (e.g., [Tanenbaum and van Steen, 2002] and [Colouris et al., 2001]).

A fundamental question that needs to be asked is: What is being distributed? One of the things that might be distributed is the processing logic. In fact, the definition of a distributed computing system given above implicitly assumes that the

1.2 What is a Distributed Database System? 3

processing logic or processing elements are distributed. Another possible distribution is according to function. Various functions of a computer system could be delegated to various pieces of hardware or software. A third possible mode of distribution is according to data. Data used by a number of applications may be distributed to a number of processing sites. Finally, control can be distributed. The control of the execution of various tasks might be distributed instead of being performed by one computer system. From the viewpoint of distributed database systems, these modes of distribution are all necessary and important. In the following sections we talk about these in more detail.

Another reasonable question to ask at this point is: Why do we distribute at all? The classical answers to this question indicate that distributed processing better corresponds to the organizational structure of today’s widely distributed enterprises, and that such a system is more reliable and more responsive. More importantly, many of the current applications of computer technology are inherently distributed. Web-based applications, electronic commerce business over the Internet, multimedia applications such as news-on-demand or medical imaging, manufacturing control systems are all examples of such applications.

From a more global perspective, however, it can be stated that the fundamental reason behind distributed processing is to be better able to cope with the large-scale data management problems that we face today, by using a variation of the well-known divide-and-conquer rule. If the necessary software support for distributed processing can be developed, it might be possible to solve these complicated problems simply by dividing them into smaller pieces and assigning them to different software groups, which work on different computers and produce a system that runs on multiple processing elements but can work efficiently toward the execution of a common task.

Distributed database systems should also be viewed within this framework and treated as tools that could make distributed processing easier and more efficient. It is reasonable to draw an analogy between what distributed databases might offer to the data processing world and what the database technology has already provided. There is no doubt that the development of general-purpose, adaptable, efficient distributed database systems has aided greatly in the task of developing distributed software.

1.2 What is a Distributed Database System?

We define a distributed database as a collection of multiple, logically interrelated databases distributed over a computer network. A distributed database management system (distributed DBMS) is then defined as the software system that permits the management of the distributed database and makes the distribution transparent to the users. Sometimes “distributed database system” (DDBS) is used to refer jointly to the distributed database and the distributed DBMS. The two important terms in these definitions are “logically interrelated” and “distributed over a computer network.” They help eliminate certain cases that have sometimes been accepted to represent a DDBS.

4 1 Introduction

A DDBS is not a “collection of files” that can be individually stored at each node of a computer network. To form a DDBS, files should not only be logically related, but there should be structured among the files, and access should be via a common interface. We should note that there has been much recent activity in providing DBMS functionality over semi-structured data that are stored in files on the Internet (such as Web pages). In light of this activity, the above requirement may seem unnecessarily strict. Nevertheless, it is important to make a distinction between a DDBS where this requirement is met, and more general distributed data management systems that provide a “DBMS-like” access to data. In various chapters of this book, we will expand our discussion to cover these more general systems.

It has sometimes been assumed that the physical distribution of data is not the most significant issue. The proponents of this view would therefore feel comfortable in labeling as a distributed database a number of (related) databases that reside in the same computer system. However, the physical distribution of data is important. It creates problems that are not encountered when the databases reside in the same com- puter. These difficulties are discussed in Section 1.5. Note that physical distribution does not necessarily imply that the computer systems be geographically far apart; they could actually be in the same room. It simply implies that the communication between them is done over a network instead of through shared memory or shared disk (as would be the case with multiprocessor systems), with the network as the only shared resource.

This suggests that multiprocessor systems should not be considered as DDBSs. Although shared-nothing multiprocessors, where each processor node has its own primary and secondary memory, and may also have its own peripherals, are quite similar to the distributed environment that we focus on, there are differences. The fundamental difference is the mode of operation. A multiprocessor system design is rather symmetrical, consisting of a number of identical processor and memory components, and controlled by one or more copies of the same operating system that is responsible for a strict control of the task assignment to each processor. This is not true in distributed computing systems, where heterogeneity of the operating system as well as the hardware is quite common. Database systems that run over multiprocessor systems are called parallel database systems and are discussed in Chapter 14.

A DDBS is also not a system where, despite the existence of a network, the database resides at only one node of the network (Figure 1.3). In this case, the problems of database management are no different than the problems encountered in a centralized database environment (shortly, we will discuss client/server DBMSs which relax this requirement to a certain extent). The database is centrally managed by one computer system (site 2 in Figure 1.3) and all the requests are routed to that site. The only additional consideration has to do with transmission delays. It is obvious that the existence of a computer network or a collection of “files” is not sufficient to form a distributed database system. What we are interested in is an environment where data are distributed among a number of sites (Figure 1.4).

1.3 Data Delivery Alternatives 5

Site 1

Site 2

Site 3Site 4

Site 5

Communication Network

Fig. 1.3 Central Database on a Network

Site 1

Site 2

Site 3Site 4

Site 5

Communication Network

Fig. 1.4 DDBS Environment

1.3 Data Delivery Alternatives

In distributed databases, data are “delivered” from the sites where they are stored to where the query is posed. We characterize the data delivery alternatives along three orthogonal dimensions: delivery modes, frequency and communication methods. The combinations of alternatives along each of these dimensions (that we discuss next) provide a rich design space.

The alternative delivery modes are pull-only, push-only and hybrid. In the pull- only mode of data delivery, the transfer of data from servers to clients is initiated by a client pull. When a client request is received at a server, the server responds by locating the requested information. The main characteristic of pull-based delivery is that the arrival of new data items or updates to existing data items are carried out at a

6 1 Introduction

server without notification to clients unless clients explicitly poll the server. Also, in pull-based mode, servers must be interrupted continuously to deal with requests from clients. Furthermore, the information that clients can obtain from a server is limited to when and what clients know to ask for. Conventional DBMSs offer primarily pull-based data delivery.

In the push-only mode of data delivery, the transfer of data from servers to clients is initiated by a server push in the absence of any specific request from clients. The main difficulty of the push-based approach is in deciding which data would be of common interest, and when to send them to clients – alternatives are periodic, irregular, or conditional. Thus, the usefulness of server push depends heavily upon the accuracy of a server to predict the needs of clients. In push-based mode, servers disseminate information to either an unbounded set of clients (random broadcast) who can listen to a medium or selective set of clients (multicast), who belong to some categories of recipients that may receive the data.

The hybrid mode of data delivery combines the client-pull and server-push mech- anisms. The continuous (or continual) query approach (e.g., [Liu et al., 1996],[Terry et al., 1992],[Chen et al., 2000],[Pandey et al., 2003]) presents one possible way of combining the pull and push modes: namely, the transfer of information from servers to clients is first initiated by a client pull (by posing the query), and the subsequent transfer of updated information to clients is initiated by a server push.

There are three typical frequency measurements that can be used to classify the regularity of data delivery. They are periodic, conditional, and ad-hoc or irregular.

In periodic delivery, data are sent from the server to clients at regular intervals. The intervals can be defined by system default or by clients using their profiles. Both pull and push can be performed in periodic fashion. Periodic delivery is carried out on a regular and pre-specified repeating schedule. A client request for IBM’s stock price every week is an example of a periodic pull. An example of periodic push is when an application can send out stock price listing on a regular basis, say every morning. Periodic push is particularly useful for situations in which clients might not be available at all times, or might be unable to react to what has been sent, such as in the mobile setting where clients can become disconnected.

In conditional delivery, data are sent from servers whenever certain conditions installed by clients in their profiles are satisfied. Such conditions can be as simple as a given time span or as complicated as event-condition-action rules. Conditional delivery is mostly used in the hybrid or push-only delivery systems. Using condi- tional push, data are sent out according to a pre-specified condition, rather than any particular repeating schedule. An application that sends out stock prices only when they change is an example of conditional push. An application that sends out a balance statement only when the total balance is 5% below the pre-defined balance threshold is an example of hybrid conditional push. Conditional push assumes that changes are critical to the clients, and that clients are always listening and need to respond to what is being sent. Hybrid conditional push further assumes that missing some update information is not crucial to the clients.

Ad-hoc delivery is irregular and is performed mostly in a pure pull-based system. Data are pulled from servers to clients in an ad-hoc fashion whenever clients request

1.4 Promises of DDBSs 7

it. In contrast, periodic pull arises when a client uses polling to obtain data from servers based on a regular period (schedule).

The third component of the design space of information delivery alternatives is the communication method. These methods determine the various ways in which servers and clients communicate for delivering information to clients. The alternatives are unicast and one-to-many. In unicast, the communication from a server to a client is one-to-one: the server sends data to one client using a particular delivery mode with some frequency. In one-to-many, as the name implies, the server sends data to a number of clients. Note that we are not referring here to a specific protocol; one-to-many communication may use a multicast or broadcast protocol.

We should note that this characterization is subject to considerable debate. It is not clear that every point in the design space is meaningful. Furthermore, specifi- cation of alternatives such as conditional and periodic (which may make sense) is difficult. However, it serves as a first-order characterization of the complexity of emerging distributed data management systems. For the most part, in this book, we are concerned with pull-only, ad hoc data delivery systems, although examples of other approaches are discussed in some chapters.

1.4 Promises of DDBSs

Many advantages of DDBSs have been cited in literature, ranging from sociological reasons for decentralization [D’Oliviera, 1977] to better economics. All of these can be distilled to four fundamentals which may also be viewed as promises of DDBS technology: transparent management of distributed and replicated data, reliable access to data through distributed transactions, improved performance, and easier system expansion. In this section we discuss these promises and, in the process, introduce many of the concepts that we will study in subsequent chapters.

1.4.1 Transparent Management of Distributed and Replicated Data

Transparency refers to separation of the higher-level semantics of a system from lower-level implementation issues. In other words, a transparent system “hides” the implementation details from users. The advantage of a fully transparent DBMS is the high level of support that it provides for the development of complex applications. It is obvious that we would like to make all DBMSs (centralized or distributed) fully transparent.

Let us start our discussion with an example. Consider an engineering firm that has offices in Boston, Waterloo, Paris and San Francisco. They run projects at each of these sites and would like to maintain a database of their employees, the projects and other related data. Assuming that the database is relational, we can store

8 1 Introduction

this information in two relations: EMP(ENO, ENAME, TITLE)1 and PROJ(PNO, PNAME, BUDGET). We also introduce a third relation to store salary information: SAL(TITLE, AMT) and a fourth relation ASG which indicates which employees have been assigned to which projects for what duration with what responsibility: ASG(ENO, PNO, RESP, DUR). If all of this data were stored in a centralized DBMS, and we wanted to find out the names and employees who worked on a project for more than 12 months, we would specify this using the following SQL query:

SELECT ENAME, AMT FROM EMP, ASG, SAL WHERE ASG.DUR > 12 AND EMP.ENO = ASG.ENO AND SAL.TITLE = EMP.TITLE

However, given the distributed nature of this firm’s business, it is preferable, under these circumstances, to localize data such that data about the employees in Waterloo office are stored in Waterloo, those in the Boston office are stored in Boston, and so forth. The same applies to the project and salary information. Thus, what we are engaged in is a process where we partition each of the relations and store each partition at a different site. This is known as fragmentation and we discuss it further below and in detail in Chapter 3.

Furthermore, it may be preferable to duplicate some of this data at other sites for performance and reliability reasons. The result is a distributed database which is fragmented and replicated (Figure 1.5). Fully transparent access means that the users can still pose the query as specified above, without paying any attention to the fragmentation, location, or replication of data, and let the system worry about resolving these issues.

For a system to adequately deal with this type of query over a distributed, frag- mented and replicated database, it needs to be able to deal with a number of different types of transparencies. We discuss these in this section.

1.4.1.1 Data Independence

Data independence is a fundamental form of transparency that we look for within a DBMS. It is also the only type that is important within the context of a centralized DBMS. It refers to the immunity of user applications to changes in the definition and organization of data, and vice versa.

As is well-known, data definition occurs at two levels. At one level the logical structure of the data are specified, and at the other level its physical structure. The former is commonly known as the schema definition, whereas the latter is referred to as the physical data description. We can therefore talk about two types of data

1 We discuss relational systems in Chapter 2 (Section 2.1) where we develop this example further. For the time being, it is sufficient to note that this nomenclature indicates that we have just defined a relation with three attributes: ENO (which is the key, identified by underlining), ENAME and TITLE.

1.4 Promises of DDBSs 9

Paris

San

FranciscoWaterloo

Boston

Communication

Network

Boston employees, Paris employees,

Boston projects

Waterloo employees,

Waterloo projects, Paris projects

San Francisco employees,

San Francisco projects

Paris employees, Boston employees,

Paris projects, Boston projects

Fig. 1.5 A Distributed Application

independence: logical data independence and physical data independence. Logical data independence refers to the immunity of user applications to changes in the logical structure (i.e., schema) of the database. Physical data independence, on the other hand, deals with hiding the details of the storage structure from user applications. When a user application is written, it should not be concerned with the details of physical data organization. Therefore, the user application should not need to be modified when data organization changes occur due to performance considerations.

1.4.1.2 Network Transparency

In centralized database systems, the only available resource that needs to be shielded from the user is the data (i.e., the storage system). In a distributed database envi- ronment, however, there is a second resource that needs to be managed in much the same manner: the network. Preferably, the user should be protected from the operational details of the network; possibly even hiding the existence of the network. Then there would be no difference between database applications that would run on a centralized database and those that would run on a distributed database. This type of transparency is referred to as network transparency or distribution transparency.

One can consider network transparency from the viewpoint of either the services provided or the data. From the former perspective, it is desirable to have a uniform means by which services are accessed. From a DBMS perspective, distribution transparency requires that users do not have to specify where data are located.

Sometimes two types of distribution transparency are identified: location trans- parency and naming transparency. Location transparency refers to the fact that the

10 1 Introduction

command used to perform a task is independent of both the location of the data and the system on which an operation is carried out. Naming transparency means that a unique name is provided for each object in the database. In the absence of naming transparency, users are required to embed the location name (or an identifier) as part of the object name.

1.4.1.3 Replication Transparency

The issue of replicating data within a distributed database is introduced in Chapter 3 and discussed in detail in Chapter 13. At this point, let us just mention that for performance, reliability, and availability reasons, it is usually desirable to be able to distribute data in a replicated fashion across the machines on a network. Such replication helps performance since diverse and conflicting user requirements can be more easily accommodated. For example, data that are commonly accessed by one user can be placed on that user’s local machine as well as on the machine of another user with the same access requirements. This increases the locality of reference. Furthermore, if one of the machines fails, a copy of the data are still available on another machine on the network. Of course, this is a very simple-minded description of the situation. In fact, the decision as to whether to replicate or not, and how many copies of any database object to have, depends to a considerable degree on user applications. We will discuss these in later chapters.

Assuming that data are replicated, the transparency issue is whether the users should be aware of the existence of copies or whether the system should handle the management of copies and the user should act as if there is a single copy of the data (note that we are not referring to the placement of copies, only their existence). From a user’s perspective the answer is obvious. It is preferable not to be involved with handling copies and having to specify the fact that a certain action can and/or should be taken on multiple copies. From a systems point of view, however, the answer is not that simple. As we will see in Chapter 11, when the responsibility of specifying that an action needs to be executed on multiple copies is delegated to the user, it makes transaction management simpler for distributed DBMSs. On the other hand, doing so inevitably results in the loss of some flexibility. It is not the system that decides whether or not to have copies and how many copies to have, but the user application. Any change in these decisions because of various considerations definitely affects the user application and, therefore, reduces data independence considerably. Given these considerations, it is desirable that replication transparency be provided as a standard feature of DBMSs. Remember that replication transparency refers only to the existence of replicas, not to their actual location. Note also that distributing these replicas across the network in a transparent manner is the domain of network transparency.

1.4 Promises of DDBSs 11

1.4.1.4 Fragmentation Transparency

The final form of transparency that needs to be addressed within the context of a distributed database system is that of fragmentation transparency. In Chapter 3 we discuss and justify the fact that it is commonly desirable to divide each database relation into smaller fragments and treat each fragment as a separate database object (i.e., another relation). This is commonly done for reasons of performance, avail- ability, and reliability. Furthermore, fragmentation can reduce the negative effects of replication. Each replica is not the full relation but only a subset of it; thus less space is required and fewer data items need be managed.

There are two general types of fragmentation alternatives. In one case, called horizontal fragmentation, a relation is partitioned into a set of sub-relations each of which have a subset of the tuples (rows) of the original relation. The second alternative is vertical fragmentation where each sub-relation is defined on a subset of the attributes (columns) of the original relation.

When database objects are fragmented, we have to deal with the problem of handling user queries that are specified on entire relations but have to be executed on subrelations. In other words, the issue is one of finding a query processing strategy based on the fragments rather than the relations, even though the queries are specified on the latter. Typically, this requires a translation from what is called a global query to several fragment queries. Since the fundamental issue of dealing with fragmentation transparency is one of query processing, we defer the discussion of techniques by which this translation can be performed until Chapter 7.

1.4.1.5 Who Should Provide Transparency?

In previous sections we discussed various possible forms of transparency within a distributed computing environment. Obviously, to provide easy and efficient access by novice users to the services of the DBMS, one would want to have full trans- parency, involving all the various types that we discussed. Nevertheless, the level of transparency is inevitably a compromise between ease of use and the difficulty and overhead cost of providing high levels of transparency. For example, Gray argues that full transparency makes the management of distributed data very difficult and claims that “applications coded with transparent access to geographically distributed databases have: poor manageability, poor modularity, and poor message performance” [Gray, 1989]. He proposes a remote procedure call mechanism between the requestor users and the server DBMSs whereby the users would direct their queries to a specific DBMS. This is indeed the approach commonly taken by client/server systems that we discuss shortly.

What has not yet been discussed is who is responsible for providing these services. It is possible to identify three distinct layers at which the transparency services can be provided. It is quite common to treat these as mutually exclusive means of providing the service, although it is more appropriate to view them as complementary.

12 1 Introduction

We could leave the responsibility of providing transparent access to data resources to the access layer. The transparency features can be built into the user language, which then translates the requested services into required operations. In other words, the compiler or the interpreter takes over the task and no transparent service is provided to the implementer of the compiler or the interpreter.

The second layer at which transparency can be provided is the operating system level. State-of-the-art operating systems provide some level of transparency to system users. For example, the device drivers within the operating system handle the details of getting each piece of peripheral equipment to do what is requested. The typical computer user, or even an application programmer, does not normally write device drivers to interact with individual peripheral equipment; that operation is transparent to the user.

Providing transparent access to resources at the operating system level can ob- viously be extended to the distributed environment, where the management of the network resource is taken over by the distributed operating system or the middleware if the distributed DBMS is implemented over one. There are two potential problems with this approach. The first is that not all commercially available distributed operat- ing systems provide a reasonable level of transparency in network management. The second problem is that some applications do not wish to be shielded from the details of distribution and need to access them for specific performance tuning.

The third layer at which transparency can be supported is within the DBMS. The transparency and support for database functions provided to the DBMS designers by an underlying operating system is generally minimal and typically limited to very fundamental operations for performing certain tasks. It is the responsibility of the DBMS to make all the necessary translations from the operating system to the higher-level user interface. This mode of operation is the most common method today. There are, however, various problems associated with leaving the task of providing full transparency to the DBMS. These have to do with the interaction of the operating system with the distributed DBMS and are discussed throughout this book.

A hierarchy of these transparencies is shown in Figure 1.6. It is not always easy to delineate clearly the levels of transparency, but such a figure serves an important instructional purpose even if it is not fully correct. To complete the picture we have added a “language transparency” layer, although it is not discussed in this chapter. With this generic layer, users have high-level access to the data (e.g., fourth- generation languages, graphical user interfaces, natural language access).

1.4.2 Reliability Through Distributed Transactions

Distributed DBMSs are intended to improve reliability since they have replicated components and, thereby eliminate single points of failure. The failure of a single site, or the failure of a communication link which makes one or more sites unreachable, is not sufficient to bring down the entire system. In the case of a distributed database, this means that some of the data may be unreachable, but with proper care, users

1.4 Promises of DDBSs 13

Data Da

ta Independence

Ne two

rk Transparency

Re pli

cat ion

Transparency

Fr ag

me ntat

ion Transparency

La ng

ua ge

Transparency

Fig. 1.6 Layers of Transparency

may be permitted to access other parts of the distributed database. The “proper care” comes in the form of support for distributed transactions and application protocols.

We discuss transactions and transaction processing in detail in Chapters 10–12. A transaction is a basic unit of consistent and reliable computing, consisting of a sequence of database operations executed as an atomic action. It transforms a consis- tent database state to another consistent database state even when a number of such transactions are executed concurrently (sometimes called concurrency transparency), and even when failures occur (also called failure atomicity). Therefore, a DBMS that provides full transaction support guarantees that concurrent execution of user transactions will not violate database consistency in the face of system failures as long as each transaction is correct, i.e., obeys the integrity rules specified on the database.

Let us give an example of a transaction based on the engineering firm example that we introduced earlier. Assume that there is an application that updates the salaries of all the employees by 10%. It is desirable to encapsulate the query (or the program code) that accomplishes this task within transaction boundaries. For example, if a system failure occurs half-way through the execution of this program, we would like the DBMS to be able to determine, upon recovery, where it left off and continue with its operation (or start all over again). This is the topic of failure atomicity. Alternatively, if some other user runs a query calculating the average salaries of the employees in this firm while the original update action is going on, the calculated result will be in error. Therefore we would like the system to be able to synchronize the concurrent execution of these two programs. To encapsulate a query (or a program code) within transactional boundaries, it is sufficient to declare the begin of the transaction and its end:

Begin transaction SALARY UPDATE begin EXEC SQL UPDATE PAY

SET SAL = SAL*1.1 end.

14 1 Introduction

Distributed transactions execute at a number of sites at which they access the local database. The above transaction, for example, will execute in Boston, Waterloo, Paris and San Francisco since the data are distributed at these sites. With full support for distributed transactions, user applications can access a single logical image of the database and rely on the distributed DBMS to ensure that their requests will be executed correctly no matter what happens in the system. “Correctly” means that user applications do not need to be concerned with coordinating their accesses to individual local databases nor do they need to worry about the possibility of site or communication link failures during the execution of their transactions. This illustrates the link between distributed transactions and transparency, since both involve issues related to distributed naming and directory management, among other things.

Providing transaction support requires the implementation of distributed concur- rency control (Chapter 11) and distributed reliability (Chapter 12) protocols — in particular, two-phase commit (2PC) and distributed recovery protocols — which are significantly more complicated than their centralized counterparts. Supporting repli- cas requires the implementation of replica control protocols that enforce a specified semantics of accessing them (Chapter 13).

1.4.3 Improved Performance

The case for the improved performance of distributed DBMSs is typically made based on two points. First, a distributed DBMS fragments the conceptual database, enabling data to be stored in close proximity to its points of use (also called data localization). This has two potential advantages:

1. Since each site handles only a portion of the database, contention for CPU and I/O services is not as severe as for centralized databases.

2. Localization reduces remote access delays that are usually involved in wide area networks (for example, the minimum round-trip message propagation delay in satellite-based systems is about 1 second).

Most distributed DBMSs are structured to gain maximum benefit from data localiza- tion. Full benefits of reduced contention and reduced communication overhead can be obtained only by a proper fragmentation and distribution of the database.

This point relates to the overhead of distributed computing if the data have to reside at remote sites and one has to access it by remote communication. The argument is that it is better, in these circumstances, to distribute the data management functionality to where the data are located rather than moving large amounts of data. This has lately become a topic of contention. Some argue that with the widespread use of high-speed, high-capacity networks, distributing data and data management functions no longer makes sense and that it may be much simpler to store data at a central site and access it (by downloading) over high-speed networks. This argument, while appealing, misses the point of distributed databases. First of all, in

1.4 Promises of DDBSs 15

most of today’s applications, data are distributed; what may be open for debate is how and where we process it. Second, and more important, point is that this argument does not distinguish between bandwidth (the capacity of the computer links) and latency (how long it takes for data to be transmitted). Latency is inherent in the distributed environments and there are physical limits to how fast we can send data over computer networks. As indicated above, for example, satellite links take about half-a-second to transmit data between two ground stations. This is a function of the distance of the satellites from the earth and there is nothing that we can do to improve that performance. For some applications, this might constitute an unacceptable delay.

The second case point is that the inherent parallelism of distributed systems may be exploited for inter-query and intra-query parallelism. Inter-query parallelism results from the ability to execute multiple queries at the same time while intra-query parallelism is achieved by breaking up a single query into a number of subqueries each of which is executed at a different site, accessing a different part of the distributed database.

If the user access to the distributed database consisted only of querying (i.e., read-only access), then provision of inter-query and intra-query parallelism would imply that as much of the database as possible should be replicated. However, since most database accesses are not read-only, the mixing of read and update operations requires the implementation of elaborate concurrency control and commit protocols.

1.4.4 Easier System Expansion

In a distributed environment, it is much easier to accommodate increasing database sizes. Major system overhauls are seldom necessary; expansion can usually be handled by adding processing and storage power to the network. Obviously, it may not be possible to obtain a linear increase in “power,” since this also depends on the overhead of distribution. However, significant improvements are still possible.

One aspect of easier system expansion is economics. It normally costs much less to put together a system of “smaller” computers with the equivalent power of a single big machine. In earlier times, it was commonly believed that it would be possible to purchase a fourfold powerful computer if one spent twice as much. This was known as Grosh’s law. With the advent of microcomputers and workstations, and their price/performance characteristics, this law is considered invalid.

This should not be interpreted to mean that mainframes are dead; this is not the point that we are making here. Indeed, in recent years, we have observed a resurgence in the world-wide sale of mainframes. The point is that for many applications, it is more economical to put together a distributed computer system (whether composed of mainframes or workstations) with sufficient power than it is to establish a single, centralized system to run these tasks. In fact, the latter may not even be feasible these days.

16 1 Introduction

1.5 Complications Introduced by Distribution

The problems encountered in database systems take on additional complexity in a distributed environment, even though the basic underlying principles are the same. Furthermore, this additional complexity gives rise to new problems influenced mainly by three factors.

First, data may be replicated in a distributed environment. A distributed database can be designed so that the entire database, or portions of it, reside at different sites of a computer network. It is not essential that every site on the network contain the database; it is only essential that there be more than one site where the database resides. The possible duplication of data items is mainly due to reliability and effi- ciency considerations. Consequently, the distributed database system is responsible for (1) choosing one of the stored copies of the requested data for access in case of retrievals, and (2) making sure that the effect of an update is reflected on each and every copy of that data item.

Second, if some sites fail (e.g., by either hardware or software malfunction), or if some communication links fail (making some of the sites unreachable) while an update is being executed, the system must make sure that the effects will be reflected on the data residing at the failing or unreachable sites as soon as the system can recover from the failure.

The third point is that since each site cannot have instantaneous information on the actions currently being carried out at the other sites, the synchronization of transactions on multiple sites is considerably harder than for a centralized system.

These difficulties point to a number of potential problems with distributed DBMSs. These are the inherent complexity of building distributed applications, increased cost of replicating resources, and, more importantly, managing distribution, the devolution of control to many centers and the difficulty of reaching agreements, and the exacerbated security concerns (the secure communication channel problem). These are well-known problems in distributed systems in general, and, in this book, we discuss their manifestations within the context of distributed DBMS and how they can be addressed.

1.6 Design Issues

In Section 1.4, we discussed the promises of distributed DBMS technology, highlight- ing the challenges that need to be overcome in order to realize them. In this section we build on this discussion by presenting the design issues that arise in building a distributed DBMS. These issues will occupy much of the remainder of this book.

1.6 Design Issues 17

1.6.1 Distributed Database Design

The question that is being addressed is how the database and the applications that run against it should be placed across the sites. There are two basic alternatives to placing data: partitioned (or non-replicated) and replicated. In the partitioned scheme the database is divided into a number of disjoint partitions each of which is placed at a different site. Replicated designs can be either fully replicated (also called fully duplicated) where the entire database is stored at each site, or partially replicated (or partially duplicated) where each partition of the database is stored at more than one site, but not at all the sites. The two fundamental design issues are fragmentation, the separation of the database into partitions called fragments, and distribution, the optimum distribution of fragments.

The research in this area mostly involves mathematical programming in order to minimize the combined cost of storing the database, processing transactions against it, and message communication among sites. The general problem is NP-hard. Therefore, the proposed solutions are based on heuristics. Distributed database design is the topic of Chapter 3.

1.6.2 Distributed Directory Management

A directory contains information (such as descriptions and locations) about data items in the database. Problems related to directory management are similar in nature to the database placement problem discussed in the preceding section. A directory may be global to the entire DDBS or local to each site; it can be centralized at one site or distributed over several sites; there can be a single copy or multiple copies. We briefly discuss these issues in Chapter 3.

1.6.3 Distributed Query Processing

Query processing deals with designing algorithms that analyze queries and convert them into a series of data manipulation operations. The problem is how to decide on a strategy for executing each query over the network in the most cost-effective way, however cost is defined. The factors to be considered are the distribution of data, communication costs, and lack of sufficient locally-available information. The objective is to optimize where the inherent parallelism is used to improve the perfor- mance of executing the transaction, subject to the above-mentioned constraints. The problem is NP-hard in nature, and the approaches are usually heuristic. Distributed query processing is discussed in detail in Chapter 6 - 8.

18 1 Introduction

1.6.4 Distributed Concurrency Control

Concurrency control involves the synchronization of accesses to the distributed data- base, such that the integrity of the database is maintained. It is, without any doubt, one of the most extensively studied problems in the DDBS field. The concurrency control problem in a distributed context is somewhat different than in a centralized framework. One not only has to worry about the integrity of a single database, but also about the consistency of multiple copies of the database. The condition that requires all the values of multiple copies of every data item to converge to the same value is called mutual consistency.

The alternative solutions are too numerous to discuss here, so we examine them in detail in Chapter 11. Let us only mention that the two general classes are pessimistic , synchronizing the execution of user requests before the execution starts, and opti- mistic, executing the requests and then checking if the execution has compromised the consistency of the database. Two fundamental primitives that can be used with both approaches are locking, which is based on the mutual exclusion of accesses to data items, and timestamping, where the transaction executions are ordered based on timestamps. There are variations of these schemes as well as hybrid algorithms that attempt to combine the two basic mechanisms.

1.6.5 Distributed Deadlock Management

The deadlock problem in DDBSs is similar in nature to that encountered in operating systems. The competition among users for access to a set of resources (data, in this case) can result in a deadlock if the synchronization mechanism is based on locking. The well-known alternatives of prevention, avoidance, and detection/recovery also apply to DDBSs. Deadlock management is covered in Chapter 11.

1.6.6 Reliability of Distributed DBMS

We mentioned earlier that one of the potential advantages of distributed systems is improved reliability and availability. This, however, is not a feature that comes automatically. It is important that mechanisms be provided to ensure the consistency of the database as well as to detect failures and recover from them. The implication for DDBSs is that when a failure occurs and various sites become either inoperable or inaccessible, the databases at the operational sites remain consistent and up to date. Furthermore, when the computer system or network recovers from the failure, the DDBSs should be able to recover and bring the databases at the failed sites up-to-date. This may be especially difficult in the case of network partitioning, where the sites are divided into two or more groups with no communication among them. Distributed reliability protocols are the topic of Chapter 12.

1.6 Design Issues 19

Directory Management

Query Processing

Distributed DB Design

Concurrency Control

Deadlock

Management

Reliability

Replication

Fig. 1.7 Relationship Among Research Issues

1.6.7 Replication

If the distributed database is (partially or fully) replicated, it is necessary to implement protocols that ensure the consistency of the replicas,i.e., copies of the same data item have the same value. These protocols can be eager in that they force the updates to be applied to all the replicas before the transaction completes, or they may be lazy so that the transaction updates one copy (called the master) from which updates are propagated to the others after the transaction completes. We discuss replication protocols in Chapter 13.

1.6.8 Relationship among Problems

Naturally, these problems are not isolated from one another. Each problem is affected by the solutions found for the others, and in turn affects the set of feasible solutions for them. In this section we discuss how they are related.

The relationship among the components is shown in Figure 1.7. The design of distributed databases affects many areas. It affects directory management, because the definition of fragments and their placement determine the contents of the directory (or directories) as well as the strategies that may be employed to manage them. The same information (i.e., fragment structure and placement) is used by the query processor to determine the query evaluation strategy. On the other hand, the access and usage patterns that are determined by the query processor are used as inputs to the data distribution and fragmentation algorithms. Similarly, directory placement and contents influence the processing of queries.

20 1 Introduction

The replication of fragments when they are distributed affects the concurrency control strategies that might be employed. As we will study in Chapter 11, some concurrency control algorithms cannot be easily used with replicated databases. Similarly, usage and access patterns to the database will influence the concurrency control algorithms. If the environment is update intensive, the necessary precautions are quite different from those in a query-only environment.

There is a strong relationship among the concurrency control problem, the dead- lock management problem, and reliability issues. This is to be expected, since to- gether they are usually called the transaction management problem. The concurrency control algorithm that is employed will determine whether or not a separate deadlock management facility is required. If a locking-based algorithm is used, deadlocks will occur, whereas they will not if timestamping is the chosen alternative.

Reliability mechanisms involve both local recovery techniques and distributed reliability protocols. In that sense, they both influence the choice of the concurrency control techniques and are built on top of them. Techniques to provide reliability also make use of data placement information since the existence of duplicate copies of the data serve as a safeguard to maintain reliable operation.

Finally, the need for replication protocols arise if data distribution involves replicas. As indicated above, there is a strong relationship between replication protocols and concurrency control techniques, since both deal with the consistency of data, but from different perspectives. Furthermore, the replication protocols influence distributed reliability techniques such as commit protocols. In fact, it is sometimes suggested (wrongly, in our view) that replication protocols can be used instead of implementing distributed commit protocols.

1.6.9 Additional Issues

The above design issues cover what may be called “traditional” distributed database systems. The environment has changed significantly since these topics started to be investigated, posing additional challenges and opportunities.

One of the important developments has been the move towards “looser” federation among data sources, which may also be heterogeneous. As we discuss in the next section, this has given rise to the development of multidatabase systems (also called federated databases and data integration systems) that require re-investigation of some of the fundamental database techniques. These systems constitute an important part of today’s distributed environment. We discuss database design issues in multi- database systems (i.e., database integration) in Chapter 4 and the query processing challenges in Chapter 9.

The growth of the Internet as a fundamental networking platform has raised important questions about the assumptions underlying distributed database systems. Two issues are of particular concern to us. One is the re-emergence of peer-to-peer computing, and the other is the development and growth of the World Wide Web (web for short). Both of these aim at improving data sharing, but take different

1.7 Distributed DBMS Architecture 21

approaches and pose different data management challenges. We discuss peer-to-peer data management in Chapter 16 and web data management in Chapter 17.

We should note that peer-to-peer is not a new concept in distributed databases, as we discuss in the next section. However, their new re-incarnation has significant differences from the earlier versions. In Chapter 16, it is these new versions that we focus on.

Finally, as earlier noted, there is a strong relationship between distributed databases and parallel databases. Although the former assumes each site to be a single logical computer, most of these installations are, in fact, parallel clusters. Thus, while most of the book focuses on issues that arise in managing data distributed across these sites, interesting data management issues exist within a single logical site that may be a parallel system. We discuss these issues in Chapter 14.

1.7 Distributed DBMS Architecture

The architecture of a system defines its structure. This means that the components of the system are identified, the function of each component is specified, and the interre- lationships and interactions among these components are defined. The specification of the architecture of a system requires identification of the various modules, with their interfaces and interrelationships, in terms of the data and control flow through the system.

In this section we develop three “reference” architectures2 for a distributed DBMS: client/server systems, peer-to-peer distributed DBMS, and multidatabase systems. These are “idealized” views of a DBMS in that many of the commercially available systems may deviate from them; however, the architectures will serve as a reasonable framework within which the issues related to distributed DBMS can be discussed.

We first start with a brief presentation of the “ANSI/SPARC architecture”, which is a datalogical approach to defining a DBMS architecture – it focuses on the different user classes and roles and their varying views on data. This architecture is helpful in putting certain concepts we have discussed so far in their proper perspective. We then have a short discussion of a generic architecture of a centralized DBMSs, that we subsequently extend to identify the set of alternative architectures for a distributed DBMS. Whithin this characterization, we focus on the three alternatives that we identified above.

1.7.1 ANSI/SPARC Architecture

In late 1972, the Computer and Information Processing Committee (X3) of the Amer- ican National Standards Institute (ANSI) established a Study Group on Database

2 A reference architecture is commonly created by standards developers to clearly define the interfaces that need to be standardized.

22 1 Introduction

External Schema

Conceptual Schema

Internal Schema

Internal view

Conceptual view

External view

External view

External view

Users

Fig. 1.8 The ANSI/SPARC Architecture

Management Systems under the auspices of its Standards Planning and Requirements Committee (SPARC). The mission of the study group was to study the feasibility of setting up standards in this area, as well as determining which aspects should be standardized if it was feasible. The study group issued its interim report in 1975 [ANSI/SPARC, 1975], and its final report in 1977 [Tsichritzis and Klug, 1978]. The architectural framework proposed in these reports came to be known as the “ANSI/SPARC architecture,” its full title being “ANSI/X3/SPARC DBMS Frame- work.” The study group proposed that the interfaces be standardized, and defined an architectural framework that contained 43 interfaces, 14 of which would deal with the physical storage subsystem of the computer and therefore not be considered essential parts of the DBMS architecture.

A simplified version of the ANSI/SPARC architecture is depicted in Figure 1.8. There are three views of data: the external view, which is that of the end user, who might be a programmer; the internal view, that of the system or machine; and the conceptual view, that of the enterprise. For each of these views, an appropriate schema definition is required.

At the lowest level of the architecture is the internal view, which deals with the physical definition and organization of data. The location of data on different storage devices and the access mechanisms used to reach and manipulate data are the issues dealt with at this level. At the other extreme is the external view, which is concerned with how users view the database. An individual user’s view represents the portion of the database that will be accessed by that user as well as the relationships that the user would like to see among the data. A view can be shared among a number of users, with the collection of user views making up the external schema. In between these two ends is the conceptual schema, which is an abstract definition of the database. It is the “real world” view of the enterprise being modeled in the database [Yormark, 1977]. As such, it is supposed to represent the data and the relationships among data without considering the requirements of individual applications or the restrictions of the physical storage media. In reality, however, it is not possible to ignore these

1.7 Distributed DBMS Architecture 23

requirements completely, due to performance reasons. The transformation between these three levels is accomplished by mappings that specify how a definition at one level can be obtained from a definition at another level.

This perspective is important, because it provides the basis for data independence that we discussed earlier. The separation of the external schemas from the conceptual schema enables logical data independence, while the separation of the conceptual schema from the internal schema allows physical data independence.

1.7.2 A Generic Centralized DBMS Architecture

A DBMS is a reentrant program shared by multiple processes (transactions), that run database programs. When running on a general purpose computer, a DBMS is interfaced with two other components: the communication subsystem and the operat- ing system. The communication subsystem permits interfacing the DBMS with other subsystems in order to communicate with applications. For example, the terminal monitor needs to communicate with the DBMS to run interactive transactions. The operating system provides the interface between the DBMS and computer resources (processor, memory, disk drives, etc.).

The functions performed by a DBMS can be layered as in Figure 1.9, where the arrows indicate the direction of the data and the control flow. Taking a top-down approach, the layers are the interface, control, compilation, execution, data access, and consistency management.

The interface layer manages the interface to the applications. There can be several interfaces such as, in the case of relational DBMSs discussed in Chapter 2, SQL embedded in a host language, such as C and QBE (Query-by-Example). Database application programs are executed against external views of the database. For an application, a view is useful in representing its particular perception of the database (shared by many applications). A view in relational DBMSs is a virtual relation derived from base relations by applying relational algebra operations.3 These concepts are defined more precisely in Chapter 2, but they are usually covered in undergraduate database courses, so we expect many readers to be familiar with them. View management consists of translating the user query from external data to conceptual data.

The control layer controls the query by adding semantic integrity predicates and authorization predicates. Semantic integrity constraints and authorizations are usually specified in a declarative language, as discussed in Chapter 5. The output of this layer is an enriched query in the high-level language accepted by the interface.

The query processing (or compilation) layer maps the query into an optimized sequence of lower-level operations. This layer is concerned with performance. It

3 Note that this does not mean that the real-world views are, or should be, specified in relational algebra. On the contrary, they are specified by some high-level data language such as SQL. The translation from one of these languages to relational algebra is now well understood, and the effects of the view definition can be specified in terms of relational algebra operations.

24 1 Introduction

Applications

User Interfaces

View Management

Semantic Integrity Control

Authorization Checking

Query Decomposition and Optimization

Access Plan Management

Access Plan Execution Control

Algebra Operation Execution

Buffer Management

Access Methods

Concurrency Control

Logging

retrieval/update

retrieval/update

relational algebra

relational calculus

relational calculus

Interface

Control

Compilation

Execution

Data Access

Consistency

Results

Database

Fig. 1.9 Functional Layers of a Centralized DBMS

decomposes the query into a tree of algebra operations and tries to find the “optimal” ordering of the operations. The result is stored in an access plan. The output of this layer is a query expressed in lower-level code (algebra operations).

The execution layer directs the execution of the access plans, including transaction management (commit, restart) and synchronization of algebra operations. It interprets the relational operations by calling the data access layer through the retrieval and update requests.

The data access layer manages the data structures that implement the files, indices, etc. It also manages the buffers by caching the most frequently accessed data. Careful use of this layer minimizes the access to disks to get or write data.

Finally, the consistency layer manages concurrency control and logging for update requests. This layer allows transaction, system, and media recovery after failure.

1.7 Distributed DBMS Architecture 25

1.7.3 Architectural Models for Distributed DBMSs

We now consider the possible ways in which a distributed DBMS may be architected. We use a classification (Figure 1.10) that organizes the systems as characterized with respect to (1) the autonomy of local systems, (2) their distribution, and (3) their heterogeneity.

Distribution

Heterogeneity

Autonomy

Client/Server

Systems

Multidatabase

Systems

Peer-to-Peer

DDBSs

Fig. 1.10 DBMS Implementation Alternatives

1.7.4 Autonomy

Autonomy, in this context, refers to the distribution of control, not of data. It indi- cates the degree to which individual DBMSs can operate independently. Autonomy is a function of a number of factors such as whether the component systems (i.e., individual DBMSs) exchange information, whether they can independently exe- cute transactions, and whether one is allowed to modify them. Requirements of an autonomous system have been specified as follows [Gligor and Popescu-Zeletin, 1986]:

1. The local operations of the individual DBMSs are not affected by their partic- ipation in the distributed system.

26 1 Introduction

2. The manner in which the individual DBMSs process queries and optimize them should not be affected by the execution of global queries that access multiple databases.

3. System consistency or operation should not be compromised when individual DBMSs join or leave the distributed system.

On the other hand, the dimensions of autonomy can be specified as follows [Du and Elmagarmid, 1989]:

1. Design autonomy: Individual DBMSs are free to use the data models and transaction management techniques that they prefer.

2. Communication autonomy: Each of the individual DBMSs is free to make its own decision as to what type of information it wants to provide to the other DBMSs or to the software that controls their global execution.

3. Execution autonomy: Each DBMS can execute the transactions that are sub- mitted to it in any way that it wants to.

We will use a classification that covers the important aspects of these features. One alternative is tight integration, where a single-image of the entire database is available to any user who wants to share the information, which may reside in multiple databases. From the users’ perspective, the data are logically integrated in one database. In these tightly-integrated systems, the data managers are implemented so that one of them is in control of the processing of each user request even if that request is serviced by more than one data manager. The data managers do not typically operate as independent DBMSs even though they usually have the functionality to do so.

Next we identify semiautonomous systems that consist of DBMSs that can (and usually do) operate independently, but have decided to participate in a federation to make their local data sharable. Each of these DBMSs determine what parts of their own database they will make accessible to users of other DBMSs. They are not fully autonomous systems because they need to be modified to enable them to exchange information with one another.

The last alternative that we consider is total isolation, where the individual systems are stand-alone DBMSs that know neither of the existence of other DBMSs nor how to communicate with them. In such systems, the processing of user transactions that access multiple databases is especially difficult since there is no global control over the execution of individual DBMSs.

It is important to note at this point that the three alternatives that we consider for autonomous systems are not the only possibilities. We simply highlight the three most popular ones.

1.7 Distributed DBMS Architecture 27

1.7.5 Distribution

Whereas autonomy refers to the distribution (or decentralization) of control, the distribution dimension of the taxonomy deals with data. Of course, we are considering the physical distribution of data over multiple sites; as we discussed earlier, the user sees the data as one logical pool. There are a number of ways DBMSs have been distributed. We abstract these alternatives into two classes: client/server distribution and peer-to-peer distribution (or full distribution). Together with the non-distributed option, the taxonomy identifies three alternative architectures.

The client/server distribution concentrates data management duties at servers while the clients focus on providing the application environment including the user interface. The communication duties are shared between the client machines and servers. Client/server DBMSs represent a practical compromise to distributing functionality. There are a variety of ways of structuring them, each providing a different level of distribution. With respect to the framework, we abstract these differences and leave that discussion to Section 1.7.8, which we devote to client/server DBMS architectures. What is important at this point is that the sites on a network are distinguished as “clients” and “servers” and their functionality is different.

In peer-to-peer systems, there is no distinction of client machines versus servers. Each machine has full DBMS functionality and can communicate with other ma- chines to execute queries and transactions. Most of the very early work on distributed database systems have assumed peer-to-peer architecture. Therefore, our main focus in this book are on peer-to-peer systems (also called fully distributed), even though many of the techniques carry over to client/server systems as well.

1.7.6 Heterogeneity

Heterogeneity may occur in various forms in distributed systems, ranging from hardware heterogeneity and differences in networking protocols to variations in data managers. The important ones from the perspective of this book relate to data models, query languages, and transaction management protocols. Representing data with different modeling tools creates heterogeneity because of the inherent expressive powers and limitations of individual data models. Heterogeneity in query languages not only involves the use of completely different data access paradigms in different data models (set-at-a-time access in relational systems versus record-at-a-time access in some object-oriented systems), but also covers differences in languages even when the individual systems use the same data model. Although SQL is now the standard relational query language, there are many different implementations and every vendor’s language has a slightly different flavor (sometimes even different semantics, producing different results).

28 1 Introduction

1.7.7 Architectural Alternatives

The distribution of databases, their possible heterogeneity, and their autonomy are orthogonal issues. Consequently, following the above characterization, there are 18 different possible architectures. Not all of these architectural alternatives that form the design space are meaningful. Furthermore, not all are relevant from the perspective of this book.

In Figure 1.10, we have identified three alternative architectures that are the focus of this book and that we discuss in more detail in the next three subsections: (A0, D1, H0) that corresponds to client/server distributed DBMSs, (A0, D2, H0) that is a peer-to-peer distributed DBMS and (A2, D2, H1) which represents a (peer-to- peer) distributed, heterogeneous multidatabase system. Note that we discuss the heterogeneity issues within the context of one system architecture, although the issue arises in other models as well.

1.7.8 Client/Server Systems

Client/server DBMSs entered the computing scene at the beginning of 1990’s and have made a significant impact on both the DBMS technology and the way we do computing. The general idea is very simple and elegant: distinguish the functionality that needs to be provided and divide these functions into two classes: server functions and client functions. This provides a two-level architecture which makes it easier to manage the complexity of modern DBMSs and the complexity of distribution.

As with any highly popular term, client/server has been much abused and has come to mean different things. If one takes a process-centric view, then any process that requests the services of another process is its client and vice versa. However, it is important to note that “client/server computing” and “client/server DBMS,” as it is used in our context, do not refer to processes, but to actual machines. Thus, we focus on what software should run on the client machines and what software should run on the server machine.

Put this way, the issue is clearer and we can begin to study the differences in client and server functionality. The functionality allocation between clients and serves differ in different types of distributed DBMSs (e.g., relational versus object-oriented). In relational systems, the server does most of the data management work. This means that all of query processing and optimization, transaction management and storage management is done at the server. The client, in addition to the application and the user interface, has a DBMS client module that is responsible for managing the data that is cached to the client and (sometimes) managing the transaction locks that may have been cached as well. It is also possible to place consistency checking of user queries at the client side, but this is not common since it requires the replication of the system catalog at the client machines. Of course, there is operating system and communication software that runs on both the client and the server, but we only focus on the DBMS related functionality. This architecture, depicted in Figure 1.11,

1.7 Distributed DBMS Architecture 29

Database

SQL

queries

Semantic Data Controller

Query Optimizer

Transaction Manager

Recovery Manager

Runtime Support Processor

Communication SoftwareO

p

e

r

a

t

i

n

g

S y s t e m

Communication Software

Client DBMS

User

Interface

Application

Program …

O p e ra

ti n g

S y s te

m

Result

relation

Fig. 1.11 Client/Server Reference Architecture

is quite common in relational systems where the communication between the clients and the server(s) is at the level of SQL statements. In other words, the client passes SQL queries to the server without trying to understand or optimize them. The server does most of the work and returns the result relation to the client.

There are a number of different types of client/server architecture. The simplest is the case where there is only one server which is accessed by multiple clients. We call this multiple client/single server. From a data management perspective, this is not much different from centralized databases since the database is stored on only one machine (the server) that also hosts the software to manage it. However, there are some (important) differences from centralized systems in the way transactions are executed and caches are managed. We do not consider such issues at this point. A more sophisticated client/server architecture is one where there are multiple servers in the system (the so-called multiple client/multiple server approach). In this case, two alternative management strategies are possible: either each client manages its own connection to the appropriate server or each client knows of only its “home server” which then communicates with other servers as required. The former approach simplifies server code, but loads the client machines with additional responsibilities. This leads to what has been called “heavy client” systems. The latter approach, on

30 1 Introduction

the other hand, concentrates the data management functionality at the servers. Thus, the transparency of data access is provided at the server interface, leading to “light clients.”

From a datalogical perspective, client/server DBMSs provide the same view of data as do peer-to-peer systems that we discuss next. That is, they give the user the appearance of a logically single database, while at the physical level data may be distributed. Thus the primary distinction between client/server systems and peer- to-peer ones is not in the level of transparency that is provided to the users and applications, but in the architectural paradigm that is used to realize this level of transparency.

Client/server can be naturally extended to provide for a more efficient function distribution on different kinds of servers: client servers run the user interface (e.g., web servers), application servers run application programs, and database servers run database management functions. This leads to the present trend in three-tier distributed system architecture, where sites are organized as specialized servers rather than as general-purpose computers.

The original idea, which is to offload the database management functions to a special server, dates back to the early 1970s [Canaday et al., 1974]. At the time, the computer on which the database system was run was called the database machine, database computer, or backend computer, while the computer that ran the applica- tions was called the host computer. More recent terms for these are the database server and application server, respectively. Figure 1.12 illustrates a simple view of the database server approach, with application servers connected to one database server via a communication network.

The database server approach, as an extension of the classical client/server archi- tecture, has several potential advantages. First, the single focus on data management makes possible the development of specific techniques for increasing data reliability and availability, e.g. using parallelism. Second, the overall performance of database management can be significantly enhanced by the tight integration of the database system and a dedicated database operating system. Finally, a database server can also exploit recent hardware architectures, such as multiprocessors or clusters of PC servers to enhance both performance and data availability.

Although these advantages are significant, they can be offset by the overhead introduced by the additional communication between the application and the data servers. This is an issue, of course, in classical client/server systems as well, but in this case there is an additional layer of communication to worry about. The communication cost can be amortized only if the server interface is sufficiently high level to allow the expression of complex queries involving intensive data processing.

The application server approach (indeed, a n-tier distributed approach) can be extended by the introduction of multiple database servers and multiple application servers (Figure 1.13), as can be done in classical client/server architectures. In this case, it is typically the case that each application server is dedicated to one or a few applications, while database servers operate in the multiple server fashion discussed above.

1.7 Distributed DBMS Architecture 31

network

Application

server

Database

server

Client Client...

network

Fig. 1.12 Database Server Approach

network

Database

server

Client

Application

server

Client...

network

Database

server

Database

server

Application

server ...

Fig. 1.13 Distributed Database Servers

32 1 Introduction

1.7.9 Peer-to-Peer Systems

If the term “client/server” is loaded with different interpretations, “peer-to-peer” is even worse as its meaning has changed and evolved over the years. As noted earlier, the early works on distributed DBMSs all focused on peer-to-peer architectures where there was no differentiation between the functionality of each site in the system4. After a decade of popularity of client/server computing, peer-to-peer have made a comeback in the last few years (primarily spurred by file sharing applications) and some have even positioned peer-to-peer data management as an alternative to distributed DBMSs. While this may be a stretch, modern peer-to-peer systems have two important differences from their earlier relatives. The first is the massive distribution in current systems. While in the early days we focused on a few (perhaps at most tens of) sites, current systems consider thousands of sites. The second is the inherent heterogeneity of every aspect of the sites and their autonomy. While this has always been a concern of distributed databases, as discussed earlier, coupled with massive distribution, site heterogeneity and autonomy take on an added significance, disallowing some of the approaches from consideration.

Discussing peer-to-peer database systems within this backdrop poses real chal- lenges; the unique issues of database management over the “modern” peer-to-peer architectures are still being investigated. What we choose to do, in this book, is to initially focus on the classical meaning of peer-to-peer (the same functionality of each site), since the principles and fundamental techniques of these systems are very similar to those of client/server systems, and discuss the modern peer-to-peer database issues in a separate chapter (Chapter 16).

Let us start the description of the architecture by looking at the data organizational view. We first note that the physical data organization on each machine may be, and probably is, different. This means that there needs to be an individual internal schema definition at each site, which we call the local internal schema (LIS). The enterprise view of the data is described by the global conceptual schema (GCS), which is global because it describes the logical structure of the data at all the sites.

To handle data fragmentation and replication, the logical organization of data at each site needs to be described. Therefore, there needs to be a third layer in the architecture, the local conceptual schema (LCS). In the architectural model we have chosen, then, the global conceptual schema is the union of the local conceptual schemas. Finally, user applications and user access to the database is supported by external schemas (ESs), defined as being above the global conceptual schema.

This architecture model, depicted in Figure 1.14, provides the levels of trans- parency discussed earlier. Data independence is supported since the model is an extension of ANSI/SPARC, which provides such independence naturally. Location and replication transparencies are supported by the definition of the local and global conceptual schemas and the mapping in between. Network transparency, on the other hand, is supported by the definition of the global conceptual schema. The user

4 In fact, in the first edition of this book which appeared in early 1990 and whose writing was completed in 1989, there wasn’t a single mention of the term “client/server”.

1.7 Distributed DBMS Architecture 33

...

...

...

ES 1 2 n

GCS

LCS LCS LCS 1 2 n

LIS 1

LIS 2

LIS n

ES ES

Fig. 1.14 Distributed Database Reference Architecture

queries data irrespective of its location or of which local component of the distributed database system will service it. As mentioned before, the distributed DBMS translates global queries into a group of local queries, which are executed by distributed DBMS components at different sites that communicate with one another.

The detailed components of a distributed DBMS are shown in Figure 1.15. One component handles the interaction with users, and another deals with the storage. The first major component, which we call the user processor, consists of four elements:

1. The user interface handler is responsible for interpreting user commands as they come in, and formatting the result data as it is sent to the user.

2. The semantic data controller uses the integrity constraints and authorizations that are defined as part of the global conceptual schema to check if the user query can be processed. This component, which is studied in detail in Chapter 5, is also responsible for authorization and other functions.

3. The global query optimizer and decomposer determines an execution strategy to minimize a cost function, and translates the global queries into local ones using the global and local conceptual schemas as well as the global directory. The global query optimizer is responsible, among other things, for generating the best strategy to execute distributed join operations. These issues are discussed in Chapters 6 through 8.

4. The distributed execution monitor coordinates the distributed execution of the user request. The execution monitor is also called the distributed transaction manager. In executing queries in a distributed fashion, the execution monitors at various sites may, and usually do, communicate with one another.

The second major component of a distributed DBMS is the data processor and consists of three elements:

34 1 Introduction

USER

User requests

System responses

USER PROCESSOR

DATA PROCESSOR

User Interface Handler

Semantic Data Controller

Global Query Optimizer

Global Execution Monitor

Local Recovery Manager

Local Internal Schema

Runtime Support Processor

External Schema

Global Conceptual

Schema

System Log

Local Query Processor

Local Conceptual

Schema

Fig. 1.15 Components of a Distributed DBMS

1.7 Distributed DBMS Architecture 35

1. The local query optimizer, which actually acts as the access path selector, is responsible for choosing the best access path5 to access any data item (touched upon briefly in Chapter 8).

2. The local recovery manager is responsible for making sure that the local database remains consistent even when failures occur (Chapter 12).

3. The run-time support processor physically accesses the database according to the physical commands in the schedule generated by the query optimizer. The run-time support processor is the interface to the operating system and contains the database buffer (or cache) manager, which is responsible for maintaining the main memory buffers and managing the data accesses.

It is important to note, at this point, that our use of the terms “user processor” and “data processor” does not imply a functional division similar to client/server systems. These divisions are merely organizational and there is no suggestion that they should be placed on different machines. In peer-to-peer systems, one expects to find both the user processor modules and the data processor modules on each machine. However, there have been suggestions to separate “query-only sites” in a system from full-functionality ones. In this case, the former sites would only need to have the user processor.

In client/server systems where there is a single server, the client has the user interface manager while the server has all of the data processor functionality as well as semantic data controller; there is no need for the global query optimizer or the global execution monitor. If there are multiple servers and the home server approach described in the previous section is employed, then each server hosts all of the modules except the user interface manager that resides on the client. If, however, each client is expected to contact individual servers on its own, then, most likely, the clients will host the full user processor functionality while the data processor functionality resides in the servers.

1.7.10 Multidatabase System Architecture

Multidatabase systems (MDBS) represent the case where individual DBMSs (whether distributed or not) are fully autonomous and have no concept of cooperation; they may not even “know” of each other’s existence or how to talk to each other. Our focus is, naturally, on distributed MDBSs, which is what the term will refer to in the remainder. In most current literature, one finds the term data integration system used instead. We avoid using that term since data integration systems consider non-database data sources as well. Our focus is strictly on databases. We discuss these systems and their relationship to database integration in Chapter 4. We note, however, that there is considerable variability of the use of the term “multidatabase” in literature. In this

5 The term access path refers to the data structures and the algorithms that are used to access the data. A typical access path, for example, is an index on one or more attributes of a relation.

36 1 Introduction

book, we use it consistently as defined above, which may devitate from its use in some of the existing literature.

The differences in the level of autonomy between the distributed multi-DBMSs and distributed DBMSs are also reflected in their architectural models. The fun- damental difference relates to the definition of the global conceptual schema. In the case of logically integrated distributed DBMSs, the global conceptual schema defines the conceptual view of the entire database, while in the case of distributed multi-DBMSs, it represents only the collection of some of the local databases that each local DBMS wants to share. The individual DBMSs may choose to make some of their data available for access by others (i.e., federated database architectures) by defining an export schema [Heimbigner and McLeod, 1985]. Thus the definition of a global database is different in MDBSs than in distributed DBMSs. In the latter, the global database is equal to the union of local databases, whereas in the former it is only a (possibly proper) subset of the same union. In a MDBS, the GCS (which is also called a mediated schema) is defined by integrating either the external schemas of local autonomous databases or (possibly parts of their) local conceptual schemas.

Furthermore, users of a local DBMS define their own views on the local database and do not need to change their applications if they do not want to access data from another database. This is again an issue of autonomy.

Designing the global conceptual schema in multidatabase systems involves the integration of either the local conceptual schemas or the local external schemas (Figure 1.16). A major difference between the design of the GCS in multi-DBMSs and in logically integrated distributed DBMSs is that in the former the mapping is from local conceptual schemas to a global schema. In the latter, however, mapping is in the reverse direction. As we discuss in Chapters 3 and 4, this is because the design in the former is usually a bottom-up process, whereas in the latter it is usually a top-down procedure. Furthermore, if heterogeneity exists in the multidatabase system, a canonical data model has to be found to define the GCS.

...

...

GCS

LCS 1 LCSn

LIS 1

LIS n

...GES2 GES3GES1

...LES11 ...LES12 LES13 LESn1 LESn2 LESnm

Fig. 1.16 MDBS Architecture with a GCS

1.7 Distributed DBMS Architecture 37

Once the GCS has been designed, views over the global schema can be defined for users who require global access. It is not necessary for the GES and GCS to be defined using the same data model and language; whether they do or not determines whether the system is homogeneous or heterogeneous.

If heterogeneity exists in the system, then two implementation alternatives exist: unilingual and multilingual. A unilingual multi-DBMS requires the users to utilize possibly different data models and languages when both a local database and the global database are accessed. The identifying characteristic of unilingual systems is that any application that accesses data from multiple databases must do so by means of an external view that is defined on the global conceptual schema. This means that the user of the global database is effectively a different user than those who access only a local database, utilizing a different data model and a different data language.

An alternative is multilingual architecture, where the basic philosophy is to permit each user to access the global database (i.e., data from other databases) by means of an external schema, defined using the language of the user’s local DBMS. The GCS definition is quite similar in the multilingual architecture and the unilingual approach, the major difference being the definition of the external schemas, which are described in the language of the external schemas of the local database. Assuming that the definition is purely local, a query issued according to a particular schema is handled exactly as any query in the centralized DBMSs. Queries against the global database are made using the language of the local DBMS, but they generally require some processing to be mapped to the global conceptual schema.

The component-based architectural model of a multi-DBMS is significantly dif- ferent from a distributed DBMS. The fundamental difference is the existence of full-fledged DBMSs, each of which manages a different database. The MDBS pro- vides a layer of software that runs on top of these individual DBMSs and provides users with the facilities of accessing various databases (Figure 1.17). Note that in a distributed MDBS, the multi-DBMS layer may run on multiple sites or there may be central site where those services are offered. Also note that as far as the individual DBMSs are concerned, the MDBS layer is simply another application that submits requests and receives answers.

A popular implementation architecture for MDBSs is the mediator/wrapper ap- proach (Figure 1.18) [Wiederhold, 1992]. A mediator “is a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications.” Thus, each mediator performs a particular function with clearly defined interfaces. Using this architecture to implement a MDBS, each module in the multi-DBMS layer of Figure 1.17 is realized as a mediator. Since mediators can be built on top of other mediators, it is possible to construct a layered implementation. In mapping this architecture to the datalogical view of Figure 1.16, the mediator level implements the GCS. It is this level that handles user queries over the GCS and performs the MDBS functionality.

The mediators typically operate using a common data model and interface lan- guage. To deal with potential heterogeneities of the source DBMSs, wrappers are implemented whose task is to provide a mapping between a source DBMSs view and the mediators’ view. For example, if the source DBMS is a relational one, but the

38 1 Introduction

USER

User requests

System responses

Multi-DBMS Layer

... DBMS DBMS

Fig. 1.17 Components of an MDBS

mediator implementations are object-oriented, the required mappings are established by the wrappers. The exact role and function of mediators differ from one imple- mentation to another. In some cases, thin mediators have been implemented who do nothing more than translation. In other cases, wrappers take over the execution of some of the query functionality.

One can view the collection of mediators as a middleware layer that provides services above the source systems. Middleware is a topic that has been the subject of significant study in the past decade and very sophisticated middleware systems have been developed that provide advanced services for development of distributed appli- cations. The mediators that we discuss only represent a subset of the functionality provided by these systems.

1.8 Bibliographic Notes

There are not many books on distributed DBMSs. Ceri and Pelagatti’s book [Ceri and Pelagatti, 1983] was the first on this topic though it is now dated. The book by Bell and Grimson [Bell and Grimson, 1992] also provides an overview of the topics addressed here. In addition, almost every database book now has a chapter on distributed DBMSs. A brief overview of the technology is provided in [Özsu and Valduriez, 1997]. Our papers [Özsu and Valduriez, 1994, 1991] provide discussions of the state-of-the-art at the time they were written.

Database design is discussed in an introductory manner in [Levin and Morgan, 1975] and more comprehensively in [Ceri et al., 1987]. A survey of the file distribu- tion algorithms is given in [Dowdy and Foster, 1982]. Directory management has not been considered in detail in the research community, but general techniques can be found in Chu and Nahouraii [1975] and [Chu, 1976]. A survey of query processing

1.8 Bibliographic Notes 39

USER

User

requests System

responses

...

Wrapper Wrapper Wrapper

Mediator Mediator

Mediator Mediator

DBMS DBMS DBMS DBMS

Fig. 1.18 Mediator/Wrapper Architecture

techniques can be found in [Sacco and Yao, 1982]. Concurrency control algorithms are reviewed in [Bernstein and Goodman, 1981] and [Bernstein et al., 1987]. Dead- lock management has also been the subject of extensive research; an introductory paper is [Isloor and Marsland, 1980] and a widely quoted paper is [Obermarck, 1982]. For deadlock detection, good surveys are [Knapp, 1987] and [Elmagarmid, 1986]. Reliability is one of the issues discussed in [Gray, 1979], which is one of the landmark papers in the field. Other important papers on this topic are [Verhofstadt, 1978] and [Härder and Reuter, 1983]. [Gray, 1979] is also the first paper discussing the issues of operating system support for distributed databases; the same topic is addressed in [Stonebraker, 1981]. Unfortunately, both papers emphasize centralized database systems.

There have been a number of architectural framework proposals. Some of the inter- esting ones include Schreiber’s quite detailed extension of the ANSI/SPARC frame- work which attempts to accommodate heterogeneity of the data models [Schreiber, 1977], and the proposal by Mohan and Yeh [Mohan and Yeh, 1978]. As expected, these date back to the early days of the introduction of distributed DBMS technology. The detailed component-wise system architecture given in Figure 1.15 derives from

40 1 Introduction

[Rahimi, 1987]. An alternative to the classification that we provide in Figure 1.10 can be found in [Sheth and Larson, 1990].

Most of the discussion on architectural models for multi-DBMSs is from [Özsu and Barker, 1990]. Other architectural discussions on multi-DBMSs are given in [Gligor and Luckenbaugh, 1984], [Litwin, 1988], and [Sheth and Larson, 1990]. All of these papers provide overview discussions of various prototype and commercial systems. An excellent overview of heterogeneous and federated database systems is [Sheth and Larson, 1990].

Chapter 2 Background

As indicated in the previous chapter, there are two technological bases for distributed database technology: database management and computer networks. In this chapter, we provide an overview of the concepts in these two fields that are more important from the perspective of distributed database technology.

2.1 Overview of Relational DBMS

The aim of this section is to define the terminology and framework used in subsequent chapters, since most of the distributed database technology has been developed using the relational model. In later chapters, when appropriate, we introduce other models. Our focus here is on the language and operators.

2.1.1 Relational Database Concepts

A database is a structured collection of data related to some real-life phenomena that we are trying to model. A relational database is one where the database structure is in the form of tables. Formally, a relation R defined over n sets D1,D2, . . . ,Dn (not necessarily distinct) is a set of n-tuples (or simply tuples) 〈d1,d2, . . . ,dn〉 such that d1 ∈ D1,d2 ∈ D2, . . . ,dn ∈Dn. Example 2.1. As an example we use a database that models an engineering company. The entities to be modeled are the employees (EMP) and projects (PROJ). For each employee, we would like to keep track of the employee number (ENO), name (ENAME), title in the company (TITLE), salary (SAL), identification number of the project(s) the employee is working on (PNO), responsibility within the project (RESP), and duration of the assignment to the project (DUR) in months. Similarly, for each project we would like to store the project number (PNO), the project name (PNAME), and the project budget (BUDGET).

41 DOI 10.1007/978-1-4419-8834-8_2, © Springer Science+Business Media, LLC 2011 M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems: Third Edition,

42 2 Background

ENO

EMP

ENAME TITLE SAL PNO RESP DUR

PROJ

PNO PNAME BUDGET

Fig. 2.1 Sample Database Scheme

The relation schemas for this database can be defined as follows:

EMP(ENO, ENAME, TITLE, SAL, PNO, RESP, DUR)

PROJ(PNO,PNAME, BUDGET)

In relation scheme EMP, there are seven attributes: ENO, ENAME, TITLE, SAL, PNO, RESP, DUR. The values of ENO come from the domain of all valid employee numbers, say D1, the values of ENAME come from the domain of all valid names, say D2, and so on. Note that each attribute of each relation does not have to come from a distinct domain. Various attributes within a relation or from a number of relations may be defined over the same domain. �

The key of a relation scheme is the minimum non-empty subset of its attributes such that the values of the attributes comprising the key uniquely identify each tuple of the relation. The attributes that make up key are called prime attributes. The superset of a key is usually called a superkey. Thus in our example the key of PROJ is PNO, and that of EMP is the set (ENO, PNO). Each relation has at least one key. Sometimes, there may be more than one possibility for the key. In such cases, each alternative is considered a candidate key, and one of the candidate keys is chosen as the primary key, which we denote by underlining. The number of attributes of a relation defines its degree, whereas the number of tuples of the relation defines its cardinality.

In tabular form, the example database consists of two tables, as shown in Figure 2.1. The columns of the tables correspond to the attributes of the relations; if there were any information entered as the rows, they would correspond to the tuples. The empty table, showing the structure of the table, corresponds to the relation schema; when the table is filled with rows, it corresponds to a relation instance. Since the information within a table varies over time, many instances can be generated from one relation scheme. Note that from now on, the term relation refers to a relation instance. In Figure 2.2 we depict instances of the two relations that are defined in Figure 2.1.

An attribute value may be undefined. This lack of definition may have various interpretations, the most common being “unknown” or “not applicable”. This special value of the attribute is generally referred to as the null value. The representation of a null value must be different from any other domain value, and special care should be given to differentiate it from zero. For example, value “0” for attribute DUR is

2.1 Overview of Relational DBMS 43

ENO

EMP

ENAME TITLE SAL

J. Doe Elect. Eng. 40000

M. Smith 34000

M. Smith

Analyst

Analyst 34000

A. Lee Mech. Eng. 27000

A. Lee Mech. Eng. 27000

J. Miller Programmer 24000

B. Casey Syst. Anal. 34000

L. Chu Elect. Eng. 40000

R. Davis Mech. Eng. 27000

E1

E2

E2

E3

E3

E4

E5

E6

E7

E8 J. Jones Syst. Anal. 34000

24

PNO RESP DUR

P1 Manager 12

P1 Analyst

P2 Analyst 6

P3 Consultant 10

P4 Engineer 48

P2 Programmer 18

P2 Manager 24

P4 Manager 48

P3 Engineer 36

P3 Manager 40

PROJ

PNO PNAME BUDGET

P1 Instrumentation 150000

P2 Database Develop. 135000

P3 CAD/CAM 250000

P4 Maintenance 310000

Fig. 2.2 Sample Database Instance

known information (e.g., in the case of a newly hired employee), while value “null” for DUR means unknown. Supporting null values is an important feature necessary to deal with maybe queries [Codd, 1979].

2.1.2 Normalization

The aim of normalization is to eliminate various anomalies (or undesirable aspects) of a relation in order to obtain “better” relations. The following four problems might exist in a relation scheme:

1. Repetition anomaly. Certain information may be repeated unnecessarily. Con- sider, for example, the EMP relation in Figure 2.2. The name, title, and salary of an employee are repeated for each project on which this person serves. This is obviously a waste of storage and is contrary to the spirit of databases.

44 2 Background

2. Update anomaly. As a consequence of the repetition of data, performing updates may be troublesome. For example, if the salary of an employee changes, multiple tuples have to be updated to reflect this change.

3. Insertion anomaly. It may not be possible to add new information to the database. For example, when a new employee joins the company, we cannot add personal information (name, title, salary) to the EMP relation unless an appointment to a project is made. This is because the key of EMP includes the attribute PNO, and null values cannot be part of the key.

4. Deletion anomaly. This is the converse of the insertion anomaly. If an em- ployee works on only one project, and that project is terminated, it is not possible to delete the project information from the EMP relation. To do so would result in deleting the only tuple about the employee, thereby resulting in the loss of personal information we might want to retain.

Normalization transforms arbitrary relation schemes into ones without these problems. A relation with one or more of the above mentioned anomalies is split into two or more relations of a higher normal form. A relation is said to be in a normal form if it satisfies the conditions associated with that normal form. Codd initially defined the first, second, and third normal forms (1NF, 2NF, and 3NF, respectively). Boyce and Codd [Codd, 1974] later defined a modified version of the third normal form, commonly known as the Boyce-Codd normal form (BCNF). This was followed by the definition of the fourth (4NF) [Fagin, 1977] and fifth normal forms (5NF) [Fagin, 1979].

The normal forms are based on certain dependency structures. BCNF and lower normal forms are based on functional dependencies (FDs), 4NF is based on multi- valued dependencies, and 5NF is based on projection-join dependencies. We only introduce functional dependency, since that is the only relevant one for the example we are considering.

Let R be a relation defined over the set of attributes A = {A1,A2, . . . ,An} and let X ⊂ A, Y ⊂ A. If for each value of X in R, there is only one associated Y value, we say that “X functionally determines Y” or that “Y is functionally dependent on X .” Notationally, this is shown as X →Y . The key of a relation functionally determines the non-key attributes of the same relation.

Example 2.2. For example, in the PROJ relation of Example 2.1 (one can observe these in Figure 2.2 as well), the valid FD is

PNO→ (PNAME, BUDGET)

In the EMP relation we have

(ENO, PNO)→ (ENAME,TITLE,SAL,RESP,DUR)

This last FD is not the only FD in EMP, however. If each employee is given unique employee numbers, we can write

2.1 Overview of Relational DBMS 45

ENO→ (ENAME, TITLE, SAL) (ENO, PNO)→ (RESP, DUR)

It may also happen that the salary for a given position is fixed, which gives rise to the FD

TITLE→ SAL �

We do not discuss the normal forms or the normalization algorithms in detail; these can be found in database textbooks. The following example shows the result of normalization on the sample database that we introduced in Example 2.1.

Example 2.3. The following set of relation schemes are normalized into BCNF with respect to the functional dependencies defined over the relations.

EMP(ENO, ENAME, TITLE)

PAY(TITLE, SAL)

PROJ(PNO, PNAME, BUDGET)

ASG(ENO, PNO, RESP, DUR)

The normalized instances of these relations are shown in Figure 2.3. �

2.1.3 Relational Data Languages

Data manipulation languages developed for the relational model (commonly called query languages) fall into two fundamental groups: relational algebra languages and relational calculus languages. The difference between them is based on how the user query is formulated. The relational algebra is procedural in that the user is expected to specify, using certain high-level operators, how the result is to be obtained. The relational calculus, on the other hand, is non-procedural; the user only specifies the relationships that should hold in the result. Both of these languages were originally proposed by Codd [1970], who also proved that they were equivalent in terms of expressive power [Codd, 1972].

2.1.3.1 Relational Algebra

Relational algebra consists of a set of operators that operate on relations. Each operator takes one or two relations as operands and produces a result relation, which, in turn, may be an operand to another operator. These operations permit the querying and updating of a relational database.

46 2 Background

ENO ENAME TITLE

E1 J. Doe Elect. Eng

E2 M. Smith Syst. Anal.

E3 A. Lee Mech. Eng.

E4 J. Miller Programmer

E5 B. Casey Syst. Anal.

E6 L. Chu Elect. Eng.

E7 R. Davis Mech. Eng.

E8 J. Jones Syst. Anal.

EMP

TITLE SAL

PAY

Elect. Eng. 40000

Syst. Anal. 34000

Mech. Eng. 27000

Programmer 24000

PROJ

PNO PNAME BUDGET

P1 Instrumentation 150000

P2 Database Develop. 135000

P3 CAD/CAM 250000

P4 Maintenance 310000

ENO PNO RESP

E1 P1 Manager 12

DUR

E2 P1 Analyst 24

E2 P2 Analyst 6

E3 P3 Consultant 10

E3 P4 Engineer 48

E4 P2 Programmer 18

E5 P2 Manager 24

E6 P4 Manager 48

E7 P3 Engineer 36

E8 P3 Manager 40

ASG

Fig. 2.3 Normalized Relations

There are five fundamental relational algebra operators and five others that can be defined in terms of these. The fundamental operators are selection, projection, union, set difference, and Cartesian product. The first two of these operators are unary operators, and the last three are binary operators. The additional operators that can be defined in terms of these fundamental operators are intersection, θ − join, natural join, semijoin and division. In practice, relational algebra is extended with operators for grouping or sorting the results, and for performing arithmetic and aggregate functions. Other operators, such as outer join and transitive closure, are sometimes used as well to provide additional functionality. We only discuss the more common operators.

The operands of some of the binary relations should be union compatible. Two relations R and S are union compatible if and only if they are of the same degree and the i-th attribute of each is defined over the same domain. The second part of the definition holds, obviously, only when the attributes of a relation are identified by their relative positions within the relation and not by their names. If relative ordering of attributes is not important, it is necessary to replace the second part of the definition by the phrase “the corresponding attributes of the two relations should be defined over the same domain.” The correspondence is defined rather loosely here.

Many operator definitions refer to “formula”, which also appears in relational calculus expressions we discuss later. Thus, let us define precisely, at this point, what we mean by a formula. We define a formula within the context of first-order predicate

2.1 Overview of Relational DBMS 47

calculus (since we use that formalism later), and follow the notation of Gallaire et al. [1984]. First-order predicate calculus is based on a symbol alphabet that consists of (1) variables, constants, functions, and predicate symbols; (2) parentheses; (3) the logical connectors ∧ (and), ∨ (or), ¬ (not),→ (implication), and↔ (equivalence); and (4) quantifiers ∀ (for all) and ∃ (there exists). A term is either a constant or a variable. Recursively, if f is an n-ary function and t1, . . . , tn are terms, f (t1, . . . , tn) is also a term. An atomic formula is of the form P(t1, . . . , tn), where P is an n-ary predicate symbol and the ti’s are terms. A well-formed formula (wff ) can be defined recursively as follows: If wi and w j are wffs, then (wi), ¬(wi),(wi)∧ (w j),(wi)∨ (w j),(wi)→ (w j), and (wi)↔ (w j) are all wffs. Variables in a wff may be free or they may be bound by one of the two quantifiers.

Selection.

Selection produces a horizontal subset of a given relation. The subset consists of all the tuples that satisfy a formula (condition). The selection from a relation R is

σF(R)

where R is the relation and F is a formula. The formula in the selection operation is called a selection predicate and is an

atomic formula whose terms are of the form Aθc, where A is an attribute of R and θ is one of the arithmetic comparison operators <, >, =, 6=, ≤, and ≥. The terms can be connected by the logical connectors ∧,∨, and ¬. Furthermore, the selection predicate does not contain any quantifiers.

Example 2.4. Consider the relation EMP shown in Figure 2.3. The result of selecting those tuples for electrical engineers is shown in Figure 2.4. �

ENO ENAME TITLE

E1 J. Doe Elect. Eng

E6 L. Chu Elect. Eng.

σ TITLE="Elect. Eng."

(EMP)

Fig. 2.4 Result of Selection

48 2 Background

Projection.

Projection produces a vertical subset of a relation. The result relation contains only those attributes of the original relation over which projection is performed. Thus the degree of the result is less than or equal to the degree of the original relation.

The projection of relation R over attributes A and B is denoted as

ΠA,B(R)

Note that the result of a projection might contain tuples that are identical. In that case the duplicate tuples may be deleted from the result relation. It is possible to specify projection with or without duplicate elimination.

Example 2.5. The projection of relation PROJ shown in Figure 2.3 over attributes PNO and BUDGET is depicted in Figure 2.5. �

PNO BUDGET

P1 150000

P2 135000

P3 250000

P4 310000

Π PNO,BUDGET

(PROJ)

Fig. 2.5 Result of Projection

Union.

The union of two relations R and S (denoted as R ∪ S) is the set of all tuples that are in R, or in S, or in both. We should note that R and S should be union compatible. As in the case of projection, the duplicate tuples are normally eliminated. Union may be used to insert new tuples into an existing relation, where these tuples form one of the operand relations.

Set Difference.

The set difference of two relations R and S (R− S) is the set of all tuples that are in R but not in S. In this case, not only should R and S be union compatible, but the operation is also asymmetric (i.e., R− S 6= S−R). This operation allows the

2.1 Overview of Relational DBMS 49

ENO ENAME EMP.TITLE PAY.TITLE SAL

E1 J. Doe Elect. Eng.

E1 J. Doe Elect. Eng.

E1 J. Doe Elect. Eng.

E1 J. Doe Elect. Eng.

Elect. Eng. 40000

Syst. Anal. 34000

Mech. Eng. 27000

Programmer 24000

E2 M. Smith Syst. Anal.

E2 M. Smith Syst. Anal.

E2 M. Smith Syst. Anal.

E2 M. Smith Syst. Anal.

Elect. Eng. 40000

Syst. Anal. 34000

Mech. Eng. 27000

Programmer 24000

Elect. Eng. 40000

Syst. Anal. 34000

Mech. Eng. 27000

Programmer 24000

Elect. Eng. 40000

Syst. Anal. 34000

Mech. Eng. 27000

Programmer 24000

E3 A. Lee Mech. Eng.

E3 A. Lee Mech. Eng.

E3 A. Lee Mech. Eng.

E3 A. Lee Mech. Eng.

E8 J. Jones Syst. Anal.

E8 J. Jones Syst. Anal.

E8 J. Jones Syst. Anal.

E8 J. Jones Syst. Anal.

EMP x PAY

≈≈≈≈≈≈

Fig. 2.6 Partial Result of Cartesian Product

deletion of tuples from a relation. Together with the union operation, we can perform modification of tuples by deletion followed by insertion.

Cartesian Product.

The Cartesian product of two relations R of degree k1 and S of degree k2 is the set of (k1 + k2)-tuples, where each result tuple is a concatenation of one tuple of R with one tuple of S, for all tuples of R and S. The Cartesian product of R and S is denoted as R×S.

It is possible that the two relations might have attributes with the same name. In this case the attribute names are prefixed with the relation name so as to maintain the uniqueness of the attribute names within a relation.

Example 2.6. Consider relations EMP and PAY in Figure 2.3. EMP × PAY is shown in Figure 2.6. Note that the attribute TITLE, which is common to both relations, appears twice, prefixed with the relation name. �

50 2 Background

Intersection.

Intersection of two relations R and S (R ∩ S) consists of the set of all tuples that are in both R and S. In terms of the basic operators, it can be specified as follows:

R∩S = R− (R−S)

θ -Join.

Join is a derivative of Cartesian product. There are various forms of join; the primary classification is between inner join and outer join. We first discuss inner join and its variants and then describe outer join.

The most general type of inner join is the θ -join. The θ -join of two relations R and S is denoted as

R 1F S

where F is a formula specifying the join predicate. A join predicate is specified similar to a selection predicate, except that the terms are of the form R.AθS.B, where A and B are attributes of R and S, respectively.

The join of two relations is equivalent to performing a selection, using the join predicate as the selection formula, over the Cartesian product of the two operand relations. Thus

R 1F S = σF(R×S)

In the equivalence above, we should note that if F involves attributes of the two relations that are common to both of them, a projection is necessary to make sure that those attributes do not appear twice in the result.

Example 2.7. Let us consider that the EMP relation in Figure 2.3 and add two more tuples as depicted in Figure 2.7(a). Then Figure 2.7(b) shows the θ -join of relations EMP and ASG over the join predicate EMP.ENO=ASG.ENO.

The same result could have been obtained as

EMP 1EMP.ENO=ASG.ENO ASG = ΠENO, ENAME, TITLE, SAL(σEMP.ENO =PAY.ENO(EMP×ASG))

Notice that the result does not have tuples E9 and E10 since these employees have not yet been assigned to a project. Furthermore, the information about some employees (e.g., E2 and E3) who have been assigned to multiple projects appear more than once in the result. �

This example demonstrates a special case of θ -join which is called the equi-join. This is a case where the formula F only contains equality (=) as the arithmetic operator. It should be noted, however, that an equi-join does not have to be specified over a common attribute as the example above might suggest.

2.1 Overview of Relational DBMS 51

ENO ENAME TITLE PNO

E1 J. Doe Elect. Eng.

M. SmithE2 Syst. Anal.

E3 A. Lee Mech. Eng.

E4 J. Miller Programmer

E6 L. Chu Elect. Eng.

E7 R. Davis Mech. Eng.

E8 J. Jones Syst. Anal.

EMP EMP.ENO=ASG.ENO

ASG

ENO ENAME TITLE

E1 J. Doe Elect. Eng

E2 M. Smith Syst. Anal.

E3 A. Lee Mech. Eng.

E4 J. Miller Programmer

E5 B. Casey Syst. Anal.

E6 L. Chu Elect. Eng.

E7 R. Davis Mech. Eng.

E8 J. Jones Syst. Anal.

EMP

E9 A. Hsu Programmer

E10 T. Wong Syst. Anal.

(a)

RESP DUR

M. SmithE2 Syst. Anal.

E3 A. Lee Mech. Eng.

E5 J. Miller Syst. Anal.

P1 Manager 12

P1 Analyst 12

P2 Analyst 12

P3 Consultant 12

P4 Engineer 12

P2 Programmer 12

P2 Manager 12

P4 Manager 12

P3 Engineer 12

P3 Manager 12

(b)

Fig. 2.7 The Result of Join

A natural join is an equi-join of two relations over a specified attribute, more specifically, over attributes with the same domain. There is a difference, however, in that usually the attributes over which the natural join is performed appear only once in the result. A natural join is denoted as the join without the formula

R 1A S

where A is the attribute common to both R and S. We should note here that the natural join attribute may have different names in the two relations; what is required is that they come from the same domain. In this case the join is denoted as

RA 1B S

where B is the corresponding join attribute of S.

Example 2.8. The join of EMP and ASG in Example 2.7 is actually a natural join. Here is another example – Figure 2.8 shows the natural join of relations EMP and PAY in Figure 2.3 over the attribute TITLE.

Inner join requires the joined tuples from the two operand relations to satisfy the join predicate. In contrast, outer join does not have this requirement – tuples exist in the result relation regardless. Outer join can be of three types: left outer join (1), right outer join (2) and full outer join (3). In the left outer join, the tuples from the left operand relation are always in the result, in the case of right outer join, the tuples from the right operand are always in the result, and in the case of full outer relation, tuples from both relations are always in the result. Outer join is useful in those cases where we wish to include information from one or both relations even if the do not satisfy the join predicate.

52 2 Background

ENO ENAME TITLE SAL

E1 J. Doe Elect. Eng. 40000

M. Smith 34000E2 Analyst

E3 A. Lee Mech. Eng. 27000

E4 J. Miller Programmer 24000

E5 B. Casey Syst. Anal. 34000

E6 L. Chu Elect. Eng. 40000

E7 R. Davis Mech. Eng. 27000

E8 J. Jones Syst. Anal. 34000

EMP TITLE

PAY

Fig. 2.8 The Result of Natural Join

Example 2.9. Consider the left outer join of EMP (as revised in Example 2.7) and ASG over attribute ENO (i.e., EMP1ENO ASG). The result is given in Figure 2.9. Notice that the information about two employees, E9 and E10 are included in the result even thought they have not yet been assigned to a project with “Null” values for the attributes from the ASG relation. �

ENO ENAME TITLE PNO

E1 J. Doe Elect. Eng.

M. SmithE2 Syst. Anal.

E3 A. Lee Mech. Eng.

E4 J. Miller Programmer

E6 L. Chu Elect. Eng.

E7 R. Davis Mech. Eng.

E8 J. Jones Syst. Anal.

EMP ENO

ASG

RESP DUR

M. SmithE2 Syst. Anal.

E3 A. Lee Mech. Eng.

E5 J. Miller Syst. Anal.

P1 Manager 12

P1 Analyst 12

P2 Analyst 12

P3 Consultant 12

P4 Engineer 12

P2 Programmer 12

P2 Manager 12

P4 Manager 12

P3 Engineer 12

P3 Manager 12

Null Null Null

Null Null Null

E9 A. Hsu Programmer

E10 T. Wong Syst. Anal.

Fig. 2.9 The Result of Left Outer Join

2.1 Overview of Relational DBMS 53

Semijoin.

The semijoin of relation R, defined over the set of attributes A, by relation S, defined over the set of attributes B, is the subset of the tuples of R that participate in the join of R with S. It is denoted as RnF S (where F is a predicate as defined before) and can be obtained as follows:

RnF S = ΠA(R 1F S) = ΠA(R) 1F ΠA∩B(S) = R 1F ΠA∩B(S)

The advantage of semijoin is that it decreases the number of tuples that need to be handled to form the join. In centralized database systems, this is important because it usually results in a decreased number of secondary storage accesses by making better use of the memory. It is even more important in distributed databases since it usually reduces the amount of data that needs to be transmitted between sites in order to evaluate a query. We talk about this in more detail in Chapters 3 and 8. At this point note that the operation is asymmetric (i.e., RnF S 6= SnF R).

Example 2.10. To demonstrate the difference between join and semijoin, let us con- sider the semijoin of EMP with PAY over the predicate EMP.TITLE = PAY.TITLE, that is,

EMP nEMP.TITLE = PAY.TITLE PAY

The result of the operation is shown in Figure 2.10. We encourage readers to compare Figures 2.7 and 2.10 to see the difference between the join and the semijoin operations. Note that the resultant relation does not have the PAY attribute and is therefore smaller. �

ENO ENAME TITLE

E1 J. Doe Elect. Eng.

M. SmithE2 Analyst

E3 A. Lee Mech. Eng.

E4 J. Miller Programmer

E5 B. Casey Syst. Anal.

E6 L. Chu Elect. Eng.

E7 R. Davis Mech. Eng.

E8 J. Jones Syst. Anal.

EMP EMP.TITLE=PAY.TITLE

PAY

Fig. 2.10 The Result of Semijoin

54 2 Background

Division.

The division of relation R of degree r with relation S of degree s (where r > s and s 6= 0) is the set of (r− s)-tuples t such that for all s-tuples u in S, the tuple tu is in R. The division operation is denoted as R÷S and can be specified in terms of the fundamental operators as follows:

R÷S = ΠĀ(R)−ΠĀ((ΠĀ(R)×S)−R)

where Ā is the set of attributes of R that are not in S [i.e., the (r− s)-tuples].

Example 2.11. Assume that we have a modified version of the ASG relation (call it ASG′) depicted in Figure 2.11a and defined as follows:

ASG′ = ΠENO,PNO (ASG) 1PNO PROJ

If one wants to find the employee numbers of those employees who are assigned to all the projects that have a budget greater than $200,000, it is necessary to divide ASG′ with a restricted version of PROJ, called PROJ′ (see Figure 2.11b). The result of division (ASG′÷ PROJ′) is shown in Figure 2.11c.

The keyword in the query above is “all.” This rules out the possibility of doing a selection on ASG′ to find the necessary tuples, since that would only give those which correspond to employees working on some project with a budget greater than $200,000, not those who work on all projects. Note that the result contains only the tuple 〈E3〉 since the tuples 〈E3, P3, CAD/CAM, 250000〉 and 〈E3, P4, Maintenance, 310000〉 both exist in ASG′. On the other hand, for example, 〈E7〉 is not in the result, since even though the tuple 〈E7, P3, CAD/CAM, 250000〉 is in ASG′, the tuple 〈E7, P4, Maintenance, 310000〉 is not. �

Since all operations take relations as input and produce relations as outputs, we can nest operations using a parenthesized notation and represent relational algebra programs. The parentheses indicate the order of execution. The following are a few examples that demonstrate the issue.

Example 2.12. Consider the relations of Figure 2.3. The retrieval query

“Find the names of employees working on the CAD/CAM project”

can be answered by the relational algebra program

ΠENAME(((σPNAME = “CAD/CAM” PROJ) 1PNO ASG) 1ENO EMP)

The order of execution is: the selection on PROJ, followed by the join with ASG, followed by the join with EMP, and finally the project on ENAME.

An equivalent program where the size of the intermediate relations is smaller is

ΠENAME (EMP nENO(ΠENO(ASG nPNO(σPNAME= “CAD/CAM” PROJ)))) �

2.1 Overview of Relational DBMS 55

(a)

(b)

PROJ'

PNO PNAME BUDGET

P3 CAD/CAM 250000

P4 Maintenance 310000

ENO

E3

(ASG' ÷ PROJ')

(c)

ASG'

ENO PNO PNAME

E1 P1 Instrumentation 150000

BUDGET

E2 P1 Instrumentation 150000

E2 P2 Database Develop. 135000

E3 P3 CAD/CAM

E3 P4 Maintenance

E4 P2

E5 P2

E6 P4

E7 P3 CAD/CAM

E8 P3 CAD/CAM

310000

135000

135000

310000

250000

250000

Maintenance

250000

Database Develop.

Database Develop.

Fig. 2.11 The Result of Division

Example 2.13. The update query

“Replace the salary of programmers by $25,000”

can be computed by

(PAY −(σTITLE = “Programmer” PAY)) ∪(〈Programmer, 25000 〉) �

2.1.3.2 Relational Calculus

In relational calculus-based languages, instead of specifying how to obtain the result, one specifies what the result is by stating the relationship that is supposed to hold for the result. Relational calculus languages fall into two groups: tuple relational calculus and domain relational calculus. The difference between the two is in terms

56 2 Background

of the primitive variable used in specifying the queries. We briefly review these two types of languages.

Relational calculus languages have a solid theoretical foundation since they are based on first-order predicate logic as we discussed before. Semantics is given to formulas by interpreting them as assertions on the database. A relational database can be viewed as a collection of tuples or a collection of domains. Tuple relational calculus interprets a variable in a formula as a tuple of a relation, whereas domain relational calculus interprets a variable as the value of a domain.

Tuple relational calculus.

The primitive variable used in tuple relational calculus is a tuple variable which specifies a tuple of a relation. In other words, it ranges over the tuples of a relation. Tuple calculus is the original relational calculus developed by Codd [1970].

In tuple relational calculus queries are specified as {t|F(t)}, where t is a tuple variable and F is a well-formed formula. The atomic formulas are of two forms:

1. Tuple-variable membership expressions. If t is a tuple variable ranging over the tuples of relation R (predicate symbol), the expression “tuple t belongs to relation R” is an atomic formula, which is usually specified as R.t or R(t).

2. Conditions. These can be defined as follows:

(a) s[A]θ t[B], where s and t are tuple variables and A and B are compo- nents of s and t, respectively. θ is one of the arithmetic comparison operators <, >, =, 6=, ≤, and ≥. This condition specifies that component A of s stands in relation θ to the B component of t: for example, s[SAL] > t[SAL].

(b) s[A]θc, where s, A, and θ are as defined above and c is a constant. For example, s[ENAME] = “Smith”.

Note that A is defined as a component of the tuple variable s. Since the range of s is a relation instance, say S, it is obvious that component A of s corresponds to attribute A of relation S. The same thing is obviously true for B.

There are many languages that are based on relational tuple calculus, the most popular ones being SQL1 [Date, 1987] and QUEL [Stonebraker et al., 1976]. SQL is now an international standard (actually, the only one) with various versions released: SQL1 was released in 1986, modifications to SQL1 were included in the 1989 version, SQL2 was issued in 1992, and SQL3, with object-oriented language extensions, was released in 1999.

1 Sometimes SQL is cited as lying somewhere between relational algebra and relational calculus. Its originators called it a “mapping language.” However, it follows the tuple calculus definition quite closely; hence we classify it as such.

2.1 Overview of Relational DBMS 57

SQL provides a uniform approach to data manipulation (retrieval, update), data definition (schema manipulation), and control (authorization, integrity, etc.). We limit ourselves to the expression, in SQL, of the queries in Examples 2.14 and 2.15.

Example 2.14. The query from Example 2.12,

“Find the names of employees working on the CAD/CAM project”

can be expressed as follows:

SELECT EMP.ENAME FROM EMP,ASG,PROJ WHERE EMP.ENO = ASG.ENO AND ASG.PNO = PROJ.PNO AND PROJ.PNAME = "CAD/CAM"

Note that a retrieval query generates a new relation similar to the relational algebra operations.

Example 2.15. The update query of Example 2.13,

“Replace the salary of programmers by $25,000”

is expressed as

UPDATE PAY SET SAL = 25000 WHERE PAY.TITLE = "Programmer"

Domain relational calculus.

The domain relational calculus was first proposed by Lacroix and Pirotte [1977]. The fundamental difference between a tuple relational language and a domain relational language is the use of a domain variable in the latter. A domain variable ranges over the values in a domain and specifies a component of a tuple. In other words, the range of a domain variable consists of the domains over which the relation is defined. The wffs are formulated accordingly. The queries are specified in the following form:

x1,x2, ...,xn|F(x1,x2, ...,xn)

where F is a wff in which x1, . . . ,xn are the free variables. The success of domain relational calculus languages is due mainly to QBE [Zloof,

1977], which is a visual application of domain calculus. QBE, designed only for interactive use from a visual terminal, is user friendly. The basic concept is an example: the user formulates queries by providing a possible example of the answer. Typing relation names triggers the printing, on screen, of their schemes. Then, by supplying keywords into the columns (domains), the user specifies the query. For instance, the attributes of the project relation are given by P, which stands for “Print.”

58 2 Background

EMP ENO ENAME TITLE

E2 P.

ASG ENO PNO RESP DUR

E2 P3

P3

PROJ PNO PNAME BUDGET

CAD/CAM

Fig. 2.12 Retrieval Query in QBE

By default, all queries are retrieval. An update query requires the specification of U under the name of the updated relation or in the updated column. The retrieval query corresponding to Example 2.12 is given in Figure 2.12 and the update query of Example 2.13 is given in Figure 2.13. To distinguish examples from constants, examples are underlined.

PAY TITLE SAL

Programmer U.25000

Fig. 2.13 Update Query in QBE

2.2 Review of Computer Networks

In this section we discuss computer networking concepts relevant to distributed database systems. We omit most of the details of the technological and technical issues in favor of discussing the main concepts.

We define a computer network as an interconnected collection of autonomous computers that are capable of exchanging information among themselves (Figure 2.14). The keywords in this definition are interconnected and autonomous. We want the computers to be autonomous so that each computer can execute programs on its own. We also want the computers to be interconnected so that they are capable of exchanging information. Computers on a network are referred to as nodes, hosts, end systems, or sites. Note that sometimes the terms host and end system are used to refer

2.2 Review of Computer Networks 59

Switches

Hosts

Network

Fig. 2.14 A Computer Network

simply to the equipment, whereas site is reserved for the equipment as well as the software that runs on it. Similarly, node is generally used as a generic reference to the computers or to the switches in a network. They form one of the fundamental hardware components of a network. The other fundamental component is special purpose devices and links that form the communication path that interconnects the nodes. As depicted in Figure 2.14, the hosts are connected to the network through switches (represented as circles with an X in them)2, which are special-purpose equipment that route messages through the network. Some of the hosts may be connected to the switches directly (using fiber optic, coaxial cable or copper wire) and some via wireless base stations. The switches are connected to each other by communication links that may be fiber optics, coaxial cable, satellite links, microwave connections, etc.

The most widely used computer network these days is the Internet. It is hard to define the Internet since the term is used to mean different things, but perhaps the best definition is that it is a network of networks (Figure 2.15). Each of these

2 Note that the terms “switch” and “router” are sometimes used interchangeably (even within the same text). However, other times they are used to mean slightly different things: switch refers to the devices inside a network whereas router refers to one that is at the edge of a network connecting it to the backbone. We use them interchangeably as in Figures 2.14 and 2.15.

60 2 Background

R

R

R

R

Server

Intranet

Intranet

ISP

Network

ISP

Network

Client

Intranet

R

Fig. 2.15 Internet

networks is referred to as an intranet to highlight the fact that they are “internal” to an organization. An intranet, then, consists of a set of links and routers (shown as “R” in Figure 2.15) administered by a single administrative entity or by its delegates. For instance, the routers and links at a university constitute a single administrative domain. Such domains may be located within a single geographical area (such as the university network mentioned above), or, as in the case of large enterprises or Internet Service Provider (ISP) networks, span multiple geographical areas. Each intranet is connected to some others by means of links provisioned from ISPs. These links are typically high-speed, long-distance duplex data transmission media (we will define these terms shortly), such as a fiber-optic cable, or a satellite link. These links make up what is called the Internet backbone. Each intranet has a router interface that connects it to the backbone, as shown in Figure 2.15. Thus, each link connects an intranet router to an ISP’s router. ISP’s routers are connected by similar links to routers of other ISPs. This allows servers and clients within an intranet to communicate with servers and clients in other intranets.

2.2.1 Types of Networks

There are various criteria by which computer networks can be classified. One crite- rion is the geographic distribution (also called scale [Tanenbaum, 2003]), a second

2.2 Review of Computer Networks 61

criterion is the interconnection structure of nodes (also called topology), and the third is the mode of transmission.

2.2.1.1 Scale

In terms of geographic distribution, networks are classified as wide area networks, metropolitan area networks and local area networks. The distinctions among these are somewhat blurred, but in the following, we give some general guidelines that identify each of these networks. The primary distinction among them are probably in terms of propagation delay, administrative control, and the protocols that are used in managing them.

A wide area network (WAN) is one where the link distance between any two nodes is greater than approximately 20 kilometers (km) and can go as large as thousands of kilometers. Use of switches allow the aggregation of communication over wider areas such as this. Owing to the distances that need to be traveled, long delays are involved in wide area data transmission. For example, via satellite, there is a minimum delay of half a second for data to be transmitted from the source to the destination and acknowledged. This is because the speed with which signals can be transmitted is limited to the speed of light, and the distances that need to be spanned are great (about 31,000 km from an earth station to a satellite).

WANs are typically characterized by the heterogeneity of the transmission media, the computers, and the user community involved. Early WANs had a limited capacity of less than a few megabits-per-second (Mbps). However, most of the current ones are broadband WANs that provide capacities of 150 Mbps and above. These individual channels are aggregated into the backbone links; the current backbone links are commonly OC48 at 2.4 Gbps or OC192 at 10Gbps. These networks can carry multiple data streams with varying characteristics (e.g., data as well as audio/video streams), the possibility of negotiating for a level of quality of service (QoS) and reserving network resources sufficient to fulfill this level of QoS.

Local area networks (LANs) are typically limited in geographic scope (usually less than 2 km). They provide higher capacity communication over inexpensive transmission media. The capacities are typically in the range of 10-1000 Mbps per connection. Higher capacity and shorter distances between hosts result in very short delays. Furthermore, the better controlled environments in which the communication links are laid out (within buildings, for example) reduce the noise and interference, and the heterogeneity among the computers that are connected is easier to manage, and a common transmission medium is used.

Metropolitan area networks (MANs) are in between LANs and WANs in scale and cover a city or a portion of it. The distances between nodes is typically on the order of 10 km.

62 2 Background

2.2.1.2 Topology

As the name indicates, interconnection structure or topology refers to the way nodes on a network are interconnected. The network in Figure 2.14 is what is called an irregular network, where the interconnections between nodes do not follow any pattern. It is possible to find a node that is connected to only one other node, as well as nodes that have connections to a number of nodes. Internet is a typical irregular network.

B U S

Host #1

Host #3

Host #2

Host #4

Bus

Interface

Fig. 2.16 Bus Network

Another popular topology is the bus, where all the computers are connected to a common channel (Figure 2.16). This type of network is primarily used in LANs. The link control is typically performed using carrier sense medium access with collision detection (CSMA/CD) protocol. The CSMA/CD bus control mechanism can best be described as a “listen before and while you transmit” scheme. The fundamental point is that each host listens continuously to what occurs on the bus. When a message transmission is detected, the host checks if the message is addressed to it, and takes the appropriate action. If it wants to transmit, it waits until it detects no more activity on the bus and then places its message on the network and continues to listen to bus activity. If it detects another transmission while it is transmitting a message itself, then there has been a “collision.” In such a case, and when the collision is detected, the transmitting hosts abort the transmission, each waits a random amount of time, and then each retransmits the message. The basic CSMA/CD scheme is used in the Ethernet local area network3.

Other common alternatives are star, ring, bus, and mesh networks.

3 In most current implementations of Ethernet, multiple busses are linked via one or more switches (called switched hubs) for expanded coverage and to better control the load on each bus segment. In these systems, individual computers can directly be connected to the switch as well. These are known as switched Ethernet.

2.2 Review of Computer Networks 63

• Star networks connect all the hosts to a central node that coordinates the transmission on the network. Thus if two hosts want to communicate, they have to go through the central node. Since there is a separate link between the central node and each of the others, there is a negotiation between the hosts and the central node when they wish to communicate.

• Ring networks interconnect the hosts in the form of a loop. This type of network was originally proposed for LANs, but their use in these networks has nearly stopped. They are now primarily used in MANs (e.g., SONET rings). In their current incarnation, data transmission around the ring is usually bidirectional (original rings were unidirectional), with each station (actually the interface to which each station is connected) serving as an active repeater that receives a message, checks the address, copies the message if it is addressed to that station, and retransmits it. Control of communication in ring type networks is generally controlled by means of a control token. In the simplest type of token ring networks, a token, which has one bit pattern to indicate that the network is free and a different bit pattern to indicate that it is in use, is circulated around the network. Any site wanting to transmit a message waits for the token. When it arrives, the site checks the token’s bit pattern to see if the network is free or in use. If it is free, the site changes the bit pattern to indicate that the network is in use and then places the messages on the ring. The message circulates around the ring and returns to the sender which changes the bit pattern to free and sends the token to the next computer down the line.

• Complete (or mesh) interconnection is one where each node is interconnected to every other node. Such an interconnection structure obviously provides more reliability and the possibility of better performance than that of the structures noted previously. However, it is also the costliest. For example, a complete connection of 10,000 computers would require approximately (10,000)2 links.4

2.2.2 Communication Schemes

In terms of the physical communication schemes employed, networks can be either point-to-point (also called unicast) networks, or broadcast (sometimes also called multi-point) networks.

In point-to-point networks, there are one or more (direct or indirect) links between each pair of nodes. The communication is always between two nodes and the receiver and sender are identified by their addresses that are included in the message header. Data transmission from the sender to the receiver follows one of the possibly many links between them, some of which may involve visiting other intermediate nodes. An intermediate node checks the destination address in the message header and if it is not addressed to it, passes it along to the next intermediate node. This is the

4 The general form of the equation is n(n−1)/2, where n is the number of nodes on the network.

64 2 Background

process of switching or routing. The selection of the links via which messages are sent is determined by usually elaborate routing algorithms that are beyond our scope. We discuss the details of switching in Section 2.2.3.

The fundamental transmission media for point-to-point networks are twisted pair, coaxial or fiber optic cables. Each of these media have different capacities: twisted pair 300 bps to 10 Mbps, coaxial up to 200 Mbps, and fiber optic 10 Gbps and even higher.

In broadcast networks, there is a common communication channel that is utilized by all the nodes in the network. Messages are transmitted over this common channel and received by all the nodes. Each node checks the receiver address and if the message is not addressed to it, ignores it.

A special case of broadcasting is multicasting where the message is sent to a subset of the nodes in the network. The receiver address is somehow encoded to indicate which nodes are the recipients.

Broadcast networks are generally radio or satellite-based. In case of satellite transmission, each site beams its transmission to a satellite which then beams it back at a different frequency. Every site on the network listens to the receiving frequency and has to disregard the message if it is not addressed to that site. A network that uses this technique is HughesNetTM.

Microwave transmission is another mode of data communication and it can be over satellite or terrestrial. Terrestrial microwave links used to form a major portion of most countries’ telephone networks although many of these have since been converted to fiber optic. In addition to the public carriers, some companies make use of private terrestrial microwave links. In fact, major metropolitan cities face the problem of microwave interference among privately owned and public carrier links. A very early example that is usually identified as having pioneered the use of satellite microwave transmission is ALOHA [Abramson, 1973].

Satellite and microwave networks are examples of wireless networks. These types of wireless networks are commonly referred to as wireless broadband networks. Another type of wireless network is one that is based on cellular networks. A cellular network control station is responsible for a geographic area called a cell and coordinates the communication from mobile hosts in their cell. These control stations may be linked to a “wireline” backbone network and thereby provide access from/to mobile hosts to other mobile hosts or stationary hosts on the wireline network.

A third type of wireless network with which most of us may be more familiar are wireless LANs (commonly referred to as Wi-LAN or WiLan). In this case a number of “base stations” are connected to a wireline network and serve as connection points for mobile hosts (similar to control stations in cellular networks). These networks can provide bandwidth of up to 54 Mbps.

A final word on broadcasting topologies is that they have the advantage that it is easier to check for errors and to send messages to more than one site than to do so in point-to-point topologies. On the other hand, since everybody listens in, broadcast networks are not as secure as point-to-point networks.

2.2 Review of Computer Networks 65

2.2.3 Data Communication Concepts

What we refer to as data communication is the set of technologies that enable two hosts to communicate. We are not going to be too detailed in this discussion, since, at the distributed DBMS level, we can assume that the technology exists to move bits between hosts. We, instead, focus on a few important issues that are relevant to understanding delay and routing concepts.

As indicated earlier hosts are connected by links, each of which can carry one or more channels. Link is a physical entity whereas channel is a logical one. Communi- cation links can carry signals either in digital form or in analog form. Telephone lines, for example, can carry data in analog form between the home and the central office – the rest of the telephone network is now digital and even the home-to-central office link is becoming digital with voice-over-IP (VoIP) technology. Each communication channel has a capacity, which can be defined as the amount of information that can be transmitted over the channel in a given time unit. This capacity is commonly referred to as the bandwidth of the channel. In analog transmission channels, the bandwidth is defined as the difference (in hertz) between the lowest and highest frequencies that can be transmitted over the channel per second. In digital links, bandwidth refers (less formally and with abuse of terminology) to the number of bits that can be transmitted per second (bps).

With respect to delays in getting the user’s work done, the bandwidth of a trans- mission channel is a significant factor, but it is not necessarily the only ones. The other factor in the transmission time is the software employed. There are usually overhead costs involved in data transmission due to the redundancies within the message itself, necessary for error detection and correction. Furthermore, the net- work software adds headers and trailers to any message, for example, to specify the destination or to check for errors in the entire message. All of these activities contribute to delays in transmitting data. The actual rate at which data are transmitted across the network is known as the data transfer rate and this rate is usually less than the actual bandwidth of the transmission channel. The software issues, that generally are referred as network protocols, are discussed in the next section.

In computer-to-computer communication, data are usually transmitted in packets, as we mentioned earlier. Usually, upper limits on frame sizes are established for each network and each contains data as well as some control information, such as the destination and source addresses, block error check codes, and so on (Figure 2.17). If a message that is to be sent from a source node to a destination node cannot fit one frame, it is split over a number of frames. This is be discussed further in Section 2.2.4.

There are various possible forms of switching/routing that can occur in point-to- point networks. It is possible to establish a connection such that a dedicated channel exists between the sender and the receiver. This is called circuit switching and is commonly used in traditional telephone connections. When a subscriber dials the number of another subscriber, a circuit is established between the two phones by means of various switches. The circuit is maintained during the period of conversation and is broken when one side hangs up. Similar setup is possible in computer networks.

66 2 Background

- Source address

- Destination address

- Message number

- Packet number

- Acknowledgment

- Control information

Header Text Block Error

Check

Fig. 2.17 Typical Frame Format

Another form of switching used in computer communication is packet switching, where a message is broken up into packets and each packet transmitted individually. In our discussion of the TCP/IP protocol earlier, we referred to messages being transmitted; in fact the TCP protocol (or any other transport layer protocol) takes each application package and breaks it up into fixed sized packets. Therefore, each application message may be sent to the destination as multiple packets.

Packets for the same message may travel independently of each other and may, in fact, take different routes. The result of routing packets along possibly different links in the network is that they may arrive at the destination out-of-order. Thus the transport layer software at the destination site should be able to sort them into their original order to reconstruct the message. Consequently, it is the individual packages that are routed through the network, which may result in packets reaching the destination at different times and even out of order. The transport layer protocol at the destination is responsible for collating and ordering the packets and generating the application message properly.

The advantages of packet switching are many. First, packet-switching networks provide higher link utilization since each link is not dedicated to a pair of communi- cating equipment and can be shared by many. This is especially useful in computer communication due to its bursty nature – there is a burst of transmission and then some break before another burst of transmission starts. The link can be used for other transmission when it is idle. Another reason is that packetizing may permit the parallel transmission of data. There is usually no requirement that various packets belonging to the same message travel the same route through the network. In such a case, they may be sent in parallel via different routes to improve the total data transmission time. As mentioned above, the result of routing frames this way is that their in-order delivery cannot be guaranteed.

On the other hand, circuit switching provides a dedicated channel between the receiver and the sender. If there is a sizable amount of data to be transmitted between the two or if the channel sharing in packet switched networks introduces too much delay or delay variance, or packet loss (which are important in multimedia applica- tions), then the dedicated channel facilitates this significantly. Therefore, schemes similar to circuit switching (i.e., reservation-based schemes) have gained favor in

2.2 Review of Computer Networks 67

the broadband networks that support applications such as multimedia with very high data transmission loads.

2.2.4 Communication Protocols

Establishing a physical connection between two hosts is not sufficient for them to communicate. Error-free, reliable and efficient communication between hosts requires the implementation of elaborate software systems that are generally called protocols. Network protocols are “layered” in that network functionality is divided into layers, each layer performing a well-defined function relying on the services provided by the layer below it and providing a service to the layer above. A protocol defines the services that are performed at one layer. The resulting layered protocol set is referred to as a protocol stack or protocol suite.

There are different protocol stacks for different types of networks; however, for communication over the Internet, the standard one is what is referred to as TCP/IP that stands for “Transport Control Protocol/Internet Protocol”. We focus primarily on TCP/IP in this section as well as some of the common LAN protocols.

Before we get into the specifics of the TCP/IP protocol stack, let us first discuss how a message from a process on host C in Figure 2.15 is transmitted to a process on server S, assuming both hosts implement the TCP/IP protocol. The process is depicted in Figure 2.18.

The appropriate application layer protocol takes the message from the process on host C and creates an application layer message by adding some application layer header information (oblique hatched part in Figure 2.18) details of which are not important for us. The application message is handed over to the TCP protocol, which repeats the process by adding its own header information. TCP header includes the necessary information to facilitate the provision of TCP services we discuss shortly. The Internet layer takes the TCP message that is generated and forms an Internet message as we also discuss below. This message is now physically transmitted from host C to its router using the protocol of its own network, then through a series of routers to the router of the network that contains server S, where the process is reversed until the original message is recovered and handed over to the appropriate process on S. The TCP protocols at hosts C and S communicate to ensure the end-to-end guarantees that we discussed.

2.2.4.1 TCP/IP Protocol Stack

What is referred to as TCP/IP is in fact a family of protocols, commonly referred to as the protocol stack. It consists of two sets of protocols, one set at the transport layer and the other at the network (Internet) layer (Figure 2.19).

The transport layer defines the types of services that the network provides to applications. The protocols at this layer address issues such as data loss (can the

68 2 Background

Application Layer

Transport Layer

Internet Layer

Message

Local

Network

Application Layer

Transport Layer

Internet Layer

Message

Local

Network

Fig. 2.18 Message Transmission using TCP/IP

Ethernet Token Ring ATM FDDI ...

IP

TCP UDP

HTML, HTTP, FTP Telnet NFS SNMP ...

Individual

Networks

Network

Transport

Application HTML, HTTP, FTP Telnet NFS SNMP ...

Ethernet Token Ring ATM FDDI ...WiFi

Fig. 2.19 TCP/IP Protocol

2.2 Review of Computer Networks 69

application tolerate losing some of the data during transmission?), bandwidth (some applications have minimum bandwidth requirements while others can be more elastic in their requirements), and timing (what type of delay can the applications tolerate?). For example, a file transfer application can not tolerate any data loss, can be flexible in its bandwidth use (it will work whether the connection is high capacity or low capacity, although the performance may differ), and it does not have strict timing requirements (although we may not like a file transfer to take a few days, it would still work). In contrast, a real-time audio/video transmission application can tolerate a limited amount of data loss (this may cause some jitter and other problems, but the communication will still be “understandable”), has minimum bandwidth requirement (5-128 Kbps for audio and 5 Kbps-20 Mbps for video), and is time sensitive (audio and video data need to be synchronized).

To deal with these varying requirements (at least with some of them), at the trans- port layer, two protocols are provided: TCP and UDP. TCP is connection-oriented, meaning that prior setup is required between the sender and the receiver before actual message transmission can start; it provides reliable transmission between the sender and the receiver by ensuring that the messages are received correctly at the receiver (referred to as “end-to-end reliability”); ensures flow control so that the sender does not overwhelm the receiver if the receiver process is not able to keep up with the incoming messages, and ensures congestion control so that the sender is throttled when network is overloaded. Note that TCP does not address the timing and minimum bandwidth guarantees, leaving these to the application layer.

UDP, on the other hand, is a connectionless service that does not provide the reliability, flow control and congestion control guarantees that TCP provides. Nor does it establish a connection between the sender and receiver beforehand. Thus, each message is transmitted hoping that it will get to the destination, but no end-to-end guarantees are provided. Thus, UDP has significantly lower overhead than TCP, and is preferred by applications that would prefer to deal with these requirements themselves, rather than having the network protocol handle them.

The network layer implements the Internet Protocol (IP) that provides the facility to “package” a message in a standard Internet message format for transmission across the network. Each Internet message can be up to 64KB long and consists of a header that contains, among other things, the IP addresses of the sender and the receiver machines (the numbers such as 129.97.79.58 that you may have seen attached to your own machines), and the message body itself. The message format of each network that makes up the Internet can be different, but each of these messages are encoded into an Internet message by the Internet Protocol before they are transmitted5.

The importance of TCP/IP is the following. Each of the intranets that are part of the Internet can use its own preferred protocol, so the computers on that network implement that particular protocol (e.g., the token ring mechanism and the CSMA/CS technique described above are examples of these types of protocols). However, if they are to connect to the Internet, they need to be able to communicate using TCP/IP, which are implemented on top of these specific network protocols (Figure 2.19).

5 Today, many of the Intranets also use TCP/IP, in which case IP encapsulation may not be necessary.

70 2 Background

2.2.4.2 Other Protocol Layers

Let us now briefly consider the other two layers depicted in Figure 2.19. Although these are not part of the TCP/IP protocol stack, they are necessary to be able to build distributed applications. These make up the top and the bottom layers of the protocol stack.

The Application Protocol layer provides the specifications that distributed appli- cations have to follow. For example, if one is building a Web application, then the documents that will be posted on the Web have to be written according to the HTML protocol (note that HTML is not a networking protocol, but a document encoding protocol) and the communication between the client browser and the Web server has to follow the HTTP protocol. Similar protocols are defined at this layer for other applications as indicated in the figure.

The bottom layer represents the specific network that may be used. Each of those networks have their own message formats and protocols and they provide the mechanisms for data transmission within those networks.

The standardization for LANs is spearheaded by the Institute of Electrical and Electronics Engineers (IEEE), specifically their Committee No. 802; hence the standard that has been developed is known as the IEEE 802 Standard. The three layers of the IEEE 802 local area network standard are the physical layer, the medium access control layer, and the logical link control layer.

The physical layer deals with physical data transmission issues such as signaling. Medium access control layer defines protocols that control who can have access to the transmission medium and when. Logical link control layer implements protocols that ensure reliable packet transmission between two adjacent computers (not end-to-end). In most LANs, the TCP and IP layer protocols are implemented on top of these three layers, enabling each computer to be able to directly communicate on the Internet.

To enable it to cover a variety of LAN architectures, the 802 local area network standard is actually a number of standards rather than a single one. Originally, it was specified to support three mechanisms at the medium access control level: the CSMA/CD mechanism, token ring, and token access mechanism for bus networks.

2.3 Bibliographic Notes

This chapter covered the basic issues related to relational database systems and computer networks. These concepts are discussed in much greater detail in a number of excellent textbooks. Related to database technology, we can name [Ramakrishnan and Gehrke, 2003; Elmasri and Navathe, 2011; Silberschatz et al., 2002; Garcia- Molina et al., 2002; Kifer et al., 2006], and [Date, 2004]. For computer networks one can refer to [Tanenbaum, 2003; Kurose and Ross, 2010; Leon-Garcia and Widjaja, 2004; Comer, 2009]. More focused discussion of data communication issues can be found in [Stallings, 2011].

Chapter 3 Distributed Database Design

The design of a distributed computer system involves making decisions on the placement of data and programs across the sites of a computer network, as well as possibly designing the network itself. In the case of distributed DBMSs, the distribution of applications involves two things: the distribution of the distributed DBMS software and the distribution of the application programs that run on it. Different architectural models discussed in Chapter 1 address the issue of application distribution. In this chapter we concentrate on distribution of data.

It has been suggested that the organization of distributed systems can be investi- gated along three orthogonal dimensions [Levin and Morgan, 1975] (Figure 3.1):

1. Level of sharing 2. Behavior of access patterns 3. Level of knowledge on access pattern behavior

In terms of the level of sharing, there are three possibilities. First, there is no shar- ing: each application and its data execute at one site, and there is no communication with any other program or access to any data file at other sites. This characterizes the very early days of networking and is probably not very common today. We then find the level of data sharing; all the programs are replicated at all the sites, but data files are not. Accordingly, user requests are handled at the site where they originate and the necessary data files are moved around the network. Finally, in data-plus-program sharing, both data and programs may be shared, meaning that a program at a given site can request a service from another program at a second site, which, in turn, may have to access a data file located at a third site.

Levin and Morgan draw a distinction between data sharing and data-plus-pro- gram sharing to illustrate the differences between homogeneous and heterogeneous distributed computer systems. They indicate, correctly, that in a heterogeneous environment it is usually very difficult, and sometimes impossible, to execute a given program on different hardware under a different operating system. It might, however, be possible to move data around relatively easily.

71 DOI 10.1007/978-1-4419-8834-8_3, © Springer Science+Business Media, LLC 2011 M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems: Third Edition,

72 3 Distributed Database Design

Dynamic

Static

Data

Partial

information

Complete

informationData +

program

Level of

knowledge

Sharing

Access

pattern

Fig. 3.1 Framework of Distribution

Along the second dimension of access pattern behavior, it is possible to identify two alternatives. The access patterns of user requests may be static, so that they do not change over time, or dynamic. It is obviously considerably easier to plan for and manage the static environments than would be the case for dynamic distributed systems. Unfortunately, it is difficult to find many real-life distributed applications that would be classified as static. The significant question, then, is not whether a system is static or dynamic, but how dynamic it is. Incidentally, it is along this dimension that the relationship between the distributed database design and query processing is established (refer to Figure 1.7).

The third dimension of classification is the level of knowledge about the access pattern behavior. One possibility, of course, is that the designers do not have any information about how users will access the database. This is a theoretical possibility, but it is very difficult, if not impossible, to design a distributed DBMS that can effectively cope with this situation. The more practical alternatives are that the designers have complete information, where the access patterns can reasonably be predicted and do not deviate significantly from these predictions, or partial information, where there are deviations from the predictions.

The distributed database design problem should be considered within this general framework. In all the cases discussed, except in the no-sharing alternative, new problems are introduced in the distributed environment which are not relevant in a centralized setting. In this chapter it is our objective to focus on these unique problems.

3.1 Top-Down Design Process 73

Two major strategies that have been identified for designing distributed databases are the top-down approach and the bottom-up approach [Ceri et al., 1987]. As the names indicate, they constitute very different approaches to the design process. Top- down approach is more suitable for tightly integrated, homogeneous distributed DBMSs, while bottom-up design is more suited to multidatabases (see the classifica- tion in Chapter 1). In this chapter, we focus on top-down design and defer bottom-up to the next chapter.

3.1 Top-Down Design Process

A framework for top-down design process is shown in Figure 3.2. The activity begins with a requirements analysis that defines the environment of the system and “elicits both the data and processing needs of all potential database users” [Yao et al., 1982a]. The requirements study also specifies where the final system is expected to stand with respect to the objectives of a distributed DBMS as identified in Section 1.4. These objectives are defined with respect to performance, reliability and availability, economics, and expandability (flexibility).

The requirements document is input to two parallel activities: view design and conceptual design. The view design activity deals with defining the interfaces for end users. The conceptual design, on the other hand, is the process by which the enterprise is examined to determine entity types and relationships among these entities. One can possibly divide this process into two related activity groups [Davenport, 1981]: entity analysis and functional analysis. Entity analysis is concerned with determining the entities, their attributes, and the relationships among them. Functional analysis, on the other hand, is concerned with determining the fundamental functions with which the modeled enterprise is involved. The results of these two steps need to be cross-referenced to get a better understanding of which functions deal with which entities.

There is a relationship between the conceptual design and the view design. In one sense, the conceptual design can be interpreted as being an integration of user views. Even though this view integration activity is very important, the conceptual model should support not only the existing applications, but also future applications. View integration should be used to ensure that entity and relationship requirements for all the views are covered in the conceptual schema.

In conceptual design and view design activities the user needs to specify the data entities and must determine the applications that will run on the database as well as statistical information about these applications. Statistical information includes the specification of the frequency of user applications, the volume of various information, and the like. Note that from the conceptual design step comes the definition of global conceptual schema discussed in Section 1.7. We have not yet considered the implications of the distributed environment; in fact, up to this point, the process is identical to that in a centralized database design.

74 3 Distributed Database Design

System Requirements

(Objectives)

User

Input

View Integration

Global Conceptual

Schema

Distribution

Design

Local Conceptual

Schemas

Physical

Design

Physical

Schemas

Observation and

Monitoring

User

input

Requirements

Analysis

FeedbackFeedback

View Design

Access Information External

Schema Definitions

Conceptual Design

Fig. 3.2 Top-Down Design Process

The global conceptual schema (GCS) and access pattern information collected as a result of view design are inputs to the distribution design step. The objective at this stage, which is the focus of this chapter, is to design the local conceptual schemas (LCSs) by distributing the entities over the sites of the distributed system. It is possible, of course, to treat each entity as a unit of distribution. Given that we use

3.2 Distribution Design Issues 75

the relational model as the basis of discussion in this book, the entities correspond to relations.

Rather than distributing relations, it is quite common to divide them into subre- lations, called fragments, which are then distributed. Thus, the distribution design activity consists of two steps: fragmentation and allocation. The reason for separating the distribution design into two steps is to better deal with the complexity of the problem. However, this raises other concerns as we discuss at the end of the chapter.

The last step in the design process is the physical design, which maps the local conceptual schemas to the physical storage devices available at the corresponding sites. The inputs to this process are the local conceptual schema and the access pattern information about the fragments in them.

It is well known that design and development activity of any kind is an ongoing process requiring constant monitoring and periodic adjustment and tuning. We have therefore included observation and monitoring as a major activity in this process. Note that one does not monitor only the behavior of the database implementation but also the suitability of user views. The result is some form of feedback, which may result in backing up to one of the earlier steps in the design.

3.2 Distribution Design Issues

In the preceding section we indicated that the relations in a database schema are usually decomposed into smaller fragments, but we did not offer any justification or details for this process. The objective of this section is to fill in these details.

The following set of interrelated questions covers the entire issue. We will there- fore seek to answer them in the remainder of this section.

1. Why fragment at all? 2. How should we fragment? 3. How much should we fragment? 4. Is there any way to test the correctness of decomposition? 5. How should we allocate? 6. What is the necessary information for fragmentation and allocation?

3.2.1 Reasons for Fragmentation

From a data distribution viewpoint, there is really no reason to fragment data. After all, in distributed file systems, the distribution is performed on the basis of entire files. In fact, the very early work dealt specifically with the allocation of files to nodes on a computer network. We consider earlier models in Section 3.4.

76 3 Distributed Database Design

With respect to fragmentation, the important issue is the appropriate unit of distri- bution. A relation is not a suitable unit, for a number of reasons. First, application views are usually subsets of relations. Therefore, the locality of accesses of applica- tions is defined not on entire relations but on their subsets. Hence it is only natural to consider subsets of relations as distribution units.

Second, if the applications that have views defined on a given relation reside at different sites, two alternatives can be followed, with the entire relation being the unit of distribution. Either the relation is not replicated and is stored at only one site, or it is replicated at all or some of the sites where the applications reside. The former results in an unnecessarily high volume of remote data accesses. The latter, on the other hand, has unnecessary replication, which causes problems in executing updates (to be discussed later) and may not be desirable if storage is limited.

Finally, the decomposition of a relation into fragments, each being treated as a unit, permits a number of transactions to execute concurrently. In addition, the fragmentation of relations typically results in the parallel execution of a single query by dividing it into a set of subqueries that operate on fragments. Thus fragmentation typically increases the level of concurrency and therefore the system throughput. This form of concurrency, which we refer to as intraquery concurrency, is dealt with mainly in Chapters 7 and 8, under query processing.

Fragmentation raises difficulties as well. If the applications have conflicting requirements that prevent decomposition of the relation into mutually exclusive fragments, those applications whose views are defined on more than one fragment may suffer performance degradation. It might, for example, be necessary to retrieve data from two fragments and then take their join, which is costly. Minimizing distributed joins is a fundamental fragmentation issue.

The second problem is related to semantic data control, specifically to integrity checking. As a result of fragmentation, attributes participating in a dependency may be decomposed into different fragments that might be allocated to different sites. In this case, even the simpler task of checking for dependencies would result in chasing after data in a number of sites. In Chapter 5 we return to the issue of semantic data control.

3.2.2 Fragmentation Alternatives

Relation instances are essentially tables, so the issue is one of finding alternative ways of dividing a table into smaller ones. There are clearly two alternatives for this: dividing it horizontally or dividing it vertically.

Example 3.1. In this chapter we use a modified version of the relational database scheme developed in Section 2.1. We have added to the PROJ relation a new attribute (LOC) that indicates the place of each project. Figure 3.3 depicts the database instance we will use. Figure 3.4 shows the PROJ relation of Figure 3.3 divided horizontally into two relations. Subrelation PROJ1 contains information about projects whose

3.2 Distribution Design Issues 77

ENO ENAME TITLE

E1 J. Doe Elect. Eng

E2 M. Smith Syst. Anal.

E3 A. Lee Mech. Eng.

E4 J. Miller Programmer

E5 B. Casey Syst. Anal.

E6 L. Chu Elect. Eng.

E7 R. Davis Mech. Eng.

E8 J. Jones Syst. Anal.

EMP

TITLE SAL

PAY

Elect. Eng. 40000

Syst. Anal. 34000

Mech. Eng. 27000

Programmer 24000

PROJ

PNO PNAME BUDGET

P1 Instrumentation 150000

P2 Database Develop. 135000

P3 CAD/CAM 250000

P4 Maintenance 310000

ENO PNO RESP

E1 P1 Manager 12

DUR

E2 P1 Analyst 24

E2 P2 Analyst 6

E3 P3 Consultant 10

E3 P4 Engineer 48

E4 P2 Programmer 18

E5 P2 Manager 24

E6 P4 Manager 48

E7 P3 Engineer 36

E8 P3 Manager 40

ASG

LOC

Montreal

New York

New York

Paris

Fig. 3.3 Modified Example Database

budgets are less than $200,000, whereas PROJ2 stores information about projects with larger budgets. �

Example 3.2. Figure 3.5 shows the PROJ relation of Figure 3.3 partitioned vertically into two subrelations, PROJ1 and PROJ2. PROJ1 contains only the information about project budgets, whereas PROJ2 contains project names and locations. It is important to notice that the primary key to the relation (PNO) is included in both fragments. �

The fragmentation may, of course, be nested. If the nestings are of different types, one gets hybrid fragmentation. Even though we do not treat hybrid fragmentation as a primitive fragmentation strategy, many real-life partitionings may be hybrid.

3.2.3 Degree of Fragmentation

The extent to which the database should be fragmented is an important decision that affects the performance of query execution. In fact, the issues in Section 3.2.1 concerning the reasons for fragmentation constitute a subset of the answers to the question we are addressing here. The degree of fragmentation goes from one extreme, that is, not to fragment at all, to the other extreme, to fragment to the level of

78 3 Distributed Database Design

PNO PNAME

P1

P2

Instrumentation

Database Develop.

BUDGET

150000

135000

PROJ 1

LOC

Montreal

New York

PNO PNAME BUDGET

P3 CAD/CAM 255000

P4 Maintenance 310000

PROJ 2

LOC

New York

Paris

Fig. 3.4 Example of Horizontal Partitioning

BUDGET

150000

135000

250000

310000

PNO

P1

P2

P3

P4

PROJ 1

PNO PNAME

P1

P2

P3

P4

Instrumentation

Database Develop.

CAD/CAM

Maintenance

PROJ 2

LOC

Montreal

New York

New York

Paris

Fig. 3.5 Example of Vertical Partitioning

individual tuples (in the case of horizontal fragmentation) or to the level of individual attributes (in the case of vertical fragmentation).

We have already addressed the adverse effects of very large and very small units of fragmentation. What we need, then, is to find a suitable level of fragmentation that is a compromise between the two extremes. Such a level can only be defined with respect to the applications that will run on the database. The issue is, how? In general, the applications need to be characterized with respect to a number of parameters. According to the values of these parameters, individual fragments can be identified. In Section 3.3 we describe how this characterization can be carried out for alternative fragmentations.

3.2 Distribution Design Issues 79

3.2.4 Correctness Rules of Fragmentation

We will enforce the following three rules during fragmentation, which, together, ensure that the database does not undergo semantic change during fragmentation.

1. Completeness. If a relation instance R is decomposed into fragments FR = {R1,R2, . . . ,Rn}, each data item that can be found in R can also be found in one or more of Ri’s. This property, which is identical to the lossless de- composition property of normalization (Section 2.1), is also important in fragmentation since it ensures that the data in a global relation are mapped into fragments without any loss [Grant, 1984]. Note that in the case of hori- zontal fragmentation, the “item” typically refers to a tuple, while in the case of vertical fragmentation, it refers to an attribute.

2. Reconstruction. If a relation R is decomposed into fragments FR = {R1,R2, . . . ,Rn}, it should be possible to define a relational operator5 such that

R =5Ri, ∀Ri ∈ FR

The operator 5 will be different for different forms of fragmentation; it is important, however, that it can be identified. The reconstructability of the relation from its fragments ensures that constraints defined on the data in the form of dependencies are preserved.

3. Disjointness. If a relation R is horizontally decomposed into fragments FR = {R1, R2, . . . , Rn} and data item di is in R j, it is not in any other fragment Rk (k 6= j). This criterion ensures that the horizontal fragments are disjoint. If relation R is vertically decomposed, its primary key attributes are typically repeated in all its fragments (for reconstruction). Therefore, in case of vertical partitioning, disjointness is defined only on the non-primary key attributes of a relation.

3.2.5 Allocation Alternatives

Assuming that the database is fragmented properly, one has to decide on the allocation of the fragments to various sites on the network. When data are allocated, it may either be replicated or maintained as a single copy. The reasons for replication are reliability and efficiency of read-only queries. If there are multiple copies of a data item, there is a good chance that some copy of the data will be accessible somewhere even when system failures occur. Furthermore, read-only queries that access the same data items can be executed in parallel since copies exist on multiple sites. On the other hand, the execution of update queries cause trouble since the system has to ensure that all the copies of the data are updated properly. Hence the decision regarding replication is a trade-off that depends on the ratio of the read-only queries to the

80 3 Distributed Database Design

update queries. This decision affects almost all of the distributed DBMS algorithms and control functions.

A non-replicated database (commonly called a partitioned database) contains fragments that are allocated to sites, and there is only one copy of any fragment on the network. In case of replication, either the database exists in its entirety at each site (fully replicated database), or fragments are distributed to the sites in such a way that copies of a fragment may reside in multiple sites (partially replicated database). In the latter the number of copies of a fragment may be an input to the allocation algorithm or a decision variable whose value is determined by the algorithm. Figure 3.6 compares these three replication alternatives with respect to various distributed DBMS functions. We will discuss replication at length in Chapter 13.

Full replication Partial replication Partitioning

QUERY PROCESSING Easy

Same difficulty

Same difficultyDIRECTORY MANAGEMENT

Easy or

nonexistent

CONCURRENCY CONTROL EasyDifficultModerate

RELIABILITY Very high High Low

REALITY Possible application Realistic Possible application

Fig. 3.6 Comparison of Replication Alternatives

3.2.6 Information Requirements

One aspect of distribution design is that too many factors contribute to an optimal design. The logical organization of the database, the location of the applications, the access characteristics of the applications to the database, and the properties of the computer systems at each site all have an influence on distribution decisions. This makes it very complicated to formulate the distribution problem.

The information needed for distribution design can be divided into four categories: database information, application information, communication network informa- tion, and computer system information. The latter two categories are completely quantitative in nature and are used in allocation models rather than in fragmentation algorithms. We do not consider them in detail here. Instead, the detailed information

3.3 Fragmentation 81

requirements of the fragmentation and allocation algorithms are discussed in their respective sections.

3.3 Fragmentation

In this section we present the various fragmentation strategies and algorithms. As mentioned previously, there are two fundamental fragmentation strategies: horizontal and vertical. Furthermore, there is a possibility of nesting fragments in a hybrid fashion.

3.3.1 Horizontal Fragmentation

As we explained earlier, horizontal fragmentation partitions a relation along its tuples. Thus each fragment has a subset of the tuples of the relation. There are two versions of horizontal partitioning: primary and derived. Primary horizontal fragmentation of a relation is performed using predicates that are defined on that relation. Derived horizontal fragmentation, on the other hand, is the partitioning of a relation that results from predicates being defined on another relation.

Later in this section we consider an algorithm for performing both of these fragmentations. However, first we investigate the information needed to carry out horizontal fragmentation activity.

3.3.1.1 Information Requirements of Horizontal Fragmentation

Database Information.

The database information concerns the global conceptual schema. In this context it is important to note how the database relations are connected to one another, especially with joins. In the relational model, these relationships are also depicted as relations. However, in other data models, such as the entity-relationship (E–R) model [Chen, 1976], these relationships between database objects are depicted explicitly. Ceri et al. [1983] also model the relationship explicitly, within the relational framework, for purposes of the distribution design. In the latter notation, directed links are drawn between relations that are related to each other by an equijoin operation.

Example 3.3. Figure 3.7 shows the expression of links among the database relations given in Figure 2.3. Note that the direction of the link shows a one-to-many rela- tionship. For example, for each title there are multiple employees with that title; thus there is a link between the PAY and EMP relations. Along the same lines, the many-to-many relationship between the EMP and PROJ relations is expressed with two links to the ASG relation. �

82 3 Distributed Database Design

TITLE, SAL

ENO, ENAME, TITLE PNO, PNAME, BUDGET, LOC

ENO, PNO, RESP, DUR

ASG

L 1

PROJ

PAY

EMP

L 2

L 3

Fig. 3.7 Expression of Relationships Among Relations Using Links

The links between database objects (i.e., relations in our case) should be quite familiar to those who have dealt with network models of data. In the relational model, they are introduced as join graphs, which we discuss in detail in subsequent chapters on query processing. We introduce them here because they help to simplify the presentation of the distribution models we discuss later.

The relation at the tail of a link is called the owner of the link and the relation at the head is called the member [Ceri et al., 1983]. More commonly used terms, within the relational framework, are source relation for owner and target relation for member. Let us define two functions: owner and member, both of which provide mappings from the set of links to the set of relations. Therefore, given a link, they return the member or owner relations of the link, respectively.

Example 3.4. Given link L1 of Figure 3.7, the owner and member functions have the following values:

owner(L1) = PAY member(L1) = EMP

The quantitative information required about the database is the cardinality of each relation R, denoted card(R).

Application Information.

As indicated previously in relation to Figure 3.2, both qualitative and quantitative information is required about applications. The qualitative information guides the fragmentation activity, whereas the quantitative information is incorporated primarily into the allocation models.

The fundamental qualitative information consists of the predicates used in user queries. If it is not possible to analyze all of the user applications to determine these

3.3 Fragmentation 83

predicates, one should at least investigate the most “important” ones. It has been suggested that as a rule of thumb, the most active 20% of user queries account for 80% of the total data accesses [Wiederhold, 1982]. This “80/20 rule” may be used as a guideline in carrying out this analysis.

At this point we are interested in determining simple predicates. Given a relation R(A1, A2, . . . , An), where Ai is an attribute defined over domain Di, a simple predicate p j defined on R has the form

p j : Ai θ Value

where θ ∈ {=, <, 6=, ≤, >, ≥} and Value is chosen from the domain of Ai (Value∈ Di). We use Pri to denote the set of all simple predicates defined on a relation Ri. The members of Pri are denoted by pi j.

Example 3.5. Given the relation instance PROJ of Figure 3.3,

PNAME = “Maintenance”

is a simple predicate, as well as

BUDGET ≤ 200000 �

Even though simple predicates are quite elegant to deal with, user queries quite often include more complicated predicates, which are Boolean combinations of simple predicates. One combination that we are particularly interested in, called a minterm predicate, is the conjunction of simple predicates. Since it is always possible to transform a Boolean expression into conjunctive normal form, the use of minterm predicates in the design algorithms does not cause any loss of generality.

Given a set Pri = {pi1, pi2, . . . , pim} of simple predicates for relation Ri, the set of minterm predicates Mi = {mi1, mi2, . . . , miz} is defined as

Mi = {mi j|mi j = ∧

pik∈Pri p∗ik}, 1≤ k ≤ m, 1≤ j ≤ z

where p∗ik = pik or p ∗ ik = ¬pik. So each simple predicate can occur in a minterm

predicate either in its natural form or its negated form. It is important to note that the negation of a predicate is meaningful for equality

predicates of the form Attribute = Value. For inequality predicates, the negation should be treated as the complement. For example, the negation of the simple predi- cate Attribute≤Value is Attribute >Value. Besides theoretical problems of comple- mentation in infinite sets, there is also the practical problem that the complement may be difficult to define. For example, if two simple predicates are defined of the form Lower bound ≤ Attribute 1, and Attribute 1 ≤U pper bound, their complements are ¬(Lower bound ≤ Attribute 1) and ¬(Attribute 1≤U pper bound). However, the original two simple predicates can be written as Lower bound ≤ Attribute 1≤ U pper bound with a complement ¬(Lower bound ≤ Attribute 1≤U pper bound)

84 3 Distributed Database Design

that may not be easy to define. Therefore, the research in this area typically considers only simple equality predicates [Ceri et al., 1982b; Ceri and Pelagatti, 1984].

Example 3.6. Consider relation PAY of Figure 3.3. The following are some of the possible simple predicates that can be defined on PAY.

p1: TITLE = “Elect. Eng.” p2: TITLE = “Syst. Anal.” p3: TITLE = “Mech. Eng.” p4: TITLE = “Programmer” p5: SAL ≤ 30000

The following are some of the minterm predicates that can be defined based on these simple predicates.

m1: TITLE = “Elect. Eng.” ∧ SAL ≤ 30000 m2: TITLE = “Elect. Eng.” ∧ SAL > 30000 m3: ¬(TITLE = “Elect. Eng.”) ∧ SAL ≤ 30000 m4: ¬(TITLE = “Elect. Eng.”) ∧ SAL > 30000 m5: TITLE = “Programmer” ∧ SAL ≤ 30000 m6: TITLE = “Programmer” ∧ SAL > 30000

There are a few points to mention here. First, these are not all the minterm predicates that can be defined; we are presenting only a representative sample. Second, some of these may be meaningless given the semantics of relation PAY; we are not addressing that issue here. Third, these are simplified versions of the minterms. The minterm definition requires each predicate to be in a minterm in either its natural or its negated form. Thus, m1, for example, should be written as

m1: TITLE = “Elect. Eng.” ∧ TITLE 6= “Syst. Anal.” ∧ TITLE 6= “Mech. Eng.” ∧ TITLE 6= “Programmer” ∧ SAL ≤ 30000

However, clearly this is not necessary, and we use the simplified form. Finally, note that there are logically equivalent expressions to these minterms; for example, m3 can also be rewritten as

m3: TITLE 6= “Elect. Eng.” ∧ SAL ≤ 30000

In terms of quantitative information about user applications, we need to have two sets of data.

1. Minterm selectivity: number of tuples of the relation that would be accessed by a user query specified according to a given minterm predicate. For example, the selectivity of m1 of Example 3.6 is 0 since there are no tuples in PAY that satisfy the minterm predicate. The selectivity of m2, on the other hand, is 0.25

3.3 Fragmentation 85

since one of the four tuples in PAY satisfy m2. We denote the selectivity of a minterm mi as sel(mi).

2. Access frequency: frequency with which user applications access data. If Q = {q1, q2, . . . , qq} is a set of user queries, acc(qi) indicates the access frequency of query qi in a given period.

Note that minterm access frequencies can be determined from the query frequen- cies. We refer to the access frequency of a minterm mi as acc(mi).

3.3.1.2 Primary Horizontal Fragmentation

Before we present a formal algorithm for horizontal fragmentation, we intuitively discuss the process for primary (and derived) horizontal fragmentation. A primary horizontal fragmentation is defined by a selection operation on the owner relations of a database schema. Therefore, given relation R, its horizontal fragments are given by

Ri = σFi(R), 1≤ i≤ w

where Fi is the selection formula used to obtain fragment Ri (also called the frag- mentation predicate). Note that if Fi is in conjunctive normal form, it is a minterm predicate (mi). The algorithm we discuss will, in fact, insist that Fi be a minterm predicate.

Example 3.7. The decomposition of relation PROJ into horizontal fragments PROJ1 and PROJ2 in Example 3.1 is defined as follows1:

PROJ1 = σBUDGET ≤ 200000 (PROJ) PROJ2 = σBUDGET > 200000 (PROJ)

Example 3.7 demonstrates one of the problems of horizontal partitioning. If the domain of the attributes participating in the selection formulas are continuous and infinite, as in Example 3.7, it is quite difficult to define the set of formulas F = {F1, F2, . . . , Fn} that would fragment the relation properly. One possible course of action is to define ranges as we have done in Example 3.7. However, there is always the problem of handling the two endpoints. For example, if a new tuple with a BUDGET value of, say, $600,000 were to be inserted into PROJ, one would have had to review the fragmentation to decide if the new tuple is to go into PROJ2 or if the fragments need to be revised and a new fragment needs to be defined as

1 We assume that the non-negativity of the BUDGET values is a feature of the relation that is enforced by an integrity constraint. Otherwise, a simple predicate of the form 0≤ BUDGET also needs to be included in Pr. We assume this to be true in all our examples and discussions in this chapter.

86 3 Distributed Database Design

PROJ2 = σ200000<BUDGET ≤ 400000 (PROJ) PROJ3 = σBUDGET > 400000 (PROJ)

Example 3.8. Consider relation PROJ of Figure 3.3. We can define the following horizontal fragments based on the project location. The resulting fragments are shown in Figure 3.8.

PROJ1 = σLOC=“Montreal” (PROJ) PROJ2 = σLOC=“New York” (PROJ) PROJ3 = σLOC=“Paris” (PROJ)

PNO PNAME BUDGET LOC

P1 Instrumentation 150000 Montreal

PROJ1

PNO PNAME BUDGET LOC

P2 Database Develop. 135000 New York

P3 CAD/CAM 250000 New York

PNO PNAME BUDGET LOC

P4 Maintenance 310000 Paris

PROJ2

PROJ3

Fig. 3.8 Primary Horizontal Fragmentation of Relation PROJ

Now we can define a horizontal fragment more carefully. A horizontal fragment Ri of relation R consists of all the tuples of R that satisfy a minterm predicate mi. Hence, given a set of minterm predicates M, there are as many horizontal fragments of relation R as there are minterm predicates. This set of horizontal fragments is also commonly referred to as the set of minterm fragments.

From the foregoing discussion it is obvious that the definition of the horizontal fragments depends on minterm predicates. Therefore, the first step of any fragmenta- tion algorithm is to determine a set of simple predicates that will form the minterm predicates.

An important aspect of simple predicates is their completeness; another is their minimality. A set of simple predicates Pr is said to be complete if and only if there

3.3 Fragmentation 87

is an equal probability of access by every application to any tuple belonging to any minterm fragment that is defined according to Pr2.

Example 3.9. Consider the fragmentation of relation PROJ given in Example 3.8. If the only application that accesses PROJ wants to access the tuples according to the location, the set is complete since each tuple of each fragment PROJi (Example 3.8) has the same probability of being accessed. If, however, there is a second application which accesses only those project tuples where the budget is less than or equal to $200,000, then Pr is not complete. Some of the tuples within each PROJi have a higher probability of being accessed due to this second application. To make the set of predicates complete, we need to add (BUDGET ≤ 200000, BUDGET > 200000) to Pr:

Pr = {LOC=“Montreal”, LOC=“New York”, LOC=“Paris”, BUDGET ≤ 200000, BUDGET > 200000}

The reason completeness is a desirable property is because fragments obtained ac- cording to a complete set of predicates are logically uniform since they all satisfy the minterm predicate. They are also statistically homogeneous in the way applications access them. These characteristics ensure that the resulting fragmentation results in a balanced load (with respect to the given workload) across all the fragments. Therefore, we will use a complete set of predicates as the basis of primary horizontal fragmentation.

It is possible to define completeness more formally so that a complete set of predicates can be obtained automatically. However, this would require the designer to specify the access probabilities for each tuple of a relation for each application under consideration. This is considerably more work than appealing to the common sense and experience of the designer to come up with a complete set. Shortly, we will present an algorithmic way of obtaining this set.

The second desirable property of the set of predicates, according to which min- term predicates and, in turn, fragments are to be defined, is minimality, which is very intuitive. It simply states that if a predicate influences how fragmentation is performed (i.e., causes a fragment f to be further fragmented into, say, fi and f j), there should be at least one application that accesses fi and f j differently. In other words, the simple predicate should be relevant in determining a fragmentation. If all the predicates of a set Pr are relevant, Pr is minimal.

A formal definition of relevance can be given as follows [Ceri et al., 1982b]. Let mi and m j be two minterm predicates that are identical in their definition, except that mi contains the simple predicate pi in its natural form while m j contains ¬pi. Also, let fi and f j be two fragments defined according to mi and m j, respectively. Then pi is relevant if and only if

2 It is clear that the definition of completeness of a set of simple predicates is different from the completeness rule of fragmentation given in Section 3.2.4.

88 3 Distributed Database Design

acc(mi) card( fi)

6= acc(m j) card( f j)

Example 3.10. The set Pr defined in Example 3.9 is complete and minimal. If, how- ever, we were to add the predicate

PNAME = “Instrumentation”

to Pr, the resulting set would not be minimal since the new predicate is not relevant with respect to Pr – there is no application that would access the resulting fragments any differently. �

We can now present an iterative algorithm that would generate a complete and minimal set of predicates Pr′ given a set of simple predicates Pr. This algorithm, called COM MIN, is given in Algorithm 3.1. To avoid lengthy wording, we have adopted the following notation:

Rule 1: each fragment is accessed differently by at least one application.’

fi o f Pr′: fragment fi defined according to a minterm predicate defined over the predicates of Pr′.

Algorithm 3.1: COM MIN Algorithm Input: R: relation; Pr: set of simple predicates Output: Pr′: set of simple predicates Declare: F : set of minterm fragments begin

find pi ∈ Pr such that pi partitions R according to Rule 1 ; Pr′← pi ; Pr← Pr− pi ; F ← fi { fi is the minterm fragment according to pi} ; repeat

find a p j ∈ Pr such that p j partitions some fk of Pr′ according to Rule 1 ; Pr′← Pr′∪ p j ; Pr← Pr− p j ; F ← F ∪ f j ; if ∃pk ∈ Pr′ which is not relevant then

Pr′← Pr′− pk ; F ← F− fk ;

until Pr′ is complete ; end

3.3 Fragmentation 89

The algorithm begins by finding a predicate that is relevant and that partitions the input relation. The repeat-until loop iteratively adds predicates to this set, ensuring minimality at each step. Therefore, at the end the set Pr′ is both minimal and complete.

The second step in the primary horizontal design process is to derive the set of minterm predicates that can be defined on the predicates in set Pr′. These minterm predicates determine the fragments that are used as candidates in the allocation step. Determination of individual minterm predicates is trivial; the difficulty is that the set of minterm predicates may be quite large (in fact, exponential on the number of simple predicates). We look at ways of reducing the number of minterm predicates that need to be considered in fragmentation.

This reduction can be achieved by eliminating some of the minterm fragments that may be meaningless. This elimination is performed by identifying those minterms that might be contradictory to a set of implications I. For example, if Pr′ = {p1, p2}, where

p1 : att = value 1 p2 : att = value 2

and the domain of att is {value 1,value 2}, it is obvious that I contains two implica- tions:

i1 : (att = value 1)⇒¬(att = value 2) i2 : ¬(att = value1)⇒ (att = value 2)

The following four minterm predicates are defined according to Pr′:

m1 : (att = value 1)∧ (att = value 2) m2 : (att = value 1)∧¬(att = value 2) m3 : ¬(att = value 1)∧ (att = value 2) m4 : ¬(att = value 1)∧¬(att = value 2)

In this case the minterm predicates m1 and m4 are contradictory to the implications I and can therefore be eliminated from M.

The algorithm for primary horizontal fragmentation is given in Algorithm 3.2. The input to the algorithm PHORIZONTAL is a relation R that is subject to primary horizontal fragmentation, and Pr, which is the set of simple predicates that have been determined according to applications defined on relation R.

Example 3.11. We now consider the design of the database scheme given in Figure 3.7. The first thing to note is that there are two relations that are the subject of primary horizontal fragmentation: PAY and PROJ.

Suppose that there is only one application that accesses PAY, which checks the salary information and determines a raise accordingly. Assume that employee records are managed in two places, one handling the records of those with salaries less than

90 3 Distributed Database Design

Algorithm 3.2: PHORIZONTAL Algorithm Input: R: relation; Pr: set of simple predicates Output: M: set of minterm fragments begin

Pr′←COM MIN(R,Pr) ; determine the set M of minterm predicates ; determine the set I of implications among pi ∈ Pr′ ; foreach mi ∈M do

if mi is contradictory according to I then M←M−mi

end

or equal to $30,000, and the other handling the records of those who earn more than $30,000. Therefore, the query is issued at two sites.

The simple predicates that would be used to partition relation PAY are

p1: SAL ≤ 30000 p2: SAL > 30000

thus giving the initial set of simple predicates Pr = {p1, p2}. Applying the COM MIN algorithm with i = 1 as initial value results in Pr′ = {p1}. This is com- plete and minimal since p2 would not partition f1 (which is the minterm fragment formed with respect to p1) according to Rule 1. We can form the following minterm predicates as members of M:

m1: (SAL < 30000) m2: ¬(SAL ≤ 30000) = SAL > 30000

Therefore, we define two fragments Fs = {S1,S2} according to M (Figure 3.9).

TITLE

Mech. Eng.

Programmer

SAL

27000

24000

TITLE

Elect. Eng.

Syst. Anal.

SAL

40000

34000

1PAY 2PAY

Fig. 3.9 Horizontal Fragmentation of Relation PAY

Let us next consider relation PROJ. Assume that there are two applications. The first is issued at three sites and finds the names and budgets of projects given their location. In SQL notation, the query is

3.3 Fragmentation 91

SELECT PNAME, BUDGET FROM PROJ WHERE LOC=Value

For this application, the simple predicates that would be used are the following:

p1: LOC = “Montreal” p2: LOC = “New York” p3: LOC = “Paris”

The second application is issued at two sites and has to do with the management of the projects. Those projects that have a budget of less than or equal to $200,000 are managed at one site, whereas those with larger budgets are managed at a second site. Thus, the simple predicates that should be used to fragment according to the second application are

p4: BUDGET ≤ 200000 p5: BUDGET > 200000

If the COM MIN algorithm is followed, the set Pr′ = {p1, p2, p4} is obviously complete and minimal. Actually COM MIN would add any two of p1, p2, p3 to Pr′; in this example we have selected to include p1, p2.

Based on Pr′, the following six minterm predicates that form M can be defined:

m1: (LOC = “Montreal”) ∧ (BUDGET ≤ 200000) m2: (LOC = “Montreal”) ∧ (BUDGET > 200000) m3: (LOC = “New York”) ∧ (BUDGET ≤ 200000) m4: (LOC = “New York”) ∧ (BUDGET > 200000) m5: (LOC = “Paris”) ∧ (BUDGET ≤ 200000) m6: (LOC = “Paris”) ∧ (BUDGET > 200000)

As noted in Example 3.6, these are not the only minterm predicates that can be generated. It is, for example, possible to specify predicates of the form

p1∧ p2∧ p3∧ p4∧ p5

However, the obvious implications

i1 :p1⇒¬p2∧¬p3 i2 :p2⇒¬p1∧¬p3 i3 :p3⇒¬p1∧¬p2 i4 :p4⇒¬p5 i5 :p5⇒¬p4 i6 :¬p4⇒ p5 i7 :¬p5⇒ p4

eliminate these minterm predicates and we are left with m1 to m6.

92 3 Distributed Database Design

Looking at the database instance in Figure 3.3, one may be tempted to claim that the following implications hold:

i8: LOC = “Montreal”⇒¬ (BUDGET > 200000) i9: LOC = “Paris”⇒¬ (BUDGET ≤ 200000) i10: ¬ (LOC = “Montreal”)⇒ BUDGET ≤ 200000 i11: ¬ (LOC = “Paris”)⇒ BUDGET > 200000

However, remember that implications should be defined according to the semantics of the database, not according to the current values. There is nothing in the database semantics that suggest that the implications i8 through i11 hold. Some of the fragments defined according to M = {m1, . . . ,m6} may be empty, but they are, nevertheless, fragments.

The result of the primary horizontal fragmentation of PROJ is to form six frag- ments FPROJ = {PROJ1, PROJ2, PROJ3, PROJ4, PROJ5, PROJ6} of relation PROJ according to the minterm predicates M (Figure 3.10). Since fragments PROJ2, and PROJ5 are empty, they are not depicted in Figure 3.10. �

PNO PNAME BUDGET LOC

P1 Instrumentation 150000 Montreal

PROJ1 PROJ3

PROJ4

PNO PNAME BUDGET LOC

P3 CAD/CAM 250000 New York

PROJ6

PNO PNAME BUDGET LOC

P2 Database Develop.

135000 New York

PNO PNAME BUDGET LOC

P4 Maintenance 310000 Paris

Fig. 3.10 Horizontal Partitioning of Relation PROJ

3.3.1.3 Derived Horizontal Fragmentation

A derived horizontal fragmentation is defined on a member relation of a link accord- ing to a selection operation specified on its owner. It is important to remember two points. First, the link between the owner and the member relations is defined as an equi-join. Second, an equi-join can be implemented by means of semijoins. This second point is especially important for our purposes, since we want to partition a

3.3 Fragmentation 93

member relation according to the fragmentation of its owner, but we also want the resulting fragment to be defined only on the attributes of the member relation.

Accordingly, given a link L where owner(L) = S and member(L) = R, the derived horizontal fragments of R are defined as

Ri = RnSi,1≤ i≤ w

where w is the maximum number of fragments that will be defined on R, and Si = σFi (S), where Fi is the formula according to which the primary horizontal fragment Si is defined.

Example 3.12. Consider link L1 in Figure 3.7, where owner(L1) = PAY and member(L1) = EMP. Then we can group engineers into two groups according to their salary: those making less than or equal to $30,000, and those making more than $30,000. The two fragments EMP1 and EMP2 are defined as follows:

EMP1 = EMP n PAY1 EMP2 = EMP n PAY2

where

PAY1 = σSAL ≤ 30000(PAY) PAY2 = σSAL > 30000(PAY)

The result of this fragmentation is depicted in Figure 3.11. �

EMP1

ENO ENAME TITLE

E3 A. Lee Mech. Eng.

E4 J. Miller Programmer

E7 R. Davis Mech. Eng.

EMP2

B. Casey Elect. Eng.

E1 J. Doe Elect. Eng.

E2 M. Smith Syst. Anal.

E5 Syst. Anal. E6 L. Chu

E8 J. Jones Syst. Anal.

ENO ENAME TITLE

Fig. 3.11 Derived Horizontal Fragmentation of Relation EMP

To carry out a derived horizontal fragmentation, three inputs are needed: the set of partitions of the owner relation (e.g., PAY1 and PAY2 in Example 3.12), the member relation, and the set of semijoin predicates between the owner and the member (e.g., EMP.TITLE = PAY.TITLE in Example 3.12). The fragmentation algorithm, then, is quite trivial, so we will not present it in any detail.

There is one potential complication that deserves some attention. In a database schema, it is common that there are more than two links into a relation R (e.g., in Figure 3.7, ASG has two incoming links). In this case there is more than one possible

94 3 Distributed Database Design

derived horizontal fragmentation of R. The choice of candidate fragmentation is based on two criteria:

1. The fragmentation with better join characteristics 2. The fragmentation used in more applications

Let us discuss the second criterion first. This is quite straightforward if we take into consideration the frequency with which applications access some data. If possible, one should try to facilitate the accesses of the “heavy” users so that their total impact on system performance is minimized.

Applying the first criterion, however, is not that straightforward. Consider, for ex- ample, the fragmentation we discussed in Example 3.1. The effect (and the objective) of this fragmentation is that the join of the EMP and PAY relations to answer the query is assisted (1) by performing it on smaller relations (i.e., fragments), and (2) by potentially performing joins in parallel.

The first point is obvious. The fragments of EMP are smaller than EMP itself. Therefore, it will be faster to join any fragment of PAY with any fragment of EMP than to work with the relations themselves. The second point, however, is more important and is at the heart of distributed databases. If, besides executing a number of queries at different sites, we can parallelize execution of one join query, the response time or throughput of the system can be expected to improve. In the case of joins, this is possible under certain circumstances. Consider, for example, the join graph (i.e., the links) between the fragments of EMP and PAY derived in Example 3.10 (Figure 3.12). There is only one link coming in or going out of a fragment. Such a join graph is called a simple graph. The advantage of a design where the join relationship between fragments is simple is that the member and owner of a link can be allocated to one site and the joins between different pairs of fragments can proceed independently and in parallel.

TITLE SAL TITLE SAL

ENO ENAME TITLE ENO ENAME TITLE

PAY 1 PAY2

EMP 1

EMP 2

Fig. 3.12 Join Graph Between Fragments

Unfortunately, obtaining simple join graphs may not always be possible. In that case, the next desirable alternative is to have a design that results in a partitioned join

3.3 Fragmentation 95

graph. A partitioned graph consists of two or more subgraphs with no links between them. Fragments so obtained may not be distributed for parallel execution as easily as those obtained via simple join graphs, but the allocation is still possible.

Example 3.13. Let us continue with the distribution design of the database we started in Example 3.11. We already decided on the fragmentation of relation EMP according to the fragmentation of PAY (Example 3.12). Let us now consider ASG. Assume that there are the following two applications:

1. The first application finds the names of engineers who work at certain places. It runs on all three sites and accesses the information about the engineers who work on local projects with higher probability than those of projects at other locations.

2. At each administrative site where employee records are managed, users would like to access the responsibilities on the projects that these employees work on and learn how long they will work on those projects.

The first application results in a fragmentation of ASG according to the (non- empty) fragments PROJ1, PROJ3, PROJ4 and PROJ6 of PROJ obtained in Example 3.11. Remember that

PROJ1: σLOC=“Montreal”∧BUDGET≤200000 (PROJ) PROJ3: σLOC=“New York”∧BUDGET≤200000 (PROJ) PROJ4: σLOC=“New York”∧BUDGET>200000 (PROJ) PROJ6: σLOC=“Paris”∧BUDGET>200000 (PROJ)

Therefore, the derived fragmentation of ASG according to {PROJ1, PROJ2, PROJ3} is defined as follows:

ASG1 = ASG n PROJ1 ASG2 = ASG n PROJ3 ASG3 = ASG n PROJ4 ASG4 = ASG n PROJ6

These fragment instances are shown in Figure 3.13. The second query can be specified in SQL as

SELECT RESP, DUR FROM ASG, EMPi WHERE ASG.ENO = EMPi.ENO

where i = 1 or i = 2, depending on the site where the query is issued. The derived fragmentation of ASG according to the fragmentation of EMP is defined below and depicted in Figure 3.14.

ASG1 = ASG n EMP1 ASG2 = ASG n EMP2

96 3 Distributed Database Design

Fig. 3.13 Derived Fragmentation of ASG with respect to PROJ

Fig. 3.14 Derived Fragmentation of ASG with respect to EMP

This example demonstrates two things:

1. Derived fragmentation may follow a chain where one relation is fragmented as a result of another one’s design and it, in turn, causes the fragmentation of another relation (e.g., the chain PAY→EMP→ASG).

2. Typically, there will be more than one candidate fragmentation for a relation (e.g., relation ASG). The final choice of the fragmentation scheme may be a decision problem addressed during allocation.

ASG1 ASG2

PNO RESP DURENO

E3 P3 Consultant 10

E3 P4 Engineer 48

E4 P2 Programmer 18

E7 P3 Engineer 36

PNO RESP DURENO

ManagerE1 P1 12

AnalystE2 P1 24

Analyst 6P2E2

ManagerE5 P2 24

ManagerE6 P4 48

ManagerE8 P3 40

ASG1

PNO RESP DURENO

E1 P1 Manager 12

E2 P1 Analyst 24

PNO RESP DURENO

AnalystE2 P2 6

Programmer 18P2E4

Manager 24P2E5

ASG2

PNO RESP DURENO

E3

E6

P4

P4 Manager

48

48

Engineer

ASG4

PNO RESP DURENO

ASG3

Consultant 10P3E3

Engineer 36P3E7

Manager 40P3E8

3.3 Fragmentation 97

3.3.1.4 Checking for Correctness

We should now check the fragmentation algorithms discussed so far with respect to the three correctness criteria presented in Section 3.2.4.

Completeness.

The completeness of a primary horizontal fragmentation is based on the selection predicates used. As long as the selection predicates are complete, the resulting fragmentation is guaranteed to be complete as well. Since the basis of the fragmen- tation algorithm is a set of complete and minimal predicates, Pr′, completeness is guaranteed as long as no mistakes are made in defining Pr′.

The completeness of a derived horizontal fragmentation is somewhat more difficult to define. The difficulty is due to the fact that the predicate determining the fragmen- tation involves two relations. Let us first define the completeness rule formally and then look at an example.

Let R be the member relation of a link whose owner is relation S, where R and S are fragmented as FR = {R1,R2, . . . ,Rw} and FS = {S1,S2, . . . ,Sw}, respectively. Furthermore, let A be the join attribute between R and S. Then for each tuple t of Ri, there should be a tuple t ′ of Si such that t[A] = t ′[A].

For example, there should be no ASG tuple which has a project number that is not also contained in PROJ. Similarly, there should be no EMP tuples with TITLE values where the same TITLE value does not appear in PAY as well. This rule is known as referential integrity and ensures that the tuples of any fragment of the member relation are also in the owner relation.

Reconstruction.

Reconstruction of a global relation from its fragments is performed by the union operator in both the primary and the derived horizontal fragmentation. Thus, for a relation R with fragmentation FR = {R1,R2, . . . ,Rw},

R = ⋃

Ri, ∀Ri ∈ FR

Disjointness.

It is easier to establish disjointness of fragmentation for primary than for derived horizontal fragmentation. In the former case, disjointness is guaranteed as long as the minterm predicates determining the fragmentation are mutually exclusive.

In derived fragmentation, however, there is a semijoin involved that adds con- siderable complexity. Disjointness can be guaranteed if the join graph is simple. Otherwise, it is necessary to investigate actual tuple values. In general, we do not

98 3 Distributed Database Design

want a tuple of a member relation to join with two or more tuples of the owner relation when these tuples are in different fragments of the owner. This may not be very easy to establish, and illustrates why derived fragmentation schemes that generate a simple join graph are always desirable.

Example 3.14. In fragmenting relation PAY (Example 3.11), the minterm predicates M = {m1,m2} were

m1: SAL ≤ 30000 m2: SAL > 30000

Since m1 and m2 are mutually exclusive, the fragmentation of PAY is disjoint. For relation EMP, however, we require that

1. Each engineer has a single title. 2. Each title have a single salary value associated with it.

Since these two rules follow from the semantics of the database, the fragmentation of EMP with respect to PAY is also disjoint. �

3.3.2 Vertical Fragmentation

Remember that a vertical fragmentation of a relation R produces fragments R1,R2, . . . ,Rr, each of which contains a subset of R’s attributes as well as the primary key of R. The objective of vertical fragmentation is to partition a relation into a set of smaller relations so that many of the user applications will run on only one fragment. In this context, an “optimal” fragmentation is one that produces a fragmentation scheme which minimizes the execution time of user applications that run on these fragments.

Vertical fragmentation has been investigated within the context of centralized database systems as well as distributed ones. Its motivation within the centralized context is as a design tool, which allows the user queries to deal with smaller relations, thus causing a smaller number of page accesses [Navathe et al., 1984]. It has also been suggested that the most “active” subrelations can be identified and placed in a faster memory subsystem in those cases where memory hierarchies are supported [Eisner and Severance, 1976].

Vertical partitioning is inherently more complicated than horizontal partitioning. This is due to the total number of alternatives that are available. For example, in horizontal partitioning, if the total number of simple predicates in Pr is n, there are 2n possible minterm predicates that can be defined on it. In addition, we know that some of these will contradict the existing implications, further reducing the candidate fragments that need to be considered. In the case of vertical partitioning, however, if a relation has m non-primary key attributes, the number of possible fragments is equal to B(m), which is the mth Bell number [Niamir, 1978]. For large values of

3.3 Fragmentation 99

m,B(m)≈mm; for example, for m=10, B(m)≈ 115,000, for m=15, B(m)≈ 109, for m=30, B(m) = 1023 [Hammer and Niamir, 1979; Navathe et al., 1984].

These values indicate that it is futile to attempt to obtain optimal solutions to the vertical partitioning problem; one has to resort to heuristics. Two types of heuristic approaches exist for the vertical fragmentation of global relations:

1. Grouping: starts by assigning each attribute to one fragment, and at each step, joins some of the fragments until some criteria is satisfied. Grouping was first suggested for centralized databases [Hammer and Niamir, 1979], and was used later for distributed databases [Sacca and Wiederhold, 1985].

2. Splitting: starts with a relation and decides on beneficial partitionings based on the access behavior of applications to the attributes. The technique was also first discussed for centralized database design [Hoffer and Severance, 1975]. It was then extended to the distributed environment [Navathe et al., 1984].

In what follows we discuss only the splitting technique, since it fits more naturally within the top-down design methodology, since the “optimal” solution is probably closer to the full relation than to a set of fragments each of which consists of a single attribute [Navathe et al., 1984]. Furthermore, splitting generates non-overlapping fragments whereas grouping typically results in overlapping fragments. We prefer non-overlapping fragments for disjointness. Of course, non-overlapping refers only to non-primary key attributes.

Before we proceed, let us clarify an issue that we only mentioned in Example 3.2, namely, the replication of the global relation’s key in the fragments. This is a charac- teristic of vertical fragmentation that allows the reconstruction of the global relation. Therefore, splitting is considered only for those attributes that do not participate in the primary key.

There is a strong advantage to replicating the key attributes despite the obvious problems it causes. This advantage has to do with semantic integrity enforcement, to be discussed in Chapter 5. Note that the dependencies briefly discussed in Section 2.1 is, in fact, a constraint that has to hold among the attribute values of the respective relations at all times. Remember also that most of these dependencies involve the key attributes of a relation. If we now design the database so that the key attributes are part of one fragment that is allocated to one site, and the implied attributes are part of another fragment that is allocated to a second site, every update request that causes an integrity check will necessitate communication among sites. Replication of the key attributes at each fragment reduces the chances of this occurring but does not eliminate it completely, since such communication may be necessary due to integrity constraints that do not involve the primary key, as well as due to concurrency control.

One alternative to the replication of the key attributes is the use of tuple identifiers (TIDs), which are system-assigned unique values to the tuples of a relation. Since TIDs are maintained by the system, the fragments are disjoint at a logical level.

100 3 Distributed Database Design

3.3.2.1 Information Requirements of Vertical Fragmentation

The major information required for vertical fragmentation is related to applications. The following discussion, therefore, is exclusively focused on what needs to be determined about applications that will run against the distributed database. Since vertical partitioning places in one fragment those attributes usually accessed together, there is a need for some measure that would define more precisely the notion of “togetherness.” This measure is the affinity of attributes, which indicates how closely related the attributes are. Unfortunately, it is not realistic to expect the designer or the users to be able to easily specify these values. We now present one way by which they can be obtained from more primitive data.

The major information requirement related to applications is their access frequen- cies. Let Q = {q1,q2, . . . ,qq} be the set of user queries (applications) that access relation R(A1,A2, . . . ,An). Then, for each query qi and each attribute A j, we associate an attribute usage value, denoted as use(qi,A j), and defined as follows:

use(qi,A j) = {

1 if attribute A j is referenced by query qi 0 otherwise

The use(qi,•) vectors for each application are easy to define if the designer knows the applications that will run on the database. Again, remember that the 80-20 rule discussed earlier should be helpful in this task.

Example 3.15. Consider relation PROJ of Figure 3.3. Assume that the following applications are defined to run on this relation. In each case we also give the SQL specification.

q1: Find the budget of a project, given its identification number.

SELECT BUDGET FROM PROJ WHERE PNO=Value

q2: Find the names and budgets of all projects.

SELECT PNAME, BUDGET FROM PROJ

q3: Find the names of projects located at a given city.

SELECT PNAME FROM PROJ WHERE LOC=Value

q4: Find the total project budgets for each city.

SELECT SUM(BUDGET) FROM PROJ WHERE LOC=Value

3.3 Fragmentation 101

According to these four applications, the attribute usage values can be defined. As a notational convenience, we let A1 = PNO, A2 = PNAME, A3 = BUDGET, and A4 = LOC. The usage values are defined in matrix form (Figure 3.15), where entry (i, j) denotes use(qi, A j). �

A 1

A 2

A 3

A 4

q 4

q 3

q 2

q 1

1 0 1 0

0 1 1 0

0 1 0 1

0 0 1 1

Fig. 3.15 Example Attribute Usage Matrix

Attribute usage values are not sufficiently general to form the basis of attribute splitting and fragmentation. This is because these values do not represent the weight of application frequencies. The frequency measure can be included in the definition of the attribute affinity measure a f f (Ai,A j), which measures the bond between two attributes of a relation according to how they are accessed by applications.

The attribute affinity measure between two attributes Ai and A j of a relation R(A1,A2, . . . ,An) with respect to the set of applications Q = {q1,q2, . . . ,qq} is de- fined as

a f f (Ai,A j) = ∑ k|use(qk,Ai)=1∧use(qk ,A j)=1

∑ ∀Sl

re fl(qk)accl(qk)

where re fl(qk) is the number of accesses to attributes (Ai,A j) for each execution of application qk at site Sl and accl(qk) is the application access frequency measure previously defined and modified to include frequencies at different sites.

The result of this computation is an n×n matrix, each element of which is one of the measures defined above. We call this matrix the attribute affinity matrix (AA).

Example 3.16. Let us continue with the case that we examined in Example 3.15. For simplicity, let us assume that re fl(qk) = 1 for all qk and Sl . If the application frequencies are

acc1(q1) = 15 acc2(q1) = 20 acc3(q1) = 10 acc1(q2) = 5 acc2(q2) = 0 acc3(q2) = 0 acc1(q3) = 25 acc2(q3) = 25 acc3(q3) = 25 acc1(q4) = 3 acc2(q4) = 0 acc3(q4) = 0

then the affinity measure between attributes A1 and A3 can be measured as

102 3 Distributed Database Design

a f f (A1,A3) = ∑1k=1 ∑ 3 l=1 accl(qk) = acc1(q1)+acc2(q1)+acc3(q1) = 45

since the only application that accesses both of the attributes is q1. The complete attribute affinity matrix is shown in Figure 3.16. Note that the diagonal values are not computed since they are meaningless. �

A 1

A 2

A 3

A 4

A 4

A 3

A 2

A 1

0 45 0

0 5 75

45 5 3

0 75 3 -

-

-

-

Fig. 3.16 Attribute Affinity Matrix

The attribute affinity matrix will be used in the rest of this chapter to guide the fragmentation effort. The process involves first clustering together the attributes with high affinity for each other, and then splitting the relation accordingly.

3.3.2.2 Clustering Algorithm

The fundamental task in designing a vertical fragmentation algorithm is to find some means of grouping the attributes of a relation based on the attribute affinity values in AA. It has been suggested that the bond energy algorithm (BEA) [McCormick et al., 1972] should be used for this purpose ([Hoffer and Severance, 1975] and [Navathe et al., 1984]). It is considered appropriate for the following reasons [Hoffer and Severance, 1975]:

1. It is designed specifically to determine groups of similar items as opposed to, say, a linear ordering of the items (i.e., it clusters the attributes with larger affinity values together, and the ones with smaller values together).

2. The final groupings are insensitive to the order in which items are presented to the algorithm.

3. The computation time of the algorithm is reasonable: O(n2), where n is the number of attributes.

4. Secondary interrelationships between clustered attribute groups are identifi- able.

The bond energy algorithm takes as input the attribute affinity matrix, permutes its rows and columns, and generates a clustered affinity matrix (CA). The permutation is

3.3 Fragmentation 103

done in such a way as to maximize the following global affinity measure (AM):

AM = n

∑ i=1

n

∑ j=1

a f f (Ai,A j)[a f f (Ai,A j−1)+a f f (Ai,A j+1)

+a f f (Ai−1,A j)+a f f (Ai+1,A j)]

where

a f f (A0,A j) = a f f (Ai,A0) = a f f (An+1,A j) = a f f (Ai,An+1) = 0

The last set of conditions takes care of the cases where an attribute is being placed in CA to the left of the leftmost attribute or to the right of the rightmost attribute during column permutations, and prior to the topmost row and following the last row during row permutations. In these cases, we take 0 to be the aff values between the attribute being considered for placement and its left or right (top or bottom) neighbors, which do not exist in CA.

The maximization function considers the nearest neighbors only, thereby resulting in the grouping of large values with large ones, and small values with small ones. Also, the attribute affinity matrix (AA) is symmetric, which reduces the objective function of the formulation above to

AM = n

∑ i=1

n

∑ j=1

a f f (Ai,A j)[a f f (Ai,A j−1)+a f f (Ai,A j+1)]

The details of the bond energy algorithm are given in Algorithm 3.3. Generation of the clustered affinity matrix (CA) is done in three steps:

1. Initialization. Place and fix one of the columns of AA arbitrarily into CA. Column 1 was chosen in the algorithm.

2. Iteration. Pick each of the remaining n− i columns (where i is the number of columns already placed in CA) and try to place them in the remaining i+1 positions in the CA matrix. Choose the placement that makes the greatest contribution to the global affinity measure described above. Continue this step until no more columns remain to be placed.

3. Row ordering. Once the column ordering is determined, the placement of the rows should also be changed so that their relative positions match the relative positions of the columns.3

3 From now on, we may refer to elements of the AA and CA matrices as AA(i, j) and CA(i, j), respectively. This is done for notational convenience only. The mapping to the affinity measures is AA(i, j) = a f f (Ai,A j) and CA(i, j) = a f f (attribute placed at column i in CA, attribute placed at column j in CA). Even though AA and CA matrices are identical except for the ordering of attributes, since the algorithm orders all the CA columns before it orders the rows, the affinity measure of CA is specified with respect to columns. Note that the endpoint condition for the calculation of the affinity measure (AM) can be specified, using this notation, as CA(0, j) =CA(i,0) =CA(n+1, j) = CA(i,n+1) = 0.

104 3 Distributed Database Design

Algorithm 3.3: BEA Algorithm Input: AA: attribute affinity matrix Output: CA: clustered affinity matrix begin {initialize; remember that AA is an n×n matrix} CA(•,1)← AA(•,1) ; CA(•,2)← AA(•,2) ; index← 3 ; while index≤ n do {choose the “best” location for attribute AAindex}

for i from 1 to index−1 by 1 do calculate cont(Ai−1,Aindex,Ai) ; calculate cont(Aindex−1,Aindex,Aindex+1) ; {boundary condition} loc← placement given by maximum cont value ; for j from index to loc by −1 do

CA(•, j)←CA(•, j−1) {shuffle the two matrices} CA(•, loc)← AA(•, index) ; index← index+1

order the rows according to the relative ordering of columns end

For the second step of the algorithm to work, we need to define what is meant by the contribution of an attribute to the affinity measure. This contribution can be derived as follows. Recall that the global affinity measure AM was previously defined as

AM = n

∑ i=1

n

∑ j=1

a f f (Ai,A j)[a f f (Ai,A j−1)+a f f (Ai,A j+1)]

which can be rewritten as

AM = n

∑ i=1

n

∑ j=1

[a f f (Ai,A j)a f f (Ai,A j−1)+a f f (Ai,A j)a f f (Ai,A j+1)]

= n

∑ j=1

[ n

∑ i=1

a f f (Ai,A j)a f f (Ai,A j−1)+ n

∑ i=1

a f f (Ai,A j)a f f (Ai,A j+1)

]

Let us define the bond between two attributes Ax and Ay as

bond(Ax,Ay) = n

∑ z=1

a f f (Az,Ax)a f f (Az,Ay)

Then AM can be written as

AM = n

∑ j=1

[bond(A j,A j−1)+bond(A j,A j+1)]

3.3 Fragmentation 105

Now consider the following n attributes

A1 A2 . . . Ai−1︸ ︷︷ ︸ AM′

Ai A j A j+1 . . . An︸ ︷︷ ︸ AM′′

The global affinity measure for these attributes can be written as

AMold = AM ′ +AM

′′

+bond(Ai−1,Ai)+bond(Ai,A j)+bond(A j,Ai)+bond(A j,A j+1)

= i

∑ l=1

[bond(Al ,Al−1)+bond(Al ,Al+1)]

+ n

∑ l=i+2

[bond(Al ,Al−1)+bond(Al ,Al+1)]

+2bond(Ai,A j)

Now consider placing a new attribute Ak between attributes Ai and A j in the clustered affinity matrix. The new global affinity measure can be similarly written as

AMnew = AM ′ +AM

′′ +bond(Ai,Ak)+bond(Ak,Ai)

+bond(Ak,A j)+bond(A j,Ak)

= AM ′ +AM

′′ +2bond(Ai,Ak)+2bond(Ak,A j)

Thus, the net contribution4 to the global affinity measure of placing attribute Ak between Ai and A j is

cont(Ai,Ak,A j) = AMnew−AMold = 2bond(Ai,Ak)+2bond(Ak,A j)−2bond(Ai,A j)

Example 3.17. Let us consider the AA matrix given in Figure 3.16 and study the contribution of moving attribute A4 between attributes A1 and A2, given by the formula

cont(A1,A4,A2) = 2bond(A1,A4)+2bond(A4,A2)−2bond(A1,A2)

Computing each term, we get

bond(A1,A4) = 45∗0+0∗75+45∗3+0∗78 = 135 bond(A4,A2) = 11865 bond(A1,A2) = 225

Therefore,

4 In literature [Hoffer and Severance, 1975] this measure is specified as bond(Ai,Ak) + bond(Ak,A j)−2bond(Ai,A j). However, this is a pessimistic measure which does not follow from the definition of AM.

106 3 Distributed Database Design

cont(A1,A4,A2) = 2∗135+2∗11865−2∗225 = 23550 �

Note that the calculation of the bond between two attributes requires the multipli- cation of the respective elements of the two columns representing these attributes and taking the row-wise sum.

The algorithm and our discussion so far have both concentrated on the columns of the attribute affinity matrix. We can also make the same arguments and redesign the algorithm to operate on the rows. Since the AA matrix is symmetric, both of these approaches will generate the same result.

Another point about Algorithm 3.3 is that to improve its efficiency, the second column is also fixed and placed next to the first one during the initialization step. This is acceptable since, according to the algorithm, A2 can be placed either to the left of A1 or to its right. The bond between the two, however, is independent of their positions relative to one another.

Finally, we should indicate the problem of computing cont at the endpoints. If an attribute Ai is being considered for placement to the left of the leftmost attribute, one of the bond equations to be calculated is between a non-existent left element and Ak [i.e., bond(A0,Ak)]. Thus we need to refer to the conditions imposed on the definition of the global affinity measure AM, where CA(0,k) = 0. The other extreme is if A j is the rightmost attribute that is already placed in the CA matrix and we are checking for the contribution of placing attribute Ak to the right of A j. In this case the bond(k,k+1) needs to be calculated. However, since no attribute is yet placed in column k+1 of CA, the affinity measure is not defined. Therefore, according to the endpoint conditions, this bond value is also 0.

Example 3.18. We consider the clustering of the PROJ relation attributes and use the attribute affinity matrix AA of Figure 3.16.

According to the initialization step, we copy columns 1 and 2 of the AA matrix to the CA matrix (Figure 3.17a) and start with column 3 (i.e., attribute A3). There are three alternative places where column 3 can be placed: to the left of column 1, resulting in the ordering (3-1-2), in between columns 1 and 2, giving (1-3-2), and to the right of 2, resulting in (1-2-3). Note that to compute the contribution of the last ordering we have to compute cont(A2,A3,A4) rather than cont(A1,A2,A3). Furthermore, in this context A4 refers to the fourth index position in the CA matrix, which is empty (Figure 3.17b), not to the attribute column A4 of the AA matrix. Let us calculate the contribution to the global affinity measure of each alternative.

Ordering (0-3-1):

cont(A0,A3,A1) = 2bond(A0,A3)+2bond(A3,A1)−2bond(A0,A1)

We know that

bond(A0,A1) = bond(A0,A3) = 0 bond(A3,A1) = 45∗45+5∗0+53∗45+3∗0 = 4410

3.3 Fragmentation 107

A 1

A 2

A 4

A 3

A 2

A 1

45 0

0 80

45 5

0 75

A 1

A 4

A 3

A 2

A 1

45

0

45

0

A 2

0

80

5

75

A 3

45

5

53

3

(a) (b)

(c) (d)

A 1

A 4

A 4

A 3

A 2

A 1

45 0

0 75

45 3

0 78

A 2

0

80

5

75

A 3

45

5

53

3

A 1

A 4

A 4

A 1

45 0

0 78

A 2

0

75

A 3

45

A 2

0 75805

A 3 45 3553

3

Fig. 3.17 Calculation of the Clustered Affinity (CA) Matrix

Thus

cont(A0,A3,A1) = 8820

Ordering (1-3-2):

cont(A1,A3,A2) = 2bond(A1,A3)+2bond(A3,A2)−2bond(A1,A2) bond(A1,A3) = bond(A3,A1) = 4410 bond(A3,A2) = 890 bond(A1,A2) = 225

Thus

cont(A1,A3,A2) = 10150

Ordering (2-3-4):

cont(A2,A3,A4) = 2bond(A2,A3)+2bond(A3,A4)−2bond(A2,A4) bond(A2,A3) = 890 bond(A3,A4) = 0 bond(A2,A4) = 0

108 3 Distributed Database Design

Thus

cont(A2,A3,A4) = 1780

Since the contribution of the ordering (1-3-2) is the largest, we select to place A3 to the right of A1 (Figure 3.17b). Similar calculations for A4 indicate that it should be placed to the right of A2 (Figure 3.17c).

Finally, the rows are organized in the same order as the columns and the result is shown in Figure 3.17d. �

In Figure 3.17d we see the creation of two clusters: one is in the upper left corner and contains the smaller affinity values and the other is in the lower right corner and contains the larger affinity values. This clustering indicates how the attributes of relation PROJ should be split. However, in general the border for this split may not be this clear-cut. When the CA matrix is big, usually more than two clusters are formed and there are more than one candidate partitionings. Thus, there is a need to approach this problem more systematically.

3.3.2.3 Partitioning Algorithm

The objective of the splitting activity is to find sets of attributes that are accessed solely, or for the most part, by distinct sets of applications. For example, if it is possible to identify two attributes, A1 and A2, which are accessed only by application q1, and attributes A3 and A4, which are accessed by, say, two applications q2 and q3, it would be quite straightforward to decide on the fragments. The task lies in finding an algorithmic method of identifying these groups.

Consider the clustered attribute matrix of Figure 3.18. If a point along the diagonal is fixed, two sets of attributes are identified. One set {A1,A2, . . . ,Ai} is at the upper left-hand corner and the second set {Ai+1, . . . ,An} is to the right and to the bottom of this point. We call the former set top and the latter set bottom and denote the attribute sets as TA and BA, respectively.

We now turn to the set of applications Q = {q1,q2, . . . ,qq} and define the set of applications that access only TA, only BA, or both. These sets are defined as follows:

AQ(qi) = {A j|use(qi,A j) = 1} T Q = {qi|AQ(qi)⊆ TA} BQ = {qi|AQ(qi)⊆ BA} OQ = Q−{T Q∪BQ}

The first of these equations defines the set of attributes accessed by application qi; T Q and BQ are the sets of applications that only access TA or BA, respectively, and OQ is the set of applications that access both.

There is an optimization problem here. If there are n attributes of a relation, there are n−1 possible positions where the dividing point can be placed along the diagonal

3.3 Fragmentation 109

A 1 A 2 A 3

A i

A i+1

A n

A 1

A 2

A i+1

A n

A i

BA

TA

Fig. 3.18 Locating a Splitting Point

of the clustered attribute matrix for that relation. The best position for division is one which produces the sets T Q and BQ such that the total accesses to only one fragment are maximized while the total accesses to both fragments are minimized. We therefore define the following cost equations:

CQ = ∑ qi∈Q

∑ ∀S j

re f j(qi)acc j(qi)

CT Q = ∑ qi∈T Q

∑ ∀S j

re f j(qi)acc j(qi)

CBQ = ∑ qi∈BQ

∑ ∀S j

re f j(qi)acc j(qi)

COQ = ∑ qi∈OQ

∑ ∀S j

re f j(qi)acc j(qi)

Each of the equations above counts the total number of accesses to attributes by applications in their respective classes. Based on these measures, the optimization problem is defined as finding the point x (1≤ x≤ n) such that the expression

z =CT Q∗CBQ−COQ2

is maximized [Navathe et al., 1984]. The important feature of this expression is that it defines two fragments such that the values of CT Q and CBQ are as nearly equal as possible. This enables the balancing of processing loads when the fragments are distributed to various sites. It is clear that the partitioning algorithm has linear complexity in terms of the number of attributes of the relation, that is, O(n).

There are two complications that need to be addressed. The first is with respect to the splitting. The procedure splits the set of attributes two-way. For larger sets of attributes, it is quite likely that m-way partitioning may be necessary.

110 3 Distributed Database Design

Designing an m-way partitioning is possible but computationally expensive. Along the diagonal of the CA matrix, it is necessary to try 1, 2, . . . ,m−1 split points, and for each of these, it is necessary to check which place maximizes z. Thus, the complexity of such an algorithm is O(2m). Of course, the definition of z has to be modified for those cases where there are multiple split points. The alternative solution is to recursively apply the binary partitioning algorithm to each of the fragments obtained during the previous iteration. One would compute T Q, BQ, and OQ, as well as the associated access measures for each of the fragments, and partition them further.

The second complication relates to the location of the block of attributes that should form one fragment. Our discussion so far assumed that the split point is unique and single and divides the CA matrix into an upper left-hand partition and a second partition formed by the rest of the attributes. The partition, however, may also be formed in the middle of the matrix. In this case, we need to modify the algorithm slightly. The leftmost column of the CA matrix is shifted to become the rightmost column and the topmost row is shifted to the bottom. The shift operation is followed by checking the n−1 diagonal positions to find the maximum z. The idea behind shifting is to move the block of attributes that should form a cluster to the topmost left corner of the matrix, where it can easily be identified. With the addition of the shift operation, the complexity of the partitioning algorithm increases by a factor of n and becomes O(n2).

Assuming that a shift procedure, called SHIFT, has already been implemented, the partitioning algorithm is given in Algorithm 3.4. The input of the PARTITION is the clustered affinity matrix CA, the relation R to be fragmented, and the attribute usage and access frequency matrices. The output is a set of fragments FR = {R1,R2}, where Ri ⊆ {A1,A2 . . . ,An} and R1 ∩R2 = the key attributes of relation R. Note that for n-way partitioning, this routine should either be invoked iteratively, or implemented as a recursive procedure.

Example 3.19. When the PARTITION algorithm is applied to the CA matrix obtained for relation PROJ (Example 3.18), the result is the definition of fragments FPROJ = {PROJ1,PROJ2}, where PROJ1 = {A1,A3} and PROJ2 = {A1,A2,A4}. Thus

PROJ1 = {PNO, BUDGET} PROJ2 = {PNO, PNAME, LOC}

Note that in this exercise we performed the fragmentation over the entire set of attributes rather than only on the non-key ones. The reason for this is the simplicity of the example. For that reason, we included PNO, which is the key of PROJ in PROJ2 as well as in PROJ1. �

3.3.2.4 Checking for Correctness

We follow arguments similar to those of horizontal partitioning to prove that the PARTITION algorithm yields a correct vertical fragmentation.

3.3 Fragmentation 111

Algorithm 3.4: PARTITION Algorithm Input: CA: clustered affinity matrix; R: relation; re f : attribute usage matrix;

acc: access frequency matrix Output: F: set of fragments begin {determine the z value for the first column} {the subscripts in the cost equations indicate the split point} calculate CT Qn−1 ; calculate CBQn−1 ; calculate COQn−1 ; best←CT Qn−1 ∗CBQn−1− (COQn−1)2 ; repeat {determine the best partitioning} for i from n−2 to 1 by −1 do

calculate CT Qi ; calculate CBQi ; calculate COQi ; z←CT Q∗CBQi−COQ2i ; if z > best then best← z {record the split point within shift}

call SHIFT(CA) until no more SHIFT is possible ; reconstruct the matrix according to the shift position ; R1←ΠTA(R)∪K ; {K is the set of primary key attributes of R} R2←ΠBA(R)∪K ; F ←{R1,R2}

end

Completeness.

Completeness is guaranteed by the PARTITION algorithm since each attribute of the global relation is assigned to one of the fragments. As long as the set of attributes A over which the relation R is defined consists of

A = ⋃

Ri

completeness of vertical fragmentation is ensured.

Reconstruction.

We have already mentioned that the reconstruction of the original global relation is made possible by the join operation. Thus, for a relation R with vertical fragmentation FR = {R1,R2, . . . ,Rr} and key attribute(s) K,

112 3 Distributed Database Design

R =1K Ri,∀Ri ∈ FR

Therefore, as long as each Ri is complete, the join operation will properly reconstruct R. Another important point is that either each Ri should contain the key attribute(s) of R, or it should contain the system assigned tuple IDs (TIDs).

Disjointness.

As we indicated before, the disjointness of fragments is not as important in vertical fragmentation as it is in horizontal fragmentation. There are two cases here:

1. TIDs are used, in which case the fragments are disjoint since the TIDs that are replicated in each fragment are system assigned and managed entities, totally invisible to the users.

2. The key attributes are replicated in each fragment, in which case one cannot claim that they are disjoint in the strict sense of the term. However, it is important to realize that this duplication of the key attributes is known and managed by the system and does not have the same implications as tuple duplication in horizontally partitioned fragments. In other words, as long as the fragments are disjoint except for the key attributes, we can be satisfied and call them disjoint.

3.3.3 Hybrid Fragmentation

In most cases a simple horizontal or vertical fragmentation of a database schema will not be sufficient to satisfy the requirements of user applications. In this case a vertical fragmentation may be followed by a horizontal one, or vice versa, producing a tree- structured partitioning (Figure 3.19). Since the two types of partitioning strategies are applied one after the other, this alternative is called hybrid fragmentation. It has also been named mixed fragmentation or nested fragmentation.

R

R1 R2

R11 R12 R21 R22 R23

H H

V V V V V

Fig. 3.19 Hybrid Fragmentation

3.4 Allocation 113

A good example for the necessity of hybrid fragmentation is relation PROJ, which we have been working with. In Example 3.11 we partitioned it into six horizontal fragments based on two applications. In Example 3.19 we partitioned the same relation vertically into two. What we have, therefore, is a set of horizontal fragments, each of which is further partitioned into two vertical fragments.

The number of levels of nesting can be large, but it is certainly finite. In the case of horizontal fragmentation, one has to stop when each fragment consists of only one tuple, whereas the termination point for vertical fragmentation is one attribute per fragment. These limits are quite academic, however, since the levels of nesting in most practical applications do not exceed 2. This is due to the fact that normalized global relations already have small degrees and one cannot perform too many vertical fragmentations before the cost of joins becomes very high.

We will not discuss in detail the correctness rules and conditions for hybrid fragmentation, since they follow naturally from those for vertical and horizontal frag- mentations. For example, to reconstruct the original global relation in case of hybrid fragmentation, one starts at the leaves of the partitioning tree and moves upward by performing joins and unions (Figure 3.20). The fragmentation is complete if the intermediate and leaf fragments are complete. Similarly, disjointness is guaranteed if intermediate and leaf fragments are disjoint.

R11 R12 R21 R22 R23

Fig. 3.20 Reconstruction of Hybrid Fragmentation

3.4 Allocation

The allocation of resources across the nodes of a computer network is an old problem that has been studied extensively. Most of this work, however, does not address the problem of distributed database design, but rather that of placing individual files on a computer network. We will examine the differences between the two shortly. We first need to define the allocation problem more precisely.

114 3 Distributed Database Design

3.4.1 Allocation Problem

Assume that there are a set of fragments F = {F1,F2, . . . ,Fn} and a distributed system consisting of sites S = {S1,S2, . . . ,Sm} on which a set of applications Q = {q1,q2, . . . ,qq} is running. The allocation problem involves finding the “optimal” distribution of F to S.

The optimality can be defined with respect to two measures [Dowdy and Foster, 1982]:

1. Minimal cost. The cost function consists of the cost of storing each Fi at a site S j, the cost of querying Fi at site S j, the cost of updating Fi at all sites where it is stored, and the cost of data communication. The allocation problem, then, attempts to find an allocation scheme that minimizes a combined cost function.

2. Performance. The allocation strategy is designed to maintain a performance metric. Two well-known ones are to minimize the response time and to maximize the system throughput at each site.

Most of the models that have been proposed to date make this distinction of optimality. However, if one really examines the problem in depth, it is apparent that the “optimality” measure should include both the performance and the cost factors. In other words, one should be looking for an allocation scheme that, for example, answers user queries in minimal time while keeping the cost of processing minimal. A similar statement can be made for throughput maximization. One can then ask why such models have not been developed. The answer is quite simple: complexity.

Let us consider a very simple formulation of the problem. Let F and S be defined as before. For the time being, we consider only a single fragment, Fk. We make a number of assumptions and definitions that will enable us to model the allocation problem.

1. Assume that Q can be modified so that it is possible to identify the update and the retrieval-only queries, and to define the following for a single fragment Fk:

T = {t1, t2, . . . , tm}

where ti is the read-only traffic generated at site Si for Fk, and

U = {u1,u2, . . . ,um}

where ui is the update traffic generated at site Si for Fk.

2. Assume that the communication cost between any two pair of sites Si and S j is fixed for a unit of transmission. Furthermore, assume that it is different for updates and retrievals in order that the following can be defined:

3.4 Allocation 115

C(T ) = {c12,c13, . . . ,c1m, . . . ,cm−1,m} C′(U) = {c′12,c′13, . . . ,c′1m, . . . ,c′m−1,m}

where ci j is the unit communication cost for retrieval requests between sites Si and S j, and c′i j is the unit communication cost for update requests between sites Si and S j .

3. Let the cost of storing the fragment at site Si be di. Thus we can define D = {d1,d2, . . . ,dm} for the storage cost of fragment Fk at all the sites.

4. Assume that there are no capacity constraints for either the sites or the com- munication links.

Then the allocation problem can be specified as a cost-minimization problem where we are trying to find the set I ⊆ S that specifies where the copies of the fragment will be stored. In the following, x j denotes the decision variable for the placement such that

x j = {

1 if fragment Fk is assigned to site S j 0 otherwise

The precise specification is as follows:

min

 m∑ i=1

 ∑ j|S j∈I

x ju jc ′ i j + t j min

j|S j∈I ci j

+ ∑ j|S j∈I

x jd j

 subject to

x j = 0 or 1

The second term of the objective function calculates the total cost of storing all the duplicate copies of the fragment. The first term, on the other hand, corresponds to the cost of transmitting the updates to all the sites that hold the replicas of the fragment, and to the cost of executing the retrieval-only requests at the site, which will result in minimal data transmission cost.

This is a very simplistic formulation that is not suitable for distributed database design. But even if it were, there is another problem. This formulation, which comes from Casey [1972], has been proven to be NP-complete [Eswaran, 1974]. Various different formulations of the problem have been proven to be just as hard over the years (e.g., [Sacca and Wiederhold, 1985] and [Lam and Yu, 1980]). The implication is, of course, that for large problems (i.e., large number of fragments and sites), obtaining optimal solutions is probably not computationally feasible. Considerable research has therefore been devoted to finding good heuristics that may provide suboptimal solutions.

116 3 Distributed Database Design

There are a number of reasons why simplistic formulations such as the one we have discussed are not suitable for distributed database design. These are inherent in all the early file allocation models for computer networks.

1. One cannot treat fragments as individual files that can be allocated one at a time, in isolation. The placement of one fragment usually has an impact on the placement decisions about the other fragments which are accessed together since the access costs to the remaining fragments may change (e.g., due to distributed join). Therefore, the relationship between fragments should be taken into account.

2. The access to data by applications is modeled very simply. A user request is issued at one site and all the data to answer it is transferred to that site. In distributed database systems, access to data is more complicated than this simple “remote file access” model suggests. Therefore, the relationship between the allocation and query processing should be properly modeled.

3. These models do not take into consideration the cost of integrity enforcement, yet locating two fragments involved in the same integrity constraint at two different sites can be costly.

4. Similarly, the cost of enforcing concurrency control mechanisms should be considered [Rothnie and Goodman, 1977].

In summary, let us remember the interrelationship between the distributed database problems as depicted in Figure 1.7. Since the allocation is so central, its relationship with algorithms that are implemented for other problem areas needs to be represented in the allocation model. However, this is exactly what makes it quite difficult to solve these models. To separate the traditional problem of file allocation from the fragment allocation in distributed database design, we refer to the former as the file allocation problem (FAP) and to the latter as the database allocation problem (DAP).

There are no general heuristic models that take as input a set of fragments and produce a near-optimal allocation subject to the types of constraints discussed here. The models developed to date make a number of simplifying assumptions and are applicable to certain specific formulations. Therefore, instead of presenting one or more of these allocation algorithms, we present a relatively general model and then discuss a number of possible heuristics that might be employed to solve it.

3.4.2 Information Requirements

It is at the allocation stage that we need the quantitative data about the database, the applications that run on it, the communication network, the processing capabilities, and storage limitations of each site on the network. We will discuss each of these in detail.

3.4 Allocation 117

3.4.2.1 Database Information

To perform horizontal fragmentation, we defined the selectivity of minterms. We now need to extend that definition to fragments, and define the selectivity of a fragment Fj with respect to query qi. This is the number of tuples of Fj that need to be accessed in order to process qi. This value will be denoted as seli(Fj).

Another piece of necessary information on the database fragments is their size. The size of a fragment Fj is given by

size(Fj) = card(Fj)∗ length(Fj)

where length(Fj) is the length (in bytes) of a tuple of fragment Fj.

3.4.2.2 Application Information

Most of the application-related information is already compiled during the fragmenta- tion activity, but a few more are required by the allocation model. The two important measures are the number of read accesses that a query qi makes to a fragment Fj during its execution (denoted as RRi j), and its counterpart for the update accesses (URi j). These may, for example, count the number of block accesses required by the query.

We also need to define two matrices UM and RM, with elements ui j and ri j, respectively, which are specified as follows:

ui j = {

1 if query qi updates fragment Fj 0 otherwise

ri j = {

1 if query qi retrieves from fragment Fj 0 otherwise

A vector O of values o(i) is also defined, where o(i) specifies the originating site of query qi. Finally, to define the response-time constraint, the maximum allowable response time of each application should be specified.

3.4.2.3 Site Information

For each computer site, we need to know its storage and processing capacity. Obvi- ously, these values can be computed by means of elaborate functions or by simple estimates. The unit cost of storing data at site Sk will be denoted as USCk. There is also a need to specify a cost measure LPCk as the cost of processing one unit of work at site Sk. The work unit should be identical to that of the RR and UR measures.

118 3 Distributed Database Design

3.4.2.4 Network Information

In our model we assume the existence of a simple network where the cost of commu- nication is defined in terms of one frame of data. Thus gi j denotes the communication cost per frame between sites Si and S j. To enable the calculation of the number of messages, we use f size as the size (in bytes) of one frame. There is no question that there are more elaborate network models which take into consideration the channel capacities, distances between sites, protocol overhead, and so on. However, the derivation of those equations is beyond the scope of this chapter.

3.4.3 Allocation Model

We discuss an allocation model that attempts to minimize the total cost of processing and storage while trying to meet certain response time restrictions. The model we use has the following form:

min(Total Cost)

subject to

response-time constraint storage constraint processing constraint

In the remainder of this section we expand the components of this model based on the information requirements discussed in Section 3.4.2. The decision variable is xi j, which is defined as

xi j = {

1 if the fragment Fi is stored at site S j 0 otherwise

3.4.3.1 Total Cost

The total cost function has two components: query processing and storage. Thus it can be expressed as

TOC = ∑ ∀qi∈Q

QPCi + ∑ ∀Sk∈S

∑ ∀Fj∈F

STC jk

where QPCi is the query processing cost of application qi, and STC jk is the cost of storing fragment Fj at site Sk.

Let us consider the storage cost first. It is simply given by

STC jk =USCk ∗ size(Fj)∗x jk

3.4 Allocation 119

and the two summations find the total storage costs at all the sites for all the fragments. The query processing cost is more difficult to specify. Most models of the file allo-

cation problem (FAP) separate it into two components: the retrieval-only processing cost, and the update processing cost. We choose a different approach in our model of the database allocation problem (DAP) and specify it as consisting of the processing cost (PC) and the transmission cost (TC). Thus the query processing cost (QPC) for application qi is

QPCi = PCi +TCi

According to the guidelines presented in Section 3.4.1, the processing component, PC, consists of three cost factors, the access cost (AC), the integrity enforcement cost (IE), and the concurrency control cost (CC):

PCi = ACi + IEi +CCi

The detailed specification of each of these cost factors depends on the algorithms used to accomplish these tasks. However, to demonstrate the point, we specify AC in some detail:

ACi = ∑ ∀Sk∈S

∑ ∀Fj∈F

(ui j ∗URi j + ri j ∗RRi j)∗x jk ∗LPCk

The first two terms in the above formula calculate the number of accesses of user query qi to fragment Fj. Note that (URi j +RRi j) gives the total number of update and retrieval accesses. We assume that the local costs of processing them are identical. The summation gives the total number of accesses for all the fragments referenced by qi. Multiplication by LPCk gives the cost of this access at site Sk. We again use x jk to select only those cost values for the sites where fragments are stored.

A very important issue needs to be pointed out here. The access cost function assumes that processing a query involves decomposing it into a set of subqueries, each of which works on a fragment stored at the site, followed by transmitting the results back to the site where the query has originated. As we discussed earlier, this is a very simplistic view which does not take into consideration the complexities of database processing. For example, the cost function does not take into account the cost of performing joins (if necessary), which may be executed in a number of ways, studied in Chapter 8. In a model that is more realistic than the generic model we are considering, these issues should not be omitted.

The integrity enforcement cost factor can be specified much like the processing component, except that the unit local processing cost would probably change to reflect the true cost of integrity enforcement. Since the integrity checking and concurrency control methods are discussed later in the book, we do not need to study these cost components further here. The reader should refer back to this section after reading Chapters 5 and 11 to be convinced that the cost functions can indeed be derived.

The transmission cost function can be formulated along the lines of the access cost function. However, the data transmission overhead for update and that for retrieval

120 3 Distributed Database Design

requests are quite different. In update queries it is necessary to inform all the sites where replicas exist, while in retrieval queries, it is sufficient to access only one of the copies. In addition, at the end of an update request, there is no data transmission back to the originating site other than a confirmation message, whereas the retrieval-only queries may result in significant data transmission.

The update component of the transmission function is

TCUi = ∑ ∀Sk∈S

∑ ∀Fj∈F

ui j ∗x jk ∗go(i),k + ∑ ∀Sk∈S

∑ ∀Fj∈F

ui j ∗ x jk ∗gk,o(i)

The first term is for sending the update message from the originating site o(i) of qi to all the fragment replicas that need to be updated. The second term is for the confirmation.

The retrieval cost can be specified as

TCRi = ∑ ∀Fj∈F

min Sk∈S

(ri j ∗ x jk ∗go(i),k + ri j ∗ x jk ∗ seli(Fj)∗ length(Fj)

f size ∗gk,o(i))

The first term in TCR represents the cost of transmitting the retrieval request to those sites which have copies of fragments that need to be accessed. The second term accounts for the transmission of the results from these sites to the originating site. The equation states that among all the sites with copies of the same fragment, only the site that yields the minimum total transmission cost should be selected for the execution of the operation.

Now the transmission cost function for query qi can be specified as

TCi = TCUi +TCRi

which fully specifies the total cost function.

3.4.3.2 Constraints

The constraint functions can be specified in similar detail. However, instead of describing these functions in depth, we will simply indicate what they should look like. The response-time constraint should be specified as

execution time of qi ≤ maximum response time of qi,∀qi ∈ Q

Preferably, the cost measure in the objective function should be specified in terms of time, as it makes the specification of the execution-time constraint relatively straightforward.

The storage constraint is

∑ ∀Fj∈F

STC jk ≤ storage capacity at site Sk,∀Sk ∈ S

3.4 Allocation 121

whereas the processing constraint is

∑ ∀qi∈Q

processing load of qi at site Sk ≤ processing capacity of Sk,∀Sk ∈ S

This completes our development of the allocation model. Even though we have not developed it entirely, the precision in some of the terms indicates how one goes about formulating such a problem. In addition to this aspect, we have indicated the important issues that need to be addressed in allocation models.

3.4.4 Solution Methods

In the preceding section we developed a generic allocation model which is consider- ably more complex than the FAP model presented in Section 3.4.1. Since the FAP model is NP-complete, one would expect the solution of this formulation of the database allocation problem (DAP) to be NP-complete as well. Even though we will not prove this conjecture, it is indeed true. Thus one has to look for heuristic methods that yield suboptimal solutions. The test of “goodness” in this case is, obviously, how close the results of the heuristic algorithm are to the optimal allocation.

A number of different heuristics have been applied to the solution of FAP and DAP models. It was observed early on that there is a correspondence between FAP and the plant location problem that has been studied in operations research. In fact, the isomorphism of the simple FAP and the single commodity warehouse location problem has been shown [Ramamoorthy and Wah, 1983]. Thus heuristics developed by operations researchers have commonly been adopted to solve the FAP and DAP problems. Examples are the knapsack problem solution [Ceri et al., 1982a], branch- and-bound techniques [Fisher and Hochbaum, 1980], and network flow algorithms [Chang and Liu, 1982].

There have been other attempts to reduce the complexity of the problem. One strategy has been to assume that all the candidate partitionings have been determined together with their associated costs and benefits in terms of query processing. The problem, then, is modeled so as to choose the optimal partitioning and placement for each relation [Ceri et al., 1983]. Another simplification frequently employed is to ignore replication at first and find an optimal non-replicated solution. Replication is handled at the second step by applying a greedy algorithm which starts with the non-replicated solution as the initial feasible solution, and tries to improve upon it ([Ceri et al., 1983] and [Ceri and Pernici, 1985]). For these heuristics, however, there is not enough data to determine how close the results are to the optimal.

122 3 Distributed Database Design

3.5 Data Directory

The distributed database schema needs to be stored and maintained by the system. This information is necessary during distributed query optimization, as we will discuss later. The schema information is stored in a data dictionary/directory, also called a catalog or simply a directory. A directory is a meta-database that stores a number of information.

Within the context of the centralized ANSI/SPARC architecture discussed in Section 1.7.1, directory is the system component that permits mapping between different data organizational views. It should at least contain schema and mapping definitions. It may also contain usage statistics, access control information, and the like. It is clearly seen that the data dictionary/directory serves as the central component in both processing different schemas and in providing mappings among them.

In the case of a distributed database, as depicted in Figure 1.14 and discussed earlier in this chapter, schema definition is done at the global level (i.e., the global conceptual schema – GCS) as well as at the local sites (i.e., local conceptual schemas – LCSs). Consequently, there are two types of directories: a global directory/dictionary (GD/D)5 that describes the database schema as the end users see it, and that permits the required global mappings between external schemas and the GCS, and the local directory/dictionary (LD/D), that describes the local mappings and describes the schema at each site. Thus, the local database management components are integrated by means of global DBMS functions.

As stated above, the directory is itself a database that contains metadata about the actual data stored in the database. Therefore, the techniques we discussed in this chapter with respect to distributed database design also apply to directory man- agement. Briefly, a directory may be either global to the entire database or local to each site. In other words, there might be a single directory containing information about all the data in the database, or a number of directories, each containing the information stored at one site. In the latter case, we might either build hierarchies of directories to facilitate searches, or implement a distributed search strategy that involves considerable communication among the sites holding the directories.

The second issue has to do with location. In the case of a global directory, it may be maintained centrally at one site, or in a distributed fashion by distributing it over a number of sites. Keeping the directory at one site might increase the load at that site, thereby causing a bottleneck as well as increasing message traffic around that site. Distributing it over a number of sites, on the other hand, increases the complexity of managing directories. In the case of multi-DBMSs, the choice is dependent on whether or not the system is distributed. If it is, the directory is always distributed; otherwise of course, it is maintained centrally.

The final issue is replication. There may be a single copy of the directory or multiple copies. Multiple copies would provide more reliability, since the probability of reaching one copy of the directory would be higher. Furthermore, the delays

5 In the remainder, we will simply refer to this as the global directory.

3.6 Conclusion 123

in accessing the directory would be lower, due to less contention and the relative proximity of the directory copies. On the other hand, keeping the directory up to date would be considerably more difficult, since multiple copies would need to be updated. Therefore, the choice should depend on the environment in which the system operates and should be made by balancing such factors as the response-time requirements, the size of the directory, the machine capacities at the sites, the reliability requirements, and the volatility of the directory (i.e., the amount of change experienced by the database, which would cause a change to the directory).

3.6 Conclusion

In this chapter, we presented the techniques that can be used for distributed database design with special emphasis on the fragmentation and allocation issues. There are a number of lines of research that have been followed in distributed database design. For example, Chang has independently developed a theory of fragmentation [Chang and Cheng, 1980], and allocation [Chang and Liu, 1982]. However, for its maturity of development, we have chosen to develop this chapter along the track developed by Ceri, Pelagatti, Navathe, and Wiederhold. Our references to the literature by these authors reflect this quite clearly.

There is a considerable body of literature on the allocation problem, focusing mostly on the simpler file allocation issue. We still do not have sufficiently general models that take into consideration all the aspects of data distribution. The model presented in Section 3.4 highlights the types of issues that need to be taken into account. Within this context, it might be worthwhile to take a somewhat different approach to the solution of the distributed allocation problem. One might develop a set of heuristic rules that might accompany the mathematical formulation and reduce the solution space, thus making the solution feasible.

We have discussed, in detail, the algorithms that one can use to fragment a relational schema in various ways. These algorithms have been developed quite independently and there is no underlying design methodology that combines the horizontal and vertical partitioning techniques. If one starts with a global relation, there are algorithms to decompose it horizontally as well as algorithms to decom- pose it vertically into a set of fragment relations. However, there are no algorithms that fragment a global relation into a set of fragment relations some of which are decomposed horizontally and others vertically. It is commonly pointed out that most real-life fragmentations would be mixed, i.e., would involve both horizontal and vertical partitioning of a relation, but the methodology research to accomplish this is lacking. What is needed is a distribution design methodology which encompasses the horizontal and vertical fragmentation algorithms and uses them as part of a more general strategy. Such a methodology should take a global relation together with a set of design criteria and come up with a set of fragments some of which are obtained via horizontal and others obtained via vertical fragmentation.

124 3 Distributed Database Design

The second part of distribution design, namely allocation, is typically treated independently of fragmentation. The process is, therefore, linear when the output of fragmentation is input to allocation. At first sight, the isolation of the fragmentation and the allocation steps appears to simplify the formulation of the problem by reducing the decision space. However, closer examination reveals that isolating the two steps actually contributes to the complexity of the allocation models. Both steps have similar inputs, differing only in that fragmentation works on global relations whereas allocation considers fragment relations. They both require information about the user applications (e.g., how often they access data, what the relationships of individual data objects to one another are, etc.), but ignore how each other makes use of these inputs. The end result is that the fragmentation algorithms decide how to partition a relation based partially on how applications access it, but the allocation models ignore the part that this input plays in fragmentation. Therefore, the allocation models have to include all over again detailed specification of the relationship among the fragment relations and how user applications access them. What would be more promising is to formulate a methodology that more properly reflects the interdependence of the fragmentation and the allocation decisions. This requires extensions to existing distribution design strategies. We recognize that integrated methodologies such as the one we propose here may be considerably complex. However, there may be synergistic effects of combining these two steps enabling the development of quite acceptable heuristic solution methods. There are a few studies that follow such an integrated methodology (e.g., [Muro et al., 1983, 1985; Yoshida et al., 1985]). These methodologies build a simulation model of the distributed DBMS, taking as input a specific database design, and measure its effectiveness. Development of tools based on such methodologies, which aid the human designer rather than attempt to replace him, is probably the more appropriate approach to the design problem.

Another aspect of the work described in this chapter is that it assumes a static environment where design is conducted only once and this design can persist. Reality, of course, is quite different. Both physical (e.g., network characteristics, available storage at various sites) and logical (e.g., migration of applications from one site to another, access pattern modifications) changes occur necessitating redesign of the database. This problem has been studied to some extent. In a dynamic environment, the process becomes one of design-redesign-materialization of the redesign. The design step follows techniques that have been described in this chapter. Redesign can either be limited in that only parts of the database are affected, or total, requir- ing a complete redistribution [Wilson and Navathe, 1986]. Materialization refers to the reorganization of the distributed database to reflect the changes required by the redesign step. Limited redesign, in particular, the materialization issue is stud- ied in [Rivera-Vega et al., 1990; Varadarajan et al., 1989]. Complete redesign and materialization issues have been studied in [Karlapalem et al., 1996b; Karlapalem and Navathe, 1994; Kazerouni and Karlapalem, 1997]. In particular, Kazerouni and Karlapalem [1997] describes a stepwise redesign methodology which involves a split phase where fragments are further subdivided based on the changed application requirements until no further subdivision is profitable based on a cost function. At

3.7 Bibliographic Notes 125

this point, the merging phase starts where fragments that are accessed together by a set of applications are merged into one fragment.

3.7 Bibliographic Notes

Most of the known results about fragmentation have been covered in this chapter. Work on fragmentation in distributed databases initially concentrated on horizontal fragmentation. Most of the literature on this has been cited in the appropriate section. The topic of vertical fragmentation for distribution design has been addressed in several papers ([Navathe et al., 1984] and [Sacca and Wiederhold, 1985]. The original work on vertical fragmentation goes back to Hoffer’s dissertation [Hoffer, 1975; Hoffer and Severance, 1975] and to Hammer and Niamir’s work ([Niamir, 1978] and [Hammer and Niamir, 1979]).

It is not possible to be as exhaustive when discussing allocation as we have been for fragmentation, given there is no limit to the literature on the subject. The investigation of FAP on wide area networks goes back to Chu’s work [Chu, 1969, 1973]. Most of the early work on FAP has been covered in the excellent survey by Dowdy and Foster [1982]. Some theoretical results about FAP are reported by Grapa and Belford [1977] and Kollias and Hatzopoulos [1981].

The DAP work dates back to the mid-1970s to the works of Eswaran [1974] and others. In their earlier work, Levin and Morgan [1975] concentrated on data allocation, but later they considered program and data allocation together [Morgan and Levin, 1977]. The DAP has been studied in many specialized settings as well. Work has been done to determine the placement of computers and data in a wide area network design [Gavish and Pirkul, 1986]. Channel capacities have been examined along with data placement [Mahmoud and Riordon, 1976] and data allocation on supercomputer systems [Irani and Khabbaz, 1982] as well as on a cluster of processors [Sacca and Wiederhold, 1985]. An interesting work is the one by Apers, where the relations are optimally placed on the nodes of a virtual network, and then the best matching between the virtual network nodes and the physical network are found [Apers, 1981].

Some of the allocation work has also touched upon physical design. The assign- ment of files to various levels of a memory hierarchy has been studied by Foster and Browne [1976] and by Navathe et al. [1984]. These are outside the scope of this chapter, as are those that deal with general resource and task allocation in distributed systems (e.g., [Bucci and Golinelli, 1977], [Ceri and Pelagatti, 1982], and [Haessig and Jenny, 1980]).

We should finally point out that some effort was spent to develop a general methodology for distributed database design along the lines that we presented (Figure 3.2). Ours is similar to the DATAID-D methodology [Ceri and Navathe, 1983; Ceri et al., 1987]. Other attempts to develop a methodology are due to Fisher et al. [1980], Dawson [1980]; Hevner and Schneider [1980] and Mohan [1979].

126 3 Distributed Database Design

Exercises

Problem 3.1 (*). Given relation EMP as in Figure 3.3, let p1: TITLE < “Program- mer” and p2: TITLE > “Programmer” be two simple predicates. Assume that char- acter strings have an order among them, based on the alphabetical order.

(a) Perform a horizontal fragmentation of relation EMP with respect to {p1, p2}. (b) Explain why the resulting fragmentation (EMP1, EMP2) does not fulfill the

correctness rules of fragmentation. (c) Modify the predicates p1 and p2 so that they partition EMP obeying the

correctness rules of fragmentaion. To do this, modify the predicates, compose all minterm predicates and deduce the corresponding implications, and then perform a horizontal fragmentation of EMP based on these minterm predicates. Finally, show that the result has completeness, reconstruction and disjointness properties.

Problem 3.2 (*). Consider relation ASG in Figure 3.3. Suppose there are two ap- plications that access ASG. The first is issued at five sites and attempts to find the duration of assignment of employees given their numbers. Assume that managers, consultants, engineers, and programmers are located at four different sites. The second application is issued at two sites where the employees with an assignment duration of less than 20 months are managed at one site, whereas those with longer duration are managed at a second site. Derive the primary horizontal fragmentation of ASG using the foregoing information.

Problem 3.3. Consider relations EMP and PAY in Figure 3.3. EMP and PAY are horizontally fragmented as follows:

EMP1 = σTITLE=“Elect.Eng.”(EMP) EMP2 = σTITLE=“Syst.Anal.”(EMP) EMP3 = σTITLE=“Mech.Eng.”(EMP) EMP4 = σTITLE=“Programmer”(EMP)

PAY1 = σSAL≥30000(PAY) PAY2 = σSAL<30000(PAY)

Draw the join graph of EMP nTITLE PAY. Is the graph simple or partitioned? If it is partitioned, modify the fragmentation of either EMP or PAY so that the join graph of EMPnTITLE PAY is simple.

Problem 3.4. Give an example of a CA matrix where the split point is not unique and the partition is in the middle of the matrix. Show the number of shift operations required to obtain a single, unique split point.

Problem 3.5 (**). Given relation PAY as in Figure 3.3, let p1: SAL < 30000 and p2: SAL ≥ 30000 be two simple predicates. Perform a horizontal fragmentation of PAY with respect to these predicates to obtain PAY1, and PAY2. Using the fragmentation of PAY, perform further derived horizontal fragmentation for EMP. Show completeness, reconstruction, and disjointness of the fragmentation of EMP.

3.7 Bibliographic Notes 127

Problem 3.6 (**). Let Q = {q1, . . . ,q5} be a set of queries, A = {A1, . . . ,A5} be a set of attributes, and S = {S1,S2,S3} be a set of sites. The matrix of Figure 3.21a describes the attribute usage values and the matrix of Figure 3.21b gives the applica- tion access frequencies. Assume that re fi(qk) = 1 for all qk and Si and that A1 is the key attribute. Use the bond energy and vertical partitioning algorithms to obtain a vertical fragmentation of the set of attributes in A.

A 1

A 2

A 3

A 4

q 4

q 3

q 2

q 1

0 1 1 0

1 1 1 0

1 0 0 1

0 0 1 0

A 5

1

1

1

0

q 5

1 1 1 0 0

S 1

S 2

S 3

q 4

q 3

q 2

q 1

10 20 0

5 0 10

0 35 5

0 10 0

q 5

0 15 0

(a) (b)

Fig. 3.21 Attribute Usage Values and Application Access Frequencies in Exercise 3.6

Problem 3.7 (**). Write an algorithm for derived horizontal fragmentation.

Problem 3.8 (**). Assume the following view definition

CREATE VIEW EMPVIEW(ENO, ENAME, PNO, RESP) AS SELECT EMP.ENO, EMP.ENAME, ASG.PNO,

ASG.RESP FROM EMP, ASG WHERE EMP.ENO=ASG.ENO AND DUR=24

is accessed by application q1, located at sites 1 and 2, with frequencies 10 and 20, respectively. Let us further assume that there is another query q2 defined as

SELECT ENO, DUR FROM ASG

which is run at sites 2 and 3 with frequencies 20 and 10, respectively. Based on the above information, construct the use(qi,A j) matrix for the attributes of both relations EMP and ASG. Also construct the affinity matrix containing all attributes of EMP and ASG. Finally, transform the affinity matrix so that it could be used to split the relation into two vertical fragments using heuristics or BEA.

Problem 3.9 (**). Formally define the three correctness criteria for derived horizon- tal fragmentation.

128 3 Distributed Database Design

Problem 3.10 (*). Given a relation R(K,A,B,C) (where K is the key) and the fol- lowing query

SELECT * FROM R WHERE R.A = 10 AND R.B=15

(a) What will be the outcome of running PHF on this query? (b) Does the COM MIN algorithm produce in this case a complete and minimal

predicate set? Justify your answer.

Problem 3.11 (*). Show that the bond energy algorithm generates the same results using either row or column operation.

Problem 3.12 (**). Modify algorithm PARTITION to allow n-way partitioning, and compute the complexity of the resulting algorithm.

Problem 3.13 (**). Formally define the three correctness criteria for hybrid frag- mentation.

Problem 3.14. Discuss how the order in which the two basic fragmentation schemas are applied in hybrid fragmentation affects the final fragmentation.

Problem 3.15 (**). Describe how the following can be properly modeled in the database allocation problem.

(a) Relationships among fragments (b) Query processing (c) Integrity enforcement (d) Concurrency control mechanisms

Problem 3.16 (**). Consider the various heuristic algorithms for the database allo- cation problem.

(a) What are some of the reasonable criteria for comparing these heuristics? Discuss.

(b) Compare the heuristic algorithms with respect to these criteria.

Problem 3.17 (*). Pick one of the heuristic algorithms used to solve the DAP, and write a program for it.

Problem 3.18 (**). Assume the environment of Exercise 3.8. Also assume that 60% of the accesses of query q1 are updates to PNO and RESP of view EMPVIEW and that ASG.DUR is not updated through EMPVIEW. In addition, assume that the data transfer rate between site 1 and site 2 is half of that between site 2 and site 3. Based on the above information, find a reasonable fragmentation of ASG and EMP and an optimal replication and placement for the fragments, assuming that storage costs do not matter here, but copies are kept consistent.

3.7 Bibliographic Notes 129

Hint: Consider horizontal fragmentation for ASG based on DUR=24 predicate and the corresponding derived horizontal fragmentation for EMP. Also look at the affinity matrix obtained in Example 3.8 for EMP and ASG together, and consider whether it would make sense to perform a vertical fragmentation for ASG.

Chapter 4 Database Integration

In the previous chapter, we discussed top-down distributed database design, which is suitable for tightly integrated, homogeneous distributed DBMSs. In this chapter, we focus on bottom-up design that is appropriate in multidatabase systems. In this case, a number of databases already exist, and the design task involves integrating them into one database. The starting point of bottom-up design is the individual local conceptual schemas. The process consists of integrating local databases with their (local) schemas into a global database with its global conceptual schema (GCS) (also called the mediated schema).

Database integration, and the related problem of querying multidatabases (see Chapter 9), is only one part of the more general interoperability problem. In recent years, new distributed applications have started to pose new requirements regarding the data source(s) they access. In parallel, the management of “legacy systems” and reuse of the data they generate have gained importance. The result has been a renewed consideration of the broader question of information system interoperability, including non-database sources and interoperability at the application level in addition to the database level.

Database integration can be either physical or logical [Jhingran et al., 2002]. In the former, the source databases are integrated and the integrated database is materialized. These are known as data warehouses. The integration is aided by extract-transform- load (ETL) tools that enable extraction of data from sources, their transformation to match the GCS, and their loading (i.e., materialization). Enterprise Application Integration (EAI), which allows data exchange between applications, perform similar transformation functions, although data are not entirely materialized. This process is depicted in Figure 4.1. In logical integration, the global conceptual (or mediated) schema is entirely virtual and not materialized. This is also known as Enterprise Information Integration (EII)1.

These two approaches are complementary and address differing needs. Data warehousing [Inmon, 1992; Jarke et al., 2003] supports decision support applications,

1 It has been (rightly) argued that the second “I” should stand for Interoperability rather than Integration (see J. Pollock’s contribution in [Halevy et al., 2005]).

DOI 10.1007/978-1-4419-8834-8_4, © Springer Science+Business Media, LLC 2011 131M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems: Third Edition,

132 4 Database Integration

ETL

Tools

Database 1 Database 2 Database n...

Materialized

Global

Database

Fig. 4.1 Data Warehouse Approach

which are commonly termed On-line Analytical Processing (OLAP) [Codd, 1995] to better reflect their different requirements relative to the On-Line Transaction Processing (OLTP) applications. OLTP applications, such as airline reservation or banking systems, are high-throughput transaction-oriented. They need extensive data control and availability, high multiuser throughput and predictable, fast response times. In contrast, OLAP applications, such as trend analysis or forecasting, need to analyze historical, summarized data coming from a number of operational databases. They use complex queries over potentially very large tables. Because of their strategic nature, response time is important. The users are managers or analysts. Performing OLAP queries directly over distributed operational databases raises two problems. First, it hurts the OLTP applications’ performance by competing for local resources. Second, the overall response time of the OLAP queries can be very poor because large quantities of data must be transferred over the network. Furthermore, most OLAP applications do not need the most current versions of the data, and thus do not need direct access to most up-to-date operational data. Consequently, data warehouses gather data from a number of operational databases and materialize them. As updates happen on the operational databases, they are propagated to the data warehouse (also referred to as materialized view maintenance [Gupta and Mumick, 1999b]).

By contrast, in logical data integration, the integration is only virtual and there is no materialized global database (see Figure 1.18). The data resides in the operational databases and the GCS provides a virtual integration for querying over them similar to the case described in the previous chapter. The difference is that the GCS may not be the union of the local conceptual schamas (LCSs). It is possible for the GCS not to capture all of the information in each of the LCSs. Furthermore, in some cases, the GCS may be defined bottom-up, by “integrating” parts of the LCSs of the local operational databases rather than being defined up-front (more on this shortly). User

4.1 Bottom-Up Design Methodology 133

queries are posed over this global schema, which are then decomposed and shipped to the local operational databases for processing as is done in tightly-integrated systems. The main differences are the autonomy and potential heterogeneity of the local systems. These have important effects on query processing that we discuss in Chapter 9. Although there is ample work on transaction management in these systems, supporting global updates is quite difficult given the autonomy of the underlying operational DBMSs. Therefore, they are primarily read-only.

Logical data integration, and the resulting systems, are known by a variety of names; data integration and information integration are perhaps the most common terms used in literature. The generality of these terms point to the fact that the underlying data sources do not have to be databases. In this chapter we focus our attention on the integration of autonomous and (possibly) heterogeneous databases; thus we will use the term database integration (which also helps to distinguish these systems from data warehouses).

4.1 Bottom-Up Design Methodology

Bottom-up design involves the process by which information from participating databases can be (physically or logically) integrated to form a single cohesive multi- database. There are two alternative approaches. In some cases, the global conceptual (or mediated) schema is defined first, in which case the bottom-up design involves mapping LCSs to this schema. This is the case in data warehouses, but the practice is not restricted to these and other data integration methodologies may follow the same strategy. In other cases, the GCS is defined as an integration of parts of LCSs. In this case, the bottom-up design involves both the generation of the GCS and the mapping of individual LCSs to this GCS.

If the GCS is defined up-front, the relationship between the GCS and the local conceptual schemas (LCS) can be of two fundamental types [Lenzerini, 2002]: local- as-view, and global-as-view. In local-as-view (LAV) systems, the GCS definition exists, and each LCS is treated as a view definition over it. In global-as-view systems (GAV), on the other hand, the GCS is defined as a set of views over the LCSs. These views indicate how the elements of the GCS can be derived, when needed, from the elements of LCSs. One way to think of the difference between the two is in terms of the results that can be obtained from each system [Koch, 2001]. In GAV, the query results are constrained to the set of objects that are defined in the GCS, although the local DBMSs may be considerably richer (Figure 4.2a). In LAV, on the other hand, the results are constrained by the objects in the local DBMSs, while the GCS definition may be richer (Figure 4.2b). Thus, in LAV systems, it may be necessary to deal with incomplete answers. A combination of these two approaches has also been proposed as global-local-as-view (GLAV) [Friedman et al., 1999] where the relationship between GCS and LCSs is specified using both LAV and GAV.

Bottom-up design occurs in two general steps (Figure 4.3): schema translation (or simply translation) and schema generation. In the first step, the component

134 4 Database Integration

Objects

accessible

through GCS

Objects

expressible as queries

over the source DBMSs

Objects

expressible as queries

over the GCS

Source

DBMS 1

Source

DBMS n ...

(a) GAV (b) LAV

Fig. 4.2 GAV and LAV Mappings (Based on [Koch, 2001])

database schemas are translated to a common intermediate canonical representation (InS1, InS2, . . . , InSn). The use of a canonical representation facilitates the translation process by reducing the number of translators that need to be written. The choice of the canonical model is important. As a principle, it should be one that is sufficiently expressive to incorporate the concepts available in all the databases that will later be integrated. Alternatives that have been used include the entity-relationship model [Palopoli et al., 1998, 2003b; He and Ling, 2006], object-oriented model [Castano and Antonellis, 1999; Bergamaschi et al., 2001], or a graph [Palopoli et al., 1999; Milo and Zohar, 1998; Melnik et al., 2002; Do and Rahm, 2002] that may be simplified to a tree [Madhavan et al., 2001]. The graph (tree) models have become more popular as XML data sources have proliferated, since it is fairly straightforward to map XML to graphs, although there are efforts to target XML directly [Yang et al., 2003]. In this chapter, we will simply use the relational model as our canonical data model, because we have been using it throughout the book, and the graph models used in literature are quite diverse with no common graph representation. The choice of the relational model as the canonical data representation does not affect in any fundamental way the discussion of the major issues of data integration. In any case, we will not discuss the specifics of translating various data models to relational; this can be found in many database textbooks.

Clearly, the translation step is necessary only if the component databases are heterogeneous and local schemas are defined using different data models. There has been some work on the development of system federation, in which systems with similar data models are integrated together (e.g., relational systems are integrated into one conceptual schema and, perhaps, object databases are integrated to another schema) and these integrated schemas are “combined” at a later stage (e.g., AURORA project [Yan, 1997; Yan et al., 1997]). In this case, the translation step is delayed, providing increased flexibility for applications to access underlying data sources in a manner that is suitable for their needs.

In the second step of bottom-up design, the intermediate schemas are used to generate a GCS. In some methodologies, local external (or export) schemas are considered for integration rather than full database schemas, to reflect the fact that

4.1 Bottom-Up Design Methodology 135

Database 2

Schema

Translator 2

InS2

Database n

Schema

Translator n

InSn

Schema Generator

GCS

...

...

...Database 1

Schema

Translator 1

InS1

Schema

Matching

Schema

Integration

Schema

Mapping

Fig. 4.3 Database Integration Process

local systems may only be willing to contribute some of their data to the multidatabase [Sheth and Larson, 1990].

The schema generation process consists of the following steps:

1. Schema matching to determine the syntactic and semantic correspondences among the translated LCS elements or between individual LCS elements and the pre-defined GCS elements (Section 4.2).

2. Integration of the common schema elements into a global conceptual (medi- ated) schema if one has not yet been defined (Section 4.3).

3. Schema mapping that determines how to map the elements of each LCS to the other elements of the GCS (Section 4.4).

It is also possible that the schema mapping step may be divided into two phases [Bernstein and Melnik, 2007]: mapping constraint generation and transforma- tion generation. In the first phase, given correspondences between two schemas, a transformation function such as a query or view definition over the source schema is generated that would “populate” the target schema. In the second phase, an exe-

136 4 Database Integration

cutable code is generated corresponding to this transformation function that would actually generate a target database consistent with these constraints. In some cases, the constraints are implicitly included in the correspondences, eliminating the need for the first phase.

Example 4.1. To facilitate our discussion of global schema design in multidatabase systems, we will use an example that is an extension of the engineering database we have been using throughout the book. To demonstrate both phases of the database integration process, we introduce some data model heterogeneity into our example.

Consider two organizations, each with their own database definitions. One is the (relational) database example that we have developed in Chapter 2. We repeat that definition in Figure 4.4 for completeness. The underscored attributes are the keys of the associated relations. We have made one modification in the PROJ relation by including attributes LOC and CNAME. LOC is the location of the project, whereas CNAME is the name of the client for whom the project is carried out. The second database also defined similar data, but is specified according to the entity-relationship (E-R) data model [Chen, 1976] as depicted in Figure 4.5.

EMP(ENO, ENAME, TITLE)

PROJ(PNO, PNAME, BUDGET, LOC, CNAME)

ASG(ENO, PNO, RESP, DUR)

PAY(TITLE, SAL)

Fig. 4.4 Relational Engineering Database Representation

We assume that the reader is familiar with the entity-relationship data model. Therefore, we will not describe the formalism, except to make the following points regarding the semantics of Figure 4.5. This database is similar to the relational engineering database definition of Figure 4.4, with one significant difference: it also maintains data about the clients for whom the projects are conducted. The rectangular boxes in Figure 4.5 represent the entities modeled in the database, and the diamonds indicate a relationship between the entities to which they are connected. The type of relationship is indicated around the diamonds. For example, the CONTRACTED-BY relation is a many-to-one from the PROJECT entity to the CLIENT entity (e.g., each project has a single client, but each client can have many projects). Similarly, the WORKS-IN relationship indicates a many-to-many relationship between the two connected relations. The attributes of entities and the relationships are shown as elliptical circles. �

Example 4.2. The mapping of the E-R model to the relational model is given in Figure 4.6. Note that we have renamed some of the attributes in order to ensure name uniqueness. �

4.2 Schema Matching 137

Responsibility

Duration

WORKER

SalaryTitle

CLIENT

Contract number

AddressClientname

N 1

N

1

LocationPROJECT

Budget

Project Name

Number

Number Name

WORKS_IN

CONTRACTED_BY

Fig. 4.5 Entity-Relationship Database

WORKER(WNUMBER, NAME, TITLE, SALARY)

PROJECT(PNUMBER, PNAME, BUDGET)

CLIENT(CNAME, ADDRESS)

WORKS IN(WNUMBER, PNUMBER, RESPONSIBILITY, DURATION)

CONTRACTED BY(PNUMBER, CNAME, CONTRACTNO)

Fig. 4.6 Relational Mapping of E-R Schema

4.2 Schema Matching

Schema matching determines which concepts of one schema match those of another. As discussed earlier, if the GCS has already been defined, then one of these schemas is typically the GCS, and the task is to match each LCS to the GCS. Otherwise, matching is done on two LCSs. The matches that are determined in this phase are then used in schema mapping to produce a set of directed mappings, which, when applied to the source schema, would map its concepts to the target schema.

The matches that are defined or discovered during schema matching are specified as a set of rules where each rule (r) identifies a correspondence (c) between two elements, a predicate (p) that indicates when the correspondence may hold, and a similarity value (s) between the two elements identified in the correspondence. A correspondence (c) may simply identify that two concepts are similar (which we

138 4 Database Integration

will denote by ≈) or it may be a function that specifies that one concept may be derived by a computation over the other one (for example, if the BUDGET value of one project is specified in US dollars while the other one is specified in Euros, the correspondence may specify that one is obtained by multiplying the other one with the appropriate exchange rate). The predicate (p) is a condition that qualifies the correspondence by specifying when it might hold. For example, in the budget example specified above, p may specify that the rule holds only if the location of one project is in US while the other one is in the Euro zone. The similarity value (s) for each rule can be specified or calculated. Similarity values are real values in the range [0,1]. Thus, a set of matches can be defined as M= {r} where r = 〈c, p,s〉.

As indicated above, correspondences may either be discovered or specified. As much as it is desirable to automate this process, as we discuss below, there are many complicating factors. The most important is schema heterogeneity, which refers to the differences in the way real-world phenomena are captured in different schemas. This is a critically important issue, and we devote a separate section to it (Section 4.2.1). Aside from schema heterogeneity, other issues that complicate the matching process are the following:

• Insufficient schema and instance information: Matching algorithms depend on the information that can be extracted from the schema and the existing data instances. In some cases there is some ambiguity of the terms due to the insufficient information provided about these items. For example, using short names or ambiguous abbreviations for concepts, as we have done in our examples, can lead to incorrect matching.

• Unavailability of schema documentation: In most cases, the database schemas are not well documented or not documented at all. Quite often, the schema designer is no longer available to guide the process. The lack of these vital information sources adds to the difficulty of matching.

• Subjectivity of matching: Finally, we need to note (and admit) that matching schema elements can be highly subjective; two designers may not agree on a single “correct” mapping. This makes the evaluation of a given algorithm’s accuracy significantly difficult.

Despite these difficulties, serious progress has been made in recent years in developing algorithmic approaches to the matching problem. In this section, we discuss a number of these algorithms and the various approaches.

A number of issues affect the particular matching algorithm [Rahm and Bernstein, 2001]. The more important ones are the following:

• Schema versus instance matching. So far in this chapter, we have been focusing on schema integration; thus, our attention has naturally been on matching concepts of one schema to those of another. A large number of algorithms have been developed that work on “schema objects.” There are others, however, that have focused instead on the data instances or a combination of schema information and data instances. The argument is that considering data instances can help alleviate some of the semantic issues discussed above. For example, if

4.2 Schema Matching 139

an attribute name is ambiguous, as in “contact-info”, then fetching its data may help identify its meaning; if its data instances have the phone number format, then obviously it is the phone number of the contact agent, while long strings may indicate that it is the contact agent name. Furthermore, there are a large number of attributes, such as postal codes, country names, email addresses, that can be defined easily through their data instances. Matching that relies solely on schema data may be more efficient, because it does not require a search over data instances to match the attributes. Fur- thermore, this approach is the only feasible one when few data instances are available in the matched databases, in which case learning may not be reliable. However, in peer-to-peer systems (see Chapter 16), there may not be a schema, in which case instance-based matching is the only appropriate approach.

• Element-level vs. structure-level. Some matching algorithms operate on indi- vidual schema elements while others also consider the structural relationships between these elements. The basic concept of the element-level approach is that most of the schema semantics are captured by the elements’ names. However, this may fail to find complex mappings that span multiple attributes. Match algorithms that also consider structure are based on the belief that, normally, the structures of matchable schemas tend to be similar.

• Matching cardinality. Matching algorithms exhibit various capabilities in terms of cardinality of mappings. The simplest approaches use 1:1 mapping, which means that each element in one schema is matched with exactly one element in the other schema. The majority of proposed algorithms belong to this category, because problems are greatly simplified in this case. Of course there are many cases where this assumption is not valid. For example, an attribute named “Total price” could be mapped to the sum of two attributes in another schema named “Subtotal” and “Taxes”. Such mappings require more complex matching algorithms that consider 1:M and N:M mappings.

These criteria, and others, can be used to come up with a taxonomy of matching approaches [Rahm and Bernstein, 2001]. According to this taxonomy (which we will follow in this chapter with some modifications), the first level of separation is between schema-based matchers versus instance-based matchers (Figure 4.7). Schema-based matchers can be further classified as element-level and structure-level, while for instance-based approaches, only element-level techniques are meaningful. At the lowest level, the techniques are characterized as either linguistic or constraint- based. It is at this level that fundamental differences between matching algorithms are exhibited and we focus on these algorithms in the remainder, discussing linguis- tic approaches in Section 4.2.2, constraint-based approaches in Section 4.2.3, and learning-based techniques in Section 4.2.4. Rahm and Bernstein [2001] refer to all of these as individual matcher approaches, and their combinations are possible by developing either hybrid matchers or composite matchers (Section 4.2.5).

140 4 Database Integration

Individual Matchers

Schema-based Instance-based

Element-level Structure-level Element-level

Linguistic Constraint-based Constraint-based Linguistic Constraint-based Learning-based

Fig. 4.7 Taxonomy of Schema Matching Techniques

4.2.1 Schema Heterogeneity

Schema matching algorithms deal with both structural heterogeneity and semantic heterogeneity among the matched schemas. We discuss these in this section before presenting the different match algorithms.

Structural conflicts occur in four possible ways: as type conflicts, dependency conflicts, key conflicts,, or behavioral conflicts [Batini et al., 1986]. Type conflicts occur when the same object is represented by an attribute in one schema and by an entity (relation) in another. Dependency conflicts occur when different relationship modes (e.g., one-to-one versus many-to-many) are used to represent the same thing in different schemas. Key conflicts occur when different candidate keys are available and different primary keys are selected in different schemas. Behavioral conflicts are implied by the modeling mechanism. For example, deleting the last item from one database may cause the deletion of the containing entity (i.e., deletion of the last employee causes the dissolution of the department).

Example 4.3. We have two structural conflicts in the example we are considering. The first is a type conflict involving clients of projects. In the schema of Figure 4.5, the client of a project is modeled as an entity. In the schema of Figure 4.4, however, the client is included as an attribute of the PROJ entity.

The second structural conflict is a dependency conflict involving the WORKS IN relationship in Figure 4.5 and the ASG relation in Figure 4.4. In the former, the relationship is many-to-one from the WORKER to the PROJECT, whereas in the latter, the relationship is many-to-many. �

Structural differences among schemas are important, but their identification and resolution is not sufficient. Schema matching has to take into account the (possibly different) semantics of the schema concepts. This is referred to as semantic hetero- geneity, which is a fairly loaded term without a clear definition. It basically refers to the differences among the databases that relate to the meaning, interpretation, and intended use of data [Vermeer, 1997]. There are attempts to formalize semantic heterogeneity and to establish its link to structural heterogeneity [Kashyap and Sheth,

4.2 Schema Matching 141

1996; Sheth and Kashyap, 1992]; we will take a more informal approach and discuss some of the semantic heterogeneity issues intuitively. The following are some of these problems that the match algorithms need to deal with.

• Synonyms, homonyms, hypernyms. Synonyms are multiple terms that all refer to the same concept. In our database example, PROJ and PROJECT refer to the same concept. Homonyms, on the other hand, occur when the same term is used to mean different things in different contexts. Again, in our example, BUDGET may refer to the gross budget in one database and it may refer to the net budget (after some overhead deduction) in another, making their simple comparison difficult. Hypernym is a term that is more generic than a similar word. Although there is no direct example of it in the databases we are considering, the concept of a Vehicle in one database is a hypernym for the concept of a Car in another (incidentally, in this case, Car is a hyponym of Vehicle). These problems can be addressed by the use of domain ontologies that define the organization of concepts and terms in a particular domain.

• Different ontology: Even if domain ontologies are used to deal with issues in one domain, it is quite often the case that schemas from different domains may need to be matched. In this case, one has to be careful of the meaning of terms across ontologies, as they can be highly dependent on the domain they are used in. For example, an attribute called “load” may imply a measure of resistance in an electrical ontology, but in a mechanical ontology, it may represent a measure of weight.

• Imprecise wording: Schemas may contain ambiguous names. For example the LOCATION and LOC attributes in our example database may refer to the full address or just the city name. Similarly, an attribute named “contact-info” may imply that the attribute contains the name of the contact agent or his/her telephone number. These types of ambiguities are common.

4.2.2 Linguistic Matching Approaches

Linguistic matching approaches, as the name implies, use element names and other textual information (such as textual descriptions/annotations in schema definitions) to perform matches among elements. In many cases, they may use external sources, such as thesauri, to assist in the process.

Linguistic techniques can be applied in both schema-based approaches and instance-based ones. In the former case, similarities are established among schema elements whereas in the latter, they are specified among elements of individual data instances. To focus our discussion, we will mostly consider schema-based linguistic matching approaches, briefly mentioning instance-based techniques. Con- sequently, we will use the notation 〈SC1.element-1≈ SC2.element-2, p,s〉 to represent that element-1 in schema SC1 corresponds to element-2 in schema SC2 if predicate p

142 4 Database Integration

holds, with a similarity value of s. Matchers use these rules and similarity values to determine the similarity value of schema elements.

Linguistic matchers that operate at the schema element-level typically deal with the names of the schema elements and handle cases such as synonyms, homonyms, and hypernyms. In some cases, the schema definitions can have annotations (natural language comments) that may be exploited by the linguistic matchers. In the case of instance-based approaches, linguistic matchers focus on information retrieval techniques such as word frequencies, key terms, etc. In these cases, the matchers “deduce” similarities based on these information retrieval measures.

Schema linguistic matchers use a set of linguistic (also called terminological) rules that can be hand-crafted or may be “discovered” using auxiliary data sources such as thesauri, e.g., WordNet [Miller, 1995] (http://wordnet.princeton.edu/). In the case of hand-crafted rules, the designer needs to specify the predicate p and the similarity value s as well. For discovered rules, these may either be specified by an expert following the discovery, or they may be computed using one of the techniques we will discuss shortly.

The hand-crafted linguistic rules may deal with capitalization, abbreviations, concept relationships, etc. In some systems, the hand-crafted rules are specified for each schema individually (intraschema rules) by the designer, and interschema rules are then “discovered” by the matching algorithm [Palopoli et al., 1999]. However, in most cases, the rule base contains both intra and interschema rules.

Example 4.4. In the relational database of Example 4.2, the set of rules may have been defined (quite intuitively) as follows where RelDB refers to the relational schema and ERDB refers to the translated E-R schema: 〈uppercase names≈ lower case names, true,1.0)〉 〈uppercase names≈ capitalized names, true,1.0)〉 〈capitalized names≈ lower case names, true,1.0)〉 〈RelDB.ASG≈ ERDB.WORKS IN, true,0.8〉 . . .

The first three rules are generic ones specifying how to deal with capitalizations, while the fourth one specifies a similarity between the ASG element of RelDB and the WORKS IN element of ERDB. Since these correspondences always hold, p = true.

As indicated above, there are ways of determining the element name similari- ties automatically. For example, COMA [Do and Rahm, 2002] uses the following techniques to determine similarity of two element names:

• The affixes, which are the common prefixes and suffixes between the two element name strings are determined.

• The n-grams of the two element name strings are compared. An n-gram is a substring of length n and the similarity is higher if the two strings have more n-grams in common.

• The edit distance between two element name strings is computed. The edit distance (also called the Lewenstein metric) determines the number of character

4.2 Schema Matching 143

modifications (additions, deletions, insertions) that one has to perform on one string to convert it to the second string.

• The soundex code of the element names is computed. This gives the phonetic similarity between names based on their soundex codes. Soundex code of English words are obtained by hashing the word to a letter and three numbers. This hash value (roughly) corresponds to how the word would sound. The important aspect of this code in our context is that two words that sound similar will have close soundex codes.

Example 4.5. Consider matching the RESP and the RESPONSIBILITY attributes in the two example schemas we are considering. The rules defined in Example 4.4 take care of the capitalization differences, so we are left with matching RESP with RESPONSIBILITY. Let us consider how the similarity between these two strings can be computed using the edit distance and the n-gram approaches.

The number of editing changes that one needs to do to convert one of these strings to the other is 10 (either we add the characters ‘O’, ‘N’, ‘S’, ‘I’, ‘B’, ‘I’, ‘L’, ‘I’, ‘T’, ‘Y’, to RESP or delete the same characters from RESPONSIBILITY). Thus the ratio of the required changes is 10/14, which defines the edit distance between these two strings; 1− (10/14) = 4/14 = 0.29 is then their similarity.

For n-gram computation, we need to first fix the value of n. For this example, let n = 3, so we are looking for 3-grams. The 3-grams of RESP are ‘RES’ and ‘ESP’. Similarly, there are twelve 3-grams of RESPONSIBILITY: ‘RES’, ‘ESP’, ‘SPO’, ‘PON’, ‘ONS’, ‘NSI’, ‘SIB’, ‘IBI’, ‘BIP’, ‘ILI’, ‘LIT’, and ‘ITY’. There are two matching 3-grams out of twelve, giving a 3-gram similarity of 2/12 = 0.17. �

The examples we have covered in this section all fall into the category of 1:1 matches – we matched one element of a particular schema to an element of another schema. As discussed earlier, it is possible to have 1:N (e.g., Street address, City, and Country element values in one database can be extracted from a single Address element in another), N:1 (e.g., Total price can be calculated from Subtotal and Taxes elements), or N:M (e.g., Book title, Rating information can be extracted via a join of two tables one of which holds book information and the other maintains reader reviews and ratings). Rahm and Bernstein [2001] suggest that 1:1, 1:N, and N:1 matchers are typically used in element-level matching while schema-level matching can also use N:M matching, since, in the latter case the necessary schema information is available.

4.2.3 Constraint-based Matching Approaches

Schema definitions almost always contain semantic information that constrain the values in the database. These are typically data type information, allowable ranges for data values, key constraints, etc. In the case of instance-based techniques, the existing ranges of the values can be extracted as well as some patterns that exist in the instance data. These can be used by matchers.

144 4 Database Integration

Consider data types that capture a large amount of semantic information. This information can be used to disambiguate concepts and also focus the match. For example, RESP and RESPONSIBILITY have relatively low similarity values ac- cording to computations in Example 4.5. However, if they have the same data type definition, this may be used to increase their similarity value. Similarly, the data type comparison may differentiate between elements that have high lexical similarity. For example, ENO in Figure 4.4 has the same edit distance and n-gram similarity values to the two NUMBER attributes in Figure 4.5 (of course, we are referring to the names of these attributes). In this case, the data types may be of assistance – if the data type of both ENO and worker number (WORKER.NUMBER) are integer while the data type of project number (PROJECT.NUMBER) is a string, the likelihood of ENO matching WORKER.NUMBER is significantly higher.

In structure-based approaches, the structural similarities in the two schemas can be exploited in determining the similarity of the schema elements. If two schema elements are structurally similar, this enhances our confidence that they indeed represent the same concept. For example, if two elements have very different names and we have not been able to establish their similarity through element matchers, but they have the same properties (e.g., same attributes) that have the same data types, then we can be more confident that these two elements may be representing the same concept.

The determination of structural similarity involves checking the similarity of the “neighborhoods” of the two concepts under consideration. Definition of the neighbor- hood is typically done using a graph representation of the schemas [Madhavan et al., 2001; Do and Rahm, 2002] where each concept (relation, entity, attribute) is a node and there is a directed edge between two nodes if and only if the two concepts are related (e.g., there is an edge from a relation node to each of its attributes, or there is an edge from a foreign key attribute node to the primary key attribute node it is referencing). In this case, the neighborhood can be defined in terms of the nodes that can be reached within a certain path length of each concept, and the problem reduces to checking the similarity of the subgraphs in this neighborhood.

The traversing of the graph can be done in a number of ways; for example CUPID [Madhavan et al., 2001] converts the graphs to trees and then looks at similarities of subtrees rooted at the two nodes in consideration, while COMA [Do and Rahm, 2002] considers the paths from the root to these element nodes. The fundamental point of these algorithms is that if the subgraphs are similar, this increases the similarity of the roots of these subtrees. The similarity of the subgraphs are determined in a bottom- up process, starting at the leaves whose similarity are determined using element matching (e.g., name similarity to the level of synonyms, or data type compatibility). The similarity of the two subtrees is recursively determined based on the similarity of the nodes in the subtree. A number of formulae may be used to for this recursive computation. CUPID, for example, looks at the similarity of two leaf nodes and if it is higher than a threshold value, then those two leaf nodes are said to be strongly linked. The similarity of two subgraphs is then defined as the fraction of leaves in the two subtrees that are strongly linked. This is based on the assumption that leafs carry more information and that the structural similarity of two non-leaf schema elements

4.2 Schema Matching 145

is determined by the similarity of the leaf nodes in their respective subtrees, even if their immediate children are not similar. These are heuristic rules and it is possible to define others.

Another interesting approach to considering neighborhood in directed graphs while computing similarity of nodes is similarity flooding [Melnik et al., 2002]. It starts from an initial graph where the node similarities are already determined by means of an element matcher, and propagates, iteratively, to determine the similarity of each node to its neighbors. Hence, whenever any two elements in two schemas are found to be similar, the similarity of their adjacent nodes increases. The iterative process stops when the node similarities stabilize. At each iteration, to reduce the amount of work, a subset of the nodes are selected as the “most plausible” matches, which are then considered in the subsequent iteration.

Both of these approaches are agnostic to the edge semantics. In some graph representations, there is additional semantics attached to these edges. For example, containment edges from a relation or entity node to its attributes may be distinguished from referential edges from a foreign key attribute node to the corresponding primary key attribute node. Some systems exploit these edge semantics (e.g., DIKE [Palopoli et al., 1998, 2003a]).

4.2.4 Learning-based Matching

A third alternative approach that has been proposed is to use machine learning techniques to determine schema matches. Learning-based approaches formulate the problem as one of classification where concepts from various schemas are classified into classes according to their similarity. The similarity is determined by checking the features of the data instances of the databases that correspond to these schemas. How to classify concepts according to their features is learned by studying the data instances in a training data set.

The process is as follows (Figure 4.8). A training set (τ) is prepared that consists of instances of example correspondences between the concepts of two databases Di and D j. This training set can be generated after manual identification of the schema correspondences between two databases followed by extraction of example training data instances [Doan et al., 2003a], or by the specification of a query expression that converts data from one database to another [Berlin and Motro, 2001]. The learner uses this training data to acquire probabilistic information about the features of the data sets. The classifier, when given two other database instances (Dk and Dl), then uses this knowledge to go through the data instances in Dk and Dl and make predictions about classifying the elements of Dk and Dl .

This general approach applies to all of the proposed learning-based schema matching approaches. Where they differ is the type of learner that they use and how they adjust this learner’s behavior for schema matching. Some have used neural networks (e.g., SEMINT [Li and Clifton, 2000; Li et al., 2000]), others have used Naı̈ve Bayesian learner/classifier (Autoplex [Berlin and Motro, 2001], LSD [Doan

146 4 Database Integration

Fig. 4.8 Learning-based Matching Approach

et al., 2001, 2003a] and [Naumann et al., 2002]), and decision trees [Embley et al., 2001, 2002]. Discussing the details of these learning techniques are beyond our scope.

4.2.5 Combined Matching Approaches

The individual matching techniques that we have considered so far have their strong points and their weaknesses. Each may be more suitable for matching certain cases. Therefore, a “complete” matching algorithm or methodology usually needs to make use of more than one individual matcher.

There are two possible ways in which matchers can be combined [Rahm and Bern- stein, 2001]: hybrid and composite. Hybrid algorithms combine multiple matchers within one algorithm. In other words, elements from two schemas can be compared using a number of element matchers (e.g., string matching as well as data type matching) and/or structural matchers within one algorithm to determine their overall similarity. Careful readers will have noted that in discussing the constraint-based matching algorithms that focused on structural matching, we followed a hybrid approach since they were based on an initial similarity determination of, for example, the leaf nodes using an element matcher, and these similarity values were then used in structural matching. Composite algorithms, on the other hand, apply each matcher to the elements of the two schemas (or two instances) individually, obtaining individual similarity scores, and then they apply a method for combining these similarity scores. More precisely, if si(Ckj ,C

m l ) is the similarity score using matcher i (i = 1, ...,q) over

two concepts C j from schema k and Cl from schema m, then the composite similarity of the two concepts is given by s(Ckj ,C

m l ) = f (s1, . . . , sq) where f is the function that

is used to combine the similarity scores. This function can be as simple as average,

Learner

Classifier

Classification predictions

Probabilistic knowledge

Dk,Dl

τ = {Di.em ≈ Dj.en}

4.3 Schema Integration 147

max, or min, or it can be an adaptation of more complicated ranking aggregation functions [Fagin, 2002] that we will discuss further in Chapter 9. Composite approach has been proposed in the LSD [Doan et al., 2001, 2003a] and iMAP [Dhamankar et al., 2004] systems for handling 1:1 and N:M matches, respectively.

4.3 Schema Integration

Once schema matching is done, the correspondences between the various LCSs have been identified. The next step is to create the GCS, and this is referred to as schema integration. As indicated earlier, this step is only necessary if a GCS has not already been defined and matching was performed on individual LCSs. If the GSC was defined up-front, then the matching step would determine correspondences between it and each of the LCSs and there would be no need for the integration step. If the GCS is created as a result of the integration of LCSs based on correspondences identified during schema matching, then, as part of integration, it is important to identify the correspondences between the GCS and the LCSs. Although tools (e.g., [Sheth et al., 1988a]) have been developed to aid in the integration process, human involvement is clearly essential.

Example 4.6. There are a number of possible integrations of the two example LCSs we have been discussing. Figure 4.9 shows one possible GCS that can be generated as a result of schema integration. �

Employee(ENUMBER, ENAME, TITLE)

Pay(TITLE, SALARY)

Project(PNUMBER, PNAME, BIDGET, LOCATION)

Client(CNAME, ADDRESS, CONTRACTNO, PNUMBER)

Works(ENUMBER, PNUMBER, RESP, DURATION)

Fig. 4.9 Example Integrated GCS

Integration methodologies can be classified as binary or nary mechanisms [Batini et al., 1986] based on the manner in which the local schemas are handled in the first phase (Figure 4.10). Binary integration methodologies involve the manipulation of two schemas at a time. These can occur in a stepwise (ladder) fashion (Figure 4.11a) where intermediate schemas are created for integration with subsequent schemas [Pu, 1988], or in a purely binary fashion (Figure 4.11b), where each schema is integrated with one other, creating an intermediate schema for integration with other intermediate schemas ([Batini and Lenzirini, 1984] and [Dayal and Hwang, 1984]).

148 4 Database Integration

Other binary integration approaches do not make this distinction [Melnik et al., 2002].

Integration Process

Binary n-ary

ladder balanced one-shot iterative

Fig. 4.10 Taxonomy of Integration Methodologies

Nary integration mechanisms integrate more than two schemas at each iteration. One-pass integration (Figure 4.12a) occurs when all schemas are integrated at once, producing the global conceptual schema after one iteration. Benefits of this approach include the availability of complete information about all databases at integration time. There is no implied priority for the integration order of schemas, and the trade-offs, such as the best representation for data items or the most understandable structure, can be made between all schemas rather than between a few. Difficulties with this approach include increased complexity and difficulty of automation.

(a) Stepwise (b) Pure binary

Fig. 4.11 Binary Integration Methods

Iterative nary integration (Figure 4.12b) offers more flexibility (typically, more information is available) and is more general (the number of schemas can be varied depending on the integrator’s preferences). Binary approaches are a special case of iterative nary. They decrease the potential integration complexity and lead toward automation techniques, since the number of schemas to be considered at each step is more manageable. Integration by an nary process enables the integrator to perform the operations on more than two schemas. For practical reasons, the majority of

4.4 Schema Mapping 149

(a) One-pass (b) Iterative

Fig. 4.12 Nary Integration Methods

systems utilize binary methodology, but a number of researchers prefer the nary approach because complete information is available ([Elmasri et al., 1987; Yao et al., 1982b; He et al., 2004]).

4.4 Schema Mapping

Once a GCS (or mediated schema) is defined, it is necessary to identify how the data from each of the local databases (source) can be mapped to GCS (target) while preserving semantic consistency (as defined by both the source and the target). Although schema matching has identified the correspondences between the LCSs and the GCS, it may not have identified explicitly how to obtain the global database from the local ones. This is what schema mapping is about.

In the case of data warehouses, schema mappings are used to explicitly extract data from the sources, and translate them to the data warehouse schema for populating it. In the case of data integration systems, these mappings are used in query processing phase by both the query processor and the wrappers (see Chapter 9).

There are two issues related to schema mapping that we will be studying: mapping creation, and mapping maintenance. Mapping creation is the process of creating explicit queries that map data from a local database to the global data. Mapping maintenance is the detection and correction of mapping inconsistencies resulting from schema evolution. Source schemas may undergo structural or semantic changes that invalidate mappings. Mapping maintenance is concerned with the detection of broken mappings and the (automatic) rewriting of mappings such that semantic consistency with the new schema and semantic equivalence with the current mapping are achieved.

150 4 Database Integration

4.4.1 Mapping Creation

Mapping creation starts with a source LCS, the target GCS, and a set of schema matches M and produces a set of queries that, when executed, will create GCS data instances from the source data. In data warehouses, these queries are actually executed to create the data warehouse (global database) while in data integration systems, these queries are used in the reverse direction during query processing (Chapter 9).

Let us make this more concrete by referring to the canonical relational representa- tion that we have adopted. The source LCS under consideration consists of a set of relations S= {S1, . . . ,Sm}, the GCS consists of a set of global (or target) relations T = {T1, . . . ,Tn}, and M consists of a set of schema match rules as defined in Section 4.2. We are looking for a way to generate, for each Tk, a query Qk that is defined on a (possibly proper) subset of the relations in S such that, when executed, will generate data for Tk from the source relations.

An algorithm due to Miller et al. [2000] accomplishes this iteratively by consider- ing each Tk in turn. It starts with Mk ⊆M (Mk is the set of rules that only apply to the attributes of Tk) and divides it into subsets {M1k , . . . ,Msk} such that each M

j k specifies

one possible way that values of Tk can be computed. Each M j k can be mapped to a

query q jk that, when executed, would generate some of Tk’s data. The union of all of these queries gives Qk(= ∪ jq jk) that we are looking for.

The algorithm proceeds in four steps that we discuss below. It does not con- sider the similarity values in the rules. It can be argued that the similarity values would be used in the final stages of the matching process to finalize correspon- dences, so that their use during mapping is unnecessary. Furthermore, by the time this phase of the integration process is reached, the concern is how to map source relation (LCS) data to target relation (GCS) data. Consequently, correspondences are not symmetric equivalences (≈), but mappings (7→): attribute(s) from (possi- bly multiple) source relations are mapped to an attribute of a target relation (i.e., (Si.attributek,S j.attributel) 7→ Tw.attributez)).

Example 4.7. To demonstrate the algorithm, we will use a different example database than what we have been working with, because it does not incorporate all the com- plexities that we wish to demonstrate. Instead, we will use the following abstract example.

Source relations (LCS):

S1(A1,A2) S2(B1,B2,B3) S3(C1,C2,C3) S4(D1,D2)

Target relation (GCS)

T (W1,W2,W3,W4)

4.4 Schema Mapping 151

We consider only one relation in GCS, since the algorithm iterates over target relations one-at-a-time, so this is sufficient to demonstrate the operation of the algorithm.

The foreign key relationships between the attributes are as follows:

Foreign key Refers to A1 B1 A2 B1 C1 B1

The following matches have been discovered for attributes of relation T (these make up MT ). In the subsequent examples, we will not be concerned with the predicates, so they are not explicitly specified.

r1 = 〈A1 7→W1, p〉 r2 = 〈A2 7→W2, p〉 r3 = 〈B2 7→W4, p〉 r4 = 〈B3 7→W3, p〉 r5 = 〈C1 7→W1, p〉 r6 = 〈C2 7→W2, p〉 r7 = 〈D1 7→W4, p〉

In the first step, Mk (corresponding to Tk) is partitioned into its subsets {M1k , . . . ,Mnk } such that each M jk contains at most one match for each attribute of Tk. These are called potential candidate sets, some of which may be complete in that they include a match for every attribute of Tk, but others may not be. The reasons for considering incomplete sets are twofold. First, it may be the case that no match is found for one or more attributes of the target relation (i.e., none of the match sets are complete). Second, for large and complex database schemas, it may make sense to build the mapping iteratively so that the designer specifies the mappings incrementally.

Example 4.8. MT is partitioned into the following fifty-three subsets (i.e., potential candidate sets). The first eight of these are complete, while the rest are not. To make it easier to read, the complete rules are listed in the order of the target attributes to which they map (e.g., the third rule in M1T is r4, because this rule maps to attribute W3):

M1T = {r1,r2,r4,r3} M2T = {r1,r2,r4,r7} M3T = {r1,r6,r4,r3} M4T = {r1,r6,r4,r7} M5T = {r5,r2,r4,r3} M6T = {r5,r2,r4,r7} M7T = {r5,r6,r4,r3} M8T = {r5,r6,r4,r7} M9T = {r1,r2,r3} M10T = {r1,r2,r4}

M11T = {r1,r3,r4} M12T = {r2,r3,r4}

152 4 Database Integration

M13T = {r1,r3,r6} M14T = {r3,r4,r6} . . . . . .

M47T = {r1} M48T = {r2} M49T = {r3} M50T = {r4} M51T = {r5} M52T = {r6} M53T = {r7}

In the second step, the algorithm analyzes each potential candidate set M jk to see if a “good” query can be produced for it. If all the matches in M jk map values from a single source relation to Tk, then it is easy to generate a query corresponding to M

j k .

Of particular concern are matches that require access to multiple source relations. In this case the algorithm checks to see if there is a referential connection between these relations through foreign keys (i.e., whether there is a join path through the source relations). If there isn’t, then the potential candidate set is eliminated from further consideration. In case there are multiple join paths through foreign key relationships, the algorithm looks for those paths that will produce the most number of tuples (i.e., the estimated difference in size of the outer and inner joins is the smallest). If there are multiple such paths, then the database designer needs to be involved in selecting one (tools such as Clio [Miller et al., 2001], OntoBuilder [Roitman and Gal, 2006] and others facilitate this process and provide mechanisms for designers to view and specify correspondences [Yan et al., 2001]). The result of this step is a set Mk ⊆Mk of candidate sets.

Example 4.9. In this example, there is no M jk where the values of all of T ’s attributes are mapped from a single source relation. Among those that involve multiple source relations, rules that involve S1,S2 and S3 can be mapped to “good” queries since there are foreign key relationships between them. However, the rules that involve S4 (i.e., those that include rule r7) cannot be mapped to a “good” query since there is no join path from S4 to the other relations (i.e., any query would involve a cross product, which is expensive). Thus, these rules are eliminated from the potential candidate set. Considering only the complete sets, M2k ,M

4 k ,M

6 k , and M

8 k are pruned from the set. In

the end, the candidate set (Mk) contains thirty-five rules (the readers are encouraged to verify this to better understand the algorithm). �

In the third step, the algorithm looks for a cover of the candidate sets Mk. The cover Ck ⊆Mk is a set of candidate sets such that each match in Mk appears in Ck at least once. The point of determining a cover is that it accounts for all of the matches and is, therefore, sufficient to generate the target relation Tk. If there are multiple covers (a match can participate in multiple covers), then they are ranked in increasing number of the candidate sets in the cover. The fewer the number of candidate sets in the cover, the fewer are the number of queries that will be generated in the next step; this improves the efficiency of the mappings that are generated. If there are

4.4 Schema Mapping 153

multiple covers with the same ranking, then they are further ranked in decreasing order of the total number of unique target attributes that are used in the candidate sets constituting the cover. The point of this ranking is that covers with higher number of attributes generate fewer null values in the result. At this stage, the designer may need to be consulted to choose from among the ranked covers.

Example 4.10. First note that we have six rules that define matches in Mk that we need to consider, since M jk that include rule r7 have been eliminated. There are a large number of possible covers; let us start with those that involve M1k to demonstrate the algorithm:

C1T = {{r1,r2,r4,r3}︸ ︷︷ ︸ M1T

,{r1,r6,r4,r3}︸ ︷︷ ︸ M3T

,{r2}︸︷︷︸ M48T

}

C2T = {{r1,r2,r4,r3}︸ ︷︷ ︸ M1T

,{r5,r2,r4,r3}︸ ︷︷ ︸ M5T

,{r6}︸︷︷︸ M50T

}

C3T = {{r1,r2,r4,r3}︸ ︷︷ ︸ M1T

,{r5,r6,r4,r3}︸ ︷︷ ︸ M7T

}

C4T = {{r1,r2,r4,r3}︸ ︷︷ ︸ M1T

,{r5,r6,r4}︸ ︷︷ ︸ M12T

}

C5T = {{r1,r2,r4,r3}︸ ︷︷ ︸ M1T

,{r5,r6,r3}︸ ︷︷ ︸ M19T

}

C6T = {{r1,r2,r4,r3}︸ ︷︷ ︸ M1T

,{r5,r6}︸ ︷︷ ︸ M32T

}

At this point we observe that the covers consist of either two or three candidate sets. Since the algorithm prefers those with fewer candidate sets, we only need to focus on those involving two sets. Furthermore, among these covers, we note that the number of target attributes in the candidate sets differ. Since the algorithm prefers covers with the largest number of target attributes in each candidate set, C3T is the preferred cover in this case.

Note that due to the two heuristics employed by the algorithm, the only covers we need to consider are those that involve M1T ,M

3 T ,M

5 T , and M

7 T . Similar covers can be

defined involving M3T ,M 5 T , and M

7 T ; we leave that as an exercise. In the remainder,

we will assume that the designer has chosen to use C3T as the preferred cover. �

The final step of the algorithm builds a query q jk for each of the candidate sets in the cover selected in the previous step. The union of all of these queries (UNION ALL) results in the final mapping for relation Tk in the GCS.

Query q jk is built as follows:

• SELECT clause includes all correspondences (c) in each of the rules (rik) in M j k .

154 4 Database Integration

• FROM clause includes all source relations mentioned in rik and in the join paths determined in Step 2 of the algorithm.

• WHERE clause includes conjunct of all predicates (p) in rik and all join predi- cates determined in Step 2 of the algorithm.

• If rik contains an aggregate function either in c or in p

• GROUP BY is used over attributes (or functions on attributes) in the SELECT clause that are not within the aggregate;

• If aggregate is in the correspondence c, it is added to SELECT, else (i.e., aggregate is in the predicate p) a HAVING clause is created with the aggregate.

Example 4.11. Since in Example 4.10 we have decided to use cover C3T for the final mapping, we need to generate two queries: q1T and q

7 T corresponding to M

1 T and M

7 T ,

respectively. For ease of presentation, we list the rules here again:

r1 = 〈A1 7→W1, p〉 r2 = 〈A2 7→W2, p〉 r3 = 〈B2 7→W4, p〉 r4 = 〈B3 7→W3, p〉 r5 = 〈C1 7→W1, p〉 r6 = 〈C2 7→W2, p〉

The two queries are as follows:

q1k : SELECT A1,A2,B2,B3 FROM S1,S2 WHERE p1 AND p2 AND p3 AND p4

AND S1.A1 = S2.B1 AND S1.A2 = S2.B1

q7k : SELECT B2,B3,C1,C2 FROM S2,S3 WHERE p3 AND p4 AND p5 AND p6

AND S3.c1 = S2.B1

Thus, the final query Qk for target relation T becomes q1k UNION ALL q 7 k . �

The output of this algorithm, after it is iteratively applied to each target relation Tk is a set of queries Q= {Qk} that, when executed, produce data for the GCS relations. Thus, the algorithm produces GAV mappings between relational schemas – recall that GAV defines a GCS as a view over the LCSs and that is exactly what the set of mapping queries do. The algorithm takes into account the semantics of the source schema since it considers foreign key relationships in determining which queries to generate. However, it does not consider the semantics of the target, so that the

4.4 Schema Mapping 155

tuples that are generated by the execution of the mapping queries are not guaranteed to satisfy target semantics. This is not a major issue in the case when the GCS is integrated from the LCSs; however, if the GCS is defined independent of the LCSs, then this is problematic.

It is possible to extend the algorithm to deal with target semantics as well as source semantics. This requires that inter-schema tuple-generating dependencies be considered. In other words, it is necessary to produce GLAV mappings. A GLAV mapping, by definition, is not simply a query over the source relations; it is a relationship between a query over the source (i.e., LCS) relations and a query over the target (i.e., GCS) relations. Let us be more precise. Consider a schema match v that specifies a correspondence between attribute A of a source LCS relation S and attribute B of a target GCS relation T (in the notation we used in this section we have v = 〈S.A≈ T.B, p,s〉). Then the source query specifies how to retrieve S.A and the target query specifies how to obtain T.B. The GLAV mapping, then, is a relationship between these two queries.

An algorithm to accomplish this [Popa et al., 2002] also starts, as above, with a source schema, a target schema, and M, and “discovers” mappings that satisfy both the source and the target schema semantics. The algorithm is also more powerful than the one we discussed in this section in that it can handle nested structures that are common in XML, object databases, and nested relational systems.

The first step in discovering all of the mappings based on schema match corre- spondences is semantic translation, which seeks to interpret schema matches in M in a way that is consistent with the semantics of both the source and target schemas as captured by the schema structure and the referential (foreign key) constraints. The result is a set of logical mappings each of which captures the design choices (seman- tics) made in both source and target schemas. Each logical mapping corresponds to one target schema relation. The second step is data translation that implements each logical mapping as a rule that can be translated into a query that would create an instance of the target element when executed.

Semantic translation takes as inputs the source S and target schemas T, and M and performs the following two steps:

• It examines intra-schema semantics within the S and T separately and produces for each a set of logical relations that are semantically consistent.

• It then interprets inter-schema correspondences M in the context of logical relations generated in Step 1 and produces a set of queries into Q that are semantically consistent with T.

4.4.2 Mapping Maintenance

In dynamic environments where schemas evolve over time, schema mappings can be made invalid as the result of structural or constraint changes made to the schemas.

156 4 Database Integration

Thus, the detection of invalid/inconsistent schema mappings and the adaptation of such schema mappings to new schema structures/constraints becomes important.

In general, automatic detection of invalid/inconsistent schema mappings is desir- able as the complexity of the schemas, and the number of schema mappings used in database applications, increases. Likewise, (semi-)automatic adaptation of mappings to schema changes is also a goal. It should be noted that automatic adaptation of schema mappings is not the same as automatic schema matching. Schema adaptation aims to resolve semantic correspondences using known changes in intra-schema semantics, semantics in existing mappings, and detected semantic inconsistencies (resulting from schema changes). Schema matching must take a much more “from scratch” approach at generating schema mappings and does not have the ability (or luxury) of incorporating such contextual knowledge.

4.4.2.1 Detecting invalid mappings

In general, detection of invalid mappings resulting from schema change can ei- ther happen proactively, or reactively. In proactive detection environments, schema mappings are tested for inconsistencies as soon as schema changes are made by a user. The assumption (or requirement) is that the mapping maintenance system is completely aware of any and all schema changes, as soon as they are made. The ToMAS system [Velegrakis et al., 2004], for example, expects users to make schema changes through its own schema editors, making the system immediately aware of any schema changes. Once schema changes have been detected, invalid mappings can be detected by doing a semantic translation of the existing mappings using the logical relations of the updated schema.

In reactive detection environments, the mapping maintenance system is unaware of when and what schema changes are made. To detect invalid schema mappings in this setting, mappings are tested at regular intervals by performing queries against the data sources and translating the resulting data using the existing mappings. Invalid mappings are then determined based on the results of these mapping tests.

An alternative method that has been proposed is to use machine learning tech- niques to detect invalid mappings (as in the Maveric system [McCann et al., 2005]). What has been proposed is to build an ensemble of trained sensors (similar to multiple learners in schema matching) to detect invalid mappings. Examples of such sensors include value sensors for monitoring distribution characteristics of target instance values, trend sensors for monitoring the average rate of data modification, and layout and constraint sensors that monitor translated data against expected target schema syntax and semantics. A weighted combination of the findings of the individual sensors is then calculated where the weights are also learned. If the combined result indicates changes and follow-up tests suggest that this may indeed be the case, an alert is generated.

4.5 Data Cleaning 157

4.4.2.2 Adapting invalid mappings

Once invalid schema mappings are detected, they must be adapted to schema changes and made valid once again. Various high-level mapping adaptation approaches have been proposed [Velegrakis et al., 2004]. These can be broadly described as fixed rule approaches that define a re-mapping rule for every type of expected schema change, map bridging approaches that compare original schema S and the updated schema S′, and generate new mapping from S to S′ in addition to existing mappings, and semantic rewriting approaches, which exploit semantic information encoded in existing mappings, schemas, and semantic changes made to schemas to propose map rewritings that produce semantically consistent target data. In most cases, multiple such rewritings are possible, requiring a ranking of the candidates for presentation to users who make the final decision (based on scenario- or business-level semantics not encoded in schemas or mappings).

Arguably, a complete remapping of schemas (i.e. from scratch, using schema matching techniques) is another alternative to mapping adaption. However, in most cases, map rewriting is cheaper than map regeneration as rewriting can exploit knowledge encoded in existing mappings to avoid computation of mappings that would be rejected by the user anyway (and to avoid redundant mappings).

4.5 Data Cleaning

Errors in source databases can always occur, requiring cleaning in order to correctly answer user queries. Data cleaning is a problem that arises in both data warehouses and data integration systems, but in different contexts. In data warehouses where data are actually extracted from local operational databases and materialized as a global database, cleaning is performed as the global database is created. In the case of data integration systems, data cleaning is a process that needs to be performed during query processing when data are returned from the source databases.

The errors that are subject to data cleaning can generally be broken down into either schema-level or instance-level concerns [Rahm and Do, 2000]. Schema-level problems can arise in each individual LCS due to violations of explicit and implicit constraints. For example, values of attributes may be outside the range of their domains (e.g. 14th month or negative salary value), attribute values may violate implicit dependencies (e.g., the age attribute value may not correspond to the value that is computed as the difference between the current date and the birth date), uniqueness of attribute values may not hold, and referential integrity constraints may be violated. Furthermore, in the environment that we are considering in this chapter, the schema-level heterogeneities (both structural and semantic) among the LCSs that we discussed earlier can all be considered problems that need to be resolved. At the schema level, it is clear that the problems need to be identified at the schema match stage and fixed during schema integration.

158 4 Database Integration

Instance level errors are those that exist at the data level. For example, the values of some attributes may be missing although they were required, there could be misspellings and word transpositions (e.g., “M.D. Mary Smith” versus “Mary Smith, M.D.”) or differences in abbreviations (e.g., “J. Doe” in one source database while “J.N. Doe” in another), embedded values (e.g., an aggregate address attribute that includes street name, value, province name, and postal code), values that were erroneously placed in other fields, duplicate values, and contradicting values (the salary value appearing as one value in one database and another value in another database). For instance-level cleaning, the issue is clearly one of generating the mappings such that the data are cleaned through the execution of the mapping functions (queries).

The popular approach to data cleaning has been to define a number of operators that operate either on schemas or on individual data. The operators can be composed into a data cleaning plan. Example schema operators add or drop columns from table, restructure a table by combining columns or splitting a column into two [Raman and Hellerstein, 2001], or define more complicated schema transformation through a generic “map” operator [Galhardas et al., 2001] that takes a single relation and produces one ore more relations. Example data level operators include those that apply a function to every value of one attribute, merging values of two attributes into the value of a single attribute and its converse split operator [Raman and Hellerstein, 2001], a matching operator that computes an approximate join between tuples of two relations, clustering operator that groups tuples of a relation into clusters, and a tuple merge operator that partitions the tuples of a relation into groups and collapses the tuples in each group into a single tuple through some aggregation over them [Galhardas et al., 2001], as well as basic operators to find duplicates and eliminate them (this has long been known as the purge/merge problem [Hernández and Stolfo, 1998]). Many of the data level operators compare individual tuples of two relations (from the same or different schemas) and decide whether or not they represent the same fact. This is similar to what is done in schema matching, except that it is done at the individual data level and what is considered are not individual attribute values, but entire tuples. However, the same techniques we studied under schema matching (e.g., use of edit distance or soundex value) can be used in this context. There have been proposals for special techniques for handling this efficiently within the context of data cleaning (e.g., [Chaudhuri et al., 2003]).

Given the large amount of data that needs to be handled, data level cleaning is expensive and efficiency is a significant issue. The physical implementation of each of the operators we discussed above is a considerable concern. Although cleaning can be done off-line as a batch process in the case of data warehouses, for data integration systems, cleaning needs to be done online as data are retrieved from the sources. The performance of data cleaning is, of course, more critical in the latter case. In fact, the performance and scalability concerns in the latter systems have resulted in proposals where data cleaning is forfeited in favor of querying that is tolerant to conflicts [Yan and Özsu, 1999].

4.6 Conclusion 159

4.6 Conclusion

In this chapter we discussed the bottom-up database design process, which we called database integration. This is the process of creating a GCS (or a mediated schema) and determining how each LCS maps to it. A fundamental separation is between data warehouses where the GCS is instantiated and materialized, and data integration systems where the GCS is merely a virtual view.

Although the topic of database integration has been studied extensively for a long time, almost all of the work has been fragmented. Individual projects focus on schema matching, or data cleaning, or schema mapping. There is a serious lack of research that considers end-to-end methodology for database integration. The lack of a methodology is made more serious by the fact that each of these research activities work on different assumptions related to data models, types of heterogeneities and so on. A notable exception is the work of Bernstein and Melnik [2007], which provides the beginnings of a comprehensive “end-to-end” methodology. This is probably the most important topic that requires attention.

A related concept that has received considerable discussion in literature is data exchange. This is defined as “the problem of taking data structured under a source schema and creating an instance of a target schema that reflects the source data as accurately as possible.” [Fagin et al., 2005]. This is very similar to the physical integration (i.e., materialized) data integration, such as data warehouses, that we discussed in this chapter. A difference between data warehouses and the materializa- tion approaches as addressed in data exchange environments is that data warehouse data typically belongs to one organization and can be structured according to a well- defined schema while in data exchange environments data may come from different sources and contain heterogeneity [Doan et al., 2010]. However, for most of the discussions of this chapter, this is not a major concern.

Our focus in this chapter has been on integrating databases. Increasingly, however, the data that are used in distributed applications involve those that are not in a database. An interesting new topic of discussion among researchers is the integration of structured data that is stored in databases and unstructured data that is maintained in other systems (Web servers, multimedia systems, digital libraries, etc) [Halevy et al., 2003; Somani et al., 2002]. In next generation systems, ability to handle both types of data will be increasingly important.

Another issue that we ignored in this chapter is interoperability when a GCS does not exist or cannot be specified. As we discussed in Chapter 1, there have been early objections to interoperable access to multiple data sources through a GCS, arguing instead that the languages should provide facilities to access multiple heterogeneous sources without requiring a GCS. The issue becomes critical in the modern peer-to- peer systems where the scale and the variety of data sources make it quite difficult (if not impossible) to design a GCS. We will discuss data integration in peer-to-peer systems in Chapter 16.

160 4 Database Integration

4.7 Bibliographic Notes

A large volume of literature exists on the topic of this chapter. The work goes back to early 1980’s and which is nicely surveyed by Batini et al. [1986]. Subsequent work is nicely covered by Elmagarmid et al. [1999] and Sheth and Larson [1990].

There is an upcoming book on this topic that provides the broadest coverage of the subject [Doan et al., 2010]. There are a number of recent overview papers on the topic. Bernstein and Melnik [2007] provides a very nice discussion of the integration methodology. It goes further by comparing the model management work with some of the data integration research. Halevy et al. [2006] reviews the data integration work in the 1990’s, focusing on the Information Manifold system [Levy et al., 1996c], that uses a LAV approach. The paper provides a large bibliography and discusses the research areas that have been opened in the intervening years. Haas [2007] takes a comprehensive approach to the entire integration process and divides it into four phases: understanding that involves discovering relevant information (keys, constraints, data types, etc), analyzing it to assess quality, an to determine statistical properties; standardization whereby the best way to represent the integrated information is determined; specification, that involves the configuration of the integra- tion process; and execution, which is the actual integration. The specification phase includes the techniques defined in this paper. Doan and Halevy [2005] is another very good overview of the various schema matching techniques. They propose a different, and simpler, classification of the techniques as rule-based, learning-based, and combined.

A large number of systems have been developed that have tested the LAV versus GAV approaches. Many of these focus on querying over integrated systems, so we will discuss them in Chapter 9. Examples of LAV approaches are described in the papers [Duschka and Genesereth, 1997; Levy et al., 1996a; Manolescu et al., 2001] while examples of GAV are presented in papers [Adali et al., 1996a; Garcia-Molina et al., 1997; Haas et al., 1997b].

Topics of structural and semantic heterogeneity have occupied researchers for quite some time. While the literature on this topic is quite extensive, some of the interesting publications that discuss structural heterogeneity are and those that focus on semantic heterogeneity are [Dayal and Hwang, 1984; Kim and Seo, 1991; Breitbart et al., 1986; Krishnamurthy et al., 1991] [Hull, 1997; Ouksel and Sheth, 1999; Kashyap and Sheth, 1996; Bright et al., 1994; Ceri and Widom, 1993]. We should note that this list is seriously incomplete.

More recent works in schema matching are surveyed by Rahm and Bernstein [2001] and Doan and Halevy [2005]. In particular, Rahm and Bernstein [2001] gives a very nice comparison of various proposals.

A number of systems have been developed demonstrating the feasibility of various schema matching approaches. Among rule-based techniques, one can cite DIKE [Palopoli et al., 1998, 2003b,a], DIPE, which is an earlier version of this system [Palopoli et al., 1999], TranSCM [Milo and Zohar, 1998], ARTEMIS [Bergamaschi et al., 2001], similarity flooding [Melnik et al., 2002], CUPID [Madhavan et al., 2001], and COMA [Do and Rahm, 2002].

4.7 Bibliographic Notes 161

Exercises

Problem 4.1. Distributed database systems and distributed multidatabase systems represent two different approaches to systems design. Find three real-life applications for which each of these approaches would be more appropriate. Discuss the features of these applications that make them more favorable for one approach or the other.

Problem 4.2. Some architectural models favor the definition of a global conceptual schema, whereas others do not. What do you think? Justify your selection with detailed technical arguments.

Problem 4.3 (*). Give an algorithm to convert a relational schema to an entity- relationship one.

Problem 4.4 (**). Consider the two databases given in Figures 4.13 and 4.14 and described below. Design a global conceptual schema as a union of the two databases by first translating them into the E-R model.

DIRECTOR(NAME, PHONE NO, ADDRESS) LICENSES(LIC NO, CITY, DATE, ISSUES, COST, DEPT, CONTACT) RACER(NAME, ADDRESS, MEM NUM) SPONSOR(SP NAME, CONTACT) RACE(R NO, LIC NO, DIR, MAL WIN, FRM WIN, SP NAME)

Fig. 4.13 Road Race Database

Figure 4.13 describes a relational race database used by organizers of road races and Figure 4.14 describes an entity-relationship database used by a shoe manufacturer. The semantics of each of these database schemas is discussed below. Figure 4.13 describes a relational road race database with the following semantics: DIRECTOR is a relation that defines race directors who organize races; we assume

that each race director has a unique name (to be used as the key), a phone number, and an address.

LICENSES is required because all races require a governmental license, which is issued by a CONTACT in a department who is the ISSUER, possibly contained within another government department DEPT; each license has a unique LIC NO (the key), which is issued for use in a specific CITY on a specific DATE with a certain COST.

RACER is a relation that describes people who participate in a race. Each person is identified by NAME, which is not sufficient to identify them uniquely, so a compound key formed with the ADDRESS is required. Finally, each racer may have a MEM NUM to identify him or her as a member of the racing fraternity, but not all competitors have membership numbers.

SPONSOR indicates which sponsor is funding a given race. Typically, one sponsor funds a number of races through a specific person (CONTACT), and a number of races may have different sponsors.

162 4 Database Integration

Contract MANUFACTURER

Cost

AddressName

Name

DISTRIBUTOR

Address

SIN

SHOES

ModelSize

Sells Makes Prod_cost

Cost

Employs

SALESPERSON

NameSIN

Commission

Base_sal

1 N

1

N

N

M M

N

Fig. 4.14 Sponsor Database

RACE uniquely identifies a single race which has a license number (LIC NO) and race number (R NO) (to be used as a key, since a race may be planned without acquiring a license yet); each race has a winner in the male and female groups (MAL WIN and FEM WIN) and a race director (DIR). Figure 4.14 illustrates an entity-relationship schema used by the sponsor’s database

system with the following semantics: SHOES are produced by sponsors of a certain MODEL and SIZE, which forms the

key to the entity. MANUFACTURER is identified uniquely by NAME and resides at a certain AD-

DRESS. DISTRIBUTOR is a person that has a NAME and ADDRESS (which are necessary

to form the key) and a SIN number for tax purposes. SALESPERSON is a person (entity) who has a NAME, earns a COMMISSION,

and is uniquely identified by his or her SIN number (the key). Makes is a relationship that has a certain fixed production cost (PROD COST). It

indicates that a number of different shoes are made by a manufacturer, and that different manufacturers produce the same shoe.

Sells is a relationship that indicates the wholesale COST to a distributor of shoes. It indicates that each distributor sells more than one type of shoe, and that each type of shoe is sold by more than one distributor.

4.7 Bibliographic Notes 163

Contract is a relationship whereby a distributor purchases, for a COST, exclusive rights to represent a manufacturer. Note that this does not preclude the distributor from selling different manufacturers’ shoes.

Employs indicates that each distributor hires a number of salespeople to sell the shoes; each earns a BASE SALARY.

Problem 4.5 (*). Consider three sources:

• Database 1 has one relation Area(Id, Field) providing areas of specialization of employees; the Id field identifies an employee.

• Database 2 has two relations, Teach(Professor, Course) and In(Course, Field); Teach indicates the courses that each professor teaches and In that specifies possible fields that a course can blong to.

• Database 3 has two relations, Grant(Researcher, GrantNo) for grants given to researchers, and For(GrantNo, Field) indicating which fields the grants are for.

The objective is to build a GCS with two relations: Works(Id, Project) stating that an employee works for a particular project, and Area(Project, Field) associating projects with one or more fields.

(a) Provide a LAV mapping between Database 1 and the GCS. (b) Provide a GLAV mapping between the GCS and the local schemas. (c) Suppose one extra relation, Funds(GrantNo, Project), is added to Database 3.

Provide a GAV mapping in this case.

Problem 4.6. Consider a GCS with the following relation: Person(Name, Age, Gen- der). This relation is defined as a view over three LCSs as follows:

CREATE VIEW Person AS SELECT Name, Age, "male" AS Gender FROM SoccerPlayer UNION SELECT Name, NULL AS Age, Gender FROM Actor UNION SELECT Name, Age, Gender FROM Politician WHERE Age > 30

For each of the following queries, discuss which of the three local schemas (SoccerPlayer, Actor, and Politician) contribute to the global query result.

(a) SELECT Name FROM person (b) SELECT Name FROM Person

WHERE Gender = "female"

(c) SELECT Name FROM Person WHERE Age > 25 (d) SELECT Name FROM Person WHERE Age < 25 (e) SELECT Name FROM Person

WHERE Gender = "male" AND Age = 40

164 4 Database Integration

Problem 4.7. A GCS with the relation Country(Name, Continent, Population, Has- Coast) describes countries of the world. The attribute HasCoast indicates if the country has direct access to the sea. Three LCSs are connected to the global schema using the LAV approach as follows:

CREATE VIEW EuropeanCountry AS SELECT Name, Continent, Population, HasCoast FROM Country WHERE Continent = "Europe"

CREATE VIEW BigCountry AS SELECT Name, Continent, Population, HasCoast FROM Country WHERE Population >= 30000000

CREATE VIEW MidsizeOceanCountry AS SELECT Name, Continent, Population, HasCoast FROM Country WHERE HasCoast = true AND Population > 10000000

(a) For each of the following queries, discuss the results with respect to their completeness, i.e., verify if the (combination of the) local sources cover all relevant results.

1. SELECT Name FROM Country

2. SELECT Name FROM Country WHERE Population > 40

3. SELECT Name FROM Country WHERE Population > 20

(b) For each of the following queries, discuss which of the three LCSs are necessary for the global query result.

1. SELECT Name FROM Country

2. SELECT Name FROM Country WHERE Population > 30 AND Continent = "Europe"

3. SELECT Name FROM Country WHERE Population < 30

4. SELECT Name FROM Country WHERE Population > 30 AND HasCoast = true

Problem 4.8. Consider the following two relations PRODUCT and ARTICLE that are specified in a simplified SQL notation. The perfect schema matching correspon- dences are denoted by arrows.

4.7 Bibliographic Notes 165

PRODUCT −→ ARTICLE Id: int PRIMARY KEY −→ Key: varchar(255) PRIMARY KEY Name: varchar(255) −→ Title: varchar(255) DeliveryPrice: float −→ Price: real Description: varchar(8000) −→ Information: varchar(5000)

(a) For each of the five correspondences, indicate which of the following match approaches will probably identify the correspondence:

1. Syntactic comparison of element names, e.g., using edit distance string similarity

2. Comparison of element names using a synonym lookup table 3. Comparison of data types 4. Analysis of instance data values

(b) Is it possible for the listed matching approaches to determine false correspon- dences for these match tasks? If so, give an example.

Problem 4.9. Consider two relations S(a,b,c) and T (d,e, f ). A match approach determines the following similarities between the elements of S and T:

T.d T.e T. f S.a 0.8 0.3 0.1 S.b 0.5 0.2 0.9 S.c 0.4 0.7 0.8

Based on the given matcher’s result, derive an overall schema match result with the following characteristics:

• Each element participates in exactly one correspondence. • There is no correspondence where both elements match an element of the

opposite schema with a higher similarity than its corresponding counterpart.

Problem 4.10 (*). Figure 4.15 illustrates the schema of three different data sources:

• MyGroup contains publications authored by members of a working group; • MyConference contains publications of a conference series and associated

workshops;

• MyPublisher contains articles that are published in journals.

The arrows show the foreign key-to-primary key relationships. The sources are defined as follows: MyGroup

• Publication

166 4 Database Integration

RELATION Publication

Pub_ID: INT PRIMARY KEY

VenueName: VARCHAR

VenueType: VARCHAR

Year: INT

Title: VARCHAR

RELATION AuthorOf

Pub_ID_FK: INT PRIMARY KEY

Member_ID_FK: INT PRIMARY KEY

RELATION GroupMember

Member_ID: INT PRIMARY KEY

Name: VARCHAR

Email: VARCHAR

MyGroup

RELATION Journal

Journ_ID: INT PRIMARY KEY

Name: VARCHAR

Volume: INT

Issue: INT

Year: INT

RELATION Article

Art_ID: INT PRIMARY KEY

Title: VARCHAR

Journ_ID_FK: INT

RELATION Person

Pers_ID: INT PRIMARY KEY

LastName: VARCHAR

FirstName: VARCHAR

Affiliation: VARCHAR

RELATION Author

Art_ID_FK: INT PRIMARY KEY

Pers_ID_FK: INT PRIMARY KEY

Position: INT

RELATION Editor

Journ_ID_FK: INT PRIMARY KEY

Pers_IK_FK: INT PRIMARY KEY

MyPublisher

RELATION ConfWorkshop

CW_ID: INT PRIMARY KEY

Year: INT

Location: VARCHAR

Organizer: VARCHAR

AssociatedConf_ID_FK: INT

RELATION Paper

Pap_ID: INT PRIMARY KEY

Title: VARCHAR

Authors: ARRAY[20] OF VARCHAR

CW_ID_FK: INT

MyConference

Fig. 4.15 Figure for Exercise 10

• Pub ID: unique publication ID • VenueName: name of the journal, conference or workshop • VenueType: “journal”, “conference”, or “workshop” • Year: year of publication • Title: publication’s title

• AuthorOf

• many-to-many relationship representing “group member is author of publication”

• GroupMember

• Member ID: unique member ID • Name: name of the group member • Email: email address of the group member

MyConference

4.7 Bibliographic Notes 167

• ConfWorkshop

• CW ID: unique ID for the conference/workshop • Name: name of the conference or workshop • Year: year when the event takes place • Location: event’s location • Organizer: name of the organizing person • AssociatedConf ID FK: value is NULL if it is a conference, ID of the

associated conference if the event is a workshop (this is assuming that workshops are organized in conjunction with a conference)

• Paper

• Pap ID: unique paper ID • Title: paper’s title • Author: array of author names • CW ID FK: conference/workshop where the paper is published

MyPublisher

• Journal

• Journ ID: unique journal ID • Name: journal’s name • Year: year when the event takes place • Volume: journal volume • Issue: journal issue

• Article

• Art ID: unique article ID • Title: title of the article • Journ ID FK: journal where the article is published

• Person

• Pers ID: unique person ID • LastName: last name of the person • FirstName: first name of the person • Affiliation: person’s affiliation (e.g., the name of a university)

• Author

• represents the many-to-many relationship for “person is author of article”

168 4 Database Integration

• Position: author’s position in the author list (e.g., first author has Position 1)

• Editor

• represents the many-to-many relationship for “person is editor of journal issue”

(a) Identify all schema matching correspondences between the schema elements of the sources. Use the names and data types of the schema elements as well as the given description.

(b) Classify your correspondences along the following dimensions:

1. Type of schema elements (e.g., attribute-attribute or attribute-relation) 2. Cardinality (e.g., 1:1 or 1:N)

(c) Give a consolidated global schema that covers all information of the source schemas.

Problem 4.11 (*). Figure 4.16 illustrates (using a simplified SQL syntax) two sources S1 and S2. S1 has two relations, Course and Tutor, and S2 has only one relation, Lecture. The solid arrows denote schema matching correspondences. The dashed arrow represents a foreign key relationship between the two relations in S1.

RELATION Course

id: INT PRIMARY KEY

name: VARCHAR(255)

tutor_id_fk: INT FOREIGN KEY REFERENCES(Tutor)

RELATION Tutor

id: INT PRIMARY KEY

lastname: VARCHAR(255)

firstname: VARCHAR(255)

RELATION Lecture

id: INT PRIMARY KEY

title: VARCHAR(255)

lecturer: VARCHAR(255)

Fig. 4.16 Figure for Exercise 11

The following are four schema mappings (represented as SQL queries) to trans- form S1’s data into S2.

1. SELECT C.id, C.name as Title, CONCAT(T.lastname, T.firstname) AS Lecturer)

FROM Course AS C JOIN Tutor AS T ON (C.tutor_id_fk = T.id)

2. SELECT C.id, C.name AS Title, NULL AS Lecturer) FROM Course AS C UNION SELECT T.id AS ID, NULL AS Title, T,

lastname AS Lecturer)

4.7 Bibliographic Notes 169

FROM Course AS C FULL OUTER JOIN Tutor AS T ON(C.tutor_id_fk=T.id)

3. SELECT C.id, C.name as Title, CONCAT(T.lastname, T.firstname) AS Lecturer)

FROM Course AS C FULL OUTER JOIN Tutor AS T ON(C.tutor_id_fk=T.id)

Discuss each of these schema mappings with respect to the following questions:

(a) Is the mapping meaningful? (b) Is the mapping complete (i.e., are all data instances of S1 transformed)? (c) Does the mapping potentially violate key constraints?

Problem 4.12 (*). Consider three data sources:

• Database 1 has one relation AREA(ID, FIELD) providing areas of specialization of employees where ID identifies an employee.

FIELD) specifying possible fields a course can belong to.

• Database 3 has two relations: GRANT(RESEARCHER, GRANT#) for grants

grants are in.

Design a global schema with two relations: WORKS(ID, PROJECT) that records which projects employees work in, and AREA(PROJECT, FIELD) that associates projects with one or more fields for the following cases:

(a) There should be a LAV mapping between Database 1 and the global schema. (b) There should be a GLAV mapping between the global schema and the local

schemas. (c) There should be a GAV mapping when one extra relation FUNDS(GRANT#,

PROJECT) is added to Database 3.

Problem 4.13 (**). Logic (first-order logic, to be precise) has been suggested as a uniform formalism for schema translation and integration. Discuss how logic can be useful for this purpose.

given to researchers, and FOR(GRANT#, FIELD) indicating the fields that the

• Database 2 has two relations: TEACH(PROFESSOR, COURSE) and IN(COURSE,

Chapter 5 Data and Access Control

An important requirement of a centralized or a distributed DBMS is the ability to support semantic data control, i.e., data and access control using high-level semantics. Semantic data control typically includes view management, security control, and semantic integrity control. Informally, these functions must ensure that authorized users perform correct operations on the database, contributing to the maintenance of database integrity. The functions necessary for maintaining the physical integrity of the database in the presence of concurrent accesses and failures are studied separately in Chapters 10 through 12 in the context of transaction management. In the relational framework, semantic data control can be achieved in a uniform fashion. Views, security constraints, and semantic integrity constraints can be defined as rules that the system automatically enforces. The violation of some rules by a user program (a set of database operations) generally implies the rejection of the effects of that program (e.g., undoing its updates) or propagating some effects (e.g., updating related data) to preserve the database integrity.

The definition of the rules for controlling data manipulation is part of the adminis- tration of the database, a function generally performed by a database administrator (DBA). This person is also in charge of applying the organizational policies. Well- known solutions for semantic data control have been proposed for centralized DBMSs. In this chapter we briefly review the centralized solution to semantic data control, and present the special problems encountered in a distributed environment and solutions to these problems. The cost of enforcing semantic data control, which is high in terms of resource utilization in a centralized DBMS, can be prohibitive in a distributed environment.

Since the rules for semantic data control must be stored in a catalog, the manage- ment of a distributed directory (also called a catalog) is also relevant in this chapter. We discussed directories in Section 3.5. Remember that the directory of a distributed DBMS is itself a distributed database. There are several ways to store semantic data control definitions, according to the way the directory is managed. Directory information can be stored differently according to its type; in other words, some information might be fully replicated whereas other information might be distributed. For example, information that is useful at compile time, such as security control

171 DOI 10.1007/978-1-4419-8834-8_5, © Springer Science+Business Media, LLC 2011 M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems: Third Edition,

172 5 Data and Access Control

information, could be replicated. In this chapter we emphasize the impact of directory management on the performance of semantic data control mechanisms.

This chapter is organized as follows. View management is the subject of Section 5.1. Security control is presented in Section 5.2. Finally, semantic integrity control is treated in Section 5.3. For each section we first outline the solution in a centralized DBMS and then give the distributed solution, which is often an extension of the centralized one, although more difficult.

5.1 View Management

One of the main advantages of the relational model is that it provides full logical data independence. As introduced in Chapter 1, external schemas enable user groups to have their particular view of the database. In a relational system, a view is a virtual relation, defined as the result of a query on base relations (or real relations), but not materialized like a base relation, which is stored in the database. A view is a dynamic window in the sense that it reflects all updates to the database. An external schema can be defined as a set of views and/or base relations. Besides their use in external schemas, views are useful for ensuring data security in a simple way. By selecting a subset of the database, views hide some data. If users may only access the database through views, they cannot see or manipulate the hidden data, which are therefore secure.

In the remainder of this section we look at view management in centralized and distributed systems as well as the problems of updating views. Note that in a distributed DBMS, a view can be derived from distributed relations, and the access to a view requires the execution of the distributed query corresponding to the view definition. An important issue in a distributed DBMS is to make view materialization efficient. We will see how the concept of materialized views helps in solving this problem, among others, but requires efficient techniques for materialized view maintenance.

5.1.1 Views in Centralized DBMSs

Most relational DBMSs use a view mechanism where a view is a relation derived from base relations as the result of a relational query (this was first proposed within the INGRES [Stonebraker, 1975] and System R [Chamberlin et al., 1975] projects). It is defined by associating the name of the view with the retrieval query that specifies it.

Example 5.1. The view of system analysts (SYSAN) derived from relation EMP (ENO,ENAME,TITLE), can be defined by the following SQL query:

5.1 View Management 173

Fig. 5.1 Relation Corresponding to the View SYSAN

CREATE VIEW SYSAN(ENO, ENAME) AS SELECT ENO, ENAME

FROM EMP WHERE TITLE = "Syst. Anal."

The single effect of this statement is the storage of the view definition in the catalog. No other information needs to be recorded. Therefore, the result of the query defining the view (i.e., a relation having the attributes ENO and ENAME for the system analysts as shown in Figure 5.1) is not produced. However, the view SYSAN can be manipulated as a base relation.

Example 5.2. The query

“Find the names of all the system analysts with their project number and respon- sibility(ies)”

involving the view SYSAN and relation ASG(ENO,PNO,RESP,DUR) can be ex- pressed as

SELECT ENAME, PNO, RESP FROM SYSAN, ASG WHERE SYSAN.ENO = ASG.ENO

Mapping a query expressed on views into a query expressed on base relations can be done by query modification [Stonebraker, 1975]. With this technique the variables are changed to range on base relations and the query qualification is merged (ANDed) with the view qualification.

Example 5.3. The preceding query can be modified to

SELECT ENAME, PNO, RESP FROM EMP, ASG WHERE EMP.ENO = ASG.ENO AND TITLE = "Syst. Anal."

The result of this query is illustrated in Figure 5.2. �

174 5 Data and Access Control

The modified query is expressed on base relations and can therefore be processed by the query processor. It is important to note that view processing can be done at compile time. The view mechanism can also be used for refining the access controls to include subsets of objects. To specify any user from whom one wants to hide data, the keyword USER generally refers to the logged-on user identifier.

ENAME PNO RESP

M.Smith P1 Analyst

M.Smith P2 Analyst

B.Casey P3 Manager

J.Jones P4 Manager

Fig. 5.2 Result of Query involving View SYSAN

Example 5.4. The view ESAME restricts the access by any user to those employees having the same title:

CREATE VIEW ESAME AS SELECT *

FROM EMP E1, EMP E2 WHERE E1.TITLE = E2.TITLE AND E1.ENO = USER

In the view definition above, * stands for “all attributes” and the two tuple variables (E1 and E2) ranging over relation EMP are required to express the join of one tuple of EMP (the one corresponding to the logged-on user) with all tuples of EMP based on the same title. For example, the following query issued by the user J. Doe,

SELECT * FROM ESAME

returns the relation of Figure 5.3. Note that the user J. Doe also appears in the result. If the user who creates ESAME is an electrical engineer, as in this case, the view represents the set of all electrical engineers. �

ENO ENAME TITLE

E1 J. Doe Elect. Eng

E2 L. Chu Elect. Eng

Fig. 5.3 Result of Query on View ESAME

5.1 View Management 175

Views can be defined using arbitrarily complex relational queries involving selec- tion, projection, join, aggregate functions, and so on. All views can be interrogated as base relations, but not all views can be manipulated as such. Updates through views can be handled automatically only if they can be propagated correctly to the base relations. We can classify views as being updatable and not updatable. A view is updatable only if the updates to the view can be propagated to the base relations without ambiguity. The view SYSAN above is updatable; the insertion, for example, of a new system analyst 〈201, Smith〉 will be mapped into the insertion of a new employee 〈201, Smith, Syst. Anal.〉. If attributes other than TITLE were hidden by the view, they would be assigned null values.

Example 5.5. The following view, however, is not updatable:

CREATE VIEW EG(ENAME, RESP) AS SELECT DISTINCT ENAME, RESP

FROM EMP, ASG WHERE EMP.ENO = ASG.ENO

The deletion, for example, of the tuple 〈Smith, Analyst〉 cannot be propagated, since it is ambiguous. Deletions of Smith in relation EMP or analyst in relation ASG are both meaningful, but the system does not know which is correct. �

Current systems are very restrictive about supporting updates through views. Views can be updated only if they are derived from a single relation by selection and projection. This precludes views defined by joins, aggregates, and so on. However, it is theoretically possible to automatically support updates of a larger class of views [Bancilhon and Spyratos, 1981; Dayal and Bernstein, 1978; Keller, 1982]. It is interesting to note that views derived by join are updatable if they include the keys of the base relations.

5.1.2 Views in Distributed DBMSs

The definition of a view is similar in a distributed DBMS and in centralized systems. However, a view in a distributed system may be derived from fragmented relations stored at different sites. When a view is defined, its name and its retrieval query are stored in the catalog.

Since views may be used as base relations by application programs, their definition should be stored in the directory in the same way as the base relation descriptions. Depending on the degree of site autonomy offered by the system [Williams et al., 1982], view definitions can be centralized at one site, partially duplicated, or fully duplicated. In any case, the information associating a view name to its definition site should be duplicated. If the view definition is not present at the site where the query is issued, remote access to the view definition site is necessary.

The mapping of a query expressed on views into a query expressed on base relations (which can potentially be fragmented) can also be done in the same way as

176 5 Data and Access Control

in centralized systems, that is, through query modification. With this technique, the qualification defining the view is found in the distributed database catalog and then merged with the query to provide a query on base relations. Such a modified query is a distributed query, which can be processed by the distributed query processor (see Chapter 6). The query processor maps the distributed query into a query on physical fragments.

In Chapter 3 we presented alternative ways of fragmenting base relations. The definition of fragmentation is, in fact, very similar to the definition of particular views. It is possible to manage views and fragments using a unified mechanism [Adiba, 1981]. This is based on the observation that views in a distributed DBMS can be defined with rules similar to fragment definition rules. Furthermore, replicated data can be handled in the same way. The value of such a unified mechanism is to facilitate distributed database administration. The objects manipulated by the database administrator can be seen as a hierarchy where the leaves are the fragments from which relations and views can be derived. Therefore, the DBA may increase locality of reference by making views in one-to-one correspondence with fragments. For example, it is possible to implement the view SYSAN illustrated in Example 5.1 by a fragment at a given site, provided that most users accessing the view SYSAN are at the same site.

Evaluating views derived from distributed relations may be costly. In a given orga- nization it is likely that many users access the same view which must be recomputed for each user. We saw in Section 5.1.1 that view derivation is done by merging the view qualification with the query qualification. An alternative solution is to avoid view derivation by maintaining actual versions of the views, called materialized views. A materialized view stores the tuples of a view in a database relation, like the other database tuples, possibly with indices. Thus, access to a materialized view is much faster than deriving the view, in particular, in a distributed DBMS where base relations can be remote. Introduced in the early 1980s [Adiba and Lindsay, 1980], materialized views have since gained much interest in the context of data warehous- ing to speed up On Line Analytical Processing (OLAP) applications [Gupta and Mumick, 1999c]. Materialized views in data warehouses typically involve aggregate (such as SUM and COUNT) and grouping (GROUP BY) operators because they provide compact database summaries. Today, all major database products support materialized views.

Example 5.6. The following view over relation PROJ(PNO,PNAME,BUDGET,LOC) gives, for each location, the number of projects and the total budget.

CREATE VIEW PL(LOC, NBPROJ, TBUDGET) AS SELECT LOC, COUNT(*),SUM(BUDGET)

FROM PROJ GROUP BY LOC

5.1 View Management 177

5.1.3 Maintenance of Materialized Views

A materialized view is a copy of some base data and thus must be kept consistent with that base data which may be updated. View maintenance is the process of updating (or refreshing) a materialized view to reflect the changes made to the base data. The issues related to view materialization are somewhat similar to those of database replication which we will address in Chapter 13. However, a major difference is that materialized view expressions, in particular, for data warehousing, are typically more complex than replica definitions and may include join, group by and aggregate operators. Another major difference is that database replication is concerned with more general replication configurations, e.g., with multiple copies of the same base data at multiple sites.

A view maintenance policy allows a DBA to specify when and how a view should be refreshed. The first question (when to refresh) is related to consistency (between the view and the base data) and efficiency. A view can be refreshed in two modes: immediate or deferred. With the immediate mode, a view is refreshed immediately as part as the transaction that updates base data used by the view. If the view and the base data are managed by different DBMSs, possibly at different sites, this requires the use of a distributed transaction, for instance, using the two-phase commit (2PC) protocol (see Chapter 12). The main advantages of immediate refreshment are that the view is always consistent with the base data and that read-only queries can be fast. However, this is at the expense of increased transaction time to update both the base data and the views within the same transactions. Furthermore, using distributed transactions may be difficult.

In practice, the deferred mode is preferred because the view is refreshed in separate (refresh) transactions, thus without performance penalty on the transactions that update the base data. The refresh transactions can be triggered at different times: lazily, just before a query is evaluated on the view; periodically, at predefined times, e.g., every day; or forcedly, after a predefined number of updates to the base data. Lazy refreshment enables queries to see the latest consistent state of the base data but at the expense of increased query time to include the refreshment of the view. Periodic and forced refreshment allow queries to see views whose state is not consistent with the latest state of the base data. The views managed with these strategies are also called snapshots [Adiba, 1981; Blakeley et al., 1986].

The second question (how to refresh a view) is an important efficiency issue. The simplest way to refresh a view is to recompute it from scratch using the base data. In some cases, this may be the most efficient strategy, e.g., if a large subset of the base data has been changed. However, there are many cases where only a small subset of view needs to be changed. In these cases, a better strategy is to compute the view incrementally, by computing only the changes to the view. Incremental view maintenance relies on the concept of differential relation. Let u be an update of relation R. R+ and R− are differential relations of R by u, where R+ contains the tuples inserted by u into R, and R− contains the tuples of R deleted by u. If u is an insertion, R− is empty. If u is a deletion, R+ is empty. Finally, if u is a modification, relation R can be obtained by computing (R−R−)∪R+. Similarly, a materialized

178 5 Data and Access Control

view V can be refreshed by computing (V −V−)∪V+. Computing the changes to the view, i.e., V+ and V−, may require using the base relations in addition to differential relations.

Example 5.7. Consider the view EG of Example 5.5 which uses relations EMP and ASG as base data and assume its state is derived from that of Example 3.1, so that EG has 9 tuples (see Figure 5.4). Let EMP+ consist of one tuple 〈E9, B. Martin, Programmer〉 to be inserted in EMP, and ASG+ consist of two tuples 〈E4, P3, Programmer, 12〉 and 〈E9, P3, Programmer, 12〉 to be inserted in ASG. The changes to the view EG can be computed as:

EG+ = (SELECT ENAME, RESP FROM EMP, ASG+ WHERE EMP.ENO = ASG+.ENO)

UNION (SELECT ENAME, RESP FROM EMP+, ASG WHERE EMP+.ENO = ASG.ENO)

UNION (SELECT ENAME, RESP FROM EMP+, ASG+ WHERE EMP+.ENO = ASG+.ENO)

which yields tuples 〈B. Martin, Programmer〉 and 〈J. Miller, Programmer〉. Note that integrity constraints would be useful here to avoid useless work (see Section 5.3.2). Assuming that relations EMP and ASG are related by a referential constraint that says that ENO in ASG must exist in EMP, the second SELECT statement is useless as it produces an empty relation. �

ENAME RESP

EG

J. Doe Manager

M. Smith Analyst

A. Lee Consultant

A. Lee Engineer

J. Miller Programmer

B. Casey Manager

L. Chu Manager

R. Davis Engineer

J.Jones Manager

Fig. 5.4 State of View EG

Efficient techniques have been devised to perform incremental view maintenance using both the materialized views and the base relations. The techniques essen- tially differ in their views’ expressiveness, their use of integrity constraints, and the way they handle insertion and deletion. Gupta and Mumick [1999a] classify

5.1 View Management 179

these techniques along the view expressiveness dimension as non-recursive views, views involving outerjoins, and recursive views. For non-recursive views, i.e., select- project-join (SPJ) views that may have duplicate elimination, union and aggregation, an elegant solution is the counting algorithm [Gupta et al., 1993]. One problem stems from the fact that individual tuples in the view may be derived from several tuples in the base relations, thus making deletion in the view difficult. The basic idea of the counting algorithm is to maintain a count of the number of derivations for each tuple in the view, and to increment (resp. decrement) tuple counts based on insertions (resp. deletions); a tuple in the view of which count is zero can then be deleted.

Example 5.8. Consider the view EG in Figure 5.4. Each tuple in EG has one deriva- tion (i.e., a count of 1) except tuple 〈M. Smith, Analyst〉 which has two (i.e., a count of 2). Assume now that tuples 〈E2, P1, Analyst, 24〉 and 〈E3, P3, Consultant, 10〉 are deleted from ASG. Then only tuple 〈A. Lee, Consultant〉 needs to be deleted from EG. �

We now present the basic counting algorithm for refreshing a view V defined over two relations R and S as a query q(R,S). Assuming that each tuple in V has an associated derivation count, the algorithm has three main steps (see Algorithm 5.1). First, it applies the view differentiation technique to formulate the differential views V+ and V− as queries over the view, the base relations, and the differential relations. Second, it computes V+ and V− and their tuple counts. Third, it applies the changes V+ and V− in V by adding positive counts and subtracting negative counts, and deleting tuples with a count of zero.

Algorithm 5.1: COUNTING Algorithm Input: V : view defined as q(R,S); R, S: relations; R+, R−: changes to R begin

V+ = q+(V, R+, R, S); V− = q−(V, R−, R, S) ; compute V+ with positive counts for inserted tuples; compute V− with negative counts for deleted tuples; compute (V −V−)∪V+ by adding positive counts and substracting negative counts deleting each tuple in V with count = 0;

end

The counting algorithm is optimal since it computes exactly the view tuples that are inserted or deleted. However, it requires access to the base relations. This implies that the base relations be maintained (possibly as replicas) at the sites of the materialized view. To avoid accessing the base relations so the view can be stored at a different site, the view should be maintainable using only the view and the differential relations. Such views are called self-maintainable [Gupta et al., 1996].

180 5 Data and Access Control

Example 5.9. Consider the view SYSAN in Example 5.1. Let us write the view definition as SYSAN=q(EMP) meaning that the view is defined by a query q on EMP. We can compute the differential views using only the differential relations, i.e., SYSAN+ = q(EMP+) and SYSAN− = q(EMP−). Thus, the view SYSAN is self-maintainable. �

Self-maintainability depends on the views’ expressiveness and can be defined with respect to the kind of updates (insertion, deletion or modification) [Gupta et al., 1996]. Most SPJ views are not self-maintainable with respect to insertion but are often self-maintainable with respect to deletion and modification. For instance, an SPJ view is self-maintainable with respect to deletion of relation R if the key attributes of R are included in the view.

Example 5.10. Consider the view EG of Example 5.5. Let us add attribute ENO (which is key of EMP) in the view definition. This view is not self-maintainable with respect to insertion. For instance, after an insertion of an ASG tuple, we need to perform the join with EMP to get the corresponding ENAME to insert in the view. However, this view is self-maintainable with respect to deletion on EMP. For instance, if one EMP tuple is deleted, the view tuples having same ENO can be deleted. �

5.2 Data Security

Data security is an important function of a database system that protects data against unauthorized access. Data security includes two aspects: data protection and access control.

Data protection is required to prevent unauthorized users from understanding the physical content of data. This function is typically provided by file systems in the context of centralized and distributed operating systems. The main data protection approach is data encryption [Fernandez et al., 1981], which is useful both for in- formation stored on disk and for information exchanged on a network. Encrypted (encoded) data can be decrypted (decoded) only by authorized users who “know” the code. The two main schemes are the Data Encryption Standard [NBS, 1977] and the public-key encryption schemes ([Diffie and Hellman, 1976] and [Rivest et al., 1978]). In this section we concentrate on the second aspect of data security, which is more specific to database systems. A complete presentation of database security techniques can be found in [Castano et al., 1995].

Access control must guarantee that only authorized users perform operations they are allowed to perform on the database. Many different users may have access to a large collection of data under the control of a single centralized or distributed system. The centralized or distributed DBMS must thus be able to restrict the access of a subset of the database to a subset of the users. Access control has long been provided by operating systems, and more recently, by distributed operating systems [Tanenbaum, 1995] as services of the file system. In this context, a centralized control is offered. Indeed, the central controller creates objects, and this person may

5.2 Data Security 181

allow particular users to perform particular operations (read, write, execute) on these objects. Also, objects are identified by their external names.

Access control in database systems differs in several aspects from that in tra- ditional file systems. Authorizations must be refined so that different users have different rights on the same database objects. This requirement implies the ability to specify subsets of objects more precisely than by name and to distinguish between groups of users. In addition, the decentralized control of authorizations is of partic- ular importance in a distributed context. In relational systems, authorizations can be uniformly controlled by database administrators using high-level constructs. For example, controlled objects can be specified by predicates in the same way as is a query qualification.

There are two main approaches to database access control [Lunt and Fernández, 1990]. The first approach is called discretionary and has long been provided by DBMS. Discretionary access control (or authorization control) defines access rights based on the users, the type of access (e.g., SELECT, UPDATE) and the objects to be accessed. The second approach, called mandatory or multilevel [Lunt and Fernández, 1990; Jajodia and Sandhu, 1991] further increases security by restricting access to classified data to cleared users. Support of multilevel access control by major DBMSs is more recent and stems from increased security threats coming from the Internet.

From solutions to access control in centralized systems, we derive those for distributed DBMSs. However, there is the additional complexity which stems from the fact that objects and users can be distributed. In what follows we first present discretionary and multilevel access control in centralized systems and then the additional problems and their solutions in distributed systems.

5.2.1 Discretionary Access Control

Three main actors are involved in discretionary access control control: the subject (e.g., users, groups of users) who trigger the execution of application programs; the operations, which are embedded in application programs; and the database objects, on which the operations are performed [Hoffman, 1977]. Authorization control consists of checking whether a given triple (subject, operation, object) can be allowed to proceed (i.e., the user can execute the operation on the object). An authorization can be viewed as a triple (subject, operation type, object definition) which specifies that the subjects has the right to perform an operation of operation type on an object. To control authorizations properly, the DBMS requires the definition of subjects, objects, and access rights.

The introduction of a subject in the system is typically done by a pair (user name, password). The user name uniquely identifies the users of that name in the system, while the password, known only to the users of that name, authenticates the users. Both user name and password must be supplied in order to log in the system. This prevents people who do not know the password from entering the system with only the user name.

182 5 Data and Access Control

The objects to protect are subsets of the database. Relational systems provide finer and more general protection granularity than do earlier systems. In a file system, the protection granule is the file, while in an object-oriented DBMS, it is the object type. In a relational system, objects can be defined by their type (view, relation, tuple, attribute) as well as by their content using selection predicates. Furthermore, the view mechanism as introduced in Section 5.1 permits the protection of objects simply by hiding subsets of relations (attributes or tuples) from unauthorized users.

A right expresses a relationship between a subject and an object for a particular set of operations. In an SQL-based relational DBMS, an operation is a high-level statement such as SELECT, INSERT, UPDATE, or DELETE, and rights are defined (granted or revoked) using the following statements:

GRANT 〈operation type(s)〉 ON 〈object〉 TO 〈subject(s)〉 REVOKE 〈operation type(s)〉 FROM 〈object〉 TO 〈subject(s)〉

The keyword public can be used to mean all users. Authorization control can be characterized based on who (the grantors) can grant the rights. In its simplest form, the control is centralized: a single user or user class, the database administrators, has all privileges on the database objects and is the only one allowed to use the GRANT and REVOKE statements.

A more flexible but complex form of control is decentralized [Griffiths and Wade, 1976]: the creator of an object becomes its owner and is granted all privileges on it. In particular, there is the additional operation type GRANT, which transfers all the rights of the grantor performing the statement to the specified subjects. Therefore, the person receiving the right (the grantee) may subsequently grant privileges on that object. The main difficulty with this approach is that the revoking process must be recursive. For example, if A, who granted B who granted C the GRANT privilege on object O, wants to revoke all the privileges of B on O, all the privileges of C on O must also be revoked. To perform revocation, the system must maintain a hierarchy of grants per object where the creator of the object is the root.

The privileges of the subjects over objects are recorded in the catalog (directory) as authorization rules. There are several ways to store the authorizations. The most convenient approach is to consider all the privileges as an authorization matrix, in which a row defines a subject, a column an object, and a matrix entry (for a pair 〈subject, object〉), the authorized operations. The authorized operations are specified by their operation type (e.g., SELECT, UPDATE). It is also customary to associate with the operation type a predicate that further restricts the access to the object. The latter option is provided when the objects must be base relations and cannot be views. For example, one authorized operation for the pair 〈Jones, relation EMP〉 could be

SELECT WHERE TITLE = "Syst.Anal."

which authorizes Jones to access only the employee tuples for system analysts. Figure 5.5 gives an example of an authorization matrix where objects are either relations (EMP and ASG) or attributes (ENAME).

5.2 Data Security 183

Casey

Jones

Smith

EMP ENAME ASG

UPDATE UPDATE UPDATE

SELECT SELECT SELECT WHERE RESP ≠ "Manager"

NONE SELECT NONE

Fig. 5.5 Example of Authorization Matrix

The authorization matrix can be stored in three ways: by row, by column, or by element. When the matrix is stored by row, each subject is associated with the list of objects that may be accessed together with the related access rights. This approach makes the enforcement of authorizations efficient, since all the rights of the logged-on user are together (in the user profile). However, the manipulation of access rights per object (e.g., making an object public) is not efficient since all subject profiles must be accessed. When the matrix is stored by column, each object is associated with the list of subjects who may access it with the corresponding access rights. The advantages and disadvantages of this approach are the reverse of the previous approach.

The respective advantages of the two approaches can be combined in the third approach, in which the matrix is stored by element, that is, by relation (subject, object, right). This relation can have indices on both subject and object, thereby providing fast-access right manipulation per subject and per object.

5.2.2 Multilevel Access Control

Discretionary access control has some limitations. One problem is that a malicious user can access unauthorized data through an authorized user. For instance, consider user A who has authorized access to relations R and S and user B who has authorized access to relation S only. If B somehow manages to modify an application program used by A so it writes R data into S, then B can read unauthorized data without violating authorization rules.

Multilevel access control answers this problem and further improves security by defining different security levels for both subjects and data objects. Multilevel access control in databases is based on the well-known Bell and Lapaduda model designed for operating system security [Bell and Lapuda, 1976]. In this model, subjects are processes acting on a user’s behalf; a process has a security level also called clearance derived from that of the user. In its simplest form, the security levels are Top Secret (T S), Secret (S), Confidential (C) and Unclassified (U), and ordered as T S > S >C >U , where “>” means “more secure”. Access in read and write modes by subjects is restricted by two simple rules:

1. A subject S is allowed to read an object of security level l only if level(S)≥ l.

184 5 Data and Access Control

2. A subject S is allowed to write an object of security level l only if class(S)≤ l.

Rule 1 (called “no read up”) protects data from unauthorized disclosure, i.e., a subject at a given security level can only read objects at the same or lower security levels. For instance, a subject with secret clearance cannot read top-secret data. Rule 2 (called “no write down”) protects data from unauthorized change, i.e., a subject at a given security level can only write objects at the same or higher security levels. For instance, a subject with top-secret clearance can only write top-secret data but cannot write secret data (which could then contain top-secret data).

In the relational model, data objects can be relations, tuples or attributes. Thus, a relation can be classified at different levels: relation (i.e., all tuples in the relation have the same security level), tuple (i.e., every tuple has a security level), or attribute (i.e., every distinct attribute value has a security level). A classified relation is thus called multilevel relation to reflect that it will appear differently (with different data) to subjects with different clearances. For instance, a multilevel relation classified at the tuple level can be represented by adding a security level attribute to each tuple. Similarly, a multilevel relation classified at attribute level can be represented by adding a corresponding security level to each attribute. Figure 5.6 illustrates a multilevel relation PROJ* based on relation PROJ which is classified at the attribute level. Note that the additional security level attributes may increase significantly the size of the relation.

PNO SL1 PNAME SL2 BUDGET SL3 LOC SL4

PROJ*

P1 C Instrumentation C 150000 C Montreal C

P2 C Database Develop. C 135000 S New York S

P3 S CAD/CAM S 250000 S New York S

Fig. 5.6 Multilevel relation PROJ* classified at the attribute level

The entire relation also has a security level which is the lowest security level of any data it contains. For instance, relation PROJ* has security level C. A relation can then be accessed by any subject having a security level which is the same or higher. However, a subject can only access data for which it has clearance. Thus, attributes for which a subject has no clearance will appear to the subject as null values with an associated security level which is the same as the subject. Figure 5.7 shows an instance of relation PROJ* as accessed by a subject at a confidential security level.

Multilevel access control has strong impact on the data model because users do not see the same data and have to deal with unexpected side-effects. One major side-effect is called polyinstantiation [Lunt et al., 1990] which allows the same object to have different attribute values depending on the users’ security level. Figure 5.8 illustrates a multirelation with polyinstantiated tuples. Tuple of primary key P3 has two instantiations, each one with a different security level. This may result from a subject S with security level C inserting a tuple with key=“P3” in relation PROJ* in

5.2 Data Security 185

PNO SL1 PNAME SL2 BUDGET SL3 LOC SL4

PROJ*C

P1 C Instrumentation C 150000 C Montreal C

P2 C Database Develop. C Null C Null C

Fig. 5.7 Confidential relation PROJ*C

Figure 5.6. Because S (with confidential clearance level) should ignore the existence of tuple with key=“P3” (classified as secret), the only practical solution is to add a second tuple with same key and different classification. However, a user with secret clearance would see both tuples with key=“E3” and should interpret this unexpected effect.

PNO SL1 PNAME SL2 BUDGET SL3 LOC SL4

PROJ**

P1 C Instrumentation C 150000 C Montreal C

P2 C Database Develop. C 135000 S New York S

P3 S CAD/CAM S 250000 S New York S

P3 C Web Develop. C 200000 C Paris C

Fig. 5.8 Multilevel relation with polyinstantiation

5.2.3 Distributed Access Control

The additional problems of access control in a distributed environment stem from the fact that objects and subjects are distributed and that messages with sensitive data can be read by unauthorized users. These problems are: remote user authentication, management of discretionary access rules, handling of views and of user groups, and enforcing multilevel access control.

Remote user authentication is necessary since any site of a distributed DBMS may accept programs initiated, and authorized, at remote sites. To prevent remote access by unauthorized users or applications (e.g., from a site that is not part of the distributed DBMS), users must also be identified and authenticated at the accessed site. Furthermore, instead of using passwords that could be obtained from sniffing messages, encrypted certificates could be used.

Three solutions are possible for managing authentication:

1. Authentication information is maintained at a central site for global users which can then be authenticated only once and then accessed from multiple sites.

186 5 Data and Access Control

2. The information for authenticating users (user name and password) is repli- cated at all sites in the catalog. Local programs, initiated at a remote site, must also indicate the user name and password.

3. All sites of the distributed DBMS identify and authenticate themselves similar to the way users do. Intersite communication is thus protected by the use of the site password. Once the initiating site has been authenticated, there is no need for authenticating their remote users.

The first solution simplifies password administration significantly and enables single authentication (also called single sign on). However, the central authentication site can be a single point of failure and a bottleneck. The second solution is more costly in terms of directory management given that the introduction of a new user is a distributed operation. However, users can access the distributed database from any site. The third solution is necessary if user information is not replicated. Nevertheless, it can also be used if there is replication of the user information. In this case it makes remote authentication more efficient. If user names and passwords are not replicated, they should be stored at the sites where the users access the system (i.e., the home site). The latter solution is based on the realistic assumption that users are more static, or at least they always access the distributed database from the same site.

Distributed authorization rules are expressed in the same way as centralized ones. Like view definitions, they must be stored in the catalog. They can be either fully replicated at each site or stored at the sites of the referenced objects. In the latter case the rules are duplicated only at the sites where the referenced objects are distributed. The main advantage of the fully replicated approach is that authorization can be processed by query modification [Stonebraker, 1975] at compile time. However, directory management is more costly because of data duplication. The second solution is better if locality of reference is very high. However, distributed authorization cannot be controlled at compile time.

Views may be considered to be objects by the authorization mechanism. Views are composite objects, that is, composed of other underlying objects. Therefore, granting access to a view translates into granting access to underlying objects. If view definition and authorization rules for all objects are fully replicated (as in many systems), this translation is rather simple and can be done locally. The translation is harder when the view definition and its underlying objects are all stored separately [Wilms and Lindsay, 1981], as is the case with site autonomy assumption. In this situation, the translation is a totally distributed operation. The authorizations granted on views depend on the access rights of the view creator on the underlying objects. A solution is to record the association information at the site of each underlying object.

Handling user groups for the purpose of authorization simplifies distributed database administration. In a centralized DBMS, “all users” can be referred to as public. In a distributed DBMS, the same notion is useful, the public denoting all the users of the system. However an intermediate level is often introduced to specify the public at a particular site, denoted by public@site s [Wilms and Lindsay, 1981]. The public is a particular user group. More precise groups can be defined by the command

5.3 Semantic Integrity Control 187

DEFINE GROUP 〈group id〉 AS 〈list of subject ids〉

The management of groups in a distributed environment poses some problems since the subjects of a group can be located at various sites and access to an object may be granted to several groups, which are themselves distributed. If group information as well as access rules are fully replicated at all sites, the enforcement of access rights is similar to that of a centralized system. However, maintaining this replication may be expensive. The problem is more difficult if site autonomy (with decentralized control) must be maintained. Several solutions to this problem have been identified [Wilms and Lindsay, 1981]. One solution enforces access rights by performing a remote query to the nodes holding the group definition. Another solution replicates a group definition at each node containing an object that may be accessed by subjects of that group. These solutions tend to decrease the degree of site autonomy.

Enforcing multilevel access control in a distributed environment is made difficult by the possibility of indirect means, called covert channels, to access unauthorized data [Rjaibi, 2004]. For instance, consider a simple distributed DBMS architecture with two sites, each managing its database at a single security level, e.g., one site is confidential while the other is secret. According to the “no write down” rule, an update operation from a subject with secret clearance could only be sent to the secret site. However, according to the “no read up” rule, a read query from the same secret subject could be sent to both the secret and the confidential sites. Since the query sent to the confidential site may contain secret information (e.g., in a select predicate), it is potentially a covert channel. To avoid such covert channels, a solution is to replicate part of the database [Thuraisingham, 2001] so that a site at security level l contains all data that a subject at level l can access. For instance, the secret site would replicate confidential data so that it can entirely process secret queries. One problem with this architecture is the overhead of maintaining the consistency of replicas (see Chapter 13 on replication). Furthermore, although there are no covert channels for queries, there may still be covert channels for update operations because the delays involved in synchronizing transactions may be exploited [Jajodia et al., 2001]. The complete support for multilevel access control in distributed database systems, therefore, requires significant extensions to transaction management techniques [Ray et al., 2000] and to distributed query processing techniques [Agrawal et al., 2003].

5.3 Semantic Integrity Control

Another important and difficult problem for a database system is how to guaran- tee database consistency. A database state is said to be consistent if the database satisfies a set of constraints, called semantic integrity constraints. Maintaining a consistent database requires various mechanisms such as concurrency control, re- liability, protection, and semantic integrity control, which are provided as part of transaction management. Semantic integrity control ensures database consistency by rejecting update transactions that lead to inconsistent database states, or by activat-

188 5 Data and Access Control

ing specific actions on the database state, which compensate for the effects of the update transactions. Note that the updated database must satisfy the set of integrity constraints.

In general, semantic integrity constraints are rules that represent the knowledge about the properties of an application. They define static or dynamic application properties that cannot be directly captured by the object and operation concepts of a data model. Thus the concept of an integrity rule is strongly connected with that of a data model in the sense that more semantic information about the application can be captured by means of these rules.

Two main types of integrity constraints can be distinguished: structural constraints and behavioral constraints. Structural constraints express basic semantic properties inherent to a model. Examples of such constraints are unique key constraints in the relational model, or one-to-many associations between objects in the object-oriented model. Behavioral constraints, on the other hand, regulate the application behavior. Thus they are essential in the database design process. They can express associations between objects, such as inclusion dependency in the relational model, or describe object properties and structures. The increasing variety of database applications and the development of database design aid tools call for powerful integrity constraints that can enrich the data model.

Integrity control appeared with data processing and evolved from procedural meth- ods (in which the controls were embedded in application programs) to declarative methods. Declarative methods have emerged with the relational model to alleviate the problems of program/data dependency, code redundancy, and poor performance of the procedural methods. The idea is to express integrity constraints using assertions of predicate calculus [Florentin, 1974]. Thus a set of semantic integrity assertions defines database consistency. This approach allows one to easily declare and modify complex integrity constraints.

The main problem in supporting automatic semantic integrity control is that the cost of checking for constraint violation can be prohibitive. Enforcing integrity constraints is costly because it generally requires access to a large amount of data that are not directly involved in the database updates. The problem is more difficult when constraints are defined over a distributed database.

Various solutions have been investigated to design an integrity manager by com- bining optimization strategies. Their purpose is to (1) limit the number of constraints that need to be enforced, (2) decrease the number of data accesses to enforce a given constraint in the presence of an update transaction, (3) define a preventive strategy that detects inconsistencies in a way that avoids undoing updates, (4) perform as much integrity control as possible at compile time. A few of these solutions have been implemented, but they suffer from a lack of generality. Either they are restricted to a small set of assertions (more general constraints would have a prohibitive checking cost) or they only support restricted programs (e.g., single-tuple updates).

In this section we present the solutions for semantic integrity control first in centralized systems and then in distributed systems. Since our context is the relational model, we consider only declarative methods.

5.3 Semantic Integrity Control 189

5.3.1 Centralized Semantic Integrity Control

A semantic integrity manager has two main components: a language for expressing and manipulating integrity assertions, and an enforcement mechanism that performs specific actions to enforce database integrity upon update transactions.

5.3.1.1 Specification of Integrity Constraints

Integrity constraints should be manipulated by the database administrator using a high-level language. In this section we illustrate a declarative language for specifying integrity constraints [Simon and Valduriez, 1987]. This language is much in the spirit of the standard SQL language, but with more generality. It allows one to specify, read, or drop integrity constraints. These constraints can be defined either at relation creation time, or at any time, even if the relation already contains tuples. In both cases, however, the syntax is almost the same. For simplicity and without lack of generality, we assume that the effect of integrity constraint violation is to abort the violating transactions. However, the SQL standard provides means to express the propagation of update actions to correct inconsistencies, with the CASCADING clause within the constraint declaration. More generally, triggers (event-condition-action rules) [Ramakrishnan and Gehrke, 2003] can be used to automatically propagate updates, and thus to maintain semantic integrity. However, triggers are quite powerful and thus more difficult to support efficiently than specific integrity constraints.

In relational database systems, integrity constraints are defined as assertions. An assertion is a particular expression of tuple relational calculus (see Chapter 2), in which each variable is either universally (∀) or existentially (∃) quantified. Thus an assertion can be seen as a query qualification that is either true or false for each tuple in the Cartesian product of the relations determined by the tuple variables. We can distinguish between three types of integrity constraints: predefined, precondition, or general constraints.

Examples of integrity constraints will be given on the following database:

EMP(ENO, ENAME, TITLE)

PROJ(PNO, PNAME, BUDGET)

ASG(ENO, PNO, RESP, DUR)

Predefined constraints are based on simple keywords. Through them, it is possible to express concisely the more common constraints of the relational model, such as non-null attribute, unique key, foreign key, or functional dependency [Fagin and Vardi, 1984]. Examples 5.11 through 5.14 demonstrate predefined constraints.

Example 5.11. Employee number in relation EMP cannot be null.

ENO NOT NULL IN EMP

190 5 Data and Access Control

Example 5.12. The pair (ENO, PNO) is the unique key in relation ASG.

(ENO, PNO) UNIQUE IN ASG

Example 5.13. The project number PNO in relation ASG is a foreign key matching the primary key PNO of relation PROJ. In other words, a project referred to in relation ASG must exist in relation PROJ.

PNO IN ASG REFERENCES PNO IN PROJ

Example 5.14. The employee number functionally determines the employee name.

ENO IN EMP DETERMINES ENAME

Precondition constraints express conditions that must be satisfied by all tuples in a relation for a given update type. The update type, which might be INSERT, DELETE, or MODIFY, permits restricting the integrity control. To identify in the constraint definition the tuples that are subject to update, two variables, NEW and OLD, are implicitly defined. They range over new tuples (to be inserted) and old tuples (to be deleted), respectively [Astrahan et al., 1976]. Precondition constraints can be expressed with the SQL CHECK statement enriched with the ability to specify the update type. The syntax of the CHECK statement is

CHECK ON 〈 relation name 〉 WHEN〈 update type 〉 (〈 qualification over relation name〉)

Examples of precondition constraints are the following:

Example 5.15. The budget of a project is between 500K and 1000K.

CHECK ON PROJ (BUDGET+ >= 500000 AND BUDGET <= 1000000)

Example 5.16. Only the tuples whose budget is 0 may be deleted.

CHECK ON PROJ WHEN DELETE (BUDGET = 0)

Example 5.17. The budget of a project can only increase.

CHECK ON PROJ (NEW.BUDGET > OLD.BUDGET AND NEW.PNO = OLD.PNO)

General constraints are formulas of tuple relational calculus where all variables are quantified. The database system must ensure that those formulas are always true. General constraints are more concise than precompiled constraints since the former may involve more than one relation. For instance, at least three precompiled constraints are necessary to express a general constraint on three relations. A general constraint may be expressed with the following syntax:

5.3 Semantic Integrity Control 191

CHECK ON list of <variable name>:<relation name>, (<qualification>)

Examples of general constraints are given below.

Example 5.18. The constraint of Example 5.8 may also be expressed as

CHECK ON e1:EMP, e2:EMP (e1.ENAME = e2.ENAME IF e1.ENO = e2.ENO)

Example 5.19. The total duration for all employees in the CAD project is less than 100.

CHECK ON g:ASG, j:PROJ (SUM(g.DUR WHERE g.PNO=j.PNO)<100 IF j.PNAME="CAD/CAM")

5.3.1.2 Integrity Enforcement

We now focus on enforcing semantic integrity that consists of rejecting update transactions that violate some integrity constraints. A constraint is violated when it becomes false in the new database state produced by the update transaction. A major difficulty in designing an integrity manager is finding efficient enforcement algo- rithms. Two basic methods permit the rejection of inconsistent update transactions. The first one is based on the detection of inconsistencies. The update transaction u is executed, causing a change of the database state D to Du. The enforcement algorithm verifies, by applying tests derived from these constraints, that all relevant constraints hold in state Du. If state Du is inconsistent, the DBMS can try either to reach another consistent state, D′u, by modifying Du with compensation actions, or to restore state D by undoing u. Since these tests are applied after having changed the database state, they are generally called posttests. This approach may be inefficient if a large amount of work (the update of D) must be undone in the case of an integrity failure.

The second method is based on the prevention of inconsistencies. An update is executed only if it changes the database state to a consistent state. The tuples subject to the update transaction are either directly available (in the case of insert) or must be retrieved from the database (in the case of deletion or modification). The enforcement algorithm verifies that all relevant constraints will hold after updating those tuples. This is generally done by applying to those tuples tests that are derived from the integrity constraints. Given that these tests are applied before the database state is changed, they are generally called pretests. The preventive approach is more efficient than the detection approach since updates never need to be undone because of integrity violation.

The query modification algorithm [Stonebraker, 1975] is an example of a pre- ventive method that is particularly efficient at enforcing domain constraints. It adds the assertion qualification to the query qualification by an AND operator so that the modified query can enforce integrity.

192 5 Data and Access Control

Example 5.20. The query for increasing the budget of the CAD/CAM project by 10%, which would be specified as

UPDATE PROJ SET BUDGET = BUDGET*1.1 WHERE PNAME= "CAD/CAM"

will be transformed into the following query in order to enforce the domain constraint discussed in Example 5.9.

UPDATE PROJ SET BUDGET = BUDGET * 1.1 WHERE PNAME= "CAD/CAM" AND NEW.BUDGET ≥ 500000 AND NEW.BUDGET ≤ 1000000

The query modification algorithm, which is well known for its elegance, produces pretests at run time by ANDing the assertion predicates with the update predicates of each instruction of the transaction. However, the algorithm only applies to tuple cal- culus formulas and can be specified as follows. Consider the assertion (∀x ∈ R)F(x), where F is a tuple calculus expression in which x is the only free variable. An update of R can be written as (∀x ∈ R)(Q(x)⇒ update(x)), where Q is a tuple calculus expression whose only free variable is x. Roughly speaking, the query modification consists in generating the update (∀x ∈ R)((Q(x) and F(x))⇒update(x)). Thus x needs to be universally quantified.

Example 5.21. The foreign key constraint of Example 5.13 that can be rewritten as

∀g ∈ ASG, ∃ j ∈ PROJ : g.PNO = j.PNO

could not be processed by query modification because the variable j is not universally quantified. �

To handle more general constraints, pretests can be generated at constraint defi- nition time, and enforced at run time when updates occur [Bernstein et al., 1980a; Bernstein and Blaustein, 1982; Blaustein, 1981; Nicolas, 1982]. The method de- scribed by Nicolas [1982] is restricted to updates that insert or delete a single tuple of a single relation. The algorithm proposed by Bernstein et al. [1980a] and Blaustein [1981] is an improvement, although updates are single single tuple. The algorithm builds a pretest at constraint definition time for each constraint and each update type (insert, delete). These pretests are enforced at run time. This method accepts multirelation, monovariable assertions, possibly with aggregates. The principle is the substitution of the tuple variables in the assertion by constants from an updated tuple. Despite its important contribution to research, the method is hardly usable in a real environment because of the restriction on updates.

In the rest of this section, we present the method proposed by Simon and Valduriez [1986, 1987], which combines the generality of updates supported by Stonebraker [1975] with at least the generality of assertions for which pretests can be produced by Blaustein [1981]. This method is based on the production, at assertion definition time,

5.3 Semantic Integrity Control 193

of pretests that are used subsequently to prevent the introduction of inconsistencies in the database. This is a general preventive method that handles the entire set of constraints introduced in the preceding section. It significantly reduces the proportion of the database that must be checked when enforcing assertions in the presence of updates. This is a major advantage when applied to a distributed environment.

The definition of pretest uses differential relations, as defined in Section 5.1.3. A pretest is a triple (R,U,C) in which R is a relation, U is an update type, and C is an assertion ranging over the differential relation(s) involved in an update of type U . When an integrity constraint I is defined, a set of pretests may be produced for the relations used by I. Whenever a relation involved in I is updated by a transaction u, the pretests that must be checked to enforce I are only those defined on I for the update type of u. The performance advantage of this approach is twofold. First, the number of assertions to enforce is minimized since only the pretests of type u need be checked. Second, the cost of enforcing a pretest is less than that of enforcing I since differential relations are, in general, much smaller than the base relations.

Pretests may be obtained by applying transformation rules to the original asser- tion. These rules are based on a syntactic analysis of the assertion and quantifier permutations. They permit the substitution of differential relations for base relations. Since the pretests are simpler than the original ones, the process that generates them is called simplification.

Example 5.22. Consider the modified expression of the foreign key constraint in Example 5.15. The pretests associated with this constraint are

(ASG, INSERT, C1), (PROJ, DELETE, C2) and (PROJ, MODIFY, C3)

where C1 is

∀ NEW ∈ ASG+, ∃ j ∈ PROJ: NEW.PNO = j.PNO

C2 is

∀g ∈ ASG, ∀ OLD ∈ PROJ− : g.PNO 6= OLD.PNO

and C3 is

∀g ∈ ASG, ∀OLD ∈ PROJ−, ∃ NEW ∈ PROJ+ : g.PNO 6= OLD.PNO OR OLD.PNO = NEW.PNO

The advantage provided by such pretests is obvious. For instance, a deletion on relation ASG does not incur any assertion checking.

The enforcement algorithm [Simon and Valduriez, 1984] makes use of pretests and is specialized according to the class of the assertions. Three classes of constraints are distinguished: single-relation constraints, multirelation constrainss, and constraints involving aggregate functions.

194 5 Data and Access Control

Let us now summarize the enforcement algorithm. Recall that an update transac- tion updates all tuples of relation R that satisfy some qualification. The algorithm acts in two steps. The first step generates the differential relations R+ and R− from R. The second step simply consists of retrieving the tuples of R+ and R−, which do not satisfy the pretests. If no tuples are retrieved, the constraint is valid. Otherwise, it is violated.

Example 5.23. Suppose there is a deletion on PROJ. Enforcing (PROJ, DELETE, C2) consists in generating the following statement:

result← retrieve all tuples of PROJ− where ¬(C2)

Then, if the result is empty, the assertion is verified by the update and consistency is preserved. �

5.3.2 Distributed Semantic Integrity Control

In this section we present algorithms for ensuring the semantic integrity of distributed databases. They are extensions of the simplification method discussed previously. In what follows, we assume global transaction management capabilities, as provided for homogeneous systems or multidatabase systems. Thus, the two main problems of designing an integrity manager for such a distributed DBMS are the definition and storage of assertions, and the enforcement of these constraints. We will also discuss the issues involved in integrity constraint checking when there is no global transaction support.

5.3.2.1 Definition of Distributed Integrity Constraints

An integrity constraint is supposed to be expressed in tuple relational calculus. Each assertion is seen as a query qualification that is either true or false for each tuple in the Cartesian product of the relations determined by the tuple variables. Since assertions can involve data stored at different sites, the storage of the constraints must be decided so as to minimize the cost of integrity checking. There is a strategy based on a taxonomy of integrity constraints that distinguishes three classes:

1. Individual constraints: single-relation single-variable constraints. They refer only to tuples to be updated independently of the rest of the database. For instance, the domain constraint of Example 5.15 is an individual assertion.

2. Set-oriented constraints: include single-relation multivariable constraints such as functional dependency (Example 5.14) and multirelation multivariable constraints such as foreign key constraints (Example 5.13).

5.3 Semantic Integrity Control 195

3. Constraints involving aggregates: require special processing because of the cost of evaluating the aggregates. The assertion in Example 5.19 is representa- tive of a constraint of this class.

The definition of a new integrity constraint can be started at one of the sites that store the relations involved in the assertion. Remember that the relations can be fragmented. A fragmentation predicate is a particular case of assertion of class 1. Different fragments of the same relation can be located at different sites. Thus, defining an integrity assertion becomes a distributed operation, which is done in two steps. The first step is to transform the high-level assertions into pretests, using the techniques discussed in the preceding section. The next step is to store pretests according to the class of constraints. Constraints of class 3 are treated like those of class 1 or 2, depending on whether they are individual or set-oriented.

Individual constraints.

The constraint definition is sent to all other sites that contain fragments of the relation involved in the constraint. The constraint must be compatible with the relation data at each site. Compatibility can be checked at two levels: predicate and data. First, predicate compatibility is verified by comparing the constraint predicate with the fragment predicate. A constraint C is not compatible with a fragment predicate p if “C is true” implies that “p is false,” and is compatible with p otherwise. If non- compatibility is found at one of the sites, the constraint definition is globally rejected because tuples of that fragment do not satisfy the integrity constraints. Second, if predicate compatibility has been found, the constraint is tested against the instance of the fragment. If it is not satisfied by that instance, the constraint is also globally rejected. If compatibility is found, the constraint is stored at each site. Note that the compatibility checks are performed only for pretests whose update type is “insert” (the tuples in the fragments are considered “inserted”).

Example 5.24. Consider relation EMP, horizontally fragmented across three sites using the predicates

p1 : 0≤ ENO < “E3” p2 : ”E3” ≤ ENO ≤ “E6” p3 : ENO > “E6”

and the domain constraint C: ENO < “E4”. Constraint C is compatible with p1 (if C is true, p1 is true) and p2 (if C is true, p2 is not necessarily false), but not with p3 (if C is true, then p3 is false). Therefore, constraint C should be globally rejected because the tuples at site 3 cannot satisfy C, and thus relation EMP does not satisfy C. �

196 5 Data and Access Control

Set-oriented constraints.

Set-oriented constraint are multivariable; that is, they involve join predicates. Al- though the assertion predicate may be multirelation, a pretest is associated with a single relation. Therefore, the constraint definition can be sent to all the sites that store a fragment referenced by these variables. Compatibility checking also involves fragments of the relation used in the join predicate. Predicate compatibility is useless here, because it is impossible to infer that a fragment predicate p is false if the constraint C (based on a join predicate) is true. Therefore C must be checked for compatibility against the data. This compatibility check basically requires joining each fragment of the relation, say R, with all fragments of the other relation, say S, involved in the constraint predicate. This operation may be expensive and, as any join, should be optimized by the distributed query processor. Three cases, given in increasing cost of checking, can occur:

1. The fragmentation of R is derived (see Chapter 3) from that of S based on a semijoin on the attribute used in the assertion join predicate.

2. S is fragmented on join attribute. 3. S is not fragmented on join attribute.

In the first case, compatibility checking is cheap since the tuple of S matching a tuple of R is at the same site. In the second case, each tuple of R must be compared with at most one fragment of S, because the join attribute value of the tuple of R can be used to find the site of the corresponding fragment of S. In the third case, each tuple of R must be compared with all fragments of S. If compatibility is found for all tuples of R, the constraint can be stored at each site.

Example 5.25. Consider the set-oriented pretest (ASG, INSERT, C1) defined in Example 5.16, where C1 is

∀ NEW ∈ ASG+, ∃ j ∈ PROJ : NEW.PNO = j.PNO

Let us consider the following three cases:

1. ASG is fragmented using the predicate

ASGnPNO PROJi

where PROJi is a fragment of relation PROJ. In this case each tuple NEW of ASG has been placed at the same site as tuple j such that NEW.PNO = j.PNO. Since the fragmentation predicate is identical to that of C1, compatibility checking does not incur communication.

2. PROJ is horizontally fragmented based on the two predicates

p1 : PNO < “P3” p2 : PNO ≥ “P3”

5.3 Semantic Integrity Control 197

In this case each tuple NEW of ASG is compared with either fragment PROJ1, if NEW.PNO < “P3”, or fragment PROJ2 if NEW.PNO ≥ “P3”.

3. PROJ is horizontally fragmented based on the two predicates

p1 : PNAME = “CAD/CAM” p2 : PNAME 6= “CAD/CAM”

In this case each tuple of ASG must be compared with both fragments PROJ1 and PROJ2.

5.3.2.2 Enforcement of Distributed Integrity Assertions

Enforcing distributed integrity assertions is more complex than needed in centralized DBMSs, even with global transaction management support. The main problem is to decide where (at which site) to enforce the integrity constraints. The choice depends on the class of the constraint, the type of update, and the nature of the site where the update is issued (called the query master site). This site may, or may not, store the updated relation or some of the relations involved in the integrity constraints. The critical parameter we consider is the cost of transferring data, including messages, from one site to another. We now discuss the different types of strategies according to these criteria.

Individual constraints.

Two cases are considered. If the update transaction is an insert statement, all the tuples to be inserted are explicitly provided by the user. In this case, all individual constraints can be enforced at the site where the update is submitted. If the update is a qualified update (delete or modify statements), it is sent to the sites storing the relation that will be updated. The query processor executes the update qualification for each fragment. The resulting tuples at each site are combined into one temporary relation in the case of a delete statement, or two, in the case of a modify statement (i.e., R+ and R−). Each site involved in the distributed update enforces the assertions relevant at that site (e.g., domain constraints when it is a delete).

Set-oriented constraints.

We first study single-relation constraints by means of an example. Consider the functional dependency of Example 5.14. The pretest associated with update type INSERT is

(EMP, INSERT, C)

198 5 Data and Access Control

where C is

(∀e ∈ EMP)(∀NEW1 ∈ EMP)(∀NEW2 ∈ EMP) (1) (NEW1.ENO = e.ENO ⇒NEW1.ENAME = e.ENAME)∧ (2) (NEW1.ENO = NEW2.ENO ⇒ NEW1.ENAME = NEW2.ENAME)(3)

The second line in the definition of C checks the constraint between the inserted tuples (NEW1) and the existing ones (e), while the third checks it between the inserted tuples themselves. That is why two variables (NEW1 and NEW2) are declared in the first line.

Consider now an update of EMP. First, the update qualification is executed by the query processor and returns one or two temporary relations, as in the case of individual constraints. These temporary relations are then sent to all sites storing EMP. Assume that the update is an INSERT statement. Then each site storing a fragment of EMP will enforce constraint C described above. Because e in C is universally quantified, C must be satisfied by the local data of each site. This is due to the fact that ∀x ∈ {a1, . . . ,an} f (x) is equivalent to [ f (a1)∧ f (a2)∧ ·· · ∧ f (an)]. Thus the site where the update is submitted must receive for each site a message indicating that this constraint is satisfied and that it is a condition for all sites. If the constraint is not true for one site, this site sends an error message indicating that the constraint has been violated. The update is then invalid, and it is the responsibility of the integrity manager to decide if the entire transaction must be rejected using the global transaction manager.

Let us now consider multirelation constraints. For the sake of clarity, we assume that the integrity constraints do not have more than one tuple variable ranging over the same relation. Note that this is likely to be the most frequent case. As with single-relation constraints, the update is computed at the site where it was submitted. The enforcement is done at the query master site, using the ENFORCE algorithm given in Algorithm 5.2.

Example 5.26. We illustrate this algorithm through an example based on the foreign key constraint of Example 5.13. Let u be an insertion of a new tuple into ASG. The previous algorithm uses the pretest (ASG, INSERT, C), where C is

∀ NEW ∈ ASG+, ∃ j ∈ PROJ : NEW.PNO = j.PNO

For this constraint, the retrieval statement is to retrieve all new tuples in ASG+ where C is not true. This statement can be expressed in SQL as

SELECT NEW.* FROM ASG+ NEW, PROJ WHERE COUNT(PROJ.PNO WHERE NEW.PNO = PROJ.PNO)=0

Note that NEW.* denotes all the attributes of ASG+. �

Thus the strategy is to send new tuples to sites storing relation PROJ in order to perform the joins, and then to centralize all results at the query master site. For each

5.3 Semantic Integrity Control 199

Algorithm 5.2: ENFORCE Algorithm Input: U : update type; R: relation begin

retrieve all compiled assertions (R, U, Ci) ; inconsistent← false ; for each compiled assertion do

result← all new (respectively old), tuples of R where ¬(Ci) if card(result) 6= 0 then

inconsistent← true if ¬inconsistent then

send the tuples to update to all the sites storing fragments of R else

reject the update

end

site storing a fragment of PROJ, the site joins the fragment with ASG+ and sends the result to the query master site, which performs the union of all results. If the union is empty, the database is consistent. Otherwise, the update leads to an inconsistent state and should be rejected, using the global transaction manager. More sophisticated strategies that notify or compensate inconsistencies can also be devised.

Constraints involving aggregates.

These constraints are among the most costly to test because they require the calcu- lation of the aggregate functions. The aggregate functions generally manipulated are MIN, MAX, SUM, and COUNT. Each aggregate function contains a projection part and a selection part. To enforce these constraints efficiently, it is possible to produce pretest that isolate redundant data which can be stored at each site storing the associated relation [Bernstein and Blaustein, 1982]. This data is what we called materialized views in Section 5.1.2.

5.3.2.3 Summary of Distributed Integrity Control

The main problem of distributed integrity control is that the communication and processing costs of enforcing distributed constraints can be prohibitive. The two main issues in designing a distributed integrity manager are the definition of the distributed assertions and of the enforcement algorithms, which minimize the cost of distributed integrity checking. We have shown in this chapter that distributed integrity control can be completely achieved, by extending a preventive method based on the compilation of semantic integrity constraints into pretests. The method is general since all types of constraints expressed in first-order predicate logic can be handled.

200 5 Data and Access Control

It is compatible with fragment definition and minimizes intersite communication. A better performance of distributed integrity enforcement can be obtained if fragments are defined carefully. Therefore, the specification of distributed integrity constraints is an important aspect of the distributed database design process.

The method described above assumes global transaction support. Without global transaction support as in some loosely-coupled multidatabase systems, the problem is more difficult [Grefen and Widom, 1997]. First, the interface between the constraint manager and the component DBMS is different since constraint checking can no longer be part of the global transaction validation. Instead, the component DBMSs should notify the integrity manager to perform constraint checking after some events, e.g., as a result of local transactions’s commitments. This can be done using triggers whose events are updates to relations involved in global constraints. Second, if a global constraint violation is detected, since there is no way to specify global aborts, specific correcting transactions should be provided to produce global database states that are consistent. A family of protocols for global integrity checking has been proposed [Grefen and Widom, 1997]. The root of the family is a simple strategy, based on the computation of differential relations (as in the previous method), which is shown to be safe (correctly identifies constraint violations) but inaccurate (may raise an error event though there is no constraint violation). Inaccuracy is due to the fact that producing differential relations at different times at different sites may yield phantom states for the global database, i.e., states that never existed. Extensions of the basic protocol with either timestamping or using local transaction commands are proposed to solve that problem.

5.4 Conclusion

Semantic data and access control includes view management, security control, and semantic integrity control. In the relational framework, these functions can be uni- formly achieved by enforcing rules that specify data manipulation control. Solutions initially designed for handling these functions in centralized systems have been significantly extended and enriched for distributed systems, in particular, support for materialized views and group-based discretionary access control. Semantic integrity control has received less attention and is generally not supported by distributed DBMS products.

Full semantic data control is more complex and costly in terms of performance in distributed systems. The two main issues for efficiently performing data control are the definition and storage of the rules (site selection) and the design of enforcement algorithms which minimize communication costs. The problem is difficult since increased functionality (and generality) tends to increase site communication. The problem is simplified if control rules are fully replicated at all sites and harder if site autonomy is to be preserved. In addition, specific optimizations can be done to minimize the cost of data control but with extra overhead such as managing materialized views or redundant data. Thus the specification of distributed data

5.5 Bibliographic Notes 201

control must be included in the distributed database design so that the cost of control for update programs is also considered.

5.5 Bibliographic Notes

Semantic data control is well-understood in centralized systems [Ramakrishnan and Gehrke, 2003] and all major DBMSs provide extensive support for it. Research on semantic data control in distributed systems started in the early 1980’s with the R* project at IBM Research and has increased much since then to address new important applications such as data warehousing or data integration.

Most of the work on view management has concerned updates through views and support for materialized views. The two basic papers on centralized view management are [Chamberlin et al., 1975] and [Stonebraker, 1975]. The first reference presents an integrated solution for view and authorization management in System R. The second reference describes INGRES’s query modification technique for uniformly handling views, authorizations, and semantic integrity control. This method was presented in Section 5.1.

Theoretical solutions to the problem of view updates are given in [Bancilhon and Spyratos, 1981; Dayal and Bernstein, 1978], and [Keller, 1982]. The first of these is the seminal paper on view update semantics [Bancilhon and Spyratos, 1981] where the authors formalize the view invariance property after updating, and show how a large class of views including joins can be updated. Semantic information about the base relations is particularly useful for finding unique propagation of updates. However, the current commercial systems are very restrictive in supporting updates through views.

Materialized views have received much attention. The notion of snapshot for optimizing view derivation in distributed database systems is due to [Adiba and Lindsay, 1980]. Adiba [1981] generalizes the notion of snapshot by that of derived relation in a distributed context. He also proposes a unified mechanism for managing views, and snapshots, as well as fragmented and replicated data. Gupta and Mumick [1999c] have edited a thorough collection of papers on materialized view management in. In [Gupta and Mumick, 1999a], they describe the main techniques to perform incremental maintenance of materialized views. The counting algorithm which we presented in Section 5.1.3 has been proposed in [Gupta et al., 1993].

Security in computer systems in general is presented in [Hoffman, 1977]. Security in centralized database systems is presented in [Lunt and Fernández, 1990; Castano et al., 1995]. Discretionary access control in distributed systems has first received much attention in the context of the R* project. The access control mechanism of System R Griffiths and Wade [1976] is extended in [Wilms and Lindsay, 1981] to handle groups of users and to run in a distributed environment. Multilevel access control for distributed DBMS has recently gained much interest. The seminal paper on multilevel access control is the Bell and Lapaduda model originally designed for operating system security [Bell and Lapuda, 1976]. Multilevel access control for

202 5 Data and Access Control

databases is described in [Lunt and Fernández, 1990; Jajodia and Sandhu, 1991]. A good introduction to multilevel security in relational DBMS can be found in [Rjaibi, 2004]. Transaction management in multilevel secure DBMS is addressed in [Ray et al., 2000; Jajodia et al., 2001]. Extensions of multilevel access control for distributed DBMS are proposed in [Thuraisingham, 2001].

The content of Section 5.3 comes largely from the work on semantic integrity control described in [Simon and Valduriez, 1984, 1986] and [Simon and Valduriez, 1987]. In particular, [Simon and Valduriez, 1986] extends a preventive strategy for centralized integrity control based on pretests to run in a distributed environment, assuming global transaction support. The initial idea of declarative methods, that is, to use assertions of predicate logic to specify integrity constraints, is due to [Florentin, 1974]. The most important declarative methods are in [Bernstein et al., 1980a; Blaustein, 1981; Nicolas, 1982; Simon and Valduriez, 1984], and [Stonebraker, 1975]. The notion of concrete views for storing redundant data is described in [Bernstein and Blaustein, 1982]. Note that concrete views are useful in optimizing the enforcement of constraints involving aggregates. [Civelek et al., 1988; Sheth et al., 1988b] and Sheth et al. [1988a] describe systems and tools for semantic data control, particularly view management. Semantic intergrity checking in loosely-coupled multidatabase systems without global transaction support is addressed in [Grefen and Widom, 1997].

Exercises

Problem 5.1. Define in SQL-like syntax a view of the engineering database V(ENO, ENAME, PNO, RESP), where the duration is 24. Is view V updatable? Assume that relations EMP and ASG are horizontally fragmented based on access frequencies as follows:

Site 1 Site 2 Site 3 EMP1 EMP2

ASG1 ASG2

where

EMP1 = σTITLE6=“Engineer”(EMP) EMP2 = σTITLE = “Engineer” (EMP) ASG1 = σ0<DUR<36(ASG) ASG2 = σDUR≥36(ASG)

At which site(s) should the definition of V be stored without being fully replicated, to increase locality of reference?

Problem 5.2. Express the following query: names of employees in view V who work on the CAD project.

Problem 5.3 (*). Assume that relation PROJ is horizontally fragmented as

5.5 Bibliographic Notes 203

PROJ1 = σPNAME = “CAD”(PROJ) PROJ2 = σPNAME 6=“CAD”(PROJ)

Modify the query obtained in Exercise 5.2 to a query expressed on the fragments.

Problem 5.4 (**). Propose a distributed algorithm to efficiently refresh a snapshot at one site derived by projection from a relation horizontally fragmented at two other sites. Give an example query on the view and base relations which produces an inconsistent result.

Problem 5.5 (*). Consider the view EG of Example 5.5 which uses relations EMP and ASG as base data and assume its state is derived from that of Example 3.1, so that EG has 9 tuples (see Figure 5.4). Assume that tuple 〈E3, P3, Consultant, 10〉 from ASG is updated to 〈E3, P3, Engineer, 10〉. Apply the basic counting algorithm for refreshing the view EG. What projected attributes should be added to view EG to make it self-maintainable?

Problem 5.6. Propose a relation schema for storing the access rights associated with user groups in a distributed database catalog, and give a fragmentation scheme for that relation, assuming that all members of a group are at the same site.

Problem 5.7 (**). Give an algorithm for executing the REVOKE statement in a distributed DBMS, assuming that the GRANT privilege can be granted only to a group of users where all its members are at the same site.

Problem 5.8 (**). Consider the multilevel relation PROJ** in Figure 5.8. Assuming that there are only two classification levels for attributes (S and C), propose an allocation of PROJ** on two sites using fragmentation and replication that avoids covert channels on read queries. Discuss the constraints on updates for this allocation to work.

Problem 5.9. Using the integrity constraint specification language of this chapter, express an integrity constraint which states that the duration spent in a project cannot exceed 48 months.

Problem 5.10 (*). Define the pretests associated with integrity constraints covered in Examples 5.11 to 5.14.

Problem 5.11. Assume the following vertical fragmentation of relations EMP, ASG and PROJ:

Site 1 Site 2 Site 3 Site 4 EMP1 EMP2

PROJ1 PROJ2 ASG1 ASG2

where

204 5 Data and Access Control

EMP1 = ΠENO, ENAME(EMP) EMP2 = ΠENO, TITLE(EMP) PROJ1 = ΠPNO, PNAME(PROJ) PROJ2 = ΠPNO, BUDGET(PROJ) ASG1 = ΠENO, PNO, RESP(ASG) ASG2 = ΠENO, PNO, DUR(ASG)

Where should the pretests obtained in Exercise 5.9 be stored?

Problem 5.12 (**). Consider the following set-oriented constraint:

CHECK ON e:EMP, a:ASG (e.ENO = a.ENO and (e.TITLE = "Programmer") IF a.RESP = "Programmer")

What does it mean? Assuming that EMP and ASG are allocated as in the previ- ous exercice, define the corresponding pretests and theri storage. Apply algorithm ENFORCE for an update of type INSERT in ASG.

Problem 5.13 (**). Assume a distributed multidatabase system with no global trans- action support. Assume also that there are two sites, each with a (different) EMP relation and a integrity manager that communicates with the component DBMS. Sup- pose that we want to have a global unique key constraint on EMP. Propose a simple strategy using differential relations to check this constraint. Discuss the possible actions when a constraint is violated.

Chapter 6 Overview of Query Processing

The success of relational database technology in data processing is due, in part, to the availability of non-procedural languages (i.e., SQL), which can significantly improve application development and end-user productivity. By hiding the low-level details about the physical organization of the data, relational database languages allow the expression of complex queries in a concise and simple fashion. In particular, to construct the answer to the query, the user does not precisely specify the procedure to follow. This procedure is actually devised by a DBMS module, usually called a query processor. This relieves the user from query optimization, a time-consuming task that is best handled by the query processor, since it can exploit a large amount of useful information about the data.

Because it is a critical performance issue, query processing has received (and continues to receive) considerable attention in the context of both centralized and distributed DBMSs. However, the query processing problem is much more difficult in distributed environments than in centralized ones, because a larger number of parameters affect the performance of distributed queries. In particular, the relations involved in a distributed query may be fragmented and/or replicated, thereby induc- ing communication overhead costs. Furthermore, with many sites to access, query response time may become very high.

In this chapter we give an overview of query processing in distributed DBMSs, leaving the details of the important aspects of distributed query processing to the next two chapters. The context chosen is that of relational calculus and relational algebra, because of their generality and wide use in distributed DBMSs. As we saw in Chapter 3, distributed relations are implemented by fragments. Distributed database design is of major importance for query processing since the definition of fragments is based on the objective of increasing reference locality, and sometimes parallel execution for the most important queries. The role of a distributed query processor is to map a high-level query (assumed to be expressed in relational calculus) on a distributed database (i.e., a set of global relations) into a sequence of database operators (of relational algebra) on relation fragments. Several important functions characterize this mapping. First, the calculus query must be decomposed into a sequence of relational operators called an algebraic query. Second, the data accessed by the

205 DOI 10.1007/978-1-4419-8834-8_6, © Springer Science+Business Media, LLC 2011 M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems: Third Edition,

206 6 Overview of Query Processing

query must be localized so that the operators on relations are translated to bear on local data (fragments). Finally, the algebraic query on fragments must be extended with communication operators and optimized with respect to a cost function to be minimized. This cost function typically refers to computing resources such as disk I/Os, CPUs, and communication networks.

The chapter is organized as follows. In Section 6.1 we illustrate the query process- ing problem. In Section 6.2 we define precisely the objectives of query processing algorithms. The complexity of relational algebra operators, which affect mainly the performance of query processing, is given in Section 6.3. In Section 6.4 we provide a characterization of query processors based on their implementation choices. Finally, in Section 6.5 we introduce the different layers of query processing starting from a distributed query down to the execution of operators on local sites and communica- tion between sites. The layers introduced in Section 6.5 are described in detail in the next two chapters.

6.1 Query Processing Problem

The main function of a relational query processor is to transform a high-level query (typically, in relational calculus) into an equivalent lower-level query (typically, in some variation of relational algebra). The low-level query actually implements the execution strategy for the query. The transformation must achieve both correctness and efficiency. It is correct if the low-level query has the same semantics as the original query, that is, if both queries produce the same result. The well-defined mapping from relational calculus to relational algebra (see Chapter 2) makes the correctness issue easy. But producing an efficient execution strategy is more involved. A relational calculus query may have many equivalent and correct transformations into relational algebra. Since each equivalent execution strategy can lead to very different consumptions of computer resources, the main difficulty is to select the execution strategy that minimizes resource consumption.

Example 6.1. We consider the following subset of the engineering database schema given in Figure 2.3:

EMP(ENO, ENAME, TITLE) ASG(ENO, PNO, RESP, DUR)

and the following simple user query:

“Find the names of employees who are managing a project”

The expression of the query in relational calculus using the SQL syntax is

6.1 Query Processing Problem 207

SELECT ENAME FROM EMP,ASG WHERE EMP.ENO = ASG.ENO AND RESP = ‘‘Manager’’

Two equivalent relational algebra queries that are correct transformations of the query above are

ΠENAME(σRESP=“Manager”∧EMP.ENO=ASG.ENO (EMP × ASG))

and

ΠENAME(EMP 1ENO (σRESP=“Manager” (ASG)))

It is intuitively obvious that the second query, which avoids the Cartesian product of EMP and ASG, consumes much less computing resources than the first, and thus should be retained. �

In a centralized context, query execution strategies can be well expressed in an extension of relational algebra. The main role of a centralized query processor is to choose, for a given query, the best relational algebra query among all equivalent ones. Since the problem is computationally intractable with a large number of relations [Ibaraki and Kameda, 1984], it is generally reduced to choosing a solution close to the optimum.

In a distributed system, relational algebra is not enough to express execution strategies. It must be supplemented with operators for exchanging data between sites. Besides the choice of ordering relational algebra operators, the distributed query processor must also select the best sites to process data, and possibly the way data should be transformed. This increases the solution space from which to choose the distributed execution strategy, making distributed query processing significantly more difficult.

Example 6.2. This example illustrates the importance of site selection and commu- nication for a chosen relational algebra query against a fragmented database. We consider the following query of Example 6.1:

ΠENAME (EMP 1ENO (σRESP=“Manager” (ASG)))

We assume that relations EMP and ASG are horizontally fragmented as follows:

EMP1 = σENO≤“E3” (EMP) EMP2 = σENO>“E3”(EMP) ASG1 = σENO≤“E3”(ASG) ASG2 = σENO>“E3”(ASG)

Fragments ASG1, ASG2, EMP1, and EMP2 are stored at sites 1, 2, 3, and 4, respectively, and the result is expected at site 5.

For the sake of pedagogical simplicity, we ignore the project operator in the following. Two equivalent distributed execution strategies for the above query are

208 6 Overview of Query Processing

shown in Figure 6.1. An arrow from site i to site j labeled with R indicates that relation R is transferred from site i to site j. Strategy A exploits the fact that relations EMP and ASG are fragmented the same way in order to perform the select and join operator in parallel. Strategy B centralizes all the operand data at the result site before processing the query.

(a) Strategy A

Site 5

Site 4Site 3

Site 1 Site 2

È

ASG’ 1

EMP’ 1

(b) Strategy B

Site 5

Site 1 Site 2 Site 3 Site 4

ASG 1

EMP 1

EMP 2

ASG 2

result = EMP’ 1 ∪ EMP’

2

EMP’ 2 = EMP

2

ENO ASG’

2 EMP’

1 = EMP

1

ENO ASG’

1

ASG’ 1 = σ

RESP="Manager" ASG

1

EMP’ 2

ASG’ 2

ASG’ 2 = σ

RESP="Manager" ASG

2

result = (EMP 1 ∪ EMP

2 )

ENO σ

RESP="Manager" (ASG

1 ∪ ASG

2 )

Fig. 6.1 Equivalent Distributed Execution Strategies

To evaluate the resource consumption of these two strategies, we use a simple cost model. We assume that a tuple access, denoted by tupacc, is 1 unit (which we leave unspecified) and a tuple transfer, denoted tuptrans, is 10 units. We assume that relations EMP and ASG have 400 and 1000 tuples, respectively, and that there are 20 managers in relation ASG. We also assume that data is uniformly distributed among sites. Finally, we assume that relations ASG and EMP are locally clustered on attributes RESP and ENO, respectively. Therefore, there is direct access to tuples of ASG (respectively, EMP) based on the value of attribute RESP (respectively, ENO).

The total cost of strategy A can be derived as follows:

6.2 Objectives of Query Processing 209

1. Produce ASG′ by selecting ASG requires (10+10)∗ tupacc = 20 2. Transfer ASG′ to the sites of EMP requires (10+10)∗ tuptrans = 200 3. Produce EMP′ by joining ASG′ and EMP requires

(10+10)∗ tupacc∗2 = 40 4. Transfer EMP′ to result site requires (10+10)∗ tuptrans = 200

The total cost is 460

The cost of strategy B can be derived as follows:

1. Transfer EMP to site 5 requires 400∗ tuptrans = 4,000 2. Transfer ASG to site 5 requires 1000∗ tuptrans = 10,000 3. Produce ASG′ by selecting ASG requires 1000∗ tupacc = 1,000 4. Join EMP and ASG′ requires 400∗20∗ tupacc = 8,000

The total cost is 23,000

In strategy A, the join of ASG′ and EMP (step 3) can exploit the cluster index on ENO of EMP. Thus, EMP is accessed only once for each tuple of ASG′. In strategy B, we assume that the access methods to relations EMP and ASG based on attributes RESP and ENO are lost because of data transfer. This is a reasonable assumption in practice. We assume that the join of EMP and ASG′ in step 4 is done by the default nested loop algorithm (that simply performs the Cartesian product of the two input relations). Strategy A is better by a factor of 50, which is quite significant. Furthermore, it provides better distribution of work among sites. The difference would be even higher if we assumed slower communication and/or higher degree of fragmentation. �

6.2 Objectives of Query Processing

As stated before, the objective of query processing in a distributed context is to trans- form a high-level query on a distributed database, which is seen as a single database by the users, into an efficient execution strategy expressed in a low-level language on local databases. We assume that the high-level language is relational calculus, while the low-level language is an extension of relational algebra with communication operators. The different layers involved in the query transformation are detailed in Section 6.5. An important aspect of query processing is query optimization. Because many execution strategies are correct transformations of the same high-level query, the one that optimizes (minimizes) resource consumption should be retained.

A good measure of resource consumption is the total cost that will be incurred in processing the query [Sacco and Yao, 1982]. Total cost is the sum of all times incurred in processing the operators of the query at various sites and in intersite communication. Another good measure is the response time of the query [Epstein et al., 1978], which is the time elapsed for executing the query. Since operators

210 6 Overview of Query Processing

can be executed in parallel at different sites, the response time of a query may be significantly less than its total cost.

In a distributed database system, the total cost to be minimized includes CPU, I/O, and communication costs. The CPU cost is incurred when performing operators on data in main memory. The I/O cost is the time necessary for disk accesses. This cost can be minimized by reducing the number of disk accesses through fast access methods to the data and efficient use of main memory (buffer management). The communication cost is the time needed for exchanging data between sites participat- ing in the execution of the query. This cost is incurred in processing the messages (formatting/deformatting), and in transmitting the data on the communication net- work.

The first two cost components (I/O and CPU cost) are the only factors considered by centralized DBMSs. The communication cost component is equally important factor considered in distributed databases. Most of the early proposals for distributed query optimization assume that the communication cost largely dominates local processing cost (I/O and CPU cost), and thus ignore the latter. This assumption is based on very slow communication networks (e.g., wide area networks that used to have a bandwidth of a few kilobytes per second) rather than on networks with bandwidths that are comparable to disk connection bandwidth. Therefore, the aim of distributed query optimization reduces to the problem of minimizing communica- tion costs generally at the expense of local processing. The advantage is that local optimization can be done independently using the known methods for centralized systems. However, modern distributed processing environments have much faster communication networks, as discussed in Chapter 2, whose bandwidth is comparable to that of disks. Therefore, more recent research efforts consider a weighted combi- nation of these three cost components since they all contribute significantly to the total cost of evaluating a query1 [Page and Popek, 1985]. Nevertheless, in distributed environments with high bandwidths, the overhead cost incurred for communication between sites (e.g., software protocols) makes communication cost still an important factor.

6.3 Complexity of Relational Algebra Operations

In this chapter we consider relational algebra as a basis to express the output of query processing. Therefore, the complexity of relational algebra operators, which directly affects their execution time, dictates some principles useful to a query processor. These principles can help in choosing the final execution strategy.

The simplest way of defining complexity is in terms of relation cardinalities independent of physical implementation details such as fragmentation and storage

1 There are some studies that investigate the feasibility of retrieving data from a neighboring nodes’ main memory cache rather than accessing them from a local disk [Franklin et al., 1992; Dahlin et al., 1994; Freeley et al., 1995]. These approaches would have a significant impact on query optimization.

6.4 Characterization of Query Processors 211

structures. Figure 6.2 shows the complexity of unary and binary operators in the order of increasing complexity, and thus of increasing execution time. Complexity is O(n) for unary operators, where n denotes the relation cardinality, if the resulting tuples may be obtained independently of each other. Complexity is O(n∗ logn) for binary operators if each tuple of one relation must be compared with each tuple of the other on the basis of the equality of selected attributes. This complexity assumes that tuples of each relation must be sorted on the comparison attributes. However, using hashing and enough memory to hold one hashed relation can reduce the complexity of binary operators O(n) [Bratbergsengen, 1984]. Projects with duplicate elimination and grouping operators require that each tuple of the relation be compared with each other tuple, and thus also have O(n∗ logn) complexity. Finally, complexity is O(n2) for the Cartesian product of two relations because each tuple of one relation must be combined with each tuple of the other.

Operation Complexity

Select

Project (without duplicate elimination) O(n)

Project (with duplicate elimination)

Group by

Join

Semijoin

Division

Set Operators

Cartesian Product O(n2)

O(n*log n)

O(n*log n)

Fig. 6.2 Complexity of Relational Algebra Operations

This simple look at operator complexity suggests two principles. First, because complexity is relative to relation cardinalities, the most selective operators that reduce cardinalities (e.g., selection) should be performed first. Second, operators should be ordered by increasing complexity so that Cartesian products can be avoided or delayed.

6.4 Characterization of Query Processors

It is quite difficult to evaluate and compare query processors in the context of both centralized systems [Jarke and Koch, 1984] and distributed systems [Sacco and

212 6 Overview of Query Processing

Yao, 1982; Apers et al., 1983; Kossmann, 2000] because they may differ in many aspects. In what follows, we list important characteristics of query processors that can be used as a basis for comparison. The first four characteristics hold for both centralized and distributed query processors while the next four characteristics are particular to distributed query processors in tightly-integrated distributed DBMSs. This characterization is used in Chapter 8 to compare various algorithms.

6.4.1 Languages

Initially, most work on query processing was done in the context of relational DBMSs because their high-level languages give the system many opportunities for optimiza- tion. The input language to the query processor is thus based on relational calculus. With object DBMSs, the language is based on object calculus which is merely an extension of relational calculus. Thus, decomposition to object algebra is also needed (see Chapter 15). XML, another data model that we consider in this book, has its own languages, primarily in XQuery and XPath. Their execution requires special care that we discuss in Chapter 17.

The former requires an additional phase to decompose a query expressed in relational calculus into relational algebra. In a distributed context, the output language is generally some internal form of relational algebra augmented with communication primitives. The operators of the output language are implemented directly in the system. Query processing must perform efficient mapping from the input language to the output language.

6.4.2 Types of Optimization

Conceptually, query optimization aims at choosing the “best” point in the solution space of all possible execution strategies. An immediate method for query optimiza- tion is to search the solution space, exhaustively predict the cost of each strategy, and select the strategy with minimum cost. Although this method is effective in selecting the best strategy, it may incur a significant processing cost for the optimization itself. The problem is that the solution space can be large; that is, there may be many equivalent strategies, even with a small number of relations. The problem becomes worse as the number of relations or fragments increases (e.g., becomes greater than 5 or 6). Having high optimization cost is not necessarily bad, particularly if query optimization is done once for many subsequent executions of the query. Therefore, an “exhaustive” search approach is often used whereby (almost) all possible execution strategies are considered [Selinger et al., 1979].

To avoid the high cost of exhaustive search, randomized strategies, such as iterative improvement [Swami, 1989] and simulated annealing [Ioannidis and Wong, 1987]

6.4 Characterization of Query Processors 213

have been proposed. They try to find a very good solution, not necessarily the best one, but avoid the high cost of optimization, in terms of memory and time consumption.

Another popular way of reducing the cost of exhaustive search is the use of heuristics, whose effect is to restrict the solution space so that only a few strategies are considered. In both centralized and distributed systems, a common heuristic is to minimize the size of intermediate relations. This can be done by performing unary operators first, and ordering the binary operators by the increasing sizes of their intermediate relations. An important heuristic in distributed systems is to replace join operators by combinations of semijoins to minimize data communication.

6.4.3 Optimization Timing

A query may be optimized at different times relative to the actual time of query execution. Optimization can be done statically before executing the query or dynami- cally as the query is executed. Static query optimization is done at query compilation time. Thus the cost of optimization may be amortized over multiple query executions. Therefore, this timing is appropriate for use with the exhaustive search method. Since the sizes of the intermediate relations of a strategy are not known until run time, they must be estimated using database statistics. Errors in these estimates can lead to the choice of suboptimal strategies.

Dynamic query optimization proceeds at query execution time. At any point of execution, the choice of the best next operator can be based on accurate knowledge of the results of the operators executed previously. Therefore, database statistics are not needed to estimate the size of intermediate results. However, they may still be useful in choosing the first operators. The main advantage over static query optimization is that the actual sizes of intermediate relations are available to the query processor, thereby minimizing the probability of a bad choice. The main shortcoming is that query optimization, an expensive task, must be repeated for each execution of the query. Therefore, this approach is best for ad-hoc queries.

Hybrid query optimization attempts to provide the advantages of static query opti- mization while avoiding the issues generated by inaccurate estimates. The approach is basically static, but dynamic query optimization may take place at run time when a high difference between predicted sizes and actual size of intermediate relations is detected.

6.4.4 Statistics

The effectiveness of query optimization relies on statistics on the database. Dynamic query optimization requires statistics in order to choose which operators should be done first. Static query optimization is even more demanding since the size of intermediate relations must also be estimated based on statistical information. In a

214 6 Overview of Query Processing

distributed database, statistics for query optimization typically bear on fragments, and include fragment cardinality and size as well as the size and number of distinct values of each attribute. To minimize the probability of error, more detailed statistics such as histograms of attribute values are sometimes used at the expense of higher management cost. The accuracy of statistics is achieved by periodic updating. With static optimization, significant changes in statistics used to optimize a query might result in query reoptimization.

6.4.5 Decision Sites

When static optimization is used, either a single site or several sites may participate in the selection of the strategy to be applied for answering the query. Most systems use the centralized decision approach, in which a single site generates the strategy. However, the decision process could be distributed among various sites participating in the elaboration of the best strategy. The centralized approach is simpler but requires knowledge of the entire distributed database, while the distributed approach requires only local information. Hybrid approaches where one site makes the major decisions and other sites can make local decisions are also frequent. For example, System R* [Williams et al., 1982] uses a hybrid approach.

6.4.6 Exploitation of the Network Topology

The network topology is generally exploited by the distributed query processor. With wide area networks, the cost function to be minimized can be restricted to the data communication cost, which is considered to be the dominant factor. This assumption greatly simplifies distributed query optimization, which can be divided into two separate problems: selection of the global execution strategy, based on intersite communication, and selection of each local execution strategy, based on a centralized query processing algorithm.

With local area networks, communication costs are comparable to I/O costs. Therefore, it is reasonable for the distributed query processor to increase parallel execution at the expense of communication cost. The broadcasting capability of some local area networks can be exploited successfully to optimize the processing of join operators [Özsoyoglu and Zhou, 1987; Wah and Lien, 1985]. Other algorithms specialized to take advantage of the network topology are discussed by Kerschberg et al. [1982] for star networks and by LaChimia [1984] for satellite networks.

In a client-server environment, the power of the client workstation can be exploited to perform database operators using data shipping [Franklin et al., 1996]. The optimization problem becomes to decide which part of the query should be performed on the client and which part on the server using query shipping.

6.5 Layers of Query Processing 215

6.4.7 Exploitation of Replicated Fragments

A distributed relation is usually divided into relation fragments as described in Chap- ter 3. Distributed queries expressed on global relations are mapped into queries on physical fragments of relations by translating relations into fragments. We call this process localization because its main function is to localize the data involved in the query. For higher reliability and better read performance, it is useful to have fragments replicated at different sites. Most optimization algorithms consider the lo- calization process independently of optimization. However, some algorithms exploit the existence of replicated fragments at run time in order to minimize communication times. The optimization algorithm is then more complex because there are a larger number of possible strategies.

6.4.8 Use of Semijoins

The semijoin operator has the important property of reducing the size of the operand relation. When the main cost component considered by the query processor is commu- nication, a semijoin is particularly useful for improving the processing of distributed join operators as it reduces the size of data exchanged between sites. However, using semijoins may result in an increase in the number of messages and in the local processing time. The early distributed DBMSs, such as SDD-1 [Bernstein et al., 1981], which were designed for slow wide area networks, make extensive use of semijoins. Some later systems, such as R* [Williams et al., 1982], assume faster networks and do not employ semijoins. Rather, they perform joins directly since using joins leads to lower local processing costs. Nevertheless, semijoins are still beneficial in the context of fast networks when they induce a strong reduction of the join operand. Therefore, some query processing algorithms aim at selecting an optimal combination of joins and semijoins [Özsoyoglu and Zhou, 1987; Wah and Lien, 1985].

6.5 Layers of Query Processing

In Chapter 1 we have seen where query processing fits within the distributed DBMS architecture. The problem of query processing can itself be decomposed into several subproblems, corresponding to various layers. In Figure 6.3 a generic layering scheme for query processing is shown where each layer solves a well-defined subproblem. To simplify the discussion, let us assume a static and semicentralized query processor that does not exploit replicated fragments. The input is a query on global data expressed in relational calculus. This query is posed on global (distributed) relations, meaning that data distribution is hidden. Four main layers are involved in distributed query processing. The first three layers map the input query into an optimized

216 6 Overview of Query Processing

QUERY

DECOMPOSITION

DATA LOCALIZATION

CALCULUS QUERY ON GLOBAL

RELATIONS

ALGEBRAIC QUERY ON GLOBAL

RELATIONS

ALGEBRAIC QUERY ON FRAGMENTS

DISTRIBUTED QUERY EXECUTION PLAN

DISTRIBUTED

EXECUTION

GLOBAL

SCHEMA

FRAGMENT

SCHEMA

ALLOCATION

SCHEMA

CONTROL

SITE

LOCAL

SITES

GLOBAL

OPTIMIZATION

Fig. 6.3 Generic Layering Scheme for Distributed Query Processing

distributed query execution plan. They perform the functions of query decomposition, data localization, and global query optimization. Query decomposition and data localization correspond to query rewriting. The first three layers are performed by a central control site and use schema information stored in the global directory. The fourth layer performs distributed query execution by executing the plan and returns the answer to the query. It is done by the local sites and the control site. The first two layers are treated extensively in Chapter 7, while the two last layers are detailed in Chapter 8. In the remainder of this chapter we present an overview of these four layers.

6.5.1 Query Decomposition

The first layer decomposes the calculus query into an algebraic query on global relations. The information needed for this transformation is found in the global

6.5 Layers of Query Processing 217

conceptual schema describing the global relations. However, the information about data distribution is not used here but in the next layer. Thus the techniques used by this layer are those of a centralized DBMS.

Query decomposition can be viewed as four successive steps. First, the calculus query is rewritten in a normalized form that is suitable for subsequent manipulation. Normalization of a query generally involves the manipulation of the query quantifiers and of the query qualification by applying logical operator priority.

Second, the normalized query is analyzed semantically so that incorrect queries are detected and rejected as early as possible. Techniques to detect incorrect queries exist only for a subset of relational calculus. Typically, they use some sort of graph that captures the semantics of the query.

Third, the correct query (still expressed in relational calculus) is simplified. One way to simplify a query is to eliminate redundant predicates. Note that redundant queries are likely to arise when a query is the result of system transformations applied to the user query. As seen in Chapter 5, such transformations are used for performing semantic data control (views, protection, and semantic integrity control).

Fourth, the calculus query is restructured as an algebraic query. Recall from Section 6.1 that several algebraic queries can be derived from the same calculus query, and that some algebraic queries are “better” than others. The quality of an algebraic query is defined in terms of expected performance. The traditional way to do this transformation toward a “better” algebraic specification is to start with an initial algebraic query and transform it in order to find a “good” one. The initial algebraic query is derived immediately from the calculus query by translating the predicates and the target statement into relational operators as they appear in the query. This directly translated algebra query is then restructured through transformation rules. The algebraic query generated by this layer is good in the sense that the worse executions are typically avoided. For instance, a relation will be accessed only once, even if there are several select predicates. However, this query is generally far from providing an optimal execution, since information about data distribution and fragment allocation is not used at this layer.

6.5.2 Data Localization

The input to the second layer is an algebraic query on global relations. The main role of the second layer is to localize the query’s data using data distribution information in the fragment schema. In Chapter 3 we saw that relations are fragmented and stored in disjoint subsets, called fragments, each being stored at a different site. This layer determines which fragments are involved in the query and transforms the distributed query into a query on fragments. Fragmentation is defined by fragmentation pred- icates that can be expressed through relational operators. A global relation can be reconstructed by applying the fragmentation rules, and then deriving a program, called a localization program, of relational algebra operators, which then act on fragments. Generating a query on fragments is done in two steps. First, the query

218 6 Overview of Query Processing

is mapped into a fragment query by substituting each relation by its reconstruction program (also called materialization program), discussed in Chapter 3. Second, the fragment query is simplified and restructured to produce another “good” query. Simplification and restructuring may be done according to the same rules used in the decomposition layer. As in the decomposition layer, the final fragment query is generally far from optimal because information regarding fragments is not utilized.

6.5.3 Global Query Optimization

The input to the third layer is an algebraic query on fragments. The goal of query optimization is to find an execution strategy for the query which is close to opti- mal. Remember that finding the optimal solution is computationally intractable. An execution strategy for a distributed query can be described with relational algebra operators and communication primitives (send/receive operators) for transferring data between sites. The previous layers have already optimized the query, for example, by eliminating redundant expressions. However, this optimization is independent of fragment characteristics such as fragment allocation and cardinalities. In addi- tion, communication operators are not yet specified. By permuting the ordering of operators within one query on fragments, many equivalent queries may be found.

Query optimization consists of finding the “best” ordering of operators in the query, including communication operators that minimize a cost function. The cost function, often defined in terms of time units, refers to computing resources such as disk space, disk I/Os, buffer space, CPU cost, communication cost, and so on. Generally, it is a weighted combination of I/O, CPU, and communication costs. Nevertheless, a typical simplification made by the early distributed DBMSs, as we mentioned before, was to consider communication cost as the most significant factor. This used to be valid for wide area networks, where the limited bandwidth made communication much more costly than local processing. This is not true anymore today and communication cost can be lower than I/O cost. To select the ordering of operators it is necessary to predict execution costs of alternative candidate orderings. Determining execution costs before query execution (i.e., static optimization) is based on fragment statistics and the formulas for estimating the cardinalities of results of relational operators. Thus the optimization decisions depend on the allocation of fragments and available statistics on fragments which are recorder in the allocation schema.

An important aspect of query optimization is join ordering, since permutations of the joins within the query may lead to improvements of orders of magnitude. One basic technique for optimizing a sequence of distributed join operators is through the semijoin operator. The main value of the semijoin in a distributed system is to reduce the size of the join operands and then the communication cost. However, techniques which consider local processing costs as well as communication costs may not use semijoins because they might increase local processing costs. The output of the query optimization layer is a optimized algebraic query with communication operators

6.6 Conclusion 219

included on fragments. It is typically represented and saved (for future executions) as a distributed query execution plan .

6.5.4 Distributed Query Execution

The last layer is performed by all the sites having fragments involved in the query. Each subquery executing at one site, called a local query, is then optimized using the local schema of the site and executed. At this time, the algorithms to perform the relational operators may be chosen. Local optimization uses the algorithms of centralized systems (see Chapter 8).

6.6 Conclusion

In this chapter we provided an overview of query processing in distributed DBMSs. We first introduced the function and objectives of query processing. The main assump- tion is that the input query is expressed in relational calculus since that is the case with most current distributed DBMS. The complexity of the problem is proportional to the expressive power and the abstraction capability of the query language. For instance, the problem is even harder with important extensions such as the transitive closure operator [Valduriez and Boral, 1986].

The goal of distributed query processing may be summarized as follows: given a calculus query on a distributed database, find a corresponding execution strategy that minimizes a system cost function that includes I/O, CPU, and communication costs. An execution strategy is specified in terms of relational algebra operators and communication primitives (send/receive) applied to the local databases (i.e., the relation fragments). Therefore, the complexity of relational operators that affect the performance of query execution is of major importance in the design of a query processor.

We gave a characterization of query processors based on their implementation choices. Query processors may differ in various aspects such as type of algorithm, optimization granularity, optimization timing, use of statistics, choice of decision site(s), exploitation of the network topology, exploitation of replicated fragments, and use of semijoins. This characterization is useful for comparing alternative query processor designs and to understand the trade-offs between efficiency and complexity.

The query processing problem is very difficult to understand in distributed envi- ronments because many elements are involved. However, the problem may be divided into several subproblems which are easier to solve individually. Therefore, we have proposed a generic layering scheme for describing distributed query processing. Four main functions have been isolated: query decomposition, data localization, global query optimization, and distributed query execution. These functions successively refine the query by adding more details about the processing environment. Query

220 6 Overview of Query Processing

decomposition and data localization are treated in detail in Chapter 7. Distributed query optimization and execution is the topic of Chapter 8.

6.7 Bibliographic Notes

Kim et al. [1985] provide a comprehensive set of papers presenting the results of research and development in query processing within the context of the relational model. After a survey of the state of the art in query processing, the book treats most of the important topics in the area. In particular, there are three papers on distributed query processing.

Ibaraki and Kameda [1984] have formally shown that finding the optimal execu- tion strategy for a query is computationally intractable. Assuming a simplified cost function including the number of page accesses, it is proven that the minimization of this cost function for a multiple-join query is NP-complete.

Ceri and Pelagatti [1984] deal extensively with distributed query processing by treating the problem of localization and optimization separately in two chapters. The main assumption is that the query is expressed in relational algebra, so the decomposition phase that maps a calculus query into an algebraic query is ignored.

There are several survey papers on query processing and query optimization in the context of the relational model. A detailed survey is by Graefe [1993]. An earlier survey is [Jarke and Koch, 1984]. Both of these mainly deal with centralized query processing. The initial solutions to distributed query processing are extensively compiled in [Sacco and Yao, 1982; Yu and Chang, 1984]. Many query processing techniques are compiled in the book [Freytag et al., 1994].

The most complete survey on distributed query processing is by Kossmann [2000] and deals with both distributed DBMSs and multidatabase systems. The paper presents the traditional phases of query processing in centralized and distributed systems, and describes the various techniques for distributed query processing. It also discusses different distributed architectures such as client-server, multi-tier, and multidatabases.

Chapter 7 Query Decomposition and Data Localization

In Chapter 6 we discussed a generic layering scheme for distributed query processing in which the first two layers are responsible for query decomposition and data localization. These two functions are applied successively to transform a calculus query specified on distributed relations (i.e., global relations) into an algebraic query defined on relation fragments. In this chapter we present the techniques for query decomposition and data localization.

Query decomposition maps a distributed calculus query into an algebraic query on global relations. The techniques used at this layer are those of the centralized DBMS since relation distribution is not yet considered at this point. The resultant algebraic query is “good” in the sense that even if the subsequent layers apply a straightforward algorithm, the worst executions will be avoided. However, the subsequent layers usually perform important optimizations, as they add to the query increasing detail about the processing environment.

Data localization takes as input the decomposed query on global relations and ap- plies data distribution information to the query in order to localize its data. In Chapter 3 we have seen that to increase the locality of reference and/or parallel execution, relations are fragmented and then stored in disjoint subsets, called fragments, each being placed at a different site. Data localization determines which fragments are involved in the query and thereby transforms the distributed query into a fragment query. Similar to the decomposition layer, the final fragment query is generally far from optimal because quantitative information regarding fragments is not exploited at this point. Quantitative information is used by the query optimization layer that will be presented in Chapter 8.

This chapter is organized as follows. In Section 7.1 we present the four successive phases of query decomposition: normalization, semantic analysis, simplification, and restructuring of the query. In Section 7.2 we describe data localization, with emphasis on reduction and simplification techniques for the four following types of fragmentation: horizontal, vertical, derived, and hybrid.

DOI 10.1007/978-1-4419-8834-8_7, © Springer Science+Business Media, LLC 2011 221M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems: Third Edition,

222 7 Query Decomposition and Data Localization

7.1 Query Decomposition

Query decomposition (see Figure 6.3) is the first phase of query processing that transforms a relational calculus query into a relational algebra query. Both input and output queries refer to global relations, without knowledge of the distribution of data. Therefore, query decomposition is the same for centralized and distributed systems. In this section the input query is assumed to be syntactically correct. When this phase is completed successfully the output query is semantically correct and good in the sense that redundant work is avoided. The successive steps of query decomposition are (1) normalization, (2) analysis, (3) elimination of redundancy, and (4) rewriting. Steps 1, 3, and 4 rely on the fact that various transformations are equivalent for a given query, and some can have better performance than others. We present the first three steps in the context of tuple relational calculus (e.g., SQL). Only the last step rewrites the query into relational algebra.

7.1.1 Normalization

The input query may be arbitrarily complex, depending on the facilities provided by the language. It is the goal of normalization to transform the query to a normalized form to facilitate further processing. With relational languages such as SQL, the most important transformation is that of the query qualification (the WHERE clause), which may be an arbitrarily complex, quantifier-free predicate, preceded by all necessary quantifiers (∀ or ∃). There are two possible normal forms for the predicate, one giving precedence to the AND (∧) and the other to the OR (∨). The conjunctive normal form is a conjunction (∧ predicate) of disjunctions (∨ predicates) as follows:

(p11∨ p12∨·· ·∨ p1n)∧·· ·∧ (pm1∨ pm2∨·· ·∨ pmn)

where pi j is a simple predicate. A qualification in disjunctive normal form, on the other hand, is as follows:

(p11∧ p12∧·· ·∧ p1n)∨·· ·∨ (pm1∧ pm2∧·· ·∧ pmn)

The transformation of the quantifier-free predicate is straightforward using the well-known equivalence rules for logical operations (∧, ∨, and ¬):

1. p1∧ p2⇔ p2∧ p1 2. p1∨ p2⇔ p2∨ p1 3. p1∧ (p2∧ p3)⇔ (p1∧ p2)∧ p3 4. p1∨ (p2∨ p3)⇔ (p1∨ p2)∨ p3 5. p1∧ (p2∨ p3)⇔ (p1∧ p2)∨ (p1∧ p3) 6. p1∨ (p2∧ p3)⇔ (p1∨ p2)∧ (p1∨ p3)

7.1 Query Decomposition 223

7. ¬(p1∧ p2)⇔¬p1∨¬p2 8. ¬(p1∨ p2)⇔¬p1∧¬p2 9. ¬(¬p)⇔ p

In the disjunctive normal form, the query can be processed as independent con- junctive subqueries linked by unions (corresponding to the disjunctions). However, this form may lead to replicated join and select predicates, as shown in the following example. The reason is that predicates are very often linked with the other predicates by AND. The use of rule 5 mentioned above, with p1 as a join or select predicate, would result in replicating p1. The conjunctive normal form is more practical since query qualifications typically include more AND than OR predicates. However, it leads to predicate replication for queries involving many disjunctions and few conjunctions, a rare case.

Example 7.1. Let us consider the following query on the engineering database that we have been referring to:

“Find the names of employees who have been working on project P1 for 12 or 24 months”

The query expressed in SQL is

SELECT ENAME FROM EMP, ASG WHERE EMP.ENO = ASG.ENO AND ASG.PNO = "P1" AND DUR = 12 OR DUR = 24

The qualification in conjunctive normal form is

EMP.ENO = ASG.ENO ∧ ASG.PNO = “P1” ∧ (DUR = 12 ∨ DUR = 24)

while the qualification in disjunctive normal form is

(EMP.ENO = ASG.ENO ∧ ASG.PNO = “P1” ∧ DUR = 12) ∨ (EMP.ENO = ASG.ENO ∧ ASG.PNO = “P1” ∧ DUR = 24)

In the latter form, treating the two conjunctions independently may lead to redun- dant work if common subexpressions are not eliminated. �

7.1.2 Analysis

Query analysis enables rejection of normalized queries for which further processing is either impossible or unnecessary. The main reasons for rejection are that the query

224 7 Query Decomposition and Data Localization

is type incorrect or semantically incorrect. When one of these cases is detected, the query is simply returned to the user with an explanation. Otherwise, query processing is continued. Below we present techniques to detect these incorrect queries.

A query is type incorrect if any of its attribute or relation names are not defined in the global schema, or if operations are being applied to attributes of the wrong type. The technique used to detect type incorrect queries is similar to type checking for programming languages. However, the type declarations are part of the global schema rather than of the query, since a relational query does not produce new types.

Example 7.2. The following SQL query on the engineering database is type incorrect for two reasons. First, attribute E# is not declared in the schema. Second, the operation “>200” is incompatible with the type string of ENAME.

SELECT E# FROM EMP WHERE ENAME > 200

A query is semantically incorrect if its components do not contribute in any way to the generation of the result. In the context of relational calculus, it is not possible to determine the semantic correctness of general queries. However, it is possible to do so for a large class of relational queries, those which do not contain disjunction and negation [Rosenkrantz and Hunt, 1980]. This is based on the representation of the query as a graph, called a query graph or connection graph [Ullman, 1982]. We define this graph for the most useful kinds of queries involving select, project, and join operators. In a query graph, one node indicates the result relation, and any other node indicates an operand relation. An edge between two nodes one of which does not correspond to the result represents a join, whereas an edge whose destination node is the result represents a project. Furthermore, a non-result node may be labeled by a select or a self-join (join of the relation with itself) predicate. An important subgraph of the query graph is the join graph, in which only the joins are considered. The join graph is particularly useful in the query optimization phase.

Example 7.3. Let us consider the following query:

“Find the names and responsibilities of programmers who have been working on the CAD/CAM project for more than 3 years.”

The query expressed in SQL is SELECT ENAME, RESP FROM EMP, ASG, PROJ WHERE EMP.ENO = ASG.ENO AND ASG.PNO = PROJ.PNO AND PNAME = "CAD/CAM" AND DUR ≥ 36 AND TITLE = "Programmer"

The query graph for the query above is shown in Figure 7.1a. Figure 7.1b shows the join graph for the graph in Figure 7.1a. �

7.1 Query Decomposition 225

Fig. 7.1 Relation Graphs

The query graph is useful to determine the semantic correctness of a conjunctive multivariable query without negation. Such a query is semantically incorrect if its query graph is not connected. In this case one or more subgraphs (corresponding to subqueries) are disconnected from the graph that contains the result relation. The query could be considered correct (which some systems do) by considering the missing connection as a Cartesian product. But, in general, the problem is that join predicates are missing and the query should be rejected.

Example 7.4. Let us consider the following SQL query:

SELECT ENAME, RESP FROM EMP, ASG, PROJ WHERE EMP.ENO = ASG.ENO AND PNAME = "CAD/CAM" AND DUR ≥ 36 AND TITLE = "Programmer"

Its query graph, shown in Figure 7.2, is disconnected, which tells us that the query is semantically incorrect. There are basically three solutions to the problem: (1) reject the query, (2) assume that there is an implicit Cartesian product between relations ASG and PROJ, or (3) infer (using the schema) the missing join predicate ASG.PNO = PROJ.PNO which transforms the query into that of Example 7.3. �

(a) Query graph

DUR≥36

PNAME = "CAD/CAM"

ENAME

PROJ

ASG.PNO = PROJ.PNO

RESULT

TITLE = "Programmer"

RESP

(b) Corresponding join graph

ASG.PNO = PROJ.PNOEMP.ENO = ASG.ENO ASG

EMP PROJ

ASG

EMP

EMP.ENO = ASG.ENO

226 7 Query Decomposition and Data Localization

PNAME = "CAD/CAM"

ENAME

EMP.ENO = ASG.ENO

TITLE =

"Programmer" RESP

RESULT

DUR≥36

PROJ

ASG

EMP

Fig. 7.2 Disconnected Query Graph

7.1.3 Elimination of Redundancy

As we saw in Chapter 5, relational languages can be used uniformly for semantic data control. In particular, a user query typically expressed on a view may be enriched with several predicates to achieve view-relation correspondence, and ensure semantic integrity and security. The enriched query qualification may then contain redundant predicates. A naive evaluation of a qualification with redundancy can well lead to duplicated work. Such redundancy and thus redundant work may be eliminated by simplifying the qualification with the following well-known idempotency rules:

1. p∧ p⇔ p 2. p∨ p⇔ p 3. p∧ true⇔ p 4. p∨ f alse⇔ p 5. p∧ f alse⇔ f alse 6. p∨ true⇔ true 7. p∧¬p⇔ f alse 8. p∨¬p⇔ true 9. p1∧ (p1∨ p2)⇔ p1

10. p1∨ (p1∧ p2)⇔ p1

Example 7.5. The SQL query

7.1 Query Decomposition 227

SELECT TITLE FROM EMP WHERE (NOT (TITLE = "Programmer") AND (TITLE = "Programmer" OR TITLE = "Elect. Eng.") AND NOT (TITLE = "Elect. Eng.")) OR ENAME = "J. Doe"

can be simplified using the previous rules to become

SELECT TITLE FROM EMP WHERE ENAME = "J. Doe"

The simplification proceeds as follows. Let p1 be TITLE = “Programmer”, p2 be TITLE = “Elect. Eng.”, and p3 be ENAME = “J. Doe”. The query qualification is

(¬p1∧ (p1∨ p2)∧¬p2)∨ p3

The disjunctive normal form for this qualification is obtained by applying rule 5 defined in Section 7.1.1, which yields

(¬p1∧ ((p1∧¬p2)∨ (p2∧¬p2)))∨ p3

and then rule 3 defined in Section 7.1.1, which yields

(¬p1∧ p1∧¬p2)∨ (¬p1∧ p2∧¬p2)∨ p3

By applying rule 7 defined above, we obtain

( f alse∧¬p2)∨ (¬p1∧ f alse)∨ p3

By applying the same rule, we get

f alse∨ f alse∨ p3

which is equivalent to p3 by rule 4. �

7.1.4 Rewriting

The last step of query decomposition rewrites the query in relational algebra. For the sake of clarity it is customary to represent the relational algebra query graphically by an operator tree. An operator tree is a tree in which a leaf node is a relation stored in the database, and a non-leaf node is an intermediate relation produced by a relational algebra operator. The sequence of operations is directed from the leaves to the root, which represents the answer to the query.

228 7 Query Decomposition and Data Localization

The transformation of a tuple relational calculus query into an operator tree can easily be achieved as follows. First, a different leaf is created for each different tuple variable (corresponding to a relation). In SQL, the leaves are immediately available in the FROM clause. Second, the root node is created as a project operation involving the result attributes. These are found in the SELECT clause in SQL. Third, the qualification (SQL WHERE clause) is translated into the appropriate sequence of relational operations (select, join, union, etc.) going from the leaves to the root. The sequence can be given directly by the order of appearance of the predicates and operators.

Example 7.6. The query

“Find the names of employees other than J. Doe who worked on the CAD/CAM project for either one or two years” whose SQL expression is

SELECT ENAME FROM PROJ, ASG, EMP WHERE ASG.ENO = EMP.ENO AND ASG.PNO = PROJ.PNO AND ENAME != "J. Doe" AND PROJ.PNAME = "CAD/CAM" AND (DUR = 12 OR DUR = 24)

can be mapped in a straightforward way in the tree in Figure 7.3. The predicates have been transformed in order of appearance as join and then select operations. �

By applying transformation rules, many different trees may be found equivalent to the one produced by the method described above [Smith and Chang, 1975]. We now present the six most useful equivalence rules, which concern the basic relational algebra operators. The correctness of these rules has been proven [Ullman, 1982].

In the remainder of this section, R, S, and T are relations where R is defined over attributes A = {A1,A2, . . . ,An} and S is defined over B = {B1,B2, . . . ,Bn}.

1. Commutativity of binary operators. The Cartesian product of two relations R and S is commutative:

R×S⇔ S×R

Similarly, the join of two relations is commutative:

R 1 S⇔ S 1 R

This rule also applies to union but not to set difference or semijoin.

2. Associativity of binary operators. The Cartesian product and the join are associative operators:

(R×S)×T ⇔ R× (S×T ) (R 1 S) 1 T ⇔ R 1 (S 1 T )

7.1 Query Decomposition 229

PROJ ASG EMP

project

select

join

PNO

Π ENAME

σ DUR=12 ∨ DUR=24

σ PNAME=”CAD/CAM”

σ ENAME≠”J. Doe”

ENO

Fig. 7.3 Example of Operator Tree

3. Idempotence of unary operators. Several subsequent projections on the same relation may be grouped. Conversely, a single projection on several attributes may be separated into several subsequent projections. If R is defined over the attribute set A, and A′ ⊆ A,A′′ ⊆ A, and A′ ⊆ A′′, then

ΠA′(ΠA′′(R))⇔ΠA′(R)

Several subsequent selections σpi(Ai) on the same relation, where pi is a predicate applied to attribute Ai, may be grouped as follows:

σp1(A1)(σp2(A2)(R)) = σp1(A1)∧p2(A2)(R)

Conversely, a single selection with a conjunction of predicates may be sepa- rated into several subsequent selections.

4. Commuting selection with projection. Selection and projection on the same relation can be commuted as follows:

ΠA1,...,An(σp(Ap)(R))⇔ΠA1,...,An(σp(Ap)(ΠA1,...,An,Ap(R)))

Note that if Ap is already a member of {A1, . . . ,An}, the last projection on [A1, . . . ,An] on the right-hand side of the equality is useless.

5. Commuting selection with binary operators. Selection and Cartesian prod- uct can be commuted using the following rule (remember that attribute Ai

230 7 Query Decomposition and Data Localization

belongs to relation R):

σp(Ai)(R×S)⇔ (σp(Ai)(R))×S

Selection and join can be commuted:

σp(Ai)(R 1p(A j ,Bk) S)⇔ σp(Ai)(R)1p(A j ,Bk) S

Selection and union can be commuted if R and T are union compatible (have the same schema):

σp(Ai)(R∪T )⇔ σp(Ai)(R)∪σp(Ai)(T )

Selection and difference can be commuted in a similar fashion.

6. Commuting projection with binary operators. Projection and Cartesian product can be commuted. If C = A′∪B′, where A′ ⊆ A, B′ ⊆ B, and A and B are the sets of attributes over which relations R and S, respectively, are defined, we have

ΠC(R×S)⇔ΠA′(R)×ΠB′(S)

Projection and join can also be commuted.

ΠC(R 1p(Ai,B j) S)⇔ΠA′(R) 1p(Ai,B j) ΠB′(S)

For the join on the right-hand side of the implication to hold we need to have Ai ∈ A′ and B j ∈ B′. Since C = A′∪B′, Ai and B j are in C and therefore we don’t need a projection over C once the projections over A′ and B′ are performed. Projection and union can be commuted as follows:

ΠC(R∪S)⇔ΠC(R)∪ΠC(S)

Projection and difference can be commuted similarly.

The application of these six rules enables the generation of many equivalent trees. For instance, the tree in Figure 7.4 is equivalent to the one in Figure 7.3. However, the one in Figure 7.4 requires a Cartesian product of relations EMP and PROJ, and may lead to a higher execution cost than the original tree. In the optimization phase, one can imagine comparing all possible trees based on their predicted cost. However, the excessively large number of possible trees makes this approach unrealistic. The rules presented above can be used to restructure the tree in a systematic way so that the “bad” operator trees are eliminated. These rules can be used in four different ways. First, they allow the separation of the unary operations, simplifying the query expression. Second, unary operations on the same relation may be grouped so that access to a relation for performing unary operations can be done only once. Third, unary operations can be commuted with binary operations so that some operations (e.g., selection) may be done first. Fourth, the binary operations can be ordered. This

7.2 Localization of Distributed Data 231

last rule is used extensively in query optimization. A simple restructuring algorithm uses a single heuristic that consists of applying unary operations (select/project) as soon as possible to reduce the size of intermediate relations [Ullman, 1982].

ASG

PROJEMP

x

PNO, ENO

Π ENAME

σ PNAME="CAD/CAM" ∧ (DUR=12 ∨ DUR=24) ∧ ENAME ≠ "J. Doe"

Fig. 7.4 Equivalent Operator Tree

Example 7.7. The restructuring of the tree in Figure 7.3 leads to the tree in Figure 7.5. The resulting tree is good in the sense that repeated access to the same relation (as in Figure 7.3) is avoided and that the most selective operations are done first. However, this tree is far from optimal. For example, the select operation on EMP is not very useful before the join because it does not greatly reduce the size of the operand relation. �

7.2 Localization of Distributed Data

In Section 7.1 we presented general techniques for decomposing and restructuring queries expressed in relational calculus. These global techniques apply to both centralized and distributed DBMSs and do not take into account the distribution of data. This is the role of the localization layer. As shown in the generic layering scheme of query processing described in Chapter 6, the localization layer translates an algebraic query on global relations into an algebraic query expressed on physical fragments. Localization uses information stored in the fragment schema.

Fragmentation is defined through fragmentation rules, which can be expressed as relational queries. As we discussed in Chapter 3, a global relation can be recon- structed by applying the reconstruction (or reverse fragmentation) rules and deriving a relational algebra program whose operands are the fragments. We call this a lo- calization program. To simplify this section, we do not consider the fact that data

232 7 Query Decomposition and Data Localization

EMPASGPROJ

PNO

ENO

Π ENAME

Π PNO,ENAME

Π ENO,ENAME

Π PNO,ENO

Π PNO

σ PNAME="CAD/CAM"

σ ENAME≠"J. Doe"

σ DUR=12 ∨ DUR=24

Fig. 7.5 Rewritten Operator Tree

fragments may be replicated, although this can improve performance. Replication is considered in Chapter 8.

A naive way to localize a distributed query is to generate a query where each global relation is substituted by its localization program. This can be viewed as replacing the leaves of the operator tree of the distributed query with subtrees corresponding to the localization programs. We call the query obtained this way the localized query. In general, this approach is inefficient because important restructurings and simplifications of the localized query can still be made [Ceri and Pelagatti, 1983; Ceri et al., 1986]. In the remainder of this section, for each type of fragmentation we present reduction techniques that generate simpler and optimized queries. We use the transformation rules and the heuristics, such as pushing unary operations down the tree, that were introduced in Section 7.1.4.

7.2.1 Reduction for Primary Horizontal Fragmentation

The horizontal fragmentation function distributes a relation based on selection predi- cates. The following example is used in subsequent discussions.

Example 7.8. Relation EMP(ENO, ENAME, TITLE) of Figure 2.3 can be split into three horizontal fragments EMP1, EMP2, and EMP3, defined as follows:

7.2 Localization of Distributed Data 233

EMP1 = σENO≤”E3”(EMP) EMP2 = σ”E3”<ENO≤”E6”(EMP) EMP3 = σENO>”E6”(EMP)

Note that this fragmentation of the EMP relation is different from the one discussed in Example 3.12.

The localization program for an horizontally fragmented relation is the union of the fragments. In our example we have

EMP = EMP1∪ EMP2∪ EMP3

Thus the localized form of any query specified on EMP is obtained by replacing it by (EMP1∪ EMP2∪ EMP3. �

The reduction of queries on horizontally fragmented relations consists primarily of determining, after restructuring the subtrees, those that will produce empty relations, and removing them. Horizontal fragmentation can be exploited to simplify both selection and join operations.

7.2.1.1 Reduction with Selection

Selections on fragments that have a qualification contradicting the qualification of the fragmentation rule generate empty relations. Given a relation R that has been horizontally fragmented as R1, R2, . . ., Rw, where R j = σp j (R), the rule can be stated formally as follows:

Rule 1: σpi(R j) = φ if ∀x in R : ¬(pi(x)∧ p j(x))

where pi and p j are selection predicates, x denotes a tuple, and p(x) denotes “predi- cate p holds for x.”

For example, the selection predicate ENO=“E1” conflicts with the predicates of fragments EMP2 and EMP3 of Example 7.8 (i.e., no tuple in EMP2 and EMP3 can satisfy this predicate). Determining the contradicting predicates requires theorem- proving techniques if the predicates are quite general [Hunt and Rosenkrantz, 1979]. However, DBMSs generally simplify predicate comparison by supporting only simple predicates for defining fragmentation rules (by the database administrator).

Example 7.9. We now illustrate reduction by horizontal fragmentation using the following example query:

SELECT * FROM EMP WHERE ENO = "E5"

Applying the naive approach to localize EMP from EMP1, EMP2, and EMP3 gives the localized query of Figure 7.6a. By commuting the selection with the union operation, it is easy to detect that the selection predicate contradicts the predicates of

234 7 Query Decomposition and Data Localization

(a) Localized query (b) Reduced query

EMP1 EMP2 EMP3 EMP2

σ ENO="E5"

σ ENO="E5"

Fig. 7.6 Reduction for Horizontal Fragmentation (with Selection)

EMP1and EMP3, thereby producing empty relations. The reduced query is simply applied to EMP2as shown in Figure 7.6b. �

7.2.1.2 Reduction with Join

Joins on horizontally fragmented relations can be simplified when the joined rela- tions are fragmented according to the join attribute. The simplification consists of distributing joins over unions and eliminating useless joins. The distribution of join over union can be stated as:

(R1∪R2) 1 S = (R1 1 S)∪ (R2 1 S)

where Ri are fragments of R and S is a relation. With this transformation, unions can be moved up in the operator tree so that

all possible joins of fragments are exhibited. Useless joins of fragments can be determined when the qualifications of the joined fragments are contradicting, thus yielding an empty result. Assuming that fragments Ri and R j are defined, respectively, according to predicates pi and p j on the same attribute, the simplification rule can be stated as follows:

Rule 2: Ri 1 R j = φ if ∀x in Ri,∀y in R j : ¬(pi(x)∧ p j(y))

The determination of useless joins and their elimination using rule 2 can thus be performed by looking only at the fragment predicates. The application of this rule permits the join of two relations to be implemented as parallel partial joins of fragments [Ceri et al., 1986]. It is not always the case that the reduced query is better (i.e., simpler) than the localized query. The localized query is better when there are a large number of partial joins in the reduced query. This case arises when there are few contradicting fragmentation predicates. The worst case occurs when each fragment of one relation must be joined with each fragment of the other relation. This is tantamount to the Cartesian product of the two sets of fragments, with each set corresponding to one relation. The reduced query is better when the number of

7.2 Localization of Distributed Data 235

partial joins is small. For example, if both relations are fragmented using the same predicates, the number of partial joins is equal to the number of fragments of each relation. One advantage of the reduced query is that the partial joins can be done in parallel, and thus increase response time.

Example 7.10. Assume that relation EMP is fragmented between EMP1, EMP2, and EMP3, as above, and that relation ASG is fragmented as

ASG1 = σENO≤”E3”(ASG) ASG2 = σENO>”E3”(ASG)

EMP1and ASG1are defined by the same predicate. Furthermore, the predicate defining ASG2 is the union of the predicates defining EMP2 and EMP3. Now consider the join query

SELECT * FROM EMP, ASG WHERE EMP.ENO = ASG.ENO

The equivalent localized query is given in Figure 7.7a. The query reduced by distributing joins over unions and applying rule 2 can be implemented as a union of three partial joins that can be done in parallel (Figure 7.7b). �

7.2.2 Reduction for Vertical Fragmentation

The vertical fragmentation function distributes a relation based on projection attrib- utes. Since the reconstruction operator for vertical fragmentation is the join, the localization program for a vertically fragmented relation consists of the join of the fragments on the common attribute. For vertical fragmentation, we use the following example.

Example 7.11. Relation EMP can be divided into two vertical fragments where the key attribute ENO is duplicated:

EMP1 = ΠENO,ENAME(EMP) EMP2 = ΠENO,TITLE(EMP)

The localization program is

EMP = EMP1 1ENO EMP2 �

Similar to horizontal fragmentation, queries on vertical fragments can be reduced by determining the useless intermediate relations and removing the subtrees that produce them. Projections on a vertical fragment that has no attributes in common

236 7 Query Decomposition and Data Localization

ENO

EMP1 EMP2 EMP3 ASG1 ASG2

EMP1 ASG1 EMP2 ASG2 EMP3 ASG2

∪ ∪

(a) Localized query

(b) Reduced query

ENO ENO ENO

Fig. 7.7 Reduction by Horizontal Fragmentation (with Join)

with the projection attributes (except the key of the relation) produce useless, though not empty relations. Given a relation R, defined over attributes A = {A1, . . . ,An}, which is vertically fragmented as Ri = ΠA′(R), where A′ ⊆ A, the rule can be formally stated as follows:

Rule 3: ΠD,K(Ri) is useless if the set of projection attributes D is not in A′.

Example 7.12. Let us illustrate the application of this rule using the following exam- ple query in SQL:

SELECT ENAME FROM EMP

The equivalent localized query on EMP1 and EMP2 (as obtained in Example 7.10) is given in Figure 7.8a. By commuting the projection with the join (i.e., projecting on ENO, ENAME), we can see that the projection on EMP2 is useless because ENAME is not in EMP2. Therefore, the projection needs to apply only to EMP1, as shown in Figure 7.8b. �

7.2 Localization of Distributed Data 237

(a) Localized query

EMP1EMP1

ENO

EMP2

Π ENAME

Π ENAME

(b) Reduced query

Fig. 7.8 Reduction for Vertical Fragmentation

7.2.3 Reduction for Derived Fragmentation

As we saw in previous sections, the join operation, which is probably the most impor- tant operation because it is both frequent and expensive, can be optimized by using primary horizontal fragmentation when the joined relations are fragmented according to the join attributes. In this case the join of two relations is implemented as a union of partial joins. However, this method precludes one of the relations from being frag- mented on a different attribute used for selection. Derived horizontal fragmentation is another way of distributing two relations so that the joint processing of select and join is improved. Typically, if relation R is subject to derived horizontal fragmentation due to relation S, the fragments of R and S that have the same join attribute values are located at the same site. In addition, S can be fragmented according to a selection predicate.

Since tuples of R are placed according to the tuples of S, derived fragmentation should be used only for one-to-many (hierarchical) relationships of the form S→ R, where a tuple of S can match with n tuples of R, but a tuple of R matches with exactly one tuple of S. Note that derived fragmentation could be used for many-to-many relationships provided that tuples of S (that match with n tuples of R) are replicated. Such replication is difficult to maintain consistently. For simplicity, we assume and advise that derived fragmentation be used only for hierarchical relationships.

Example 7.13. Given a one-to-many relationship from EMP to ASG, relation ASG(ENO, PNO, RESP, DUR) can be indirectly fragmented according to the follow- ing rules:

ASG1 = ASG nENO EMP1 ASG2 = ASG nENO EMP2

Recall from Chapter 3 that the predicate on

EMP1 = σTITLE=”Programmer”(EMP) EMP2 = σTITLE6=”Programmer”(EMP)

238 7 Query Decomposition and Data Localization

The localization program for a horizontally fragmented relation is the union of the fragments. In our example, we have

ASG = ASG1∪ ASG2 �

Queries on derived fragments can also be reduced. Since this type of fragmentation is useful for optimizing join queries, a useful transformation is to distribute joins over unions (used in the localization programs) and to apply rule 2 introduced earlier. Because the fragmentation rules indicate what the matching tuples are, certain joins will produce empty relations if the fragmentation predicates conflict. For example, the predicates of ASG1 and EMP2 conflict; thus we have

ASG1 1 EMP2 = φ

Contrary to the reduction with join discussed previously, the reduced query is always preferable to the localized query because the number of partial joins usually equals the number of fragments of R.

Example 7.14. The reduction by derived fragmentation is illustrated by applying it to the following SQL query, which retrieves all attributes of tuples from EMP and ASG that have the same value of ENO and the title “Mech. Eng.”:

SELECT * FROM EMP, ASG WHERE ASG.ENO = EMP.ENO AND TITLE = "Mech. Eng."

The localized query on fragments EMP1, EMP2, ASG1, and ASG2, defined previously is given in Figure 7.9a. By pushing selection down to fragments EMP1 and EMP2, the query reduces to that of Figure 7.9b. This is because the selection predicate conflicts with that of EMP1, and thus EMP1 can be removed. In order to discover conflicting join predicates, we distribute joins over unions. This produces the tree of Figure 7.9c. The left subtree joins two fragments, ASG1 and EMP2, whose qualifications conflict because of predicates TITLE = “Programmer” in ASG1, and TITLE 6= “Programmer” in EMP2. Therefore the left subtree which produces an empty relation can be removed, and the reduced query of Figure 7.9d is obtained. This example illustrates the value of fragmentation in improving the execution performance of distributed queries. �

7.2.4 Reduction for Hybrid Fragmentation

Hybrid fragmentation is obtained by combining the fragmentation functions discussed above. The goal of hybrid fragmentation is to support, efficiently, queries involving projection, selection, and join. Note that the optimization of an operation or of a

7.2 Localization of Distributed Data 239

(a) Localized query

(b) Query after pushing selection down

(c) Query after moving unions up

(d) Reduced query after eliminating the left subtree

ASG1 EMP1

ENO

ASG2 EMP2

σ TITLE=”Mech. Eng.”

∪ ∪

ASG1 EMP2 EMP2 ASG2

σ TITLE=”Mech. Eng.”σTITLE=”Mech. Eng.”

ENO ENO

ASG2 EMP2

σ TITLE=”Mech. Eng.”

ENO

ASG1 ASG2 EMP2

∪ σTITLE=”Mech. Eng.”

ENO

Fig. 7.9 Reduction for Indirect Fragmentation

combination of operations is always done at the expense of other operations. For example, hybrid fragmentation based on selection-projection will make selection only, or projection only, less efficient than with horizontal fragmentation (or vertical fragmentation). The localization program for a hybrid fragmented relation uses unions and joins of fragments.

Example 7.15. Here is an example of hybrid fragmentation of relation EMP:

EMP1 = σENO≤”E4”(ΠENO,ENAME(EMP)) EMP2 = σENO>”E4”(ΠENO,ENAME(EMP)) EMP3 = ΠENO,TITLE(EMP)

240 7 Query Decomposition and Data Localization

In our example, the localization program is

EMP = (EMP1 ∪ EMP2) 1ENO EMP3 �

Queries on hybrid fragments can be reduced by combining the rules used, respec- tively, in primary horizontal, vertical, and derived horizontal fragmentation. These rules can be summarized as follows:

1. Remove empty relations generated by contradicting selections on horizontal fragments.

2. Remove useless relations generated by projections on vertical fragments. 3. Distribute joins over unions in order to isolate and remove useless joins.

Example 7.16. The following example query in SQL illustrates the application of rules (1) and (2) to the horizontal-vertical fragmentation of relation EMP into EMP1, EMP2 and EMP3 given above:

SELECT ENAME FROM EMP WHERE ENO="E5"

The localized query of Figure 7.10a can be reduced by first pushing selection down, eliminating fragment EMP1, and then pushing projection down, eliminating fragment EMP3. The reduced query is given in Figure 7.10b. �

(b) Reduced query(a) Localized query

EMP1

ENO

EMP2 EMP3

EMP2

Π ENAME

Π ΕΝΑΜΕ

σ ENO=”E5”

σ ENO=”E5”

Fig. 7.10 Reduction for Hybrid Fragmentation

7.4 Bibliographic NOTES 241

7.3 Conclusion

In this chapter we focused on the techniques for query decomposition and data localization layers of the localized query processing scheme that was introduced in Chapter 6. Query decomposition and data localization are the two successive func- tions that map a calculus query, expressed on distributed relations, into an algebraic query (query decomposition), expressed on relation fragments (data localization).

These two layers can produce a localized query corresponding to the input query in a naive way. Query decomposition can generate an algebraic query simply by translating into relational operations the predicates and the target statement as they appear. Data localization can, in turn, express this algebraic query on relation frag- ments, by substituting for each distributed relation an algebraic query corresponding to its fragmentation rules.

Many algebraic queries may be equivalent to the same input query. The queries produced with the naive approach are inefficient in general, since important simplifi- cations and optimizations have been missed. Therefore, a localized query expression is restructured using a few transformation rules and heuristics. The rules enable separation of unary operations, grouping of unary operations on the same relation, commuting of unary operations with binary operations, and permutation of the binary operations. Examples of heuristics are to push selections down the tree and do projec- tion as early as possible. In addition to the transformation rules, data localization uses reduction rules to simplify the query further, and therefore optimize it. Two main types of rules may be used. The first one avoids the production of empty relations which are generated by contradicting predicates on the same relation(s). The second type of rule determines which fragments yield useless attributes.

The query produced by the query decomposition and data localization layers is good in the sense that the worse executions are avoided. However, the subsequent layers usually perform important optimizations, as they add to the query increasing detail about the processing environment. In particular, quantitative information re- garding fragments has not yet been exploited. This information will be used by the query optimization layer for selecting an “optimal” strategy to execute the query. Query optimization is the subject of Chapter 8.

7.4 Bibliographic NOTES

Traditional techniques for query decomposition are surveyed in [Jarke and Koch, 1984]. Techniques for semantic analysis and simplification of queries have their origins in [Rosenkrantz and Hunt, 1980]. The notion of query graph or connection graph is introduced in [Ullman, 1982]. The notion of query tree, which we called operator tree in this chapter, and the transformation rules to manipulate algebraic expressions have been introduced by Smith and Chang [1975] and developed in [Ullman, 1982]. Proofs of completeness and correctness of the rules are given in the latter reference.

242 7 Query Decomposition and Data Localization

Data localization is treated in detail in [Ceri and Pelagatti, 1983] for horizontally partitioned relations which are referred to as multirelations. In particular, an algebra of qualified relations is defined as an extension of relation algebra, where a qualified relation is a relation name and the qualification of the fragment. Proofs of correctness and completeness of equivalence transformations between expressions of algebra of qualified relations are also given. The formal properties of horizontal and vertical fragmentation are used in [Ceri et al., 1986] to characterize distributed joins over fragmented relations.

Exercises

Problem 7.1. Simplify the following query, expressed in SQL, on our example database using idempotency rules:

SELECT ENO FROM ASG WHERE RESP = "Analyst" AND NOT(PNO="P2" OR DUR=12) AND PNO != "P2" AND DUR=12

Problem 7.2. Give the query graph of the following query, in SQL, on our example database:

SELECT ENAME, PNAME FROM EMP, ASG, PROJ WHERE DUR > 12 AND EMP.ENO = ASG.ENO AND PROJ.PNO = ASG.PNO

and map it into an operator tree.

Problem 7.3 (*). Simplify the following query:

SELECT ENAME, PNAME FROM EMP, ASG, PROJ WHERE (DUR > 12 OR RESP = "Analyst") AND EMP.ENO = ASG.ENO AND (TITLE = "Elect. Eng." OR ASG.PNO < "P3") AND (DUR > 12 OR RESP NOT= "Analyst") AND ASG.PNO = PROJ.PNO

and transform it into an optimized operator tree using the restructuring algorithm (Section 7.1.4) where select and project operations are applied as soon as possible to reduce the size of intermediate relations.

Problem 7.4 (*). Transform the operator tree of Figure 7.5 back to the tree of Figure 7.3 using the restructuring algorithm. Describe each intermediate tree and show which rule the transformation is based on.

7.4 Bibliographic NOTES 243

Problem 7.5 (**). Consider the following query on our Engineering database:

SELECT ENAME,SAL FROM EMP,PROJ,ASG,PAY WHERE EMP.ENO = ASG.ENO AND EMP.TITLE = PAY.TITLE AND (BUDGET>200000 OR DUR>24) AND ASG.PNO = PROJ.PNO AND (DUR>24 OR PNAME = "CAD/CAM")

Compose the selection predicate corresponding to the WHERE clause and transform it, using the idempotency rules, into the simplest equivalent form. Furthermore, compose an operator tree corresponding to the query and transform it, using relational algebra transformation rules, to three equivalent forms.

Problem 7.6. Assume that relation PROJ of the sample database is horizontally fragmented as follows:

PROJ1 = σPNO≤”P2” (PROJ) PROJ2 = σPNO>”P2” (PROJ)

Transform the following query into a reduced query on fragments:

SELECT ENO, PNAME FROM PROJ,ASG WHERE PROJ.PNO = ASG.PNO AND PNO = "P4"

Problem 7.7 (*). Assume that relation PROJ is horizontally fragmented as in Prob- lem 7.6, and that relation ASG is horizontally fragmented as

ASG1 = σPNO≤”P2” (ASG) ASG2 = σ”P2”<PNO≤”P3” (ASG) ASG3 = σPNO>”P3” (ASG)

Transform the following query into a reduced query on fragments, and determine whether it is better than the localized query:

SELECT RESP, BUDGET FROM ASG, PROJ WHERE ASG.PNO = PROJ.PNO AND PNAME = "CAD/CAM"

Problem 7.8 (**). Assume that relation PROJ is fragmented as in Problem 7.6. Furthermore, relation ASG is indirectly fragmented as

ASG1 = ASG nPNO PROJ1 ASG2 = ASG nPNO PROJ2

and relation EMP is vertically fragmented as

244 7 Query Decomposition and Data Localization

EMP1 = ΠENO,ENAME (EMP) EMP2 = ΠENO,TITLE (EMP)

Transform the following query into a reduced query on fragments:

SELECT ENAME FROM EMP,ASG,PROJ WHERE PROJ.PNO = ASG.PNO AND PNAME = "Instrumentation" AND EMP.ENO = ASG.ENO

Chapter 8 Optimization of Distributed Queries

Chapter 7 shows how a calculus query expressed on global relations can be mapped into a query on relation fragments by decomposition and data localization. This map- ping uses the global and fragment schemas. During this process, the application of transformation rules permits the simplification of the query by eliminating common subexpressions and useless expressions. This type of optimization is independent of fragment characteristics such as cardinalities. The query resulting from decomposi- tion and localization can be executed in that form simply by adding communication primitives in a systematic way. However, the permutation of the ordering of opera- tions within the query can provide many equivalent strategies to execute it. Finding an “optimal” ordering of operations for a given query is the main role of the query optimization layer, or optimizer for short.

Selecting the optimal execution strategy for a query is NP-hard in the number of relations [Ibaraki and Kameda, 1984]. For complex queries with many relations, this can incur a prohibitive optimization cost. Therefore, the actual objective of the optimizer is to find a strategy close to optimal and, perhaps more important, to avoid bad strategies. In this chapter we refer to the strategy (or operation ordering) produced by the optimizer as the optimal strategy (or optimal ordering). The output of the optimizer is an optimized query execution plan consisting of the algebraic query specified on fragments and the communication operations to support the execution of the query over the fragment sites.

The selection of the optimal strategy generally requires the prediction of exe- cution costs of the alternative candidate orderings prior to actually executing the query. The execution cost is expressed as a weighted combination of I/O, CPU, and communication costs. A typical simplification of the earlier distributed query optimizers was to ignore local processing cost (I/O and CPU costs) by assuming that the communication cost is dominant. Important inputs to the optimizer for estimating execution costs are fragment statistics and formulas for estimating the cardinalities of results of relational operations. In this chapter we focus mostly on the ordering of join operations for two reasons: it is a well-understood problem, and queries involving joins, selections, and projections are usually considered to be the most frequent type. Furthermore, it is easier to generalize the basic algorithm for other

245 DOI 10.1007/978-1-4419-8834-8_8, © Springer Science+Business Media, LLC 2011 M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems: Third Edition,

246 8 Optimization of Distributed Queries

binary operations, such as union, intersection and difference. We also discuss how the semijoin operation can help to process join queries efficiently.

This chapter is organized as follows. In Section 8.1 we introduce the main compo- nents of query optimization, including the search space, the search strategy and the cost model. Query optimization in centralized systems is described in Section 8.2 as a prerequisite to understand distributed query optimization, which is more complex. In Section 8.3 we discuss the major optimization issue, which deals with the join ordering in distributed queries. We also examine alternative join strategies based on semijoin. In Section 8.4 we illustrate the use of the techniques and concepts in four basic distributed query optimization algorithms.

8.1 Query Optimization

This section introduces query optimization in general, i.e., independent of whether the environment is centralized or distributed. The input query is supposed to be expressed in relational algebra on database relations (which can obviously be fragments) after query rewriting from a calculus expression.

Query optimization refers to the process of producing a query execution plan (QEP) which represents an execution strategy for the query. This QEP minimizes an objective cost function. A query optimizer, the software module that performs query optimization, is usually seen as consisting of three components: a search space, a cost model, and a search strategy (see Figure 8.1). The search space is the set of alternative execution plans that represent the input query. These plans are equivalent, in the sense that they yield the same result, but they differ in the execution order of operations and the way these operations are implemented, and therefore in their performance. The search space is obtained by applying transformation rules, such as those for relational algebra described in Section 7.1.4. The cost model predicts the cost of a given execution plan. To be accurate, the cost model must have good knowledge about the distributed execution environment. The search strategy explores the search space and selects the best plan, using the cost model. It defines which plans are examined and in which order. The details of the environment (centralized versus distributed) are captured by the search space and the cost model.

8.1.1 Search Space

Query execution plans are typically abstracted by means of operator trees (see Section 7.1.4), which define the order in which the operations are executed. They are enriched with additional information, such as the best algorithm chosen for each operation. For a given query, the search space can thus be defined as the set of equivalent operator trees that can be produced using transformation rules. To characterize query optimizers, it is useful to concentrate on join trees, which are operator trees whose

8.1 Query Optimization 247

SEARCH SPACE

GENERATION TRANSFORMATION

RULES

SEARCH

STRATEGY COST MODEL

EQUIVALENT QEP

INPUT QUERY

BEST QEP

Fig. 8.1 Query Optimization Process

PNO

ENO PROJ

ASGEMP

(a)

ENO

PNO EMP

PROJASG

(b)

ENO,PNO

ASG

EMPPROJ

(c)

X

Fig. 8.2 Equivalent Join Trees

operators are join or Cartesian product. This is because permutations of the join order have the most important effect on performance of relational queries.

Example 8.1. Consider the following query:

SELECT ENAME, RESP FROM EMP, ASG, PROJ WHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO

Figure 8.2 illustrates three equivalent join trees for that query, which are obtained by exploiting the associativity of binary operators. Each of these join trees can be assigned a cost based on the estimated cost of each operator. Join tree (c) which starts with a Cartesian product may have a much higher cost than the other join trees. �

For a complex query (involving many relations and many operators), the number of equivalent operator trees can be very high. For instance, the number of alternative

248 8 Optimization of Distributed Queries

Fig. 8.3 The Two Major Shapes of Join Trees

join trees that can be produced by applying the commutativity and associativity rules is O(N!) for N relations. Investigating a large search space may make optimiza- tion time prohibitive, sometimes much more expensive than the actual execution time. Therefore, query optimizers typically restrict the size of the search space they consider. The first restriction is to use heuristics. The most common heuristic is to perform selection and projection when accessing base relations. Another common heuristic is to avoid Cartesian products that are not required by the query. For instance, in Figure 8.2, operator tree (c) would not be part of the search space considered by the optimizer.

Another important restriction is with respect to the shape of the join tree. Two kinds of join trees are usually distinguished: linear versus bushy trees (see Figure 8.3). A linear tree is a tree such that at least one operand of each operator node is a base relation. A bushy tree is more general and may have operators with no base relations as operands (i.e., both operands are intermediate relations). By considering only linear trees, the size of the search space is reduced to O(2N). However, in a distributed environment, bushy trees are useful in exhibiting parallelism. For example, in join tree (b) of Figure 8.3, operations R1 1 R2 and R3 1 R4 can be done in parallel.

8.1.2 Search Strategy

The most popular search strategy used by query optimizers is dynamic programming, which is deterministic. Deterministic strategies proceed by building plans, starting from base relations, joining one more relation at each step until complete plans are obtained, as in Figure 8.4. Dynamic programming builds all possible plans, breadth- first, before it chooses the “best” plan. To reduce the optimization cost, partial plans that are not likely to lead to the optimal plan are pruned (i.e., discarded) as soon as possible. By contrast, another deterministic strategy, the greedy algorithm, builds only one plan, depth-first.

Dynamic programming is almost exhaustive and assures that the “best” of all plans is found. It incurs an acceptable optimization cost (in terms of time and space)

(a) linear join tree

R3

R2R1

R4

R1 R2 R3 R4

(b) bushy join tree

8.1 Query Optimization 249

R2R1

R3

R2R1

R4

R3

R2R1

Step 1 Step 2 Step 3

Fig. 8.4 Optimizer Actions in a Deterministic Strategy

R 2

R 1

R 3

R 3

R 1

R 2

Fig. 8.5 Optimizer Action in a Randomized Strategy

when the number of relations in the query is small. However, this approach becomes too expensive when the number of relations is greater than 5 or 6. For more complex queries, randomized strategies have been proposed, which reduce the optimization complexity but do not guarantee the best of all plans. Unlike deterministic strategies, randomized strategies allow the optimizer to trade optimization time for execution time [Lanzelotte et al., 1993].

Randomized strategies, such as Simulated Annealing [Ioannidis and Wong, 1987] and Iterative Improvement [Swami, 1989] concentrate on searching for the optimal solution around some particular points. They do not guarantee that the best solution is obtained, but avoid the high cost of optimization, in terms of memory and time consumption. First, one or more start plans are built by a greedy strategy. Then, the algorithm tries to improve the start plan by visiting its neighbors. A neighbor is obtained by applying a random transformation to a plan. An example of a typical transformation consists in exchanging two randomly chosen operand relations of the plan, as in Figure 8.5. It has been shown experimentally that randomized strategies provide better performance than deterministic strategies as soon as the query involves more than several relations[Lanzelotte et al., 1993].

8.1.3 Distributed Cost Model

An optimizer’s cost model includes cost functions to predict the cost of operators, statistics and base data, and formulas to evaluate the sizes of intermediate results.

250 8 Optimization of Distributed Queries

The cost is in terms of execution time, so a cost function represents the execution time of a query.

8.1.3.1 Cost Functions

The cost of a distributed execution strategy can be expressed with respect to either the total time or the response time. The total time is the sum of all time (also referred to as cost) components, while the response time is the elapsed time from the initiation to the completion of the query. A general formula for determining the total time can be specified as follows [Lohman et al., 1985]:

Total time = TCPU ∗#insts+TI/O ∗#I/Os+TMSG ∗#msgs+TT R ∗#bytes

The two first components measure the local processing time, where TCPU is the time of a CPU instruction and TI/O is the time of a disk I/O. The communication time is depicted by the two last components. TMSG is the fixed time of initiating and receiving a message, while TT R is the time it takes to transmit a data unit from one site to another. The data unit is given here in terms of bytes (#bytes is the sum of the sizes of all messages), but could be in different units (e.g., packets). A typical assumption is that TT R is constant. This might not be true for wide area networks, where some sites are farther away than others. However, this assumption greatly simplifies query optimization. Thus the communication time of transferring #bytes of data from one site to another is assumed to be a linear function of #bytes:

CT (#bytes) = TMSG +TT R ∗#bytes

Costs are generally expressed in terms of time units, which in turn, can be translated into other units (e.g., dollars).

The relative values of the cost coefficients characterize the distributed database environment. The topology of the network greatly influences the ratio between these components. In a wide area network such as the Internet, the communication time is generally the dominant factor. In local area networks, however, there is more of a balance among the components. Earlier studies cite ratios of communication time to I/O time for one page to be on the order of 20:1 for wide area networks [Selinger and Adiba, 1980] while it is 1:1.6 for a typical early generation Ethernet (10Mbps) [Page and Popek, 1985]. Thus, most early distributed DBMSs designed for wide area networks have ignored the local processing cost and concentrated on minimizing the communication cost. Distributed DBMSs designed for local area networks, on the other hand, consider all three cost components. The new faster networks, both at the wide area network and at the local area network levels, have improved the above ratios in favor of communication cost when all things are equal. However, communication is still the dominant time factor in wide area networks such as the Internet because of the longer distances that data are retrieved from (or shipped to).

When the response time of the query is the objective function of the optimizer, parallel local processing and parallel communications must also be considered

8.1 Query Optimization 251

[Khoshafian and Valduriez, 1987]. A general formula for response time is

Response time = TCPU ∗ seq #insts+TI/O ∗ seq #I/Os +TMSG ∗ seq #msgs+TT R ∗ seq #bytes

where seq #x, in which x can be instructions (insts), I/O, messages (msgs) or bytes, is the maximum number of x which must be done sequentially for the execution of the query. Thus any processing and communication done in parallel is ignored.

Example 8.2. Let us illustrate the difference between total cost and response time using the example of Figure 8.6, which computes the answer to a query at site 3 with data from sites 1 and 2. For simplicity, we assume that only communication cost is considered.

Site 1

Site 2

Site 3

x units

y units

Fig. 8.6 Example of Data Transfers for a Query

Assume that TMSGand TT R are expressed in time units. The total time of transfer- ring x data units from site 1 to site 3 and y data units from site 2 to site 3 is

Total time = 2 TMSG +TT R ∗ (x+ y)

The response time of the same query can be approximated as

Response time = max{TMSG +TT R ∗ x,TMSG +TT R ∗ y}

since the transfers can be done in parallel. �

Minimizing response time is achieved by increasing the degree of parallel exe- cution. This does not, however, imply that the total time is also minimized. On the contrary, it can increase the total time, for example, by having more parallel local processing and transmissions. Minimizing the total time implies that the utilization of the resources improves, thus increasing the system throughput. In practice, a compromise between the two is desired. In Section 8.4 we present algorithms that can optimize a combination of total time and response time, with more weight on one of them.

252 8 Optimization of Distributed Queries

8.1.3.2 Database Statistics

The main factor affecting the performance of an execution strategy is the size of the intermediate relations that are produced during the execution. When a subsequent operation is located at a different site, the intermediate relation must be transmitted over the network. Therefore, it is of prime interest to estimate the size of the inter- mediate results of relational algebra operations in order to minimize the size of data transfers. This estimation is based on statistical information about the base relations and formulas to predict the cardinalities of the results of the relational operations. There is a direct trade-off between the precision of the statistics and the cost of man- aging them, the more precise statistics being the more costly [Piatetsky-Shapiro and Connell, 1984]. For a relation R defined over the attributes A = {A1, A2, . . . , An} and fragmented as R1, R2, . . . , Rr, the statistical data typically are the following:

1. For each attribute Ai, its length (in number of bytes), denoted by length(Ai), and for each attribute Ai of each fragment R j, the number of distinct values of Ai, with the cardinality of the projection of fragment R j on Ai, denoted by card(ΠAi(R j)).

2. For the domain of each attribute Ai, which is defined on a set of values that can be ordered (e.g., integers or reals), the minimum and maximum possible values, denoted by min(Ai) and max(Ai).

3. For the domain of each attribute Ai, the cardinality of the domain of Ai, denoted by card(dom[Ai]). This value gives the number of unique values in the dom[Ai].

4. The number of tuples in each fragment R j, denoted by card(R j).

In addition, for each attribute Ai, there may be a histogram that approximates the frequency distribution of the attribute within a number of buckets, each corresponding to a range of values.

Sometimes, the statistical data also include the join selectivity factor for some pairs of relations, that is the proportion of tuples participating in the join. The join selectivity factor, denoted SFJ , of relations R and S is a real value between 0 and 1:

SFJ(R,S) = card(R 1 S)

card(R)∗ card(S)

For example, a join selectivity factor of 0.5 corresponds to a very large joined relation, while 0.001 corresponds to a small one. We say that the join has bad (or low) selectivity in the former case and good (or high) selectivity in the latter case.

These statistics are useful to predict the size of intermediate relations. Remember that in Chapter 3 we defined the size of an intermediate relation R as follows:

size(R) = card(R)∗ length(R)

8.1 Query Optimization 253

where length(R) is the length (in bytes) of a tuple of R, computed from the lengths of its attributes. The estimation of card(R), the number of tuples in R, requires the use of the formulas given in the following section.

8.1.3.3 Cardinalities of Intermediate Results

Database statistics are useful in evaluating the cardinalities of the intermediate results of queries. Two simplifying assumptions are commonly made about the database. The distribution of attribute values in a relation is supposed to be uniform, and all attributes are independent, meaning that the value of an attribute does not affect the value of any other attribute. These two assumptions are often wrong in practice, but they make the problem tractable. In what follows we give the formulas for estimating the cardinalities of the results of the basic relational algebra operations (selection, projection, Cartesian product, join, semijoin, union, and difference). The operand relations are denoted by R and S. The selectivity factor of an operation, that is, the proportion of tuples of an operand relation that participate in the result of that operation, is denoted SFOP, where OP denotes the operation.

Selection.

The cardinality of selection is

card(σF(R)) = SFS(F)∗ card(R)

where SFS(F) is dependent on the selection formula and can be computed as follows [Selinger et al., 1979], where p(Ai) and p(A j) indicate predicates over attributes Ai and A j , respectively:

SFS(A = value) = 1

card(ΠA(R))

SFS(A > value) = max(A)− value

max(A)−min(A)

SFS(A < value) = value−min(A)

max(A)−min(A)

SFS(p(Ai)∧ p(A j)) = SFS(p(Ai))∗SFS(p(A j))

SFS(p(Ai)∨ p(A j)) = SFS(p(Ai))+SFS(p(A j))− (SFS(p(Ai))∗SFS(p(A j)))

SFS(A ∈ {values}) = SFS(A = value)∗ card({values})

254 8 Optimization of Distributed Queries

Projection.

As indicated in Section 2.1, projection can be with or without duplicate elimination. We consider projection with duplicate elimination. An arbitrary projection is difficult to evaluate precisely because the correlations between projected attributes are usually unknown [Gelenbe and Gardy, 1982]. However, there are two particularly useful cases where it is trivial. If the projection of relation R is based on a single attribute A, the cardinality is simply the number of tuples when the projection is performed. If one of the projected attributes is a key of R, then

card(ΠA(R)) = card(R)

Cartesian product.

The cardinality of the Cartesian product of R and S is simply

card(R×S) = card(R)∗ card(S)

Join.

There is no general way to estimate the cardinality of a join without additional information. The upper bound of the join cardinality is the cardinality of the Cartesian product. It has been used in the earlier distributed DBMS (e.g. [Epstein et al., 1978]), but it is a quite pessimistic estimate. A more realistic solution is to divide this upper bound by a constant to reflect the fact that the join result is smaller than that of the Cartesian product [Selinger and Adiba, 1980]. However, there is a case, which occurs frequently, where the estimation is simple. If relation R is equijoined with S over attribute A from R, and B from S, where A is a key of relation R, and B is a foreign key of relation S, the cardinality of the result can be approximated as

card(R 1A=B S) = card(S)

because each tuple of S matches with at most one tuple of R. Obviously, the same thing is true if B is a key of S and A is a foreign key of R. However, this estimation is an upper bound since it assumes that each tuple of R participates in the join. For other important joins, it is worthwhile to maintain their join selectivity factor SFJ as part of statistical information. In that case the result cardinality is simply

card(R 1 S) = SFJ ∗card(R)∗card(S)

8.1 Query Optimization 255

Semijoin.

The selectivity factor of the semijoin of R by S gives the fraction (percentage) of tuples of R that join with tuples of S. An approximation for the semijoin selectivity factor is given by Hevner and Yao [1979] as

SFSJ(RnA S) = card(ΠA(S)) card(dom[A])

This formula depends only on attribute A of S. Thus it is often called the selectivity factor of attribute A of S, denoted SFSJ(S.A), and is the selectivity factor of S.A on any other joinable attribute. Therefore, the cardinality of the semijoin is given by

card(RnA S) = SFSJ(S.A)∗ card(R)

This approximation can be verified on a very frequent case, that of R.A being a foreign key of S (S.A is a primary key). In this case, the semijoin selectivity factor is 1 since ΠA(S)) = card(dom[A]) yielding that the cardinality of the semijoin is card(R).

Union.

It is quite difficult to estimate the cardinality of the union of R and S because the duplicates between R and S are removed by the union. We give only the simple formulas for the upper and lower bounds, which are, respectively,

card(R)+ card(S)

max{card(R),card(S)}

Note that these formulas assume that R and S do not contain duplicate tuples.

Difference.

Like the union, we give only the upper and lower bounds. The upper bound of card(R−S) is card(R), whereas the lower bound is 0.

More complex predicates with conjunction and disjunction can also be handled by using the formulas given above.

8.1.3.4 Using Histograms for Selectivity Estimation

The formulae above for estimating the cardinalities of intermediate results of queries rely on the strong assumption that the distribution of attribute values in a relation is uniform. The advantage of this assumption is that the cost of managing the statistics

256 8 Optimization of Distributed Queries

is minimal since only the number of distinct attribute values is needed. However, this assumption is not practical. In case of skewed data distributions, it can result in fairly inaccurate estimations and QEPs which are far from the optimal.

An effective solution to accurately capture data distributions is to use histograms. Today, most commercial DBMS optimizers support histograms as part of their cost model. Various kinds of histograms have been proposed for estimating the selectivity of query predicates with different trade-offs between accuracy and maintenance cost [Poosala et al., 1996]. To illustrate the use of histograms, we use the basic definition by Bruno and Chaudhuri [2002]. A histogram on attribute A from R is a set of buckets. Each bucket bi describes a range of values of A, denoted by rangei, with its associated frequency fi and number of distinct values di. fi gives the number of tuples of R where R.A ∈ rangei. di gives the number of distinct values of A where R.A ∈ rangei. This representation of a relation’s attribute can capture non-uniform distributions of values, with the buckets adapted to the different ranges. However, within a bucket, the distribution of attribute values is assumed to be uniform.

Histograms can be used to accurately estimate the selectivity of selection opera- tions. They can also be used for more complex queries including selection, projection and join. However, the precise estimation of join selectivity remains difficult and depends on the type of the histogram [Poosala et al., 1996]. We now illustrate the use of histograms with two important selection predicates: equality and range predicate.

Equality predicate.

With value ∈ rangei, we simply have: SFS(A = value) = 1/di.

Range predicate.

Computing the selectivity of range predicates such as A ≤ value, A < value and A> value requires identifying the relevant buckets and summing up their frequencies. Let us consider the range predicate R.A≤ value with value ∈ rangei. To estimate the numbers of tuples of R that satisfy this predicate, we must sum up the frequencies of all buckets which precede bucket i and the estimated number of tuples that satisfy the predicate in bucket bi. Assuming uniform distribution of attribute values in bi, we have:

card(σA≤value(R)) = i−1

∑ j=1

f j +( value−min(rangei)

min(rangei) −min(rangei)∗ fi)

The cardinality of other range predicates can be computed in a similar way.

Example 8.3. Figure 8.7 shows a possible 4-bucket histogram for attribute DUR of a relation ASG with 300 tuples. Let us consider the equality predicate ASG.DUR=18. Since the value ”18” fits in bucket b3, the selectivity factor is 1/12. Since the cardinalty

8.2 Centralized Query Optimization 257

of b3 is 50, the cardinality of the selection is 50/12 which is approximately 5 tuples. Let us now consider the range predicate ASG.DUR ≤ 18. We have min(range3) = 12 and max(range3) = 24. The cardinality of the selection is: 100+ 75+(((18− 12)/(24−12))∗50) = 200 tuples. �

Frequency

50

100

ASG.DURb 1

b 2

b 3

b 4

d 3 =12

0 6 12 24 30

card(ASG)=300

Fig. 8.7 Histogram of Attribute ASG.DUR

8.2 Centralized Query Optimization

In this section we present the main query optimization techniques for centralized systems. This presentation is a prerequisite to understanding distributed query opti- mization for three reasons. First, a distributed query is translated into local queries, each of which is processed in a centralized way. Second, distributed query opti- mization techniques are often extensions of the techniques for centralized systems. Finally, centralized query optimization is a simpler problem; the minimization of communication costs makes distributed query optimization more complex.

As discussed in Chapter 6, the optimization timing, which can be dynamic, static or hybrid, is a good basis for classifying query optimization techniques. Therefore, we present a representative technique of each class.

8.2.1 Dynamic Query Optimization

Dynamic query optimization combines the two phases of query decomposition and optimization with execution. The QEP is dynamically constructed by the query optimizer which makes calls to the DBMS execution engine for executing the query’s operations. Thus, there is no need for a cost model.

258 8 Optimization of Distributed Queries

The most popular dynamic query optimization algorithm is that of INGRES [Stonebraker et al., 1976], one of the first relational DBMS. In this section, we present this algorithm based on the detailed description by Wong and Youssefi [1976]. The algorithm recursively breaks up a query expressed in relational calculus (i.e., SQL) into smaller pieces which are executed along the way. The query is first decomposed into a sequence of queries having a unique relation in common. Then each monorelation query is processed by selecting, based on the predicate, the best access method to that relation (e.g., index, sequential scan). For example, if the predicate is of the form A = value, an index available on attribute A would be used if it exists. However, if the predicate is of the form A 6= value, an index on A would not help, and sequential scan should be used.

The algorithm executes first the unary (monorelation) operations and tries to mini- mize the sizes of intermediate results in ordering binary (multirelation) operations. Let us denote by qi−1→ qi a query q decomposed into two subqueries, qi−1 and qi, where qi−1 is executed first and its result is consumed by qi. Given an n-relation query q, the optimizer decomposes q into n subqueries q1→ q2→ ··· → qn. This decom- position uses two basic techniques: detachment and substitution. These techniques are presented and illustrated in the rest of this section.

Detachment is the first technique employed by the query processor. It breaks a query q into q′→ q′′, based on a common relation that is the result of q′. If the query q expressed in SQL is of the form

SELECT R2.A2,R3.A3, . . . ,Rn.An FROM R1,R2, . . . ,Rn WHERE P1(R1.A

′ 1)

AND P2(R1.A1,R2.A2, . . . ,Rn.An)

where Ai and A ′ i are lists of attributes of relation Ri, P1 is a predicate involving

attributes from relation R1, and P2 is a multirelation predicate involving attributes of relations R1,R2, . . . ,Rn. Such a query may be decomposed into two subqueries, q′

followed by q′′, by detachment of the common relation R1:

q′: SELECT R1.A1INTO R ′ 1

FROM R1 WHERE P1(R1.A

′ 1)

where R ′ 1 is a temporary relation containing the information necessary for the contin-

uation of the query:

q′′: SELECT R2.A2, . . . ,Rn.An FROM R

′ 1,R2, . . . ,Rn

WHERE P2(R ′ 1.A1, . . . ,Rn.An)

This step has the effect of reducing the size of the relation on which the query q′′

is defined. Furthermore, the created relation R ′ 1 may be stored in a particular structure

to speed up the following subqueries. For example, the storage of R ′ 1 in a hashed file

8.2 Centralized Query Optimization 259

on the join attributes of q′′ will make processing the join more efficient. Detachment extracts the select operations, which are usually the most selective ones. Therefore, detachment is systematically done whenever possible. Note that this can have adverse effects on performance if the selection has bad selectivity.

Example 8.4. To illustrate the detachment technique, we apply it to the following query:

“Names of employees working on the CAD/CAM project”

This query can be expressed in SQL by the following query q1 on the engineering database of Chapter 2:

q1: SELECT EMP.ENAME FROM EMP, ASG, PROJ WHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO AND PNAME="CAD/CAM"

After detachment of the selections, query q1 is replaced by q11 followed by q′, where JVAR is an intermediate relation.

q11: SELECT PROJ.PNO INTO JVAR FROM PROJ WHERE PNAME="CAD/CAM"

q′: SELECT EMP.ENAME FROM EMP, ASG, JVAR WHERE EMP.ENO=ASG.ENO AND ASG.PNO=JVAR.PNO

The successive detachments of q′ may generate

q12: SELECT ASG.ENO INTO GVAR FROM ASG, JVAR WHERE ASG.PNO=JVAR.PNO

q13: SELECT EMP.ENAME FROM EMP, GVAR WHERE EMP.ENO=GVAR.ENO

Note that other subqueries are also possible. Thus query q1 has been reduced to the subsequent queries q11→ q12→ q13. Query

q11 is monorelation and can be executed. However, q12 and q13 are not monorelation and cannot be reduced by detachment. �

Multirelation queries, which cannot be further detached (e.g., q12 and q13), are irreducible. A query is irreducible if and only if its query graph is a chain with two nodes or a cycle with k nodes where k > 2. Irreducible queries are converted into monorelation queries by tuple substitution. Given an n-relation query q, the tuples of one relation are substituted by their values, thereby producing a set of (n−1)-relation

260 8 Optimization of Distributed Queries

queries. Tuple substitution proceeds as follows. First, one relation in q is chosen for tuple substitution. Let R1 be that relation. Then for each tuple t1i in R1, the attributes referred to by in q are replaced by their actual values in t1i, thereby generating a query q′ with n−1 relations. Therefore, the total number of queries q′ produced by tuple substitution is card(R1). Tuple substitution can be summarized as follows:

q(R1,R2, . . . ,Rn) is replaced by {q′(t1i,R2,R3, . . . ,Rn), t1i ∈ R1}

For each tuple thus obtained, the subquery is recursively processed by substitution if it is not yet irreducible.

Example 8.5. Let us consider the query q13:

SELECT EMP.ENAME FROM EMP, GVAR WHERE EMP.ENO=GVAR.ENO

The relation GVAR is over a single attribute (ENO). Assume that it contains only two tuples: 〈E1〉 and 〈E2〉. The substitution of GVAR generates two one-relation subqueries:

q131: SELECT EMP.ENAME FROM EMP WHERE EMP.ENO="E1"

q132: SELECT EMP.ENAME FROM EMP WHERE EMP.ENO="E2"

These queries may then be executed. �

This dynamic query optimization algorithm (called Dynamic-QOA) is depicted in Algorithm 8.1. The algorithm works recursively until there remain no more monorelation queries to be processed. It consists of applying the selections and projections as soon as possible by detachment. The results of the monorelation queries are stored in data structures that are capable of optimizing the later queries (such as joins). The irreducible queries that remain after detachment must be processed by tuple substitution. For the irreducible query, denoted by MRQ′, the smallest relation whose cardinality is known from the result of the preceding query is chosen for substitution. This simple method enables one to generate the smallest number of subqueries. Monorelation queries generated by the reduction algorithm are executed after choosing the best existing access path to the relation, according to the query qualification.

8.2 Centralized Query Optimization 261

Algorithm 8.1: Dynamic-QOA Input: MRQ: multirelation query with n relations Output: out put: result of execution begin

out put← φ ; if n = 1 then

out put← run(MRQ) {execute the one relation query} {detach MRQ into m one-relation queries (ORQ) and one multirelation query} ORQ1, . . . ,ORQm,MRQ′←MRQ ; for i from 1 to m do

out put ′← run(ORQi) ; {execute ORQi} out put← out put ∪out put ′ {merge all results}

R← CHOOSE RELATION(MRQ′) ; {R chosen for tuple substitution} for each tuple t ∈ R do

MRQ′′← substitute values for t in MRQ′ ; out put ′← Dynamic-QOA(MRQ′′) ; {recursive call} out put← out put ∪out put ′ {merge all results}

end

8.2.2 Static Query Optimization

With static query optimization, there is a clear separation between the generation of the QEP at compile-time and its execution by the DBMS execution engine. Thus, an accurate cost model is key to predict the costs of candidate QEPs.

The most popular static query optimization algorithm is that of System R [Astra- han et al., 1976], also one of the first relational DBMS. In this section, we present this algorithm based on the description by Selinger et al. [1979]. Most commercial relational DBMSs have implemented variants of this algorithm due to its efficiency and compatibility with query compilation.

The input to the optimizer is a relational algebra tree resulting from the decompo- sition of an SQL query. The output is a QEP that implements the “optimal” relational algebra tree.

The optimizer assigns a cost (in terms of time) to every candidate tree and retains the one with the smallest cost. The candidate trees are obtained by a permutation of the join orders of the n relations of the query using the commutativity and associativity rules. To limit the overhead of optimization, the number of alternative trees is reduced using dynamic programming. The set of alternative strategies is constructed dynamically so that, when two joins are equivalent by commutativity, only the cheapest one is kept. Furthermore, the strategies that include Cartesian products are eliminated whenever possible.

The cost of a candidate strategy is a weighted combination of I/O and CPU costs (times). The estimation of such costs (at compile time) is based on a cost model that

262 8 Optimization of Distributed Queries

provides a cost formula for each low-level operation (e.g., select using a B-tree index with a range predicate). For most operations (except exact match select), these cost formulas are based on the cardinalities of the operands. The cardinality information for the relations stored in the database is found in the database statistics. The car- dinality of the intermediate results is estimated based on the operation selectivity factors discussed in Section 8.1.3.

The optimization algorithm consists of two major steps. First, the best access method to each individual relation based on a select predicate is predicted (this is the one with the least cost). Second, for each relation R, the best join ordering is estimated, where R is first accessed using its best single-relation access method. The cheapest ordering becomes the basis for the best execution plan.

In considering the joins, there are two basic algorithms available, with one of them being optimal in a given context. For the join of two relations, the relation whose tuples are read first is called the external, while the other, whose tuples are found according to the values obtained from the external relation, is called the internal relation. An important decision with either join method is to determine the cheapest access path to the internal relation.

The first method, called nested-loop, performs two loops over the relations. For each tuple of the external relation, the tuples of the internal relation that satisfy the join predicate are retrieved one by one to form the resulting relation. An index or a hashed table on the join attribute is a very efficient access path for the internal relation. In the absence of an index, for relations of n1 and n2 tuples, respectively, this algorithm has a cost proportional to n1 * n2, which may be prohibitive if n1 and n2 are high. Thus, an efficient variant is to build a hashed table on the join attribute for the internal relation (chosen as the smallest relation) before applying nested-loop. If the internal relation is itself the result of a previous operation, then the cost of building the hashed table can be shared with that of producing the previous result.

The second method, called merge-join, consists of merging two sorted relations on the join attribute. Indices on the join attribute may be used as access paths. If the join criterion is equality, the cost of joining two relations of n1 and n2 tuples, respectively, is proportional to n1 + n2. Therefore, this method is always chosen when there is an equijoin, and when the relations are previously sorted. If only one or neither of the relations are sorted, the cost of the nested-loop algorithm is to be compared with the combined cost of the merge join and of the sorting. The cost of sorting n pages is proportional to n logn. In general, it is useful to sort and apply the merge join algorithm when large relations are considered.

The simplified version of the static optimization algorithm, for a select-project- join query, is shown in Algorithm 8.2. It consists of two loops, the first of which selects the best single-relation access method to each relation in the query, while the second examines all possible permutations of join orders (there are n! permutations with n relations) and selects the best access strategy for the query. The permutations are produced by the dynamic construction of a tree of alternative strategies. First, the join of each relation with every other relation is considered, followed by joins of three relations. This continues until joins of n relations are optimized. Actually, the algorithm does not generate all possible permutations since some of them are useless.

8.2 Centralized Query Optimization 263

As we discussed earlier, permutations involving Cartesian products are eliminated, as are the commutatively equivalent strategies with the highest cost. With these two heuristics, the number of strategies examined has an upper bound of 2n rather than n!.

Algorithm 8.2: Static-QOA Input: QT : query tree with n relations Output: out put: best QEP begin

for each relation Ri ∈QT do for each access path APi j to Ri do

compute cost(APi j) best APi← APi j with minimum cost ; for each order (Ri1,Ri2, · · · ,Rin) with i = 1, · · · ,n! do

build QEP (. . .((best APi1 1 Ri2) 1 Ri3) 1 . . . 1 Rin) ; compute cost (QEP)

out put ← QEP with minimum cost end

Example 8.6. Let us illustrate this algorithm with the query q1 (see Example 8.4) on the engineering database. The join graph of q1 is given in Figure 8.8. For short, the label ENO on edge EMP–ASG stands for the predicate EMP.ENO=ASG.ENO and the label PNO on edge ASG–PROJ stands for the predicate ASG.PNO=PROJ.PNO. We assume the following indices:

EMP has an index on ENO ASG has an index on PNO PROJ has an index on PNO and an index on PNAME

EMP

ASG

PROJ

ENO PNO

Fig. 8.8 Join Graph of Query q1

We assume that the first loop of the algorithm selects the following best single- relation access paths:

264 8 Optimization of Distributed Queries

EMP: sequential scan (because there is no selection on EMP) ASG: sequential scan (because there is no selection on ASG) PROJ: index on PNAME (because there is a selection on PROJ

based on PNAME)

The dynamic construction of the tree of alternative strategies is illustrated in Figure 8.9. Note that the maximum number of join orders is 3!; dynamic search considers fewer alternatives, as depicted in Figure 8.9. The operations marked “pruned” are dynamically eliminated. The first level of the tree indicates the best single-relation access method. The second level indicates, for each of these, the best join method with any other relation. Strategies (EMP × PROJ) and (PROJ × EMP) are pruned because they are Cartesian products that can be avoided (by other strategies). We assume that (EMP 1 ASG) and (ASG 1 PROJ) have a cost higher than (ASG 1 EMP) and (PROJ 1 ASG), respectively. Thus they can be pruned because there are better join orders equivalent by commutativity. The two remaining possibilities are given at the third level of the tree. The best total join order is the least costly of ((ASG 1 EMP) 1 PROJ) and ((PROJ 1 ASG) 1 EMP). The latter is the only one that has a useful index on the select attribute and direct access to the joining tuples of ASG and EMP. Therefore, it is chosen with the following access methods:

ASGEMP

EMP X PROJ pruned pruned pruned

PROJ

PROJ X EMP pruned

(PROJ ASG) EMP

PROJ ASG

(ASG EMP) PROJ

EMP ASG ASG PROJASG EMP

Fig. 8.9 Alternative Join Orders

Select PROJ using index on PNAME Then join with ASG using index on PNO Then join with EMP using index on ENO

The performance measurements substantiate the important contribution of the CPU time to the total time of the query[Mackert and Lohman, 1986]. The accuracy of the optimizer’s estimations is generally good when the relations can be contained in the main memory buffers, but degrades as the relations increase in size and are

8.2 Centralized Query Optimization 265

written to disk. An important performance parameter that should also be considered for better predictions is buffer utilization.

8.2.3 Hybrid Query Optimization

Dynamic and static query optimimization both have advantages and drawbacks. Dynamic query optimization mixes optimization and execution and thus can make accurate optimization choices at run-time. However, query optimization is repeated for each execution of the query. Therefore, this approach is best for ad-hoc queries. Static query optimization, done at compilation time, amortizes the cost of optimiza- tion over multiple query executions. The accuracy of the cost model is thus critical to predict the costs of candidate QEPs. This approach is best for queries embedded in stored procedures, and has been adopted by all commercial DBMSs.

However, even with a sophisticated cost model, there is an important problem that prevents accurate cost estimation and comparison of QEPs at compile-time. The problem is that the actual bindings of parameter values in embedded queries is not known until run-time. Consider for instance the selection predicate “WHERE R.A = $a” where “$a” is a parameter value. To estimate the cardinality of this selection, the optimizer must rely on the assumption of uniform distribution of A values in R and cannot make use of histograms. Since there is a runtime binding of the parameter a, the accurate selectivity of σA=$a(R) cannot be estimated until runtime.

Thus, it can make major estimation errors that can lead to the choice of suboptimal QEPs.

Hybrid query optimization attempts to provide the advantages of static query opti- mization while avoiding the issues generated by inaccurate estimates. The approach is basically static, but further optimization decisions may take place at run time. This approach was pionnered in System R by adding a conditional runtime reopti- mization phase for execution plans statically optimized [Chamberlin et al., 1981]. Thus, plans that have become infeasible (e.g., because indices have been dropped) or suboptimal (e.g. because of changes in relation sizes) are reoptimized. However, detecting suboptimal plans is hard and this approach tends to perform much more reoptimization than necessary. A more general solution is to produce dynamic QEPs which include carefully selected optimization decisions to be made at runtime using “choose-plan” operators [Cole and Graefe, 1994]. The choose-plan operator links two or more equivalent subplans of a QEP that are incomparable at compile-time because important runtime information (e.g. parameter bindings) is missing to estimate costs. The execution of a choose-plan operator yields the comparison of the subplans based on actual costs and the selection of the best one. Choose-plan nodes can be inserted anywhere in a QEP.

Example 8.7. Consider the following query expressed in relational algebra:

266 8 Optimization of Distributed Queries

σA≤$a(R1) 1 R2 1 R3

Figure 8.10 shows a dynamic execution plan for this query. We assume that each join is performed by nested-loop, with the left operand relation as external and the right operand relation as internal. The bottom choose-plan operator compares the cost of two alternative subplans for joining R1 and R2, the left subplan being better than the right one if the selection predicate has high selectivity. As stated above, since there is a runtime binding of the parameter $a, the accurate selectivity of σA≤$a(R1) cannot be estimated until runtime. The top choose-plan operator compares the cost of two alternative subplans for joining the result of the bottom choose-plan operation with R3. Depending on the estimated size of the join of R1 and R2, which indirectly depends on the selectivity of the selection on R1 it may be better to use R3 as external or internal relation. �

R 1

R 2

R 3

R 3

R 2 R1

Choose-plan

Choose-plan

σ

σ

Fig. 8.10 A Dynamic Execution Plan

Dynamic QEPs are produced at compile-time using any static algorithm such as the one presented in Section 8.2.2. However, instead of producing a total order of operations, the optimizer must produce a partial order by introducing choose-node operators anywhere in the QEP. The main modification necessary to a static query optimizer to handle dynamic QEPs is that the cost model supports incomparable costs of plans in addition to the standard values “greater than”, “less than” and “equal to”. Costs may be incomparable because the costs of some subplans are unknown at compile-time. Another reason for cost incomparability is when cost is modeled as an interval of possible cost values rather than a single value [Cole and Graefe, 1994]. Therefore, if two plan costs have overlapping intervals, it is not possible to decide which one is better and they should be considered as incomparable.

Given a dynamic QEP, produced by a static query optimizer, the choose-plan decisions must be made at query startup time. The most effective solution is to simply evaluate the costs of the participating subplans and compare them. In Algorithm 8.3,

8.3 Join Ordering in Distributed Queries 267

we describe the startup procedure (called Hybrid-QOA) which makes the optimization decisions to produce the final QEP and run it. The algorithm executes the choose-plan operators in bottom-up order and propagates cost information upward in the QEP.

Algorithm 8.3: Hybrid-QOA Input: QEP: dynamic QEP; B: Query parameter bindinds Output: out put: result of execution begin

best QEP← QEP ; for each choose-plan operator CP in bottom-up order do

for each alternative subplan SP do compute cost(CP) using B

best QEP← best QEP without CP and SP of highest cost out put← execute best QEP

end

Experimentation with the Volcano query optimizer [Graefe, 1994] has shown that this hybrid query optimization outperforms both dynamic and static query optimization. In particular, the overhead of dynamic QEP evaluation at startup time is significantly less than that of dynamic optimization, and the reduced execution time of dynamic QEPs relative to static QEPs more than offsets the startup time overhead.

8.3 Join Ordering in Distributed Queries

As we have seen in Section 8.2, ordering joins is an important aspect of centralized query optimization. Join ordering in a distributed context is even more important since joins between fragments may increase the communication time. Two basic approaches exist to order joins in distributed queries. One tries to optimize the ordering of joins directly, whereas the other replaces joins by combinations of semijoins in order to minimize communication costs.

8.3.1 Join Ordering

Some algorithms optimize the ordering of joins directly without using semijoins. The purpose of this section is to stress the difficulty that join ordering presents and to motivate the subsequent section, which deals with the use of semijoins to optimize join queries.

A number of assumptions are necessary to concentrate on the main issues. Since the query is localized and expressed on fragments, we do not need to distinguish

268 8 Optimization of Distributed Queries

between fragments of the same relation and fragments of different relations. To simplify notation, we use the term relation to designate a fragment stored at a particular site. Also, to concentrate on join ordering, we ignore local processing time, assuming that reducers (selection, projection) are executed locally either before or during the join (remember that doing selection first is not always efficient). Therefore, we consider only join queries whose operand relations are stored at different sites. We assume that relation transfers are done in a set-at-a-time mode rather than in a tuple-at-a-time mode. Finally, we ignore the transfer time for producing the data at a result site.

Let us first concentrate on the simpler problem of operand transfer in a single join. The query is R 1 S, where R and S are relations stored at different sites. The obvious choice of the relation to transfer is to send the smaller relation to the site of the larger one, which gives rise to two possibilities, as shown in Figure 8.11. To make this choice we need to evaluate the sizes of R and S. We now consider the case where there are more than two relations to join. As in the case of a single join, the objective of the join-ordering algorithm is to transmit smaller operands. The difficulty stems from the fact that the join operations may reduce or increase the size of the intermediate results. Thus, estimating the size of join results is mandatory, but also difficult. A solution is to estimate the communication costs of all alternative strategies and to choose the best one. However, as discussed earlier, the number of strategies grows rapidly with the number of relations. This approach makes optimization costly, although this overhead is amortized rapidly if the query is executed frequently.

R S

if size(R) < size(S)

if size(R) > size(S)

Fig. 8.11 Transfer of Operands in Binary Operation

Example 8.8. Consider the following query expressed in relational algebra:

PROJ 1PNO ASG 1ENO EMP

whose join graph is given in Figure 8.12. Note that we have made certain assumptions about the locations of the three relations. This query can be executed in at least five different ways. We describe these strategies by the following programs, where (R→ site j) stands for “relation R is transferred to site j.”

1. EMP→ site 2; Site 2 computes EMP′ = EMP 1 ASG; EMP′→ site 3; Site 3 computes EMP′ 1 PROJ.

2. ASG→ site 1; Site 1 computes EMP′ = EMP 1 ASG; EMP′→ site 3; Site 3 computes EMP′ 1 PROJ.

8.3 Join Ordering in Distributed Queries 269

EMP

ASG

PROJ

PNOENO

Site 2

Site 3Site 1

Fig. 8.12 Join Graph of Distributed Query

3. ASG→ site 3; Site 3 computes ASG′ = ASG 1 PROJ; ASG′→ site 1; Site 1 computes ASG′ 1 EMP.

4. PROJ→ site 2; Site 2 computes PROJ′ = PROJ 1 ASG; PROJ′→ site 1; Site 1 computes PROJ′ 1 EMP.

5. EMP→ site 2; PROJ→ site 2; Site 2 computes EMP 1 PROJ 1 ASG

To select one of these programs, the following sizes must be known or predicted: size(EMP), size(ASG), size(PROJ), size(EMP 1 ASG), and size(ASG 1 PROJ). Furthermore, if it is the response time that is being considered, the optimization must take into account the fact that transfers can be done in parallel with strategy 5. An alternative to enumerating all the solutions is to use heuristics that consider only the sizes of the operand relations by assuming, for example, that the cardinality of the resulting join is the product of operand cardinalities. In this case, relations are ordered by increasing sizes and the order of execution is given by this ordering and the join graph. For instance, the order (EMP, ASG, PROJ) could use strategy 1, while the order (PROJ, ASG, EMP) could use strategy 4. �

8.3.2 Semijoin Based Algorithms

In this section we show how the semijoin operation can be used to decrease the total time of join queries. The theory of semijoins was defined by Bernstein and Chiu [1981]. We are making the same assumptions as in Section 8.3.1. The main shortcoming of the join approach described in the preceding section is that entire operand relations must be transferred between sites. The semijoin acts as a size reducer for a relation much as a selection does.

The join of two relations R and S over attribute A, stored at sites 1 and 2, respec- tively, can be computed by replacing one or both operand relations by a semijoin with the other relation, using the following rules:

R 1A S⇔ (RnA S) 1A S ⇔ R 1A (SnA R)

270 8 Optimization of Distributed Queries

⇔ (RnA S) 1A (SnA R)

The choice between one of the three semijoin strategies requires estimating their respective costs.

The use of the semijoin is beneficial if the cost to produce and send it to the other site is less than the cost of sending the whole operand relation and of doing the actual join. To illustrate the potential benefit of the semijoin, let us compare the costs of the two alternatives: R 1A S versus (RnA S) 1A S, assuming that size(R)< size(S).

The following program, using the notation of Section 8.3.1, uses the semijoin operation:

1. ΠA(S)→ site 1 2. Site 1 computes R′ = RnA S 3. R′→ site 2 4. Site 2 computes R′ 1A S

For the sake of simplicity, let us ignore the constant TMSG in the communication time assuming that the term TT R ∗ size(R) is much larger. We can then compare the two alternatives in terms of the amount of transmitted data. The cost of the join-based algorithm is that of transferring relation R to site 2. The cost of the semijoin-based algorithm is the cost of steps 1 and 3 above. Therefore, the semijoin approach is better if

size(ΠA(S))+ size(RnA S)< size(R)

The semijoin approach is better if the semijoin acts as a sufficient reducer, that is, if a few tuples of R participate in the join. The join approach is better if almost all tuples of R participate in the join, because the semijoin approach requires an additional transfer of a projection on the join attribute. The cost of the projection step can be minimized by encoding the result of the projection in bit arrays [Valduriez, 1982], thereby reducing the cost of transferring the joined attribute values. It is important to note that neither approach is systematically the best; they should be considered as complementary.

More generally, the semijoin can be useful in reducing the size of the operand relations involved in multiple join queries. However, query optimization becomes more complex in these cases. Consider again the join graph of relations EMP, ASG, and PROJ given in Figure 8.12. We can apply the previous join algorithm using semijoins to each individual join. Thus an example of a program to compute EMP 1 ASG 1 PROJ is EMP′ 1 ASG′ 1 PROJ, where EMP′ = EMP n ASG and ASG′ = ASG n PROJ.

However, we may further reduce the size of an operand relation by using more than one semijoin. For example, EMP′ can be replaced in the preceding program by EMP′′ derived as

EMP′′ = EMP n (ASG nPROJ)

8.3 Join Ordering in Distributed Queries 271

since if size(ASG n PROJ) ≤ size(ASG), we have size(EMP′′) ≤ size(EMP′). In this way, EMP can be reduced by the sequence of semijoins: EMP n (ASG n PROJ). Such a sequence of semijoins is called a semijoin program for EMP. Similarly, semijoin programs can be found for any relation in a query. For example, PROJ could be reduced by the semijoin program PROJ n (ASG n EMP). However, not all of the relations involved in a query need to be reduced; in particular, we can ignore those relations that are not involved in the final joins.

For a given relation, there exist several potential semijoin programs. The number of possibilities is in fact exponential in the number of relations. But there is one optimal semijoin program, called the full reducer, which for each relation R reduces R more than the others [Chiu and Ho, 1980]. The problem is to find the full reducer. A simple method is to evaluate the size reduction of all possible semijoin programs and to select the best one. The problems with the enumerative method are twofold:

1. There is a class of queries, called cyclic queries, that have cycles in their join graph and for which full reducers cannot be found.

2. For other queries, called tree queries, full reducers exist, but the number of candidate semijoin programs is exponential in the number of relations, which makes the enumerative approach NP-hard.

In what follows we discuss solutions to these problems.

Example 8.9. Consider the following relations, where attribute CITY has been added to relations EMP (renamed ET), PROJ (renamed PT) and ASG (renamed AT) of the engineering database. Attribute CITY of AT corresponds to the city where the employee identified by ENO lives.

ET(ENO, ENAME, TITLE, CITY) AT(ENO, PNO, RESP, DUR) PT(PNO, PNAME, BUDGET, CITY)

The following SQL query retrieves the names of all employees living in the city in which their project is located together with the project name.

SELECT ENAME, PNAME FROM ET, AT, PT WHERE ET.ENO = AT.ENO AND AT.ENO = PT.ENO AND ET.CITY = PT.CITY

As illustrated in Figure 8.13a, this query is cyclic. �

No full reducer exists for the query in Example 8.9. In fact, it is possible to derive semijoin programs for reducing it, but the number of operations is multiplied by the number of tuples in each relation, making the approach inefficient. One solution consists of transforming the cyclic graph into a tree by removing one arc of the graph and by adding appropriate predicates to the other arcs such that the removed

272 8 Optimization of Distributed Queries

PT

ET

AT

AT.PNO = PT.PNO

ET.ENO=AT.ENO

ET.CITY= PT.CITY

(a) Cyclic query

PT

AT

AT.PNO=PT.PNO and AT.CITY=PT.CITY

ET.ENO=AT.ENO

and ET.CITY=AT.CITY

(b) Equivalent acyclic query

ET

Fig. 8.13 Transformation of Cyclic Query

predicate is preserved by transitivity [Kambayashi et al., 1982]. In the example of Figure 8.13b, where the arc (ET, PT) is removed, the additional predicate ET.CITY = AT.CITY and AT.CITY = PT.CITY imply ET.CITY = PT.CITY by transitivity. Thus the acyclic query is equivalent to the cyclic query.

Although full reducers for tree queries exist, the problem of finding them is NP- hard. However, there is an important class of queries, called chained queries, for which a polynomial algorithm exists [Chiu and Ho, 1980; Ullman, 1982]). A chained query has a join graph where relations can be ordered, and each relation joins only with the next relation in the order. Furthermore, the result of the query is at the end of the chain. For instance, the query in Figure 8.12 is a chain query. Because of the difficulty of implementing an algorithm with full reducers, most systems use single semijoins to reduce the relation size.

8.3.3 Join versus Semijoin

Compared with the join, the semijoin induces more operations but possibly on smaller operands. Figure 8.14 illustrates these differences with an equivalent pair of join and semijoin strategies for the query whose join graph is given in Figure 8.12. The join of two relations, EMP 1 ASG in Figure 8.12, is done by sending one relation, ASG, to the site of the other one, EMP, to complete the join locally. When a semijoin is used, however, the transfer of relation ASG is avoided. Instead, it is replaced by the transfer of the join attribute values of relation EMP to the site of relation ASG, followed by the transfer of the matching tuples of relation ASG to the site of relation EMP, where the join is completed. If the join attribute length is smaller than the length of an entire tuple and the semijoin has good selectivity, then the semijoin approach can result in significant savings in communication time. Using semijoins may well increase the local processing time, since one of the two joined relations must be accessed twice. For example, relations EMP and PROJ are accessed twice in Figure

8.4 Distributed Query Optimization 273

8.14. Furthermore, the join of two intermediate relations produced by semijoins cannot exploit the indices that were available on the base relations. Therefore, using semijoins might not be a good idea if the communication time is not the dominant factor, as is the case with local area networks [Lu and Carey, 1985].

(a) Join approach (b) Semijoin approach

EMP ASG

PROJ

PROJ

EMP

EMPASG

∏ENO

∏PNO

PROJ

Fig. 8.14 Join versus Semijoin Approaches

Semijoins can still be beneficial with fast networks if they have very good selec- tivity and are implemented with bit arrays [Valduriez, 1982]. A bit array BA[1 : n] is useful in encoding the join attribute values present in one relation. Let us consider the semijoin RnS. Then BA[i] is set to 1 if there exists a join attribute value A = val in relation S such that h(val) = i, where h is a hash function. Otherwise, BA[i] is set to 0. Such a bit array is much smaller than a list of join attribute values. Therefore, transferring the bit array instead of the join attribute values to the site of relation R saves communication time. The semijoin can be completed as follows. Each tuple of relation R, whose join attribute value is val, belongs to the semijoin if BA[h(val)] = 1.

8.4 Distributed Query Optimization

In this section we illustrate the use of the techniques presented in earlier sections within the context of four basic query optimization algorithms. First, we present the dynamic and static approaches which extend the centralized algorithms presented in Section 8.2. Then, we describe a popular semijoin-based optimization algorithm. Finally, we present a hybrid approach.

274 8 Optimization of Distributed Queries

8.4.1 Dynamic Approach

We illustrate the dynamic approach with the algorithm of Distributed INGRES [Epstein et al., 1978] that is derived from the algorithm described in Section 8.2.1. The objective function of the algorithm is to minimize a combination of both the communication time and the response time. However, these two objectives may be conflicting. For instance, increasing communication time (by means of parallelism) may well decrease response time. Thus, the function can give a greater weight to one or the other. Note that this query optimization algorithm ignores the cost of transmitting the data to the result site. The algorithm also takes advantage of fragmentation, but only horizontal fragmentation is handled for simplicity.

Since both general and broadcast networks are considered, the optimizer takes into account the network topology. In broadcast networks, the same data unit can be transmitted from one site to all the other sites in a single transfer, and the algorithm explicitly takes advantage of this capability. For example, broadcasting is used to replicate fragments and then to maximize the degree of parallelism.

The input to the algorithm is a query expressed in tuple relational calculus (in conjunctive normal form) and schema information (the network type, as well as the location and size of each fragment). This algorithm is executed by the site, called the master site, where the query is initiated. The algorithm, which we call Dynamic*-QOA, is given in Algorithm 8.4.

Algorithm 8.4: Dynamic*-QOA Input: MRQ: multirelation query Output: result of the last multirelation query begin

for each detachable ORQi in MRQ do {ORQ is monorelation query} run(ORQi) (1)

MRQ′ list← REDUCE(MRQ) {MRQ repl. by n irreducible queries} (2) while n 6= 0 do {n is the number of irreducible queries} (3) {choose next irreducible query involving the smallest fragments} MRQ′← SELECT QUERY(MRQ′ list); (3.1) {determine fragments to transfer and processing site for MRQ′} Fragment-site-list← SELECT STRATEGY(MRQ′); (3.2) {move the selected fragments to the selected sites} for each pair (F,S) in Fragment-site-list do

move fragment F to site S (3.3) execute MRQ′; (3.4) n← n−1

{output is the result of the last MRQ′} end

8.4 Distributed Query Optimization 275

All monorelation queries (e.g., selection and projection) that can be detached are first processed locally [Step (1)]. Then the reduction algorithm [Wong and Youssefi, 1976] is applied to the original query [Step (2)]. Reduction is a technique that isolates all irreducible subqueries and monorelation subqueries by detachment (see Section 8.2.1). Monorelation subqueries are ignored because they have already been processed in step (1). Thus the REDUCE procedure produces a sequence of irreducible subqueries q1→ q2→ ··· → qn, with at most one relation in common between two consecutive subqueries. Wong and Youssefi [1976] have shown that such a sequence is unique. Example 8.4 (in Section 8.2.1), which illustrated the detachment technique, also illustrates what the REDUCE procedure would produce.

Based on the list of irreducible queries isolated in step (2) and the size of each fragment, the next subquery, MRQ′, which has at least two variables, is chosen at step (3.1) and steps (3.2), (3.3), and (3.4) are applied to it. Steps (3.1) and (3.2) are discussed below. Step (3.2) selects the best strategy to process the query MRQ′. This strategy is described by a list of pairs (F,S), in which F is a fragment to transfer to the processing site S. Step (3.3) transfers all the fragments to their processing sites. Finally, step (3.4) executes the query MRQ′. If there are remaining subqueries, the algorithm goes back to step (3) and performs the next iteration. Otherwise, it terminates.

Optimization occurs in steps (3.1) and (3.2). The algorithm has produced sub- queries with several components and their dependency order (similar to the one given by a relational algebra tree). At step (3.1) a simple choice for the next subquery is to take the next one having no predecessor and involving the smaller fragments. This minimizes the size of the intermediate results. For example, if a query q has the subqueries q1, q2, and q3, with dependencies q1→ q3,q2→ q3, and if the fragments referred to by q1 are smaller than those referred to by q2, then q1 is selected. Depend- ing on the network, this choice can also be affected by the number of sites having relevant fragments.

The subquery selected must then be executed. Since the relation involved in a subquery may be stored at different sites and even fragmented, the subquery may nevertheless be further subdivided.

Example 8.10. Assume that relations EMP, ASG, and PROJ of the query of Example 8.4 are stored as follows, where relation EMP is fragmented.

Site 1 Site 2 EMP1 EMP2 ASG PROJ

There are several possible strategies, including the following:

1. Execute the entire query (EMP 1 ASG 1 PROJ) by moving EMP1 and ASG to site 2.

2. Execute (EMP 1 ASG) 1 PROJ by moving (EMP1 1 ASG) and ASG to site 2, and so on.

276 8 Optimization of Distributed Queries

The choice between the possible strategies requires an estimate of the size of the intermediate results. For example, if size(EMP1 1 ASG) > size (EMP1), strategy 1 is preferred to strategy 2. Therefore, an estimate of the size of joins is required. �

At step (3.2), the next optimization problem is to determine how to execute the subquery by selecting the fragments that will be moved and the sites where the processing will take place. For an n-relation subquery, fragments from n−1 relations must be moved to the site(s) of fragments of the remaining relation, say Rp, and then replicated there. Also, the remaining relation may be further partitioned into k “equalized” fragments in order to increase parallelism. This method is called fragment-and-replicate and performs a substitution of fragments rather than of tuples. The selection of the remaining relation and of the number of processing sites k on which it should be partitioned is based on the objective function and the topology of the network. Remember that replication is cheaper in broadcast networks than in point-to-point networks. Furthermore, the choice of the number of processing sites involves a trade-off between response time and total time. A larger number of sites decreases response time (by parallel processing) but increases total time, in particular increasing communication costs.

Epstein et al. [1978] give formulas to minimize either communication time or processing time. These formulas use as input the location of fragments, their size, and the network type. They can minimize both costs but with a priority to one. To illustrate these formulas, we give the rules for minimizing communication time. The rule for minimizing response time is even more complex. We use the following assumptions. There are n relations R1,R2, . . . ,Rn involved in the query. R

j i denotes the

fragment of Ri stored at site j. There are m sites in the network. Finally, CTk(#bytes) denotes the communication time of transferring #bytes to k sites, with 1≤ k ≤ m.

The rule for minimizing communication time considers the types of networks separately. Let us first concentrate on a broadcast network. In this case we have

CTk(#bytes) =CT1(#bytes)

The rule can be stated as

if max j=1,m(∑ni=1 size(R j i ))> maxi=1,n(size(Ri))

then the processing site is the j that has the largest amount of data

else Rp is the largest relation and site of Rp is the processing site

If the inequality predicate is satisfied, one site contains an amount of data useful to the query larger than the size of the largest relation. Therefore, this site should be the processing site. If the predicate is not satisfied, one relation is larger than the maximum useful amount of data at one site. Therefore, this relation should be the Rp, and the processing sites are those which have its fragments.

Let us now consider the case of the point-to-point networks. In this case we have

CTk(#bytes) = k ∗CT1(#bytes)

8.4 Distributed Query Optimization 277

The choice of Rp that minimizes communication is obviously the largest relation. Assuming that the sites are arranged by decreasing order of amounts of useful data for the query, that is,

n

∑ i=1

size(R ji )> n

∑ i=1

size(R j+1i )

the choice of k, the number of sites at which processing needs to be done, is given as

if ∑i6=p(size(Ri)− size(R1i ))> size(R1p) then

k = 1 else

k is the largest j such that ∑i 6=p(size(Ri)− size(R j i ))≤ size(R

j p)

This rule chooses a site as the processing site only if the amount of data it must receive is smaller than the additional amount of data it would have to send if it were not a processing site. Obviously, the then-part of the rule assumes that site 1 stores a fragment of Rp.

Example 8.11. Let us consider the query PROJ 1 ASG, where PROJ and ASG are fragmented. Assume that the allocation of fragments and their sizes are as follows (in kilobytes):

Site 1 Site 2 Site 3 Site 4 PROJ 1000 1000 1000 1000 ASG 2000

With a point–to–point network, the best strategy is to send each PROJi to site 3, which requires a transfer of 3000 kbytes, versus 6000 kbytes if ASG is sent to sites 1, 2, and 4. However, with a broadcast network, the best strategy is to send ASG (in a single transfer) to sites 1, 2, and 4, which incurs a transfer of 2000 kbytes. The latter strategy is faster and maximizes response time because the joins can be done in parallel. �

This dynamic query optimization algorithm is characterized by a limited search of the solution space, where an optimization decision is taken for each step without concerning itself with the consequences of that decision on global optimization. However, the algorithm is able to correct a local decision that proves to be incorrect.

8.4.2 Static Approach

We illustrate the static approach with the algorithm of R* [Selinger and Adiba, 1980; Lohman et al., 1985] which is a substantial extension of the techniques we described in Section 8.2.2). This algorithm performs an exhaustive search of all alternative

278 8 Optimization of Distributed Queries

strategies in order to choose the one with the least cost. Although predicting and enu- merating these strategies may be costly, the overhead of exhaustive search is rapidly amortized if the query is executed frequently. Query compilation is a distributed task, coordinated by a master site, where the query is initiated. The optimizer of the master site makes all intersite decisions, such as the selection of the execution sites and the fragments as well as the method for transferring data. The apprentice sites, which are the other sites that have relations involved in the query, make the remaining local decisions (such as the ordering of joins at a site) and generate local access plans for the query. The objective function of the optimizer is the general total time function, including local processing and communications costs (see Section 8.1.1).

We now summarize this query optimization algorithm. The input to the algorithm is a localized query expressed as a relational algebra tree (the query tree), the location of relations, and their statistics. The algorithm is described by the procedure Static*- QOA in Algorithm 8.5.

Algorithm 8.5: Static*-QOA Input: QT : query tree Output: strat: minimum cost strategy begin

for each relation Ri ∈QT do for each access path APi j to Ri do

compute cost(APi j) best APi← APi j with minimum cost

for each order (Ri1,Ri2, · · · ,Rin) with i = 1, · · · ,n! do build strategy (. . .((best APi1 1 Ri2)1 Ri3)1 . . . 1 Rin) ; compute the cost of strategy

strat← strategy with minimum cost ; for each site k storing a relation involved in QT do

LSk← local strategy (strategy, k) ; send (LSk, site k) {each local strategy is optimized at site k}

end

As in the centralized case, the optimizer must select the join ordering, the join algorithm (nested-loop or merge-join), and the access path for each fragment (e.g., clustered index, sequential scan, etc.). These decisions are based on statistics and formulas used to estimate the size of intermediate results and access path information. In addition, the optimizer must select the sites of join results and the method of transferring data between sites. To join two relations, there are three candidate sites: the site of the first relation, the site of the second relation, or a third site (e.g., the site of a third relation to be joined with). Two methods are supported for intersite data transfers.

8.4 Distributed Query Optimization 279

1. Ship-whole. The entire relation is shipped to the join site and stored in a temporary relation before being joined. If the join algorithm is merge join, the relation does not need to be stored, and the join site can process incoming tuples in a pipeline mode, as they arrive.

2. Fetch-as-needed. The external relation is sequentially scanned, and for each tuple the join value is sent to the site of the internal relation, which selects the internal tuples matching the value and sends the selected tuples to the site of the external relation. This method is equivalent to the semijoin of the internal relation with each external tuple.

The trade-off between these two methods is obvious. Ship-whole generates a larger data transfer but fewer messages than fetch-as-needed. It is intuitively better to ship whole relations when they are small. On the contrary, if the relation is large and the join has good selectivity (only a few matching tuples), the relevant tuples should be fetched as needed. The optimizer does not consider all possible combinations of join methods with transfer methods since some of them are not worthwhile. For example, it would be useless to transfer the external relation using fetch-as-needed in the nested-loop join algorithm, because all the outer tuples must be processed anyway and therefore should be transferred as a whole.

Given the join of an external relation R with an internal relation S on attribute A, there are four join strategies. In what follows we describe each strategy in detail and provide a simplified cost formula for each, where LT denotes local processing time (I/O + CPU time) and CT denotes communication time. For simplicity, we ignore the cost of producing the result. For convenience, we denote by s the average number of tuples of S that match one tuple of R:

s = card(SnA R)

card(R)

Strategy 1.

Ship the entire external relation to the site of the internal relation. In this case the external tuples can be joined with S as they arrive. Thus we have

Total cost = LT (retrieve card(R) tuples from R) +CT (size(R))

+LT (retrieve s tuples from S)∗ card(R)

Strategy 2.

Ship the entire internal relation to the site of the external relation. In this case, the internal tuples cannot be joined as they arrive, and they need to be stored in a temporary relation T . Thus we have

280 8 Optimization of Distributed Queries

Total cost = LT (retrieve card(S) tuples from S) +CT (size(S))

+LT (store card(S) tuples in T ) +LT (retrieve card(R) tuples from R) +LT (retrieve s tuples from T )∗ card(R)

Strategy 3.

Fetch tuples of the internal relation as needed for each tuple of the external relation. In this case, for each tuple in R, the join attribute value is sent to the site of S. Then the s tuples of S which match that value are retrieved and sent to the site of R to be joined as they arrive. Thus we have

Total cost = LT (retrieve card(R) tuples from R) +CT (length(A))∗ card(R) +LT (retrieve s tuples from S)∗ card(R) +CT (s∗ length(S))∗ card(R)

Strategy 4.

Move both relations to a third site and compute the join there. In this case the internal relation is first moved to a third site and stored in a temporary relation T . Then the external relation is moved to the third site and its tuples are joined with T as they arrive. Thus we have

Total cost = LT (retrieve card(S) tuples from S) +CT (size(S))

+LT (store card(S) tuples in T ) +LT (retrieve card(R) tuples from R) +CT (size(R))

+LT (retrieve s tuples from T )∗ card(R)

Example 8.12. Let us consider a query that consists of the join of relations PROJ, the external relation, and ASG, the internal relation, on attribute PNO. We assume that PROJ and ASG are stored at two different sites and that there is an index on attribute PNO for relation ASG. The possible execution strategies for the query are as follows:

1. Ship whole PROJ to site of ASG. 2. Ship whole ASG to site of PROJ. 3. Fetch ASG tuples as needed for each tuple of PROJ.

8.4 Distributed Query Optimization 281

4. Move ASG and PROJ to a third site.

The optimization algorithm predicts the total time of each strategy and selects the cheapest. Given that there is no operation following the join PROJ 1 ASG, strategy 4 obviously incurs the highest cost since both relations must be transferred. If size(PROJ) is much larger than size(ASG), strategy 2 minimizes the communication time and is likely to be the best if local processing time is not too high compared to strategies 1 and 3. Note that the local processing time of strategies 1 and 3 is probably much better than that of strategy 2 since they exploit the index on the join attribute.

If strategy 2 is not the best, the choice is between strategies 1 and 3. Local processing costs in both of these alternatives are identical. If PROJ is large and only a few tuples of ASG match, strategy 3 probably incurs the least communication time and is the best. Otherwise, that is, if PROJ is small or many tuples of ASG match, strategy 1 should be the best. �

Conceptually, the algorithm can be viewed as an exhaustive search among all alternatives that are defined by the permutation of the relation join order, join meth- ods (including the selection of the join algorithm), result site, access path to the internal relation, and intersite transfer mode. Such an algorithm has a combinatorial complexity in the number of relations involved. Actually, the algorithm significantly reduces the number of alternatives by using dynamic programming and the heuristics, as does the System R’s optimizer (see Section 8.2.2). With dynamic programming, the tree of alternatives is dynamically constructed and pruned by eliminating the inefficient choices.

Performance evaluation of the algorithm in the context of both high-speed net- works (similar to local networks) and medium-speed wide area networks con- firm the significant contribution of local processing costs, even for wide area net- works[Lohman and Mackert, 1986; Mackert and Lohman, 1986]. It is shown in particular that for the distributed join, transferring the entire internal relation outper- forms the fetch-as-needed method.

8.4.3 Semijoin-based Approach

We illustrate the semijoin-based approach with the algorithm of SDD-1 [Bernstein et al., 1981] which takes full advantage of the semijoin to minimize communication cost. The query optimization algorithm is derived from an earlier method called the “hill-climbing” algorithm [Wong, 1977], which has the distinction of being the first distributed query processing algorithm. In the hill-climbing algorithm, refinements of an initial feasible solution are recursively computed until no more cost improve- ments can be made. The algorithm does not use semijoins, nor does it assume data replication and fragmentation. It is devised for wide area point-to-point networks. The cost of transferring the result to the final site is ignored. This algorithm is quite general in that it can minimize an arbitrary objective function, including the total time and response time.

282 8 Optimization of Distributed Queries

The hill-climbing algorithm proceeds as follows. The input to the algorithm includes the query graph, location of relations, and relation statistics. Following the completion of initial local processing, an initial feasible solution is selected which is a global execution schedule that includes all intersite communication. It is obtained by computing the cost of all the execution strategies that transfer all the required relations to a single candidate result site, and then choosing the least costly strategy. Let us denote this initial strategy as ES0. Then the optimizer splits ES0 into two strategies, ES1 followed by ES2, where ES1 consists of sending one of the relations involved in the join to the site of the other relation. The two relations are joined locally and the resulting relation is transmitted to the chosen result site (specified as schedule ES2). If the cost of executing strategies ES1 and ES2, plus the cost of local join processing, is less than that of ES0, then ES0 is replaced in the schedule by ES1 and ES2. The process is then applied recursively to ES1 and ES2 until no more benefit can be gained. Notice that if n-way joins are involved, ES0 will be divided into n subschedules instead of just two.

The hill-climbing algorithm is in the class of greedy algorithms, which start with an initial feasible solution and iteratively improve it. The main problem is that strategies with higher initial cost, which could nevertheless produce better overall benefits, are ignored. Furthermore, the algorithm may get stuck at a local minimum cost solution and fail to reach the global minimum.

Example 8.13. Let us illustrate the hill-climbing algorithm using the following query involving relations EMP, PAY, PROJ, and ASG of the engineering database:

“Find the salaries of engineers who work on the CAD/CAM project”

The query in relational algebra is

ΠSAL (PAY 1TITLE (EMP 1ENO (ASG 1PNO( σPNAME = “CAD/CAM”(PROJ)))))

We assume that TMSG = 0 and TT R = 1. Furthermore, we ignore the local processing, following which the database is

Relation Size Site EMP 8 1 PAY 4 2 PROJ 1 3 ASG 10 4

To simplify this example, we assume that the length of a tuple (of every relation) is 1, which means that the size of a relation is equal to its cardinality. Furthermore, the placement of the relation is arbitrary. Based on join selectivities, we know that size(EMP 1 PAY) = size(EMP), size(PROJ 1 ASG) = 2∗ size(PROJ), and size(ASG 1 EMP) = size(ASG).

Considering only data transfers, the initial feasible solution is to choose site 4 as the result site, producing the schedule

8.4 Distributed Query Optimization 283

ES0 : EMP→ site 4 PAY→ site 4 PROJ→ site 4 Total cost(ES0) = 4+8+1 = 13

This is true because the cost of any other solution is greater than the foregoing alternative. For example, if one chooses site 2 as the result site and transmits all the relations to that site, the total cost will be

Total cost = cost(EMP→ site 2) + cost(ASG→ site 2) +cost(PROJ→ site 2)

= 19

Similarly, the total cost of choosing either site 1 or site 3 as the result site is 15 and 22, respectively.

One way of splitting this schedule (call it ES′) is the following:

ES1 : EMP→ site 2 ES2 : (EMP 1 PAY)→ site 4 ES3 : PROJ→ site 4 Total cost(ES′) = 8+8+1 = 17

A second splitting alternative (ES′′) is as follows:

ES1 : PAY→ site 1 ES2 : (PAY 1 EMP)→ site 4 ES3 : PROJ→ site 4 Total cost(ES′′) = 4+8+1 = 13

Since the cost of either of the alternatives is greater than or equal to the cost of ES0,ES0 is kept as the final solution. A better solution (ignored by the algorithm) is

B : PROJ→ site 4 ASG′ = (PROJ 1 ASG)→ site 1 (ASG′ 1 EMP)→ site 2 Total cost(B) = 1+2+2 = 5

The semijoin-based algorithm extends the hill-climbing algorithm in a number of ways [Bernstein et al., 1981]. In addition to the extensive use of semijoins, the objective function is expressed in terms of total communication time (local time and response time are not considered). Furthermore, the algorithm uses statistics on the database, called database profiles, where a profile is associated with a relation. The algorithm also selects an initial feasible solution that is iteratively refined. Finally, a postoptimization step is added to improve the total time of the solution selected. The main step of the algorithm consists of determining and ordering beneficial semijoins, that is semijoins whose cost is less than their benefit.

The cost of a semijoin is that of transferring the semijoin attributes A,

284 8 Optimization of Distributed Queries

Cost(RnA S) = TMSG +TT R ∗ size(ΠA(S))

while its benefit is the cost of transferring irrelevant tuples of R (which is avoided by the semijoin):

Bene f it(RnA S) = (1−SFSJ(S.A))∗ size(R)∗TT R

The semijoin-based algorithm proceeds in four phases: initialization, selection of beneficial semijoins, assembly site selection, and postoptimization. The output of the algorithm is a global strategy for executing the query (Algorithm 8.6).

Algorithm 8.6: Semijoin-based-QOA Input: QG: query graph with n relations; statistics for each relation Output: ES: execution strategy begin

ES← local-operations (QG) ; modify statistics to reflect the effect of local processing ; BS← φ ; {set of beneficial semijoins} for each semijoin SJ in QG do

if cost(SJ)< bene f it(SJ) then BS← BS∪SJ

while BS 6= φ do {selection of beneficial semijoins}

SJ←most bene f icial(BS); {SJ: semijoin with max(bene f it− cost)} BS← BS−SJ; {remove SJ from BS} ES← ES+SJ; {append SJ to execution strategy} modify statistics to reflect the effect of incorporating SJ ; BS← BS− non-beneficial semijoins ; BS← BS∪ new beneficial semijoins ;

{assembly site selection} AS(ES)← select site i such that i stores the largest amount of data after all local operations ; ES← ES ∪ transfers of intermediate relations to AS(ES) ; {postoptimization} for each relation Ri at AS(ES) do

for each semijoin SJ of Ri by R j do if cost(ES)> cost(ES−SJ) then

ES← ES−SJ

end

The initialization phase generates a set of beneficial semijoins, BS = {SJ1,SJ2, . . . , SJk}, and an execution strategy ES that includes only local processing. The next phase selects the beneficial semijoins from BS by iteratively choosing the most

8.4 Distributed Query Optimization 285

beneficial semijoin, SJi, and modifying the database statistics and BS accordingly. The modification affects the statistics of relation R involved in SJi and the remaining semijoins in BS that use relation R. The iterative phase terminates when all semijoins in BS have been appended to the execution strategy. The order in which semijoins are appended to ES will be the execution order of the semijoins.

The next phase selects the assembly site by evaluating, for each candidate site, the cost of transferring to it all the required data and taking the one with the least cost. Finally, a postoptimization phase permits the removal from the execution strategy of those semijoins that affect only relations stored at the assembly site. This phase is necessary because the assembly site is chosen after all the semijoins have been ordered. The SDD-1 optimizer is based on the assumption that relations can be transmitted to another site. This is true for all relations except those stored at the assembly site, which is selected after beneficial semijoins are considered. Therefore, some semijoins may incorrectly be considered beneficial. It is the role of postoptimization to remove them from the execution strategy.

Example 8.14. Let us consider the following query:

SELECT R3.C FROM R1,R2,R3 WHERE R1.A = R2.A AND R2.B = R3.B

Figure 8.15 gives the join graph of the query and of relation statistics. We assume that TMSG = 0 and TT R = 1. The initial set of beneficial semijoins will contain the following two:

SJ1: R2 n R1, whose benefit is 2100 = (1−0.3)∗3000 and cost is 36 SJ2: R2 n R3, whose benefit is 1800 = (1−0.4)∗3000 and cost is 80

Furthermore there are two non-beneficial semijoins:

SJ3: R1 n R2, whose benefit is 300 = (1−0.8)∗1500 and cost is 320 SJ4: R3 n R2, whose benefit is 0 and cost is 400.

At the first iteration of the selection of beneficial semijoins, SJ1 is appended to the execution strategy ES. One effect on the statistics is to change the size of R2 to 900 = 3000 ∗ 0.3. Furthermore, the semijoin selectivity factor of attribute R2.A is reduced because card(ΠA(R2)) is reduced. We approximate SFSJ(R2.A) by 0.8∗0.3 = 0.24. Finally, size of ΠR2.A is also reduced to 96 = 320∗ 0.3. Similarly, the semijoin selectivity factor of attribute R2.B and ΠR2.B should also be reduced (but they not needed in the rest of the example).

At the second iteration, there are two beneficial semijoins:

SJ2 : R′2 nR3, whose benefit is 540 = 900∗ (1−0.4) and cost is 80 (here R′2 = R2 nR1, which is obtained by SJ1

SJ3: R1 nR′2, whose benefit is 1140 = (1−0.24)∗1500 and cost is 96

286 8 Optimization of Distributed Queries

0.3

0.8

1.0

0.4

36

320

400

80

A B

relation

50

30

40

card tuple size relation

size

attribute SFSJ size(Πattribute)

R 1 .A

R 2 .B

R 3 .B

R 2 .A

R 2

R 1

R 3

Site 1 Site 2 Site 3

R 1

R 2

R 3

30

100

50

1500

3000

2000

Fig. 8.15 Example Query and Statistics

The most beneficial semijoin is SJ3 and is appended to ES. One effect on the statistics of relation R1 is to change the size of R1 to 360(= 1500∗0.24). Another effect is to change the selectivity of R1 and size of ΠR1.A.

At the third iteration, the only remaining beneficial semijoin, SJ2, is appended to ES. Its effect is to reduce the size of relation R2 to 360(= 900∗0.4). Again, the statistics of relation R2 may also change.

After reduction, the amount of data stored is 360 at site 1, 360 at site 2, and 2000 at site 3. Site 3 is therefore chosen as the assembly site. The postoptimization does not remove any semijoin since they all remain beneficial. The strategy selected is to send (R2 nR1)nR3 and R1 nR2 to site 3, where the final result is computed. �

Like its predecessor hill-climbing algorithm, the semijoin-based algorithm selects locally optimal strategies. Therefore, it ignores the higher-cost semijoins which would result in increasing the benefits and decreasing the costs of other semijoins. Thus this algorithm may not be able to select the global minimum cost solution.

8.4.4 Hybrid Approach

The static and dynamic distributed optimization approaches have the same advan- tages and disadvantages as in centralized systems (see Section 8.2.3). However, the problems of accurate cost estimation and comparison of QEPs at compile-time are much more severe in distributed systems. In addition to unknown bindings of parameter values in embedded queries, sites may become unvailable or overloaded at

8.4 Distributed Query Optimization 287

runtime. In addition, relations (or relation fragments) may be replicated at several sites. Thus, site and copy selection should be done at runtime to increase availability and load balancing of the system.

The hybrid query optimization technique using dynamic QEPs (see Section 8.2.3) is general enough to incorporate site and copy selection decisions. However, the search space of alternative subplans linked by choose-plan operators becomes much larger and may result in heavy static plans and much higher startup time. Therefore, several hybrid techniques have been proposed to optimize queries in distributed sys- tems [Carey and Lu, 1986; Du et al., 1995; Evrendilek et al., 1997]. They essentially rely on the following two-step approach:

1. At compile time, generate a static plan that specifies the ordering of operations and the access methods, without considering where relations are stored.

2. At startup time, generate an execution plan by carrying out site and copy selection and allocating the operations to the sites.

Example 8.15. Consider the following query expressed in relational algebra:

σ(R1) 1 R2 1 R3

Figure 8.16 shows a 2-step plan for this query. The static plan shows the relational operation ordering as produced by a centralized query optimizer. The run-time plan extends the static plan with site and copy selection and communication between sites. For instance, the first selection is allocated at site s1 on copy R11 of relation R1 and sends its result to site s3 to be joined with R23 and so on. �

(a) Static plan (b) Run-time plan

R 23

R 32

R 1

R 2

R 3

s 1

s 2

s 3

send

send

σ

R 11

σ

Fig. 8.16 A 2-Step Plan

The first step can be done by a centralized query optimizer. It may also include choose-plan operators so that runtime bindings can be used at startup time to make accurate cost estimations. The second step carries out site and copy selection, possibly in addition to choose-plan operator execution. Furthermore, it can optimize the load

288 8 Optimization of Distributed Queries

balancing of the system. In the rest of this section, we illustrate this second step based on the seminal paper by Carey and Lu [1986] on two-step query optimization.

We consider a distributed database system with a set of sites S = {s1, ..,sn}. A query Q is represented as an ordered sequence of subqueries Q = {q1, ..,qm}. Each subquery qi is the maximum processing unit that accesses a single base relation and communicates with its neighboring subqueries. For instance, in Figure 8.16, there are three subqueries, one for R1, one for R2, and one for R3. Each site si has a load, denoted by load(si), which reflects the number of queries currently submitted. The load can be expressed in different ways, e.g. as the number of I/O bound and CPU bound queries at the site [Carey and Lu, 1986]. The average load of the system is defined as:

Avg load(S) = ∑ni=1 load(si)

n

The balance of the system for a given allocation of subqueries to sites can be measured as the variance of the site loads using the following unbalance factor [Carey and Lu, 1986]:

UF(S) = 1 n

n

∑ i=1

(load(si)−Avg load(S))2

As the system gets balanced, its unbalance factor approaches 0 (perfect balance). For example, with load(s1)=10 and load(s1)=30, the unbalance factor of s1,s2 is 100 while with load(s1)=20 and load(s1)=20, it is 0.

The problem addressed by the second step of two-step query optimization can be formalized as the following subquery allocation problem. Given

1. a set of sites S = {s1, ..,sn} with the load of each site; 2. a query Q = {q1, ..,qm}; and 3. for each subquery qi in Q, a feasible allocation set of sites Sq = {s1, ...,sk}

where each site stores a copy of the relation involved in qi;

the objective is to find an optimal allocation on Q to S such that

1. UF(S) is minimized, and 2. the total communication cost is minimized.

Carey and Lu [1986] propose an algorithm that finds near-optimal solutions in a reasonable amount of time. The algorithm, which we describe in Algorithm 8.7 for linear join trees, uses several heuristics. The first heuristic (step 1) is to start by allocating subqueries with least allocation flexibility, i.e. with the smaller feasible allocation sets of sites. Thus, subqueries with a few candidate sites are allocated earlier. Another heuristic (step 2) is to consider the sites with least load and best benefit. The benefit of a site is defined as the number of subqueries already allocated to the site and measures the communication cost savings from allocating the subquery

8.4 Distributed Query Optimization 289

to the site. Finally, in step 3 of the algorithm, the load information of any unallocated subquery that has a selected site in its feasible allocation set is recomputed.

Algorithm 8.7: SQAllocation Input: Q: q1, . . . ,qm ;

Feasible allocation sets: Sq1 , . . . ,Sqm ; Loads: load(S1), . . . , load(Sm);

Output: an allocation of Q to S begin

for each q in Q do compute(load(Sq))

while Q not empty do a← q ∈Q with least allocation flexibility; {select subquery a for allocation} (1) b← s ∈ Sa with least load and best benefit; {select best site b for a} (2) Q← Q−a ; {recompute loads of remaining feasible allocation sets if necessary} (3) for each q ∈ Q where b ∈ Sq do

compute(load(Sq)

end

Example 8.16. Consider the following query Q expressed in relational algebra:

σ(R1) 1 R2 1 R3 1 R4

Figure 8.17 shows the placement of the copies of the 4 relations at the 4 sites, and the site loads. We assume that Q is decomposed as Q = {q1,q2,q3,q4} where q1 is associated with R1, q2 with R2 joined with the result of q1, q3 with R3 joined with the result of q2, and q4 with R4 joined with the result of q3. The SQAllocation algorithm performs 4 iterations. At the first one, it selects q4 which has the least allocation flexibility, allocates it to s1 and updates the load of s1 to 2. At the second iteration, the next set of subqueries to be selected are either q2 or q3 since they have the same allocation flexibility. Let us choose q2 and assume it gets allocated to s2 (it could be allocated to s4 which has the same load as s2). The load of s2 is increased to 3. At the third iteration, the next subquery selected is q3 and it is allocated to s1 which has the same load as s3 but a benefit of 1 (versus 0 for s3) as a result of the allocation of q4. The load of s1 is increased to 3. Finally, at the last iteration, q1 gets allocated to either s3 or s4 which have the least loads. If in the second iteration q2 were allocated to s4 instead of to s2, then the fourth iteration would have allocated q1 to s4 because of a benefit of 1. This would have produced a better execution plan with less communication. This illustrates that two-step optimization can still miss optimal plans. �

290 8 Optimization of Distributed Queries

s 1

s 2

s 3

sites load R 1

R 2

R 3

R 4

s 4

1

2

2

2

R 11

R 13

R 14

R 22

R 24

R 31

R 33

R 41

Fig. 8.17 Example Data Placement and Load

This algorithm has reasonable complexity. It considers each subquery in turn, considering each potential site, selects a current one for allocation, and sorts the list of remaining subqueries. Thus, its complexity can be expressed as O(max(m∗n,m2 ∗ log2m)).

Finally, the algorithm includes a refining phase to further optimize join processing and decide whether or not to use semijoins. Although it minimizes communication given a static plan, two-step query optimization may generate runtime plans that have higher communication cost than the optimal plan. This is because the first step is carried out ignoring data location and its impact on communication cost. For instance, consider the runtime plan in 8.16 and assume that the third subquery on R3 is allocated to site s1 (instead of site s2). In this case, the plan that does the join (or Cartesian product) of the result of the selection of R1 with R3 first at site s1 may be better since it minimizes communication. A solution to this problem is to perform plan reorganization using operation tree transformations at startup time [Du et al., 1995].

8.5 Conclusion

In this chapter we have presented the basic concepts and techniques for distributed query optimization. We first introduced the main components of query optimization, including the search space, the cost model and the search strategy. The details of the environment (centralized versus distributed) are captured by the search space and the cost model. The search space describes the equivalent execution plans for the input query. These plans differ on the execution order of operations and their implementation, and therefore on performance. The search space is obtained by applying transformation rules, such as those described in Section 7.1.4.

The cost model is key to estimating the cost of a given execution plan. To be accurate, the cost model must have good knowledge about the distributed execution environment. Important inputs are the database statistics and the formulas used to estimate the size of intermediate results. For simplicity, earlier cost models relied on the strong assumption that the distribution of attribute values in a relation is uniform. However, in case of skewed data distributions, this can result in fairly inaccurate estimations and execution plans which are far from the optimal. An

8.5 Conclusion 291

effective solution to accurately capture data distributions is to use histograms. Today, most commercial DBMS optimizers support histograms as part of their cost model. A difficulty remains to estimate the selectivity of the join operation when it is not on foreign key. In this case, maintaining join selectivity factors is of great benefit [Mackert and Lohman, 1986]. Earlier distributed DBMSs considered transmission costs only. With the availability of faster communication networks, it is important to consider local processing costs as well.

The search strategy explores the search space and selects the best plan, using the cost model. It defines which plans are examined and in which order. The most popular search strategy is dynamic programming which enumerates all equivalent execution plans with some pruning. However, it may incur a high optimization cost for queries involving large number of relations. Thus, it is best suited when optimization is static (done at compile time) and amortized over multiple executions. Randomized strategies, such as Iterative Improvement and Simulated Annealing, have received much attention. They do not guarantee that the best solution is obtained, but avoid the high cost of optimization. Thus, they are appropriate for ad-hoc queries which are not repetitive.

As a prerequisite to understanding distributed query optimization, we have in- troduced centralized query optimization with the three basic techniques: dynamic, static and hybrid. Dynamic and static query optimimization both have advantages and drawbacks. Dynamic query optimization can make accurate optimization choices at run-time. but optimization is repeated for each query execution. Therefore, this approach is best for ad-hoc queries. Static query optimization, done at compilation time, is best for queries embedded in stored procedures, and has been adopted by all commercial DBMSs. However, it can make major estimation errors, in particular, in the case of parameter values not known until runtime, which can lead to the choice of suboptimal execution plans. Hybrid query optimization attempts to provide the advantages of static query optimization while avoiding the issues generated by inac- curate estimates. The approach is basically static, but further optimization decisions may take place at run time.

Next, we have seen two approaches to solve distributed join queries, which are the most important type of queries. The first one considers join ordering. The second one computes joins with semijoins. Semijoins are beneficial only when a join has good selectivity, in which case the semijoins act as powerful size reducers. The first systems that make extensive use of semijoins assumed a slow network and therefore concentrated on minimizing only the communication time at the expense of local processing time. However, with faster networks, the local processing time is as important as the communication time and sometimes even more important. There- fore, semijoins should be employed carefully since they tend to increase the local processing time. Join and semijoin techniques should be considered complementary, not alternative [Valduriez and Gardarin, 1984], because each technique may be better under certain database-dependent parameters. For instance, if a relation has very large tuples, as is the case with multimedia data, semijoin is useful to minimize data transfers. Finally, semijoins implemented by hashed bit arrays [Valduriez, 1982] can be made very efficient [Mackert and Lohman, 1986].

292 8 Optimization of Distributed Queries

We illustrated the use of the join and semijoin techniques in four basic distributed query optimization algorithms: dynamic, static, semijoin-based and hybrid. The static and dynamic distributed optimization approaches have the same advantages and disadvantages as in centralized systems. The semijoin-based approach is best for slow networks. The hybrid approach is best in today’s dynamic environments as it delays important decisions such as copy selection and allocation of subqueries to sites at query startup time. Thus, it can better increase availability and load balancing of the system. We illustrated the hybrid approach with two-step query optimization which first generates a static plan that specifies the operations ordering as in a centralized system and then generates an execution plan at startup time, by carrying out site and copy selection and allocating the operations to the sites.

In this chapter we focused mostly on join queries for two reasons: join queries are the most frequent queries in the relational framework and they have been studied extensively. Furthermore, the number of joins involved in queries expressed in languages of higher expressive power than relational calculus (e.g., Horn clause logic) can be extremely large, making the join ordering more crucial [Krishnamurthy et al., 1986]. However, the optimization of general queries containing joins, unions, and aggregate functions is a harder problem [Selinger and Adiba, 1980]. Distributing unions over joins is a simple and good approach since the query can be reduced as a union of join subqueries, which are optimized individually. Note also that the unions are more frequent in distributed DBMSs because they permit the localization of horizontally fragmented relations.

8.6 Bibliographic Notes

Good surveys of query optimization are provided in [Graefe, 1993], [Ioannidis, 1996] and [Chaudhuri, 1998]. Distributed query optimization is surveyed in [Kossmann, 2000].

The three basic algorithms for query optimization in centralized systems are: the dynamic algorithm of INGRES [Wong and Youssefi, 1976] which performs query reduction, the static algorithm of System R [Selinger et al., 1979] which uses dynamic programming and a cost model and the hybrid algorithm of Volcano [Cole and Graefe, 1994] which uses choose-plan operators.

The theory of semijoins and their value for distributed query processing has been covered in [Bernstein and Chiu, 1981], [Chiu and Ho, 1980], and [Kambayashi et al., 1982]. Algorithms for improving the processing of semijoins in distributed systems are proposed in [Valduriez, 1982]. The value of semijoins for multiprocessor database machines having fast communication networks is also shown in [Valduriez and Gardarin, 1984]. Parallel execution strategies for horizontally fragmented databases is treated in [Ceri and Pelagatti, 1983] and [Khoshafian and Valduriez, 1987]. The solutions in [Shasha and Wang, 1991] are also applicable to parallel systems.

The dynamic approach to distributed query optimization was was first proposed for Distributed INGRES in [Epstein et al., 1978]. It extends the dynamic algorithm

8.6 Bibliographic Notes 293

of INGRES, with a heuristic approach. The algorithm takes advantage of the network topology (general or broadcast networks). Improvements on this method based on the enumeration of all possible solutions are given and analyzed in [Epstein and Stonebraker, 1980].

The static approach to distributed query optimization was first proposed for R* in [Selinger and Adiba, 1980] as an extension of the static algorithm of System R. It is one of the first papers to recognize the significance of local processing on the performance of distributed queries. Experimental validation in [Lohman and Mackert, 1986] have confirmed this important statement.

The semijoin-based approach to distributed query optimization was proposed in [Bernstein et al., 1981] for SDD-1 [Wong, 1977]. It is one of the most complete algorithms which make full use of semijoins.

Several hybrid approaches based on two-step query optimization have been pro- posed for distributed systems [Carey and Lu, 1986; Du et al., 1995; Evrendilek et al., 1997]. The content of Section 8.4.4 is based on [Carey and Lu, 1986] which is the first paper on two-step query optimization. In [Du et al., 1995], efficient operations to transform linear join trees (produced by the first step) into bushy trees which exhibit more parallelism are proposed. In [Evrendilek et al., 1997], a solution to maximize intersite join parallelism in the second step is proposed.

Exercises

Problem 8.1 (*). Apply the dynamic query optimization algorithm in Section 8.2.1 to the query of Exercise 7.3, and illustrate the successive detachments and substitutions by giving the monorelation subqueries generated.

Problem 8.2. Consider the join graph of Figure 8.12 and the following information: size(EMP) = 100, size(ASG) = 200, size(PROJ) = 300, size(EMP 1 ASG) = 300, and size(ASG 1 PROJ) = 200. Describe an optimal join program based on the objective function of total transmission time.

Problem 8.3. Consider the join graph of Figure 8.12 and make the same assumptions as in Problem 8.2. Describe an optimal join program that minimizes response time (consider only communication).

Problem 8.4. Consider the join graph of Figure 8.12, and give a program (possibly not optimal) that reduces each relation fully by semijoins.

Problem 8.5 (*). Consider the join graph of Figure 8.12 and the fragmentation de- picted in Figure 8.18. Also assume that size(EMP 1 ASG) = 2000 and size(ASG 1 PROJ) = 1000. Apply the dynamic distributed query optimization algorithm in Section 8.4.1 in two cases, general network and broadcast network, so that communi- cation time is minimized.

294 8 Optimization of Distributed Queries

Rel. Site 1 Site 2 Site 3

EMP 1000 1000 1000

ASG 2000

PROJ 1000

Fig. 8.18 Fragmentation

Problem 8.6. Consider the join graph of Figure 8.19 and the statistics given in Figure 8.20. Apply the semijoin-based distributed query optimization algorithm in Section 8.4.3 with TMSG = 20 and TT R = 1.

R 1

R 2

R 3

R 4

A

B

B

B

Fig. 8.19 Join Graph

0.5

0.1

0.9

0.4

100

200

300

150

R 1 .A

R 2 .A

R 3 .B

R 4 .B

relation size

1000

1000

2000

R 1

R 2

R 3

R 3

1000

attribute size SFSJ

0.2100R 2 .A

(a) (b)

Fig. 8.20 Relation Statistics

Problem 8.7 (**). Consider the query in Problem 7.5. Assume that relations EMP, ASG, PROJ and PAY have been stored at sites 1, 2, and 3 according to the table in Figure 8.21. Assume also that the transfer rate between any two sites is equal and that data transfer is 100 times slower than data processing performed by any site. Finally, assume that size(R 1 S) = max(size(R),size(S)) for any two relations R and S, and the selectivity factor of the disjunctive selection of the query in Exercise 7.5 is

8.6 Bibliographic Notes 295

0.5. Compose a distributed program which computes the answer to the query and minimizes total time.

Rel. Site 1 Site 2 Site 3

EMP 2000

500

ASG 3000

PROJ 1000

PAY

Fig. 8.21 Fragmentation Statistics

Problem 8.8 (**). In Section 8.4.4, we described Algorithm 8.7 for linear join trees. Extend this algorithm to support bushy join trees. Apply it to the bushy join tree in Figure 8.3 using the data placement and site loads shown in Figure 8.17.

Chapter 9 Multidatabase Query Processing

In the previous three chapters, we have considered query processing in tighly-coupled homogeneous distributed database systems. As we discussed in Chapter 1, these sys- tems are logically integrated and provide a single image of the database, even though they are physically distributed. In this chapter, we concentrate on query processing in multidatabase systems that provide interoperability among a set of DBMSs. This is only one part of the more general interoperability problem. Distributed applications pose major requirements regarding the databases they access, in particular, the ability to access legacy data as well as newly developed databases. Thus, providing inte- grated access to multiple, distributed databases and other heterogeneous data sources has become a topic of increasing interest and focus.

Many of the distributed query processing and optimization techniques carry over to multidatabase systems, but there are important differences. Recall from Chapter 6 that we characterized distributed query processing in four steps: query decom- position, data localization, global optimization, and local optimization. The nature of multidatabase systems requires slightly different steps and different techniques. The component DBMSs may be autonomous and have different database languages and query processing capabilities. Thus, a multi-DBMS layer (see Figure 1.17) is necessary to communicate with component DBMSs in an effective way, and this requires additional query processing steps (Figure 9.1). Furthermore, there may be many component DBMSs, each of which may exhibit different behavior, thereby posing new requirements for more adaptive query processing techniques.

This chapter is organized as follows. In Section 9.1 we introduce in more detail the main issues in multidatabase query processing. Assuming the mediator/wrapper architecture, we describe the multidatabase query processing architecture in Section 9.2. Section 9.3 describes the techniques for rewriting queries using multidatabase views. Section 9.4 describes multidatabase query optimization and execution, in particular, heterogeneous cost modeling, heterogeneous query optimization, and adaptive query processing. Section 9.5 describes query translation and execution at the wrappers, in particular, the techniques for translating queries for execution by the component DBMSs and for generating and managing wrappers.

297 DOI 10.1007/978-1-4419-8834-8_9, © Springer Science+Business Media, LLC 2011 M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems: Third Edition,

298 9 Multidatabase Query Processing

9.1 Issues in Multidatabase Query Processing

Query processing in a multidatabase system is more complex than in a distributed DBMS for the following reasons [Sheth and Larson, 1990]:

1. The computing capabilities of the component DBMSs may be different, which prevents uniform treatment of queries across multiple DBMSs. For example, some DBMSs may be able to support complex SQL queries with join and aggregation while some others cannot. Thus the multidatabase query processor should consider the various DBMS capabilities.

2. Similarly, the cost of processing queries may be different on different DBMSs, and the local optimization capability of each DBMS may be quite different. This increases the complexity of the cost functions that need to be evaluated.

3. The data models and languages of the component DBMSs may be quite different, for instance, relational, object-oriented, XML, etc. This creates difficulties in translating multidatabase queries to component DBMS and in integrating heterogeneous results.

4. Since a multidatabase system enables access to very different DBMSs that may have different performance and behavior, distributed query processing techniques need to adapt to these variations.

The autonomy of the component DBMSs poses problems. DBMS autonomy can be defined along three main dimensions: communication, design and execution [Lu et al., 1993]. Communication autonomy means that a component DBMS communi- cates with others at its own discretion,and, in particular, it may terminate its services at any time. This requires query processing techniques that are tolerant to system unavailability. The question is how the system answers queries when a component system is either unavailable from the beginning or shuts down in the middle of query execution. Design autonomy may restrict the availability and accuracy of cost infor- mation that is needed for query optimization. The difficulty of determining local cost functions is an important issue. The execution autonomy of multidatabase systems makes it difficult to apply some of the query optimization strategies we discussed in previous chapters. For example, semijoin-based optimization of distributed joins may be difficult if the source and target relations reside in different component DBMSs, since, in this case, the semijoin execution of a join translates into three queries: one to retrieve the join attribute values of the target relation and to ship it to the source relation’s DBMS, the second to perform the join at the source relation, and the third to perform the join at the target relation’s DBMS. The problem arises because communication with component DBMSs occurs at a high level of the DBMS API.

In addition to these difficulties, the architecture of a distributed multidatabase system poses certain challenges. The architecture depicted in Figure 1.17 points to an additional complexity. In distributed DBMSs, query processors have to deal only with data distribution across multiple sites. In a distributed multidatabase environment, on the other hand, data are distributed not only across sites but also across multiple

9.2 Multidatabase Query Processing Architecture 299

databases, each managed by an autonomous DBMS. Thus, while there are two parties that cooperate in the processing of queries in a distributed DBMS (the control site and local sites), the number of parties increases to three in the case of a distributed multi-DBMS: the multi-DBMS layer at the control site (i.e., the mediator) receives the global query, the multi-DBMS layers at the sites (i.e., the wrappers) participate in processing the query, and the component DBMSs ultimately optimize and execute the query.

9.2 Multidatabase Query Processing Architecture

Most of the work on multidatabase query processing has been done in the context of the mediator/wrapper architecture (see Figure 1.18). In this architecture, each component database has an associated wrapper that exports information about the source schema, data and query processing capabilities. A mediator centralizes the information provided by the the wrappers in a unified view of all available data (stored in a global data dictionary) and performs query processing using the wrappers to access the component DBMSs. The data model used by the mediator can be rela- tional, object-oriented or even semi-structured (based on XML). In this chapter, for consistency with the previous chapters on distributed query processing, we continue to use the relational model, which is quite sufficient to explain the multidatabase query processing techniques.

The mediator/wrapper architecture has several advantages. First, the specialized components of the architecture allow the various concerns of different kinds of users to be handled separately. Second, mediators typically specialize in a related set of component databases with “similar” data, and thus export schemas and semantics related to a particular domain. The specialization of the components leads to a flexible and extensible distributed system. In particular, it allows seamless integration of different data stored in very different components, ranging from full-fledged relational DBMSs to simple files.

Assuming the mediator/wrapper architecture, we can now discuss the various layers involved in query processing in distributed multidatabase systems as shown in Figure 9.1. As before, we assume the input is a query on global relations expressed in relational calculus. This query is posed on global (distributed) relations, meaning that data distribution and heterogeneity are hidden. Three main layers are involved in multidatabase query processing. This layering is similar to that of query processing in homogeneous distributed DBMSs (see Figure 6.3). However, since there is no fragmentation, there is no need for the data localization layer.

The first two layers map the input query into an optimized distributed query execu- tion plan (QEP). They perform the functions of query rewriting, query optimization and some query execution. The first two layers are performed by the mediator and use meta-information stored in the global directory (global schema, allocation and capability schema). Query rewriting transforms the input query into a query on local relations, using the global schema. Recall from Chapter 4 that there are two main

300 9 Multidatabase Query Processing

REWRITING

QUERY ON GLOBAL

RELATIONS

QUERY ON LOCAL

RELATIONS

DISTRIBUTED

QUERY EXECUTION PLAN

TRANSLATION &

EXECUTION

GLOBAL

SCHEMA

ALLOC. & CAP.

SCHEMA

MEDIATOR

SITE

WRAPPER

SITES

OPTIMIZATION &

EXECUTION

WRAPPER

SCHEMA

Results

Fig. 9.1 Generic Layering Scheme for Multidatabase Query Processing

approaches for database integration: global-as-view (GAV) and local-as-view (LAV). Thus, the global schema provides the view definitions (i.e., mappings between the global relations and the local relations stored in the component databases) and the query is rewritten using the views.

Rewriting can be done at the relational calculus or algebra levels. In this chapter, we will use a generalized form of relational calculus called Datalog [Ullman, 1988] which is well suited for such rewriting. Thus, there is an additional step of calculus to algebra translation that is similar to the decomposition step in homogeneous distributed DBMSs.

The second layer performs query optimization and (some) execution by consider- ing the allocation of the local relations and the different query processing capabilities of the component DBMSs exported by the wrappers. The allocation and capability schema used by this layer may also contain heterogeneous cost information. The distributed QEP produced by this layer groups within subqueries the operations that can be performed by the component DBMSs and wrappers. Similar to dis- tributed DBMSs, query optimization can be static or dynamic. However, the lack of homogeneity in multidatabase systems (e.g., some component DBMSs may have unexpectedly long delays in answering) make dynamic query optimization more critical. In the case of dynamic optimization, there may be subsequent calls to this layer after execution by the next layer. This is illustrated by the arrow showing results coming from the next layer. Finally, this layer integrates the results coming from the

9.3 Query Rewriting Using Views 301

different wrappers to provide a unified answer to the user’s query. This requires the capability of executing some operations on data coming from the wrappers. Since the wrappers may provide very limited execution capabilities, e.g., in the case of very simple component DBMSs, the mediator must provide the full execution capabilities to support the mediator interface.

The third layer performs query translation and execution using the wrappers. Then it returns the results to the mediator that can perform result integration from different wrappers and subsequent execution. Each wrapper maintains a wrapper schema that includes the local export schema (see Chapter 4) and mapping information to facilitate the translation of the input subquery (a subset of the QEP) expressed in a common language into the language of the component DBMS. After the subquery is translated, it is executed by the component DBMS and the local result is translated back to the common format.

The wrapper schema contains information describing how mappings from/to par- ticipating local schemas and global schema can be performed. It enables conversions between components of the database in different ways. For example, if the global schema represents temperatures in Fahrenheit degrees, but a participating database uses Celsius degrees, the wrapper schema must contain a conversion formula to provide the proper presentation to the global user and the local databases. If the con- version is across types and simple formulas cannot perform the translation, complete mapping tables could be used in the wrapper schema.

9.3 Query Rewriting Using Views

Query rewriting reformulates the input query expressed on global relations into a query on local relations. It uses the global schema, which describes in terms of views the correspondences between the global relations and the local relations. Thus, the query must be rewritten using views. The techniques for query rewriting differ in major ways depending on the database integration approach that is used, i.e., global-as-view (GAV) or local-as-view (LAV). In particular, the techniques for LAV (and its extension GLAV) are much more involved [Halevy, 2001]. Most of the work on query rewriting using views has been done using Datalog [Ullman, 1988], which is a logic-based database language. Datalog is more concise than relational calculus and thus more convenient for describing complex query rewriting algorithms. In this section, we first introduce Datalog terminology. Then, we describe the main techniques and algorithms for query rewriting in the GAV and LAV approaches.

9.3.1 Datalog Terminology

Datalog can be viewed as an in-line version of domain relational calculus. Let us first define conjunctive queries, i.e., select-project-join queries, which are the basis for

302 9 Multidatabase Query Processing

more complex queries. A conjuntive query in Datalog is expressed as a rule of the form:

Q(T ) :−R1(T1), . . . ,Rn(Tn)

The atom Q(T ) is the head of the query and denotes the result relation. The atoms R1(T1), . . . ,Rn(Tn) are the subgoals in the body of the query and denote database relations. Q and R1, . . . ,Rn are predicate names and correspond to relation names. T,T1, . . . ,Tn refer to the relation tuples and contain variables or constants. The vari- ables are similar to domain variables in domain relational calculus. Thus, the use of the same variable name in multiple predicates expresses equijoin predicates. Con- stants correspond to equality predicates. More complex comparison predicates (e.g., using comparators such as 6=, ≤ and <) must be expressed as other subgoals. We consider queries which are safe, i.e., those where each variable in the head also appears in the body. Disjunctive queries can also be expressed in Datalog using unions, by having several conjuntive queries with the same head predicate.

Example 9.1. Let us consider relations EMP(ENO, ENAME, TITLE, CITY) and ASG(ENO, PNO, DUR) assuming that ENO is the primary key of EMP and (ENO, PNO) is the primary key of ASG. Consider the following SQL query:

SELECT ENO, TITLE, PNO FROM EMP, ASG WHERE EMP.ENO = ASG.ENO AND TITLE = "Programmer" OR DUR = 24

The corresponding query in Datalog can be expressed as:

Q(ENO,TITLE,PNO) :− EMP(ENO,ENAME,”Programmer”,CITY), ASG(ENO,PNO,DUR)

Q(ENO,TITLE,PNO) :− EMP(ENO,ENAME,TITLE,CITY), ASG(ENO,PNO,24)

9.3.2 Rewriting in GAV

In the GAV approach, the global schema is expressed in terms of the data sources and each global relation is defined as a view over the local relations. This is similar to the global schema definition in tightly-integrated distributed DBMS. In particular, the local relations (i.e., relations in a component DBMS) can correspond to fragments. However, since the local databases pre-exist and are autonomous, it may happen that tuples in a global relation do not exist in local relations or that a tuple in a global relation appears in different local relations. Thus, the properties of completeness and disjointness of fragmentation cannot be guaranteed. The lack of completeness may yield incomplete answers to queries. The lack of disjointness may yield duplicate

9.3 Query Rewriting Using Views 303

results that may still be useful information and may not need to be eliminated. Similar to queries, view definitions can use Datalog notation.

Example 9.2. Let us consider the local relations EMP1(ENO, ENAME, TITLE, CITY), EMP2(ENO, ENAME, TITLE, CITY) and ASG1(ENO, PNO, DUR). The global relations EMP(ENO, ENAME, CITY) and ASG(ENO, PNO, TITLE, DUR) can be simply defined with the following Datalog rules:

EMP(ENO,ENAME,CITY) :−EMP1(ENO,ENAME,TITLE,CITY) (r1) EMP(ENO,ENAME,CITY) :−EMP2(ENO,ENAME,TITLE,CITY) (r2)

ASG(ENO,PNO,TITLE,DUR) :−EMP1(ENO,ENAME,TITLE,CITY), ASG1(ENO,PNO,DUR) (r3)

ASG(ENO,PNO,TITLE,DUR) :−EMP2(ENO,ENAME,TITLE,CITY), ASG1(ENO,PNO,DUR) (r4)

Rewriting a query expressed on the global schema into an equivalent query on the local relations is relatively simple and similar to data localization in tightly-integrated distributed DBMS (see Section 7.2). The rewriting technique using views is called unfolding [Ullman, 1997], and it replaces each global relation invoked in the query with its corresponding view. This is done by applying the view definition rules to the query and producing a union of conjunctive queries, one for each rule application. Since a global relation may be defined by several rules (see Example 9.2), unfolding can generate redundant queries that need to be eliminated.

Example 9.3. Let us consider the global schema in Example 9.2 and the following query Q that asks for assignment information about the employees living in “Paris”:

Q(e, p) :−EMP(e,ENAME,“Paris”),ASG(e, p,TITLE,DUR).

Unfolding Q produces Q′ as follows:

Q′(e, p) :−EMP1(e,ENAME,TITLE,“Paris”),ASG1(e, p,DUR). (q1) Q′(e, p) :−EMP2(e,ENAME,TITLE,“Paris”),ASG1(e, p,DUR). (q2)

Q′ is the union of two conjunctive queries labeled as q1 and q2. q1 is obtained by applying rule r3 or both rules r1 and r3. In the latter case, the query obtained is redundant with respect to that obtained with r3 only. Similarly, q2 is obtained by applying rule r4 or both rules r2 and r4. �

Although the basic technique is simple, rewriting in GAV becomes difficult when local databases have limited access patterns [Calı̀ and Calvanese, 2002]. This is the case for databases accessed over the web where relations can be only accessed using certain binding patterns for their attributes. In this case, simply substituing the global

304 9 Multidatabase Query Processing

relations with their views is not sufficient, and query rewriting requires the use of recursive Datalog queries.

9.3.3 Rewriting in LAV

In the LAV approach, the global schema is expressed independent of the local databases and each local relation is defined as a view over the global relations. This enables considerable flexibility for defining local relations.

Example 9.4. To facilitate comparison with GAV, we develop an example that is sym- metric to Example 9.2 with EMP(ENO, ENAME, CITY) and ASG(ENO, PNO, TI- TLE, DUR) as global relations. In the LAV approach, the local relations EMP1(ENO, ENAME, TITLE, CITY), EMP2(ENO, ENAME, TITLE, CITY) and ASG1(ENO, PNO, DUR) can be defined with the following Datalog rules:

EMP1(ENO,ENAME,TITLE,CITY) :−EMP(ENO,ENAME,CITY), (r1) ASG(ENO,PNO,TITLE,DUR)

EMP2(ENO,ENAME,TITLE,CITY) :−EMP(ENO,ENAME,CITY), (r2) ASG(ENO,PNO,TITLE,DUR)

ASG1(ENO,PNO,DUR) :−ASG(ENO,PNO,TITLE,DUR) (r3)

Rewriting a query expressed on the global schema into an equivalent query on the views describing the local relations is difficult for three reasons. First, unlike in the GAV approach, there is no direct correspondence between the terms used in the global schema, (e.g., EMP, ENAME) and those used in the views (e.g., EMP1, EMP2, ENAME). Finding the correspondences requires comparison with each view. Second, there may be many more views than global relations, thus making view comparison time consuming. Third, view definitions may contain complex predicates to reflect the specific contents of the local relations, e.g., view EMP3 containing only programmers. Thus, it is not always possible to find an equivalent rewriting of the query. In this case, the best that can be done is to find a maximally-contained query, i.e., a query that produces the maximum subset of the answer [Halevy, 2001]. For instance, EMP3 could only return a subset of all employees, those who are programmers.

Rewriting queries using views has received much attention because of its relevance to both logical and physical data integration problems. In the context of physical integration (i.e., data warehousing), using materialized views may be much more efficient than accessing base relations. However, the problem of finding a rewriting using views is NP-complete in the number of views and the number of subgoals in the query [Levy et al., 1995]. Thus, algorithms for rewriting a query using views essentially try to reduce the numbers of rewritings that need to be considered. Three

9.3 Query Rewriting Using Views 305

main algorithms have been proposed for this purpose: the bucket algorithm [Levy et al., 1996b], the inverse rule algorithm [Duschka and Genesereth, 1997], and the MinCon algorithm [Pottinger and Levy, 2000]. The bucket algorithm and the inverse rule algorithm have similar limitations that are addressed by the MinCon algorithm.

The bucket algorithm considers each predicate of the query independently to select only the views that are relevant to that predicate. Given a query Q, the algorithm proceeds in two steps. In the first step, it builds a bucket b for each subgoal q of Q that is not a comparison predicate and inserts in b the heads of the views that are relevant to answer q. To determine whether a view V should be in b, there must be a mapping that unifies q with one subgoal v in V .

For instance, consider query Q in Example 9.3 and the views in Example 9.4. The following mapping unifies the subgoal EMP(e, ENAME, “Paris”) of Q with the subgoal EMP(ENO, ENAME, CITY) in view EMP1:

e→ ENO,“Paris”→ CITY

In the second step, for each view V of the Cartesian product of the non-empty buckets (i.e., some subset of the buckets), the algorithm produces a conjuntive query and checks whether it is contained in Q. If it is, the conjuntive query is kept as it represents one way to anwer part of Q from V . Thus, the rewritten query is a union of conjunctive queries.

Example 9.5. Let us consider query Q in Example 9.3 and the views in Example 9.4. In the first step, the bucket algorithm creates two buckets, one for each subgoal of Q. Let us denote by b1 the bucket for the subgoal EMP(e, ENAME, “Paris”) and by b2 the bucket for the subgoal ASG(e, p, TITLE, DUR). Since the algorithm inserts only the view heads in a bucket, there may be variables in a view head that are not in the unifying mapping. Such variables are simply primed. We obtain the following buckets:

b1 = {EMP1(ENO,ENAME,TITLE′,CITY), EMP2(ENO,ENAME,TITLE′,CITY)}

b2 = {ASG1(ENO,PNO,DUR′)}

In the second step, the algorithm combines the elements from the buckets, which produces a union of two conjuntive queries:

Q′(e, p) :−EMP1(e,ENAME,TITLE,“Paris”),ASG1(e, p,DUR) (q1) Q′(e, p) :−EMP2(e,ENAME,TITLE,“Paris”),ASG1(e, p,DUR) (q2)

The main advantage of the bucket algorithm is that, by considering the predicates in the query, it can significantly reduce the number of rewritings that need to be considered. However, considering the predicates in the query in isolation may yield the addition of a view in a bucket that is irrelevant when considering the join with

306 9 Multidatabase Query Processing

other views. Furthermore, the second step of the algorithm may still generate a large number of rewritings as a result of the Cartesian product of the buckets.

Example 9.6. Let us consider query Q in Example 9.3 and the views in Example 9.4 with the addition of the following view that gives the projects for which there are employees who live in Paris.

PROJ1(PNO) :−EMP1(ENO,ENAME,“Paris”), ASG(ENO,PNO,TITLE,DUR) (r4)

Now, the following mapping unifies the subgoal ASG(e, p, TITLE, DUR) of Q with the subgoal ASG(ENO, PNO, TITLE, DUR) in view PROJ1:

p→ PNAME

Thus, in the first step of the bucket algorithm, PROJ1 is added to bucket b2. However, PROJ1 cannot be useful in a rewriting of Q since the variable ENAME is not in the head of PROJ1 and thus makes it impossible to join PROJ1 on the variable e of Q. This can be discovered only in the second step when building the conjunctive queries. �

The MinCon algorithm addresses the limitations of the bucket algorithm (and the inverse rule algorithm) by considering the query globally and considering how each predicate in the query interacts with the views. It proceeds in two steps like the bucket algorithm. The first step starts similar to that of the bucket algorithm, selecting the views that contain subgoals corresponding to subgoals of query Q. However, upon finding a mapping that unifies a subgoal q of Q with a subgoal v in view V , it considers the join predicates in Q and finds the minimum set of additional subgoals of Q that must be mapped to subgoals in V . This set of subgoals of Q is captured by a MinCon description (MCD) associated with V . The second step of the algorithm produces a rewritten query by combining the different MCDs. In this second step, unlike in the bucket algorithm, it is not necessary to check that the proposed rewritings are contained in the query because the way the MCDs are created guarantees that the resulting rewritings will be contained in the original query.

Applied to Example 9.6, the algorithm would create 3 MCDs: two for the views EMP1 and EMP2 containing the subgoal EMP of Q and one for ASG1 containing the subgoal ASG. However, the algorithm cannot create an MCD for PROJ1 because it cannot apply the join predicate in Q. Thus, the algorithm would produce the rewritten query Q′ of Example 9.5. Compared with the bucket algorithm, the second step of the MinCon algorithm is much more efficient since it performs fewer combinations of MCDs than buckets.

9.4 Query Optimization and Execution 307

9.4 Query Optimization and Execution

The three main problems of query optimization in multidatabase systems are het- erogeneous cost modeling, heterogeneous query optimization (to deal with different capabilities of component DBMSs), and adaptive query processing (to deal with strong variations in the environment – failures, unpredictable delays, etc.). In this section, we describe the techniques for these three problems. We note that the result is a distributed execution plan to be executed by the wrappers and the mediator.

9.4.1 Heterogeneous Cost Modeling

Global cost function definition, and the associated problem of obtaining cost-related information from component DBMSs, is perhaps the most-studied of the three problems. A number of possible solutions have emerged, which we discuss below.

The first thing to note is that we are primarily interested in determining the cost of the lower levels of a query execution tree that correspond to the parts of the query executed at component DBMSs. If we assume that all local processing is “pushed down” in the tree, then we can modify the query plan such that the leaves of the tree correspond to subqueries that will be executed at individual component DBMSs. In this case, we are talking about the determination of the costs of these subqueries that are input to the first level (from the bottom) operators. Cost for higher levels of the query execution tree may be calculated recursively, based on the leaf node costs.

Three alternative approaches exist for determining the cost of executing queries at component DBMSs [Zhu and Larson, 1998]:

1. Black Box Approach. This approach treats each component DBMS as a black box, running some test queries on it, and from these determines the necessary cost information [Du et al., 1992; Zhu and Larson, 1994].

2. Customized Approach. This approach uses previous knowledge about the component DBMSs, as well as their external characteristics, to subjectively determine the cost information [Zhu and Larson, 1996a; Roth et al., 1999; Naacke et al., 1999].

3. Dynamic Approach. This approach monitors the run-time behavior of com- ponent DBMSs, and dynamically collects the cost information [Lu et al., 1992; Zhu et al., 2000, 2003; Rahal et al., 2004].

We discuss each approach, focusing on the proposals that have attracted the most attention.

308 9 Multidatabase Query Processing

9.4.1.1 Black box approach

In the black box approach, which is used in the Pegasus project [Du et al., 1992], the cost functions are expressed logically (e.g., aggregate CPU and I/O costs, selectivity factors), rather than on the basis of physical characteristics (e.g., relation cardinalities, number of pages, number of distinct values for each column). Thus, the cost functions for component DBMSs are expressed as

Cost = initialization cost + cost to f ind quali f ying tuples

+ cost to process selected tuples

The individual terms of this formula will differ for different operators. However, these differences are not difficult to specify a priori. The fundamental difficulty is the determination of the term coefficients in these formulae, which change with different component DBMSs. The approach taken in the Pegasus project is to construct a synthetic database (called a calibrating database), run queries against it in isolation, and measure the elapsed time to deduce the coefficients.

A problem with this approach is that the calibration database is synthetic, and the results obtained by using it may not apply well to real DBMSs [Zhu and Larson, 1994]. An alternative is proposed in the CORDS project [Zhu and Larson, 1996b], that is based on running probing queries on component DBMSs to determine cost information. Probing queries can, in fact, be used to gather a number of cost infor- mation factors. For example, probing queries can be issued to retrieve data from component DBMSs to construct and update the multidatabase catalog. Statistical probing queries can be issued that, for example, count the number of tuples of a relation. Finally, performance measuring probing queries can be issued to measure the elapsed time for determining cost function coefficients.

A special case of probing queries is sample queries [Zhu and Larson, 1998]. In this case, queries are classified according to a number of criteria, and sample queries from each class are issued and measured to derive component cost information. Query classification can be performed according to query characteristics (e.g., unary operation queries, two-way join queries), characteristics of the operand relations (e.g., cardinality, number of attributes, information on indexed attributes), and char- acteristics of the underlying component DBMSs (e.g., the access methods that are supported and the policies for choosing access methods).

Classification rules are defined to identify queries that execute similarly, and thus could share the same cost formula. For example, one may consider that two queries that have similar algebraic expressions (i.e., the same algebraic tree shape), but different operand relations, attributes, or constants, are executed the same way if their attributes have the same physical properties. Another example is to assume that join order of a query has no effect on execution since the underlying query optimizer applies reordering techniques to choose an efficient join ordering. Thus, two queries that join the same set of relations belong to the same class, whatever ordering is expressed by the user. Classification rules are combined to define query classes. The classification is performed either top-down by dividing a class into more

9.4 Query Optimization and Execution 309

specific ones, or bottom-up by merging two classes into a larger one. In practice, an efficient classification is obtained by mixing both approaches. The global cost function is similar to the Pegasus cost function in that it consists of three components: initialization cost, cost of retrieving a tuple, and cost of processing a tuple. The difference is in the way the parameters of this function are determined. Instead of using a calibrating database, sample queries are executed and costs are measured. The global cost equation is treated as a regression equation, and the regression coefficients are calculated using the measured costs of sample queries [Zhu and Larson, 1996a]. The regression coefficients are the cost function parameters. Eventually, the cost model quality is controlled through statistical tests (e.g., F-test): if the tests fail, the query classification is refined until quality is sufficient. This approach has been validated over various DBMS and has been shown to yield good results [Zhu and Larson, 2000].

The above approaches require a preliminary step to instantiate the cost model (either by calibration or sampling). This may not be appropriate in MDBMSs because it would slow down the system each time a new DBMS component is added. One way to address this problem, as proposed in the Hermes project, is to progressively learn the cost model from queries [Adali et al., 1996b]. The cost model designed in the Hermes mediator assumes that the underlying component DBMSs are invoked by a function call. The cost of a call is composed of three values: the response time to access the first tuple, the whole result response time, and the result cardinality. This allows the query optimizer to minimize either the time to receive the first tuple or the time to process the whole query, depending on end-user requirements. Initially the query processor does not know any statistics about components DBMSs. Then it monitors on-going queries: it collects processing time of every call and stores it for future estimation. To manage the large amount of collected statistics, the cost manager summarizes them, either without loss of precision or with less precision at the benefit of lower space use and faster cost estimation. Summarization consists of aggregating statistics: the average response time is computed of all the calls that match the same pattern, i.e., those with identical function name and zero or more identical argument values. The cost estimator module is implemented in a declarative language. This allows adding new cost formulae describing the behavior of a particular component DBMS. However, the burden of extending the mediator cost model remains with the mediator developer.

The major drawback of the black box approach is that the cost model, although adjusted by calibration, is common for all component DBMSs and may not capture their individual specifics. Thus it might fail to estimate accurately the cost of a query executed at a component DBMS that exposes unforeseen behavior.

9.4.1.2 Customized Approach

The basis of this approach is that the query processors of the component DBMSs are too different to be represented by a unique cost model as used in the black- box approach. It also assumes that the ability to accurately estimate the cost of

310 9 Multidatabase Query Processing

local subqueries will improve global query optimization. The approach provides a framework to integrate the component DBMSs’ cost model into the mediator query optimizer. The solution is to extend the wrapper interface such that the mediator gets some specific cost information from each wrapper. The wrapper developer is free to provide a cost model, partially or entirely. Then, the challenge is to integrate this (potentially partial) cost description into the mediator query optimizer. There are two main solutions.

A first solution is to provide the logic within the wrapper to compute three cost estimates: the time to initiate the query process and receive the first result item (called reset cost), the time to get the next item (called advance cost), and the result cardinality. Thus, the total query cost is:

Total access cost = reset cost +(cardinality−1)∗advance cost

This solution can be extended to estimate the cost of database procedure calls. In that case, the wrapper provides a cost formula that is a linear equation depending on the procedure parameters. This solution has been successfully implemented to model a wide range of heterogeneous components DBMSs, ranging from a relational DBMS to an image server [Roth et al., 1999]. It shows that a little effort is sufficient to implement a rather simple cost model and this significantly improves distributed query processing over heterogeneous sources.

A second solution is to use a hierarchical generic cost model. As shown in Figure 9.2, each node represents a cost rule that associates a query pattern with a cost function for various cost parameters.

The node hierarchy is divided into five levels depending on the genericity of the cost rules (in Figure 9.2, the increasing width of the boxes shows the increased focus of the rules). At the top level, cost rules apply by default to any DBMS. At the underlying levels, the cost rules are increasingly focused on: specific DBMS, relation, predicate or query. At the time of wrapper registration, the mediator receives wrapper metadata including cost information, and completes its built-in cost model by adding new nodes at the appropriate level of the hierarchy. This framework is sufficiently general to capture and integrate both general cost knowledge declared as rules given by wrapper developers and specific information derived from recorded past queries that were previously executed. Thus, through an inheritance hierarchy , the mediator cost-based optimizer can support a wide variety of data sources. The mediator benefits from specialized cost information about each component DBMS, to accurately estimate the cost of queries and choose a more efficient QEP [Naacke et al., 1999].

Example 9.7. Consider the following relations:

EMP(ENO, ENAME, TITLE) ASG(ENO, PNO, RESP, DUR)

EMP is stored at component DBMS db1 and contains 1,000 tuples. ASG is stored at component DBMS db2 and contains 10,000 tuples. We assume uniform distribution

9.4 Query Optimization and Execution 311

Wrapper-scope

rules

Collection

scope

rules

Predicate-scope

rules

CountObject = ...

TotalSize = ...

TotalTime = ...

etc...

Source 1: Source 2:

TotalTime = ... TotalTime = ...

TotalSize = ... TotalTime = ...

TotalTime = ... TotalSize = ...

select(EMP, Predicate)

select (Collection, Predicate)

select (Collection, Predicate)

select (Collection, Predicate)

select(PROJ, Predicate)

Default-scope rules

select(EMP, TITLE = value) select(EMP, ENAME = Value)

Query

specific rules

Fig. 9.2 Hierarchical Cost Formula Tree

of attribute values. Half of the ASG tuples have a duration greater than 6. We detail below some parts of the mediator generic cost model (we use superscripts to indicate the access method):

cost(R) = |R| cost(σpredicate(R)) = cost(R) (access to R by sequential scan (by default)) cost(R 1indA S) = cost(R)+ |R| ∗ cost(σA=v(S)) (using an index-based (ind) join with the index on S.A)

cost(R 1nlA S) = cost(R)+ |R| ∗ cost(S) (using a nested-loop (nl) join)

Consider the following global query Q:

SELECT * FROM EMP, ASG WHERE EMP.ENO=ASG.ENO AND ASG.DUR>6

The cost-based query optimizer generates the following plans to process Q:

312 9 Multidatabase Query Processing

P1 = σDUR>6(EMP 1indENO ASG)

P2 = EMP 1nlENO σDUR>6(ASG)

P3 = σDUR>6(ASG) 1indENO EMP

P4 = σDUR>6(ASG) 1nlENO EMP

Based on the generic cost model, we compute their cost as:

cost(P1) = cost(σDUR>6(EMP 1indENO ASG)

= cost(EMP 1indENO ASG)

= cost(EMP)+ |EMP| ∗ cost(σENO=v(ASG))

= |EMP|+ |EMP| ∗ |ASG|= 10,001,000

cost(P2) = cost(EMP)+ |EMP| ∗ cost(σDUR>6(ASG))

= cost(EMP)+ |EMP| ∗ cost(ASG)

= |EMP|+ |EMP| ∗ |ASG|= 10,001,000

cost(P3) = cost(P4) = |ASG|+ |ASG|

2 ∗ |EMP|

= 5,010,000

Thus, the optimizer discards plans P1 and P2 to keep either P3 or P4 for processing Q. Let us assume now that the mediator imports specific cost information about component DBMSs. db1 exports the cost of accessing EMP tuples as:

cost(σA=v(R)) = |σA=v(R)|

db2 exports the specific cost of selecting ASG tuples that have a given ENO as:

cost(σENO=v(ASG)) = |σENO=v(ASG)|

The mediator integrates these cost functions in its hierarchical cost model, and can now estimate more accurately the cost of the QEPs:

cost(P1) = |EMP|+ |EMP| ∗ |σENO=v(ASG)|

= 1,000+1,000∗10

= 11,000

cost(P2) = |EMP|+ |EMP| ∗ |σDUR>6(ASG)|

9.4 Query Optimization and Execution 313

= |EMP|+ |EMP| ∗ |ASG| 2

= 5,001,000

cost(P3) = |ASG|+ |ASG|

2 ∗ |σENO=v(EMP)|

= 10,000+5,000∗1

= 15,000

cost(P4) = |ASG|+ |ASG|

2 ∗ |EMP|

= 10,000+5,000∗1,000

= 5,010,000

The best QEP is now P1 which was previously discarded because of lack of cost information about component DBMSs. In many situations P1 is actually the best alternative to process Q1. �

The two solutions just presented are well suited to the mediator/wrapper archi- tecture and offer a good tradeoff between the overhead of providing specific cost information for diverse component DBMSs and the benefit of faster heterogeneous query processing.

9.4.1.3 Dynamic Approach

The above approaches assume that the execution environment is stable over time. However, in most cases, the execution environment factors are frequently changing. Three classes of environmental factors can be identified based on their dynamicity [Rahal et al., 2004]. The first class for frequently changing factors (every second to every minute) includes CPU load, I/O throughput, and available memory. The second class for slowly changing factors (every hour to every day) includes DBMS configuration parameters, physical data organization on disks, and database schema. The third class for almost stable factors (every month to every year) includes DBMS type, database location, and CPU speed. We focus on solutions that deal with the first two classes.

One way to deal with dynamic environments where network contention, data storage or available memory change over time is to extend the sampling method [Zhu, 1995] and consider user queries as new samples. Query response time is measured to adjust the cost model parameters at run time for subsequent queries. This avoids the overhead of processing sample queries periodically, but still requires heavy computation to solve the cost model equations and does not guarantee that cost model precision improves over time. A better solution, called qualitative [Zhu

314 9 Multidatabase Query Processing

et al., 2000], defines the system contention level as the combined effect of frequently changing factors on query cost. The system contention level is divided into several discrete categories: high, medium, low, or no system contention. This allows for defining a multi-category cost model that provides accurate cost estimates while dynamic factors are varying. The cost model is initially calibrated using probing queries. The current system contention level is computed over time, based on the most significant system parameters. This approach assumes that query executions are short, so the environment factors remain rather constant during query execution. However, this solution does not apply to long running queries, since the environment factors may change rapidly during query execution.

To manage the case where the environment factor variation is predictable (e.g., the daily DBMS load variation is the same every day), the query cost is computed for successive date ranges [Zhu et al., 2003]. Then, the total cost is the sum of the costs for each range. Furthermore, it may be possible to learn the pattern of the available network bandwidth between the MDBMS query processor and the component DBMS [Vidal et al., 1998]. This allows adjusting the query cost depending on the actual date.

9.4.2 Heterogeneous Query Optimization

In addition to heterogeneous cost modeling, multidatabase query optimization must deal with the issue of the heterogeneous computing capabilities of component DBMSs. For instance, one component DBMS may support only simple select opera- tions while another may support complex queries involving join and aggregate. Thus, depending on how the wrappers export such capabilities, query processing at the mediator level can be more or less complex. There are two main approaches to deal with this issue depending on the kind of interface between mediator and wrapper: query-based and operator-based.

1. Query-based. In this approach, the wrappers support the same query capabil- ity, e.g., a subset of SQL, which is translated to the capability of the component DBMS. This approach typically relies on a standard DBMS interface such as Open Database Connectivity (ODBC) and its extensions for the wrappers or SQL Management of External Data (SQL/MED) [Melton et al., 2001]. Thus, since the component DBMSs appear homogeneous to the mediator, query processing techniques designed for homogeneous distributed DBMS can be reused. However, if the component DBMSs have limited capabilities, the additional capabilities must be implemented in the wrappers, e.g., join queries may need to be handled at the wrapper, if the component DBMS does not support join.

2. Operator-based. In this approach, the wrappers export the capabilities of the component DBMSs through compositions of relational operators. Thus, there is more flexibility in defining the level of functionality between the mediator

9.4 Query Optimization and Execution 315

and the wrapper. In particular, the different capabilities of the component DBMSs can be made available to the mediator. This makes wrapper construc- tion easier at the expense of more complex query processing in the mediator. In particular, any functionality that may not be supported by component DBMSs (e.g., join) will need to be implemented at the mediator.

In the rest of this section, we present, in more detail, the approaches to query optimization.

9.4.2.1 Query-based Approach

Since the component DBMSs appear homogeneous to the mediator, one approach is to use a distributed cost-based query optimization algorithm (see Chapter 8) with a heterogeneous cost model (see Section 9.4.1). However, extensions are needed to convert the distributed execution plan into subqueries to be executed by the component DBMSs and into subqueries to be executed by the mediator. The hybrid two-step optimization technique is useful in this case (see Section 8.4.4): in the first step, a static plan is produced by a centralized cost-based query optimizer; in the second step, at startup time, an execution plan is produced by carrying out site selection and allocating the subqueries to the sites. However, centralized optimizers restrict their search space by eliminating bushy join trees from consideration. Almost all the systems use left linear join orders where the right subtree of a join node is always a leaf node corresponding to a base relation (Figure 9.3a). Consideration of only left linear join trees gives good results in centralized DBMSs for two reasons: it reduces the need to estimate statistics for at least one operand, and indexes can still be exploited for one of the operands. However, in multidatabase systems, these types of join execution plans are not necessarily the preferred ones as they do not allow any parallelism in join execution. As we discussed in earlier chapters, this is also a problem in homogeneous distributed DBMSs, but the issue is more serious in the case of multidatabase systems, because we wish to push as much processing as possible to the component DBMSs.

A way to resolve this problem is to somehow generate bushy join trees and consider them at the expense of left linear ones. One way to achieve this is to apply a cost-based query optimizer to first generate a left linear join tree, and then convert it to a bushy tree [Du et al., 1995]. In this case, the left linear join execution plan can be optimal with respect to total time, and the transformation improves the query response time without severely impacting the total time. A hybrid algorithm that concurrently performs a bottom-up and top-down sweep of the left linear join execution tree, transforming it, step-by-step, to a bushy one has been proposed [Du et al., 1995]. The algorithm maintains two pointers, called upper anchor nodes (UAN) on the tree. At the beginning, one of these, called the bottom UAN (UANB), is set to the grandparent of the leftmost root node (join with R3 in Figure 9.3a), while the second one, called the top UAN (UANT ), is set to the root (join with R5). For each UAN the algorithm selects a lower anchor node (LAN). This is the node closest to the UAN and whose

316 9 Multidatabase Query Processing

right child subtree’s response time is within a designer-specified range, relative to that of the UAN’s right child subtree. Intuitively, the LAN is chosen such that its right child subtree’s response time is close to the corresponding UAN’s right child subtree’s response time. As we will see shortly, this helps in keeping the transformed bushy tree balanced, which reduces the response time.

R1 R2

R3

R4

R5

R1 R2 R3 R4

R5

(a) Left Linear Join Tree (b) Bushy Join Tree

Fig. 9.3 Left Linear versus Bushy Join Tree

At each step, the algorithm picks one of the UAN/LAN pairs (strictly speaking, it picks the UAN and selects the appropriate LAN, as discussed above), and performs the following translation for the segment between that LAN and UAN pair:

1. The left child of UAN becomes the new UAN of the transformed segment. 2. The LAN remains unchanged, but its right child node is replaced with a new

join node of two subtrees, which were the right child subtrees of the input UAN and LAN.

The UAN mode that will be considered in that particular iteration is chosen according to the following heuristic: choose UANB if the response time of its left child subtree is smaller than that of UANT ’s subtree; otherwise choose UANT . If the response times are the same, choose the one with the more unbalanced child subtree.

At the end of each transformation step, the UANB and UANT are adjusted. The algorithm terminates when UANB = UANT , since this indicates that no further trans- formations are possible. The resulting join execution tree will be almost balanced, producing an execution plan whose response time is reduced due to parallel execution of the joins.

The algorithm described above starts with a left linear join execution tree that is generated by a commercial DBMS optimizer. While this is a good starting point, it can be argued that the original linear execution plan may not fully account for the peculiarities of the distributed multidatabase characteristics, such as data replication. A special global query optimization algorithm [Evrendilek et al., 1997] can take

9.4 Query Optimization and Execution 317

these into consideration. Starting from an initial join graph, the algorithm checks for different parenthesizations of this linear join execution order and produces a parenthesized order, which is optimal with respect to response time. The result is an (almost) balanced join execution tree. Performance evaluations indicate that this approach produces better quality plans at the expense of longer optimization time.

9.4.2.2 Operator-based Approach

Expressing the capabilities of the component DBMSs through relational operators allows tight integration of query processing between mediator and wrappers. In particular, the mediator/wrapper communication can be in terms of subplans. We illustrate the operator-based approach with planning functions proposed in the Garlic project [Haas et al., 1997a]. In this approach, the capabilities of the component DBMSs are expressed by the wrappers as planning functions that can be directly called by a centralized query optimizer. It extends the rule-based optimizer proposed by Lohman [1988] with operators to create temporary relations and retrieve locally- stored data. It also creates the PushDown operator that pushes a portion of the work to the component DBMSs where it will be executed. The execution plans are represented, as usual, as operator trees, but the operator nodes are annotated with additional information that specifies the source(s) of the operand(s), whether the results are materialized, and so on. The Garlic operator trees are then translated into operators that can be directly executed by the execution engine.

Planning functions are considered by the optimizer as enumeration rules. They are called by the optimizer to construct subplans using two main functions: accessPlan to access a relation, and joinPlan to join two relations using the access plans. These functions precisely reflect the capabilities of the component DBMSs with a common formalism.

Example 9.8. We consider three component databases, each at a different site. Com- ponent database db1 stores relation EMP(ENO, ENAME, CITY). Component database db2 stores relation ASG(ENO, PNAME, DUR). Component database db3 stores only employee information with a single relation of schema EM- PASG(ENAME, CITY, PNAME, DUR), whose primary key is (ENAME, PNAME). Component databases db1 and db2 have the same wrapper w1 whereas db3 has a different wrapper w2.

Wrapper w1 provides two planning functions typical of a relational DBMS. The accessPlan rule

accessPlan(R: relation, A: attribute list, P: select predicate) = scan(R,A,P,db(R))

produces a scan operator that accesses tuples of R from its component database db(R) (here we can have db(R) = db1 or db(R) = db2), applies select predicate P, and projects on the attribute list A. The joinPlan rule

318 9 Multidatabase Query Processing

joinPlan(R1,R2: relations, A: attribute list, P: join predicate) = join (R1,R2, A, P)

condition: db(R1) 6= db(R2)

produces a join operator that accesses tuples of relations R1 and R2 and applies join predicate P and projects on attribute list A. The condition expresses that R1 and R2 are stored in different component databases (i.e., db1 and db2). Thus, the join operator is implemented by the wrapper.

Wrapper w2 also provides two planning functions. The accessPlan rule

accessPlan(R: relation, A: attribute list, P: select predicate) = fetch(CITY=“c”)

condition: (CITY=“c”) ⊆ P

produces a fetch operator that directly accesses (entire) employee tuples in component database db3 whose CITY value is “c”. The accessPlan rule

accessPlan(R: relation, A: attribute list, P: select predicate) = scan(R,A,P)

produces a scan operator that accesses tuples of relation R in the wrapper and applies select predicate P and attribute project list A. Thus, the scan operator is implemented by the wrapper, not the component DBMS.

Consider the following SQL query submitted to mediator m:

SELECT ENAME, PNAME, DUR FROM EMPASG WHERE CITY = "Paris" AND DUR > 24

Assuming the GAV approach, the global view EMPASG(ENAME, CITY, PNAME, DUR) can be defined as follows (for simplicity, we prefix each relation by its component database name):

EMPASG = (db1.EMP 1 db2.ASG) ∪ db3.EMPASG

After query rewriting in GAV and query optimization, the operator-based approach could produce the QEP shown in Figure 9.4. This plan shows that the operators that are not supported by the component DBMS are to be implemented by the wrappers or the mediator. �

Using planning functions for heterogeneous query optimization has several advan- tages in multi-DBMSs. First, planning functions provide a flexible way to express precisely the capabilities of component data sources. In particular, they can be used to model non-relational data sources such as web sites. Second, since these rules are declarative, they make wrapper development easier. The only important development for wrappers is the implementation of specific operators, e.g., the scan operator of db3 in Example 9.8. Finally, this approach can be easily incorporated in an existing, centralized query optimizer.

9.4 Query Optimization and Execution 319

Scan (CITY=”Paris”)

EMP ASG

Scan (DUR>24) Fetch (CITY=”Paris”)

EMPASG

Join Scan (DUR>24)

db 2db1

db 3

w 1

w 2

Union m

Fig. 9.4 Heterogeneous Query Execution Plan

The operator-based approach has also been successfully used in DISCO, a multi- DBMS designed to access multiple databases over the web [Tomasic et al., 1996, 1997, 1998]. DISCO uses the GAV approach and supports an object data model to represent both mediator and component database schemas and data types. This allows easy introduction of new component databases, easily handling potential type mismatches. The component DBMS capabilities are defined as a subset of an algebraic machine (with the usual operators such as scan, join and union) that can be partially or entirely supported by the wrappers or the mediator. This gives much flexibility for the wrapper implementors in deciding where to support component DBMS capabilities (in the wrapper or in the mediator). Furthermore, compositions of operators, including specific data sets, can be specified to reflect component DBMS limitations. However, query processing is more complicated because of the use of an algrebraic machine and compositions of operators. After query rewriting on the component schemas, there are three main steps [Kapitskaia et al., 1997].

1. Search space generation. The query is decomposed into a number of QEPs, which constitutes the search space for query optimization. The search space is generated using a traditional search strategy such as dynamic programming.

2. QEP decomposition. Each QEP is decomposed into a forest of n wrapper QEPs and a composition QEP. Each wrapper QEP is the largest part of the initial QEP that can be entirely executed by the wrapper. Operators that cannot be performed by a wrapper are moved up to the composition QEP. The composition QEP combines the results of the wrapper QEPs in the final answer, typically through unions and joins of the intermediate results produced by the wrappers.

3. Cost evaluation. The cost of each QEP is evaluated using a hierarchical cost model discussed in Section 9.4.1.

320 9 Multidatabase Query Processing

9.4.3 Adaptive Query Processing

Multidatabase query processing, as discussed so far, follows essentially the principles of traditional query processing whereby an optimal QEP is produced for a query based on a cost model, which is then executed. The underlying assumption is that the multidatabase query optimizer has sufficient knowledge about query runtime conditions in order to produce an efficient QEP and the runtime conditions remain stable during execution. This is a fair assumption for multidatabase queries with few data sources running in a controlled environment. However, this assumption is inappropriate for changing environments with large numbers of data sources and unpredictable runtime conditions.

Example 9.9. Consider the QEP in Figure 9.5 with relations EMP, ASG, PROJ and PAY at sites s1,s2,s3,s4, respectively. The crossed arrow indicates that, for some reason (e.g., failure), site s2 (where ASG is stored) is not available at the beginning of execution. Let us assume, for simplicity, that the query is to be executed according to the iterator execution model [Graefe and McKenna, 1993], such that tuples flow from the left most relation,

EMPASG

PROJ

PAY

Fig. 9.5 Query Execution Plan with Blocked Data Source

Because of the unavailability of s2, the entire pipeline is blocked, waiting for ASG tuples to be produced. However, with some reoganization of the plan, some other operators could be evaluated while waiting for s2, for instance, to evaluate the join of EMP and PAY. �

This simple example illustrates that a typical static plan cannot cope with unpre- dictable data source unavailability [Amsaleg et al., 1996a]. More complex examples involve continuous queries [Madden et al., 2002b], expensive predicates [Porto et al., 2003] and data skew [Shah et al., 2003]. The main solution is to have some adaptive behavior during query processing, i.e., adaptive query processing. Adaptive query processing is a form of dynamic query processing, with a feedback loop between the execution environment and the query optimizer in order to react to unforeseen variations of runtime conditions. A query processing system is defined as adaptive if it receives information from the execution environment and determines its behavior according to that information in an iterative manner [Hellerstein et al., 2000; Gounaris et al., 2002b]. In the context of multidatabase systems, the execution environment

9.4 Query Optimization and Execution 321

includes the mediator, wrappers and component DBMSs. In particular, wrappers should be able to collect information regarding execution within the component DBMSs. Obviously, this is harder to do with legacy DBMSs.

In this section, we first provide a general presentation of the adaptive query processing process. Then, we present, in more detail, the Eddy approach [Avnur and Hellerstein, 2000] that provides a powerful framework for adaptive query processing techniques. Finally, we discuss major extensions to Eddy.

9.4.3.1 Adaptive Query Processing Process

Adaptive query processing adds to the traditional query processing process the following activities: monitoring, assessing and reacting. These activities are logically implemented in the query processing system by sensors, assessment components, and reaction components, respectively. These components may be embedded into control operators of the QEP, e.g., the Exchange operator [Graefe and McKenna, 1993]. Monitoring involves measuring some environment parameters within a time window, and reporting them to the assessment component. The latter analyzes the reports and considers thresholds to arrive at an adaptive reaction plan. Finally, the reaction plan is communicated to the reaction component that applies the reactions to query execution.

Typically, an adaptive process specifies the frequency with which each component will be executed. There is a tradeoff between reactiveness, in which higher values lead to eager reactions, and the overhead caused by the adaptive process. A generic representation of the adaptive process is given by the function fadapt(E,T )→ Ad, where E is a set of monitored environment parameters, T is a set of threshold values and Ad is a possibly empty set of adaptive reactions. The elements of E,T and Ad, called adaptive elements, obviously may vary in a number of ways depending on the application. The most important elements are the monitoring parameters and the adaptive reactions. We now describe them, following the presentation in [Gounaris et al., 2002b].

Monitoring parameters.

Monitoring query runtime parameters involves placing sensors at key places of the QEP and defining observation windows, during which sensors collect information. It also requires the specification of a communication mechanism to pass collected information to the assessment component. Examples of candidates for monitoring are:

• Memory size. Monitoring available memory size allows, for instance, operators to react to memory shortage or memory increase [Shah et al., 2003].

• Data arrival rates. Monitoring the variations in data arrival rates may enable the query processor to do useful work while waiting for a blocked data source.

322 9 Multidatabase Query Processing

• Actual statistics. Database statistics in a multidatabase environment tend to be inaccurate, if at all available. Monitoring the actual size of relations and inter- mediate results may lead to important modifications in the QEP. Furthermore, the usual data assumptions, in which the selectivity of predicates over attributes in a relation are considered to be mutually independent, can be abandoned and real selectivity values can be computed.

• Operator execution cost. Monitoring the actual cost of operator execution, including production rates, is useful for better operator scheduling. Furthermore, monitoring the size of the queues placed before operators may avoid overload situations [Tian and DeWitt, 2003b].

• Network throughput. In multidatabase query evaluation with remote data sources, monitoring network throughput may be helpful to define the data retrieval block size. In a lower throughput network, the system may react with larger block sizes to reduce network penalty.

Adaptive reactions.

Adaptive reactions modify query execution behavior according to the decisions taken by the assessment component. Important adaptive reactions are the following:

• Change schedule: modifies the order in which operators in the QEP get sched- uled. Query Scrambling [Amsaleg et al., 1996a; Urhan et al., 1998a] reacts by a change schedule of the plan, e.g., to reorganize the QEP in Example 9.9, to avoid stalling on a blocked data source during query evaluation. Eddy adopts finer reaction where operator scheduling can be decided on a tuple basis.

• Operator replacement: replaces a physical operator by an equivalent one. For example, depending on the available memory, the system may choose between a nested loop join or a hash join. Operator replacement may also change the plan by introducing a new operator to join the intermediate results produced by a previous adaptive reaction. Query Scrambling, for instance, may introduce new operators to evaluate joins between the results of change schedule reactions.

• Operator behavior: modifies the physical behavior of an operator. For example, the symmetric hash join [Wilschut and Apers, 1991] or ripple join algorithms [Haas and Hellerstein, 1999b] constantly alternate the inner/outer relation roles between their input tuples.

• Data repartitioning: considers the dynamic repartitioning of a relation through multiple nodes using intra-operator parallelism [Shah et al., 2003]. Static par- titioning of a relation tends to produce load imbalance between nodes. For example, information partitioned according to their associated geographical region (i.e., continent) may exhibit different access rates during the day because of the time differences in users’ locations.

9.4 Query Optimization and Execution 323

• Plan reformulation: computes a new QEP to replace an inefficient one. The optimizer considers actual statistics and state information, collected on the fly, to produce a new plan.

9.4.3.2 Eddy Approach

Eddy is a general framework for adaptive query processing. It was developed in the context of the Telegraph project with the goal of running queries on large volumes of online data with unpredictable input rates and fluctuations in the running environment.

For simplicity, we only consider select-project-join (SPJ) queries. Select operators can include expensive predicates [Hellerstein and Stonebraker, 1993]. The process of generating a QEP from an input SPJ query begins by producing a spanning tree of the query graph G modeling the input query. The choice among join algorithms and relation access methods favors adaptiveness. A QEP can be modeled as a tuple Q = 〈D,P,C〉, where D is a set of data sources, P is a set of query predicates with associated algorithms, and C is a set of ordering constraints that must be followed during execution. Observe that multiple valid spanning trees can be derived from G that obey the constraints in C, by exploring the search space composed of equivalent plans with different predicate orders. There is no need to find an optimal QEP during query compilation. Instead, operator ordering is done on the fly on a tuple-per-tuple basis (i.e., tuple routing). The process of QEP compilation is completed by adding the Eddy operator which is an n-ary physical operator placed between data sources in D and query predicates in P.

Example 9.10. Consider a three-relation query Q = σp(R)1 S 1 T , where joins are equi-joins. Assume that the only access method to relation T is through an index on join attribute T.A, i.e., the second join can only be an index join over T.A. Assume also that σp is an expensive predicate (e.g., a predicate over the results of running a program over values of R.B). Under these assumptions, the QEP is defined as D = {R,S,T}, P = {σp(R),R 11 S,S 12 T} and C = {S ≺ T}. The constraint ≺ imposes S tuples to probe T tuples, based on the index on T.A.

Figure 9.6 shows a QEP produced by the compilation of query Q with Eddy. An ellipse corresponds to a physical operator (i.e., either the Eddy operator or an algorithm implementing a predicate p ∈ P). As usual, the bottom of the plan presents the data sources. In the absence of a scan access method, relation T access is wrapped by the index join implementing the second join, and, thus, does not appear as a data source. The arrows specify pipeline dataflow following a producer-consumer relationship. Finally, an arrow departing from the Eddy models the production of output tuples. �

Eddy provides fine-grain adaptiveness by deciding on the fly how to route tuples through predicates according to a scheduling policy. During query execution, tuples in data sources are retrieved and staged into an input buffer managed by the Eddy operator. Eddy responds to data source unavailability by simply reading from another data source and staging tuples in the buffer pool.

324 9 Multidatabase Query Processing

Eddy

R S

R S (R)S T

Fig. 9.6 A Query Execution Plan with Eddy.

The flexibility of choosing the currently available data source is obtained by relaxing the fixed order of predicates in a QEP. In Eddy, there is no fixed QEP and each tuple follows its own path through predicates according to the constraints in the plan and its own history of predicate evaluation.

The tuple-based routing strategy produces a new QEP topology. The Eddy operator together with its managed predicates form a circular dataflow in which tuples leave the Eddy operator to be evaluated by the predicates, which in turn bounce back output tuples to the Eddy operator. A tuple leaves the circular dataflow either when it is eliminated by a predicate evaluation or the Eddy operator realizes that the tuple has passed through all the predicates in its list. The lack of a fixed QEP requires each tuple to register the set of predicates it is eligible for. For example, in Figure 9.6, S tuples are eligible for the two join predicates but are not eligible for predicate σp(R).

Let us now present, in more detail, how Eddy adaptively performs join ordering and scheduling.

Adaptive join ordering.

A fixed QEP (produced at compile time) dictates the join ordering and specifies which relations can be pipelined through the join operators. This makes query execution simple. When, as in Eddy, there is no fixed QEP, the challenge is to dynamically order pipelined join operators at run time, while tuples from different relations are flowing in. Ideally, when a tuple of a relation participating in a join arrives, it should be sent to a join operator (chosen by the scheduling policy) to be processed on the fly. However, most join algorithms cannot process some incoming tuples on the fly because they are asymmetric with respect to the way inner and outer tuples are processed. Consider the basic hash-based join algorithm, for instance: the inner relation is fully read during

9.4 Query Optimization and Execution 325

the build phase to construct a hash table, whereas tuples in the outer relation are pipelined during the probe phase. Thus, an incoming inner tuple cannot be processed on the fly as it must be stored in the hash table and the processing will be possible when the entire hash table has been built. Similarly, the nested loop join algorithm is asymmetric as only the inner relation must be read entirely for each tuple of the outer relation. Join algorithms with some kind of asymmetry offer few opportunities for alternating input relations between inner and outer roles. Thus, to relax the order in which join inputs are consumed, symmetric join algorithms are needed where the role played by the relations in a join may change without producing incorrect results.

The earliest example of a symmetric join algorithm is the symmetric hash join [Wilschut and Apers, 1991], which uses two hash tables, one for each input relation. The traditional build and probe phases of the basic hash join algorithm are simply interleaved. When a tuple arrives, it is used to probe the hash table corresponding to the other relation and find matching tuples. Then, it is inserted in its corresponding hash table so that tuples of the other relation arriving later can be joined. Thus, each arriving tuple can be processed on the fly. Another popular symmetric join algorithm is the ripple join [Haas and Hellerstein, 1999b], which can be viewed as a generalization of the nested loop join algorithm where the roles of inner and outer relation continually alternate during query execution. The main idea is to keep the probing state of each input relation, with a pointer that indicates the last tuple used to probe the other relation. At each toggling point, a change of roles between inner and outer relations occurs. At this point, the new outer relation starts to probe the inner input from its pointer position onwards, to a specified number of tuples. The inner relation, in turn, is scanned from its first tuple to its pointer position minus 1. The number of tuples processed at each stage in the outer relation gives the toggling rate and can be adaptively monitored.

Using symmetric join algorithms, Eddy can achieve flexible join ordering by controlling the history and constraints regarding predicate evaluation on a tuple basis. This control is implemented using two sets of progress bits carried by each tuple, which indicate, respectively, the predicates to which the tuple is ready to be evaluated by (i.e., the “ready bits”) and the set of predicates already evaluated (i.e., the “done bits”). When a tuple t is read into an Eddy operator, all done bits are zeroed and the predicates without ordering constraints, and to which t is eligible for, have their corresponding ready bits set. After each predicate evaluation, the corresponding done bit is set and the ready bits are updated, accordingly. When a join concatenates a pair of tuples, their done bits are ORed and a new set of ready bits are turned on. Combining progress bits and symmetric join algorithms allows Eddy to schedule predicates in an adaptive way.

Adaptive scheduling.

Given a set of candidate predicates, Eddy must adaptively select the one to which each tuple will be sent. Two main principles drive the choice of a predicate in Eddy: cost and selectivity. Predicate costs are measured as a function of the consumption

326 9 Multidatabase Query Processing

rate of each predicate. Remember that the Eddy operator holds tuples in its internal buffer, which is shared by all predicates. Low cost (i.e., fast) predicates finish their work quicker and request new tuples from the Eddy. As a result, low cost predicates get allocated more tuples than high cost predicates. This strategy, however, is agnostic with respect to predicate selectivity. Eddy’s tuple routing strategy is complemented by a simple lottery scheduling mechanism that learns about predicate selectivity [Arpaci-Dusseau et al., 1999]. The strategy credits a ticket to a predicate whenever the latter gets scheduled a tuple. Once a tuple has been processed and is bounced back to the Eddy, the corresponding predicate gets its ticket amount decremented. Combining cost and selectivity criteria becomes easy. Eddy continuously runs a lottery among predicates currently requesting tuples. The predicate with higher count of tickets wins the lottery and gets scheduled.

Another interesting issue is the choice of the running tuple from the input buffer. In order to end query processing, all tuples in the input buffer must be evaluated. Thus, a difference in tuple scheduling may reflect user preferences with respect to tuple output. For example, Eddy may favor tuples with higher number of done bits set, so that the user receives first results earlier.

9.4.3.3 Extensions to Eddy

The Eddy approach has been extended in various directions. In the cherry pick- ing approach [Porto et al., 2003], context is used instead of simple ticket-based scheduling. The relationship among expensive predicate input attribute values are discovered at runtime and used as the basis for adaptive tuple scheduling. Given a query Q with D = {R[A,B,C]}, P = {σ1p(R.A),σ2p(R.B),σ3p(R.C)} and C = /0, the main idea is to model the input attribute values of the expensive predicates in P as a hypergraph G = (V,E), where V is a set of n node partitions, with n being the number of expensive predicates. Each partition corresponds to a single attribute of the input relation R that are input to a predicate in P and each node corresponds to a distinct value of that attribute. An hyperedge e = {ai,b j,ck} corresponds to a tuple of relation R. The degree of a node vi corresponds to the number of hyperedges in which vi takes part. With this modeling, efficiently evaluating query Q corresponds to eliminating as quickly as possible the hyperedges in G. An hyperedge is eliminated whenever a value associated with one of its nodes is evaluated by a predicate in P and returns false. Furthermore, node degrees model hidden attribute dependencies, so that when the result of a predicate evaluation over a value vi returns false, all hyperedges (i.e., tuples) that vi takes part in are also eliminated. An adaptive content-sensitive strategy to evaluate a query Q is proposed for this model. It schedules values to be evaluated by a predicate according to the Fanout of its corresponding node, computed as the product of the node degree in the hypergraph G with the ratio between the corresponding predicate selectivity and predicate unitary evaluation cost.

Another interesting extension is distributed Eddies [Tian and DeWitt, 2003b] to deal with distributed input data streams. Since a centralized Eddy operator may quickly become a bottleneck, a distributed approach is proposed for tuple routing.

9.5 Query Translation and Execution 327

Each operator decides on the next operator to route a tuple to based on its history of operator’s evaluation (i.e., done bits) and statistics collected from the remain- ing operators. In a distributed setting, each operator may run at a different node in the network with a queue holding input tuples. The query optimization problem is specified by considering two new metrics for measuring stream query perfor- mance: average response time and maximum data rate. The former corresponds to the average time tuples take to traverse the operators in a plan, whereas the latter measures the maximum throughput the system can withstand without overloading. Routing strategies use the following parameters: operator’s cost, selectivity, length of operator’s input queue and probability of an operator being routed a tuple. The combination of these parameters yields efficient query evaluation. Using operator’s cost and selectivity guarantee that low-cost and highly selective operators are given higher routing priority. Queue length provides information on the average time tuples are staged in queues. Managing operator’s queue length allows the routing decision to avoid overloaded operators. Thus, by supporting routing policies, each operator is able to individually make routing decisions, thereby avoiding the bottlneck of a centralized router.

9.5 Query Translation and Execution

Query translation and execution is performed by the wrappers using the component DBMSs. A wrapper encapsulates the details of one or more component databases, each supported by the same DBMS (or file system). It also exports to the mediator the component DBMS capabilities and cost functions in a common interface. One of the major practical uses of wrappers has been to allow an SQL-based DBMS to access non-SQL databases [Roth and Schwartz, 1997].

The main function of a wrapper is conversion between the common interface and the DBMS-dependent interface. Figure 9.7 shows these different levels of interfaces between the mediator, the wrapper and the component DBMSs. Note that, depending on the level of autonomy of the component DBMSs, these three components can be located differently. For instance, in the case of strong autonomy, the wrapper should be at the mediator site, possibly on the same server. Thus, communication between a wrapper and its component DBMS incurs network cost. However, in the case of a cooperative component database (e.g., within the same organization), the wrapper could be installed at the component DBMS site, much like an ODBC driver. Thus, communication between the wrapper and the component DBMS is much more efficient.

The information necessary to perform conversion is stored in the wrapper schema that includes the local schema exported to the mediator in the common interface (e.g., relational) and the schema mappings to transform data between the local schema and the component database schema and vice-versa. We discussed schema mappings in Chapter 4. Two kinds of conversion are needed. First, the wrapper must translate the input QEP generated by the mediator and expressed in a common interface

328 9 Multidatabase Query Processing

into calls to the component DBMS using its DBMS-dependent interface. These calls yield query execution by the component DBMS that return results expressed in the DBMS-dependent interface. Second, the wrapper must translate the results to the common interface format so that they can be returned to the mediator for integration. In addition, the wrapper can execute operations that are not supported by the component DBMS (e.g., the scan operation by wrapper w2 in Figure 9.4).

MEDIATOR

COMPONENT

DBMS

COMMON INTERFACE

WRAPPER

DBMS-DEPENDENT

INTERFACE

Fig. 9.7 Wrapper interfaces

As discussed in Section 9.4.2, the common interface to the wrappers can be query- based or operator-based. The problem of translation is similar in both approaches. To illustrate query translation in the following example, we use the query-based approach with the SQL/MED standard that allows a relational DBMS to access external data represented as foreign relations in the wrapper’s local schema. This example, borrowed from [Melton et al., 2001], illustrates how a very simple data source can be wrapped to be accessed through SQL.

Example 9.11. We consider relation EMP(ENO, ENAME, CITY) stored in a very simple component database, in server ComponentDB, built with Unix text files. Each EMP tuple can then be stored as a line in a file, e.g., with the attributes separated by “:”. In SQL/MED, the definition of the local schema for this relation together with the mapping to a Unix file can be declared as a foreign relation with the following statement:

CREATE FOREIGN TABLE EMP ENO INTEGER, ENAME VARCHAR(30), CITY VARCHAR(20)

SERVER ComponentDB OPTIONS (Filename ’/usr/EngDB/emp.txt’, Delimiter ’:’)

Then, the mediator can send the wrapper supporting access to this relation SQL statements. For instance, the query:

9.5 Query Translation and Execution 329

SELECT ENAME FROM EMP

can be translated by the wrapper using the following Unix shell command to extract the relevant attribute:

cut -d: -f2 /usr/EngDB/emp

Additional processing, e.g., for type conversion, can then be done using programming code. �

Wrappers are mostly used for read-only queries, which makes query translation and wrapper construction relatively easy. Wrapper construction typically relies on CASE tools with reusable components to generate most of the wrapper code [Tomasic et al., 1997]. Furthermore, DBMS vendors provide wrappers for transparently access- ing their DBMS using standard interfaces. However, wrapper construction is much more difficult if updates to component databases are to be supported through wrap- pers (as opposed to directly updating the component databases through their DBMS). The main problem is due to the heterogeneity of integrity constraints between the common interface and the DBMS-dependent interface. As discussed in Chapter 5, integrity constraints are used to reject updates that violate database consistency. In modern DBMSs, integrity constraints are explicit and specified as rules as part of the database schema. However, in older DBMSs or simpler data sources (e.g., files), integrity constraints are implicit and implemented by specific code in the applications. For instance, in Example 9.11, there could be applications with some embedded code that rejects insertions of new lines with an existing ENO in the EMP text file. This code corresponds to a unique key constraint on ENO in relation EMP but is not readily available to the wrapper. Thus, the main problem of updating through a wrapper is to guarantee component database consistency by rejecting all updates that violate integrity constraints, whether they are explicit or implicit. A software engineering solution to this problem uses a CASE tool with reverse engineering techniques to identify within application code the implicit integrity constraints which are then translated into validation code in the wrappers [Thiran et al., 2006].

Another major problem is wrapper maintenance. Query translation relies heavily on the mappings between the component database schema and the local schema. If the component database schema is changed to reflect the evolution of the component database, then the mappings can become invalid. For instance, in Example 9.11, the administrator may switch the order of the fields in the EMP file. Using invalid map- pings may prevent the wrapper from producing correct results. Since the component databases are autonomous, detecting and correcting invalid mappings is important. The techniques to do so are those for mapping maintenance that we presented in Chapter 4.

330 9 Multidatabase Query Processing

9.6 Conclusion

Query processing in multidatabase systems is significantly more complex than in tightly-integrated and homogeneous distributed DBMSs. In addition to being dis- tributed, component databases may be autonomous, have different database languages and query processing capabilities, and exhibit varying behavior. In particular, com- ponent databases may range from full-fledged SQL databases to very simple data sources (e.g., text files).

In this chapter, we addressed these issues by extending and modifying the dis- tributed query processing architecture presented in Chapter 6. Assuming the popular mediator/wrapper architecture, we isolated the three main layers by which a query is successively rewritten (to bear on local relations) and optimized by the mediator, and then translated and executed by the wrappers and component DBMSs. We also discussed how to support OLAP queries in a multidatabase, an important requirement of decision-support applications. This requires an additional layer of translation from OLAP multidimensional queries to relational queries. This layered architecture for multidatabase query processing is general enough to capture very different varia- tions. This has been useful to describe various query processing techniques, typically designed with different objectives and assumptions.

The main techniques for multidatabase query processing are query rewriting using multidatabase views, multidatabase query optimization and execution, and query translation and execution. The techniques for query rewriting using multidatabase views differ in major ways depending on whether the GAV or LAV integration approach is used. Query rewriting in GAV is similar to data localization in homoge- neous distributed database systems. But the techniques for LAV (and its extension GLAV) are much more involved and it is often not possible to find an equivalent rewriting for a query, in which case a query that produces a maximum subset of the answer is necessary. The techniques for multidatabase query optimization include cost modeling and query optimization for component databases with different com- puting capabilities. These techniques extend traditional distributed query processing by focusing on heterogeneity. Besides heterogeneity, an important problem is to deal with the dynamic behavior of the component DBMSs. Adaptive query processing addresses this problem with a dynamic approach whereby the query optimizer com- municates at run time with the execution environment in order to react to unforeseen variations of runtime conditions. Finally, we discussed the techniques for translating queries for execution by the components DBMSs and for generating and managing wrappers.

The data model used by the mediator can be relational, object-oriented or even semi-structured (based on XML). In this chapter, for simplicity, we assumed a mediator with a relational model that is sufficient to explain the multidatabase query processing techniques. However, when dealing with data sources on the Web, a richer mediator model such as object-oriented or semi-structured (e.g., XML-based) may be preferred. This requires significant extensions to query processing techniques.

9.7 Bibliographic Notes 331

9.7 Bibliographic Notes

Work on multidatabase query processing started in the early 1980’s with the first multidatabase systems (e.g., [Brill et al., 1984; Dayal and Hwang, 1984] and [Landers and Rosenberg, 1982]). The objective then was to access different databases within an organization. In the 1990’s, the increasing use of the Web for accessing all kinds of data sources triggered renewed interest and much more work in multidatabase query processing, following the popular mediator/wrapper architecture [Wiederhold, 1992]. A brief overview of multidatabase query optimization issues can be found in [Meng et al., 1993]. Good discussions of multidatabase query processing can be found in [Lu et al., 1992, 1993], in Chapter 4 of [Yu and Meng, 1998] and in [Kossmann, 2000].

Query rewriting using views is surveyed in [Halevy, 2001]. In [Levy et al., 1995], the general problem of finding a rewriting using views is shown to be NP-complete in the number of views and the number of subgoals in the query The unfolding technique for rewriting a query expressed in Datalog in GAV was proposed in [Ullman, 1997]. The main techniques for query rewriting using views in LAV are the bucket algorithm [Levy et al., 1996b], the inverse rule algorithm [Duschka and Genesereth, 1997], and the MinCon algorithm [Pottinger and Levy, 2000].

The three main approaches for heterogeneous cost modeling are discussed in [Zhu and Larson, 1998]. The black-box approach is used in [Du et al., 1992; Zhu and Larson, 1994]. The customized approach is developped in [Zhu and Larson, 1996a; Roth et al., 1999; Naacke et al., 1999]. The dynamic approach is used in [Zhu et al., 2000], [Zhu et al., 2003] and [Rahal et al., 2004].

The algorithm we described to illustrate the query-based approach to heteroge- neous query optimization has been proposed in [Du et al., 1995]. To illustrate the operator-based approach, we described the popular solution with planning functions proposed in the Garlic project [Haas et al., 1997a]. The operator-based approach has been also used in DISCO, a multidatabase system to access component databases over the web [Tomasic et al., 1996, 1998].

Adaptive query processing is surveyed in [Hellerstein et al., 2000; Gounaris et al., 2002b]. The seminal paper on the Eddy approach which we used to illustrate adaptive query processing is [Avnur and Hellerstein, 2000]. Other important techniques for adaptive query processing are query scrambling [Amsaleg et al., 1996a; Urhan et al., 1998a], Ripple joins [Haas and Hellerstein, 1999b], adaptive partitioning [Shah et al., 2003] and Cherry picking [Porto et al., 2003]. Major extensions to Eddy are state modules [Raman et al., 2003] and distributed Eddies [Tian and DeWitt, 2003b].

A software engineering solution to the problem of wrapper creation and mainte- nance, considering integrity control, is proposed in [Thiran et al., 2006].

332 9 Multidatabase Query Processing

Exercises

Problem 9.1 (**). Can any type of global optimization be performed on global queries in a multidatabase system? Discuss and formally specify the conditions under which such optimization would be possible.

Problem 9.2 (*). Consider a marketing application with a ROLAP server at site s1 which needs to integrate information from two customer databases, each at site s2 within the corporate network. Assume also that the application needs to combine customer information with information extracted from Web data sources about cities in 10 different countries. For security reasons, a web server at site s3 is dedicated to Web access outside the corporate network. Propose a multidatabase system archi- tecture with mediator and wrappers to support this application. Discuss and justify design choices.

Problem 9.3 (**). Consider the global relations EMP(ENAME, TITLE, CITY) and ASG(ENAME, PNAME, CITY, DUR). City in ASG is the location of the project of name PNAME (i.e., PNAME functionnally determines CITY). Consider the local relations EMP1(ENAME, TITLE, CITY), EMP2(ENAME, TITLE, CITY), PROJ1(PNAME, CITY), PROJ2(PNAME, CITY) and ASG1(ENAME, PNAME, DUR). Consider query Q which selects the names of the employees assigned to a project in Rio de Janeiro for more than 6 months and the duration of their assignment.

(a) Assuming the GAV approach, perform query rewriting. (b) Assuming the LAV approach, perform query rewriting using the bucket algo-

rithm. (c) Same as (b) using the MinCon algorithm.

Problem 9.4 (*). Consider relations EMP and ASG of Example 9.7. We denote by |R| the number of pages to store R on disk. Consider the following statistics about the data:

|EMP|= 1 000 |EMP|= 100 |ASG|= 10 000 |ASG|= 2 000 selectivity(ASG.DUR > 36) = 1%

The mediator generic cost model is:

cost(σA=v(R)) = |R| cost(σ(X)) = cost(X) where X contains at least one operator.

cost(R 1indA S) = cost(R)+ |R| ∗ cost(σA=v(S)) using an indexed join algorithm. cost(R 1nlA S) = cost(R)+ |R| ∗ cost(S) using a nested loop join algorithm.

Consider the MDBMS input query Q:

9.7 Bibliographic Notes 333

SELECT * FROM EMP, ASG WHERE EMP.ENO=ASG.ENO AND ASG.DUR>36

Consider four plans to process Q:

P1 = EMP 1indENO σDUR>36(ASG)

P2 = EMP 1nlENO σDUR>36(ASG)

P3 = σDUR>36(ASG)1indENO EMP

P4 = σDUR>36(ASG)1nlENO EMP

(a) What is the cost of plans P1 to P4? (b) Which plan has the minimal cost?

Problem 9.5 (*). Consider relations EMP and ASG of the previous exercice. Suppose now that the mediator cost model is completed with the following cost information issued from the component DBMSs.

The cost of accessing EMP tuples at db1 is:

cost(σA=v(R)) = |σA=v(R)|

The specific cost of selecting ASG tuples that have a given ENO at D2 is:

cost(σENO=v(ASG)) = |σENO=v(ASG)|

(a) What is the cost of plans P1 to P4? (b) Which plan has the minimal cost?

Problem 9.6 (**). What are the respective advantages and limitations of the query- based and operator-based approaches to heterogeneous query optimization from the points of view of query expressiveness, query performance, development cost of wrappers, system (mediator and wrappers) maintenance and evolution?

Problem 9.7 (**). Consider Example 9.8 by adding, at a new site, component database db4 which stores relations EMP(ENO, ENAME, CITY) and ASG(ENO, PNAME, DUR). db4 exports through its wrapper w3 join and scan capabilities. Let us assume that there can be employees in db1 with corresponding assignments in db4 and employees in db4 with corresponding assignments in db2.

(a) Define the planning functions of wrapper w3. (b) Give the new definition of global view EMPASG(ENAME, CITY, PNAME,

DUR). (c) Give a QEP for the same query as in Example 9.8.

Problem 9.8 (**). Consider three relations R(A,B), S(B,C) and T (C,D) and query Q (σ 1p(R)11 S 12 T ), where 11 and 12 are natural joins. Assume that S has an index

334 9 Multidatabase Query Processing

on attribute B and T has an index on attribute C. Furthermore, σ 1p is an expensive predicate (i.e., a predicate over the results of running a program over values of R.A). Using the Eddy approach for adaptive query processing, answer the following questions:

(a) Propose the set C of constraints on Q to produce an Eddy-based QEP. (b) Give a query graph G for Q. (c) Using C and G, propose an Eddy-based QEP. (d) Propose a second QEP that uses State Modules. Discuss the advantages ob-

tained by using state modules in this QEP.

Problem 9.9 (**). Propose a data structure to store tuples in the Eddy buffer pool to help choosing quickly the next tuple to be evaluated according to user specified preference, for instance, produce first results earlier.

Problem 9.10 (**). Propose a predicate scheduling algorithm based on the Cherry picking approach introduced in Section 9.4.3.3.

Chapter 10 Introduction to Transaction Management

Up to this point the basic access primitive that we have considered has been a query. Our focus has been on retrieve-only (or read-only) queries that read data from a distributed database. We have not yet considered what happens if, for example, two queries attempt to update the same data item, or if a system failure occurs during execution of a query. For retrieve-only queries, neither of these conditions is a problem. One can have two queries reading the value of the same data item concurrently. Similarly, a read-only query can simply be restarted after a system failure is handled. On the other hand, it is not difficult to see that for update queries, these conditions can have disastrous effects on the database. We cannot, for example, simply restart the execution of an update query following a system failure since certain data item values may already have been updated prior to the failure and should not be updated again when the query is restarted. Otherwise, the database would contain incorrect data.

The fundamental point here is that there is no notion of “consistent execution” or “reliable computation” associated with the concept of a query. The concept of a transaction is used in database systems as a basic unit of consistent and reliable computing. Thus queries are executed as transactions once their execution strategies are determined and they are translated into primitive database operations.

In the discussion above, we used the terms consistent and reliable quite informally. Due to their importance in our discussion, we need to define them more precisely. We differentiate between database consistency and transaction consistency.

A database is in a consistent state if it obeys all of the consistency (integrity) constraints defined over it (see Chapter 5). State changes occur due to modifications, insertions, and deletions (together called updates). Of course, we want to ensure that the database never enters an inconsistent state. Note that the database can be (and usually is) temporarily inconsistent during the execution of a transaction. The important point is that the database should be consistent when the transaction terminates (Figure 10.1).

Transaction consistency, on the other hand, refers to the actions of concurrent transactions. We would like the database to remain in a consistent state even if there are a number of user requests that are concurrently accessing (reading or updating)

335 DOI 10.1007/978-1-4419-8834-8_10, © Springer Science+Business Media, LLC 2011 M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems: Third Edition,

336 10 Introduction to Transaction Management

Database in a

consistent

state

Execution of

Transaction T

End

Transaction T

Database may be

temporarily in an

inconsistent state

during execution

Database in a

consistent

state

Begin

Transaction T

Fig. 10.1 A Transaction Model

the database. A complication arises when replicated databases are considered. A replicated database is in a mutually consistent state if all the copies of every data item in it have identical values. This is referred to as one-copy equivalence since all replica copies are forced to assume the same state at the end of a transaction’s execution. There are more relaxed notions of replica consistency that allow replica values to diverge. These will be discussed later in Chapter 13.

Reliability refers to both the resiliency of a system to various types of failures and its capability to recover from them. A resilient system is tolerant of system failures and can continue to provide services even when failures occur. A recoverable DBMS is one that can get to a consistent state (by moving back to a previous consistent state or forward to a new consistent state) following various types of failures.

Transaction management deals with the problems of always keeping the database in a consistent state even when concurrent accesses and failures occur. In the up- coming two chapters, we investigate the issues related to managing transactions. A third chapter will address issues related to keeping replicated databases consistent. The purpose of the current chapter is to define the fundamental terms and to provide the framework within which these issues can be discussed. It also serves as a con- cise introduction to the problem and the related issues. We will therefore discuss the concepts at a high level of abstraction and will not present any management techniques.

The organization of this chapter is as follows. In the next section we formally and intuitively define the concept of a transaction. In Section 10.2 we discuss the properties of transactions and what the implications of each of these properties are in terms of transaction management. In Section 10.3 we present various types of transactions. In Section 10.4 we revisit the architectural model defined in Chapter 1 and indicate the modifications that are necessary to support transaction management.

10.1 Definition of a Transaction 337

10.1 Definition of a Transaction

Gray [1981] indicates that the transaction concept has its roots in contract law. He states, “In making a contract, two or more parties negotiate for a while and then make a deal. The deal is made binding by the joint signature of a document or by some other act (as simple as a handshake or a nod). If the parties are rather suspicious of one another or just want to be safe, they appoint an intermediary (usually called an escrow officer) to coordinate the commitment of the transaction.” The nice aspect of this historical perspective is that it does indeed encompass some of the fundamental properties of a transaction (atomicity and durability) as the term is used in database systems. It also serves to indicate the differences between a transaction and a query.

As indicated before, a transaction is a unit of consistent and reliable computation. Thus, intuitively, a transaction takes a database, performs an action on it, and gener- ates a new version of the database, causing a state transition. This is similar to what a query does, except that if the database was consistent before the execution of the transaction, we can now guarantee that it will be consistent at the end of its execution regardless of the fact that (1) the transaction may have been executed concurrently with others, and (2) failures may have occurred during its execution.

In general, a transaction is considered to be made up of a sequence of read and write operations on the database, together with computation steps. In that sense, a transaction may be thought of as a program with embedded database access queries [Papadimitriou, 1986]. Another definition of a transaction is that it is a single execution of a program [Ullman, 1988]. A single query can also be thought of as a program that can be posed as a transaction.

Example 10.1. Consider the following SQL query for increasing by 10% the budget of the CAD/CAM project that we discussed (in Example 5.20):

UPDATE PROJ SET BUDGET = BUDGET*1.1 WHERE PNAME= "CAD/CAM"

This query can be specified, using the embedded SQL notation, as a transaction by giving it a name (e.g., BUDGET UPDATE) and declaring it as follows:

Begin transaction BUDGET UPDATE begin

EXEC SQL UPDATE PROJ SET BUDGET = BUDGET*1.1 WHERE PNAME= “CAD/CAM”

end. �

The Begin transaction and end statements delimit a transaction. Note that the use of delimiters is not enforced in every DBMS. If delimiters are not specified, a DBMS may simply treat as a transaction the entire program that performs a database access.

338 10 Introduction to Transaction Management

Example 10.2. In our discussion of transaction management concepts, we will use an airline reservation system example instead of the one used in the first nine chapters. The real-life implementation of this application almost always makes use of the transaction concept. Let us assume that there is a FLIGHT relation that records the data about each flight, a CUST relation for the customers who book flights, and an FC relation indicating which customers are on what flights. Let us also assume that the relation definitions are as follows (where the underlined attributes constitute the keys):

FLIGHT(FNO, DATE, SRC, DEST, STSOLD, CAP) CUST(CNAME, ADDR, BAL) FC(FNO, DATE, CNAME, SPECIAL)

The definition of the attributes in this database schema are as follows: FNO is the flight number, DATE denotes the flight date, SRC and DEST indicate the source and destination for the flight, STSOLD indicates the number of seats that have been sold on that flight, CAP denotes the passenger capacity on the flight, CNAME indicates the customer name whose address is stored in ADDR and whose account balance is in BAL, and SPECIAL corresponds to any special requests that the customer may have for a booking.

Let us consider a simplified version of a typical reservation application, where a travel agent enters the flight number, the date, and a customer name, and asks for a reservation. The transaction to perform this function can be implemented as follows, where database accesses are specified in embedded SQL notation:

Begin transaction Reservation begin

input(flight no, date, customer name); (1) EXEC SQL UPDATE FLIGHT (2)

SET STSOLD = STSOLD + 1 WHERE FNO = flight no AND DATE = date;

EXEC SQL INSERT (3) INTO FC(FNO,DATE,CNAME,SPECIAL) VALUES (flight no,date,customer name, null);

output(“reservation completed”) (4) end.

Let us explain this example. First a point about notation. Even though we use embedded SQL, we do not follow its syntax very strictly. The lowercase terms are the program variables; the uppercase terms denote database relations and attributes as well as the SQL statements. Numeric constants are used as they are, whereas character constants are enclosed in quotes. Keywords of the host language are written in boldface, and null is a keyword for the null string.

10.1 Definition of a Transaction 339

The first thing that the transaction does [line (1)], is to input the flight number, the date, and the customer name. Line (2) updates the number of sold seats on the requested flight by one. Line (3) inserts a tuple into the FC relation. Here we assume that the customer is an old one, so it is not necessary to have an insertion into the CUST relation, creating a record for the client. The keyword null in line (3) indicates that the customer has no special requests on this flight. Finally, line (4) reports the result of the transaction to the agent’s terminal. �

10.1.1 Termination Conditions of Transactions

The reservation transaction of Example 10.2 has an implicit assumption about its termination. It assumes that there will always be a free seat and does not take into consideration the fact that the transaction may fail due to lack of seats. This is an unrealistic assumption that brings up the issue of termination possibilities of transactions.

A transaction always terminates, even when there are failures as we will see in Chapter 12. If the transaction can complete its task successfully, we say that the transaction commits. If, on the other hand, a transaction stops without completing its task, we say that it aborts. Transactions may abort for a number of reasons, which are discussed in the upcoming chapters. In our example, a transaction aborts itself because of a condition that would prevent it from completing its task successfully. Additionally, the DBMS may abort a transaction due to, for example, deadlocks or other conditions. When a transaction is aborted, its execution is stopped and all of its already executed actions are undone by returning the database to the state before their execution. This is also known as rollback.

The importance of commit is twofold. The commit command signals to the DBMS that the effects of that transaction should now be reflected in the database, thereby making it visible to other transactions that may access the same data items. Second, the point at which a transaction is committed is a “point of no return.” The results of the committed transaction are now permanently stored in the database and cannot be undone. The implementation of the commit command is discussed in Chapter 12.

Example 10.3. Let us return to our reservation system example. One thing we did not consider is that there may not be any free seats available on the desired flight. To cover this possibility, the reservation transaction needs to be revised as follows:

Begin transaction Reservation begin

input(flight no, date, customer name); EXEC SQL SELECT STSOLD,CAP

INTO temp1,temp2 FROM FLIGHT WHERE FNO = flight no AND DATE = date;

340 10 Introduction to Transaction Management

if temp1 = temp2 then begin

output(“no free seats”); Abort

end else begin

EXEC SQL UPDATE FLIGHT SET STSOLD = STSOLD + 1 WHERE FNO = flight no AND DATE = date;

EXEC SQL INSERT INTO FC(FNO,DATE,CNAME,SPECIAL) VALUES (flight no, date, customer name, null);

Commit; output(“reservation completed”) end

end-if end.

In this version the first SQL statement gets the STSOLD and CAP into the two variables temp1 and temp2. These two values are then compared to determine if any seats are available. The transaction either aborts if there are no free seats, or updates the STSOLD value and inserts a new tuple into the FC relation to represent the seat that was sold. �

Several things are important in this example. One is, obviously, the fact that if no free seats are available, the transaction is aborted1. The second is the ordering of the output to the user with respect to the abort and commit commands. Transactions can be aborted either due to application logic, as is the case here, or due to deadlocks or system failures. If the transaction is aborted, the user can be notified before the DBMS is instructed to abort it. However, in case of commit, the user notification has to follow the successful servicing (by the DBMS) of the commit command, for reliability reasons. These are discussed further in Section 10.2.4 and in Chapter 12.

10.1.2 Characterization of Transactions

Observe in the preceding examples that transactions read and write some data. This has been used as the basis for characterizing a transaction. The data items that a transaction reads are said to constitute its read set (RS). Similarly, the data items that a transaction writes are said to constitute its write set (WS). The read set and write

1 We will be kind to the airlines and assume that they never overbook. Thus our reservation transaction does not need to check for that condition.

10.1 Definition of a Transaction 341

set of a transaction need not be mutually exclusive. The union of the read set and write set of a transaction constitutes its base set (BS = RS∪WS).

Example 10.4. Considering the reservation transaction as specified in Example 10.3 and the insert to be a number of write operations, the above-mentioned sets are defined as follows:

RS[Reservation] = {FLIGHT.STSOLD, FLIGHT.CAP} WS[Reservation] = {FLIGHT.STSOLD, FC.FNO, FC.DATE,

FC.CNAME, FC.SPECIAL} BS[Reservation] = {FLIGHT.STSOLD, FLIGHT.CAP,

FC.FNO, FC.DATE, FC.CNAME, FC.SPECIAL}

Note that it may be appropriate to include FLIGHT.FNO and FLIGHT.DATE in the read set of Reservation since they are accessed during execution of the SQL query. We omit them to simplify the example. �

We have characterized transactions only on the basis of their read and write operations, without considering the insertion and deletion operations. We therefore base our discussion of transaction management concepts on static databases that do not grow or shrink. This simplification is made in the interest of simplicity. Dynamic databases have to deal with the problem of phantoms, which can be explained using the following example. Consider that transaction T1, during its execution, searches the FC table for the names of customers who have ordered a special meal. It gets a set of CNAME for customers who satisfy the search criteria. While T1 is executing, transaction T2 inserts new tuples into FC with the special meal request, and commits. If T1 were to re-issue the same search query later in its execution, it will get back a set of CNAME that is different than the original set it had retrieved. Thus, “phantom” tuples have appeared in the database. We do not discuss phantoms any further in this book; the topic is discussed at length by Eswaran et al. [1976] and Bernstein et al. [1987].

We should also point out that the read and write operations to which we refer are abstract operations that do not have one-to-one correspondence to physical I/O primitives. One read in our characterization may translate into a number of primitive read operations to access the index structures and the physical data pages. The reader should treat each read and write as a language primitive rather than as an operating system primitive.

10.1.3 Formalization of the Transaction Concept

By now, the meaning of a transaction should be intuitively clear. To reason about transactions and about the correctness of the management algorithms, it is necessary to define the concept formally. We denote by Oi j(x) some operation O j of transaction Ti that operates on a database entity x. Following the conventions adopted in the

342 10 Introduction to Transaction Management

preceding section, Oi j ∈ {read, write}. Operations are assumed to be atomic (i.e., each is executed as an indivisible unit). We let OSi denote the set of all operations in Ti (i.e., OSi =

⋃ j Oi j). We denote by Ni the termination condition for Ti, where Ni ∈

{abort, commit}2. With this terminology we can define a transaction Ti as a partial ordering over

its operations and the termination condition. A partial order P = {Σ, ≺} defines an ordering among the elements of Σ (called the domain) according to an irreflexive and transitive binary relation ≺ defined over Σ. In our case Σ consists of the operations and termination condition of a transaction, whereas ≺ indicates the execution order of these operations (which we will read as “precedes in execution order”). Formally, then, a transaction Ti is a partial order Ti = {Σi,≺i}, where

1. Σi = OSi∪{Ni}. 2. For any two operations Oi j,Oik ∈ OSi, if Oi j = {R(x)or W (x)} and Oik =

W (x) for any data item x, then either Oi j ≺i Oik or Oik ≺i Oi j. 3. ∀Oi j ∈OSi,Oi j ≺i Ni.

The first condition formally defines the domain as the set of read and write operations that make up the transaction, plus the termination condition, which may be either commit or abort. The second condition specifies the ordering relation between the conflicting read and write operations of the transaction, while the final condition indicates that the termination condition always follows all other operations.

There are two important points about this definition. First, the ordering relation ≺ is given and the definition does not attempt to construct it. The ordering relation is actually application dependent. Second, condition two indicates that the ordering between conflicting operations has to exist within ≺. Two operations, Oi(x) and O j(x), are said to be in conflict if Oi = Write or O j = Write (i.e., at least one of them is a Write and they access the same data item).

Example 10.5. Consider a simple transaction T that consists of the following steps:

Read(x) Read(y) x← x+ y Write(x) Commit

The specification of this transaction according to the formal notation that we have introduced is as follows:

Σ = {R(x),R(y),W (x),C} ≺ = {(R(x),W (x)),(R(y),W (x)),(W (x),C),(R(x),C),(R(y),C)}

where (Oi,O j) as an element of the ≺ relation indicates that Oi ≺ O j. � 2 From now on, we use the abbreviations R, W, A and C for the Read, Write, Abort, and Commit operations, respectively.

10.1 Definition of a Transaction 343

Notice that the ordering relation specifies the relative ordering of all operations with respect to the termination condition. This is due to the third condition of transaction definition. Also note that we do not specify the ordering between every pair of operations. That is why it is a partial order.

Example 10.6. The reservation transaction developed in Example 10.3 is more com- plex. Notice that there are two possible termination conditions, depending on the availability of seats. It might first seem that this is a contradiction of the definition of a transaction, which indicates that there can be only one termination condition. However, remember that a transaction is the execution of a program. It is clear that in any execution, only one of the two termination conditions can occur. Therefore, what exists is one transaction that aborts and another one that commits. Using this formal notation, the former can be specified as follows:

Σ = {R(STSOLD), R(CAP), A} ≺ = {(O1,A),(O2,A)}

and the latter can be specified as

Σ = {R(STSOLD), R(CAP), W (STSOLD), W (FNO), W (DATE), W (CNAME), W (SPECIAL), C}

≺ = {(O1,O3),(O2,O3),(O1,O4),(O1,O5),(O1,O6),(O1,O7),(O2,O4), (O2,O5),(O2,O6),(O2,O7),(O1,C),(O2,C),(O3,C),(O4,C), (O5,C),(O6,C),(O7,C)}

where O1 = R(STSOLD), O2 = R(CAP), O3 =W (STSOLD), O4 =W (FNO), O5 = W (DATE), O6 =W (CNAME), and O7 =W (SPECIAL). �

One advantage of defining a transaction as a partial order is its correspondence to a directed acyclic graph (DAG). Thus a transaction can be specified as a DAG whose vertices are the operations of a transaction and whose arcs indicate the ordering relationship between a given pair of operations. This will be useful in discussing the concurrent execution of a number of transactions (Chapter 11) and in arguing about their correctness by means of graph-theoretic tools.

Example 10.7. The transaction discussed in Example 10.5 can be represented as a DAG as depicted in Figure 10.2. Note that we do not draw the arcs that are implied by transitivity even though we indicate them as elements of ≺. �

In most cases we do not need to refer to the domain of the partial order separately from the ordering relation. Therefore, it is common to drop Σ from the transaction definition and use the name of the partial order to refer to both the domain and the name of the partial order. This is convenient since it allows us to specify the ordering of the operations of a transaction in a more straightforward manner by making use of their relative ordering in the transaction definition. For example, we can define the transaction of Example 10.5 as follows:

T = {R(x),R(y),W (x),C}

344 10 Introduction to Transaction Management

R(x)

R(y)

W(x) C

Fig. 10.2 DAG Representation of a Transaction

instead of the longer specification given before. We will therefore use the modified definition in this and subsequent chapters.

10.2 Properties of Transactions

The previous discussion clarifies the concept of a transaction. However, we have not yet provided any justification of our earlier claim that it is a unit of consistent and reliable computation. We do that in this section. The consistency and reliability aspects of transactions are due to four properties: (1) atomicity, (2) consistency, (3) isolation, and (4) durability. Together, these are commonly referred to as the ACID properties of transactions. They are not entirely independent of each other; usually there are dependencies among them as we will indicate below. We discuss each of these properties in the following sections.

10.2.1 Atomicity

Atomicity refers to the fact that a transaction is treated as a unit of operation. Therefore, either all the transaction’s actions are completed, or none of them are. This is also known as the “all-or-nothing property.” Notice that we have just extended the concept of atomicity from individual operations to the entire transaction. Atomicity requires that if the execution of a transaction is interrupted by any sort of failure, the DBMS will be responsible for determining what to do with the transaction upon recovery from the failure. There are, of course, two possible courses of action: it can either be terminated by completing the remaining actions, or it can be terminated by undoing all the actions that have already been executed.

One can generally talk about two types of failures. A transaction itself may fail due to input data errors, deadlocks, or other factors. In these cases either the transaction aborts itself, as we have seen in Example 10.2, or the DBMS may abort it while handling deadlocks, for example. Maintaining transaction atomicity in the presence of this type of failure is commonly called the transaction recovery. The second type

10.2 Properties of Transactions 345

of failure is caused by system crashes, such as media failures, processor failures, communication link breakages, power outages, and so on. Ensuring transaction atomicity in the presence of system crashes is called crash recovery. An important difference between the two types of failures is that during some types of system crashes, the information in volatile storage may be lost or inaccessible. Both types of recovery are parts of the reliability issue, which we discuss in considerable detail in Chapter 12.

10.2.2 Consistency

The consistency of a transaction is simply its correctness. In other words, a transaction is a correct program that maps one consistent database state to another. Verifying that transactions are consistent is the concern of integrity enforcement, covered in Chapter 5. Ensuring transaction consistency as defined at the beginning of this chapter, on the other hand, is the objective of concurrency control mechanisms, which we discuss in Chapter 11.

There is an interesting classification of consistency that parallels our discussion above and is equally important. This classification groups databases into four levels of consistency [Gray et al., 1976]. In the following definition (which is taken verbatim from the original paper), dirty data refers to data values that have been updated by a transaction prior to its commitment. Then, based on the concept of dirty data, the four levels are defined as follows:

“Degree 3: Transaction T sees degree 3 consistency if:

1. T does not overwrite dirty data of other transactions. 2. T does not commit any writes until it completes all its writes [i.e., until the

end of transaction (EOT)].

3. T does not read dirty data from other transactions. 4. Other transactions do not dirty any data read by T before T completes.

Degree 2: Transaction T sees degree 2 consistency if:

1. T does not overwrite dirty data of other transactions. 2. T does not commit any writes before EOT. 3. T does not read dirty data from other transactions.

Degree 1: Transaction T sees degree 1 consistency if:

1. T does not overwrite dirty data of other transactions. 2. T does not commit any writes before EOT.

346 10 Introduction to Transaction Management

Degree 0: Transaction T sees degree 0 consistency if:

1. T does not overwrite dirty data of other transactions.”

Of course, it is true that a higher degree of consistency encompasses all the lower degrees. The point in defining multiple levels of consistency is to provide application programmers the flexibility to define transactions that operate at different levels. Consequently, while some transactions operate at Degree 3 consistency level, others may operate at lower levels and may see, for example, dirty data.

10.2.3 Isolation

Isolation is the property of transactions that requires each transaction to see a consis- tent database at all times. In other words, an executing transaction cannot reveal its results to other concurrent transactions before its commitment.

There are a number of reasons for insisting on isolation. One has to do with maintaining the interconsistency of transactions. If two concurrent transactions access a data item that is being updated by one of them, it is not possible to guarantee that the second will read the correct value.

Example 10.8. Consider the following two concurrent transactions (T1 and T2), both of which access data item x. Assume that the value of x before they start executing is 50.

T1: Read(x) T2: Read(x) x← x+1 x← x+1 Write(x) Write(x) Commit Commit

The following is one possible sequence of execution of the actions of these transactions:

T1: Read(x) T1: x← x+1 T1: Write(x) T1: Commit T2: Read(x) T2: x← x+1 T2: Write(x) T2: Commit

In this case, there are no problems; transactions T1 and T2 are executed one after the other and transaction T2 reads 51 as the value of x. Note that if, instead, T2 executes before T1, T2 reads 51 as the value of x. So, if T1 and T2 are executed one after the other (regardless of the order), the second transaction will read 51 as

10.2 Properties of Transactions 347

the value of x and x will have 52 as its value at the end of execution of these two transactions. However, since transactions are executing concurrently, the following execution sequence is also possible:

T1: Read(x) T1: x← x+1 T2: Read(x) T1: Write(x) T2: x← x+1 T2: Write(x) T1: Commit T2: Commit

In this case, transaction T2 reads 50 as the value of x. This is incorrect since T2 reads x while its value is being changed from 50 to 51. Furthermore, the value of x is 51 at the end of execution of T1 and T2 since T2’s Write will overwrite T1’s Write. �

Ensuring isolation by not permitting incomplete results to be seen by other trans- actions, as the previous example shows, solves the lost updates problem. This type of isolation has been called cursor stability. In the example above, the second execution sequence resulted in the effects of T1 being lost3. A second reason for isolation is cascading aborts. If a transaction permits others to see its incomplete results before committing and then decides to abort, any transaction that has read its incomplete values will have to abort as well. This chain can easily grow and impose considerable overhead on the DBMS.

It is possible to treat consistency levels discussed in the preceding section from the perspective of the isolation property (thus demonstrating the dependence between isolation and consistency). As we move up the hierarchy of consistency levels, there is more isolation among transactions. Degree 0 provides very little isolation other than preventing lost updates. However, since transactions commit write operations before the entire transaction is completed (and committed), if an abort occurs after some writes are committed to disk, the updates to data items that have been committed will need to be undone. Since at this level other transactions are allowed to read the dirty data, it may be necessary to abort them as well. Degree 2 consistency avoids cascading aborts. Degree 3 provides full isolation which forces one of the conflicting transactions to wait until the other one terminates. Such execution sequences are called strict and will be discussed further in the next chapter. It is obvious that the issue of isolation is directly related to database consistency and is therefore the topic of concurrency control.

3 A more dramatic example may be to consider x to be your bank account and T1 a transaction that executes as a result of your depositing money into your account. Assume that T2 is a transaction that is executing as a result of your spouse withdrawing money from the account at another branch. If the same problem as described in Example 10.8 occurs and the results of T1 are lost, you will be terribly unhappy. If, on the other hand, the results of T2 are lost, the bank will be furious. A similar argument can be made for the reservation transaction example we have been considering.

348 10 Introduction to Transaction Management

ANSI, as part of the SQL2 (also known as SQL-92) standard specification, has defined a set of isolation levels [ANSI, 1992]. SQL isolation levels are defined on the basis of what ANSI call phenomena which are situations that can occur if proper isolation is not maintained. Three phenomena are specified:

Dirty Read: As defined earlier, dirty data refer to data items whose values have been modified by a transaction that has not yet committed. Consider the case where transaction T1 modifies a data item value, which is then read by another transaction T2 before T1 performs a Commit or Abort. In case T1 aborts, T2 has read a value which never exists in the database. A precise specification4 of this phenomenon is as follows (where subscripts indicate the transaction identifiers)

. . . ,W1(x), . . . ,R2(x), . . . ,C1(or A1), . . . ,C2(or A2)

or

. . . ,W1(x), . . . ,R2(x), . . . ,C2(or A2), . . . ,C1(or A1)

Non-repeatable or Fuzzy read: Transaction T1 reads a data item value. Another transaction T2 then modifies or deletes that data item and commits. If T1 then attempts to reread the data item, it either reads a different value or it can’t find the data item at all; thus two reads within the same transaction T1 return different results. A precise specification of this phenomenon is as follows:

. . . ,R1(x), . . . ,W2(x), . . . ,C1(or A1), . . . ,C2(or A2)

or

. . . ,R1(x), . . . ,W2(x), . . . ,C2(or A2), . . . ,C1(or A1)

Phantom: The phantom condition that was defined earlier occurs when T1 does a search with a predicate and T2 inserts new tuples that satisfy the predicate. Again, the precise specification of this phenomenon is (where P is the search predicate)

. . . ,R1(P), . . . ,W2(y in P), . . . ,C1(or A1), . . . ,C2(or A2)

or

. . . ,R1(P), . . . ,W2(y in P), . . . ,C2( or A2), . . . ,C1(or A1)

4 The precise specifications of these phenomena are due to Berenson et al. [1995] and correspond to their loose interpretations which they indicate are the more appropriate interpretations.

10.3 Types of Transactions 349

Based on these phenomena, the isolation levels are defined as follows. The objec- tive of defining multiple isolation levels is the same as defining multiple consistency levels.

Read uncommitted: For transactions operating at this level all three phenomena are possible.

Read committed: Fuzzy reads and phantoms are possible, but dirty reads are not. Repeatable read: Only phantoms are possible. Anomaly serializable: None of the phenomena are possible.

ANSI SQL standard uses the term “serializable” rather than “anomaly serializable.” However, a serializable isolation level, as precisely defined in the next chapter, cannot be defined solely in terms of the three phenomena identified above; thus this isolation level is called “anomaly serializable” [Berenson et al., 1995]. The relationship between SQL isolation levels and the four levels of consistency defined in the previous section are also discussed in [Berenson et al., 1995].

One non-serializable isolation level that is commonly implemented in commercial products is snapshot isolation [Berenson et al., 1995]. Snapshot isolation provides repeatable reads, but not serializable isolation. Each transaction “sees” a snapshot of the database when it starts and its reads and writes are performed on this snapshot – thus the writes are not visible to other transactions and it does not see the writes of other transactions.

10.2.4 Durability

Durability refers to that property of transactions which ensures that once a transaction commits, its results are permanent and cannot be erased from the database. Therefore, the DBMS ensures that the results of a transaction will survive subsequent system failures. This is exactly why in Example 10.2 we insisted that the transaction commit before it informs the user of its successful completion. The durability property brings forth the issue of database recovery, that is, how to recover the database to a consistent state where all the committed actions are reflected. This issue is discussed further in Chapter 12.

10.3 Types of Transactions

A number of transaction models have been proposed in literature, each being appro- priate for a class of applications. The fundamental problem of providing “ACID”ity usually remains, but the algorithms and techniques that are used to address them may be considerably different. In some cases, various aspects of ACID requirements are relaxed, removing some problems and adding new ones. In this section we provide

350 10 Introduction to Transaction Management

an overview of some of the transaction models that have been proposed and then identify our focus in Chapters 11 and 12.

Transactions have been classified according to a number of criteria. One criterion is the duration of transactions. Accordingly, transactions may be classified as online or batch [Gray, 1987]. These two classes are also called short-life and long-life transactions, respectively. Online transactions are characterized by very short execu- tion/response times (typically, on the order of a couple of seconds) and by access to a relatively small portion of the database. This class of transactions probably covers a large majority of current transaction applications. Examples include banking transactions and airline reservation transactions.

Batch transactions, on the other hand, take longer to execute (response time being measured in minutes, hours, or even days) and access a larger portion of the database. Typical applications that might require batch transactions are design databases, statistical applications, report generation, complex queries, and image processing. Along this dimension, one can also define a conversational transaction, which is executed by interacting with the user issuing it.

Another classification that has been proposed is with respect to the organization of the read and write actions. The examples that we have considered so far intermix their read and write actions without any specific ordering. We call this type of transactions general. If the transactions are restricted so that all the read actions are performed before any write action, the transaction is called a two-step transaction [Papadimitriou, 1979]. Similarly, if the transaction is restricted so that a data item has to be read before it can be updated (written), the corresponding class is called restricted (or read-before-write) [Stearns et al., 1976]. If a transaction is both two- step and restricted, it is called a restricted two-step transaction. Finally, there is the action model of transactions [Kung and Papadimitriou, 1979], which consists of the restricted class with the further restriction that each 〈read, write〉 pair be executed atomically. This classification is shown in Figure 10.3, where the generality increases upward.

Example 10.9. The following are some examples of the above-mentioned models. We omit the declaration and commit commands. General:

T1 : {R(x),R(y),W (y),R(z),W (x),W (z),W (w),C}

Two-step:

T2 : {R(x),R(y),R(z),W (x),W (z),W (y),W (w),C}

Restricted:

T3 : {R(x),R(y),W (y),R(z),W (x),W (z),R(w),W (w),C}

Note that T3 has to read w before writing. Two-step restricted:

10.3 Types of Transactions 351

General model

Two-step model Restricted model

Restricted two-step model Action model

Fig. 10.3 Various Transaction Models (From: C.H. Papadimitriou and P.C. Kanellakis, ON CON- CURRENCY CONTROL BY MULTIPLE VERSIONS. ACM Trans. Database Sys.; December 1984; 9(1): 89–99.)

T4 : {R(x),R(y),R(z),R(w),W (x),W (z),W (y),W (w),C}

Action:

T5 : {[R(x),W (x)], [R(y),W (y)], [R(z),W (z)], [R(w),W (w)],C}

Note that each pair of actions within square brackets is executed atomically. �

Transactions can also be classified according to their structure. We distinguish four broad categories in increasing complexity: flat transactions, closed nested transac- tions as in [Moss, 1985], and open nested transactions such as sagas [Garcia-Molina and Salem, 1987], and workflow models which, in some cases, are combinations of various nested forms. This classification is arguably the most dominant one and we will discuss it at some length.

10.3.1 Flat Transactions

Flat transactions have a single start point (Begin transaction) and a single termi- nation point (End transaction). All our examples in this section are of this type. Most of the transaction management work in databases has concentrated on flat transactions. This model will also be our main focus in this book, even though we discuss management techniques for other transaction types, where appropriate.

352 10 Introduction to Transaction Management

10.3.2 Nested Transactions

An alternative transaction model is to permit a transaction to include other transac- tions with their own begin and commit points. Such transactions are called nested transactions. These transactions that are embedded in another one are usually called subtransactions.

Example 10.10. Let us extend the reservation transaction of Example 10.2. Most travel agents will make reservations for hotels and car rentals in addition to the flights. If one chooses to specify all of this as one transaction, the reservation transaction would have the following structure:

Begin transaction Reservation begin

Begin transaction Airline . . .

end. {Airline} Begin transaction Hotel

. . . end. {Hotel} Begin transaction Car

. . . end. {Car}

end. �

Nested transactions have received considerable interest as a more generalized transaction concept. The level of nesting is generally open, allowing subtransactions themselves to have nested transactions. This generality is necessary to support appli- cation areas where transactions are more complex than in traditional data processing.

In this taxonomy, we differentiate between closed and open nesting because of their termination characteristics. Closed nested transactions [Moss, 1985] commit in a bottom-up fashion through the root. Thus, a nested subtransaction begins af- ter its parent and finishes before it, and the commitment of the subtransactions is conditional upon the commitment of the parent. The semantics of these transactions enforce atomicity at the top-most level. Open nesting relaxes the top-level atomicity restriction of closed nested transactions. Therefore, an open nested transaction al- lows its partial results to be observed outside the transaction. Sagas [Garcia-Molina and Salem, 1987; Garcia-Molina et al., 1990] and split transactions [Pu, 1988] are examples of open nesting.

A saga is a “sequence of transactions that can be interleaved with other trans- actions” [Garcia-Molina and Salem, 1987]. The DBMS guarantees that either all the transactions in a saga are successfully completed or compensating transac- tions [Garcia-Molina, 1983; Korth et al., 1990] are run to recover from a partial execution. A compensating transaction effectively does the inverse of the transaction that it is associated with. For example, if the transaction adds $100 to a bank account,

10.3 Types of Transactions 353

its compensating transaction deducts $100 from the same bank account. If a transac- tion is viewed as a function that maps the old database state to a new database state, its compensating transaction is the inverse of that function.

Two properties of sagas are: (1) only two levels of nesting are allowed, and (2) at the outer level, the system does not support full atomicity. Therefore, a saga differs from a closed nested transaction in that its level structure is more restricted (only 2) and that it is open (the partial results of component transactions or sub-sagas are visible to the outside). Furthermore, the transactions that make up a saga have to be executed sequentially.

The saga concept is extended and placed within a more general model that deals with long-lived transactions and with activities that consist of multiple steps [Garcia- Molina et al., 1990] . The fundamental concept of the model is that of a module that captures code segments each of which accomplishes a given task and access a database in the process. The modules are modeled (at some level) as sub-sagas that communicate with each other via messages over ports. The transactions that make up a saga can be executed in parallel. The model is multi-layer where each subsequent layer adds a level of abstraction.

The advantages of nested transactions are the following. First, they provide a higher-level of concurrency among transactions. Since a transaction consists of a number of other transactions, more concurrency is possible within a single transaction. For example, if the reservation transaction of Example 10.10 is implemented as a flat transaction, it may not be possible to access records about a specific flight concurrently. In other words, if one travel agent issues the reservation transaction for a given flight, any concurrent transaction that wishes to access the same flight data will have to wait until the termination of the first, which includes the hotel and car reservation activities in addition to flight reservation. However, a nested implementation will permit the second transaction to access the flight data as soon as the Airline subtransaction of the first reservation transaction is completed. In other words, it may be possible to perform a finer level of synchronization among concurrent transactions.

A second argument in favor of nested transactions is related to recovery. It is possible to recover independently from failures of each subtransaction. This limits the damage to a smaller part of the transaction, making it less costly to recover. In a flat transaction, if any operation fails, the entire transaction has to be aborted and restarted, whereas in a nested transaction, if an operation fails, only the subtransaction containing that operation needs to be aborted and restarted.

Finally, it is possible to create new transactions from existing ones simply by inserting the old one inside the new one as a subtransaction.

10.3.3 Workflows

Flat transactions model relatively simple and short activities very well. However, they are less appropriate for modeling longer and more elaborate activities.That is

354 10 Introduction to Transaction Management

the reason for the development of the various nested transaction models discussed above. It has been argued that these extensions are not sufficiently powerful to model business activities: “after several decades of data processing, we have learned that we have not won the battle of modeling and automating complex enterprises” [Medina- Mora et al., 1993]. To meet these needs, more complex transaction models which are combinations of open and nested transactions have been proposed. There are well-justified arguments for not calling these transactions, since they hardly follow any of the ACID properties; a more appropriate name that has been proposed is a workflow [Dogac et al., 1998b; Georgakopoulos et al., 1995].

The term “workflow,” unfortunately, does not have a clear and uniformly accepted meaning. A working definition is that a workflow is “a collection of tasks organized to accomplish some business process.” [Georgakopoulos et al., 1995]. This defini- tion, however, leaves a lot undefined. This is perhaps unavoidable given the very different contexts where this term is used. Three types of workflows are identified [Georgakopoulos et al., 1995]:

1. Human-oriented workflows, which involve humans in performing the tasks. The system support is provided to facilitate collaboration and coordination among humans, but it is the humans themselves who are ultimately responsible for the consistency of the actions.

2. System-oriented workflows are those that consist of computation-intensive and specialized tasks that can be executed by a computer. The system support in this case is substantial and involves concurrency control and recovery, automatic task execution, notification, etc.

3. Transactional workflows range in between human-oriented and system- oriented workflows and borrow characteristics from both. They involve “coor- dinated execution of multiple tasks that (a) may involve humans, (b) require access to HAD [heterogeneous, autonomous, and/or distributed] systems, and (c) support selective use of transactional properties [i.e., ACID properties] for individual tasks or entire workflows.” [Georgakopoulos et al., 1995]. Among the features of transactional workflows, the selective use of transac- tional properties is particularly important as it characterizes possible relax- ations of ACID properties.

In this book, our primary interest is with transactional workflows. There have been many transactional workflow proposals [Elmagarmid et al., 1990; Nodine and Zdonik, 1990; Buchmann et al., 1982; Dayal et al., 1991; Hsu, 1993], and they differ in a number of ways. The common point among them is that a workflow is defined as an activity consisting of a set of tasks with well-defined precedence relationship among them.

Example 10.11. Let us further extend the reservation transaction of Example 10.3. The entire reservation activity consists of the following taks and involves the follow- ing data:

10.3 Types of Transactions 355

• Customer request is obtained (task T1) and Customer Database is accessed to obtain customer information, preferences, etc.;

• Airline reservation is performed (T2) by accessing the Flight Database; • Hotel reservation is performed (T3), which may involve sending a message to

the hotel involved;

• Auto reservation is performed (T4), which may also involve communication with the car rental company;

• Bill is generated (T5) and the billing info is recorded in the billing database.

Figure 10.4 depicts this workflow where there is a serial dependency of T2 on T1, and T3, T4 on T2; however, T3 and T4 (hotel and car reservations) are performed in parallel and T5 waits until their completion. �

T1 T2

T3

T4

T5

Customer

Database

Customer

Database

Customer

Database

Fig. 10.4 Example Workflow

A number of workflow models go beyond this basic model by both defining more precisely what tasks can be and by allocating different relationships among the tasks. In the following, we define one model that is similar to the models of Buchmann et al. [1982] and Dayal et al. [1991].

A workflow is modeled as an activity with open nesting semantics in that it permits partial results to be visible outside the activity boundaries. Thus, tasks that make up the activity are allowed to commit individually. Tasks may be other activities (with the same open transaction semantics) or closed nested transactions that make their results visible to the entire system when they commit. Even though an activity can have both other activities and closed nested transactions as its component, a closed nested transaction task can only be composed of other closed nested transactions (i.e., once closed nesting semantics begins, it is maintained for all components).

An activity commits when its components are ready to commit. However, the components commit individually, without waiting for the root activity to commit.

356 10 Introduction to Transaction Management

This raises problems in dealing with aborts since when an activity aborts, all of its components should be aborted. The problem is dealing with the components that have already committed. Therefore, compensating transactions are defined for the components of an activity. Thus, if a component has already committed when an activity aborts, the corresponding compensating transaction is executed to “undo” its effects.

Some components of an activity may be marked as vital. When a vital component aborts, its parent must also abort. If a non-vital component of a workflow model aborts, it may continue executing. A workflow, on the other hand, always aborts when one of its components aborts. For example, in the reservation workflow of Example 10.11, T2 (airline reservation) and T3 (hotel reservation) may be declared as vital so that if an airline reservation or a hotel reservation cannot be made, the workflow aborts and the entire trip is canceled. However, if a car reservation cannot be committed, the workflow can still successfully terminate.

It is possible to define contingency tasks that are invoked if their counterparts fail. For example, in the Reservation example presented earlier, one can specify that the contingency to making a reservation at Hilton is to make a reservation at Sheraton. Thus, if the hotel reservation component for Hilton fails, the Sheraton alternative is tried rather than aborting the task and the entire workflow.

10.4 Architecture Revisited

With the introduction of the transaction concept, we need to revisit the architectural model introduced in Chapter 1. We do not need to revise the model but simply need to expand the role of the distributed execution monitor.

The distributed execution monitor consists of two modules: a transaction manager (TM) and a scheduler (SC). The transaction manager is responsible for coordinating the execution of the database operations on behalf of an application. The scheduler, on the other hand, is responsible for the implementation of a specific concurrency control algorithm for synchronizing access to the database.

A third component that participates in the management of distributed transactions is the local recovery managers (LRM) that exist at each site. Their function is to implement the local procedures by which the local database can be recovered to a consistent state following a failure.

Each transaction originates at one site, which we will call its originating site. The execution of the database operations of a transaction is coordinated by the TM at that transaction’s originating site.

The transaction managers implement an interface for the application programs which consists of five commands: begin transaction, read, write, commit, and abort. The processing of each of these commands in a non-replicated distributed DBMS is discussed below at an abstract level. For simplicity, we ignore the scheduling of concurrent transactions as well as the details of how data is physically retrieved by the data processor. These assumptions permit us to concentrate on the interface to

10.5 Conclusion 357

the TM. The details are presented in the Chapters 11 and 12, while the execution of these commands in a replicated distributed database is discussed in Chapter 13.

1. Begin transaction. This is an indicator to the TM that a new transaction is starting. The TM does some bookkeeping, such as recording the transaction’s name, the originating application, and so on, in coordination with the data processor.

2. Read. If the data item to be read is stored locally, its value is read and returned to the transaction. Otherwise, the TM finds where the data item is stored and requests its value to be returned (after appropriate concurrency control measures are taken).

3. Write. If the data item is stored locally, its value is updated (in coordination with the data processor). Otherwise, the TM finds where the data item is located and requests the update to be carried out at that site after appropriate concurrency control measures are taken).

4. Commit. The TM coordinates the sites involved in updating data items on behalf of this transaction so that the updates are made permanent at every site.

5. Abort. The TM makes sure that no effects of the transaction are reflected in any of the databases at the sites where it updated data items.

In providing these services, a TM can communicate with SCs and data processors at the same or at different sites. This arrangement is depicted in Figure 10.5.

As we indicated in Chapter 1, the architectural model that we have described is only an abstraction that serves a pedagogical purpose. It enables the separation of many of the transaction management issues and their independent and isolated discussion. In Chapter 11 we focus on the interface between a TM and an SC and between an SC and a data processor, in addition to the scheduling algorithms. In Chapter 12 we consider the execution strategies for the commit and abort commands in a distributed environment, in addition to the recovery algorithms that need to be implemented for the recovery manager. In Chapter 13, we extend this discussion to the case of replicated databases. We should point out that the computational model that we described here is not unique. Other models have been proposed such as, for example, using a private workspace for each transaction.

10.5 Conclusion

In this chapter we introduced the concept of a transaction as a unit of consistent and reliable access to the database. The properties of transactions indicate that they are larger atomic units of execution which transform one consistent database to another consistent database. The properties of transactions also indicate what the requirements for managing them are, which is the topic of the next two chapters. Consistency requires a definition of integrity enforcement (which we did in Chapter

358 10 Introduction to Transaction Management

With other

SCs

With other

data

processors

Begin_transaction,

Read, Write,

Commit, Abort Results

Transaction

Manager

(TM)

Distributed Execution

Monitor

Scheduling/

Descheduling

Requests

To data

processors

Scheduler

(TM)

With other

TMs

Fig. 10.5 Detailed Model of the Distributed Execution Monitor

5), as well as concurrency control algorithms (which is the topic of Chapter 11). Concurrency control also deals with the issue of isolation. Durability and atomicity properties of transactions require a discussion of reliability, which we cover in Chapter 12. Specifically, durability is supported by various commit protocols and commit management, whereas atomicity requires the development of appropriate recovery protocols.

10.6 Bibliographic Notes

Transaction management has been the topic of considerable study since DBMSs have become a significant research area. There are two excellent books on the subject: [Gray and Reuter, 1993] and [Weikum and Vossen, 2001]. An excellent companion to these is [Bernstein and Newcomer, 1997] which provides an in-depth discussion of transaction processing principles. It also gives a view of transaction processing and transaction monitors which is more general than the database-centric view that we provide in this book. A good collection of papers that focus on the concurrency control and reliability aspects of distributed systems is [Bhargava, 1987]. Two books focus on the performance of concurrency control mechanisms with a focus on centralized systems [Kumar, 1996; Thomasian, 1996]. Distributed concurrency control is the topic of [Cellary et al., 1988].

10.6 Bibliographic Notes 359

Advanced transaction models are discussed and various examples are given in [Elmagarmid, 1992]. Nested transactions are also covered in [Lynch et al., 1993]. A good introduction to workflow systems is [Georgakopoulos et al., 1995]. The same topic is covered in detail in [Dogac et al., 1998b].

A very important work is a set of notes on database operating systems by Gray [1979]. These notes contain valuable information on transaction management, among other things.

The discussion concerning transaction classification in Section 10.3 comes from a number of sources. Part of it is from [Farrag, 1986]. The structure discussion is from [Özsu, 1994] and [Buchmann et al., 1982], where the authors combine transaction structure with the structure of the objects that these transactions operate upon to develop a more complete classification.

There are numerous papers dealing with various transaction management issues. The ones referred to in this chapter are those that deal with the concept of a transaction. More detailed references on their management are left to Chapters 11 and 12.

Chapter 11 Distributed Concurrency Control

As we discussed in Chapter 10, concurrency control deals with the isolation and consistency properties of transactions. The distributed concurrency control mecha- nism of a distributed DBMS ensures that the consistency of the database, as defined in Section 10.2.2, is maintained in a multiuser distributed environment. If transac- tions are internally consistent (i.e., do not violate any consistency constraints), the simplest way of achieving this objective is to execute each transaction alone, one after another. It is obvious that such an alternative is only of theoretical interest and would not be implemented in any practical system, since it minimizes the system throughput. The level of concurrency (i.e., the number of concurrent transactions) is probably the most important parameter in distributed systems [Balter et al., 1982]. Therefore, the concurrency control mechanism attempts to find a suitable trade-off between maintaining the consistency of the database and maintaining a high level of concurrency.

In this chapter, we make two major assumptions: the distributed system is fully reliable and does not experience any failures (of hardware or software), and the database is not replicated. Even though these are unrealistic assumptions, they permit us to delineate the issues related to the management of concurrency from those related to the operation of a reliable distributed system and those related to maintaining replicas. In Chapter 12, we discuss how the algorithms that are presented in this chapter need to be enhanced to operate in an unreliable environment. In Chapter 13 we address the issues related to replica management.

We start our discussion of concurrency control with a presentation of serializabil- ity theory in Section 11.1. Serializability is the most widely accepted correctness criterion for concurrency control algorithms. In Section 11.2 we present a taxonomy of algorithms that will form the basis for most of the discussion in the remainder of the chapter. Sections 11.3 and 11.4 cover the two major classes of algorithms: locking-based and timestamp ordering-based. Both locking and timestamp ordering classes cover what is called pessimistic algorithms; optimistic concurrency control is discussed in Section 11.5. Any locking-based algorithm may result in deadlocks, requiring special management methods. Various deadlock management techniques are therefore the topic of Section 11.6. In Section 11.7, we discuss “relaxed” con-

DOI 10.1007/978-1-4419-8834-8_11, © Springer Science+Business Media, LLC 2011 361M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems: Third Edition,

362 11 Distributed Concurrency Control

currency control approaches. These are mechanisms which use weaker correctness criteria than serializability, or relax the isolation property of transactions.

11.1 Serializability Theory

In Section 10.1.3 we discussed the issue of isolating transactions from one another in terms of their effects on the database. We also pointed out that if the concurrent execution of transactions leaves the database in a state that can be achieved by their serial execution in some order, problems such as lost updates will be resolved. This is exactly the point of the serializability argument. The remainder of this section addresses serializability issues more formally.

A history R (also called a schedule) is defined over a set of transactions T = {T1,T2, . . . ,Tn} and specifies an interleaved order of execution of these transactions’ operations. Based on the definition of a transaction introduced in Section 10.1, the history can be specified as a partial order over T . We need a few preliminaries, though, before we present the formal definition.

Recall the definition of conflicting operations that we gave in Chapter 10. Two operations Oi j(x) and Okl(x) (i and k representing transactions and are not necessarily distinct) accessing the same database entity x are said to be in conflict if at least one of them is a write operation. Note two things in this definition. First, read operations do not conflict with each other. We can, therefore, talk about two types of conflicts: read-write (or write-read), and write-write. Second, the two operations can belong to the same transaction or to two different transactions. In the latter case, the two transactions are said to be conflicting. Intuitively, the existence of a conflict between two operations indicates that their order of execution is important. The ordering of two read operations is insignificant.

We first define a complete history, which defines the execution order of all opera- tions in its domain. We will then define a history as a prefix of a complete history. For- mally, a complete history HcT defined over a set of transactions T = {T1,T2, . . . ,Tn} is a partial order HcT = {ΣT ,≺H} where

1. ΣT = ⋃n

i=1 Σi. 2. ≺H⊇

⋃n i=1 ≺Ti .

3. For any two conflicting operations Oi j,Okl ∈ΣT , either Oi j ≺H Okl , or Okl ≺H Oi j.

The first condition simply states that the domain of the history is the union of the domains of individual transactions. The second condition defines the ordering relation of the history as a superset of the ordering relations of individual transactions. This maintains the ordering of operations within each transaction. The final condition simply defines the execution order among conflicting operations in H.

11.1 Serializability Theory 363

Example 11.1. Consider the two transactions from Example 10.8, which were as follows:

T1: Read(x) T2: Read(x) x← x+1 x← x+1 Write(x) Write(x) Commit Commit

A possible complete history HcT over T = {T1,T2} is the partial order HcT = {ΣT ,≺T} where

Σ1 ={R1(x),W1(x),C1} Σ2 ={R2(x),W2(x),C2}

Thus

ΣT = Σ1∪Σ2 = {R1(x),W1(x),C1,R2(x),W2(x),C2}

and

≺H={(R1,R2),(R1,W1),(R1,C1),(R1,W2),(R1,C2),(R2,W1),(R2,C1),(R2,W2), (R2,C2),(W1,C1),(W1,W2),(W1,C2),(C1,W2),(C1,C2),(W2,C2)}

which can be specified as a DAG as depicted in Figure 11.1. Note that consistent with our earlier adopted convention (see Example 10.7), we do not draw the arcs that are implied by transitivity [e.g., (R1,C1)].

C 1

C 2

R 1 (x) R

2 (x)

W 2 (x)W

1 (x)

Fig. 11.1 DAG Representation of a Complete History

It is quite common to specify a history as a listing of the operations in ΣT , where their execution order is relative to their order in this list. Thus HcT can be specified as

HcT = {R1(x),R2(x),W1(x),C1,W2(x),C2}

364 11 Distributed Concurrency Control

A history is defined as a prefix of a complete history. A prefix of a partial order can be defined as follows. Given a partial order P = {Σ,≺},P′ = {Σ′,≺′} is a prefix of P if

1. Σ′ ⊆ Σ; 2. ∀ei ∈ Σ′,e1 ≺′ e2 if and only if e1 ≺ e2; and 3. ∀ei ∈ Σ′, if ∃e j ∈ Σ and e j ≺ ei, then e j ∈ Σ′.

The first two conditions define P′ as a restriction of P on domain Σ′, whereby the ordering relations in P are maintained in P′. The last condition indicates that for any element of Σ′, all its predecessors in Σ have to be included in Σ′ as well.

What does this definition of a history as a prefix of a partial order provide for us? The answer is simply that we can now deal with incomplete histories. This is useful for a number of reasons. From the perspective of the serializability theory, we deal only with conflicting operations of transactions rather than with all operations. Furthermore, and perhaps more important, when we introduce failures, we need to be able to deal with incomplete histories, which is what a prefix enables us to do.

The history discussed in Example 11.1 is special in that it is complete. It needs to be complete in order to talk about the execution order of these two transactions’ operations. The following example demonstrates a history that is not complete.

Example 11.2. Consider the following three transactions: T1: Read(x) T2: Write(x) T3: Read(x)

Write(x) Write(y) Read(y) Commit Read(z) Read(z)

Commit Commit A complete history Hc for these transactions is given in Figure 11.2, and a history H (as a prefix of Hc) is depicted in Figure 11.3. �

W 2 (x) R

3 (x)

W 2 (y) R

3 (y)

R 1 (x)

W 1 (x)

C 1

C 2

R 2 (z)

C 3

R 3 (z)

Fig. 11.2 A Complete History

11.1 Serializability Theory 365

W 2 (x) R

3 (x)

W 2 (y) R

3 (y)

R 1 (x)

R 2 (z) R

3 (z)

Fig. 11.3 Prefix of Complete History in Figure 11.2

If in a complete history H, the operations of various transactions are not interleaved (i.e., the operations of each transaction occur consecutively), the history is said to be serial. As we indicated before, the serial execution of a set of transactions maintains the consistency of the database. This follows naturally from the consistency property of transactions: each transaction, when executed alone on a consistent database, will produce a consistent database.

Example 11.3. Consider the three transactions of Example 11.2. The following his- tory is serial since all the operations of T2 are executed before all the operations of T1 and all operations of T1 are executed before all operations of T31.

H = {W2(x),W2(y),R2(z)︸ ︷︷ ︸ T2

,R1(x),W1(x)︸ ︷︷ ︸ T1

,R3(x),R3(y),R3(z)︸ ︷︷ ︸ T3

}

One common way to denote this precedence relationship between transaction execu- tions is T2→ T1→ T3 rather than the more formal T2 ≺H T1 ≺H T3. �

Based on the precedence relationship introduced by the partial order, it is possible to discuss the equivalence of histories with respect to their effects on the database. Intuitively, two histories H1 and H2, defined over the same set of transactions T , are equivalent if they have the same effect on the database. More formally, two histories, H1 and H2, defined over the same set of transactions T , are said to be equivalent if for each pair of conflicting operations Oi j and Okl (i 6= k), whenever Oi j ≺H1 Okl , then Oi j ≺H2 Okl . This is called conflict equivalence since it defines equivalence of two histories in terms of the relative order of execution of the conflicting operations in those histories. Here, for the sake of simplicity, we assume that T does not include any aborted transaction. Otherwise, the definition needs to be modified to specify only those conflicting operations that belong to unaborted transactions.

Example 11.4. Again consider the three transactions given in Example 11.2. The following history H ′ defined over them is conflict equivalent to H given in Example 11.3:

H ′ = {W2(x),R1(x),W1(x),R3(x),W2(y),R3(y),R2(z),R3(z)} 1 From now on we will generally omit the Commit operation from histories.

366 11 Distributed Concurrency Control

We are now ready to define serializability more precisely. A history H is said to be serializable if and only if it is conflict equivalent to a serial history. Note that seri- alizability roughly corresponds to degree 3 consistency, which we defined in Section 10.2.2. Serializability so defined is also known as conflict-based serializability since it is defined according to conflict equivalence.

Example 11.5. History H ′ in Example 11.4 is serializable since it is equivalent to the serial history H of Example 11.3. Also note that the problem with the uncontrolled execution of transactions T1 and T2 in Example 10.8 was that they could generate an unserializable history. �

Now that we have formally defined serializability, we can indicate that the primary function of a concurrency controller is to generate a serializable history for the execution of pending transactions. The issue, then, is to devise algorithms that are guaranteed to generate only serializable histories.

Serializability theory extends in a straightforward manner to the non-replicated (or partitioned) distributed databases. The history of transaction execution at each site is called a local history. If the database is not replicated and each local history is serializable, their union (called the global history) is also serializable as long as local serialization orders are identical.

Example 11.6. We will give a very simple example to demonstrate the point. Consider two bank accounts, x (stored at Site 1) and y (stored at Site 2), and the following two transactions where T1 transfers $100 from x to y, while T2 simply reads the balances of x and y:

T1: Read(x) T2: Read(x) x← x−100 Read(y) Write(x) Commit Read(y) y← y+100 Write(y) Commit

Obviously, both of these transactions need to run at both sites. Consider the following two histories that may be generated locally at the two sites (Hi is the history at Site i):

H1 ={R1(x),W1(x),R2(x)} H2 ={R1(y),W1(y),R2(y)}

Both of these histories are serializable; indeed, they are serial. Therefore, each represents a correct execution order. Furthermore, the serialization order for both are the same T1→ T2. Therefore, the global history that is obtained is also serializable with the serialization order T1→ T2.

11.2 Taxonomy of Concurrency Control Mechanisms 367

However, if the histories generated at the two sites are as follows, there is a problem:

H ′ 1 ={R1(x),W1(x),R2(x)}

H ′ 2 ={R2(y),R1(y),W1(y)}

Although each local history is still serializable, the serialization orders are differ- ent: H

′ 1 serializes T1 before T2 while H

′ 2 serializes T2 before T1. Therefore, there can

be no global history that is serializable. �

A weaker version of serializability that has gained importance in recent years is snapshot isolation [Berenson et al., 1995] that is now provided as a standard consistency criterion in a number of commercial systems. Snapshot isolation allows read transactions (queries) to read stale data by allowing them to read a snapshot of the database that reflects the committed data at the time the read transaction starts. Consequently, the reads are never blocked by writes, even though they may read old data that may be dirtied by other transactions that were still running when the snapshot was taken. Hence, the resulting histories are not serializable, but this is accepted as a reasonable tradeoff between a lower level of isolation and better performance.

11.2 Taxonomy of Concurrency Control Mechanisms

There are a number of ways that the concurrency control approaches can be classified. One obvious classification criterion is the mode of database distribution. Some algorithms that have been proposed require a fully replicated database, while others can operate on partially replicated or partitioned databases. The concurrency control algorithms may also be classified according to network topology, such as those requiring a communication subnet with broadcasting capability or those working in a star-type network or a circularly connected network.

The most common classification criterion, however, is the synchronization prim- itive. The corresponding breakdown of the concurrency control algorithms results in two classes [Bernstein and Goodman, 1981]: those algorithms that are based on mutually exclusive access to shared data (locking), and those that attempt to order the execution of the transactions according to a set of rules (protocols). However, these primitives may be used in algorithms with two different viewpoints: the pessimistic view that many transactions will conflict with each other, or the optimistic view that not too many transactions will conflict with one another.

We will thus group the concurrency control mechanisms into two broad classes: pessimistic concurrency control methods and optimistic concurrency control methods. Pessimistic algorithms synchronize the concurrent execution of transactions early in their execution life cycle, whereas optimistic algorithms delay the synchronization of transactions until their termination. The pessimistic group consists of locking-

368 11 Distributed Concurrency Control

based algorithms, ordering (or transaction ordering) based algorithms, and hybrid algorithms. The optimistic group can, similarly, be classified as locking-based or timestamp ordering-based. This classification is depicted in Figure 11.4.

Centralized

Primary

Copy

Distributed

Basic

Multiversion

Conservative

Locking Timestamp

Ordering Hybrid

Pessimistic

Concurrency

Control

Algorithms

Optimistic

Locking Timestamp

Ordering

Fig. 11.4 Classification of Concurrency Control Algorithms

In the locking-based approach, the synchronization of transactions is achieved by employing physical or logical locks on some portion or granule of the database. The size of these portions (usually called locking granularity) is an important issue. However, for the time being, we will ignore it and refer to the chosen granule as a lock unit. This class is subdivided further according to where the lock management activities are performed: centralized and decentralized (or distributed) locking.

The timestamp ordering (TO) class involves organizing the execution order of transactions so that they maintain transaction consistency. This ordering is maintained by assigning timestamps to both the transactions and the data items that are stored in the database. These algorithms can be basic TO, multiversion TO, or conservative TO.

We should indicate that in some locking-based algorithms, timestamps are also used. This is done primarily to improve efficiency and the level of concurrency. We call these hybrid algorithms. We will not discuss these algorithms in this chapter since they have not been implemented in any commercial or research prototype distributed

11.3 Locking-Based Concurrency Control Algorithms 369

DBMS. The rules for integrating locking and timestamp ordering protocols are discussed by Bernstein and Goodman [1981].

11.3 Locking-Based Concurrency Control Algorithms

The main idea of locking-based concurrency control is to ensure that a data item that is shared by conflicting operations is accessed by one operation at a time. This is accomplished by associating a “lock” with each lock unit. This lock is set by a transaction before it is accessed and is reset at the end of its use. Obviously a lock unit cannot be accessed by an operation if it is already locked by another. Thus a lock request by a transaction is granted only if the associated lock is not being held by any other transaction.

Since we are concerned with synchronizing the conflicting operations of con- flicting transactions, there are two types of locks (commonly called lock modes) associated with each lock unit: read lock (rl) and write lock (wl). A transaction Ti that wants to read a data item contained in lock unit x obtains a read lock on x [denoted rli(x)]. The same happens for write operations. Two lock modes are compatible if two transactions that access the same data item can obtain these locks on that data item at the same time. As Figure 11.5 shows, read locks are compatible, whereas read-write or write-write locks are not. Therefore, it is possible, for example, for two transactions to read the same data item concurrently.

compatible

not compatible

not compatible

not compatible

rl i (x)

rl j (x)

wl j (x)

wl j (x)

Fig. 11.5 Compatibility Matrix of Lock Modes

The distributed DBMS not only manages locks but also handles the lock manage- ment responsibilities on behalf of the transactions. In other words, users do not need to specify when a data item needs to be locked; the distributed DBMS takes care of that every time the transaction issues a read or write operation.

In locking-based systems, the scheduler (see Figure 10.5) is a lock manager (LM). The transaction manager passes to the lock manager the database operation (read or write) and associated information (such as the item that is accessed and the identifier of the transaction that issues the database operation). The lock manager then checks if the lock unit that contains the data item is already locked. If so, and if the existing lock mode is incompatible with that of the current transaction, the current operation is delayed. Otherwise, the lock is set in the desired mode and the database operation is passed on to the data processor for actual database access. The transaction manager is then informed of the results of the operation. The termination of a transaction

370 11 Distributed Concurrency Control

results in the release of its locks and the initiation of another transaction that might be waiting for access to the same data item.

The locking algorithm as described above will not, unfortunately, properly syn- chronize transaction executions. This is because to generate serializable histories, the locking and releasing operations of transactions also need to be coordinated. We demonstrate this by an example. Example 11.7. Consider the following two transactions:

T1: Read(x) T2: Read(x) x← x+1 x← x∗2 Write(x) Write(x) Read(y) Read(y) y← y−1 y← y∗2 Write(y) Write(y) Commit Commit

The following is a valid history that a lock manager employing the locking algorithm may generate:

H ={wl1(x),R1(x),W1(x), lr1(x),wl2(x),R2(x),w2(x), lr2(x),wl2(y), R2(y),W2(y), lr2(y),wl1(y),R1(y),W1(y), lr1(y)}

where lri(z) indicates the release of the lock on z that transaction Ti holds. Note that H is not a serializable history. For example, if prior to the execution of

these transactions, the values of x and y are 50 and 20, respectively, one would expect their values following execution to be, respectively, either 102 and 38 if T1 executes before T2, or 101 and 39 if T2 executes before T1. However, the result of executing H would give x and y the values 102 and 39. Obviously, H is not serializable. �

The problem with history H in Example 11.7 is the following. The locking algorithm releases the locks that are held by a transaction (say, Ti) as soon as the associated database command (read or write) is executed, and that lock unit (say x) no longer needs to be accessed. However, the transaction itself is locking other items (say, y), after it releases its lock on x. Even though this may seem to be advantageous from the viewpoint of increased concurrency, it permits transactions to interfere with one another, resulting in the loss of isolation and atomicity. Hence the argument for two-phase locking (2PL).

The two-phase locking rule simply states that no transaction should request a lock after it releases one of its locks. Alternatively, a transaction should not release a lock until it is certain that it will not request another lock. 2PL algorithms execute transactions in two phases. Each transaction has a growing phase, where it obtains locks and accesses data items, and a shrinking phase, during which it releases locks (Figure 11.6). The lock point is the moment when the transaction has achieved all its locks but has not yet started to release any of them. Thus the lock point determines the end of the growing phase and the beginning of the shrinking phase of a transaction. It has been proven that any history generated by a concurrency control algorithm that obeys the 2PL rule is serializable [Eswaran et al., 1976].

11.3 Locking-Based Concurrency Control Algorithms 371

N u m

b e

r o f

lo c k s

Obtain lock

Release lock

BEGIN LOCK

POINT

END Transaction

duration

Fig. 11.6 2PL Lock Graph

ENDBEGIN

Period of

data item

use

Transaction

duration

Obtain lock

Release lock

N u m

b e r

o f lo

c k s

Fig. 11.7 Strict 2PL Lock Graph

Figure 11.6 indicates that the lock manager releases locks as soon as access to that data item has been completed. This permits other transactions awaiting access to go ahead and lock it, thereby increasing the degree of concurrency. However, this is difficult to implement since the lock manager has to know that the transaction has obtained all its locks and will not need to lock another data item. The lock manager also needs to know that the transaction no longer needs to access the data item in question, so that the lock can be released. Finally, if the transaction aborts after it releases a lock, it may cause other transactions that may have accessed the unlocked data item to abort as well. This is known as cascading aborts. These problems may be overcome by strict two-phase locking, which releases all the locks together when the transaction terminates (commits or aborts). Thus the lock graph is as shown in Figure 11.7.

We should note that even though a 2PL algorithm enforces conflict serializability, it does not allow all histories that are conflict serializable. Consider the following history discussed by Agrawal and El-Abbadi [1990]:

372 11 Distributed Concurrency Control

H = {W1(x),R2(x),W3(y),W1(y)}

H is not allowed by 2PL algorithm since T1 would need to obtain a write lock on y after it releases its write lock on x. However, this history is serializable in the order T3→ T1→ T2. The order of locking can be exploited to design locking algorithms that allow histories such as these [Agrawal and El-Abbadi, 1990].

The main idea is to observe that in serializability theory, the order of serialization of conflicting operations is as important as detecting the conflict in the first place and this can be exploited in defining locking modes. Consequently, in addition to read (shared) and write (exclusive) locks, a third lock mode is defined: ordered shared. Ordered shared locking of an object x by transactions Ti and Tj has the following meaning: Given a history H that allows ordered shared locks between operations o ∈ Ti and p ∈ Tj, if Ti acquires o-lock before Tj acquires p-lock, then o is executed before p. Consider the compatibility table between read and write locks given in Figure 11.5. If the ordered shared mode is added, there are eight variants of this table. Figure 11.5 depicts one of them and two more are shown in Figure 11.8. In Figure 11.8(b), for example, there is an ordered shared relationship between rl j(x) and wli(x) indicating that Ti can acquire a write lock on x while Tj holds a read lock on x as long as the ordered shared relationship from rl j(x) to wli(x) is observed. The eight compatibility tables can be compared with respect to their permissiveness (i.e., with respect to the histories that can be produced using them) to generate a lattice of tables such that the one in Figure 11.5 is the most restrictive and the one in Figure 11.8(b) is the most liberal.

rl i (x) wl

i (x)

compatible

ordered shared

not compatible

not compatible

compatible

ordered shared

ordered shared

ordered shared

(a) (b)

rl i (x)

rl j (x)rl

j (x)

wl i (x)

wl j (x) wl

j (x)

Fig. 11.8 Commutativity Table with Ordered Shared Lock Mode

The locking protocol that enforces a compatibility matrix involving ordered shared lock modes is identical to 2PL, except that a transaction may not release any locks as long as any of its locks are on hold. Otherwise circular serialization orders can exist.

Locking-based algorithms may cause deadlocks since they allow exclusive access to resources. It is possible that two transactions that access the same data items may lock them in reverse order, causing each to wait for the other to release its locks causing a deadlock. We discuss deadlock management in Section 11.6.

11.3 Locking-Based Concurrency Control Algorithms 373

11.3.1 Centralized 2PL

The 2PL algorithm discussed in the preceding section can easily be extended to the distributed DBMS environment. One way of doing this is to delegate lock manage- ment responsibility to a single site only. This means that only one of the sites has a lock manager; the transaction managers at the other sites communicate with it rather than with their own lock managers. This approach is also known as the primary site 2PL algorithm [Alsberg and Day, 1976].

The communication between the cooperating sites in executing a transaction according to a centralized 2PL (C2PL) algorithm is depicted in Figure 11.9. This communication is between the transaction manager at the site where the transaction is initiated (called the coordinating TM), the lock manager at the central site, and the data processors (DP) at the other participating sites. The participating sites are those that store the data item and at which the operation is to be carried out. The order of messages is denoted in the figure.

1

2

3

4

5

Data Processors at

participating sites Coordinating TM Central SiteTM

Lock Request

End of Operation

Release Locks

Lock Gra

nted

Ope ratio

n

Fig. 11.9 Communication Structure of Centralized 2PL

The centralized 2PL transaction management algorithm (C2PL-TM) that incor- porates these changes is given at a very high level in Algorithm 11.1, while the centralized 2PL lock management algorithm (C2PL-LM) is shown in Algorithm 11.2. A highly simplified data processor algorithm (DP) is given in Algorithm 11.3; this is the algorithm that will see major changes when we discuss reliability issues in Chapter 12. For the time being, this is sufficient for our purposes.

374 11 Distributed Concurrency Control

There is one important data structure that is used in these algorithms and that is the operation that is defined as a 5-tuple: Op : 〈Type = {BT,R,W,A,C},arg : Data item,val : Value, tid : Transaction identifier,res : Result〉. The meaning of the components is as follows: for an operation o : Op, o.Type∈{BT,R,W,A,C} specifies its type where BT = Begin transaction, R = Read, W = Write, A = Abort, and C = Commit, arg is the data item that the operation accesses (reads or writes; for other operations this field is null), val is also used in case of Read and Write operations to specify the value that has been read or the value to be written for data item arg (otherwise it is null), tid is the transaction that this operation belongs to (strictly speaking, this is the transaction identifier), and res indicates the completion code of operations requested of DP. In the high level descriptions of the algorithms in this chapter, res may seem unnecessary, but we will see in Chapter 12 that these return codes will be important.

The transaction manager (C2PL-TM) algorithm is written as a process that runs forever and waits until a message arrives from either an application (with a transaction operation) or from a lock manager, or from a data processor. The lock manager (C2PL- LM) and data processor (DP) algorithms are written as procedures that are called when needed. Since the algorithms are given at a high level of abstraction, this is not a major concern, but actual implementations may, naturally, be quite different.

One common criticism of C2PL algorithms is that a bottleneck may quickly form around the central site. Furthermore, the system may be less reliable since the failure or inaccessibility of the central site would cause major system failures. There are studies that indicate that the bottleneck will indeed form as the transaction rate increases.

11.3.2 Distributed 2PL

Distributed 2PL (D2PL) requires the availability of lock managers at each site. The communication between cooperating sites that execute a transaction according to the distributed 2PL protocol is depicted in Figure 11.10.

The distributed 2PL transaction management algorithm is similar to the C2PL- TM, with two major modifications. The messages that are sent to the central site lock manager in C2PL-TM are sent to the lock managers at all participating sites in D2PL-TM. The second difference is that the operations are not passed to the data processors by the coordinating transaction manager, but by the participating lock managers. This means that the coordinating transaction manager does not wait for a “lock request granted” message. Another point about Figure 11.10 is the following. The participating data processors send the “end of operation” messages to the coordinating TM. The alternative is for each DP to send it to its own lock manager who can then release the locks and inform the coordinating TM. We have chosen to describe the former since it uses an LM algorithm identical to the strict 2PL lock manager that we have already discussed and it makes the discussion of the commit protocols simpler (see Chapter 12). Owing to these similarities, we do not

11.3 Locking-Based Concurrency Control Algorithms 375

Algorithm 11.1: Centralized 2PL Transaction Manager (C2PL-TM) Algorithm Input: msg : a message begin

repeat wait for a msg ; switch msg do

case transaction operation let op be the operation ; if op.Type = BT then DP(op) {call DP with operation} else C2PL-LM(op) {call LM with operation}

case Lock Ma