Database systems

ExtensionofAIBigdata.PDF

Home >Computer Science homework help >Database systems

Database Concepts 9th Edition

David M. Kroenke • David J. Auer • Scott L. Vandenberg • Robert C. Yoder

Online Extension C

Advanced Business Intelligence and Big Data

Z06_KROE8149_09_SE_EXTC.indd 1 04/07/19 9:05 PM

Vice President of Courseware Portfolio Management: Andrew Gilfillan Executive Portfolio Manager: Samantha Lewis Team Lead, Content Production: Laura Burgess Content Producer: Faraz Sharique Ali Portfolio Management Assistant: Bridget Daly Director of Product Marketing: Brad Parkins Director of Field Marketing: Jonathan Cottrell Product Marketing Manager: Heather Taylor Field Marketing Manager: Bob Nisbet Product Marketing Assistant: Liz Bennett Field Marketing Assistant: Derrica Moser Senior Operations Specialist: Diane Peirano

Senior Art Director: Mary Seiner Interior and Cover Design: Pearson CSC Cover Art: Artwork by Donna Auer Senior Product Model Manager: Eric Hakanson Manager, Digital Studio: Heather Darby Course Producer, MyLab MIS: Jaimie Noy Digital Studio Producer: Tanika Henderson Full-Service Project Manager: Gowthaman Sadhanandham Full Service Vendor: Integra Software Service Pvt. Ltd. Manufacturing Buyer: LSC Communications, Maura Zaldivar-Garcia Text Printer/Bindery: LSC Communications Cover Printer: Phoenix Color

Credits and acknowledgments borrowed from other sources and reproduced, with permission, in this textbook appear on the appropriate page within text.

Microsoft and/or its respective suppliers make no representations about the suitability of the information contained in the documents and related graphics published as part of the services for any purpose. All such documents and related graphics are provided “as is” without warranty of any kind. Microsoft and/or its respective suppliers hereby disclaim all warranties and conditions with regard to this information, including all warranties and conditions of merchantability, whether express, implied or statutory, fitness for a particular purpose, title and non-infringement. In no event shall Microsoft and/or its respective suppliers be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of information available from the services.

The documents and related graphics contained herein could include technical inaccuracies or typographical errors. Changes are periodically added to the information herein. Microsoft and/or its respective suppliers may make improvements and/or changes in the product(s) and/or the program(s) described herein at any time. Partial screen shots may be viewed in full within the software version specified.

Microsoft® Windows®, and Microsoft Office® are registered trademarks of the Microsoft Corporation in the U.S.A. and other countries. This book is not sponsored or endorsed by or affiliated with the Microsoft Corporation.

MySQL®, the MySQL Command Line Client®, the MySQL Workbench®, and the MySQL Connector/ODBC® are registered trademarks of Sun Microsystems, Inc./Oracle Corporation. Screenshots and icons reprinted with permission of Oracle Corporation. This book is not sponsored or endorsed by or affiliated with Oracle Corporation.

Oracle Database XE 2016 by Oracle Corporation. Reprinted with permission.

PHP is copyright The PHP Group 1999–2012, and is used under the terms of the PHP Public License v3.01 available at http://www.php.net/ license/3_01.txt. This book is not sponsored or endorsed by or affiliated with The PHP Group.

ArangoDB is a copyright of ArangoDB GmbH.

Copyright © 2020, 2017, 2015 by Pearson Education, Inc. 221 River Street, Hoboken, NJ 07030. All rights reserved. Manufactured in the United States of America. This publication is protected by Copyright, and permission should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise.

For information regarding permissions, request forms and the appropriate contacts within the Pearson Education Global Rights & Permissions Department, please visit www.pearsoned.com/permissions. Ackn-owledgments of third-party content appear on the appropriate page within the text, which constitutes an extension of this copyright page. Unless otherwise indicated herein, any third-party trademarks that may appear in this work are the property of their respective owners and any references to third-party trademarks, logos or other trade dress are for demonstrative or de- scriptive purposes only. Such references are not intended to imply any sponsorship, endorsement, authorization, or promotion of Pearson’s products by the owners of such marks, or any relationship between the owner and Pearson Education, Inc. or its affiliates, authors, licensees or distributors.

Library of Congress Cataloging-in-Publication Data

Names: Kroenke, David M., author. | Auer, David J., author. | Vandenberg, Scott L., author. | Yoder, Robert C., author.

Title: Database concepts/David M. Kroenke, David J. Auer, Western Washington University, Scott L. Vandenberg, Siena College, Robert C. Yoder, Siena College.

Description: Ninth edition. | New York, NY : Pearson, [2020] | Includes index. Identifiers: LCCN 2018052988 | ISBN 9780135188149 | ISBN 0135188148 Subjects: LCSH: Database management. | Relational databases. Classification: LCC QA76.9.D3 K736 2020 | DDC 005.74–dc23 LC record available at https://lccn.loc.gov/2018052988

1 19

ISBN 10: 0-13-518814-8 ISBN 13: 978-0-13-518814-9

Z06_KROE8149_09_SE_EXTC.indd 2 04/07/19 9:05 PM

C-1

EX TE

N S IO

N Advanced Business Intelligence and Big Data

EXTENSION OBJECTIVES

■ Learn the basic concepts of reporting systems

■ Learn the basic concepts of RFM reporting

■ Learn the basic concepts of data mining

■ Learn the basic concepts of market basket analysis

■ Learn the basic concepts of decision trees

■ Learn the basic concepts of Big Data

■ Learn the basic concepts of nonrelational database management systems

■ Learn about replicated, partitioned data stores for supporting clusters

■ Understand the importance of XML

■ Understand the role of XSLT in materializing XML documents

■ Learn the elements of XML documents

■ Understand how to describe and validate XML document structure using XML Schema

■ Understand how XML can be supported in a DBMS

■ See examples of XPath and XQuery for queries of XML documents

■ Learn the basic concepts of key-value NoSQL databases

■ Learn the basic concepts of column family NoSQL databases

■ Learn the basic concepts of graph NoSQL databases

■ Learn some advanced concepts of document NoSQL databases

■ Learn the basic concepts of the Microsoft Azure cloud environment for creating and using relational databases

T his extension covers technologies that have largely arisen after relational databases became the most widespread form of data storage. These technologies take advantage of improvements in stor- age, processing speed, networking, and software advances in order to get information from and to manage large amounts of data, often in a networked environment. Two of these technology areas are Business Intelligence (BI) and Big Data.

BI includes a number of important areas; here we will expand on reporting systems and data mining—including coverage of RFM analysis, market basket analysis, and decision trees. Big Data can occur anywhere and with any data model, but it frequently involves cloud-based hardware/ software and/or nonrelational database management systems. Thus, in this book, we take Big Data to mean not just large amounts of data, but also the software and hardware systems that have evolved to make Big Data more useful. We will explore some of the basic concepts of Big Data, in addition to the CAP theorem discussed in Chapter 7. We will present a more detailed discussion of various nonrelational data models used in the realm of Big Data. Specifically, we will introduce XML as well as the major varieties of so- called NoSQL database management systems: key-value, document, column family, and graph.

Z06_KROE8149_09_SE_EXTC.indd 1 04/07/19 9:05 PM

C-2 Online Extension C Advanced Business Intelligence and Big Data

REPORTING SYSTEMS

To make some of these concepts more concrete, this extension includes coverage of using the Microsoft Azure cloud platform to create and manage a database. In addition, the “Working with Microsoft Access” section of this extension describes two ways of using Microsoft Access that can be useful in a Big Data or BI context. Using Access as a frontend to an enterprise DBMS (in this case, MySQL) allows Access to manipulate more data efficiently than it is typically capable of; this is clearly helpful in a Big Data scenario. Access Switchboard forms allow a DBA (or sophisticated end users) to create a single form that can be used to navigate to all the forms, reports, queries, and so on in a database. This can make an Access database much easier for BI users by placing the relevant reports and queries all together on a startup screen.

The purpose of a reporting system is to create meaningful information from disparate data sources and to deliver that information to the proper users on a timely basis. Unlike data mining, which uses sophisticated statistical techniques, reporting systems create informa- tion by using the basic operations of sorting, filtering, grouping, and making simple calculations.

Reporting systems summarize the current status of business activities and compare that status with past or predicted future activities. Report delivery is crucial. Reports must be delivered to the proper users on a timely basis in the appropriate format. For example, re- ports may be delivered on paper, via a Web browser, or in some other format.

It is easier to understand reporting systems if you are familiar with a typical report, so let us take a look at a typical reporting problem: RFM analysis.

RFM Analysis RFM analysis analyzes and ranks customers according to their purchasing patterns. It is a simple customer classification technique that considers how recently (R) a customer has ordered, how frequently (F) a customer orders, and how much money (M) the customer spends per order.

To produce an RFM score, we need only two things: customer data and sales data for each purchase (including the date of the sale and the total amount of the sale) made by each customer. If you look at the SALES_FOR_RFM table and its associated CUSTOMER and TIMELINE dimension tables in Figure C-1, you see that we have exactly those data: The SALES_FOR_RFM table is the starting point for RFM analysis in the HSD-DW BI system. Although we will not do it here, RFM analysis can be done using SQL statements and a table such as SALES_FOR_RFM.1

To calculate an R score, you first sort the customer purchase records by the date of the most recent (R) purchase—note that only the most recent purchase for each customer is used in this calculation. In a common form of this analysis, the customers are then divided into five groups, and a score of 1 to 5 is given to customers in each group. The 20 percent of the customers having the most recent orders are given an R score of 1, the 20 percent of the customers having the next most recent orders are given an R score of 2, and so forth, down to the last 20 percent, who are given an R score of 5.

1For a full discussion of RFM analysis using SQL statements, see David M. Kroenke, David J. Auer, Scott L. Vandenberg, and Robert C. Yoder, Database Processing: Fundamentals, Design, and Implementation, 15th edition (Upper Saddle River, NJ: Pearson, 2019).

Z06_KROE8149_09_SE_EXTC.indd 2 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-3

FIGURE C-1

The HSD-DW Dimensional Database with RFM Data

SALES_FOR_RFM Fact Table

TIMELINE Dimension Table

CUSTOMER Dimension Table

To calculate an F score, you re-sort the customers on the basis of how frequently they order. As before, the customers are again divided into five groups. The 20 percent of the customers who order most frequently are given an F score of 1, the next 20 percent most frequently ordering customers are given a score of 2, and so forth, down to the least fre- quently ordering customers, who are given an F score of 5.

To calculate an M score, you re-sort the customers according to the average amounts spent on their orders. The 20 percent who have placed the largest orders on average are given an M score of 1, the next 20 percent are given an M score of 2, and so forth, down to the 20 percent who spend the least, who are given an M score of 5.

Figure C-2 shows sample RFM data for Heather Sweeney Designs. (Note that these data have not been calculated from the sample data and are for illustrative purposes only.) The first customer, Ralph Able, has a score of {1 1 2}, which means that he has ordered recently and orders frequently. His M score of 2 indicates, however, that he does not order the most expensive goods. From these scores, the salespeople can surmise that Ralph is a good cus- tomer who may be open to purchasing more expensive goods or higher quantities of goods.

FIGURE C-2

The RFM Score Report

Each customer is ranked for R (recent), F (frequent), and M (money) characteristics—1 is highest (best) and 5 is lowest (worst) score

SQL Server 2017, SQL Server Management Studio, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 3 04/07/19 9:05 PM

C-4 Online Extension C Advanced Business Intelligence and Big Data

Susan Baker is above average in terms of how recently she has shopped and how fre- quently she shops, but her purchases are average in value. Sally George is truly in the mid- dle. Based on Jenny Tyler’s scores, she has not ordered in some time, but in the past, when she did order, she ordered frequently, and her orders were of the highest monetary value. These data suggest that Jenny may be going to another vendor. Someone from the sales team should contact her immediately. However, no one on the sales team should be talking to Chantel Jacobs. She has not ordered for some time, she does not order frequently, and when she does order, she only buys inexpensive items and not many of them.

Reporting System Components Figure C-3 shows the major components of a reporting system. Data from disparate data sources are read and processed. As shown, reporting systems can obtain data from opera- tional databases, data warehouses, and data marts.

A reporting system maintains a database of reporting metadata. The metadata describe reports, users, groups, roles, events, and other entities involved in the reporting activity. The reporting system uses the metadata to prepare and deliver appropriate reports to the proper users in the correct format on a timely basis. As shown in Figure C-3, reports can be prepared in a variety of media or formats. Figure C-4 lists report characteristics, which we describe in more detail next.

Report Types Some reports are static reports. They are prepared once from the un- derlying data, and they do not change. A report of the past year’s sales, for example, is a static report. Other reports are dynamic reports—at the time of their creation, the reporting system reads the latest, most current data and generates the report using those fresh data. Reports on today’s sales and on current stock prices are dynamic reports.

Query reports are prepared in response to information entered by users. Google web search is an example of a reporting system that uses query reports: You enter the keywords you want to search for, and Google’s reporting system searches its database and generates a response that is custom-built to your query and possibly your location or other factors. Within a specific organization, such as Heather Sweeney Designs, a query report could be generated to show current inventory levels. The user would enter item numbers, and the reporting system would respond with inventory levels of those items.

Report Media As illustrated in Figure C-3 and summarized in Figure C-4, reports are delivered via many different channels. Some reports are printed on paper or its elec- tronic equivalents, such as PDF format. Other reports are delivered via Web portals.

FIGURE C-3

Components of a Reporting System

Reporting System

Metadata

Paper Report

Web Portal

Digital Dashboard

E-mail Alert

Users

XML Web Service

Operational Database

Reporting System

Author Manage Deliver

and/or

Data Warehouse Database

Data Mart Database

Z06_KROE8149_09_SE_EXTC.indd 4 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-5

An organization might place a sales report on the sales department’s Web portal and a re- port on customers serviced on the customer service department’s Web portal.

A digital dashboard is an electronic display that is customized for a particular user. Companies such as Google, MSN, and Yahoo! offer digital dashboard services that you might have seen or used. Users can define the content they want to see—say, a local weather forecast, a list of stock prices, and a list of news sources—and the vendor con- structs a customized display for each user. Such pages are called, for example, myhomemsn. com and My Yahoo!. Other dashboards are designed specifically for organizations. Executives at a manufacturing organization, for example, might have a dashboard that shows up-to-the-minute production and sales activities. These dashboards may be pro- vided by companies such as IBM or SAP.

Reports can also be delivered via alerts. Users can indicate that they want to be noti- fied of news and events by email or cell phone. “Smart cell phones” such as the iPhone and those using the Android operating system are capable of displaying Web pages and can use digital dashboards.

Finally, reports can be delivered to other information systems, sometimes using XML or other Web technologies. This style of reporting is particularly useful for interorganiza- tional information systems, such as supply chain management.

Report Modes The final report characteristic summarized in Figure C-4 is the report mode. A push report is sent to users based on a predetermined schedule. Users receive the report without any action on their part. In contrast, users must request a pull report. To obtain a pull report, a user goes to a Web portal or digital dashboard and clicks a link or button to cause the reporting system to produce and deliver the report.

Report System Functions As shown in Figure C-3, report systems serve three functions: report management, report authoring, and report delivery. Report management consists of defining who receives what reports, when, and by what means. Most report management systems enable the report system administrator to define user accounts and user groups and to assign users to one or more groups. For example, all the salespeople would be assigned to the Sales group, all upper-level management would be assigned to the Executive group, and so forth. All these objects and assignments are stored in the reporting system metadata shown in Figure C-3.

Report authoring involves connecting to the required data sources, creating the report structure, and formatting the report. Reports created using a report authoring system are then assigned to groups and users. Assigning reports to groups saves the administrator work; when a report is created, changed, or removed, the administrator need only change the report assignments of the group, and all the users in the group will inherit the changes. The report assignment metadata not only includes the user or group and the reports as- signed but also indicates the format of the report that should be sent to the user, the chan- nel by which the report will be delivered, and whether the report is to be pushed or pulled.

FIGURE C-4

Report Characteristics

Type Media Mode Static Paper Push

Dynamic Web portal Pull

Query Digital dashboard

Online analytical processing (OLAP)

Email/alert

XML Web service and application specific

Z06_KROE8149_09_SE_EXTC.indd 5 04/07/19 9:05 PM

C-6 Online Extension C Advanced Business Intelligence and Big Data

If it is to be pushed, the administrator declares whether the report is to be generated on a regular schedule or as an alert based on a specific event in a database.

The report delivery function of a reporting system pushes reports or allows them to be pulled based on the report management metadata. Reports can be delivered by hand or via an email server, a Web portal, XML Web Services, or other program-specific means. The report delivery system uses the operating system and other program security components to ensure that only authorized users receive authorized reports, and it also ensures that push reports are produced at appropriate times.

For query reports, the report delivery system serves as an intermediary between the user and the report generator. It receives a user query request, such as the item numbers in an inventory query, passes the query request to the report generator, receives the resulting report, and delivers the report to the user.

DATA MINING Instead of the basic calculations, filtering, sorting, and grouping used in reporting applica- tions, data mining involves the application of sophisticated mathematical and statistical techniques to find patterns and relationships that can be used to classify data and predict future outcomes. As shown in Figure C-5, data mining represents the convergence of sev- eral methodologies. Data mining techniques have emerged from the statistical and mathe- matics disciplines and from the artificial intelligence and machine-learning communities. In fact, data mining terminology embraces an odd combination of terms used by these differ- ent disciplines.

Data mining techniques take advantage of developments for processing enormous da- tabases that have emerged in the past two decades. Of course, all these data would not have been generated were it not for fast and inexpensive computers, and without such comput- ers, results from the new techniques would be impossible to produce in a reasonable timeframe.

Many data mining techniques are sophisticated and difficult to use. However, such techniques are valuable to organizations; and some business professionals, especially those in finance and marketing, have developed expertise in their use. Almost all data mining techniques require specialized software. Popular data mining products are Enterprise Miner from SAS Corporation, SPSS Modeler from IBM, and HP Vertica Analytics Platform from Hewlett-Packard. However, there is a movement to make data mining avail- able to more users. For example, Oracle offers data mining functionality via the

FIGURE C-5

Convergence of Disciplines for Data Mining

Data Mining Data

Management Technology

Artificial Intelligence Machine Learning

Statistics/ Mathematics

Huge Databases

Cheap Computer Processing and

Storage

Sophisticated Marketing, Finance, and Other Business

Professionals

Z06_KROE8149_09_SE_EXTC.indd 6 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-7

“Oracle Advanced Analytics” option, with a GUI interface as part of SQL Developer. Microsoft offers SQL Server Analysis Services as an add-on to SQL Server; it provides data mining and other functionality. Data mining techniques fall into two broad categories: un- supervised and supervised.

Unsupervised Data Mining When using unsupervised data mining techniques, analysts do not create a model or hy- pothesis prior to beginning the analysis. Instead, the data mining technique is applied to the data and results are observed. After the analysis, explanations and hypotheses are cre- ated to explain the patterns found.

One commonly used unsupervised technique is cluster analysis. With cluster analysis, statistical techniques are used to identify groups of entities that have similar characteristics. A common use for cluster analysis is to find customer groups in (1) order data and (2) cus- tomer demographic data. For example, Heather Sweeney Designs could use cluster analy- sis to determine which groups of customers are associated with the purchase of specific products. For example, a cluster analysis could be created using the HSD-DW data table to create the OLAP reports. In this case, the cluster analysis tool might indicate that there are different sales patterns for the Dallas area and the non-Dallas area. For example, sales of specific videos may differ markedly between the two clusters. Market basket analysis is another form of unsupervised data mining. As this is one of the most common and impor- tant forms of data mining, we will describe it in more detail next.

Market Basket Analysis Data mining techniques are usually complex. However, mar- ket basket analysis is a data mining technique that can be readily implemented with pure SQL (although we will not pursue that implementation in this book). All the major data mining products have features and functions to perform market basket analysis. Market basket analysis is also known as association rules.

Suppose that you run a diving shop, and one day you realize that one of your sales- people is much better than others at up-selling your customers. Any of your sales associates can fill a customer’s order, but this particular salesperson is especially able to sell customers items in addition to those for which they ask. One day you ask him how he does it.

“It’s simple,” he says. “I just ask myself, ‘What is the next product they’ll want to buy?’ If someone buys a dive computer, I don’t try to sell her fins. If she’s buying a dive com- puter, she’s already a diver, and she already has fins. But, look, these dive computer displays are hard to read. A better mask makes it easier to read the display and get the full benefits from the dive computer.” Thus, the market basket analysis might include an association rule that says, “If a customer buys a dive computer, then that customer will also buy a mask.” Clearly not all customers buying a dive computer will also buy a mask, of course, so the market basket analysis will need to determine the likelihood that this will occur.

Market basket analysis is a data mining technique for determining such patterns and rules. A market basket analysis shows the products that customers tend to purchase at the same time. Several different statistical techniques can be used to generate a market basket analysis. Here we discuss a technique that involves, as implied earlier, conditional probabilities.

Figure C-6 shows hypothetical data from 1,000 transactions at a dive shop. The first row of numbers (shaded blue) under each column is the total number of transactions that include the product in that column. For example, the 270 near the top of the Mask column means that 270 of the 1,000 transactions include the purchase of a mask. The 120 under Dive Computer means that 120 of the 1,000 purchase transactions included a dive computer.

Note that in this example, every transaction involves one or two items among those listed in Figure C-6; those transactions with two different items will be counted in two col- umns of the blue row. Also note that some of the 1,000 transactions do not contain any of the five products listed in the table (e.g., somebody purchases a wet suit and nothing else).

Z06_KROE8149_09_SE_EXTC.indd 7 04/07/19 9:05 PM

C-8 Online Extension C Advanced Business Intelligence and Big Data

You can use the numbers in the blue row to estimate the probability that a customer will purchase an item. Because 270 out of 1,000 transactions included a mask, you can esti- mate the likelihood that a customer will buy a mask to be 270/1,000, or .27. Similarly, the likelihood of a tank purchase is 200/1,000, or .2, and the likelihood of a fins purchase is 280/1,000, or 0.28.

The next five rows in this table show the occurrences of transactions that involve two items. For example, the last column indicates that 50 transactions included both a dive computer and a mask, 30 transactions included a dive computer and a tank, 20 included a dive computer and fins, 10 included a dive computer and weights, 5 included a dive com- puter and another dive computer (meaning the customer bought two dive computers), and 5 transactions had a dive computer and no other product. Note the symmetry in the table: The numbers for (mask, fins) and (fins, mask) are the same, and so on.

These data are interesting, but we can refine the analysis by computing additional fac- tors. Marketing professionals define support as the probability that two items will be pur- chased together. From these data, the support for fins and mask is 150 out of 1,000, or .15. Similarly, the support for dive computer and mask (the combination cited earlier by a sales- person) is 50/1,000, or .05.

Confidence is defined as the probability of a customer buying one product, given that he or she is buying another product. The confidence of fins, given that the customer is pur- chasing a mask, is the number of purchases of fins and masks out of the number of pur- chases of masks. Thus, in this example, the confidence is 150 out of 270, or .55556. The confidence that a customer purchases a tank, given that the customer is purchasing fins, is 40 out of 280, or .14286. As another example, the confidence in the rule alluded to by our star salesperson (“If a customer buys a dive computer, then that customer will also buy a mask”) is 50>120 = .41667, or roughly 42 percent.

Lift is defined as the ratio of confidence divided by the base probability of an item purchase. The lift for fins, given a mask, is the probability that a customer buys fins (given that the customer is purchasing a mask) divided by the overall probability that the cus- tomer buys fins. If the lift is greater than 1, then the probability of buying fins goes up when a customer buys a mask; if the lift is less than 1, the probability of buying fins goes down when a customer buys a mask.

For the data in Figure C-6, the lift for fins, given a mask purchase, is .55556/.28 or 1.98. This means that when someone purchases a mask, the likelihood he or she will also pur- chase fins almost doubles. The lift for fins, given a dive computer purchase, is 20/120 (the confidence of fins, given a dive computer) divided by .28, the probability that someone buys fins (280 of the 1,000 transactions involved fins). Therefore, 20/120 is .16667, and

FIGURE C-6

A Market Basket Analysis Example

1,000 Transactions

Mask

Tank

Fins

Weights

Dive Computer

270 200 280 130 120

20 20 150 20 50

20 80 40 30 30

150 40 10 60 20

20 30 60 10 10

50 30 20 10 5

10 – – – 5

Dive Computer

No Additional Product

Example: P (Fins & Mask) = 150 / 1000 = .15 Example: P (Fins | Mask) = 150 / 270 = .55556 Example: P (Fins | Mask) / P (Fins) = .55556 / .28 = 1.98 P (Mask | Fins) / P (Mask) = 150 / 280 / .27 = 1.98

Support = P (A & B) Confidence = P (A | B) Lift = P (A | B)/ P (A)

Note:

Z06_KROE8149_09_SE_EXTC.indd 8 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-9

.16667/.28 is .59525. So the lift for fins, given purchase of a dive computer, is just under .6, meaning that when a customer buys a dive computer the likelihood that he or she will buy fins decreases. Finally, returning to our salesperson’s example, the lift for a mask, given the purchase of a dive computer, is .41667 (the confidence in our “rule”) divided by the overall likelihood of buying a mask, which is .27. This gives a lift of .41667>.27 = 1.5432, mean- ing that our salesperson’s intuition was correct: Purchasing a dive computer increases the odds of a mask being purchased at the same time. Note that, as shown in the last line of Figure C-6, lift is symmetrical. If the lift of fins, given purchase of a mask, is 1.98, then the lift of a mask, given purchase of fins, is also 1.98.

Supervised Data Mining When using supervised data mining techniques, data miners develop a model prior to the analysis and then apply statistical techniques to the data to estimate parameters of the model. For example, suppose that marketing experts at a communications company be- lieve that the use of cell phone weekend minutes is determined by the age of the customer and the number of months the customer has had the cell phone account. A data mining analyst could then run a statistical analysis technique known as regression analysis to de- termine the coefficients of the equation of that model. A possible result is:

CellPhoneWeekendMinutes = 12 + (17.5 * CustomerAge) + (23.7 * NumberMonthsOfAccount)

Considerable skill is required to interpret and adjust the quality of such a model: The software will create an equation, but whether the equation becomes a good predictor of future cell phone usage depends on a variety of factors beyond the scope of this book.

Three other popular supervised data mining techniques are decision tree analysis, lo- gistic regression, and neural networks. Decision tree analysis classifies customers or other entities of interest into two or more groups, according to history. Logistic regression pro- duces equations that offer probabilities that particular events will occur. Common applica- tions of logistic regression are using donor characteristics to predict the likelihood of a do- nation in a given period and using customer characteristics to predict the likelihood that customers will switch to another vendor. Neural networks are complex statistical predic- tion techniques. The name is actually a misnomer—although there is some similarity be- tween the structure of a neural network and a network of biological neurons, the similarity is only superficial. In data mining, neural networks are just a technique for creating very complex mathematical functions for making predictions.

Decision Trees The potential for straightforward graphical representation makes de- cision trees possible to interpret without a lot of background and training. Constructing decision trees, on the other hand, requires sophisticated algorithms and enough experience to properly parameterize those algorithms. In this section, we will focus on the intuition behind decision tree structure and usage by examining an example that is further explored in the exercises.

A decision tree intuitively represents a set of rules that can be easily expressed in English or in SQL. As a simple example, consider the situation of a child deciding whether to read a particular book. Based on his or her experience, the child has determined some characteristics of books that are likely to make them either good or bad choices for the child to read. For example, the child may not like books that are too long and may like short books only if they have plenty of pictures. Without realizing it, that child is using a decision tree similar to that shown in Figure C-7.

Z06_KROE8149_09_SE_EXTC.indd 9 04/07/19 9:05 PM

C-10 Online Extension C Advanced Business Intelligence and Big Data

The structure in Figure C-7 is referred to as a tree (in computing, we draw trees upside-down, with the root of the tree at the top of the diagram and the leaves at the bot- tom). Any tree, including decision trees, will have just one root and will have just one way to get from the root to any specific leaf. In this example, as with all decision trees, we begin with a single question asked at the root of the tree. We will represent questions by rectan- gles and decisions using ovals.

The question first asked by the child is whether the book is less than 100 pages long or at least 100 pages long. The root question of the tree contains the name of the attribute be- ing examined, in this case the length of the book. If the book is at least 100 pages long, then the questioning follows the arrow coming out of the right side of the root. At the end of that arrow is a leaf with the word NO, indicating that the child is not interested in reading this book. Note that the arrow is labeled with the value(s) of the Length attribute that will lead the questioning that way.

If the answer to the question at the root is that the book is less than 100 pages long, then another question is required in order to make the final decision. In this case, we follow the ar- row to the left of the root (labeled “< 100”) to arrive at the second question. This question is based on how many pictures the book has. If the book has at least one picture per page, then the child wants to read the book. This corresponds to following the arrow coming from the right side of the PicturesPerPage rectangle. This arrow is labeled, as elsewhere, with the condi- tions under which it will be followed (PicturesPerPage > = 1). On the other hand, if the book has less than one picture per page, on average, then the left arrow will be followed from the PicturesPerPage rectangle and the decision will be to not read the book.

Where did this decision tree come from? It is based on the child’s experience reading and examining other books. In general, a decision tree is created in two main phases. In the training phase, the decision tree is constructed based on data found, for example, in a data warehouse. These data will include the proper classification (answer) for each record. In the child’s books example, the data would include such things as “War and Peace is over 100 pages long, it has no pictures, and I did not like it” or “Green Eggs and Ham is less than 100 pages long, it has many pictures, and I did like it.” These are the training data for the decision tree.

The exact process used to build a decision tree can be very complicated. There are many algorithms for creating decision trees, and they need to determine which attribute should be used for each question and which values of each attribute should be used to guide the ques- tioning to the next level down in the tree. We will not discuss these algorithms further.

The second phase of decision tree creation is called the testing phase. In this phase, we give the tree some “new” data points for which we already know the answer and we judge its results. This may cause us to alter the decision tree in various ways. After the testing phase is complete, the tree is ready to be deployed and used to make future decisions. It can, of course, always be refined more in the future as we learn how well it continues to classify the data. For example, as the child grows up, he or she may learn to like longer books a little better and could then change the “100” values in the decision tree to “200.”

A decision tree represents a set of rules that is used to make a decision about a record that represents a sample or an event. Ideally, this tree will represent a short series of simple questions. The rectangles and ovals in our trees are called nodes. Each rectangular node in

FIGURE C-7

A Decision Tree Example

Length

PicturesPerPage NO

NO YES

>= 100< 100

< 1 >= 1

Z06_KROE8149_09_SE_EXTC.indd 10 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-11

the decision tree is a question, and each oval-shaped node is a final answer (and a leaf). In our first example in Figure C-7, the questions all have only two possible outcomes, but in general each question can have any number of possible results (at least two, of course). Starting at the root, the answer to the question determines which node to visit next.

This process is continued until a leaf is reached. The full series of questions from the root to the leaf represents a classification rule. For example, in the case of Green Eggs and Ham mentioned earlier, the classification rule used is:

Length < 100 and PicturesPerPage >= 1

To illustrate the difficulties of constructing an accurate, efficient decision tree, con- sider the tree in Figure C-8, which solves the same problem as that in Figure C-7. The deci- sion tree in Figure C-8 will come to the same decisions about books as in Figure C-7, but for some books this decision will now take longer to arrive at. In particular, long books with many pictures will now require two questions to rule out rather than one. Another disadvantage of this decision tree will be explored in the exercises.

Now that we have seen a basic example and introduced the terminology, we will con- sider a more complete example and illustrate how it might appear in Oracle Data Mining, a subset of Oracle Advanced Analytics. In addition, we have not yet considered the accuracy of our decision trees: How well do they truly make decisions for both the training and test- ing data, and thus how well do we expect them to perform on future data? Consider the problem of deciding whether to go outside and ice skate on a nearby lake. From past expe- rience, we have the following data and decisions to guide us (the training data), based on observations of weather conditions (sunny or cloudy), temperature, and the number of days that ice fishing has been observed so far this season. Figure C-9 shows this training data.

Our decision (the “class” attribute) is either to skate or not skate, depending on vari- ous combinations of weather, temperature, and ice fishing duration. We want to build a

FIGURE C-8

A Second Decision Tree Example

PicturesPerPage

Length

NONO NO

>= 1< 1

< 100 >= 100

Length

YES

< 100 >= 100

FIGURE C-9

Training Data for a Decision Tree

Oracle Database 12c, SQL Developer 18.4, Oracle Corporation

Z06_KROE8149_09_SE_EXTC.indd 11 04/07/19 9:05 PM

C-12 Online Extension C Advanced Business Intelligence and Big Data

decision tree to help us make the decision in the future. Figure C-10 shows two small deci- sion trees created by Oracle Data Mining to classify the data as “skate” or “no skate.” Both trees consist of just one question and two leaves (note that Oracle Data Mining uses rect- angles for both question nodes and leaf, or decision, nodes).

The left decision tree in Figure C-10 bases its first question on the ICEFISHDAYS at- tribute, with the left leaf corresponding to ICEFISHDAYS 6 = 7 and the right leaf (labeled Node 2) corresponding to ICEFISHDAYS 7 7. Note that each leaf node contains a sup- port and confidence for the rule that led to that node. Node 1 represents all three records with ICEFISHDAYS 6= 7, and all of those (confidence 100%) are labeled “no skate.” Node 2, however, is not very good at predicting the outcome: It predicts that all 12 of the other training records correspond to “skate,” when in fact only 7 of the 12 do (giving us 58.33% confidence in this rule). For the decision tree to the right in Figure C-10, Oracle Data Mining has chosen a different first question, this time based on TEMP, with Node 1 corresponding to TEMP 6= 2. This decision tree does a better job at accurately determin- ing the decision, but both these trees are based on a very small amount of data. A more ac- curate decision tree for the skating data set is shown in Figure C-11, using the same notation as in Figures C-7 and C-8.

A more realistic decision tree generated by Oracle Data Mining, based on a larger data set,2 is shown in Figure C-12. In this figure, we see part of a decision tree that does a very good job deciding whether a mushroom is poisonous (p) or edible (e) based on some of its charac- teristics. The rule for Node 8 (a leaf node) is displayed at the bottom of the figure, and it displays the series of questions/answers (based on the mushroom’s odor, spore color, and bruising) that led to that leaf node, which represents 44 poisonous mushrooms and 0 edible mushrooms. Thus, if you examine a mushroom that has those characteristics, you can assume (with confidence 1.0 = 100,) it is poisonous.

2Poisonous mushroom identification data set obtained from the UCI Machine Learning Repository: Lichman, M. (2013), UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, Irvine, School of Information and Computer Science. Accessed December 2018.

FIGURE C-10

Oracle Data Mining Decision Trees for the Ice Skating Example

Oracle Database 12c, SQL Developer 18.4, Oracle Data Miner, Oracle Corporation

Z06_KROE8149_09_SE_EXTC.indd 12 04/07/19 9:05 PM

FIGURE C-11

A Decision Tree for the Skating Data Set

ICEFISHDAYS

WEATHER YES

YES

>= 10< 10

< 0

TEMP

>= 1 and < 12

>= 12

sunnycloudy

C-13

FIGURE C-12

Oracle Data Mining Partial Decision Tree and Rule for Poisonous Mushroom Data

Oracle Database 12c, SQL Developer 18.4, Oracle Data Miner, Oracle Corporation

Z06_KROE8149_09_SE_EXTC.indd 13 04/07/19 9:05 PM

C-14 Online Extension C Advanced Business Intelligence and Big Data

Big Data is a hot topic. Big Data is a “buzz word.” According to the Merriam-Webster dic- tionary, Big Data (aka big data) is “an accumulation of data that is too large and complex for processing by traditional database management tools.”3 And yet, as danah boyd (who prefers all lowercase letters in her name) and Kate Crawford4 point out, United States cen- sus data sets that have been gathered over many decades are larger than some data sets now considered Big Data. They note the social phenomena aspects of Big Data and write that:

We define Big Data as a cultural, technological, and scholarly phenomenon that rests on the interplay of:

(1) Technology: maximizing computation power and algorithmic accuracy to gather, analyze, link, and compare large data sets.

(2) Analysis: drawing on large data sets to identify patterns in order to make economic, social, technical, and legal claims.

(3) Mythology: the widespread belief that large data sets offer a higher form of intelli- gence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.5

Most users of Big Data would probably agree with the first two parts of this definition, while we should definitely be thinking about the third part and its implications.

The Three Vs In early 2001, Doug Laney, working as a member of the META Group (which is now part of Gartner), needed a way to explain Big Data to his clients. He created and popularized the 3V framework for discussing Big Data.6

The 3V framework discussed Big Data in terms of:

• Volume: the extremely large amount of data that needs to be stored. • Velocity: the speed and continuous nature of data acquisition. • Variety: the many different forms of data being acquired and stored.

Volume In their 2012 book Understanding Big Data7 (an ebook available from IBM and sponsored by IBM), Paul Zikopoulos and his colleagues note that in 2012 Twitter needed to store more than 7 terabytes of data per day and that Facebook was generating 10 terabytes daily. They projected that 35 zettabytes of data would need to be stored by 2020. IBM, in its illustration The FOUR V’s of Big Data,8 projects data growth of 40 zettabytes in 2020 and claims that 2.3 trillion gigabytes of data are currently created each day! Figure 7-1 in the text- book shows what these terms mean relative to each other, and remember that as this is being

3Merriam-Webster online dictionary at http://www.merriam-webster.com/dictionary/big%20data. Accessed December 2018.

4danah boyd and Kate Crawford, “Critical Questions for Big Data,” in Information, Communication, and Society, 15(5), 2012, 662-679, DOI: 10.1080/1369118X.2012.678878.

5Ibid., p. 663.

6Doug Laney, 3-D Data Management: Controlling Data Volume, Velocity, and Variety (Stamford, CT: META Group Inc., 2001). See Doug Laney’s January 12, 2014, Garner Blog Network article Deja VVVu: Others Claiming Gartner’s Construct for Big Data (http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiming- gartners-volume-velocity-variety-construct-for-big-data/ (accessed December 2018)—there is a PDF ver- sion of the original paper available there). 7Paul Zikopoulos, Chris Eaton, Dirk deRoos, Thomas Deutsch, and George Lapis, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data (New York: McGraw-Hill Bulk Sales, 2012). Available online at http://www.ibmbigdatahub.com/whitepaper/understanding-big-data-ebook (ac- cessed December 2018). 8Available at http://www.ibmbigdatahub.com/infographic/four-vs-big-data (accessed December 2018).

WHAT IS BIG DATA?

Z06_KROE8149_09_SE_EXTC.indd 14 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-15

written in late 2018, 2-terabyte hard drives are not unusual for personal computers. IBM added veracity, having to do with the uncertainty of the quality and accuracy of data, to the list of Vs.

Velocity Saying that Twitter adds 7 terabytes (that is a one with 12 zeros!) per day not only is a measure of volume (the amount of data needing to be stored), it is also a measure of the speed at which the data arrives: 7 terabytes > day = over 290 gigabytes > hour = over 4.8 gigabytes > minute = over 81 megabytes > second. The data arrives, and it must be stored and (hopefully) processed. And it doesn’t stop—it keeps coming!

Variety Twitter stores tweets (IBM’s illustration claims 400 million tweets per day), which are short text messages of up to 280 characters (and stored in JSON format). YouTube stores and streams video content. Facebook stores text, pictures, “likes,” and friend con- nection data. Instagram stores pictures, “likes,” and comments (more text!). Pandora stores and streams music. The term variety reflects all the different types of content data that need to be stored.

Big Data and NoSQL Systems We have used the relational database model and SQL throughout this book up until Chapter 7. However, there is another school of thought that has led to what was originally known as the NoSQL movement but now is usually referred to as the Not only SQL movement.9 It has been noted that most, but not all, DBMSs associated with the NoSQL movement are nonrelational DBMSs. A NoSQL DBMS is often a distributed, replicated (i.e., cloud) database, as described in Chapter 7, and is used where this type of DBMS is needed to support large data sets. There have been several classification systems proposed for grouping NoSQL databases.

For our purposes, we will adopt and use a set of four categories of NoSQL databases:10

• Key-Value—Examples are DynamoDB and MemcacheDB • Document—Examples are ArangoDB, Couchbase Server, MongoDB, and

Microsoft Azure Cosmos DB • Column Family—Examples are Apache Cassandra and HBase • Graph—Examples are Neo4j and AllegroGraph

NoSQL databases are used by widely recognized Web applications—for example, both Facebook and Twitter use the Apache Software Foundation’s Cassandra database. In Chapter 7, we discussed document DBMSs. In this extension, we will discuss all four cate- gories. Many organizations today use both relational and NoSQL databases to fit specific needs—this is known as polyglot persistence.

A major motivating factor for the development of NoSQL database systems was the need to rapidly process large data sets. One must either “scale up” or “scale out” to gain storage and processing capacity. Maintaining a DBMS on a single computer system makes it easier to maintain database consistency, but there is a limit to the number of processors and storage that can be attached to a single computer system. Scaling up might initially work but will usu- ally become untenable as data storage and processing requirements grow rapidly. Many NoSQL database systems have built-in support for running on clusters of computers—scaling out—that allows capacity to be added fairly easily and cheaply. As mentioned in Chapter 7, this capacity growth is accomplished by partitioning (sharding) and replication.

9For a good overview, see the Wikipedia article NoSQL. 10This set of categories corresponds to the four categories used in the Wikipedia article NoSQL as Wikipedia’s taxonomy of NoSQL databases and is also used in Ian Robinson, Jim Webber, and Emil Eifrem, Graph Databases (Sebastopol, CA: O’Reilly Media, 2013).

Z06_KROE8149_09_SE_EXTC.indd 15 04/07/19 9:05 PM

C-16 Online Extension C Advanced Business Intelligence and Big Data

Sharding is a set of approaches for taking portions of a database and moving them to a different server in a cluster. In the relational world, we can move entire tables to other servers— for example, move customer orders as a separate shard—or we can extract sets of rows and move them to corresponding tables on another server—for example, move customers start- ing with letters A–H, I–P, and Q–Z to separate CUSTOMER tables. Of course, there has to be DBMS or application support for updating the rows on the correct shard.

Replication is accomplished by making one or more copies of a set of data and main- taining it on a different server. Each copy is called a replica set. Replication increases the processing capacity of our system by allowing several servers to access the data simultane- ously. It also can increase availability of our system since processing can continue in the event that a server fails. However, this can lead to consistency problems. If a data item is updated on one server, it will take some time before the update can propagate to the other replica sets on other servers. This is known as eventual consistency. Some applications may end up reading stale data. This may or may not be a problem, depending on the application. What if two different updates are made to the same data item at the same time on different servers? Again, the severity of the problem depends on the application, but there are ways to address this problem using timestamps and quorums. Quorums are briefly described next.

To increase the consistency of our database, we can request that database reads or writes be propagated to several of the replica sets before indicating to the application that the request has completed successfully. A typical scenario is to ensure that the read or write operation completes on a majority (quorum) of the servers. Increasing the number of replica sets that must respond increases consistency but makes processing slower. The quorum values for read operations can be different than for write operations. This availability versus consistency trade- off is one of many that must be taken into consideration when using shards or replicas. The CAP theorem, described in Chapter 7, delineates some limitations on the available trade-offs.

Aggregates The first three types of NoSQL systems (Key-value, Document, and Column Family), as well as some NoSQL Graph databases, can store aggregates. By this we mean more complex data structures than a simple row in a relational table, as aggregates can represent composite data (e.g., an address consists of street, city, and country data items) as well as lists (e.g., a set of phone numbers for a customer) that would normally be represented in separate relational ta- bles to store 1:N or 1:1 relationships. Thus, we have a collection of data items that are treated as a unit. Aggregate data can be described using the data modeling languages XML and JSON. XML will be covered in the next section; JSON was introduced in Chapter 7 and ad- ditional examples appear in this extension. Since relational databases are designed for (among other things) consistency, there is support for “atomic” transactions that require all updates to complete successfully before the data is committed to the database (see Chapter 6). For aggregate NoSQL databases, we have to think of “transactions” in a different way. Although operations on a single aggregate are atomic, there is usually not built-in support for operations across more than one aggregate data object. Thus, the design of the aggregates and the application itself become important considerations when consistency is a goal.

EXTENSIBLE MARKUP LANGUAGE (XML) Many NoSQL database systems store, process, and communicate data in the form of documents. XML and JSON are two common document formats used for this. This sec- tion describes XML. Some DBMSs support XML directly, for example BaseX and eXist, which support the XQuery standard as a query language. Other DBMSs, for example Microsoft’s SQL Server and IBM’s DB2, are XML-enabled to import and export table data as XML. Both XML and JSON can be supported as an attribute datatype, thus allowing a mixture of relational and document capabilities. Database systems that support XML can be considered a type of NoSQL document database.

Database processing and document processing need each other. This is particularly true in nonrelational databases. Database processing needs document processing for

Z06_KROE8149_09_SE_EXTC.indd 16 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-17

transmitting database views; document processing needs database processing for storing and manipulating data. However, even though these technologies need each other, it took the popularity of the Internet to make that need obvious.

As Web sites evolved, organizations wanted to use Internet technology to display and update data from their organizational databases. Web developers began to take a serious in- terest in SQL, database performance, database security, and other aspects of database pro- cessing. As Web developers created critical e-commerce applications that connected to data- bases, database practitioners became part of project teams that utilized Web technologies, and began to learn about HTML, the language used to mark up documents for display by Web browsers. At first, the database community scoffed at HTML because of its limitations, but it soon learned that HTML was the output of a more robust document markup language called SGML (Standard Generalized Markup Language). SGML was clearly as important to document processing as the relational model was to database processing. Obviously, this powerful language had some role to play in the display of database data, but what role?

In the early 1990s, the two communities began to meet, and the result of their work is a series of standards that describes a language called XML (Extensible Markup Language). XML is a subset of SGML, but additional standards and capabilities have been added to XML, and today XML technology is a hybrid of document processing and database pro- cessing. In fact, as XML standards evolved, it became clear that the communities had been working on different aspects of the same problem for many years. They even used the same terms, but with different meanings. You will see later in this extension how the term schema is used in XML for a concept that is different from the use of schema in the database world.

XML provides a standardized yet customizable way to describe the content of docu- ments. As such, it can be used to describe any database view, but in a standardized way. In addition, when used with the XML Schema standard, XML documents can automatically be generated from database data. Further, database data can automatically be extracted from XML documents. In addition, there are standardized ways of defining how document components are mapped to database schema components, and vice versa.

Today, XML is used for many purposes. One of the most important is its use as a stan- dardized means to define, communicate, and validate documents for processing over the Internet. XML plays a key role in Microsoft’s .NET initiative, and in 2001, Bill Gates called XML the “lingua franca of the Internet age.”

We will begin the discussion of XML by briefly describing its use for displaying Web pages. As you will learn, however, XML uses go far beyond Web page display. In fact, Web page display is one of the least important applications of XML. We begin with page display only because it is an easy way to introduce XML documents. After that, we will explain the XML Schema standard and discuss its use for database processing.

As you read this extension, keep in mind that XML standards and product capabilities are still changing. You can keep abreast of these changes by checking the Web sites www. w3c.org and www.xml.org. Learning as much as you can about XML and its relation to database processing is a good way to prepare yourself for a successful career using Web and database technologies.

XML as a Markup Language As a markup language, XML is significantly better than HTML in several ways. For one, XML provides a clean separation between document structure, content, and materialization—the way the document is displayed on a particular device. XML has facilities for dealing with each, and they cannot be confounded (mixed together), as they are with HTML.

Additionally, XML is standardized, but as its name implies, the standards allow for ex- tension by developers. With XML, you are not limited to a fixed set of elements such as <p>, <H1>, and <H2>; you can create your own. Third, XML eliminates the inconsistent tag use that is possible (and popular) with HTML. For example, consider the following HTML:

<H2>Hello World</H2>

Z06_KROE8149_09_SE_EXTC.indd 17 04/07/19 9:05 PM

C-18 Online Extension C Advanced Business Intelligence and Big Data

Although the <H2> tag can be used to mark a level-two heading in an outline, it can be used for other purposes, too, such as causing “Hello World” to be displayed in a particular font size, weight, and color. Because a tag has potentially many uses, we cannot rely on tags to discern the structure of an HTML page. Tag use is too arbitrary; it may mean a heading, or it may mean nothing at all. HTML is also lax about the use of closing tags, leading to in- consistent display of Web pages on different browsers. Some browsers support vendor- specific extensions, making it more difficult to develop applications that operate well across all browsers. XML and related languages go a long way to correct that problem.

As you will see, the structure of an XML document can be formally defined. Tags are de- fined in relationship to one another. In XML, if we find the tag <street>, we know exactly what data we have, where those data belong in the document, and how that tag relates to other tags.

Materializing XML Documents with XSLT The XML document shown in Figure C-13 shows both the document’s structure and con- tent. The second line in the document refers to a stylesheet that indicates how it is to be materialized for display. The stylesheet may contain CSS (cascading style sheets) statements to control the specific font size or colors used to display elements, as shown after the <style type="text/css"> line in Figure C-14.

The designers of XML created a clean separation among structure, content, and for- mat. The most popular way to materialize XML documents is to use XSLT (Extensible Style Language: Transformations). XSLT is a powerful and robust transformation lan- guage. It can be used to materialize XML documents into HTML, and it can be used for other purposes as well.

One common application of XSLT is to transform an XML document in one format into a second XML document in another format. A company can, for example, use XSLT to transform an XML order document in its own format into an equivalent XML order docu- ment in its customer’s format. We will only briefly discuss XSLT here. See www.w3.org for more information.

XSLT is a declarative transformation language. It is declarative because you create a set of rules that govern how the document is to appear after materialization instead of

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="DBC-e09- SeminarListStyleSheet.xsl"?>

<Location>San Antonio Convention Center</Location>

<SeminarTitle>Kitchen on a Budget</SeminarTitle>

</Seminar>

<Location>Dallas Convention Center</Location>

<SeminarTitle>Kitchen on a Big D Budget</SeminarTitle>

</Seminar>

</SeminarList>

FIGURE C-13

Seminar XML Document with Reference to an XSLT Stylesheet

Z06_KROE8149_09_SE_EXTC.indd 18 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-19

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet

version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns="http://www.w3.org/1999/xhtml">

<xsl:output method="xml" encoding="UTF-8"/>

<xsl:template match="/">

<html>

<head> <title>Seminar Data</title> </head>

h1 {text-align:center; color:blue}

body {font-family:ariel,sans-serif;font-size:12pt; background-color:#FFFFFF}

div.seminardata {font-weight:bold;background-color:#3399 FF;color:#FFFFFF;padding:4px}

</style>

<body>

<p>

<h1>HSD Seminar Data</h1>

</p><hr />

<xsl:for-each select="SeminarList/Seminar">

<br/>

<xsl:value-of select="Location"/> <br/>

<xsl:value-of select="SeminarDate"/> <br/>

<xsl:value-of select="SeminarTime"/>

</xsl:for-each>

</div>

</body>

</html>

</xsl:template>

</xsl:stylesheet>

FIGURE C-14

The Heather Sweeney Designs Seminar Stylesheet

specifying a procedure for materializing document elements. It is transformational because it transforms the input document into another (output) document.

The stylesheet to display the data in Figure C-13 is shown in Figure C-14. A stylesheet is used by XSLT to indicate how to transform the elements of the XML document into an- other format, in this case an HTML document that will be acceptable to a browser.

The XSLT processor copies the elements of the stylesheet until it finds a command in the format {item, action}. When it finds such a command, it searches for an instance of the indicated item; when it finds one, it takes the indicated action. For example, when the XSLT processor encounters

<xsl:for-each select = "SeminarList/Seminar">

it starts a search in the document for an element named SeminarList. When it finds such an element, it looks further within the SeminarList element for an element named Seminar. If a

Z06_KROE8149_09_SE_EXTC.indd 19 04/07/19 9:05 PM

C-20 Online Extension C Advanced Business Intelligence and Big Data

match is found, it takes the actions indicated in the loop that ends with the last occurrence of </xsl:for-each> in the file. Within the loop, styles are set for each element in the Seminar document.

Here we are creating an XML document that can be viewed using a Web browser to display the list of Seminars at Heather Sweeney Designs, as shown in Figure C-15, which is the result of applying the stylesheet in Figure C-14 to the document in Figure C-13. XSLT processors are context oriented; each statement is evaluated in the context of matches that have already been made. Thus, the following statement:

<xsl:value-of select = "SeminarTitle"/>

operates in the context of the SeminarList/Seminar match that was made earlier. Browsers have built-in XSLT processors. You need only supply the document to the

browser; it will locate the stylesheet and apply it to the document for you. In this example, the document in C-13 has a link to the stylesheet file in the same directory.

XML Document Type Declarations and XML Schema DTDs (Document Type Declarations) were the XML community’s first attempt at devel- oping a document structure specification language. It is recognizable by the keyword DOCTYPE in the document. DTDs work, but they have some limitations, and, embarrass- ingly, DTD documents are not XML documents. To correct these problems, the W3C Committee defined another specification language called XML Schema. Today, XML Schema is the preferred method for defining and validating document structure, so we will skip DTDs and cover XML Schemas instead.

XML Schemas are XML documents. This means that you use the same language to define an XML Schema as you would use to define any other XML document. It also means that you can validate an XML Schema document against its schema (a “meta- schema”) just as you would any other XML document.

If you are following this discussion, then you realize that there is a chicken-and-the-egg problem here. If XML Schema documents are themselves XML documents, what docu- ment is used to validate them? What is the schema of all of the schemas? There is such a document; the mother of all schemas is located at www.w3.org. All XML Schema docu- ments are validated against this document.

XML Schema is a broad and complex topic. Dozens of sizable books have been writ- ten just on XML Schema alone. Clearly, we will not be able to discuss even the major topics of XML Schema in this extension. Instead, we will focus on a few basic terms and concepts and show how those terms and concepts are used with database processing. Given this in- troduction, you will then be able to learn more on your own.

FIGURE C-15

HTML Result from Application of Seminar Stylesheet

Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 20 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-21

XMLSpy 2018, Altova Corporation

XML Schema Validation Figure C-16(a) shows an XML Schema document that can be used to represent the rules for a valid XML SeminarList file at Heather Sweeney Designs. An XML document is schema-valid if it conforms to the XML Schema described in a separate XML Schema

FIGURE C-16

Example XML Schema Document

(a) XML Schema

(b) Graphical Representation of XML Schema

<?xml version="1.0" encoding="UTF-8"?>

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

<xsd:element name="SeminarList">

<xsd:annotation>

<xsd:documentation>

This is an XML Schema describing a Seminar List at Heather Sweeney Designs.

</xsd:documentation>

</xsd:annotation>

<xsd:complexType>

<xsd:sequence>

<xsd:element name="Seminar" maxOccurs="unbounded">

<xsd:complexType>

<xsd:sequence>

<xsd:element name="SeminarID" type="xsd:integer"/>

<xsd:element name="SeminarDate" type="xsd:string"/>

<xsd:element name="SeminarTime" type="xsd:string"/>

<xsd:element name="Location" type="xsd:string"/>

<xsd:element name="SeminarTitle" type="xsd:string"/>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

</xsd:schema>

Z06_KROE8149_09_SE_EXTC.indd 21 04/07/19 9:05 PM

C-22 Online Extension C Advanced Business Intelligence and Big Data

document. Because the schema is itself an XML Schema document, it is to be validated against the mother of all schemas, the one at www.w3.org, as indicated on the second line of the schema document that starts with xsd:schema. This same reference will be used in all XML Schemas in every organization worldwide. (By the way, this reference address is used only for identification purposes. Because this schema is so widely used, most schema vali- dation programs have their own built-in copy of it.)

The xsd:schema statement not only specifies the document that is to be used for vali- dation, it also establishes a labeled namespace. Namespaces are a complicated topic in their own right, and we will only give a brief overview here. In this first statement, the label xsd is defined by the expression xmlns:xsd. The first part of that expression stands for xml namespace, and the second part defines the label xsd. Notice that all of the other lines in the document use the label xsd. The expression xsd:complexType simply tells the validat- ing program to look into the namespace called xsd (here, the one specified as www. w3.org/2001/XMLSchema) to find the definition of the term complexType.

The name of the label is up to the designer of the document. You could change xmlns:xsd to xmlns:xse or to xmlns:mylabel, and you would set xse or mylabel to point to the w3 document. XML namespaces are used to combine different vocabularies into the same XML Schema. They can be used to define and support domains and to disambiguate terms. The need for the latter occurs when a document contains synonyms.

For example, consider a document that has two different uses for the term Instrument. Suppose one usage of this term refers to musical instruments and has the subelements {Manufacturer, Model, Material}, as in {Horner, Bflat Clarinet, Wood}, and a second use of this term refers to electronic instruments and has the subelements {Manufacturer, Model, Voltage}, as in {RadioShack, Ohm-meter, 12-volt}. The author of the XML Schema for such a document can define two different namespaces where each contains one of these defini- tions. Then the complexType definition for each of these definitions of Instrument can be prefixed by the label of the namespace, as was done in our schema documents when we used the label xsd. There is more to XML namespaces, and you will undoubtedly learn more as you work with XML.

Elements (and a Little About Attributes) Back to our HSD Seminar example: as shown in Figure C-16(a), XML Schemas consist of elements. Elements are either simple or complex. A simple element has a single data item. In Figure C-16(a), the elements SeminarID, SeminarDate, SeminarTime, Location, and SeminarTitle are all simple elements. A complex element can contain other elements that can be either simple or complex. In Figure C-16(a), Seminar is a complex element and is followed by (after the annotations section) a section with <xsd:complexType> </xsd:complexType> tags. It contains a sequence of the five simple elements SeminarID, SeminarDate, SeminarTime, Location, and SeminarTitle. It is possible to have complex types that contain other complex types.

Data types can be specified in simple elements. In Figure C-16(a), the element SeminarID is typed as xsd:integer and the element SeminarTitle is typed as xsd:string. By default, the cardinality of both simple and complex elements is 1.1, meaning that a single value is required, and no more than a single value can be specified. This is similar to a NOT NULL constraint in SQL schema definitions. If an element did not have a required value, we would add minOccurs = "0" to the element definition. This is similar to the NULL statement in SQL schema definitions. There can be multiple occurrences of the complex type Seminar due to the MaxOccurs = "unbounded" option on the Seminar element.

Schemas also contain attributes, which are usually used to provide additional informa- tion about elements. For example, we could choose to store the SeminarID as an attribute of Seminar by coding:

Z06_KROE8149_09_SE_EXTC.indd 22 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-23

A good rule of thumb is that elements represent data and attributes represent metadata, though this is not part of any XML standard.

The schema definition in Figure C-16(a) was used to validate the XML document in Figure C-13. There are several programs available that can validate schemas. A free pro- gram is Notepad++ (see https://notepad-plus-plus.org) with the XML plugin enabled. Another excellent product is Altova’s XMLSpy XML editing tool (see http://www.altova. com). Figure C-16(b) shows the XML Schema in a diagram format as drawn by XMLSpy. Being able to see an XML Schema as a diagram often makes it easier to interpret exactly what the XML Schema is specifying.

Creating XML Documents from Database Data Several relational database systems, such as SQL Server, Oracle Database, and MySQL have facilities for generating XML documents from database data using the FOR XML clause in an SQL query. Microsoft SQL Server uses the SQL SELECT . . . FOR XML state- ment to work with XML. Consider the following SQL statement:

/* *** SQL-Query-ExtC-01 *** */

SELECT *

FROM SEMINAR

FOR XML RAW;

The expression FOR XML RAW places the values of the columns as attributes in the resulting XML document. Figure C-17 shows an example of the results of a FOR XML RAW query to extract Seminar data. As expected, each column is placed as an attribute of the element named row.

It is also possible to cause SQL Server to place the values of the columns into elements rather than attributes by using FOR XML AUTO, ELEMENTS. For example, we can dis- play the data in the SEMINAR table using the SQL query:

/* *** SQL-Query-ExtC-02 *** */

SELECT *

FROM SEMINAR

FOR XML AUTO, ELEMENTS;

<row SeminarID="1" SeminarDate="2018-10-12" SeminarTime="11:00:00"

Location="San Antonio Convention Center" SeminarTitle="Kitchen on

a Budget" />

<row SeminarID="2" SeminarDate="2018-10-26" SeminarTime="16:00:00"

Location="Dallas Convention Center" SeminarTitle="Kitchen on a Big

D Budget" />

<row SeminarID="3" SeminarDate="2018-11-02" SeminarTime="08:30:00"

Location="Austin Convention Center" SeminarTitle="Kitchen on a

Budget" />

<row SeminarID="4" SeminarDate="2019-03-22" SeminarTime="11:00:00"

Location="Dallas Convention Center" SeminarTitle="Kitchen on a Big

D Budget" />

<row SeminarID="5" SeminarDate="2019-03-23" SeminarTime="11:00:00"

Location="Dallas Convention Center" SeminarTitle="Kitchen on a Big

D Budget" />

<row SeminarID="6" SeminarDate="2019-04-05" SeminarTime="08:30:00"

Location="Austin Convention Center" SeminarTitle="Kitchen on a

Budget" />

FIGURE C-17

FOR XML RAW Results in XML Document

Z06_KROE8149_09_SE_EXTC.indd 23 04/07/19 9:05 PM

C-24 Online Extension C Advanced Business Intelligence and Big Data

<Location>San Antonio Convention Center</Location>

<SeminarTitle>Kitchen on a Budget</SeminarTitle>

</SEMINAR>

<Location>Dallas Convention Center</Location>

<SeminarTitle>Kitchen on a Big D Budget</SeminarTitle>

</SEMINAR>

<Location>Austin Convention Center</Location>

<SeminarTitle>Kitchen on a Budget</SeminarTitle>

</SEMINAR>

<Location>Dallas Convention Center</Location>

<SeminarTitle>Kitchen on a Big D Budget</SeminarTitle>

</SEMINAR>

<Location>Dallas Convention Center</Location>

<SeminarTitle>Kitchen on a Big D Budget</SeminarTitle>

</SEMINAR>

<Location>Austin Convention Center</Location>

<SeminarTitle>Kitchen on a Budget</SeminarTitle>

</SEMINAR>

FIGURE C-18

FOR XML AUTO, ELEMENTS Results in XML Document

Figure C-18 shows the results of the query from Microsoft SQL Server. Here each at- tribute value appears as a separate element. Using another option, FOR XML EXPLICIT, you can cause SQL Server to place some columns into elements and other columns into at- tributes. However, the details of how this is done are beyond the scope of this discussion.

Z06_KROE8149_09_SE_EXTC.indd 24 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-25

Using XPath and XQuery to Query XML Documents There are two basic approaches to performing queries on XML documents. The XPath standard renders the document as a hierarchical structure, much like files are stored in a directory structure, to navigate and select elements in an XML file. Here is an example XPath query:

/* *** XML-Query-ExtC-01 *** */

.//SeminarList/Seminar/SeminarTitle[ends-with(., 'Budget')]

The results are:

<SeminarTitle>Kitchen on a Budget</SeminarTitle>

<SeminarTitle>Kitchen on a Big D Budget</SeminarTitle>

Note that the query finds all occurrences of a SeminarTitle that end in “Budget” in Seminar elements under the SeminarList collection in the XML document.

Another approach uses the more powerful XQuery standard, which uses a programming-language style FOR loop to iterate through the document:

/* *** XML-Query-ExtC-02 *** */

for $seminar in SeminarList/Seminar

let $location := $seminar/Location

let $date := $seminar/SeminarDate

where matches($location, "Dallas*")

order by ($date)[1]

return $seminar

This query structure is called FLWOR (For/Let/Where/Order by/Return). In this case we find each Seminar element in the SeminarList. We assign two variables named $lo- cation and $date to save the Location and SeminarDate for each Seminar. We then match each Location starting with the letters “Dallas” and return the Seminar data sorted by SeminarDate (although there was only one match). The results are shown here:

<Location>Dallas Convention Center</Location>

<SeminarTitle>Kitchen on a Big D Budget</SeminarTitle>

</Seminar>

Note that in general, the “return” clause can be used to create an arbitrarily complex XML document.

Why Is XML Important? At this point, you should have some idea of the nature of XML and the XML standards. You know that XML makes a clear separation between structure, content, and materialization. Structure is defined by either a DTD or an XML Schema document. Content is expressed in an XML document, and the materializations of a document are expressed in an XSL document.

Z06_KROE8149_09_SE_EXTC.indd 25 04/07/19 9:05 PM

C-26 Online Extension C Advanced Business Intelligence and Big Data

You may be asking, “These are interesting ideas, but why do they matter? What’s so important about all of this?” The answer to these questions is that XML processing pro- vides a standardized facility to describe, validate, and materialize any database view.

Suppose, for example, that the real estate industry agrees on an XML Schema docu- ment for property listings. Every real estate company that can produce data in the format of the schema can then exchange listings with every other real estate company. Given the schema, each company can ensure that it is transmitting valid documents and it can ensure that it is receiving valid documents. Further, each company can develop its own set of XSL documents to materialize property listings in whatever way it wants. Once the XSL docu- ments have been prepared, any listing from any participating agent can be displayed in the local agency’s materializations. There are XML standards in dozens of industries, ranging from Accounting, to Health to Human Resources to Insurance.

For another example, consider business-to-business e-commerce. Suppose that Wal- Mart wants to send orders to its vendors in a particular standardized format and that it wants to receive shipment responses to those orders in another particular standardized format. To do this, Wal-Mart can develop an XML Schema for Order documents and an- other for Shipment documents. It can then publish those XML Schemas on a Web site ac- cessible to its vendors. In this way, all vendors can determine how they will receive orders from Wal-Mart and how they should send their Shipment notifications back.

The schemas can be used by Wal-Mart and all of its vendors to ensure that they are send- ing and receiving only valid XML documents. Further, Wal-Mart can develop XSL docu- ments to cause the Order and Shipment documents to be transformed into the specific for- mats needed by its accounting, operations, marketing, and general management departments.

These XSL documents work for any Order or Shipment from any of its vendors. In all of these cases, once the XML Schema documents have been prepared and the XSL docu- ments have been written, all validation and materialization are done via automated pro- cesses. Thus, there is no need for any human to touch the Order document between its origination at Wal-Mart and the picking of the inventory at the supplier.

So, the only challenge that remains is to populate the XML documents with database data in accordance with the relevant XML Schema. SQL, as we have seen, can be used to populate simpler schemas, and newer technologies like ADO.NET can handle more sophisticated transformations of database data into XML documents. Both SQL and ADO.NET can also be used for the other direction, the transformation of XML documents into database data.

Additional XML Standards As you know, XML was developed as a series of standards. So far, we have mentioned XML, XSL, XSLT, XPath, XQuery, and XML Schema. You will probably encounter a number of other XML standards, and you can find the standards, their documentation, and some tutorials on the www.w3.org and www.xml.org Web sites.

The XML Standards Committee continues its important work, and more standards will be developed as the needs arise. At present, work is under way for developing security standards. XML will continue to be an important data interchange format for both rela- tional and nonrelational (specifically NoSQL) databases.

NOSQL DATABASE MANAGEMENT SYSTEMS XML is closely related to many nonrelational NoSQL database management systems. These systems can be divided into four major categories (key-value, column family, graph, and document), which are described in the following subsections. Most NoSQL DBMSs, no matter which of the four categories they belong to, have one thing in common: There is often no schema for the database, as there is in a relational database. There are also fewer restrictions on the way data can be structured. Such systems are often based on XML (de- scribed earlier) or JSON (described in Chapter 7, with additional examples later), but they may use other (or no) data structuring formats as well. These systems are also often called semistructured databases or unstructured databases (depending on the level of structuring

Z06_KROE8149_09_SE_EXTC.indd 26 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-27

available) as opposed to the highly structured relational data model. For example, a STUDENT relation in a relational DBMS is highly structured: All rows have the same number of columns with the same column names and column data types, and every column value is a single value (first normal form).

Some column family DBMSs are highly structured, like a traditional relational data- base; others more resemble document databases, which are semistructured (there is struc- ture to the data, but the structuring requirements are much more flexible than those of re- lations). At the far end of the spectrum from relations is unstructured data, which is found in key-value databases.

Key-Value Databases Key-value databases use a simple key and value pairing. Each key (similar to a relational DBMS primary key) appears only once in each database. An example is DynamoDB, which was developed by Amazon.com. Another key-value database is MemcacheDB. Apache Cassandra is typically classified as column family database, though it is built upon Amazon's DynamoDB and Google's BigTable, so it has some features of key-value databases as well.11

A key-value database is extremely simple and allows for easy distribution of the key- value pairs in a networked cluster of servers. Note that the value can be anything: a string, a large binary object, a list of things, a JSON object, and so on. Key-value databases are ideal for large amounts of data that need fast storage and retrieval of simple objects but do not need the full complexity of SQL queries, since the DBMS is unaware of the structures within the value object. Thus, queries cannot be made using internal fields. The basic op- erations available to query a key-value database are:

• get (key): retrieves the value associated with a key • set (key, value): creates or updates a key-value pair • delete (key): removes a key-value pair

These commands are typically issued from within a program in some programming lan- guage or from a command-line interface in some systems. Here are some sample data that might appear in a typical key-value database that stores information about email accounts and the last computer IP address from which those accounts were accessed:

("joe@somewhere.com",172.13.233.1)

("mary@nowhere.com",177.10.254.1)

The email addresses are the keys and the IP addresses are the values. Remember that the values can be much more complex or larger, but in a key-value database, any structure within the values must be managed by the application after the key-value database retrieves the value.

Column Family Databases The basis for much of the development of column family databases was a structured stor- age mechanism developed by Google named Bigtable, and column family databases are now widely available, with a good example being the Apache Software Foundation’s Cassandra project. Facebook did the original development work on Cassandra and then turned it over to the open source development community in 2008.

A generalized column family database storage example is shown in Figure C-19. The column family database storage equivalent of a relational DBMS (RDBMS) table has a very different construction. Although similar terms are used, they do not mean the same thing that they mean in a relational DBMS.

11See the Wikipedia article Apache_Cassandra and the Apache Cassandra Web site, http://cassandra. apache.org.

Z06_KROE8149_09_SE_EXTC.indd 27 04/07/19 9:05 PM

C-28 Online Extension C Advanced Business Intelligence and Big Data

FIGURE C-19

A Generalized Column Family Database Storage System

Name: LastName

Value: Able

Timestamp: 40324081235

(a) A Column

Name: LastName

Value: Able

Timestamp: 40324081235

Name: FirstName

Value: Ralph

Timestamp: 40324081235

CustomerNameSuper Column Name:

Super Column Values:

(b) A Super Column

Name: LastName

Value: Able

Timestamp: 40324081235

Name: FirstName

Value: Ralph

Timestamp: 40324081235

Customer Column Family Name:

RowKey001

Name: LastName

Value: Jacobs

Timestamp: 40335091055

Name: Phone

Value: 817-871-8123

Timestamp: 40335091055

Name: City

Value: Fort Worth

Timestamp: 40335091055

Name: FirstName

Value: Nancy

Timestamp: 40335091055

RowKey002

Name: EmailAddress

Value: Susan.Baker@elswhere.com

Timestamp: 40340103518

Name: LastName

Value: Baker

Timestamp: 40340103518

RowKey003

Name: FirstName

Customer

CustomerPhone

Customer Name

Name: FirstName

Customer Name

Super Column Family Name:

Rowkey001

Rowkey002

Rowkey003

Value: Ralph

Value: Nancy

Value: Susan

Timestamp: 40324081235

Timestamp: 40335091055

Timestamp: 40340103518

Name: LastName Name: AreaCode

Name: AreaCode

Name: AreaCode Name: PhoneNumber

Name: PhoneNumber

Timestamp: 40324081235 Timestamp: 40335091055 Timestamp: 40335091055

Timestamp: 40335091055 Timestamp: 40335091055

Value: Able Value: 210

Value: 817

Value: 210 Value: 281–7876

Value: 871–8123

Value: 281–7987

Timestamp: 40335091055

Timestamp: 40340103518 Timestamp: 40340103518 Timestamp: 40340103518

Name: LastName

Value: Jacobs

Value: Baker

(d) A Super Column Family

Z06_KROE8149_09_SE_EXTC.indd 28 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-29

The smallest unit of storage is called a column, but it is really the equivalent of an RDBMS table cell (the intersection of an RDBMS row and column). A column consists of three elements: the column name, the column value or datum, and a timestamp to record when the value was stored in the column. This is shown in Figure C-19(a) by the LastName column, which stores the LastName value Able.

Columns can be grouped into sets referred to as super columns. This is shown in Figure C-19(b) by the CustomerName super column, which consists of a FirstName col- umn and a LastName column and which stores the CustomerName value Ralph Able.

Columns and super columns are grouped to create column families, which are the col- umn family database storage equivalent of RDBMS tables, as they are typically stored to- gether. In a column family we have rows of grouped columns, and each row has a RowKey, which is similar to the primary key used in an RDBMS table. However, unlike an RDBMS table, a row in a column family does not have to have the same number of columns as an- other row in the same column family (i.e., in some column family DBMSs the data may be semistructured). This is illustrated in Figure C-19(c) by the Customer column family, which consists of three rows of data on customers.

Figure C-19(c) clearly illustrates the difference between structured storage column fami- lies and RDBMS tables: Column families can have variable columns and data stored in each row in a way that is impossible in an RDBMS table. This storage column structure is definitely not in 1NF as defined in Chapter 2, let alone BCNF! For example, note that the first row has no Phone or City columns, while the third row not only has no FirstName, Phone, or City columns but also contains an EmailAddress column that does not exist in the other rows.

All the column families are contained in a keyspace, which provides the set of RowKey values that can be used in the data store. RowKey values from the keyspace are shown be- ing used in Figure C-19(c) to identify each row in a column family. While this structure may seem odd at first, in practice it allows for great flexibility because columns to contain new data may be introduced at any time without modifying an existing table structure. It is up to the application, however, to handle the possibly different row structures.

As shown in Figure C-19(d), a super column family is similar to a column family but uses super columns (or a combination of columns and super columns) instead of columns. A super column is a named collection of related columns. The Cassandra DBMS supports super columns. Of course, there is more to column family database storage than discussed here, but now you should have an understanding of the basic principles of column family databases.

Graph Databases Based on mathematical graph theory, graph databases are composed of three elements:

1. Nodes 2. Properties 3. Edges

An example graph database management system is Neo4j. ArangoDB, used as an ex- ample of a document database in Chapter 7 and later in this extension, is also a graph DBMS (it is multimodel).

Nodes are equivalent to entities in E-R data modeling and tables (or relations) in data- base design. They represent the things that we want to keep track of or about which we want to store data. Properties are equivalent to attributes in E-R data modeling and columns (or fields) in database design. They represent the data items that we want to store for each node. Edges are similar to, but not identical to, the relationships in E-R data models and database designs. They are similar because they connect nodes as relationships connect entities, but they are different because they also store data. Edges can also have a “direction.”

Figure C-20 shows an abstract representation of a graph database based on part of the HSD database. Note that the some of the data in the HSD SEMINAR, CUSTOMER, and SEMINAR_CUSTOMER tables are replicated in this database. CUSTOMER data are

Z06_KROE8149_09_SE_EXTC.indd 29 04/07/19 9:05 PM

C-30 Online Extension C Advanced Business Intelligence and Big Data

here, SEMINAR data are here, and the data about what customer attended which seminar are also here. However, the graph database adds some new data. First, the edges labeled Attendees for a group named Attendees: While this grouping could be obtained by an SQL query in a relational database, it is a built-in component of the graph database. Similarly, the edge with ID 3001 and labeled “knows” adds entirely new data to the data set— additional data about the relationships between customers that does not exist in the origi- nal HSD database design. Clearly, the graph database can easily extend our original data model and provide additional pathways between nodes.

As a more concrete example, consider adding more data about which HSD customers know which other HSD customers. Figure C-20 has one example of this, but now we will add more, show its representation in ArangoDB (using that system’s graph database fea- tures), and demonstrate a simple graph-based query on the data. Figure C-21 shows some of the HSD customers and some data about who knows whom, in an ArangoDB screen- shot. Note that for this example we are assuming the “know” relationship is not necessarily symmetric: I can know who someone is without them knowing who I am. In Figure C-21, we see that Ralph Able knows who Susan Baker and Nancy Jacobs are; Nancy Jacobs knows who Chantel Jacobs is; and so on. The edges are labelled with the date the customer began to know the other customer.

FIGURE C-20

A Partial HSD Graph Database

ID: 1 FirstName: Nancy LastName: Jacobs

City: Fort Worth

ID: 2 FirstName: Chantel LastName: Jacobs

City: Fort Worth

ID: 3 FirstName: Ralph

LastName: Able City: San Antonio

SeminarID: 1 SeminarDate: 12-OCT-2018

Seminar Location: San Antonio Convention Center

ID: 1001 Label: attended

Date: 12-OCT-2018

ID: 1002 Label: attended

Date: 12-OCT-2018 ID: 1003 Label: attended

Date: 12-OCT-2018

ID: 2001 Label: Attendees

ID: 2003 Label: Attendees

ID: 3001 Label: knows

Date: 9-SEP-2016

ID: 2002 Label: Attendees

Z06_KROE8149_09_SE_EXTC.indd 30 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-31

Of course, we can run queries on graph databases. As an example, the following AQL query (AQL, the ArangoDB Query Language, is introduced in the context of document databases in Chapter 7) will give us the emails of all people who are known to someone who is known to Ralph Able or known to someone who is known to him:

/* *** AQL-Query-ExtC-01 *** */

FOR Cust IN 1..2 OUTBOUND

"CUSTOMERS/RalphAble" GRAPH "KnowsGraph"

RETURN DISTINCT Cust.EmailAddress

This query can be interpreted as follows: Starting with customer Ralph Able, and following the “Knows” edges either 1 or 2 steps, return the email addresses of every customer en- countered. The DISTINCT ensures that we do not return Chantel Jacobs twice, as she is known (indirectly) to Ralph Able in two different ways, as can be seen in Figure C-21. The result of this query in JSON format is:

[

"Nancy.Jacobs@somewhere.com",

"Chantel.Jacobs@somewhere.com",

"Susan.Baker@elsewhere.com"

]

Document Databases We introduced JSON and document databases in Chapter 7. In this section we will briefly review some of the basic concepts then extend our work with document modeling and que- rying to include a few more examples, giving a more complete picture of the possibilities within a document DBMS. Readers interested in more details of modeling and querying in document databases in general, and ArangoDB in particular, are encouraged to read online Appendix L of Database Processing,12 which describes some of these issues in the context of ArangoDB and document databases and includes enough detail to install and use ArangoDB effectively for a document database (see www.pearsonhighered.com/kroenke).

12David M. Kroenke, David J. Auer, Scott L. Vandenberg, and Robert C. Yoder, Database Processing: Fundamentals, Design, and Implementation, 15th ed. (Upper Saddle River, NJ: Pearson, 2019).

SamEagleton

14-JUL-2012

02-AUG-2017

17-Feb-2015

09-S E

P -2016

2 5 -A

U G

-2 0 0 9

ChantelJacobs

NancyJacobs

RalphAble

SusanBaker

FIGURE C-21

Some HSD Graph Data in ArangoDB

Courtesy of ArangoDB GmbH

Z06_KROE8149_09_SE_EXTC.indd 31 04/07/19 9:05 PM

C-32 Online Extension C Advanced Business Intelligence and Big Data

Document databases store data according to a document-oriented format, the two most popular of which are XML (Extensible Markup Language), described earlier, and JSON (Java Script Object Notation). An example DBMS is Couchbase Server, which uses JSON storage. Another example is Microsoft’s multimodel Azure Cosmos DB (available only in the cloud). Amazon's DynamoDB supports both key-value and document data storage. MongoDB is an- other popular document database; it uses BSON (Binary JSON) storage, which is a binary- encoded version of JSON documents, with a few additional data types available. In a document database, documents (e.g., JSON objects) are usually stored within a set or a “collection.”

JSON, which is used in several document databases, has some similarities to XML, but as we learned in Chapter 7, there is no notion of any schema external to the data. Also re- call the following basic concepts of JSON from Chapter 7: A document (“object” in JSON terminology) consists of a set of (field, value) pairs. Values can be simple values (strings or numbers) or other objects. Values can also be arrays of values or objects. The number, or- der, types, and (in some systems) names of columns are less restricted in a semistructured database; in addition, fields can be multi-valued. These features, some of which were dis- cussed earlier in the XML section, are common in NoSQL DBMSs. A document database can be a good choice when your data have some structure, but that structure may be com- plex or inconsistent among the various documents. Unlike key-value databases, document databases support queries based on the internal structure of the document.

Figure C-22 shows the entire Art Course database as an array of documents called Courses. Each document corresponds to one course. Each course contains the basic attributes of that course (CourseNumber, Course, CourseDate, and Fee). In addition, ArangoDB has automatically included three fields in each document: “_key” is an internally generated value

[

{

"_key": "558",

"_id": "Courses/558",

"_rev": "_XU2oaVC--F",

"CourseNumber": 4,

"Course": "Beg Oils",

"CourseDate": "10/15/2019",

"Fee": 350,

"Enrollments": [

{

"CustomerNumber": 7,

"CustomerLastName": "Myers",

"CustomerFirstName": "Lynda",

"Phone": "360-789-5678",

"AmountPaid": 0

}

]

{

"_key": "560",

"_id": "Courses/560",

"_rev": "_XU2oaVC--H",

"CourseNumber": 5,

"Course": "Adv Pastels",

FIGURE C-22

The Art Course Database in JSON

Z06_KROE8149_09_SE_EXTC.indd 32 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-33

"CourseDate": "11/15/2019",

"Fee": 500,

"Enrollments": [

{

"CustomerNumber": 6,

"CustomerLastName": "Kyle",

"CustomerFirstName": "Leah",

"Phone": "425-678-7654",

"AmountPaid": 250

}

]

{

"_key": "550",

"_id": "Courses/550",

"_rev": "_XU2oaVC--_",

"CourseNumber": 1,

"Course": "Adv Pastels",

"CourseDate": "10/1/2019",

"Fee": 500,

"Enrollments": [

{

"CustomerNumber": 1,

"CustomerLastName": "Johnson",

"CustomerFirstName": "Ariel",

"Phone": "206-567-1234",

"AmountPaid": 250

{

"CustomerNumber": 3,

"CustomerLastName": "Jackson",

"CustomerFirstName": "Charles",

"Phone": "360-789-3456",

"AmountPaid": 500

{

"CustomerNumber": 4,

"CustomerLastName": "Pearson",

"CustomerFirstName": "Jeffery",

"Phone": "206-567-2345",

"AmountPaid": 500

}

]

Z06_KROE8149_09_SE_EXTC.indd 33 04/07/19 9:05 PM

C-34 Online Extension C Advanced Business Intelligence and Big Data

{

"_key": "554",

"_id": "Courses/554",

"_rev": "_XU2oaVC--B",

"CourseNumber": 2,

"Course": "Beg Oils",

"CourseDate": "9/15/2019",

"Fee": 350,

"Enrollments": [

{

"CustomerNumber": 2,

"CustomerLastName": "Green",

"CustomerFirstName": "Robin",

"Phone": "425-678-8765",

"AmountPaid": 350

{

"CustomerNumber": 5,

"CustomerLastName": "Sears",

"CustomerFirstName": "Miguel",

"Phone": "360-789-4567",

"AmountPaid": 350

}

]

{

"_key": "556",

"_id": "Courses/556",

"_rev": "_XU2oaVC--D",

"CourseNumber": 3,

"Course": "Int Pastels",

"CourseDate": "3/15/2019",

"Fee": 350,

"Enrollments": [

{

"CustomerNumber": 1,

"CustomerLastName": "Johnson",

"CustomerFirstName": "Ariel",

"Phone": "206-567-1234",

"AmountPaid": 350

}

]

}

]

Z06_KROE8149_09_SE_EXTC.indd 34 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-35

unique within the Courses collection of documents; “_id” is a value unique within the entire database; and “_rev” indicates revision information for the document, which is useful for re- solving replicated data updates. We could have chosen to use our own values (e.g., CourseNumber) as the “_key” values, but in this case we let ArangoDB assign them.

The Enrollments field of each course is a list of customer documents. Note that if a customer is enrolled in more than one course, that customer’s data are nested inside every course that customer takes. Obviously, this is redundant, and we could conceivably replace these documents with something similar to a foreign key. That would improve performance of some queries and updates, but at the expense of decreased performance on others. Essentially, this structure has precomputed within it the result of the join of the CUSTOMER, ENROLLMENT, and COURSE relational tables. Recall query AQL- QUERY-CH07-01 from Chapter 7:

/* *** AQL-QUERY-CH07-01 *** */

FOR C in Courses

FILTER C.Fee >400

RETURN {CourseName: C.Course,

CourseDate: C.CourseDate,

StudentLastNames: C. Enrollments[*]. CustomerLastName}

Suppose we want to extend this query to include first names of customers as well as last names. This can be done by nesting another query within the RETURN clause, as follows:

/* *** AQL-QUERY-ExtC-02 *** */

FOR C in Courses

FILTER C.Fee >400

RETURN {CourseName: C.Course,

CourseDate: C.CourseDate,

StudentNames: (FOR S in C.Enrollments

RETURN {Fname: S.CustomerFirstName,

Lname: S.CustomerLastName})}

The last line of AQL-QUERY-CH07-01 has been replaced by a nested query that creates a new document for each customer (student) enrolled in the course. This demonstrates some of the flexibility available in AQL (and similar languages) when creating query results. The results of AQL-QUERY-ExtC-02 are shown in Figure C-23.

Our final example of AQL illustrates a way to do the equivalent of the SQL GROUP BY clause. The AQL equivalent of GROUP BY is called COLLECT. The AQL COLLECT . . . INTO syntax allows us to group the documents of a collection based on an expression (which we give a name to) and to give a name to the groups created by the grouping process. We can then use these group and expression names later in the query. Here is an example:

/* *** AQL-QUERY-ExtC-03 *** */

FOR C in Courses

COLLECT CourseFee = C.Fee INTO FeeGroup

RETURN {FeeCharged: CourseFee,

TotalRevenue:

SUM(FLATTEN(FeeGroup[*].C.Enrollments[*]. AmountPaid))}

Z06_KROE8149_09_SE_EXTC.indd 35 04/07/19 9:05 PM

C-36 Online Extension C Advanced Business Intelligence and Big Data

[

{

"CourseName": "Adv Pastels",

"CourseDate": "11/15/2019",

"StudentNames": [

{

"Fname": "Leah",

"Lname": "Kyle"

}

]

{

"CourseName": "Adv Pastels",

"CourseDate": "10/1/2019",

"StudentNames": [

{

"Fname": "Ariel",

"Lname": "Johnson"

{

"Fname": "Charles",

"Lname": "Jackson"

{

"Fname": "Jeffery",

"Lname": "Pearson"

}

]

}

]

FIGURE C-23

Results of AQL-QUERY- ExtC-02 in JSON Format

This query collects courses into groups (called “FeeGroup”, which is like an SQL alias or range variable) based on the fee charged for the course. The query returns one docu- ment for each FeeGroup. Each such document contains the fee and the total revenue collected so far for all courses charging that fee. The aggregate function SUM works just as in SQL. In this case, however, as the AmountPaid data is inside enrollments, which are inside courses, the AQL FLATTEN operator was necessary to convert an array of arrays of AmountPaid values into an array of AmountPaid values. For example, the FLATTEN operation on this data would transform the array [[0], [350, 350], [350]] into the array [0, 350, 350, 350] to help compute the total value 1050 (for all courses charging $350). The results of AQL-QUERY-ExtC-03, in JSON format, are shown in Figure C-24.

Z06_KROE8149_09_SE_EXTC.indd 36 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-37

We should note at this point that there is no standard document query language like there is SQL for relational queries. AQL is one of many, and it shares many features with many of the others. Mastering one such language will make it much easier to master the others.

ArangoDB and the CAP Theorem During a network partition, ArangoDB will choose to preserve consistency over availabil- ity. Thus, ArangoDB has chosen C and P over A in the terminology of the CAP theorem (recall that at most two of Consistency, Availability, and Partition tolerance can be sup- ported in a DBMS). ArangoDB provides a full ACID transaction implementation to sup- port consistency, but queries involving multiple documents or multiple collections in a clustered environment are not guaranteed the ACID properties. This support is planned for future versions of ArangoDB. Note, however, that transactions are user-defined: if a user chooses not to place database interactions inside a transaction, then of course consis- tency is not guaranteed.

[

{

"FeeCharged": 350,

"TotalRevenue": 1050

{

"FeeCharged": 500,

"TotalRevenue": 1500

}

]

FIGURE C-24

Results of AQL-QUERY- ExtC-03 in JSON Format

USING THE DATABASE FEATURES OF MICROSOFT AZURE Now it is time to review and extend some cloud concepts by presenting a short case study of how Microsoft SQL Server integrates with Microsoft Azure—Microsoft’s public cloud service that we will sometimes refer to simply as Azure. This section is not a detailed how- to guide but rather a conceptual overview (a mini tour) of how all the components on a PC interact with the Azure cloud portal and services on Azure to create and query a database.

The author installed on his personal computer both SQL Server and the SSMS (Microsoft SQL Server Management Studio) client software. SSMS functions as the graphical frontend to SQL Server AND to the SQL Database functionality on Azure. Microsoft offers a free account on Azure. We will demonstrate how we migrated an exist- ing HSD (Heather Sweeney Designs) database from a PC to the Azure cloud and ran a query on it.

The SQL Database server on Azure is very similar to, but not exactly the same as, SQL Server. Some of the advanced features of SQL Server are not supported on Azure’s SQL Database, but for our purposes they are the same. As you read through this section, keep in mind how all of the components that are involved interact to use a cloud-based service: the Web browser, client software on the PC, the cloud portal Web site, and other cloud ser- vices provided by servers in the cloud datacenter.

Z06_KROE8149_09_SE_EXTC.indd 37 04/07/19 9:05 PM

C-38 Online Extension C Advanced Business Intelligence and Big Data

Connecting to the Microsoft Azure Home Page In this section we briefly review how we obtained an account on Azure to get an idea of what information is typically required by cloud providers to set up an account. Using a Web browser, we connected to the Azure home page at azure.microsoft.com/en-us/free. As shown in Figure C-25, we selected the Start Free button. In the future, after we create our account, we will use the MY ACCOUNT button on the top right to log in to manage our Azure services or to check our billing status. The sign-up process is outlined here, but it may change at Microsoft’s discretion.

1. A Sign in dialog box requesting your account information is displayed. Select “No ac- count? Create one!” In the Create account dialog box, enter your email account and make up a separate Azure password, then click the Next button to begin the account setup process.

2. Microsoft will then send a verification code to your email address to confirm that you own the email account. Enter the numeric code that was sent to you, then click Next.

3. To ensure that a real person is creating the account (and not a program “bot”), enter the characters displayed in the image on the Create account dialog box and click Next.

4. The About you screen where you enter some information, such as your name and phone number, appears. Click Next. You may receive a text message with a verification number that you then enter in an identity verification by phone dialog box.

5. The identity verification by card dialog box asks for your credit card information as an- other way to confirm your identity. Then click the Next button. Finish the sign-up process by checking the Agreement box at the bottom of the screen. Your account will be set up and you will get the Welcome to Microsoft Azure screen, where you can start an infor- mational tour or get started. After your trial period is up, you may elect to continue your Azure account under a “pay as you go” plan.

FIGURE C-25

Create Your Free Azure Account

The START FREE button

The MY ACCOUNT FREE button

Azure, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 38 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-39

Creating a New SQL Database We used the Start managing my service button to be directed to the Azure Dashboard screen as shown in Figure C-26. This screen is also called the portal in other cloud systems. As you can see, Azure has a rich selection of other services, including virtual machines, data storage, SQL databases, and NoSQL database servers (Azure Cosmos DB). We can now create and configure a new relational database in Azure by clicking the SQL databases but- ton in the left pane, then selecting the Create SQL Databases button in the next screen (not shown), to bring us to the SQL Database screen shown in Figure C-27.

We will need to create a new database server, set up a resource group, and create our administrator login and password. We first click Configure required settings under the Server category to bring us to Figure C-28. We entered our new Server name, administra- tor login information, and the server location in the New server pane. It is a good idea to record your login and password somewhere! The author uses a free product called PWSAFE to store the account and password of dozens of logins. When using cloud ser- vices, it is also a good idea to pick the geographic location of the cloud server nearest you to minimize network transmission time.

To complete the process, we clicked the Select button in the bottom right to configure the new server and to bring us back to the SQL database screen as shown in Figure C-29. We will initially create an empty “test” database to establish our resource group (comput- ing environment), and then use SQL Server Management Studio (SSMS) to migrate HSD to the new server.

We entered the Database name “test” and created a new Resource group named bob-rg, then used the Create button in the bottom left of the screen to create the blank test database. One must be patient, it may take a minute or two to set up a database on Azure: the free accounts probably do not get the highest priority! The “deployment succeeded” message briefly popped up, and we used the Refresh button to see our new database in a SQL databases screen (not shown). The next step is to set up a “Firewall rule” that will al- low your computer’s Internet IP address to connect to the remote database server on Azure.

FIGURE C-26

Azure Dashboard to Manage Your Services

The SQL DATABASES button

Azure, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 39 04/07/19 9:05 PM

C-40 Online Extension C Advanced Business Intelligence and Big Data

FIGURE C-27

SQL Database Screen

The CONFIGURE REQUIRED SETTINGS command

Azure, Microsoft Corporation

FIGURE C-28

The New Server Screen

Filled-in server information

Use SELECT to continue

Azure, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 40 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-41

Setting up Our Database Server Firewall Rules After we establish our adminis- trator account and a blank database, we need to configure our Azure firewall to allow only certain computers (IP addresses) to access our databases. Security is very important in cloud computing. We select the Resource groups button in the left pane, then our resource group name (in this example, it is bob-rg). Our server instance and test database will be listed as members of our resource group, as shown in Figure C-30. We selected the SQL server in- stance (in this example, bob-server) to see the SQL database server details screen in Figure C-31, then selected the Show firewall settings command near the top right of the pane.

In the Firewalls and virtual networks screen as shown in Figure C-32, we clicked Add client IP to allow our IP address to access the server. Our IP address will appear in a new rule (the author’s own IP address is hidden in the figure for security reasons). To finish this step, we clicked the Save command near the top of the Firewall screen. Note that IP addresses may change if using a wireless Internet connection in public or organizational networks.

Our Azure account is set up. From now on, we will mostly be interacting with Azure SQL server through our local SQL Server Management Studio (SSMS) to migrate the ex- isting Heather Sweeney Designs (HSD) database from our local PC to Azure and to run a query against it.

Migrating Our Existing Local HSD Database to Azure We had previously converted our HSD database from Access to Microsoft SQL Server. We use SSMS as our graphical interface to connect to our local SQL Server database AND to our remote SQL databases in the Azure cloud. We started SSMS and connected to our local SQL Server installation first using local windows authentication as shown in Figure C-33, then con- nected to our database server on Azure using Connect | Database Engine command from the Object Explorer window to get the dialog box shown in Figure C-34. We entered the name of our Microsoft Azure server and its administrator login and password, then clicked on the Connect button. The local HSD tables are shown in the Object Explorer pane on the left.

Note that SSMS is now connected to two servers: our local DESKTOP server listed at the top of the Object Explorer pane in Figure C-34 and the remote bob-server.database. server.net server in the Azure datacenter as shown in Figure C-35.

FIGURE C-29

Continue with Configuring SQL Database

Create an empty “test” database for now

Create a NEW resource group that contains our server and security information

Continue with the CREATE button

Azure, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 41 04/07/19 9:05 PM

C-42 Online Extension C Advanced Business Intelligence and Big Data

FIGURE C-30

The New Resource Group in Azure

Select the new server name to continue with security configuration

FIGURE C-31

The SQL Server Details Screen

Select SHOW FIREWALL SETTINGS to configure network security

Azure, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 42 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-43

FIGURE C-33

Connect to Local SQL Database Server Using SSMS

The CONNECT command in the Object Explorer pane

Select local DESKTOP login, then select CONNECT button

FIGURE C-32

The Server Firewalls and Virtual Networks Screen

Select ADD CLIENT IP to allow your PC to access the server

Azure, Microsoft Corporation

SQL Server 2017, SQL Server Management Studio, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 43 04/07/19 9:05 PM

C-44 Online Extension C Advanced Business Intelligence and Big Data

FIGURE C-34

Connect to Remote Azure Database Server Using SSMS

The established local DESKTOP connection showing HSD database tables

Select the remote server name in windows.net network

The remote server connection showing empty test database

FIGURE C-35

Remote Azure Database Server in the Object Explorer Pane of SSMS

SQL Server 2017, SQL Server Management Studio, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 44 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-45

Now, in the Object Explorer pane of SSMS, we right-clicked the local HSD database to show the submenu, then selected Tasks | Deploy Database to Microsoft Azure SQL Database command, as shown in Figure C-36. The Deploy Database Introduction screen appears, as shown in Figure C-37, and we clicked the Next button to start the process.

In the Deployment Settings page as shown in Figure C-38, the Connect... button brings up the SQL Server Connect to Server dialog box again to enter the remote connec- tion credentials to Azure.

FIGURE C-36

Preparing to Deploy Database to Microsoft Azure SQL Database

Select local HSD database, then right-click to open

option task menu with DEPLOY

FIGURE C-37

Deploy Database to Microsoft Azure SQL Database

Select NEXT to start the deployment process

SQL Server 2017, SQL Server Management Studio, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 45 04/07/19 9:05 PM

C-46 Online Extension C Advanced Business Intelligence and Big Data

Once the login and password are provided, the deployment settings for the server con- nection are filled in, as shown in Figure C-39, and we used the Next button to set up the actual migration process. The local SSMS program cooperates with the local SQL Server DBMS and the remote Azure DBMS server to transfer the table definitions and data to the remote end. The Summary page will appear after several seconds, as shown in Figure C-40.

FIGURE C-38

The Deployment Settings Screen

Enter the new remote database name

Enter remote server credentials then select CONNECT

FIGURE C-39

The Filled-in Deployment Settings Screen

Information is filled in, select NEXT to continue

SQL Server 2017, SQL Server Management Studio, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 46 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-47

FIGURE C-40

The Deployment Summary Screen

Verify detailed settings, then select FINISH

The Finish button begins the migration process. The Progress screen appears but we must be patient, as it may take a minute or so to import the database to the remote Microsoft Azure server. Eventually the Results: Operation Complete screen appears, as shown in Figure C-41. The Close button allows us to see the new database HSD appear in SSMS under the remote server name, as shown in Figure C-42.

FIGURE C-41

Results: Deployment Operation Complete

All steps completed, select CLOSE to continue

SQL Server 2017, SQL Server Management Studio, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 47 04/07/19 9:05 PM

C-48 Online Extension C Advanced Business Intelligence and Big Data

FIGURE C-42

SSMS Object Explorer Showing New Azure HSD Database

To test the HSD database on Azure, we selected the HSD icon, then the New Query button. A new SQL query blank window opens. We entered the simple SQL query:

SELECT *

FROM SEMINAR;

then executed it, as shown in Figure C-43.

Cloud Reflections We have just touched the surface of all the capabilities of the cloud in our little tour. We could have gone on to show how to run scripts to build a new database in the cloud, manage our databases, and connect applications to use relational and nonrelational databases on Azure.

Think of the cloud as a set of datacenters containing servers, software, disk storage, and network equipment that one can rent and configure as needed to meet one’s comput- ing needs. Understanding how the Web, cloud portals, and local applications interoperate will become increasingly important to information professionals.

The economic case for cloud computing is compelling. Computing equipment has a high capital cost, yet it depreciates rapidly. The support infrastructure costs for a datacenter are prohibitive for small companies and include the computer room, power, air condition- ing, and electric power backup systems. Hardware support, software licensing costs, along with salaries for systems administrators, security, and operations personnel are driving the popularity of large datacenters where economies of scale can be taken advantage of.

The remote server connection in SSMS

The exported HSD tables in Azure

SQL Server 2017, SQL Server Management Studio, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 48 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-49

FIGURE C-43

Query Results of HSD Database Migrated to Azure

The query in SSMS to test successful export to Azure

Section C: Creating Microsoft Access Switchboards The “Working with Microsoft Access” section of Chapter 7 covered how to do some basic Business Intelligence (BI) work (specifically, OLAP queries and reports) using Microsoft Access and Microsoft Excel. In this section of “Working with Microsoft Access” we will present a further feature of Microsoft Access that can be useful in BI and/or Big Data sce- narios. Microsoft Access switchboard forms allow creation of a form that can contain links to other forms, reports, and so on. This form can be made to appear by default when the database is opened, resulting in an easily used, self-contained application for BI (and other) users of the database. In this section, you will learn how to create switchboard forms in Microsoft Access 2019.

At the end of Chapter 4’s section of “Working with Microsoft Access,” we stated that:

Most users would find working with Microsoft Access 2019 database applications at the level of detail that we have been using to be intimidating. Users want a simple way to access forms (so that they can input data) and reports (so that they can view and print them). They really don’t want all the complexity of tables, views, and relationships. This is particularly true when prototyping applications—users want to see what the application can do, not how it does it!

In Microsoft Access 2019, we can build a switchboard that will provide this functionality. A switchboard is simply a specialized Microsoft Access form that provides a way for the user to easily navigate the application with a button-based menu system. An example for our WMCRM database is shown in Figure WA-C-1—this figure duplicates the switchboard shown in Figure WA-4-48. This section of “Working with Microsoft Access” provides an introductory discussion of Microsoft Access switchboards and describes how to create them.

WORKING WITH MICROSOFT ACCESS

SQL Server 2017, SQL Server Management Studio, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 49 04/07/19 9:05 PM

C-50 Online Extension C Advanced Business Intelligence and Big Data

FIGURE WA-C-1

The Microsoft Access 2019 Switchboard for the WMCRM Database

As its title implies, this section is designed to get you started using Microsoft Access 2019 switchboards so that you can create switchboards for Microsoft Access database ap- plications. The material in this section does not go beyond what is necessary to get you started using Microsoft Access 2019 switchboards. There are many additional topics about Microsoft Access 2019 switchboards that we do not cover in Database Concepts (9th edition) and this section. Consult the Microsoft Access 2019 Help systems and documenta- tion for more information about Microsoft Access 2019 switchboards.

While we present one way of creating forms that link to other forms, reports, and so on in an Access database, note that there are at least two other ways to do similar things. First, the functionality we describe here can also be achieved using the Microsoft Access 2019 Switchboard Manager. This is a tool included with Microsoft Access, but it no longer ap- pears in the interface unless the DBA manually places it in the Quick Access Toolbar. It eliminates the need for a few of the steps we describe here, but because (a) it may not be around forever and (b) it limits some things one can do with the form, we do not describe it here. Second, a newer feature of Access, the navigation form, is in some ways intended to supplant traditional switchboard forms. However, these forms (a) are mainly designed for use in a web-based setting (though they can be used elsewhere) and (b) have many more features and complexities than necessary for simply packaging a straightforward Microsoft Access application database, as we need to do in this book.

Finally, we note that other features of Microsoft Access 2019 can be useful for BI and Big Data. In particular, it is possible to use Microsoft Access as a frontend to an enterprise- class DBMS. This allows the DBA to use features of Access (easy forms, report, etc.) while accessing large amounts of data. Instructions for how to do this with MySQL are in Extension A.

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 50 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-51

Working with Microsoft Access Switchboards

We will start by making another copy of our basic WMCRM database, and then create a switchboard to function as a main menu for the CRM application. Since this section is in- tended to be used any time after you have completed working on Chapter 4’s section of “Working with Microsoft Access,” we are starting with the WMCRM.accdb as it exists at the end of that section. Here are the steps to create a database named WMCRM-WA-C.accdb.

Copying the WMCRM.accdb Database:

1. Select Start | File Explorer | Documents to open the My Documents library (or whichever folder contains your WMCRM Access database).

2. Right-click the WMCRM.accdb file object to display the shortcut menu, and then click Copy. 3. Right-click anywhere in the empty area of the Documents library (or other folder) window

to display the shortcut menu and then click Paste. A file object named WMCRM - Copy. accdb appears in the folder window.

4. Right-click the WMCRM - Copy.accdb file object to display the shortcut menu, and then click Rename.

5. Edit the file name to read WMCRM-WA-C.accdb, and then press the Enter key.

Now we will build a switchboard for use with this database file. The goal is a switch- board that will ultimately look like the one shown in Figure WA-C-1. To do this, we start by creating a blank form. Because of the large number of callouts that we will need to use in the figures in this extension, we will use a screen display setting of 1024 * 768 pixels for some of our screenshots. We will switch back to the higher resolution 1280 * 768 format for some screenshots and at the end of the extension. You do not need to do this—leave your screen resolution as it is. You will still find every control you need as we go along.

Creating a Blank Form:

1. Start Microsoft Access 2019. 2. If the Backstage view is not open, click the File command tab to display the Backstage

view, and then click the Open Other Files button on the left, near the bottom. In the Open page, click the Browse button, browse to the WMCRM-WA-C.accdb file, click the file name to highlight it, and then click the Open button.

3. The Security Warning bar appears with the database. Click the Security Warning bar Enable Content button. If you get a Security Warning box asking to make this file a Trusted Document, click the Yes button.

4. Click the Create command tab. 5. Click the Blank Form button in the Forms group. A blank form labeled Form1 is displayed,

as shown in Figure WA-C-2. Note the blank Field List pane; the Form Layout Tools contextual command tab, which contains the Design command tab (which is active and dis- played); the Arrange command tab; and the Format command tab that are displayed.

Now we need to modify some of the property settings for the form. For example, note that in Figure WA-C-2 the form has a set of record navigation buttons near the bottom. These are not needed (indeed, they are useless) in a switchboard, so we need to disable the display of these buttons for the form.

Modifying Property Sheet Settings for a Form:

1. Click the Property Sheet button shown in Figure WA-C-2. The Property Sheet for the form is displayed, as shown in Figure WA-C-3, with the All tab settings displayed.

2. Most of the property settings are correct, but three need to be changed. In order to change these settings, however, the form must be in Design view. Click the Views drop-down list button, and then select Design View. Click the Design command tab to display the Design command groups. The form is now displayed in Design View, as shown in Figure WA-C-4.

Z06_KROE8149_09_SE_EXTC.indd 51 04/07/19 9:05 PM

C-52 Online Extension C Advanced Business Intelligence and Big Data

3. Click the square button on the upper left of the form, as shown in Figure WA-C-4, to select the entire form.

■ NOTE: The entire form needs to be selected. This is the default when you select Design view, but if you somehow select just a part of the form, the Property Sheet set- tings display will change. To select or re-select the entire form, click the square button on the upper-left corner of the form, as shown in Figure WA-C-4.

4. As shown in Figure WA-C-4, change the Allow Layout View property setting from Yes to No using the property drop-down list.

5. Similarly, change the Record Selectors property setting from Yes to No using the property drop-down list.

6. Finally, change the Navigation Buttons property setting from Yes to No using the property drop-down list.

7. Click the Views drop-down list button, and then select Form View. The form in Form view is displayed, as shown in Figure WA-C-5.

8. Note that the three revised settings are in effect, and the record navigation buttons are no longer visible.

Now that we have the correct property settings, we can build the switchboard itself. We will start by adding a label.

Adding a Label to a Switchboard Form:

1. Click the Views drop-down list button, and then select Design View. 2. Minimize the Navigation Pane (use the << button on the top-right corner of the pane) to

allow more of the form itself to be displayed. The form in Design view is now displayed as shown in Figure WA-C-6. Note the drag-and-drop controls available via the buttons in the Controls command group.

3. Drag and drop the right margin of the form so that it is 6.75 inches wide, as shown in Figure WA-C-6. Note that you can actually choose whatever width is appropriate.

FIGURE WA-C-2

The Blank Form

The Form Layout Tools contextual command tab

The blank Field List pane

The Design command tab

The Arrange command tab

The Format command tab

The Property Sheet button

The blank tabbed form

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 52 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-53

FIGURE WA-C-3

The Blank Form with Property Sheet

FIGURE WA-C-4

The Form in Design View with Original Property Settings

If needed, click here to select the entire form

Select the correct setting from the property drop-down list

The Navigation Buttons property settings

The Record Selectors property settings

The Allow Layout View property settings

The All tab and format settings

The Views drop- down list button

The All tab and format settings

The Property Sheet button

The Property Sheet

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 53 04/07/19 9:05 PM

C-54 Online Extension C Advanced Business Intelligence and Big Data

FIGURE WA-C-5

The Form with Revised Settings

The Views drop- down list button

The form with the revised settings— note that the record selection buttons have been removed from the bottom of the form

FIGURE WA-C-6

The Expanded Form in Design View

Click here to restore normal mouse pointer functions

The Command Button button

Drag the right border of the form to here

The Label button

The Text Box button

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 54 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-55

4. Click the Label button. Click and drag in the form and create the label area shown in Figure WA-C-7.

5. In the label area, type the text Wallingford Motors Customer Relationship Management System, as shown in Figure WA-C-8.

6. Click the small square button that selects the entire form, and then click the label area to select and then highlight the label text.

7. Click the Format command tab. Highlight the label text. The label text can now be formatted, as shown in Figure WA-C-9.

8. Using the formatting controls of the Format command tab, format the label text as a font size of 14 pts., bold, and centered.

FIGURE WA-C-7

Adding a Label to the Form

Click here to restore normal mouse pointer functions

Click and drag on the form to create the text label area

The Label button

FIGURE WA-C-8

The Label Text

Click here to restore normal mouse pointer functions

The Label text

If needed, click here to select the entire form

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 55 04/07/19 9:05 PM

C-56 Online Extension C Advanced Business Intelligence and Big Data

FIGURE WA-C-9

Formatting the Label Text

The Format command tab is selected and the formatting commands are available

The highlighted label text

Select label name here

9. Click the Save button on the Quick Access Toolbar to display the Save As dialog box. Name and save the form as WMCRM Main Menu.

10. Click the Home command tab and select Form View. The form is now displayed (in 1280 * 786-pixel format) as shown in Figure WA-C-10.

Next, we will add a second label to the WMCRM Main Menu to indicate a subsection of the switchboard for forms.

Additional Formatting for Text Labels on a Switchboard Form:

1. Click the View drop-down list button, and then select Design View. 2. If needed, minimize the Navigation Pane to allow more of the form itself to be displayed.

FIGURE WA-C-10

The WMCRM Main Menu with Label Text

The form is now named as WMCRM Main Menu

The WMCRM Main Menu label text

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 56 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-57

3. Click the Label button. Click and drag in the form and create the text box area shown in Figure WA-C-11.

4. In the label text area, type the text WMCRM Forms. 5. Click the Format command tab. Select the correct label name in the object name drop-

down list and highlight the label text. The label text can now be formatted. 6. Using the formatting controls, format the label text as a font size of 16 pts., bold, and cen-

tered. Select the shade of olive green as the background color for the text box, as shown in Figure WA-C-12.

7. Click the Save button on the Quick Access Toolbar to save the new additions to the form. 8. Click the Home command tab and select Form View. The form is now displayed (in 1280

* 768-pixel format) as shown in Figure WA-C-13.

To complete our survey of basic form controls, we will add a command button to the WMCRM Main Menu. We will set this button to open the WMCRM Customer Data Form.

Adding a Command Button to a Switchboard Form:

1. Click the Views drop-down list button, and then select Design View. 2. If needed, minimize the Navigation Pane to allow more of the form itself to be displayed. 3. Click the Command Button button (note that the tool tip for this button refers to this as

just the Button button). Click and drag in the form to create the command button area shown in Figure WA-C-14. If you get a security warning, click the Open button.

4. The placement of a command button object in the form automatically starts the Command Button Wizard, as shown in Figure WA-C-15, which is used to simplify setting the prop- erty settings for the command button object.

5. The first page of the Command Button Wizard asks, “What action do you want to hap- pen when the button is pressed?” We want clicking the button to open a form, so in the Categories list click Form Operations.

6. Clicking Form Operations displays a specific set of form operation actions in the Actions list, as shown in Figure WA-C-16.

7. Click Open Form in the Actions list, and then click the Next button.

FIGURE WA-C-11

Adding a Second Label

The Label button

Click here to restore normal mouse pointer functions

Click and drag on the form to create the text label area

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 57 04/07/19 9:05 PM

C-58 Online Extension C Advanced Business Intelligence and Big Data

FIGURE WA-C-12

Formatting the Second Label

Select this shade of olive green

Use the Background Color drop-down list button to display the background color gallery

The Format command tab is selected and the formatting commands are available

FIGURE WA-C-13

The WMCRM Main Menu with the Additional Label

The WMCRM Forms label text

The WMCRM Main Menu label text

The form is now named as WMCRM Main Menu

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 58 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-59

FIGURE WA-C-14

The Command Button on the Form

The Command Button button

Click here to restore normal mouse pointer functions

Click and drag on the form to create the Command Button area

FIGURE WA-C-15

The Command Button Wizard

Click Form Operations in the Categories list to select form operations

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 59 04/07/19 9:05 PM

C-60 Online Extension C Advanced Business Intelligence and Big Data

FIGURE WA-C-16

Selecting Form Options

FIGURE WA-C-17

Selecting the WMCRM Customer Data Form

Click WMCRM Customer Data Form in the list of available forms

The Next button

Click Open Form in the Actions list

Form Operations is selected

The Next button

8. The second page of the Command Button Wizard asks, “What form would you like the command button to open?” As shown in Figure WA-C-17, click the WMCRM Customer Data Form to select it.

9. Click the Next button. 10. As shown in Figure WA-C-18, the third page of the Command Button Wizard asks, “Do

you want the button to find specific information to display in the form?” 11. The default setting of “Open the form and show all the records” is correct for our applica-

tion, so click the Next button. 12. As shown in Figure WA-C-19, the fourth page of the Command Button Wizard asks, “Do

you want text or a picture on the button?”

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 60 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-61

13. Note that we can choose either a text label or a picture to be placed on the command button itself to label the button in the form. As shown in Figure WA-C-19, we will use text and label the command button as Open WMCRM Customer Data Form. Click the Text radio button to select it, and then type the text Open WMCRM Customer Data Form into the text box.

14. Click the Next button. 15. As shown in Figure WA-C-20, the final page of the Command Button Wizard asks, “What

do you want to name the button?” Naming the button allows us to reference it by a known name if we need to use or modify it later while developing our WMCRM application.

FIGURE WA-C-18

Selecting the Information to Be Displayed

This setting is correct

The Next button

FIGURE WA-C-19

Creating the Command Button Text Label

Type in the text Open WMCRM Customer Data Form—this is the text that will appear on the Command Button

The Next button

Click the Text radio button

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 61 04/07/19 9:05 PM

C-62 Online Extension C Advanced Business Intelligence and Big Data

16. As shown in Figure WA-C-20, we will use the name CommandButtonWMCRM01 as our button name. Click the text box to select it, and then type the text CommandButtonWMCRM01 into the text box.

17. Click the Finish button. 18. As shown in Figure WA-C-21, the command button is now in the form, but the size of the

button has been changed to the width of the typed text. Click and drag the right edge of the command button object so that it is the size and shape we originally set it, as shown in Figure WA-C-21.

19. The final command button object is shown in Figure WA-C-22. Note that the command button name we created is displayed in the command button property settings.

20. Click the Save button on the Quick Access Toolbar. 21. Click the Home command tab and select Form View. The form is now displayed (in 1280

* 768-pixel mode) as shown in Figure WA-C-23. 22. Click the Open WMCRM Customer Data Form command button. As shown in Figure

WA-C-24, the WMCRM Customer Data Form is displayed.

We now know the tools and techniques needed to complete the WMCRM Main Menu switchboard shown in Figure WA-C-1. At this point, you should add the other la- bels and command buttons needed to complete the switchboard. Be sure to save and test your work as you go. After completing the switchboard form itself, you may want to make it open automatically when the Access database is opened. To do this, complete the fol- lowing steps:

Making a Form Open by Default when Access Opens the Database:

1. Click the File tab to get the Backstage view, and then select Options. 2. Select Current Database to get to the screen shown in Figure WA-C-25. 3. As shown in Figure WA-C-25, in the Display Form dropdown box, select WMCRM Main

Menu. Click the OK button to ensure that when Access next opens this database, the WMCRM Main Menu form will open automatically.

4. You will get a message box stating that you must close and reopen the database for this to take effect; click OK.

With the addition of a main menu or switchboard to our prototyped application, the users will be able to easily test the application and make suggestions for changes and improvements

There are other options and features for Microsoft Access switchboards that we have not covered here. Consult the Microsoft Access 2019 documentation for more information and take some time to experiment with the switchboard capabilities.

By The Way

Z06_KROE8149_09_SE_EXTC.indd 62 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-63

FIGURE WA-C-20

Creating the Command Button Object Name

Type in the text CommandButton WMCRM01— this will be the name of the Command Button object

The Finish button

FIGURE WA-C-21

Resizing the Command Button Object

The Command Button object

Click and drag to resize the Command Button object to the correct size

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 63 04/07/19 9:05 PM

C-64 Online Extension C Advanced Business Intelligence and Big Data

FIGURE WA-C-22

The Final Command Button Object

The Command Button object name

The Command Button object is now the correct size

FIGURE WA-C-23

The WMCRM Main Menu with the Command Button

The WMCRM Forms label text

The WMCRM Main Menu label text

The Open WMCRM Customer Data Form command button

The form is now named as WMCRM Main Menu

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 64 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-65

FIGURE WA-C-25

Setting the Switchboard to Open when the Database is Opened

The Display Form drop-down box

The Current Database options page

The WMCRM Main Menu form selection

The OK button

FIGURE WA-C-24

The Displayed WMCRM Customer Data Form

The WMCRM Customer Data Form

The WMCRM Customer Data Form tab

The WMCRM Main Menu tab

Access 2019, Windows 10, Microsoft Corporation

Z06_KROE8149_09_SE_EXTC.indd 65 04/07/19 9:05 PM

C-66 Online Extension C Advanced Business Intelligence and Big Data

Closing the Database and Exiting Microsoft Access

This completes the work we will do in this extension’s “Working with Microsoft Access.” As usual, we finish by closing the database and Microsoft Access.

Closing the WMCRM-WA-C Database:

1. To close the WMCRM-WA-C database and exit Microsoft Access, click the Close button in the upper-right corner of the Microsoft Access window.

SUMMARY

Business intelligence (BI) systems assist managers and other professionals in the analysis of current and past activities and in the prediction of future events. BI applications are of two major types: reporting applications and data mining applications. Reporting applications make elementary calculations on data; data mining applications use sophisticated mathe- matical and statistical techniques. BI applications obtain data from three sources: opera- tional databases, extracts of operational databases, and purchased data. A BI system some- times has its own DBMS, which may or may not be the operational DBMS. Because of potential problems with operational data, many organizations have chosen to create and staff data warehouses (and subsets of them, called data marts). Extract, transform, and load (ETL) systems are used to extract data from operational systems; transform the data and load them into data warehouses; and maintain metadata that describe the source, format, assumptions, and constraints about the data.

The purpose of a reporting system is to create meaningful information from disparate data sources and to deliver that information to the proper users on a timely basis. RFM analysis is a typical reporting application. An RFM report can be produced using SQL statements. Reports must be prepared and delivered to the appropriate users. Reporting systems maintain metadata that support the three basic report functions: authoring, manag- ing, and delivering reports. Reports vary by type, media, and mode. Online analytical pro- cessing (OLAP) reporting applications, discussed in detail in Chapter 7, enable users to dynamically restructure reports.

Data mining is the application of mathematical and statistical techniques to find pat- terns and relationships and to classify records and predict outcomes based on the data. Data mining has arisen in recent years because of the confluence of factors shown in Figure C-5.

With unsupervised data mining, analysts do not create models or hypotheses prior to the analysis. Rather, results are explained after the analysis is performed. With supervised techniques, hypotheses are formed and tested before the analysis. Six popular data mining techniques are cluster analysis, market basket analysis, regression analysis, decision tree analysis, logistic regression, and neural networks. The first two of those are unsupervised techniques.

Market basket analysis, or association rules, can be used to determine which sets of products are likely to be sold at the same time. According to market basket analysis termi- nology, the support for two products is the frequency with which they appear together in transactions. Confidence is the conditional probability that one item will be purchased, given that another item is being purchased. Lift is confidence divided by the base probabil- ity that an item will be purchased.

Decision trees can be used to predict future decisions based on the results of past deci- sions. By asking a proper series of questions about a new data point or record, it is possible to reach a decision about how to classify that data point. These questions can be organized

Z06_KROE8149_09_SE_EXTC.indd 66 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-67

as a tree, with every new data point being asked the same question initially and subsequent questions based on the responses to the previous questions. Decision trees can be used to determine levels of risk in the insurance industry or in medical diagnosis.

The NoSQL movement (now often read as “not only SQL”) is built upon the need to meet the Big Data storage needs of companies such as Amazon.com, Google, and Facebook. The tools used to do this are nonrelational DBMSs known as structured storage or NoSQL databases. An early example was Bigtable, and a more recent popular example is Cassandra.

Other features typical of NoSQL and other DBMSs designed to manipulate Big Data include replication and sharding: multiple copies of data can be stored on various comput- ers in a network, and data can be split among several computers in a network (for example, by putting half the documents in a collection on one computer and the other half on another).

Many of these systems use data models based on document format standards such as XML or JSON. The confluence of database processing and document processing is one of the most important developments in information systems technology today.

XML is a series of standards that were developed jointly by the database processing and document processing communities. XML provides a standardized yet customizable way to describe the contents of documents. XML documents can automatically be gener- ated from database data, and database data can be automatically extracted from XML documents. XML documents can be queried using XPath and XQuery.

Although XML can be used to materialize Web pages, this is one of its least important uses. More important is its use for describing, representing, and materializing database views. XML is a better markup language than HTML, primarily because XML provides a clear separation between document structure, content, and materialization. XML docu- ments are transformed when an XSLT processor applies an XSL document to the XML document. The content of XML documents can be described by the older Document Type Declarations (DTDs) and by XML Schema. XML Schema is the newer standard for de- scribing the content of an XML document. XML Schema can be used to define custom vocabularies.

Schemas consist of elements and attributes. There are two types of elements: simple and complex. Simple elements have one data value. ComplexType elements can have mul- tiple elements nested within them. ComplexTypes may also have attributes. The elements contained in a ComplexType may be simple or other ComplexTypes. ComplexTypes may also define element sequences.

SQL Server, Oracle Database, and MySQL can produce XML documents from data- base data. XML is important also because it facilitates the sharing of XML documents (and hence database data) among organizations. Nonrelational databases can be grouped into Key-value, Document, Column Family, and Graph databases. These products use a non- normalized semistructured data structure. Key-value databases use key-value pairs to store data. Document databases typically use data documents represented in XML or JSON. Column family databases use columns, super columns, column families, and super column families to store data. Graph databases use nodes, properties, and edges (which can also have properties).

Big Data is stored in the cloud and in NoSQL databases, often using either XML or JSON for their structuring capabilities. Microsoft Azure is a cloud-based platform that al- lows us to deploy relational databases using the SQL Server interface. These databases can be created by importing them from existing local databases or by creating them directly in the cloud on Microsoft Azure. The Azure platform also allows creation and querying of Document databases in JSON format.

Z06_KROE8149_09_SE_EXTC.indd 67 04/07/19 9:05 PM

C-68 Online Extension C Advanced Business Intelligence and Big Data

3V framework aggregate alert Apache Cassandra AQL COLLECT...INTO syntax ArangoDB association rules attributes Azure Cosmos DB Big Data Bigtable BSON (Binary JSON) business intelligence (BI) cluster analysis column column family complex element confidence Couchbase Server data mining decision tree analysis digital dashboard document DTD (document type declaration) dynamic report DynamoDB edge element F score FLWOR graph

JSON (JavaScript Object Notation) keyspace key-value leaf lift logistic regression M score market basket analysis MemcacheDB MongoDB namespace Neo4j neural network node NoSQL Not only SQL movement polyglot persistence property pull report push report query report quorum R score regression analysis replica set replication report authoring report delivery report management reporting system RFM analysis

root RowKey schema-valid SGML (Standard Generalized

Markup Language) sharding simple element SSMS (Microsoft SQL Server

Management Studio) static report stylesheet super column super column family supervised data mining support testing phase training phase unsupervised data mining variety velocity volume Web portal XSLT (Extensible Style

Language: Transformations) XML (Extensible Markup

Language) XML Schema XPath XQuery

KEY TERMS

REVIEW QUESTIONS

C.1 State the purpose of a reporting system. C.2 In RFM analysis, what do the letters RFM stand for? C.3 Describe, in general terms, how to perform an RFM analysis. C.4 Explain the characteristics of customers that have the following RFM scores:

{1 1 5}, {1 5 1}, {5 5 5}, {2 5 5}, {5 1 2}, {1 1 3}

C.5 Name and describe the purpose of the major components of a reporting system. C.6 What are the major functions of a reporting system? C.7 Summarize the types of reports described in this extension. C.8 Describe the various media used to deliver reports. C.9 Summarize the modes of reports described in this extension. C.10 Describe the major tasks in report management. C.11 Explain the role of report metadata in report management. C.12 Name three tasks of report authoring.

Z06_KROE8149_09_SE_EXTC.indd 68 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-69

C.13 Describe the major tasks in report delivery. C.14 Define data mining. C.15 Explain the difference between unsupervised and supervised data mining. C.16 Name five popular data mining techniques. C.17 Explain how a decision tree classifies a new record. C.18 What is Big Data? C.19 What are the original three Vs? Define each term. C.20 What are the four categories of NoSQL databases used in this book? C.21 Why do database processing and document processing need each other? C.22 How are HTML, SGML, and XML related? C.23 Explain the phrase “standardized but customizable.” C.24 What are the problems in interpreting a tag such as <H2> in HTML? C.25 What is an aggregate? What are some of the NoSQL databases that support

aggregates?

C.26 What is the relationship between aggregates and XML or JSON? C.27 Why is it too limiting to say that XML is just the next version of HTML? C.28 How are XML, XSL, and XSLT related? C.29 Explain the use of the pattern {item, action} in the processing of an XSL document. C.30 What is the purpose of XML Schema? C.31 What is a schema-valid document? C.32 Explain the chicken-and-egg problem concerning the validation of XML Schema

documents.

C.33 Explain the difference between simple and complex elements. C.34 Explain the difference between elements and attributes. C.35 Explain the purpose of XPath and how it relates to XML queries. C.36 Explain the FLWOR structure of XQuery processing. C.37 Explain, in your own words, why XML is important to database processing. C.38 Why is XML Schema important for interorganizational document sharing? C.39 What is a key-value database? Under what circumstances is one most useful?

Where does processing relating to the structure of data values take place?

C.40 As illustrated in Figure C-19, what is column family database storage and how are such systems organized? How do column family database storage systems compare to RDBMS systems?

C.41 What is a graph database? What are nodes, properties, and edges? C.42 What are the main features of data in a document database? What are the basic

operations and utilities provided by a document DBMS?

Some of the Exercises and some of the Case and Project questions involve XML or AQL. These questions can be answered with or without XML software and ArangoDB. But if possible, use an XML schema validator such as Notepad++ (https://notepad-plus-plus. org/, free download, get the XML plugin) or XMLSpy (free trial download at https:// www.altova.com/xmlspy-xml-editor) for the XML questions and use ArangoDB (free download at www.arangodb.com) for the JSON and AQL questions. A pointer to Aran- goDB installation and use instructions is found earlier in this extension.

Z06_KROE8149_09_SE_EXTC.indd 69 04/07/19 9:05 PM

C-70 Online Extension C Advanced Business Intelligence and Big Data

C.43 Describe the process we used to migrate an SQL database from a PC to Azure. What components (SSMS, Azure Portal, local and remote database servers) were involved in each step of the process?

C.44 Develop an XML Schema for a document that uses the data in the HSD CUSTOMER table. Create an XML document for your schema that represents the first two customers. If you need more information about XML Schema, go to W3C and read tutorials on XML and XML Schema.

C.45 Develop a graph database based on a portion of the WP EMPLOYEE and PROJECT tables that will look similar to Figure C-20. Use WP data to model only Project 1300 and the employees assigned to it. The Project and the Employees will be nodes; the Assignment relationship is bidirectional (AssignedTo and ProjectWorkers) and becomes the edges and labels; properties will be from the data fields.

C.46 Develop a JSON file describing the first customer from the HSD CUSTOMER table. Be sure to represent CustomerID (value will be 1). Use the expressive power of JSON to extend the data to allow multiple phone numbers, and also to express Address as a composite containing StreetAddress, City, State, and ZIP. Make up a second phone number for Nancy Jacobs and enter it.

Use the data in Figure C-6 to answer questions C.47 through C.53. C.47 What is the probability that someone will buy a tank? C.48 What is the support for buying a tank and fins? What is the support for buying two

tanks?

C.49 What is the confidence for fins, given that a tank has been purchased? C.50 What is the confidence for a second tank, given that a tank has been purchased? C.51 What is the lift for fins, given that a tank has been purchased? C.52 What is the lift for a second tank, given that a tank has been purchased? C.53 How many transactions are there (among the 1,000) that involve none of the five

products mentioned in the table (mask, fins, tanks, dive computer, and weights)?

C.54 How could you improve the decision tree in Figure C-8 to be more efficient? Use the decision tree in Figure C-11 to answer questions C.55 through C.57.

C.55 Would the new data point (record) (cloudy, –3, 16) be classified as “skate” or “no skate”? Which nodes (questions) in the tree would be asked of this new record?

C.56 Would the new data point (record) (sunny, 5, 22) be classified as “skate” or “no skate”? Which nodes (questions) in the tree would be asked of this new record?

C.57 Draw a different decision tree, based on the same data, by basing the second ques- tion on a different attribute. Does your tree ask more or fewer questions, on aver- age, to categorize a new point when compared to the tree presented in the text? Does your tree have higher or lower accuracies for its decisions?

C.58 Based on the document structure you created in Exercise C.46, write the follow- ing queries in AQL, assuming the customer documents are stored in a collection named Customers:

A. Retrieve the last name and city of all customers whose first name is “Nancy”. B. Group the customers by state and, for each state, count how many customers

are from that state. Return the state and the count for each state.

EXERCISES

Z06_KROE8149_09_SE_EXTC.indd 70 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-71

WORKING WITH MICROSOFT ACCESS

Key Terms command button label

switchboard switchboard form

Exercises WA.C.1 In the “Working with Microsoft Access” section in Chapter 4, you created some

prototype forms and reports for the Wedgewood Pacific database (WP.accdb). Create a switchboard for the Wedgewood Pacific database (WP.accdb).

MARCIA’S DRY CLEANING CASE QUESTIONS

Marcia Wilson has been successfully using the MDC database described in Chapter 7. Her business has grown so much, however, that she wants to see what she might learn from min- ing some of her data. In addition, she wants to explore the possibility of using a document database for some of her data in the future.

A. Describe how an RFM analysis could be useful in Marcia’s business.

B. Using the data in Figures 7-27 through 7-30, construct a table similar to that in Figure C-6, based on the data in the MDC INVOICE_ITEM and SERVICE tables. Note that in a database this small, you can do this by hand. In a larger database, SQL can be used to create the table.

C. Based on your answer to the previous question, identify at least two pairs of services that occur together frequently. For each such pair (A, B) compute the support, the con- fidence of A given B, the confidence of B given A, and the lift.

D. Suppose that Marcia wants to be able to predict which invoices are likely to be paid on time. As part of this effort, she will add a DatePaid column to the INVOICE table (payments are due on the DateOut date). Describe, in general terms, how a decision tree, in conjunction with this new data, might help Marcia to predict which invoices will be paid on time.

E. Marcia has heard good things about XML as a way of structuring data; in addition, she may want to be able to easily share data with other local businesses using a standard- ized XML format. As an experiment, she would like to create XML data to model her customer data. Create an XML Schema file to model her customer data and create a matching XML document file containing data for at least three of her customers.

F. Marcia has also heard wonderful things about JSON-based document databases, and would like to experiment with such a system, with the possibility of deploying it in the future. Create a JSON-based document database for Marcia containing three collec- tions: one for customers, one for services, and one for invoices (with the invoice items stored within each invoice).

G. Based on the database you created in the previous question, write the following queries in AQL:

Z06_KROE8149_09_SE_EXTC.indd 71 04/07/19 9:05 PM

C-72 Online Extension C Advanced Business Intelligence and Big Data

GARDEN GLORY PROJECT QUESTIONS

These questions are based on the Garden Glory database described in Chapter 3, which Garden Glory has been successfully using. Their business has grown so much, however, that they want to see what they might learn from mining some of their data. In addition, they want to explore the possibility of using a document database for some of their data in the future.

A. Describe how an RFM analysis could be useful to Garden Glory.

B. Using the Garden Glory data in the Garden Glory Project Questions from Chapter 3, construct a table similar to that in Figure C-6, based on the data in the PROPERTY_ SERVICE and GG_SERVICE tables. Note that in a database this small, you can do this by hand. In a larger database, SQL can be used to create the table.

C. Based on your answer to the previous question, identify a pair of services that occurs together frequently. For this pair (A, B) compute the support, the confidence of A given B, the confidence of B given A, and the lift.

D. Suppose that Garden Glory wants to be able to predict which properties are likely to be “repeat customers.” As part of this effort, they will add a MultipleVisits column to the OWNED_PROPERTY table (this will be a “yes/no” column). Describe, in general terms, how a decision tree, in conjunction with this new data, might help Garden Glory to predict which properties will be visited more than once.

E. Garden Glory has heard good things about XML as a way of structuring data; in ad- dition, they may want to be able to easily share data with other local businesses using a standardized XML format. As an experiment, they would like to create XML data to model their property data. Create an XML Schema file to model the data in the OWNED_PROPERTY table and create a matching XML document file containing data for at least three of their properties.

F. Garden Glory has also heard wonderful things about JSON-based document databas- es, and would like to experiment with such a system, with the possibility of deploying it in the future. Create a JSON-based document database for a portion of the Garden Glory business (we will omit owners and employees). This database should contain one collection, which will contain one document for each property. Each property should have nested within it an array of all of its PROPERTY_SERVICE records, and each of these documents should have the service description and hourly cost (from the GG_ SERVICE table) stored within it.

G. Based on the database you created in the previous question, write the following queries in AQL:

1. Retrieve the ID and name of all office properties.

2. For each property, retrieve the name, city, and the service date and hours worked for all of its property services.

3. Group the properties by type and return, for each type, the type and the number of hours spent working on properties of that type.

1. Retrieve the descriptions of all services costing over $5. 2. For each invoice totaling over $100 retrieve the invoice number, date in, and the

service ID and quantity of all items in the invoice.

3. Group the invoices by customer and return, for each customer, the customer num- ber and the total amount spent by that customer on all their invoices combined.

Z06_KROE8149_09_SE_EXTC.indd 72 04/07/19 9:05 PM

Online Extension C Advanced Business Intelligence and Big Data C-73

These questions are based on the James River Jewelry JRJ database described in Chapter 3, which James River Jewelry has been successfully using. Their business has grown so much, however, that they want to see what they might learn from mining some of their data. In ad- dition, they want to explore the possibility of using a document database for some of their data in the future.

A. Describe how an RFM analysis could be useful to James River Jewelry.

B. Using the data in the James River Jewelry Project Questions from Chapter 3, construct a table similar to that in Figure C-6, based on the data in the PURCHASE_ITEM and ITEM tables. James River Jewelry is interested in finding pairs of artists whose work is purchased together. In order to make this and the next question more meaningful, assume that a new purchase transaction, InvoiceNumber 1010, has been added to the database, and that it includes purchases for items 15 and 17. Note that in a database this small, you can do this by hand. In a larger database, SQL can be used to create the table.

C. Based on your answer to the previous question, identify a pair of artists whose work is purchased together frequently. For this pair (A, B) compute the support, the confi- dence of A given B, the confidence of B given A, and the lift.

D. Suppose that James River Jewelry wants to be able to predict which items are likely to sell. As part of this effort, they will add a Sold column to the ITEM table (this will be a “yes/no” column). Describe, in general terms, how a decision tree, in conjunction with this new data, might help James River to predict which items will be sold.

E. James River Jewelry has heard good things about XML as a way of structuring data; in addition, they may want to be able to easily share data with other local businesses using a standardized XML format. As an experiment, they would like to create XML data to model their item data. Create an XML Schema file to model the data in the ITEM table and create a matching XML document file containing data for at least four of their items.

F. James River Jewelry has also heard wonderful things about JSON-based document databases, and would like to experiment with such a system, with the possibility of deploying it in the future. Create a JSON-based document database for a portion of the James River Jewelry business, to include only the artists and items they have made. This database should contain one collection, which will contain one document for each artist. Each artist, in addition to first name and last name, should have nested within it an array of all of their items.

G. Based on the database you created in the previous question, write the following queries in AQL:

1. Retrieve the first and last names of all artists.

2. For each artist, retrieve the first name, last name, and the description of all of their items costing over $100.

JAMES RIVER JEWELRY PROJECT QUESTIONS

THE QUEEN ANNE CURIOSITY SHOP PROJECT QUESTIONS

These questions are based on the Queen Anne Curiosity Shop database described in Chapter 3, which they have been successfully using. Their business has grown so much, however, that they want to see what they might learn from mining some of their data. In ad- dition, they want to explore the possibility of using a document database for some of their data in the future.

Z06_KROE8149_09_SE_EXTC.indd 73 04/07/19 9:05 PM

C-74 Online Extension C Advanced Business Intelligence and Big Data

A. Describe how an RFM analysis could be useful to The Queen Anne Curiosity Shop.

B. Using the data in The Queen Anne Curiosity Shop Project Questions from Chapter 3, construct a table similar to that in Figure C-6, based on the data in the ITEM and SALE_ITEM tables. Base the table on ItemDescription, not on ItemID. Note that in a database this small, you can do this by hand. In a larger database, SQL can be used to create the table.

C. Based on your answer to the previous question, identify a pair of items (again, based on descriptions) that is purchased together more than once. For this pair (A, B) compute the support, the confidence of A given B, the confidence of B given A, and the lift.

D. Suppose that The Queen Anne Curiosity Shop wants to be able to predict which items are likely to sell. As part of this effort, they will add a Sold column to the ITEM table (this will be a “yes/no” column). Describe, in general terms, how a decision tree, in conjunction with this new data, might help The Queen Anne Curiosity Shop to predict which items will sell.

E. The Queen Anne Curiosity Shop has heard good things about XML as a way of struc- turing data; in addition, they may want to be able to easily share data with other local businesses using a standardized XML format. As an experiment, they would like to create XML data to model their vendor data. Create an XML Schema file to model the data in the VENDOR table, and create a matching XML document file containing data for at least two individual vendors and at least two corporate vendors.

F. The Queen Anne Curiosity Shop has also heard wonderful things about JSON-based document databases, and would like to experiment with such a system, with the possi- bility of deploying it in the future. Create a JSON-based document database for a por- tion of The Queen Anne Curiosity Shop business that includes only vendors and items. This database should contain one collection, which will contain one document for each vendor. Each vendor should have nested within it an array of all of its ITEM records.

G. Based on the database you created in the previous question, write the following queries in AQL:

1. Retrieve the contact last name and city of all vendors in Washington state (WA).

2. For each vendor, retrieve the contact last name, address, and the item description and purchase date for all of its items.

3. Group the vendors by state and return, for each state, the state and the total cost of items purchased from vendors in that state.

Z06_KROE8149_09_SE_EXTC.indd 74 04/07/19 9:05 PM