Assignment - Zero plagiarism - Network Management

smartman1212
IT-446Assignment-2.docx

Pg. 10

خطأ! استخدم علامة التبويب "الصفحة الرئيسية" لتطبيق Heading 1 على النص الذي ترغب في أن يظهر هنا.

Deadline: Day 29/10/2020 @ 23:59

Data mining and data warehousing

IT446

https://www.seu.edu.sa/sites/ar/SitePages/images/logo.png

College of Computing and Informatics

Question One

1.5 Marks

Learning Outcome(s):1

Explain different data mining tasks, problems and the algorithms most appropriate for addressing them

There are several typical cube computation methods such as Multi-Way, BUC, and Star-cubing. Briefly describe each one of these methods outlining the key points.

Answer:

For Multiway Array Aggregation,

it is a method used for data cub computation, which is considered as a major task in DW implementation to minimize the time response and maximizing the performance. It uses bottom up approach, as shown below:

In this method, a full data cub is computed. It depends on array addressing, where dimension values are obtained (or accessed) by the index of their corresponding array locations. It uses two main steps, which are:

1. Dividing the arrays into chunks, which are a sub-cubs that are suitable for memory.

2. Compute aggregates in “multiway” by passing (or visiting) the cube cells in the order that minimizes the number of times to visit each cell.

By doing the previous steps, we can move from 3D to 2D to 1D of data.

For BUC,

· Bottom-up cube computation

· Divides dimensions into partitions and facilitates iceberg pruning

· If a partition does not satisfy min_sup, its descendants can be pruned

· If minsup = 1 ⇒ compute full CUBE!

· No simultaneous aggregation

Star-cubing

· This method combines the two previous ones where it depends on both top-down and bottom-up cube computation. So, it uses multidimensional aggregation and pruning.

· In depth, we will share the dimensions as it is illustrated in the following figure, where we will have a cuboid tree.

References

Lecture 5, slides 9,10,14

Chapter 5 of the book, pages 195+200+204+210

Question Two

1.5 Marks

Learning Outcome(s):2

Apply and evaluate data mining algorithms with respect to problems they are specifically designed for

Consider the following table showing multiple transactions. Find all frequent itemsets using Apriori, then list all the strong association rules knowing that min_sup count = 2, and min_conf = 60%.

TID

Items

T1

T2

T3

T4

T5

T6

A, B, D, E

A, B, C

C, E

B, C

A

A, B, C

Answer: min_sup count = 2

1 – item sets (we have 4 items sets)

Item

Support

Frequent or not

A

4

Yes

B

4

Yes

C

4

Yes

D

1

No

E

2

Yes

2 – item sets (we have 3 items sets)

Item

Support

Frequent or not

A, B

3

Yes

A, C

2

Yes

A, E

1

No

B, C

3

Yes

B, E

1

No

C, E

1

No

3 – item sets (we have 1 items sets)

Item

Support

Frequent or not

A, B, C

2

Yes

strong association rules (min_conf = 60%.)

Rule

Confidence

Strong or not

A=>B

3\4=75%

Yes

B=>A

3\4=75%

Yes

A=>C

2\4=75%

Yes

C=>A

2\4=50%

No

B=>C

3\4=75%

Yes

C=>B

3\4=75%

Yes

A=>B,C

2\4=50%

No

B=>A,C

2\4=50%

No

C=>A,B

2\4=50%

No

References

Chapter 6,slides 5+6+14.

Chapter 6 of the book, page 246.

Question Three

1.5 Marks

Learning Outcome(s):2

Apply and evaluate data mining algorithms with respect to problems they are specifically designed for

Views are virtual in database but Materialized view are persistent. Discuss the need to have materialized view instead of views and in what condition No materialization if preferred.

Answer:

Pre-computation of data cube’s cuboids is known as Materialization. Displaying the content of this process is called view materialization.

Full materialization refers to the computation of all of the cuboids in the lattice defining a data cube. It typically requires an excessive amount of storage space, particularly as the number of dimensions and size of associated concept hierarchies grow. This problem is known as the curse of dimensionality . Alternatively, partial materialization is the selective computation of a subset of the cuboids or subcubes in the lattice. For example, an iceberg cube is a data cube that stores only those cube cells that have an aggregate value (e.g., count) above some minimum support threshold. No materialization: Do not precompute any of the “nonbase” cuboids. This leads to computing expensive multidimensional aggregates

Suppose that you want to create a data cube for Electronics Company that contains the following: city, item, year, and sales. The corresponding lattice of cuboids is:

Here, the number of generated cuboids is (2 power n, n= 3).

If there were no hierarchies associated with each dimension, then the total number of cuboids for an n-dimensional data cube is (2 power n, where n is the number of dimensions). In this case, No materialization view suit the best. That is because no need to pre-compute any of the “non-base” cuboids.

If there is a hierarchies associated with each dimension, for example time is usually explored in the hierarchy “day <month < quarter < year” rather than year alone. In this case, No materialization will perform the worst because it leads to computing expensive multidimensional aggregates, which is very slow. The solution will be partial materialization.

Reference

Chapter 4 of the book, page 159-160-179

Question Four

1.5 Marks

Learning Outcome(s):2

Apply and evaluate data mining algorithms with respect to problems they are specifically designed for

Calculate the 90% confidence interval for the following data sample:

a. Sample size of 66

b. Mean of their height is 22.4 ,

c. Standard deviation of the data is 2.8

Given formula Confidence interval at :

A picture containing clock Description automatically generated

is the estimated standard error of the mean

Confidence Interval

80%

1.282

85%

1.440

90%

1.645

95%

1.960

99%

2.576

99.5%

2.807

99.9%

3.291

Answer:

Step 1: start with

· the sample size is 66 (n)

· the mean is 22.4

· and the standard deviation 2.8 (s)

Step 2: decide what Confidence Interval we want: 90% . Then find the "Tc" value for that Confidence Interval here:1.645

Step 3: use that Tc value in this formula for the Confidence Interval

References

Chapter 5, slide 27