Assignment - Zero plagiarism - Network Management
|
Pg. 10 |
|
خطأ! استخدم علامة التبويب "الصفحة الرئيسية" لتطبيق Heading 1 على النص الذي ترغب في أن يظهر هنا. |
|
|
|
|
Deadline: Day 29/10/2020 @ 23:59
Data mining and data warehousing
IT446
College of Computing and Informatics
|
|
|
|
|
|
|
|
Question One
1.5 Marks
Learning Outcome(s):1
Explain different data mining tasks, problems and the algorithms most appropriate for addressing them
There are several typical cube computation methods such as Multi-Way, BUC, and Star-cubing. Briefly describe each one of these methods outlining the key points.
Answer:
For Multiway Array Aggregation,
it is a method used for data cub computation, which is considered as a major task in DW implementation to minimize the time response and maximizing the performance. It uses bottom up approach, as shown below:
In this method, a full data cub is computed. It depends on array addressing, where dimension values are obtained (or accessed) by the index of their corresponding array locations. It uses two main steps, which are:
1. Dividing the arrays into chunks, which are a sub-cubs that are suitable for memory.
2. Compute aggregates in “multiway” by passing (or visiting) the cube cells in the order that minimizes the number of times to visit each cell.
By doing the previous steps, we can move from 3D to 2D to 1D of data.
For BUC,
· Bottom-up cube computation
· Divides dimensions into partitions and facilitates iceberg pruning
· If a partition does not satisfy min_sup, its descendants can be pruned
· If minsup = 1 ⇒ compute full CUBE!
· No simultaneous aggregation
· This method combines the two previous ones where it depends on both top-down and bottom-up cube computation. So, it uses multidimensional aggregation and pruning.
· In depth, we will share the dimensions as it is illustrated in the following figure, where we will have a cuboid tree.
References
Lecture 5, slides 9,10,14
Chapter 5 of the book, pages 195+200+204+210
Question Two
1.5 Marks
Learning Outcome(s):2
Apply and evaluate data mining algorithms with respect to problems they are specifically designed for
Consider the following table showing multiple transactions. Find all frequent itemsets using Apriori, then list all the strong association rules knowing that min_sup count = 2, and min_conf = 60%.
|
TID |
Items |
|
T1 T2 T3 T4 T5 T6 |
A, B, D, E A, B, C C, E B, C A A, B, C |
Answer: min_sup count = 2
|
1 – item sets (we have 4 items sets) |
||
|
Item |
Support |
Frequent or not |
|
A |
4 |
Yes |
|
B |
4 |
Yes |
|
C |
4 |
Yes |
|
D |
1 |
No |
|
E |
2 |
Yes |
|
2 – item sets (we have 3 items sets) |
||
|
Item |
Support |
Frequent or not |
|
A, B |
3 |
Yes |
|
A, C |
2 |
Yes |
|
A, E |
1 |
No |
|
B, C |
3 |
Yes |
|
B, E |
1 |
No |
|
C, E |
1 |
No |
|
3 – item sets (we have 1 items sets) |
||
|
Item |
Support |
Frequent or not |
|
A, B, C |
2 |
Yes |
strong association rules (min_conf = 60%.)
|
Rule |
Confidence |
Strong or not |
|
A=>B |
3\4=75% |
Yes |
|
B=>A |
3\4=75% |
Yes |
|
A=>C |
2\4=75% |
Yes |
|
C=>A |
2\4=50% |
No |
|
B=>C |
3\4=75% |
Yes |
|
C=>B |
3\4=75% |
Yes |
|
A=>B,C |
2\4=50% |
No |
|
B=>A,C |
2\4=50% |
No |
|
C=>A,B |
2\4=50% |
No |
References
Chapter 6,slides 5+6+14.
Chapter 6 of the book, page 246.
Question Three
1.5 Marks
Learning Outcome(s):2
Apply and evaluate data mining algorithms with respect to problems they are specifically designed for
Views are virtual in database but Materialized view are persistent. Discuss the need to have materialized view instead of views and in what condition No materialization if preferred.
Answer:
Pre-computation of data cube’s cuboids is known as Materialization. Displaying the content of this process is called view materialization.
Full materialization refers to the computation of all of the cuboids in the lattice defining a data cube. It typically requires an excessive amount of storage space, particularly as the number of dimensions and size of associated concept hierarchies grow. This problem is known as the curse of dimensionality . Alternatively, partial materialization is the selective computation of a subset of the cuboids or subcubes in the lattice. For example, an iceberg cube is a data cube that stores only those cube cells that have an aggregate value (e.g., count) above some minimum support threshold. No materialization: Do not precompute any of the “nonbase” cuboids. This leads to computing expensive multidimensional aggregates
Suppose that you want to create a data cube for Electronics Company that contains the following: city, item, year, and sales. The corresponding lattice of cuboids is:
Here, the number of generated cuboids is (2 power n, n= 3).
If there were no hierarchies associated with each dimension, then the total number of cuboids for an n-dimensional data cube is (2 power n, where n is the number of dimensions). In this case, No materialization view suit the best. That is because no need to pre-compute any of the “non-base” cuboids.
If there is a hierarchies associated with each dimension, for example time is usually explored in the hierarchy “day <month < quarter < year” rather than year alone. In this case, No materialization will perform the worst because it leads to computing expensive multidimensional aggregates, which is very slow. The solution will be partial materialization.
Reference
Chapter 4 of the book, page 159-160-179
Question Four
1.5 Marks
Learning Outcome(s):2
Apply and evaluate data mining algorithms with respect to problems they are specifically designed for
Calculate the 90% confidence interval for the following data sample:
a. Sample size of 66
b. Mean of their height is 22.4 ,
c. Standard deviation of the data is 2.8
Given formula Confidence interval at :
is the estimated standard error of the mean
|
Confidence Interval |
|
|
80% |
1.282 |
|
85% |
1.440 |
|
90% |
1.645 |
|
95% |
1.960 |
|
99% |
2.576 |
|
99.5% |
2.807 |
|
99.9% |
3.291 |
Answer:
Step 1: start with
· the sample size is 66 (n)
· the mean is 22.4
· and the standard deviation 2.8 (s)
Step 2: decide what Confidence Interval we want: 90% . Then find the "Tc" value for that Confidence Interval here:1.645
Step 3: use that Tc value in this formula for the Confidence Interval
References
Chapter 5, slide 27