DATA MINING

jefnyeiln4509

  

Data Mining Steps

Problem Definition 

Market Analysis

Customer Profiling, Identifying Customer Requirements, Cross Market Analysis, Target Marketing, Determining Customer purchasing pattern

Corporate Analysis and Risk Management

Finance Planning and Asset Evaluation, Resource Planning, Competition 

Fraud Detection

Customer Retention

Production Control

Science Exploration

> Data Preparation 

Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with the data, to discover first insights into the data and have a good understanding of any possible data quality issues. The Datasets you are provided in these projects were obtained from kaggle.com.

Variable selection and description

Numerical – Ratio, Interval

Categorical – Ordinal, Nominal

Simplifying variables: From continuous to discrete

Formatting the data 

Basic data integrity checks: missing data, outliers

> Data Exploration 

Data Exploration is about describing the data by means of statistical and visualization techniques.

· Data Visualization: 

o Univariate analysis explores variables (attributes) one by one. Variables could be either categorical or numerical.

   

Univariate   Analysis - Categorical

 

Statistics


Visualization


Description

 

Count


Bar   Chart


The number of values of the   specified variable.

 

Count%


Pie   Chart


The percentage of values of the   specified variable

   

Univariate   Analysis - Numerical

 

Statistics


Visualization


Equation


Description

 

Count


Histogram


N


The number of values (observations)   of the variable.

 

Minimum


Box Plot


Min 


The smallest value of the variable.

 

Maximum


Box Plot


Max 


The largest value of the variable.

 

Mean


Box Plot


Description: http://www.saedsayad.com/images/Mean.png


The sum of the values divided by the   count. 

 

Median


Box Plot


Description: http://www.saedsayad.com/images/Median.png


The middle value. Below and above   median lies an equal number of values.

 

Mode


Histogram



The most frequent value. There can be   more than one mode.

 

Quantile


Box Plot


Description: http://www.saedsayad.com/images/Quantiles.png


A set of 'cut points' that divide a   set of data into groups containing equal numbers of values (Quartile,   Quintile, Percentile, ...).

 

Range


Box Plot


Max-Min


The difference between maximum and   minimum.

 

Variance


Histogram


Description: http://www.saedsayad.com/images/Variance.png


A measure of data dispersion.

 

Standard Deviation


Histogram


Description: http://www.saedsayad.com/images/StDev.png


The square root of variance.

 

Coefficient of Deviation


Histogram


Description: http://www.saedsayad.com/images/CV.png


A measure of data dispersion divided   by mean.

 

Skewness


Histogram


Description: http://www.saedsayad.com/images/Skewness.png


A measure of symmetry or asymmetry in   the distribution of data.

 

Kurtosis


Histogram


Description: http://www.saedsayad.com/images/Kurtosis.png


A measure of whether the data are   peaked or flat relative to a normal distribution.

Note: There are two types of numerical variables, interval and ratio. An interval variable has values whose differences are interpretable, but it does not have a true zero. A good example is temperature in Centigrade degrees. Data on an interval scale can be added and subtracted but cannot be meaningfully multiplied or divided. For example, we cannot say that one day is twice as hot as another day. In contrast, a ratio variable has values with a true zero and can be added, subtracted, multiplied or divided (e.g., weight).

o Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship between two variables, whether there exists an association and the strength of this association.

There are three types of bivariate analysis. 

1.Numerical & Numerical

ScMatter Plot, Linear Correlation …

2.Categorical & Categorical

Stacked Column Chart, Combination Chart, Chi-square Test

3.Numerical & Categorical

Line Chart with Error Bars, Combination Chart, Z-test and t-test

> Modeling 

· Predictive modeling is the process by which a model is created to predict an outcome

o If the outcome is categorical it is called classification and if the outcome is numerical it is called regression

· Descriptive modeling or clustering is the assignment of observations into clusters so that observations in the same cluster are similar. 

· Finally, association rules can find interesting associations amongst observations. 

  

Classification algorithms:

  

  1. Frequency Table 


  1. Covariance Matrix 


  1. Similarity Functions 


  1. Others 


Regression

  

  1. Frequency Table 


  1. Covariance Matrix 


  1. Similarity Function 


  1. Others 


 

Clustering algorithms are:

  

  1. Hierarchical 


  1. Partitive 


> Evaluation 

· helps to find the best model that represents our data and how well the chosen model will work in the future. Hold-Out and Cross-Validation

> Deployment

The concept of deployment in predictive data mining refers to the application of a model for prediction to new data.

   <

    • 9 years ago
    • 20
    Answer(0)