A REPORT ON DATA MINING AND WEB USAGE

profileakuddin00786
DATAMINIGWEBUSAGE.pptx

BY

MUHAMMAD ABRAR UDDIN

DATA MINIG TECHNIQUES USING

Web usage mining

Web Mining is the application of data mining techniques to discover and retrieve useful information and patterns from the World Wide Web documents and services.

Web Mining :DEFINATION

 Web usage mining

– can be broadly defined as discovery and analysis

useful information from the WWW.

– automatic discovery of patterns in clickstreams and associated data, collected or generated as a result of user interactions with one or more Web sites.

 Goal: analyze the behavioral patterns and profiles of users interacting with a Web site.

 This is important in Web usage mining due to the characteristics of clickstream data.

 This process is critical to the successful extraction of useful patterns from the data.

 The process may involve pre-processing the original data,is a process known as data preparation.

Web Usage Mining –Preprocessing

Uddin, Muhammad Sameer (Cognizant) (UMS() -

 Data cleaning

remove irrelevant references and fields in server logs

remove references due to spider/robot navigation

add missing references due to caching (done after sessionization)

 Data fusion/integration

synchronize data from multiple server logs

integrate e-commerce and application server data

integrate meta-data (e.g., content labels)

Data transformation

user identification

sessionization

pageview identification

a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser

Data Reduction

sampling and dimensionality reduction (ignoring certain

pageviews / items)

Identifying User Transactions

i.e., sets or sequences of pageviews possibly with associated weights

Sessionization (Identify sessions )

-It is the process of segmenting the user activity record of each user into sessions, each representing a single visit to the site.

-The goal of a sessionization heuristic is to reconstruct, from the clickstream data, the actual sequence of actions performed by one user during one visit to the site

Difficult to obtain reliable usage data due to

proxy servers

dynamic IP addresses,

the inability of servers.

Pageview identification

Depends on the intra-page structure of sites

Identify the collection of Web files representing a specific “user event” corresponding to a clickthrough (e.g. viewing a product page, adding a product to a shopping cart)

e.g like the purchase of a product on an online ecommerce Site

User Identification

The analysis of Web usage does not require knowledge about a

user’s identity. So it is necessary to distinguish among different users.

Since a user may visit a site more than once, the server logs record multiple sessions for each user.

Path completion

-Client- or proxy-side caching can often result in missing access references to those pages or objects that have been cached.

For instance,

if a user goes back to a page A during the same session, the second access to A will likely result in viewing the previously downloaded version of A that was cached on the client-side, and therefore, no request is made to the server.

This results in the second reference to A not being recorded on the server logs.

 The discovered patterns: usually represented as

– collections of pages, objects, or resources that are frequently accessed by groups of users with

common interests.

 Decision Trees

a flow chart of questions leading to a decision

Ex: car buying decision tree

 Path Analysis

Uses Graph Model

Provide insights to navigational problems

Example of info. Discovered by Path analysis:

78% “company”-> “what’s new”->“sample”-> “order”

60% left sites after 4 or less page references

=> most important info must be within the first 4 pages of site entry

points.

 Grouping

Groups similar info. to help draw higher-level conclusions

Ex: all URLs containing the word “Yahoo”…

 Filtering

Allows to answer specific questions like:

 how many visitors to the site in this week?

 Cookies

Randomly assigned ID by web server to browser

Cookies are beneficial to both web site developers and visitors

Cookie field entry in log file can be used by Web traffic analysis

software to track repeat visitors  loyal customers.

 Association Rules

help find spending patterns on related products

30% who accessed/company/products/bread.html, also accessed

/company/products/milk.htm.

 Sequential Patterns

help find inter-transaction patterns

50% who bought items in /pcworld/computers/, also bought in

/pcworld/accessories/ within 15 days

 Clustering

Identifies visitors with common characteristics based on visitors’ profiles

One straightforward approach in creating an aggregate view of each cluster is to compute the centroid of each cluster.

50% who applied discover platinum card in

/discovercard/customerService/newcard, were in the 25-35 age group,

with annual income between $40,000 – 50,000.

Information on how customers are using a Web site is critical information for marketers of e-commerce businesses.

WUM can provide business process optimization and marketing decisions .

Business intelligence includes personalization for C2B systems

Business Intelligence

Usage Mining on Semantic Web

Help to build semantic Web

With semantic Web, WUM can be improved

Multimedia Web Data Mining

Representation, problem solving and learning from Multimedia data is indeed a challenge

Software Computing Technology for Web Mining

Fuzzy logic: dealing with imprecision and conceptual data. Used in clustering Web log data and mining ARs.

Neural network:

Adaptive to new data and information

Suitable for parallel process

Robust for missing, confusing, ill-defined data

Capable for modeling non-linear decision boundaries

Effective for learning user profiles

Future Research Directions

Analysis of Discovered Patterns

Research on efficient, flexible and powerful analysis tools

More Applications

Temporal evolutions of usage behavior

Improving Web services

Detect credit card fraud

Privacy issues

Future Research Directions (Cont.)

 Web Mining support on-going, continuous improvements for E- businesses

 Web usage and data mining to find patterns is a growing area with the growth of Web-based applications

 Application of web usage data can be used to better understand web usage, and apply this specific knowledge to better serve users

 Web usage patterns and data mining can be the basis for a great deal

of future research

Thank you…..