A REPORT ON DATA MINING AND WEB USAGE
BY
MUHAMMAD ABRAR UDDIN
DATA MINIG TECHNIQUES USING
Web usage mining
Web Mining is the application of data mining techniques to discover and retrieve useful information and patterns from the World Wide Web documents and services.
Web Mining :DEFINATION
Web usage mining
– can be broadly defined as discovery and analysis
useful information from the WWW.
– automatic discovery of patterns in clickstreams and associated data, collected or generated as a result of user interactions with one or more Web sites.
Goal: analyze the behavioral patterns and profiles of users interacting with a Web site.
This is important in Web usage mining due to the characteristics of clickstream data.
This process is critical to the successful extraction of useful patterns from the data.
The process may involve pre-processing the original data,is a process known as data preparation.
Web Usage Mining –Preprocessing
Uddin, Muhammad Sameer (Cognizant) (UMS() -
Data cleaning
remove irrelevant references and fields in server logs
remove references due to spider/robot navigation
add missing references due to caching (done after sessionization)
Data fusion/integration
synchronize data from multiple server logs
integrate e-commerce and application server data
integrate meta-data (e.g., content labels)
Data transformation
user identification
sessionization
pageview identification
a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser
Data Reduction
sampling and dimensionality reduction (ignoring certain
pageviews / items)
Identifying User Transactions
i.e., sets or sequences of pageviews possibly with associated weights
Sessionization (Identify sessions )
-It is the process of segmenting the user activity record of each user into sessions, each representing a single visit to the site.
-The goal of a sessionization heuristic is to reconstruct, from the clickstream data, the actual sequence of actions performed by one user during one visit to the site
Difficult to obtain reliable usage data due to
proxy servers
dynamic IP addresses,
the inability of servers.
Pageview identification
Depends on the intra-page structure of sites
Identify the collection of Web files representing a specific “user event” corresponding to a clickthrough (e.g. viewing a product page, adding a product to a shopping cart)
e.g like the purchase of a product on an online ecommerce Site
User Identification
The analysis of Web usage does not require knowledge about a
user’s identity. So it is necessary to distinguish among different users.
Since a user may visit a site more than once, the server logs record multiple sessions for each user.
Path completion
-Client- or proxy-side caching can often result in missing access references to those pages or objects that have been cached.
For instance,
if a user goes back to a page A during the same session, the second access to A will likely result in viewing the previously downloaded version of A that was cached on the client-side, and therefore, no request is made to the server.
This results in the second reference to A not being recorded on the server logs.
The discovered patterns: usually represented as
– collections of pages, objects, or resources that are frequently accessed by groups of users with
common interests.
Decision Trees
a flow chart of questions leading to a decision
Ex: car buying decision tree
Path Analysis
Uses Graph Model
Provide insights to navigational problems
Example of info. Discovered by Path analysis:
78% “company”-> “what’s new”->“sample”-> “order”
60% left sites after 4 or less page references
=> most important info must be within the first 4 pages of site entry
points.
Grouping
Groups similar info. to help draw higher-level conclusions
Ex: all URLs containing the word “Yahoo”…
Filtering
Allows to answer specific questions like:
how many visitors to the site in this week?
Cookies
Randomly assigned ID by web server to browser
Cookies are beneficial to both web site developers and visitors
Cookie field entry in log file can be used by Web traffic analysis
software to track repeat visitors loyal customers.
Association Rules
help find spending patterns on related products
30% who accessed/company/products/bread.html, also accessed
/company/products/milk.htm.
Sequential Patterns
help find inter-transaction patterns
50% who bought items in /pcworld/computers/, also bought in
/pcworld/accessories/ within 15 days
Clustering
Identifies visitors with common characteristics based on visitors’ profiles
One straightforward approach in creating an aggregate view of each cluster is to compute the centroid of each cluster.
50% who applied discover platinum card in
/discovercard/customerService/newcard, were in the 25-35 age group,
with annual income between $40,000 – 50,000.
Information on how customers are using a Web site is critical information for marketers of e-commerce businesses.
WUM can provide business process optimization and marketing decisions .
Business intelligence includes personalization for C2B systems
Business Intelligence
Usage Mining on Semantic Web
Help to build semantic Web
With semantic Web, WUM can be improved
Multimedia Web Data Mining
Representation, problem solving and learning from Multimedia data is indeed a challenge
Software Computing Technology for Web Mining
Fuzzy logic: dealing with imprecision and conceptual data. Used in clustering Web log data and mining ARs.
Neural network:
Adaptive to new data and information
Suitable for parallel process
Robust for missing, confusing, ill-defined data
Capable for modeling non-linear decision boundaries
Effective for learning user profiles
Future Research Directions
Analysis of Discovered Patterns
Research on efficient, flexible and powerful analysis tools
More Applications
Temporal evolutions of usage behavior
Improving Web services
Detect credit card fraud
Privacy issues
Future Research Directions (Cont.)
Web Mining support on-going, continuous improvements for E- businesses
Web usage and data mining to find patterns is a growing area with the growth of Web-based applications
Application of web usage data can be used to better understand web usage, and apply this specific knowledge to better serve users
Web usage patterns and data mining can be the basis for a great deal
of future research
Thank you…..