Data Analysis
NOTES on the RESEARCH – Sequential Activity Pattern Analysis of Tourists
This research was conducted over a period of 1 year Oct 2016 – Oct 2017.
The data was gathered by using the Twitter Developer API and extracting all the tweets of tourists visiting various parts of Singapore.
This RAW DATA SET containing thousands of files was then filtered by running through a specialized coded program. The resulting files contained only the tweets associated with foursquare check-ins of users/tourists.
The FILTERED DATA SET contains foursquare Check-in tweets of more than 1000 tourists collected over a period of 7-9 months. The data files were then run through another coded program, the purpose of which was to extract the foursquare category information associated with each Check-in. The category information was essential to find out which activity were the tourists engaged in at that particular time.
(Note : The complete set of foursquare categories can be accessed from https://developer.foursquare.com/docs/resources/categories )
The PROCESSED DATA SET containing category information for each Check-in was then run through different Sequential Pattern Mining Algorithms to find the most frequently and regularly occurring “Activity Patterns” and “Trends” amongst tourists.
ANALYTICS
Note that the sequence of user in the data file is organized in the order latest first. To get correct sequence, you need to organize from earliest to latest. (Check the date and time of the check-ins).
Next step is to think about the meaning and the actual application of the results generated.
For example:
· If tourism manager want to design travel itinerary for the entire trip. Then you can treat all check-in of one user spanning multiple day as a sequence to analyse the result and apply sequential pattern mining algorithm. So the count can be the number of sequence, which is also the number of user.
· However, if tourism manager want to design the best itinerary for 1 day, then they need to treat all check-ins in the same day as one sequence. There can be more sequence than the number of user as one user can travel for multiple day.. So the count in this case can be the number of sequence, not necessary the same as number of user.
· Another thing to consider, the activities in the data are for all categories. When tourism managers want to design itinerary specifically for attractions, or dinning, or shopping. For example, what attraction to go first, what attraction to go section, etc. we need to extract only the check-ins for those categories to generate sequences.
So in generally, there are many ways to slide the data into sub data sets with a specific purpose, to answer some of the above listed cases. There are many other possible case/scenario that you can think of, and you can generate data for all of those cases. We would like the paper to have several results to cover many possible case application, rather than just a single result.
Another thing to consider is the use of data mining techniques. You need to be clear about the differences between sequential rules mining and association rules mining. Each of those sub data sets, you can apply both algorithms, and you can obtain some results, they can be meaningful to report as well.
The above suggestions and tips are just something for you to consider. You can think more as you are the leading person to decide what is the best practice in this case. When to start sliding the data for details analysis is also up to you, as you will be the person to understand the data most. Also you will be the one to decide how to slide data, there are many ways to slides it than just what I mentioned. Be creative and think about all possible scenario and analyse the data to extract as much knowledge as possible and verify your ideas.
(Usually in data analysis, I start from general descriptive statistics, to understanding the data, then slide the data one level by one level, to find out deeper and deeper insights. etc... The sequence can be done for categories as general category, or for venues as detailed insights of exact place. Also there are several techniques to analyze the data, you need to master all of them, sequential patterns mining, association rules mining, etc.. Keep results and analysis along the way, we will select what to report at the end, after all possible results are generated).
Please provide information for each result table, what is the result for? (for which scenario or application case).
You also need to examine your data to see what specific pattern will be interesting to report, there are many of them, but we need to pick around 20-30 most interesting pattern for each result to put into a table for each application scenario. So it is important that we can see which pattern might be interesting. You can just highlight them in the excel file to begin with.
RESULTS D FINDINGS
· The pattern sequence must be in order A -> B -> C
· Techniques used should be Association Rule Mining as well as Sequential Pattern Mining
· The data-set needs to be organized before processing ( Date n Time ---- earliest to latest)
· Define Cases – Scenarios ---- Daily Pattern / Monthly / Hourly / Different Nationalities / Male Tourists / Female Tourists
· For each case filter the tourists in that category and then run the algorithm to check what patterns you find
OUTLINE for the RESULTS SECTION
4. Case Study
4.1 Data Collection
Describe the collected data and basic statistics, such as how many check-ins, users, how many for each gender, check-ins by day of week, hours of day. List of country where tourist comes from and how many of them for each country. etc....
4.2 Result Analysis
Outline what kind of result you want to show for the sequential rules and association rules.
a) sequential pattern for whole trip
b) sequential pattern for one day
c) etc.....................
Then for each section, pick the result you want to show and put them there.
When you design this result section, think about what are the key point of your paper, what are you trying to proof/ what message you want to deliver to reader about the contribution of your work.
I would pick the most prominent patterns and study them to find out their connection .. categories – sub categories --- and then give recommendations such as :::
· Popular patterns
· To promote tourist attractions
· To design travel itineraries
· To customise tours based on preferences
· etc
Also giving exact count of number of tourists involved in that activity sequence.
(The information for these categories is in https://developer.foursquare.com/docs/resources/categories )
IMPORTANT CORRECTIONS on RESEACH PROJECT
The aim of our work is to find sequential activity patterns. This means that we want to find rule or sequential patterns regarding activities Such as Dinning --> Shopping --> Attractions -->Travel , this is the general activity category.
If we find rules such as Garden by the Bay--> Flower Dome --> Airport. This is NOT activity. This is just the name of the venue/location, and this pattern only tell where tourist has been, but NOT their activity.
The analysis to identify where people go has been done a lot before, the thing which makes our work new is the mapping of the locations to Activity Category.
Because, we are writing a research paper, our result must tell a story, not just any result that we get from the data. Each result to be shown is for a purpose, not just to show something that we get. I brief the structure here again:
1) General Data Description: How many check-ins? How many users totally?
Show a table containing user location by country of origin, how many users from each country. Percentage of user for each country.
2) Travel Pattern for Trip: Identify check-ins belonging to each trip made by user. One user could have made multiple trips to the same destination. Need to separate the trips. The count is now trips, not the number of user anymore. Usually if 2 check-ins are made more than 30 days a parts, they are likely to be in indifferent trips.
So we generate a date set following the trips, each trip is a record.
You also need to familiar with the foursquare categories https://developer.foursquare.com/docs/resources/categories
There are 10 big categories, each has sub categories.
Beside Activities, we can also analyse their preference for Food, so we generate patterns for subcategories for Food category only, to identify this specific preference. NO OTHER CATEGORIES SHOULD BE INCLUDED HERE, FOOD ONLY.
3) Daily Travel Pattern: identify check-ins for each user and in each day and construct a data set according to day. Each record is one day. The count is now based on day, not number of user any more. In this analysis, we are interested in sequential patterns in general. Can use Sequential pattern Mining And sequential Rules Mining.
This analysis can include subcategories from All 10 categories. The aims is to understand daily itinerary of people.
Still in Daily Pattern , we do sequential Pattern Analysis for Venues: Here we go deeper into the venues, not subcategory any more. We can identify some venues that are major attraction in Singapore that many people visited. Then we can simply general some graphs or pattern to indicate where tourist usually come before and after visiting those popular venues.