python modeling

nieyanan
2020fallK513finalproject.pdf

2020 Fall K513 Final Project

1

Will a Product Sell Well in Wish Use P redictiv e Analy tic s to Guide the D ecisio n

Due at 11:59 10/10/2020

Introduction Your business wants to open a store front in Wish. You would like to know what factors affect product sales. For example, what types of products sells well in Wish, how to list your product, how important different ratings numbers are, whether you should use Ads to boost sales, how important it is to get badges, etc. You scraped data from Wish platform website for its 2020 summer items. You want to have a general understanding of the items on sale there. In addition, you want to create some machine learning models to derive rules to predict whether you can reach your sales goal or even predict the number of items that can be sold.

Part 1 – Exploratory Data Analysis (EDA) and Prepare Data The data scraped from the Internet is real data in a timely manner. However, you might have missing data, duplicate data, text data that need to be encoded. There might be variables that won’t contribute to your model. Some columns might not lend themselves well to be used as features to your models directly but could be very useful after some processing. So, you need to go through the process to understand and prepare the data first. Use the tools you learned in this class to explore the data. The following steps are recommended:

1. Understand columns by reading the column description and data. You can load the data into a dataframe to examine it.

2. Identify columns that you know won’t be contributing to your EDA and modeling building. Remove them.

3. Check whether there are duplicate rows to be removed. Remember that you can focus on key columns to identify duplicate rows.

4. Check missing values in columns and decide how to handle them. You may drop columns if the number of missing values is too many or if the columns won’t contribute to your predicting objective. You may impute missing values if you want to keep the columns. We discussed how to impute missing values. This article might provide additional ways to impute values.

5. Check unique values of some columns, especially text columns. The number of unique values might give you ideas on whether to use these columns, how to use the columns or how to derive new variables out from the original ones. This also helps to decide how to encode text columns if necessary.

6. Encode text columns into numerical. If a text column has too many unique values, it is impractical to use it directly. For example, the tags column is a proxy for categories. Since an item can be tagged with anything a seller likes, there will be a lot of unique values. The column might be important since it is directly relevant to the search result but cannot be directly used. Can you extract some important key tags by analyzing the content in values of this column and use a few tags as variables? Note that only numerical values can be used to create models

2020 Fall K513 Final Project

2

7. Create visualizations to help better understand the variables and their relationships. Examples could be correlation heatmap, scatterplot matrix, boxplot, bar plot, etc. Any charts and graphs help with understanding and preparation of data are welcome.

You should be able to describe and summarize your understanding of the data in your final PPT. Feel free to use Excel or Tableau to explore data in additional Python. This is not a very big dataset, so it is OK to use Excel. However, the final transformations must be done in the Jupyter notebook.

Part 2 – Build Predictive Models Your goal is to build supervised models to predict the sales of any products. There are two ways to do it.

1. Create regression models with the variable “units_sold” as the target. Feel free to try out all the regression models we learned in this class. Once you have the data ready, running different models does not take much additional time as we see in class. Compare their performances and decide which model generates the best performance.

2. After you delve into the details of the columns and the wish website, you will realize that the units_sold is not the actual number but the range of sales. The regression model might not be completely appropriate. You can also create a classification model to predict the approximate range of unit_sold. Explore the numbers in unit_sold to decide what new variable to derive from it. It could be a binary or multi-class variable. Depending on it, you will perform binary or a multi-class classification. Once you have the data ready, running different models does not take much additional time as we see in class. Compare their performances and decide which model generates the best performance.

Pay attention to the following when building your models: 1. Make sure to only use the variables that are available at the time of prediction as features. You

cannot use future values as your features. Don’t use the original columns if new columns have been derived. For example, if you created a dummy variable out of unit_sold, you should not use unit_sold as a feature to predict your dummy variable.

2. Be sure to split your data into training and test sets. Test set could use 25% to 30% of the data. The performance evaluation must be based on test set. So you want to control overfitting for model generalization.

Files available and Submission Requirement 1. The zip file download from Canvas should contain this instruction file, the raw data in csv file

format, and a PNG file showing Wish interface. 2. What needs to be submitted

a. A slide deck that summarizes your findings on business understanding, data understanding and preparation, models and results interpretation on the given dataset. The main body of the slides should not exceed 20 pages, but you can have additional slides in the Appendix which will be taken into account for grading purpose.

b. The main body of slides should not include any Python code, but it can include screenshots of the results. The presentation should have decision makers as the target audience. Assume that they have high level knowledge of predictive analytics. They are not interested in technical details. Provide high level overview and business

2020 Fall K513 Final Project

3

insights. Use technical jargons only when necessary and try to provide brief explanations when doing so.

c. A working Jupyter Notebook that includes both your code and outputs. Use comments and markdown cells to organize your code and describe outputs. Please upload both the ‘HTML’ and the ‘ipynb’ versions of your notebook.

Part 1 of the project coincide with the first half of the course. It is recommended that you start working on the project after the first 3 week of the course is over. If you submit the completed Part 1 with the PPT and Jupyter notebook (code and html pages), I will provide feedback to you including a tentative range of letter grade you would get with the quality level of the submitted file. The early submission is not required but highly recommended. However, the final submission deadline with both parts completed must be followed since the registrar's office has a deadline for course grade submission.

Details about the columns of the data See the picture variables from wish.PNG to see whether the columns come from and get further understanding of the variables. Detailed explanations of the variables are listed below.

1. Title: the title of the item displayed on the website 2. Price: price you would pay to get the product 3. Retail price: reference price of similar articles on the market, or in other stores/places. 4. Currency: currency for the price 5. Units_sold: number of units sold (lower bounds of the web listing) 6. Uses_ad_boosts: Used Ads to boost sales? 0 – no, 1 – yes 7. Rating: average rating by customers 8. Rating_count: number of ratings 9. Rating_five_count: number of ratings 5 10. Rating_four_count: number of ratings 4 11. Rating_three_count: number of ratings 3 12. Rating_two_count: number of ratings 2 13. rating_one_count: number of ratings 1 14. badges_count: Number of badges the product or the seller have 15. badge_local_product: A badge that denotes the product is a local product. 1 - Yes, has the badge 16. badge_product_quality: Badge awarded when many buyers consistently gave good evaluations

1 means Yes, has the badge 17. badge_fast_shipping: Badge awarded when this product's order is consistently shipped

rapidly 18. tags: tags set by the seller: Wish does not have categories. Tags are used instead 19. product_color: Products’s main color 20. product_variation_size_id: One of the available size variation for this product 21. product_variation_inventory: Inventory the seller has. Max allowed quantity is 50 22. shipping_option_name 23. shipping_option_price 24. shipping_is_express: whether the shipping is express or not. 1 - yes 25. countries_shipped_to: number of countries the item can be shipped to

2020 Fall K513 Final Project

4

26. inventory_total: Total inventory for all the product's variations (size/color variations for instance)

27. has_urgency_banner: 1 - yes 28. urgency_text: A text banner that appear over some products in the search results. 29. origin_country: which country the product is from 30. merchant_title: Merchant's displayed name (show in the UI as the seller's shop name) 31. merchant_name: Merchant's canonical name. A name not shown publicly. Used by the

website under the hood as a canonical name. Easier to process since all lowercase without white space

32. merchant_info_subtitle: an overview of the seller's stats provided by the website 33. merchant_rating_count: Number of ratings of this seller 34. merchant_rating: merchant's rating 35. merchant_id 36. merchant_has_profile_picture: where there is a profile picture of the merchant. 1 - yes 37. product_url 38. product_picture 39. product_id: unique id of products 40. theme: the search term used to get all the data

  • Introduction
  • Part 1 – Exploratory Data Analysis (EDA) and Prepare Data
  • Part 2 – Build Predictive Models
  • Files available and Submission Requirement
  • Details about the columns of the data