Exam performance analysis

profilemezz
DS540project.pdf

DS540: Advanced Python for Data Science Project proposal 1: Exam performance analysis You are requested to analyse students’ exam performance dataset using python programming language. The dataset contains students’ scores in three different subjects (math, reading, and writing). Wire a python program to perform the following tasks: Task 1: Load the dataset in relevant format and show its properties, e.g. number of records, number of features and their types. (hint: use Pandas Library to read the data). Inspect the dataset and perform data cleaning (e.g. removing duplicate records and fixing missing data). Task 2: Provide descriptive statistics of the dataset and perform an exploratory data analysis (EDA) to answer the following analysis questions: • Compare students’ exam scores in different subjects (math, reading,

writing), What trend did you find? • Who performed better in different subjects male or female students? • Show any attributes (features) that are correlated with exam scores.

(e.g. Does parental level of education affect their children exam scores? Does test preparation influence students’ performance?) (hint: use corr() method in Pandas).

(you are encouraged to impose other analysis questions based on any trend you notice in the dataset). Task 3: Show visual representation of your analysis (hint: use data visualization packages such as Matplotlib and Seaborn). Task 4: Build a machine learning model to predict student’s exam performance in each subject given the following attributes: gender, race/ethnicity, parental level of education, lunch, and test preparation course. Download the dataset from the following link: Students exam performance data

DS540: Advanced Python for Data Science Project proposal 2: Tweets sentiment analysis

You are requested to perform natural language processing on users’ tweets using python programming language. The dataset contains textual data obtain from twitter users. Wire a python program to perform the following tasks: Task 1: Load the dataset in relevant format and show its properties, e.g. number of records, number of features and their types. (hint: use Pandas Library to read the data). Inspect the dataset and perform data cleaning (e.g. removing duplicate records and fixing missing data). Task 2: Pre-process the textual data and extract features using NLP techniques as follows: • Pre-processing steps:

1. Convert to lowercase. 2. Remove stop words. 3. Normalise the text (punctuation removal, spelling correction,

Stemming). 4. Tokenisation.

• Extract the following features:

1. Compute word count per tweet.

2. Average word length per tweet.

3. Special character count per tweet. 4. Tweets sentiments. (hint: use TextBlob library to obtain tweets’

sentiments).

5. N-grams. 6. TF-IDF.

DS540: Advanced Python for Data Science Task 3: Using visual representation, show the following: most commonly used words in tweets using Worldcould, number of positive, negative, and neutral tweets and word count distribution among different sentiments). (hint: use data visualization packages such as Matplotlib and Seaborn).

Task 4: Preform sentiment analysis using machine learning techniques to classify tweets into positive, negative, or neutral sentiments given the following features: word count per tweet, average word length per tweet, and special character count per tweet.

Download the dataset from the following link: Sentiment analysis Dataset

DS540: Advanced Python for Data Science

Project guidelines The report should provide the following information:

• A written description of data with relevant spreadsheets.

• Explanation of how you analysed your data (hint: what python packages/functions did you use).

• Explanation of what data you analysed and follow with relevant

visualization.

• Show the results of your analysis, follow with relevant visualization

and highlight important results.

• Details of your machine learning model development.

Notes:

1. Follow attached report template.

2. Your report can’t go beyond 10 pages inclusive of any references.

3. You must combine yourselves into a group of 1-2 students.

4. Submission deadline is on Saturday of Week 13 (28/11/2020). 5. You must submit your Jupyter Notebook along with the report.