Google Colab - Pandas

profiledanielsabra123
Midterm2022-10-21.ipynb

{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"collapsed_sections":[],"authorship_tag":"ABX9TyP4O6zZRywYRTA4/J7bwd5v"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["##1. Lab03B KNN\n","\n","We want to analyze Heart.csv data to predict heart attack by the KNN method. Heart.csv contains numerical and categorical data that would lead to heart attack. While all other data in Heart.csv are symptoms or indicators toward heart attack, AHD represents the result whether the Heart Attack has happened for a patient or not. \n","\n","1a) Read Heart.csv into a dataframe. As Heart.csv has a few NaN values, you need to remove them when the data is read. (Hint: For dropna(), refer to Lab05 Cross Validation in the class material.) "],"metadata":{"id":"kjGPx9GTfCCX"}},{"cell_type":"code","source":[],"metadata":{"id":"ABD95wyqk3qk"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"JdhbAkQtk3t3"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"B_sxxVlik3w_"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"ljxaiQpyk30A"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"pID3_4DEk338"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["1b) From the read dataframe, set aside output y with 'AHD' data of the dataframe, and remove data columns for 'ChestPain', 'Thal', 'AHD' from the dataframe. Then, set aside the first 200 data sets to X_train and y_train from the dataframe, and the rest to X_test and y_test."],"metadata":{"id":"YBwMvpI1TYQB"}},{"cell_type":"code","source":[],"metadata":{"id":"hMAjxnCDk7vy"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"MgYqo-ook7zI"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"dJ2nRumgk72a"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"UxTRedwSk75y"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["1c) Using the train and test data sets in 1b), run the KNN method and display confusion matrices and classification reports in loop from k=1 to k=7. "],"metadata":{"id":"XYLocZdzO1Wz"}},{"cell_type":"code","source":[],"metadata":{"id":"5GuWFjsDlA6a"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"vh3TrsjElBCC"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["1d) We want to run KNN with the scaled (normalized) data from the train and test data to see any difference in the KNN performance. Scale X_train and X_test by using processing.scale() function. (Hint: Lab03B KNN) Then, run the KNN from k=1 to 7 with the scaled data in the loop."],"metadata":{"id":"A7d6bCBdVxj-"}},{"cell_type":"code","source":[],"metadata":{"id":"1_IYkSFYlFKn"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"XhIwh6iElFOB"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["1e) From the results in 1c) and 1d), can we tell the KNN performance difference with unscaled data and scaled data? Why one is better than the other? "],"metadata":{"id":"qLiEiAijQBJh"}},{"cell_type":"markdown","source":["**Answer** - "],"metadata":{"id":"lhpqJo3z7b_-"}},{"cell_type":"markdown","source":["##2. Lab04A Logistical Regression:\n","\n","We want to run Logistical Regression for the dataframe data in 1a) to check its performance.\n","\n","2a) Run Logistical Regression that predicts \"AHD\" classes (No, Yes) from all numerical indicators from the dataframe in 1a). (Remark: To run this model, we don't need to separate the dataframe data between train and test data. We will use the whole dataframe data to fit a model and predict.) Display confusion_matrix and classification_report between true AHD classes and the predicted AHD classes. "],"metadata":{"id":"96XcFnF2gcsB"}},{"cell_type":"code","source":[],"metadata":{"id":"ZFPDcH1slO-0"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"04__4H6blPCC"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"VM5BVs83lKsI"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"EiNWWzkqlKvn"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"nl6bcJBslK05"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"qEEEfBdYlK5Q"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["##3. Lab04B LDA and QDA:\n","\n","We want to run LDA-QDA methods for the scaled train and test data sets in 1d) (X_train_scaled, X_test_scaled), to check their performances.\n","\n","3a) Fit an LDA model with X_train_scaled and X_test_scaled data in 1d). Display confusion_matrix and classification_report between true AHD classes and the predicted AHD classes. (Hint: If you can't scale the data in 1d), try to use X_train and X_test data in 1b) instead.)"],"metadata":{"id":"lJV70gIyq40C"}},{"cell_type":"code","source":[],"metadata":{"id":"UYlJNIznlRvB"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"Ui3pAIdFlRya"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"eMdNvuBnlR6N"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["\n","3b) For the same scaled data in 1d), run QDA that predicts \"AHD\" from the sysmptoms data of the Heart.csv. Display confusion_matrix and classification_report between true AHD classes and the predicted AHD classes."],"metadata":{"id":"pxwAIcUXs5cF"}},{"cell_type":"code","source":[],"metadata":{"id":"uiReExfllXwe"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"DELDM3jqlXzz"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["3c) From the results of Logistical Regression, LDA, and QDA fro the same Heart.csv data, which model performs the best and what basis can you tell that? "],"metadata":{"id":"N8TrmJLNuqYS"}},{"cell_type":"markdown","source":["**Answer** - "],"metadata":{"id":"56cJRCHRu6wG"}},{"cell_type":"code","source":[],"metadata":{"id":"SQPbzoYAvYGq"},"execution_count":null,"outputs":[]}]}