Data Mining and Machine Learning

#This HW uses the same datafile as previous HWs.

# Please refer to the CSV file “titanic_data” that contains data about each

# passenger aboard the HMS Titanic when it sank in 1912.

# The file has several columns given as:

# survival: Survival 0 = No, 1 = Yes

# pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd

# sex: Sex

# Age: Age in years

# sibsp # of siblings / spouses aboard the Titanic

# parch # of parents / children aboard the Titanic

# ticket: Ticket number

# fare: Passenger fare

# cabin: Cabin number

# embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

#Treat Survived as your y variable, and the other variables as your x variables.

#The goal is to build a decision tree and a random forest model to predict whether a person survives or not.

#Please include the following in your work:

#1. Classification report showing precision, recall, F-score etc.

#2. Which model works better? Decision tree or random forest?

#3. Tune the hyper-parameters of the decision tree and random forest model. How the performance of the tuned models compare with un-tuned models?

#4. How does your models compare to the logistics regression/ SVM models from previous HW?

#5. What is overfitting? How the overfitting problem can be resolved in decision trees?

#6. Create an ensemble using the classification models learned so far. See if the ensemble works better than the individual models.

Recent Posts