Project completed in week 2 (05.10.-09.10.20) of the Data Science Bootcamp at Spiced Academy in Berlin.
On the second week of the bootcamp, we started with Machine Learning (ML). If you think about it, ML surrounds us in everyday life: Netflix recommending you movies you might like, your smartphone camera detecting faces, self-driving cars recognizing passengers on the street, bank detecting credit card fraud -- these are all applications of ML. They can be split into three main ML categories:
- Classification: Logistic Regression, Decision Trees, Random Forest
- Regression: Linear Regression, Regression Trees, SVR, Forecasting
- Unsupervised: PCA, Clustering, t-SNE, Matrix factorization
This week we focused only on classification and applied logistic regression, decision tree, and random forest models on the Titanic dataset to predict passenger survival.
- Logistic Regression separates the data points into two classes by a curve. It's a fast and performant model, but the data might require prior extensive feature engineering (e.g. encoding categorical values numerically). Logistic Regression is rooted in statistics and dates back to 1958.
- Decision Trees classify the data by "asking questions" about different features in the dataset, just like in a guessing game. You can specify the depth of the tree (on how many levels it should go), depending on the desired complexity.
- Random Forest is basically an ensemble of decision trees. This model doesn't require feature engineering, but tweaking the hyperparameters. The downside of Decision Trees, and therefore of Random Forest, is that they are prone to overfitting. Decision Trees/Random Forest are rather rooted in computer science and date back to 1986/1995.
'Logistic regression' is a terrible term. It should be called 'binary classification with a sigmoid function'.
ONE OF OUR TEACHERS
Now, a few words about why Feature Engineering is so important (and annoying). An ML model like Logistic Regression doesn't understand words (e.g. female, good, orange), so we need to transform them into numbers. But depending on the data type, we use different Encoders:
- for ordinal data (if one value is better than another, e.g.: good-great-excellent, child-teenager-adult-senior): KBinsDiscretizer, LabelEncoder
- for nominal data (if the values are equal or don't have an inherent value, e.g.: male-female, Berlin-Vienna-Bucharest): OneHotEncoder
In the Titanic dataset, there were several columns that needed to be transformed in different ways, while leaving the others untouched. The best way to do this is with a ColumnTransformer, which applies all individual transformations in one step.
Evaluating Classifiers
After hours of wrangling the data and encoding it properly (e.g. extracting the honorifics from the passenger names), I applied the three models on the Titanic dataset and achieved the following accuracyscores:
Logistic Regression | Decision Tree | Random Forest | |
---|---|---|---|
train set | 79.1 % | 80.3 % | 94.7 % |
test set | 81.6 % | 76.5 % | 81.0 % |
It's interesting to note that the Logistic Regression and Random Forest achieved similar scores on the test set (81%), but Logistic Regression performed worse, whereas Random Forest performed better on the train set. This means that Random Forest overfit (it learned too much detail from the train set and couldn't generalized on the test set), whereas Logistic Regression learned less, but could apply that knowledge better on the test set.
Another important point is that in Logistic Regression feature engineering is decisive, whereas Random Forest is more of a black box.
I'd say my scores are not that bad, but of course there's room for improvement, by adding/removing/encoding other features or increasing the depth of the tree, for example.
Friday Lightning Talk
As last Friday, we had to give a talk on the weekly project or a topic related to it and I presented my Decision Tree and Random Forest Classifiers.