Classifying clothes images with neural networks

Project completed in week 9 (23.11.-27.11.20) of the Data Science Bootcamp at Spiced Academy in Berlin.

This week we dived into Deep Learning and learned about different types neural networks (NN) and their applications in various domains. The main goal of this project was to learn and understand what each hyperparameter in a NN model does and how to tune it, so this week was more theoretical and math-heavy than usual.

Building a Neural Network

For my first deep learning project, I used the famous Fashion MNIST dataset created by Zalando. The dataset contains 60K images of 10 clothing items (e.g., tops, sandals, trousers). In order to classify the images in the correct item category, I tried two types of NN:

Artificial Neural Network (ANN): represents a group of multiple perceptrons/ neurons at each layer. ANN is also called Feed-Forward Neural Network, because the inputs are processed only forward. An ANN consists of three layers: input, hidden, and output.
Convolutional Neural Network (CNN): uses filters to extract features and capture the spatial information from images. CNNs are the go-to model for image classification.

In this post, I will present only the CNN model, since it’s the one that performed best in my project. Here’s an overview of my model:

model = keras.models.Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=(28, 28, 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(10, activation='softmax'))

First [line 1] I instantiated the model. Then I started adding several layers of with different hyperparameters.

Conv2D is a 2D convolution layer which creates a convolution kernel that is filled with layers input to produce a tensor of outputs. I used 32 filters, as it’s recommended to use powers of 2 as values.
kernel_size determines the height and width of the kernel, passed as a tuple.
activation specifies the activation function, which transforms the summed weighted input from the node into the activation of the node. relu (Rectified Linear Activation Function) outputs the input directly if it is positive and 0 if it is negative.
kernel_initializer refers to the functions for initializing the weights, which in this case is uniform distribution.
input_shape represents the dimension of the images (28×28 px) and their color code (1 for black-and-white). This hyperparameter needs to be specified only in the first layer.

Next [3] I added a MaxPooling2D layer, which downsamples the input representation by taking the maximum value over the window defined by pool_size (2, 2) for each dimension along the features axis.

Then [4] I added a Flatten layer that flattens the images, so that the pixel values are between 0 an 1. This is done because when working with images, if the values are positive and large, a ReLU neuron becomes almost a linear unit, losing many of its advantages.

Lastly [5,6] I added two Dense layers, which are fully connected layers, where the first parameter declares the number of desired units. So in [5] I have a layer with has 100 neurons with ReLU activation. The last layer [6] has 10 hidden layers (number of clothing items) and softmax activation, which is used for multi-class classification.

Tuning the hyperparameters

Finally, I compiled the model:

model.compile(optimizer='adam',loss='categorical_crossentropy', metrics=['accuracy'])

optimizer defines the stochastic gradient descent algorithm that is used. I’ve tried both sgd (Stochastic Gradient Descent) and adam (Adaptive Moment Estimation), and stuck with the latter because it is more advanced it it generally performs better.
loss defines the cost function.
metrics is a list of all the evaluation scores I want to compute. In this case, accuracy is enough.

model.fit(
    xtrain,
    to_categorical(ytrain),
    epochs=15,
    batch_size=32,
    validation_data=(xtest, to_categorical(ytest))
    )

epochs represents the number of iterations on the training data.
batch_size is the number of images to feed tot he model in one go, it normally ranges from 16 to 512, but in any case it’s smaller than the total number of samples.
validation_data represents the part of the dataset kept for testing the model.
to_categorical one-hot-encodes the labels (clothing items).

Evaluating the model performance

The CNN had an accuracy of 99.43% on the train set and 90.69% on the test set. This is a really good score! I think it could’ve been even better if I had let the model train longer (i.e. more epochs).

cover

Friday Lightning Talk

This Friday talk was a bit different from the previous ones. Instead of presenting our projects, we had to read and present a paper about a Deep Learning application, like generative art, object recognition, or text generation. Of course, I chose the latter topic and tried LSTM to generate poems by E.A. Poe. But I talked about GPT-3, a state-of-the-art deep learning model that can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. However, I focused not so much on the technical details, as on the ethics and implications of this technology. This opened an important discussion in our group, which I think should be included in the curriculum of every tech degree.