Feedback on the nc-fish Kaggle competition

Presentation of the challenge

The Nature Conservancy fisheries monitoring was a Kaggle challenge of image classification.

The goal was to classify images from fishing boats and extracting the class of the fish in the image. Contrary to datasets such as image-net, the object to classify often was in unusual places, and could be highly distorted.

The challenge was in two stages. The first stage required classification of 1,000 test images, which represent only 8% of the second stage dataset. The public leaderboard is computed on the data from the first stage, and participation to the second stage was only possible for people that had submitted a model during stage 1. I’ll end the suspense right away, I was ranked ~ 150 / 2293 before the public leaderboard was wiped out. I now am 91/391 for the public leaderboard, and 321/391 for the private leaderboard (ranked on the whole stage 2 dataset).

Where to start?

This was my first image competition (and the first time I am doing machine learning on images). I should mention the course by Jeremy Howard on fast.ai, and Github from rdcolema which helped me getting started.

Tuning vgg16

Transfer learning uses the idea that the first layers of your neural net should stay the same independently of your dataset. This is because most of the time your first layer will identify a line, an edge and all that kind of small patterns that are not specific to your dataset. However, the way you combine all these patterns to make a cat, a dog or a tuna will only be relevant if you want to identify cats, dogs and fishes.

The results of transfer learning ‘out of the box’ are usually good, but not fantastic. The issue in this competition was that there where only a limited number of boats in the training set and the neural net tended to learn to recognize the boat more that the fish.

What went very very wrong

Using a convolutional vgg, I largely overfitted. I had pretty good results on the validation and stage 1 datasets, because the boats were identical. Stage 2 dataset included new boats leading to a drop in my ranking.

What I tried else

I tried many different things in this competition. First, I tried to train a Cascade detector, and also tried with histograms of gradients. The results were extremely low, and I quickly let this go. It was about the same time that I discovered the courses from Jeremy Howard, and that has saved me quite some time.

I also tried a three step architecture with:

Regional proposal
Detection (Fish / No fish inside the cropped image)
Classification if a fish is supposed to be there

The issue with that kind of structure is that one element cannot learn from the mistakes of the others. So, in the end, you multiply errors without any possible correction. The result from that was a a log loss 1.5 to 2 times higher than the convolutional model. The recent way to solve that problem is to use Single Shot Multibox Detector (SSD) or similar networks such as YOLO. That’s not me saying it but the winners.

Anything else?

One of the other things I learnt was that splitting train and validation sets can be highly challenging, especially when the dataset has so many (almost) identical images. I should not have focused on the leaderboard so much, sometimes taking a break from what you are doing and taking time to think about the possible issues of your model can lead to more improvements than testing a new parameter.

I wanted to highlight that the Kaggle community is truly helpful. People have shared their annotations and scripts, even though the competition awarded money. If you ever thought that people are money driven, I would recommend visiting that site at least once.

Of course, this was only my side of the competition, and you are more than encouraged to share your experience on this competition or image classification in general.

Written on April 19, 2017