Using Kaggle datasets

Even if you are a beginner in machine learning, you’ve probably heard about Kaggle. Kaggle is an online community of data scientists and machine learners. Kaggle allows users to find and publish datasets, explore and build models, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. In short, Kaggle is the right place to learn and practice machine learning.

In this blog post you will learn how to use one of the many datasets provided by kaggle. You can clone this project here: https://github.com/waslleysouza/keras.

Let’s create some folders inside the project folder.
Create a new folder named ‘datasets’.
Inside the ‘datasets’ folder, create a new folder named ‘dogs-vs-cats’.
Finally, inside the ‘dogs-vs-cats’ folder, create a new folder named ‘original’.

Let’s create some folders inside the project folder.
Create a new folder and name it as datasets.
Inside datasets folder, create a new folder and name is as dogs-vs-cats.
Finally, inside dogs-vs-cats folder, create a new folder and name it as original.

Go to Kaggle (https://www.kaggle.com/c/dogs-vs-cats/data) and download the dogs-vs-cats dataset.

Copy the zip file to the original folder and unzip it.

Let’s coding!
Import all required libraries.

from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
from keras.preprocessing.image import ImageDataGenerator
import os, shutil

All the pictures of dogs and cats are mixed inside the original/train folder and we need to organize them.
The following code sets the path of created folders.

base_dir = 'datasets/dogs-vs-cats'

original_dir = os.path.join(base_dir, 'original')
original_train_dir = os.path.join(original_dir, 'train')

train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')

cats_train_dir = os.path.join(train_dir, 'cats')
cats_validation_dir = os.path.join(validation_dir, 'cats')

dogs_train_dir = os.path.join(train_dir, 'dogs')
dogs_validation_dir = os.path.join(validation_dir, 'dogs')

This code creates the 6 new folders.

os.mkdir(train_dir)
os.mkdir(validation_dir)
os.mkdir(cats_train_dir)
os.mkdir(cats_validation_dir)
os.mkdir(dogs_train_dir)
os.mkdir(dogs_validation_dir)

Now that we’ve created all the necessary folders, we can organize the images.
From the original/train folder, 10,000 images of cats are copied to the train/cat folder and 2,500 images of cats are copied to the validation/cat folder.
The same amount of dog images are copied to the train/dog and validation/dog folders.

def copy_images_to_folder(filename_pattern, start_range, stop_range, src_dir, dst_dir):
    filenames = [filename_pattern.format(i) for i in range(start_range, stop_range)]
    for filename in filenames:
        src = os.path.join(src_dir, filename)
        dst = os.path.join(dst_dir, filename)
        shutil.copyfile(src, dst)

copy_images_to_folder('cat.{}.jpg', 0, 10000, original_train_dir, cats_train_dir)
copy_images_to_folder('cat.{}.jpg', 10000, 12500, original_train_dir, cats_validation_dir)

copy_images_to_folder('dog.{}.jpg', 0, 10000, original_train_dir, dogs_train_dir)
copy_images_to_folder('dog.{}.jpg', 10000, 12500, original_train_dir, dogs_validation_dir)

Use Generators to read train and validation folders to create batches of images and labels.
These batches will be required to train and validate the model.

batch_size = 20

train_datagen = ImageDataGenerator(rescale=1./255)
validation_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(train_dir, 
                                                    target_size=(50,50), 
                                                    batch_size=batch_size, 
                                                    class_mode='binary')

validation_generator = validation_datagen.flow_from_directory(validation_dir, 
                                                              target_size=(50,50), 
                                                              batch_size=batch_size, 
                                                              class_mode='binary')

Create a new Convolutional Neural Network.

model = Sequential()
model.add(Conv2D(32, kernel_size=(3,3), activation='relu', input_shape=(50,50,3)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(128, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(128, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

Compile and train the network using fit_generator method.

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

history = model.fit_generator(train_generator, 
                              steps_per_epoch=20000/batch_size, 
                              epochs=10, 
                              validation_data=validation_generator, 
                              validation_steps=5000/batch_size)

From this moment you can use other datasets, not just the datasets that are part of Keras.
Have a good time!

Author: Waslley Souza

Consultor Oracle com foco em tecnologias Oracle Fusion Middleware e SOA. Certificado Oracle WebCenter Portal, Oracle ADF e Java.

Leave a Reply

Your email address will not be published. Required fields are marked *