fruit_recognition_deep

cta Univ

. S

apientiae

, I

nformatica

, 10, 1 (2018) 26–42

Fruit recognition from images using deep

learning

Horea Mures¸an

Faculty of Mathematics and Computer

Science Mihail Kogˇalniceanu, 1 Babes¸-Bolyai University

Romania

email: horea94@gmail.com

Mihai Oltean

Faculty of Exact Sciences and

Engineering

Unirii, 15-17 ”1 Decembrie 1918” University of Alba

Iulia

Romania

email: mihai.oltean@gmail.com

Abstract.

In this paper we introduce a new, high-quality, dataset of images containing fruits. We also present the results of some numerical experiment for training a neural network to detect fruits. We discuss the reason why we chose to use fruits in this project by proposing a few applications that could use such classifier.

Keywords: Deep learning, Object recognition, Computer vision, f ruits dataset, image processing

Introduction

The aim of this paper is to propose a new dataset of images containing popular fruits. The dataset was named Fruits-360 and can be downloaded from the addresses pointed by references [20] and [21]. Currently (as of 2019.09.21) the set contains 82213 images of 120 fruits and vegetables and it is constantly updated with images of new fruits and vegetables as soon as the authors have accesses to them. The reader is encouraged to access the latest version of the dataset from the above indicated addresses.

Computing Classification System 1998: I.2.6 Mathematics Subject Classification 2010: 68T45 Key words and phrases: Deep learning, Object recognition, Computer vision

TensorFlow library

For the purpose of implementing, training and testing the network described in this paper we used the TensorFlow library [32]. This is an open source framework for machine learning created by Google for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays called tensors.

The main components in a TensorFlow system are the client, which uses the Session interface to communicate with the master, and one or more worker processes, with each worker process responsible for arbitrating access to one or more computational devices (such as CPU cores or GPU cards) and for executing graph nodes on those devices as instructed by the master.

TensorFlow offers some powerful features such as: it allows computation mapping to multiple machines, unlike most other similar frameworks; it has built in support for automatic gradient computation; it can partially execute subgraphs of the entire graph and it can add constraints to devices, like placing nodes on devices of a certain type, ensure that two or more objects are placed in the same space etc.

TensorFlow is used in several projects, such as the Inception Image Classification Model [31]. This project introduced a state of the art network for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014. In this project the usage of the computing resources is improved by adjusting the network width and depth while keeping the computational budget constant[31].

Another project that employs the TensorFlow framework is DeepSpeech, developed by Mozilla. It is an open source Speech-To-Text engine based on Baidu’s Deep Speech architecture [9]. The architecture is a state of the art recognition system developed using end-to-end deep learning. It is simpler that other architectures and does not need hand designed components for background noise, reverberation or speaker variation.

We will present the most important utilized methods and data types from TensorFlow together with a short description for each of them.

A convolutional layer is defined like this:

conv2d(

input ,

filter ,

strides ,

padding ,

use_cudnn_on_gpu =True ,

data_format =’NHWC ’,

dilations =[1, 1, 1, 1],

name=None

)

Computes a 2-D convolution given 4-D input and filter tensors. Given an

input tensor of shape [batch, in height, in width, in channels] and a kernel tensor of shape [filter height, filter width, in channels, out channels], this op performs the following:

•

Flattens the filter to a 2-D matrix with shape [filter height * filter width * in channels, output channels].

•

Extracts image patches from the input tensor to form a virtual tensor of shape [batch, out height, out width, filter height * filter width * in channels].

•

For each patch, right-multiplies the filter matrix and the image patch vector.

tf.nn. max_pool (

value ,

ksize ,

strides ,

padding ,

data_format =’NHWC ’,

name=None

)

Performs the max pooling operation on the input. The ksize and strides parameters can be tuples or lists of tuples of 4 elements. Ksize represents the size of the window for each dimension of the input tensor and strides represents the stride of the sliding window for each dimension of the input tensor. The padding parameter can be ‘’VALID’‘ or ‘’SAME’‘.

tf.nn.relu(

features ,

name=None

)

Computes the rectified linear operation - max(features, 0). Features is a tensor.

tf.nn. dropout (

keep_prob ,

noise_shape =None ,

seed=None ,

name=None

)

Applies dropout on input x with probability keep prob. This means that for each value in x the method outputs the value scaled by 1 / keep prob with probability keep prob or 0. The scaling is done on order to preserve the sum of the elements. The noise shape parameter defines which groups of values are kept or dropped together. For example, a value of [k, 1, 1, n] for the noise shape, with x having the shape [k, l, m, n], means that each row and column will be kept or dropped together, while the batch and channel components will be kept or dropped separately.

The structure of the neural network used in experiments

For this project we used a convolutional neural network. As previously described this type of network makes use of convolutional layers, pooling layers, ReLU layers, fully connected layers and loss layers. In a typical CNN architecture, each convolutional layer is followed by a Rectified Linear Unit (ReLU) layer, then a Pooling layer then one or more convolutional layer and finally one or more fully connected layer.

Note again that a characteristic that sets apart the CNN from a regular neural network is taking into account the structure of the images while processing them. A regular neural network converts the input in a one dimensional array which makes the trained classifier less sensitive to positional changes.

The input that we used consists of standard RGB images of size 100 x 100 pixels.

The neural network that we used in this project has the structure given in Table 2.

Table 2: The structure of the neural network used in this paper.

Layer type Dimensions Output

Convolutional 5 x 5 x 4 16 Max pooling 2 x 2 — Stride: 2 Convolutional 5 x 5 x 16 32 Max pooling 2 x 2 — Stride: 2 Convolutional 5 x 5 x 32 64 Max pooling 2 x 2 — Stride: 2 Convolutional 5 x 5 x 64 128 Max pooling 2 x 2 — Stride: 2 Fully connected 5 x 5 x 128 1024 Fully connected 1024 256 Softmax 256 60

A visual representation of the neural network used is given in Figure 2.

•

The first layer (Convolution #1) is a convolutional layer which applies 16 5 x 5 filters. On this layer we apply max pooling with a filter of shape 2 x 2 with stride 2 which specifies that the pooled regions do not overlap (Max-Pool #1). This also reduces the width and height to 50 pixels each.

•

The second convolutional (Convolution #2) layer applies 32 5 x 5 filters which outputs 32 activation maps. We apply on this layer the same kind of max pooling(Max-Pool #2) as on the first layer, shape 2 x 2 and stride 2.

•

The third convolutional (Convolution #3) layer applies 64 5 x 5 filters. Following is another max pool layer(Max-Pool #3) of shape 2 x 2 and stride 2.

•

The fourth convolutional (Convolution #4) layer applies 128 5 x 5 filters after which we apply a final max pool layer (Max-Pool #4).

Figure 2: Graphical representation of the convolutional neural network used in experiments.

•

Because of the four max pooling layers, the dimensions of the representation have each been reduced by a factor of 16, therefore the fifth layer, which is a fully connected layer(Fully Connected #1), has 7 x 7 x 16 inputs.

•

This layer feeds into another fully connected layer (Fully Connected #2) with 1024 inputs and 256 outputs.

•

The last layer is a softmax loss layer (Softmax) with 256 inputs. The number of outputs is equal to the number of classes.

We present a short scheme containing the flow of the the training process:

iterations = 75000

read_images (images)

apply_random_hue_saturation_changes (images)

apply_random_vertical_horizontal_flips (images)

convert_to_hsv (images)

add_grayscale_layer (images)

define_network_structure (images , network ,

training_operation )

for i in range (1, iterations ):

sess.run( training_operation )

Numerical experiments

For the experiments we used the 82110 images split in 2 parts: training set - which consists of 61488 images of fruits and testing set - which is made of 20622 images. The other 103 images with multiple fruits were not used in the training and testing of the network.

The data was bundled into a tfrecord file (specific to TensorFlow). This is a binary file that contains protocol buffers with a feature map. In this map it is possible to store information such as the image height, width, depth and even the raw image. Using these files we can create queues in order to feed the data to the neural network.

By calling the method shu f f le batch we provide randomized input to the network. The way we used this method was providing it example tensors

for images and labels and it returned tensors of shape batch size x image dimensions and batch size x labels. This helps greatly lower the chance of using the same batch multiple times for training, which in turn improves the quality of the network.

We ran multiple scenarios in which the neural network was trained using different levels of data augmentation and preprocessing:

•

convert the input RGB images to grayscale

•

keep the input images in the RGB colorspace

•

convert the input RGB images to the HSV colorspace

•

convert the input RGB images to the HSV colorspace and to grayscale and merge them

•

apply random hue and saturation changes on the input RGB images, randomly flip them horizontally and vertically, then convert them to the HSV colorspace and to grayscale and merge them

For each scenario we used the previously described neural network which was trained over 75000 iterations with batches of 60 images selected at random from the training set. Every 50 steps we calculated the accuracy using cross-validation. For testing we ran the trained network on the test set. The results for each case are presented in Table 3.

Table 3: Results of training the neural network on the fruits360 dataset. Scenario Accuracy on

training set

Accuracy on

test set

Grayscale 99.82% 92.65% RGB 99.82% 94.43% HSV 99.80% 94.40% HSV + Grayscale 99.78% 94.74% HSV + Grayscale + hue/saturation change + flips

99.58%

95.23%

As reflected in Table 3 the best results were obtained by applying data augmentation and converting the RGB images to the HSV colorspace to which the grayscale representation was concatenated. This is intuitive since

in this scenario we attach the most amount of information to the input, thus the network can learn multiple features in order to classify the images.

It is also important to notice that training the grayscale images only yielded the best results on the train set but very weak results on the test set. We investigated this problem and we have discovered that a lot of images containing apples are incorrectly classified on the test set. In order to further investigate the issue we ran a round of training and testing on just the apple classes of images. The results were similar, with high accuracy on the train data, but low accuracy on the test data. We attribute this to overfitting, because the grayscale images lose too many features, the network does not learn properly how to classify the images.

In order to determine the best network configuration for classifying the images in out dataset, we took multiple configurations, used the train set to train them and then calculated their accuracy on the test and training set.

In Table 4 we present the results.

Table 4: Results of training different network configurations on the fruits-360 dataset.

Nr.

Configuration

Accuracy on

training set

Accuracy

on test set

Convolutional 5 x 5 16

99.58%

95.23%

Convolutional 5 x 5 32 Convolutional 5 x 5 64 Convolutional 5 x 5 128 Fully connected - 1024 Fully connected - 256

Convolutional 5 x 5 8

99.68%

95.02%

Convolutional 5 x 5 32 Convolutional 5 x 5 64 Convolutional 5 x 5 128 Fully connected - 1024 Fully connected - 256

Convolutional 5 x 5 32

99.24%

94.06%

Convolutional 5 x 5 32 Convolutional 5 x 5 64 Convolutional 5 x 5 128 Fully connected - 1024 Fully connected - 256

Table 4: Results of training different network configurations on the fruits-360 dataset.

Nr.

Configuration

Accuracy on

training set

Accuracy

on test set

Convolutional 5 x 5 16

99.31%

93.59%

Convolutional 5 x 5 16 Convolutional 5 x 5 64 Convolutional 5 x 5 128 Fully connected - 1024 Fully connected - 256

Convolutional 5 x 5 16

99.39%

94.82%

Convolutional 5 x 5 64 Convolutional 5 x 5 64 Convolutional 5 x 5 128 Fully connected - 1024 Fully connected - 256

Convolutional 5 x 5 16

99.34%

94.57%

Convolutional 5 x 5 32 Convolutional 5 x 5 32 Convolutional 5 x 5 128 Fully connected - 1024 Fully connected - 256

Convolutional 5 x 5 16

99.55%

95.09%

Convolutional 5 x 5 32 Convolutional 5 x 5 128 Convolutional 5 x 5 128 Fully connected - 1024 Fully connected - 256

Convolutional 5 x 5 16

99.17%

93.83%

Convolutional 5 x 5 32 Convolutional 5 x 5 64 Convolutional 5 x 5 64 Fully connected - 1024 Fully connected - 256

Table 4: Results of training different network configurations on the fruits-360 dataset.

Nr.

Configuration

Accuracy on

training set

Accuracy

on test set

Convolutional 5 x 5 16

99.22%

93.96%

Convolutional 5 x 5 32 Convolutional 5 x 5 64 Convolutional 5 x 5 128 Fully connected - 512 Fully connected - 256

Convolutional 5 x 5 16

99.20%

93.79%

Convolutional 5 x 5 32 Convolutional 5 x 5 64 Convolutional 5 x 5 128 Fully connected - 1024 Fully connected - 512

From Table 4 we can see that the best performance on the test set was obtained by configuration nr. 1. The same configuration obtained an accuracy merely 0.1% lower than the best accuracy on the training set. The general trend indicates that if a configuration obtained high accuracy on the train set, this will translate into a good performance on the test set. However,one outlier can be seen in configuration nr. 4. This configuration obtained 99.31% accuracy (an average performance) on the train set, but only 93.59% accuracy on the test set, the worst result. This is a result of the model overfitting to the training data and not properly generalizing to other images.

The evolution of accuracy during training is given in Figure 3. It can be seen that the training rapidly improves in the first 1000 iterations (accuracy becomes greater than 90%) and then it is very slowly improved in the next 74000 iterations.

Some of the incorrectly classified images are given in Table 5.

Table 5: Some of the images that were classified incorrectly. On the top we have the correct class of the fruit and on the bottom we have the class (and its associated probability) that was assigned by the network.

Apple Golden 2 Apple Golden 3 Braeburn(Apple)

Peach

Apple Golden 3

Granny Smith

(Apple)

Apple Red 2

Apple Red

Yellow

96.54%

95.22%

97.71%

97.85%

Pomegranate

Peach

Pear

Pomegranate

Nectarine Apple Red 1 Apple Golden 2 Braeburn(Apple) 94.64% 97.87% 98.73% 97.21%

Conclusions and further work

We described a new and complex database of images with fruits. Also we made some numerical experiments by using TensorFlow library in order to classify the images according to their content.

From our point of view one of the main objectives for the future is to improve the accuracy of the neural network. This involves further experimenting with the structure of the network. Various tweaks and changes to

This involves further experimenting with the structure of the network Figure 3: Accuracy evolution over 75000 training iterations

any layers as well as the introduction of new layers can provide completely different results. Another option is to replace all layers with convolutional layers. This has been shown to provide some improvement over the networks that have fully connected layers in their structure. A consequence of replacing all layers with convolutional ones is that there will be an increase in the number of parameters for the network [29]. Another possibility is to replace the rectified linear units with exponential linear units. According to paper [8], this reduces computational complexity and add significantly better generalization performance than rectified linear units on networks with more that 5 layers. We would like to try out these practices and also to try to find new configurations that provide interesting results.

In the near future we plan to create a mobile application which takes pictures of fruits and labels them accordingly.

Another objective is to expand the data set to include more fruits. This is a more time consuming process since we want to include items that were not used in most others related papers.

Acknowledgments

A preliminary version of this dataset with 25 fruits was presented during the Students Communication Session from Babes¸-Bolyai University, June

2017.

Warning

The project was developed using TensorFlow 1.8.0. If you use a newer version, you may receive deprecation warnings and some scripts may not work properly. In particular, the utils/freeze graph.py script may produce errors since the format of the checkpoint files may differ in newer TensorFlow versions. This script is available in any TensorFlow version and so using the script provided in your TensorFlow distribution is recommended.

The latest version can be found here: freeze grpah.py An implementation of the same network that is adapted to the latest TensorFlow version can be found on Kaggle in a Python Notebook: Fruit Network

Appendix

In this section we present the source code and project structure used in the numerical experiment described in this paper. The source code can be downloaded from GitHub [20].

The source code is organized (on GitHub [20]) as follows:

root directory

fruit detection

detect fruits.py

network

fruit test net.py

fruit train net.py

network structure

fruit network.py

utils.py

utils

build image data.py

constants.py

freeze graph.py

labels

In order to run the project from the command line, first make sure the PYTHONPATH system variable contains the path to the root directory.

Ensure that the utils/constants.py contains the proper paths.

Run the utils/build image data.py to generate the tfrecord files with training and test data. This script is provided in the tensorFlow library. The

This script is provided in the tensorFlow library file contains several default values for the flags. They can be changed in the code directly or different values can be provided from the command line: python utils/build image data.py [flags]

where flags can be:

•

–train directory=path to the folder containing the train images

•

–validation directory=path to the folder containing the validation images

•

–output directory=path to where to output the tfrecord files

•

–labels file=path to the label file

•

–train shards, –test shards determine the number of tfrecord files for train data and test data

•

–num threads is the number of threads to create when creating the tfrecord files

After the train and test data has been serialized, the train and test scripts can be run: python network/fruit train net.py python network/fruit test net.py After the training has completed, the python utils/build image data.py [flags] script can be run: python utils/build image data.py –image path=”path to a jpeg file”

Finally, the utils/freeze graph.py script, which is also provided as a utility script in tensorFlow, creates a single file with the trained model data.

python freeze graph flags These flags are mandatory:

•

–input graph=path to the pbtxt file

•

–input checkpoint=path to the ckpt file

•

–output graph=name of the output file

•

–output node names=name of the last layer of the network (found in the network structure/fruit network.py file, in the conv net method, in this case the name of the last layer is ”out/out”)

In the following, we will provide explanations for the code. We will begin with the definition of the general parameters and configurations of the project.

The following are defined in the utils/constants.py file:

•

root dir - the top level folder of the project

•

data dir - the folder where the .tfrecords are persisted

•

fruit models dir - the folder where the network structure and parameters are saved

•

labels file - the path to the file that contains all the labels used

•

training images dir, test images dir - paths to the folders containing the training and test images

•

num classes - the number of different classes used

– it is determined by counting the number of elements in the labels file, which is also used in the utils/freeze graph.py script

•

number train images, number test images - number of training and test

images; used in the test method to calculate accuracy

All these configurations can be changed to suit the setup of anyone using the code.

1 utils/ constants .py

import os

4 # needs to be changed according to the location of the project

root_dir = ’C:\\ root_directory \\’

data_dir = root_dir + ’\\ data \\’

fruit_models_dir = root_dir + ’\\ fruit_models \\’

labels_file = root_dir + ’\\ utils \\ labels ’

10 # change this to the path of the folders that hold the images

training_images_dir = ’\\ Fruit -Images - Dataset \\ Training ’

test_images_dir = ’\\ Fruit -Images - Dataset \\ Test ’

14 # number of classes : number of fruit classes + 1 resulted due to the build_image_data .py script that

leaves the first class as a background class

# using the labels file that is also used in the build_image_data .py

with open( labels_file ) as f:

labels = f. readlines ()

num_classes = len(labels) + 1

number_train_images = \ trainingImageCount

number_test_images = \ testImageCount

In the network structure/utils.py file we have helper methods used across the project:

•

conv, fully connected combine the TensorFlow methods of defining a convolutional layer and a fully connected layer, respectively, adding the bias to the layer and applying a linear rectifier

– a convolutional layer consists of groups of neurons that make up kernels

– the kernels have a small size but they always have the same depth as the input

– the neurons from a kernel are connected to a small region of the input, called the receptive field, because it is highly inefficient to link all neurons to all previous outputs in the case of inputs of high dimensions such as images

•

max pool, loss, int64 feature and bytes feature simplify the calls to the

corresponding TensorFlow methods

•

parse single example converts a serialized input into an image and label as they were saved using the utils/build image data.py

Here we also define methods to perform data augmentation on the input images. Data augmentation is a good way to reduce overfitting on models. Flipping the image horizontally and vertically helps prevent the use the orientation of the fruit as a feature when training. This should result in fruits being correctly classified regardless of their position in an image.

•

augment image applies the following operations on the train images

1. Alters the hue of the image

2. Alters the saturation of the image

3. Flips the image horizontally

4. Flips the image vertically

5. Calls the build hsv grayscale image on the result

•

build hsv grayscale image converts the image to HSV color space and creates a grayscale version of the image and adds it as a fourth channel to the HSV image By altering the hue and saturation, we simulate having a larger variety of fruits in the images. The values with which we alter these properties are small, since in nature there is a small color variance between different fruits of the same species

network_structure /utils.py

3 import tensorflow as tf

6 # perform data augmentation on images

7 # add random hue and saturation

8 # randomly flip the image vertically and horizontally

9 # converts the image from RGB to HSV and

10 # adds a 4th channel to the HSV ones that contains the image in gray scale

def augment_image (image):

image = tf.image. convert_image_dtype (image , tf.

float32 )

image = tf.image. random_hue (image , 0.02)

image = tf.image. random_saturation (image , 0.9, 1.2)

image = tf.image. random_flip_left_right (image)

image = tf.image. random_flip_up_down (image)

return build_hsv_grayscale_image (image)

# convert the image to HSV and add the gray scale channel

def build_hsv_grayscale_image (image):

image = tf.image. convert_image_dtype (image , tf.

float32 )

gray_image = tf.image. rgb_to_grayscale (image)

image = tf.image. rgb_to_hsv (image)

rez = tf.concat ([ image , gray_image ], 2)

return rez

29 def parse_single_example ( serialized_example ):

features = tf. parse_single_example (

serialized_example ,

features ={

’image_raw ’: tf. FixedLenFeature ([], tf.

string),

’label ’: tf. FixedLenFeature ([], tf.int64),

’height ’: tf. FixedLenFeature ([], tf.int64) ,

’width ’: tf. FixedLenFeature ([], tf.int64)

}

)

image = tf.image. decode_jpeg ( features [’image_raw ’ ], channels =3)

image = tf. reshape (image , [100 , 100, 3])

label = tf.cast( features [’label ’], tf.int32)

return image , label

45 def conv(input_tensor , name , kernel_width , kernel_height , num_out_activation_maps ,

stride_horizontal =1, stride_vertical =1, activation_fn =tf.nn.relu):

prev_layer_output = input_tensor . get_shape () [ -1].

value

with tf. variable_scope (name):

weights = tf. get_variable (’weights ’, [

kernel_height , kernel_width , prev_layer_output , num_out_activation_maps ], tf.float32 ,

tf.

truncated_normal_initializer (stddev =5e-2, dtype=tf. float32 ) )

biases = tf. get_variable ("bias", [ num_out_activation_maps ], tf.float32 , tf.

constant_initializer (0.0))

conv_layer = tf.nn.conv2d(input_tensor , weights , (1, stride_horizontal , stride_vertical , 1), padding =’SAME ’)

activation = activation_fn (tf.nn. bias_add ( conv_layer , biases), name=name)

return activation

56 def fully_connected (input_tensor , name , output_neurons

, activation_fn =tf.nn.relu):

n_in = input_tensor . get_shape () [ -1]. value

with tf. variable_scope (name):

weights = tf. get_variable (’weights ’, [n_in , output_neurons ], tf.float32 ,

initializer =tf.

truncated_normal_initializer (stddev =5e-2, dtype=tf. float32 ) )

biases = tf. get_variable ("bias", [ output_neurons ], tf.float32 , tf.

constant_initializer (0.0))

logits = tf.nn. bias_add (tf.matmul(input_tensor , weights ), biases , name=name)

if activation_fn is None:

return logits

return activation_fn (logits)

68 def max_pool (input_tensor , name , kernel_height ,

kernel_width , stride_horizontal , stride_vertical ):

return tf.nn. max_pool (input_tensor ,

ksize =[1, kernel_height , kernel_width , 1],

strides =[1,

stride_horizontal , stride_vertical , 1],

padding =’VALID ’,

name=name)

76 def loss(logits , onehot_labels ):

xentropy = tf.nn. softmax_cross_entropy_with_logits (logits=logits , labels= onehot_labels , name=’ xentropy ’)

loss = tf. reduce_mean (xentropy , name=’loss ’)

return loss

82 def _int64_feature (value):

if not isinstance (value , list):

value = [value]

return tf.train. Feature ( int64_list =tf.train.

Int64List (value=value))

88 def _bytes_feature (value):

return tf.train. Feature ( bytes_list =tf.train.

BytesList (value =[ value ]))

Following, in the network structure/fruit network.py file we have network parameters and the method that defines the network structure.

•

IMAGE HEIGHT, IMAGE WIDTH, IMAGE CHANNELS - the image

height, width and depth respectively;

•

NETWORK DEPTH - the depth of the input for the network (3 from

HSV image + 1 from grayscale image)

•

batch size - the number of images selected in each training/testing step

•

dropout - the probability to keep a node in each training step

– during training, at each iteration, some nodes are ignored with probability 1 − dropout

– this results in a reduced network, which is then used for a forward or backward pass

– dropout prevents neurons from developing co-dependency and, in turn, overfitting

– outside of training, the dropout is ignored and the entire network is used for classifying

•

update learning rate dynamically adjusts the learning rate as training progresses

•

the weights and biases used are defined as follows:

– The first layer (Convolution #1) is a convolutional layer which applies 16 5 x 5 filters. On this layer we apply max pooling with a filter of shape 2 x 2 with stride 2 which specifies that the pooled regions do not overlap (Max-Pool #1). This also reduces the width and height to 50 pixels each.

– The second convolutional (Convolution #2) layer applies 32 5 x 5 filters which outputs 32 activation maps. We apply on this layer the same kind of max pooling(Max-Pool #2) as on the first layer, shape 2 x 2 and stride 2.

– The third convolutional (Convolution #3) layer applies 64 5 x 5 filters. Following is another max pool layer(Max-Pool #3) of shape 2 x 2 and stride 2.

– The fourth convolutional (Convolution #4) layer applies 128 5 x 5 filters after which we apply a final max pool layer (Max-Pool #4).

– Because of the four max pooling layers, the dimensions of the representation have each been reduced by a factor of 16, therefore the fifth layer, which is a fully connected layer(Fully Connected #1), has 7 x 7 x 16 inputs.

– This layer feeds into another fully connected layer (Fully Connected #2) with 1024 inputs and 256 outputs.

– The last layer is a softmax loss layer (Softmax) with 256 inputs.

The number of outputs is equal to the number of classes.

•

build model define operations for the training process and for loss and accuracy evaluation

network_structure / fruit_network .py

3 import tensorflow as tf

4 import numpy as np

5 from . import utils

6 from utils import constants

8 HEIGHT = 100

9 WIDTH = 100

10 # number of channels for an image - jpeg image has RGB channels

CHANNELS = 3

# number of channels for the input layer of the network : HSV + gray scale

NETWORK_DEPTH = 4

15 batch_size = 60

16 input_size = HEIGHT * WIDTH * NETWORK_DEPTH

17 # probability to keep the values after a training iteration

dropout = 0.8

# placeholder for input layer

X = tf. placeholder (tf.float32 , [None , input_size ], name="X")

# placeholder for actual labels

Y = tf. placeholder (tf.int64 , [None], name="Y")

25 initial_learning_rate = 0.001

26 final_learning_rate = 0.00001

27 learning_rate = initial_learning_rate

30 def conv_net ( input_layer ):

# number of activation maps for each convolutional

layer

number_of_act_maps_conv1 = 16

number_of_act_maps_conv2 = 32

number_of_act_maps_conv3 = 64

number_of_act_maps_conv4 = 128

# number of outputs for each fully connected layer

number_of_fcl_outputs1 = 1024

number_of_fcl_outputs2 = 256

input_layer = tf. reshape (input_layer , shape =[-1, HEIGHT , WIDTH , NETWORK_DEPTH ])

conv1 = utils.conv(input_layer , ’conv1 ’, kernel_width =5, kernel_height =5, num_out_activation_maps = number_of_act_maps_conv1 )

conv1 = utils. max_pool (conv1 , ’max_pool1 ’, kernel_height =2, kernel_width =2, stride_horizontal =2, stride_vertical =2)

conv2 = utils.conv(conv1 , ’conv2 ’, kernel_width =5,

kernel_height =5, num_out_activation_maps = number_of_act_maps_conv2 )

conv2 = utils. max_pool (conv2 , ’max_pool2 ’, kernel_height =2, kernel_width =2, stride_horizontal =2, stride_vertical =2)

conv3 = utils.conv(conv2 , ’conv3 ’, kernel_width =5,

kernel_height =5, num_out_activation_maps = number_of_act_maps_conv3 )

conv3 = utils. max_pool (conv3 , ’max_pool3 ’, kernel_height =2, kernel_width =2, stride_horizontal =2, stride_vertical =2)

conv4 = utils.conv(conv3 , ’conv4 ’, kernel_width =5,

kernel_height =5, num_out_activation_maps = number_of_act_maps_conv4 )

conv4 = utils. max_pool (conv4 , ’max_pool4 ’, kernel_height =2, kernel_width =2, stride_horizontal =2, stride_vertical =2)

flattened_shape = np.prod ([s.value for s in conv4.

get_shape () [1:]])

net = tf. reshape (conv4 , [-1, flattened_shape ], name=" flatten ")

fcl1 = utils. fully_connected (net , ’fcl1 ’, number_of_fcl_outputs1 )

fcl1 = tf.nn. dropout (fcl1 , dropout )

fcl2 = utils. fully_connected (fcl1 , ’fcl2 ’, number_of_fcl_outputs2 )

fcl2 = tf.nn. dropout (fcl2 , dropout )

out = utils. fully_connected (fcl2 , ’out ’, constants .num_classes , activation_fn =None)

return out

stride_horizontal =2 69 def update_learning_rate (acc , learn_rate ):

return max( learn_rate - acc * learn_rate * 0.9, final_learning_rate )

73 def build_model ():

# build the network

logits = conv_net ( input_layer =X)

# apply softmax on the final layer

prediction = tf.nn. softmax (logits)

# calculate the loss using the predicted labels vs

the expected labels

loss = tf. reduce_mean (tf.nn.

sparse_softmax_cross_entropy_with_logits (logits =logits , labels=Y))

# use adaptive moment estimation optimizer

optimizer = tf.train. AdamOptimizer ( learning_rate = learning_rate )

train_op = optimizer . minimize (loss=loss)

# calculate the accuracy for this training step

correct_prediction = tf.equal(tf.argmax(prediction , 1), Y)

return train_op , loss , correct_prediction

The following 2 files, network/fruit test net.py, network/fruit train net.py

contain the logic for training and testing the network Firstly, in network/fruit train net.py we have:

•

iterations - the number of steps for which the training will be done

•

acc display interval - the number of iterations to train for before displaying the loss and accuracy of the network

•

save interval - default number of iterations after we save the model

•

step display interval - number of iterations after we display the total number of steps done and the time spent training the past step display interval iterations

•

useCkpt - if true, load a previously trained model and continue training, else, train a new model from scratch

•

build datasets - read the tfrecord files and prepare two datasets to be used during the training process

•

calculate intermediate accuracy and loss - calculates the loss and accuracy on the training dataset; used during training to monitor the performance of the network

•

train model - runs the training process

network / fruit_train_net .py

3 import tensorflow as tf

4 import numpy as np

5 import time

6 import os

7 import re

9 from network_structure import fruit_network as network

10 from network_structure import utils

12 from utils import constants

14 # default number of iterations to run the training

15 iterations = 75000

16 # default number of iterations after we display the loss and accuracy

acc_display_interval = 1000

# default number of iterations after we save the model

save_interval = 1000

# default number of iterations after we display the total number of steps done and the time spent training the past step_display_interval iterations

step_display_interval = 100

# use the saved model and continue training ; defaults to false

useCkpt = False

25 # create two datasets from the previously created training tfrecord files

# the first dataset will apply data augmentation and shuffle its elements and will continuously queue new items - used for training

# the second dataset will iterate once over the training images - used for evaluating the loss and accuracy during the training

def build_datasets (filenames , batch_size ):

train_dataset = tf.data. TFRecordDataset ( filenames )

.repeat ()

train_dataset = train_dataset .map(utils.

parse_single_example ).map(lambda image , label: (utils. augment_image (image), label))

train_dataset = train_dataset . shuffle ( buffer_size =10000 , reshuffle_each_iteration =True)

train_dataset = train_dataset .batch( batch_size )

test_dataset = tf.data. TFRecordDataset ( filenames )

test_dataset = test_dataset .map(utils.

parse_single_example ).map(lambda image , label: (utils. build_hsv_grayscale_image (image), label) )

test_dataset = test_dataset .batch( batch_size )

return train_dataset , test_dataset

39 def train_model (session , train_operation ,

loss_operation , correct_prediction , iterator_map ):

time1 = time.time ()

train_iterator = iterator_map [" train_iterator "]

test_iterator = iterator_map [" test_iterator "]

test_init_op = iterator_map [" test_init_op "]

train_images_with_labels = train_iterator . get_next ()

test_images_with_labels = test_iterator . get_next ()

for i in range (1, iterations + 1):

batch_x , batch_y = session .run( train_images_with_labels )

batch_x = np. reshape (batch_x , [ network . batch_size , network . input_size ])

session .run( train_operation , feed_dict ={ network .X: batch_x , network .Y: batch_y })

if i % step_display_interval == 0:

time2 = time.time ()

print("time: %.4f step: %d" % (time2 time1 , i))

time1 = time.time ()

if i % acc_display_interval == 0:

acc_value , loss = calculate_intermediate_accuracy_and_loss (session , correct_prediction , loss_operation , test_images_with_labels , test_init_op , constants .

number_train_images )

network . learning_rate = network .

update_learning_rate (acc_value , learn_rate = network . learning_rate )

print("step: %d loss: %.4f accuracy : %.4f"

% (i, loss , acc_value ))

if i % save_interval == 0:

# save the weights and the meta data for the graph

saver.save(session , constants .

fruit_models_dir + ’model.ckpt ’)

tf.train. write_graph ( session .graph_def , constants . fruit_models_dir , ’graph.

pbtxt ’)

66 def calculate_intermediate_accuracy_and_loss (session , correct_prediction , loss_operation ,

test_images_with_labels , test_init_op , total_image_count ):

sess.run( test_init_op )

loss = 0

predicted = 0

count = 0

while True:

try:

test_batch_x , test_batch_y = session .run( test_images_with_labels )

test_batch_x = np. reshape (test_batch_x , [-1, network . input_size ])

l, p = session .run ([ loss_operation ,