Image captioning on a mobile terminal using a machine learning model
-Development and implementation of a neuronal network that will take an image as input and will generate a sentence that summarises the contents of the image. -Have a web server that runs the previously mentioned model -An android application that will act as a client to our server
Block diagram
The decoder is connected to the penultimate layer of the VGG16 with a layer that reduces the size from 4096 to 512. 512 is also the number of internal states of our GRU architecture. The last layer will contain a layer of 10000 elements which is also the dimension of our vocabulary.
The recurent model was trained with the following hyperparameters
-Images were given under the format of the penultimate layer of vgg16.
-Optimizer: RMSprop
20 epochs,batch size 3000 images.
One epoch took 5 hours on a nvidiaGTX 1060.
Acurracy
BLEU(2002)= 34.08%
Meteor(2005)= 34.08%
Cost function evolution on validation set
Results
On the validation set
Firebase authentification
Technologies used in server implementaion
Flask(REST API)
HTML/CSS(UI)
Docker(Scalability)
Firebase(Authentification,Scalability)
Google Cloud(Hosting)
Scalability and loadbalancing
Docker instances
Path to REST API for image description generation
POST http://127\.0\.0\.1:5000/api/predict and the image as payload
Structure
Introduction page
Login/Register Pages
Image selection page
Description generator