An implementation of the vqa model almost described similarly to the following in the paper: "VQA with Cascade of Self- and Co-Attention Blocks."
Mishra, Aakansha ; Anand, Ashish ; Guha, Prithwijit
Full text available at: https://arxiv.org/pdf/2302.14777
Building a multi-model usually is a difficult challenge that scientists want to explore and conquer. With the quick development of technology and algorithms, the majority of current approaches can be easier than the previous time. This project was created with a big desire to bring many valuable results when merging many large language models (LLM) together and deeply related knowledge.
VQA model is a deep neural network that learns and responds to users when they provide questions and related images, the model will answer reliance on those information. Namely, as below images:
(It is recommended to install the dependencies under Conda environment.)
- python 3.7, 3.8, 3.9
- tqdm
- pytorch==0.4.0
- torchvision
- cython
- matplotlib
- numpy
- scipy
- pyyaml
- packaging
- pycocotools
- tensorboardx
- h5py
- opencv-python
- streamlit
- pillow # for PIL
The required supportive environment uses a hardware accelerator GPUs such as T4 of Colab, GPU A100, etc.
Name | #Image | #Question | #Answer |
---|---|---|---|
Train | 5000 | 248350 | 248350 |
Validation | 2500 | 121513 | 121513 |
Test | 2500 | 107395 | 107395 |
All of orignal file are formatted in JSON files, after handling data processing, they has formatted in CSV files.
The csv files have some attribution:
[
{
"image_id": int,
"question": str,
"question_id": int,
"answer" : str
}
]
Google - Bert ( textual_feature_extractor ): BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. Available on this link
Google - ViT ( visual_feature_extractor ): The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Available on this link
Data | #WUPS | #MRR | #Loss |
---|---|---|---|
Train | 0.23114 | 0.32561 | 4.52219 |
Validation | 0.23152 | 0.32828 | 5.18071 |
I used Streamlit Framework for deploying this model but it can only could be run locally, if wanting to run it on the server of HuggingFace, we need to complete some steps related to registering the model (because this model is built from scratch) more information
-
Logo is generated by @tranphihung
- Fine-tune on a larger dataset
- Evaluation on downstream tasks
- Experiment with different model sizes
- Experiment with different serving frameworks: vLLM, TGI, Triton Inference Server, etc.
- Experiment with expanding the tokenizer and prepare for pre-training
Stay tuned for future releases as we are continuously working on improving the model, expanding the dataset, and adding new features.
Thank you for your interest in my project. We hope you find it useful. If you have any questions, please feel free and don't hesitate to contact me at tranphihung8383@gmail.com
-
PAPER: OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge Available on this link. [Online; accessed June 1, 2024].
-
PAPER: Stanislaw Antol and others. “VQA: Visual Question Answering”. inInternational Conference on Computer Vision (ICCV): 2015. Available on this link. [Online; accessed June 1, 2024].
-
PAPER: Yash Goyal andothers. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. 2017. arXiv: 1612.00837 [cs.CV]. Available on this link. [Online; accessed June 1, 2024].
If you find this repo useful in your research, please consider citing the following papers:
@article{mishra2023vqa,
title={VQA with Cascade of Self-and Co-Attention Blocks},
author={Mishra, Aakansha and Anand, Ashish and Guha, Prithwijit},
journal={arXiv preprint arXiv:2302.14777},
year={2023}
}