VISUAL QUESTION ANSWERING - LLM

Bert + Vit-Google : Answer some issues from Question + realted Image by MultiModel Training

An implementation of the vqa model almost described similarly to the following in the paper: "VQA with Cascade of Self- and Co-Attention Blocks."

Mishra, Aakansha ; Anand, Ashish ; Guha, Prithwijit

Full text available at: https://arxiv.org/pdf/2302.14777

Model Overview

Introduction

Building a multi-model usually is a difficult challenge that scientists want to explore and conquer. With the quick development of technology and algorithms, the majority of current approaches can be easier than the previous time. This project was created with a big desire to bring many valuable results when merging many large language models (LLM) together and deeply related knowledge.

VQA model is a deep neural network that learns and responds to users when they provide questions and related images, the model will answer reliance on those information. Namely, as below images:

Architecture

Getting Started

Install Required Packages

(It is recommended to install the dependencies under Conda environment.)

python 3.7, 3.8, 3.9
tqdm
pytorch==0.4.0
torchvision
cython
matplotlib
numpy
scipy
pyyaml
packaging
pycocotools
tensorboardx
h5py
opencv-python
streamlit
pillow # for PIL

The required supportive environment uses a hardware accelerator GPUs such as T4 of Colab, GPU A100, etc.

Prepare the Training Data

Name	#Image	#Question	#Answer
Train	5000	248350	248350
Validation	2500	121513	121513
Test	2500	107395	107395

All of orignal file are formatted in JSON files, after handling data processing, they has formatted in CSV files.

The csv files have some attribution:

[
    {
        "image_id": int,
        "question": str,
        "question_id": int,
        "answer" : str
    }
]

Download Based Models

Google - Bert ( textual_feature_extractor ): BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. Available on this link

Google - ViT ( visual_feature_extractor ): The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Available on this link

Inference And Demo

Results

Data	#WUPS	#MRR	#Loss
Train	0.23114	0.32561	4.52219
Validation	0.23152	0.32828	5.18071

Deployment on HuggingFace

I used Streamlit Framework for deploying this model but it can only could be run locally, if wanting to run it on the server of HuggingFace, we need to complete some steps related to registering the model (because this model is built from scratch) more information

Acknowledgements

Future Plans

Fine-tune on a larger dataset
Evaluation on downstream tasks
Experiment with different model sizes
Experiment with different serving frameworks: vLLM, TGI, Triton Inference Server, etc.
Experiment with expanding the tokenizer and prepare for pre-training

Stay tuned for future releases as we are continuously working on improving the model, expanding the dataset, and adding new features.

Thank you for your interest in my project. We hope you find it useful. If you have any questions, please feel free and don't hesitate to contact me at tranphihung8383@gmail.com

References

PAPER: OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge Available on this link. [Online; accessed June 1, 2024].
PAPER: Stanislaw Antol and others. “VQA: Visual Question Answering”. inInternational Conference on Computer Vision (ICCV): 2015. Available on this link. [Online; accessed June 1, 2024].
PAPER: Yash Goyal andothers. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. 2017. arXiv: 1612.00837 [cs.CV]. Available on this link. [Online; accessed June 1, 2024].

Citation

If you find this repo useful in your research, please consider citing the following papers:

@article{mishra2023vqa,
  title={VQA with Cascade of Self-and Co-Attention Blocks},
  author={Mishra, Aakansha and Anand, Ashish and Guha, Prithwijit},
  journal={arXiv preprint arXiv:2302.14777},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
.husky		.husky
experiments		experiments
readme/images		readme/images
serving		serving
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
commitlint.config.js		commitlint.config.js
docker-compose.yaml		docker-compose.yaml
init_setup.sh		init_setup.sh
package-lock.json		package-lock.json
package.json		package.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
template_structure.py		template_structure.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VISUAL QUESTION ANSWERING - LLM

Bert + Vit-Google : Answer some issues from Question + realted Image by MultiModel Training

Contents

Model Overview

Introduction

Architecture

Getting Started

Install Required Packages

Prepare the Training Data

Download Based Models

Inference And Demo

Results

Deployment on HuggingFace

Acknowledgements

Future Plans

References

Citation

About

Releases 6

Packages

Languages

License

tph-kds/vqa-llm

Folders and files

Latest commit

History

Repository files navigation

VISUAL QUESTION ANSWERING - LLM

Bert + Vit-Google : Answer some issues from Question + realted Image by MultiModel Training

Contents

Model Overview

Introduction

Architecture

Getting Started

Install Required Packages

Prepare the Training Data

Download Based Models

Inference And Demo

Results

Deployment on HuggingFace

Acknowledgements

Future Plans

References

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages