EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

[🌐 Project Page] [📖 Paper] [📊 Dataset] [🏆 Eval Server ]

🔥 Update

2025.02.11 We are very proud to launch ✨EgoTextVQA✨, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text! Our paper has been released on arXiv.

🔍 EgoTextVQA

EgoTextVQA is a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. It consists of two parts: 1) EgoTextVQA-Indoor focuses on the outdoor scenarios, with 694 videos and 4,848 QA pairs that may arise when driving; 2) EgoTextVQA-Outdoor emphasizes indoor scenarios, with 813 videos and 2,216 QA pairs that users may encounter in house-keeping activities. There are several unique features of EgoTextVQA.

It stands out as the first VideoQA testbed towards egocentric scene-text aware QA assistance in the wild, with 7K QAs that reflect diverse user intentions under 1.5K different egocentric visual situations.
The QAs emphasize scene text comprehension, but only about half invoke the exact scene text.
The situations cover both indoor and outdoor activities.
Detailed timestamps and categories of the questions are provided to facilitate real-time QA and model analysis.
The real-time QA setting is that the answer to the question is obtained from the video captured before the question is asked, rather than the global content. The answer changes with the timestamp of the question.

Dataset Comparision

QA examples of different question categories.

Dataset analysis.

✅ TODO List

Release paper on arxiv.
Release dataset.
Release model QA and evaluation code.
Release model evaluation server.

🎨 Dataset Examples

Examples on EgoTextVQA-Outdoor
Examples on EgoTextVQA-Indoor

📝 Evaluation Pipeline

📍Video Process:

📍MLLM QA Prompt:

📍Evaluation Prompt:

📈 Experiment Results

Evaluation results of MLLMs on EgoTextVQA-Outdoor.

Evaluation results of MLLMs on EgoTextVQA-Indoor.

📧 Contact

If you have any questions or suggestions about the dataset, please contact: hzgn97@gmail.com. We are happy to communicate 😊.

✨ Citation

If this work is helpful to you, consider giving this repository a 🌟 and citing our papers as follows:

@article{zhou2025egotextvqa,
      title={EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering}, 
      author={Sheng Zhou and Junbin Xiao and Qingyun Li and Yicong Li and Xun Yang and Dan Guo and Meng Wang and Tat-Seng Chua and Angela Yao},
      journal={arXiv preprint arXiv:2502.07411},
      year={2025}
}

💌 Acknowledgement

We would like to thank the following repos for their great work:

Our dataset is built upon: RoadTextVQA and EgoSchema.

Our evaluation is built upon: VideoChatGPT.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
asset		asset
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

🔥 Update

🔍 EgoTextVQA

Dataset Comparision

✅ TODO List

🎨 Dataset Examples

📝 Evaluation Pipeline

📈 Experiment Results

📧 Contact

✨ Citation

💌 Acknowledgement

About

Packages

Contributors 2

License

zhousheng97/EgoTextVQA

Folders and files

Latest commit

History

Repository files navigation

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

🔥 Update

🔍 EgoTextVQA

Dataset Comparision

✅ TODO List

🎨 Dataset Examples

📝 Evaluation Pipeline

📈 Experiment Results

📧 Contact

✨ Citation

💌 Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Packages 0

Contributors 2

Packages