VQA

An approach to Video Question Answering that enhances Vision-Language Models by utilizing multiple external vision models.

Project Overview

This research project introduces a groundbreaking approach to Video Question Answering (VideoQA), addressing the limitations of traditional latent video representations. Our novel method leverages multiple external vision models to transform video content into detailed natural language descriptions, significantly enhancing the capabilities of Vision-Language Models.