VQA | Shangyang Min

Project Overview

This research project introduces a groundbreaking approach to Video Question Answering (VideoQA), addressing the limitations of traditional latent video representations. Our novel method leverages multiple external vision models to transform video content into detailed natural language descriptions, significantly enhancing the capabilities of Vision-Language Models.

Download the PDF