中科院计算所视觉信息处理与学习组
中科院计算所视觉信息处理与学习组


您所在的位置 / 资源共享

资源共享

Envqa

1.       Overview

Visual understanding goes well beyond the study of images or videos on the web. To achieve complex tasks in volatile situations, the human vision system can deeply understand the environment, quickly perceive events happening around, and continuously track objects’ state changes, which are still challenging for current AI systems. To equip AI system with the ability to understand dynamic ENVironments, we build a novel video Question Answering dataset named Env-QA. Env-QA contains 23K egocentric videos, where each video is composed of a series of events about exploring and interacting in the environment. It also provides 85K questions to evaluate the ability of understanding the composition, layout, and state changes of the environment presented by the events in videos. Moreover, we propose a novel video QA model, Temporal Segmentation and Event Attention network (TSEA), which introduces event-level video representation and corresponding attention mechanisms to better extract environment information and answer questions. Comprehensive experiments demonstrate the effectiveness of our framework and show the formidable challenges of Env-QA in terms of long-term state tracking, multi-event temporal reasoning and event counting, etc.

 

 


视觉信息处理和学习组
  • 单位地址:北京海淀区中关村科学院南路6号
  • 邮编:100190
  • 联系电话:010-62600514
  • Email:yi.cheng@vipl.ict.ac.cn
  • Valse

  • 深度学习大讲堂

版权所有 @ 中科院计算所视觉信息处理与学习组 京ICP备05002829号-1 京公网安备1101080060