MUCH: MUtual Coupling enHancement of scene recognition and dense captioning

Xinhang Song, Bohan Wang, Gongwei Chen, Shuqiang Jiang
(ACMMM 2019)
[PDF]

Abstract

Due to the abstraction of scenes, comprehensive scene understanding requires semantic modeling in both global and local aspects. Scene recognition is usually researched from a global point of view, while dense captioning is typically studied for local regions. Previous works separately research on the modeling of scene recognition and dense captioning. In contrast, we propose a joint learning framework that benefits from the mutual coupling of scene recognition and dense captioning models. Generally, these two tasks are coupled through two steps, 1) fusing the supervision by considering the contexts between scene labels and local captions, and 2) jointly optimizing semantically symmetric LSTM models. Particularly, in order to balance bias between dense captioning and scene recognition, a scene adaptive non-maximum suppression (NMS) method is proposed to emphasize the scene related regions in region proposal procedure, and a region-wise and category-wise weighted pooling method is proposed to avoid over attention on particular regions in local to global pooling procedure. For the model training and evaluation, scene labels are manually annotated for Visual Genome database. The experimental results on Visual Genome show the effectiveness of the proposed method. Moreover, the proposed method also can improve previous CNN based works on public scene databases, such as MIT67 and SUN397.


  • Xinhang Song, Bohan Wang, Gongwei Chen and Shuqiang Jiang. MUCH: MUtual Coupling enHancement of scene recognition and dense captioning. (ACM Multimedia 2019), 21-25 October 2019, Nice, France.