Shuqiang Jiang's homepage
Shuqiang Jiang
Ph.D
Tel:
010-62600505
Email:
sqjiang@ict.ac.cn
Address:
No.6 Kexueyuan South Road Zhongguancun,Haidian District Beijing,China The Institute of Computing Technology of the Chinese Academy of Sciences Key Laboratory of Intelligent Information Processing 100190
Publication

Partial Publication(More in DBLP

Journal
  • Xinda Liu, Weiqing Min, Shuhuan Mei, Lili Wang, Shuqiang Jiang, Plant Disease Recognition: A Large-Scale Benchmark Dataset and a Visual Region and Loss Reweighting Approach.

    每年,全球高达40%的粮食作物因病虫害而遭受损失。这导致每年的农业贸易损失超过2200亿美元,数以百万计的人们陷入饥饿。近年来图像处理技术特别是深度学习的发展为病虫害识别提供了新的可行方案,但由于缺乏系统分析和足够的数据量支撑,研究并没有形成应有的规模。本文系统分析了在计算机视觉视角下的病害识别问题面临的挑战,收集了一个包含271类病害、超过22万张图像的数据集,提出了重新加权视觉区域和损失以强调患病部位的方法。该方法从图像和特征两个层次对强化患病位置的影响力,同时通过拆分和重组兼顾全局和局部信息。我们在提出的数据集和另一个公共数据集上进行了大量评估证明该方法的优势。我们希望这项研究将进一步推动图像处理领域中植物病害识别的进程。

    Abstract

    Plant disease diagnosis is very critical for agriculture due to its importance for increasing crop production. Recent advances in image processing offer us a new way to solve this issue via visual plant disease analysis. However, there are few works in this area, not to mention systematic researches. In this paper, we systematically investigate the problem of visual plant disease recognition for plant disease diagnosis. Compared with other types of images, plant disease images generally exhibit randomly distributed lesions, diverse symptoms and complex backgrounds, and thus are hard to capture discriminative information. To facilitate the plant disease recognition research, we construct a new large-scale plant disease dataset with 271 plant disease categories and 220,592 images. Based on this dataset, we tackle plant disease recognition via reweighting both visual regions and loss to emphasize diseased parts. We first compute the weights of all the divided patches from each image based on the cluster distribution of these patches to indicate the discriminative level of each patch. Then we allocate the weight to each loss for each patch-label pair during weakly-supervised training to enable discriminative disease part learning. We finally extract patch features from the network trained with loss reweighting, and utilize the LSTM network to encode the weighed patch feature sequence into a comprehensive feature representation. Extensive evaluations on this dataset and another public dataset demonstrate the advantage of the proposed method. We expect this research will further the agenda of plant disease recognition in the community of image processing.


    • Xinda Liu, Weiqing Min, Shuhuan Mei, Lili Wang, Shuqiang Jiang. “Plant Disease Recognition: A Large-Scale Benchmark Dataset and a Visual Region and Loss Reweighting Approach”, IEEE Transactions on Image Processing (TIP), 2021.

    IEEE Trans. Image Processing (2020, Accepted)
    [PDF]
  • Haitao Zeng, Xinhang Song, Gongwei Chen, Shuqiang Jian, Amorphous Region Context Modeling for Scene Recognition.

    场景图像通常由前景和背景区域内容组成。一些现有的方法提出使用密集网格来提取区域内容。这样的网格可以将物体分成几个离散的部分,使得区域块中的语义含义并不明确。同时,物体性的方法可能只关注场景图像中的前景内容,导致背景内容和空间结构不完整。与现有方法相比,本文提出了一种解决语义模糊的方法,即检测区域内容本身的边界,并通过语义分割技术精确定位区域内容的无定形轮廓。此外,在构建场景表示时,我们引入了图像中完整的前景和背景信息。通过图神经网络对这些区域建模,探索了区域之间的上下文关系,得到用于场景识别的具有区分性的场景特征表示。在MIT67和SUN397上的实验结果证明了提出方法的有效性和泛化性。

    Abstract

    Scene images are usually composed of foreground and background regional contents. Some existing methods propose to extract regional contents with dense grids or objectness region proposals. However, dense grids may split the object into several discrete parts, learning semantic ambiguity for the patches. The objectness methods may focus on particular objects but only pay attention to the foreground contents and do not exploit the background that is key to scene recognition. In contrast, we propose a novel scene recognition framework with amorphous region detection and context modeling. In the proposed framework, discriminative regions are first detected with amorphous contours that can tightly surround the targets through semantic segmentation techniques. In addition, both foreground and background regions are jointly embedded to obtain the scene representations with the graph model. Based on the graph modeling module, we explore the contextual relations between the regions in geometric and morphology aspects, and generate the discriminative representations for scene recognition. Experimental results on MIT67 and SUN397 demonstrate the effectiveness and generality of the proposed method.


    • Haitao Zeng, Xinhang Song, Gongwei Chen, Shuqiang Jiang. “Amorphous Region Context Modeling for Scene Recognition”, IEEE Transactions on Multimedia (TMM), 2020.(Accepted December 7, 2020)


    IEEE Trans. Multimedia (2020, Accepted)
    [PDF]
  • Yanchao Zhang, Weiqing Min, Liqiang Nie, Shuqiang Jiang, Hybrid-Attention Enhanced Two-Stream Fusion Network for Video Venue Prediction.

    随着社交媒体平台的发展,如Facebook和Vine,越来越多的用户喜欢在这些平台上分享自己的日常生活,而移动设备的普及,促进了海量多媒体数据的产生。用户在分享日常的时候,考虑到隐私问题,往往是没有地理位置标注的,这就限制了地点识别和推荐系统的发展。随着多媒体数据的不断增加以及人工智能领域特别是深度学习的发展,视频场所识别任务应运而生。该任务以视频作为输入,判别该视频所发生的场所,在个性化餐馆推荐、用户隐私检测等方面有着广泛的应用前景。本文在项目组前期工作(Video Venue Prediction: [Jiang2018-IEEE TMM])的研究基础上,提出了一个新的网络模型HA-TSFN。该模型考虑了全局信息和局部信息,并使用全局-局部注意力机制来捕获视频中的场景和物体信息,从而增强视觉信息的表达。同时,在大规模视频场所数据集Vine上进行了实验分析和验证。

    • [Jiang2018-IEEE TMM] Shuqiang Jiang, Weiqing Min Shuhuan Mei, “Hierarchy-dependent cross-platform multi-view feature learning for venue category prediction,” IEEE Transactions on Multimedia, vol. 21, no. 6, pp. 1609–1619, 2018

    Abstract

    Video venue category prediction has been drawing more attention in the multimedia community for various applications such as personalized location recommendation and video verification. Most of existing works resort to the information from either multiple modalities or other platforms for strengthening video representations. However, noisy acoustic information, sparse textual descriptions and incompatible cross-platform data could limit the performance gain and reduce the universality of the model. Therefore, we focus on discriminative visual feature extraction from videos by introducing a hybrid-attention structure. Particularly, we propose a novel Global-Local Attention Module (GLAM), which can be inserted to neural networks to generate enhanced visual features from video content. In GLAM, the Global Attention (GA) is used to catch contextual scene-oriented information via assigning channels with various weights while the Local Attention (LA) is employed to learn salient object-oriented features via allocating different weights for spatial regions. Moreover, GLAM can be extended to ones with multiple GAs and LAs for further visual enhancement. These two types of features respectively captured by GAs and LAs are integrated via convolution layers, and then delivered into convolutional Long Short-Term Memory (convLSTM) to generate spatial-temporal representations, constituting the content stream. In addition, video motions are explored to learn long-term movement variations, which also contributes to video venue prediction. The content and motion stream constitute our proposed Hybrid-Attention Enhanced Two-Stream Fusion Network (HA-TSFN). HA-TSFN finally merges the features from two streams for comprehensive representations. Extensive experiments demonstrate that our method achieves the state-of-the-art performance in the large-scale dataset Vine. The visualization also shows that the proposed GLAM can capture complementary scene-oriented and object-oriented visual features from videos.


    • Yanchao Zhang, Weiqing Min, Liqiang Nie, Shuqiang Jiang. “Hybrid-Attention Enhanced Two-Stream Fusion Network for Video Venue Prediction”, IEEE Transactions on Multimedia (TMM), 2020.


    IEEE Trans. Multimedia (2020, Accepted)
    [PDF]
  • Yaohui Zhu, Weiqing Min, Shuqiang Jiang, Attribute-Guided Feature Learning for Few-Shot Image Recognition.

    我们提出了一种属性指导的两层学习框架,该框架能够获得通用的特征表示。属性学习被作为小样本图像识别在多任务学习框架下的另一学习目标。在该框架下,小样本图像识别在任务层面学习和属性学习在图像上进行,他们共享同一网络。此外,在属性学习的指导下,来自不同层次的特征是不同级别的属性表示,它们能在多个方面进行小样本图像识别。因此,本文建立了一种以属性为指导的两层学习机制,以捕获更多判别性表示。与单层学习机制相比,两层学习机制获得的是互补表示。所提出的框架与特定的模型无关。两种典型的方法:基于度量的小样本方法和元学习方法都能插入到提出的框架中。

    Abstract

    Few-shot image recognition has become an essential problem in the field of machine learning and image recognition, and has attracted more and more research attention. Typically, most few-shot image recognition methods are trained across tasks. However, these methods are apt to learn an embedding network for discriminative representations of training categories, and thus could not distinguish well for novel categories. To establish connections between training and novel categories, we use attribute-related representations for few-shot image recognition and propose an attribute-guided two-layer learning framework, which is capable of learning general feature representations. Specifically, few-shot image recognition trained over tasks and attribute learning trained over images share the same network in a multi-task learning framework. In this way, few-shot image recognition learns feature representations guided by attributes, and is thus less sensitive to novel categories compared with feature representations only using category supervision. Meanwhile, the multi-layer features associated with attributes are aligned with category learning on multiple levels respectively. Therefore we establish a two-layer learning mechanism guided by attributes to capture more discriminative representations, which are complementary compared with a single-layer learning mechanism. Experimental results on CUB-200, AWA and Mini- ImageNet datasets demonstrate our method effectively improves the performance.


    • Yaohui Zhu, Weiqing Min, Shuqiang Jiang. “Attribute-Guided Feature Learning for Few-Shot Image Recognition”, IEEE Transactions on Multimedia (TMM), 2020.


    IEEE Trans. Multimedia (2020, Accepted)
    [PDF]
  • Shuqiang Jiang, Weiqing Min, Yongqiang Lyu, Linhu Liu, Few-Shot Food Recognition via Multi-View Representation.

    食品类别多样,从现实世界收集的食品数据集符合典型的长尾分布,许多不常见食品类别只能收集到少量样本。相比于一般图像的小样本识别,食品图像的小样本识别更具实际意义。 本文在项目组前期食品计算(Food Computing:[Min2019-ACM CSUR])与食品识别(Food Recognition:[Jiang2020-IEEE TIP][Min2019-ACMMM][Xu2015-IEEE TMM])的研究基础上,研究了小样本食品图像识别问题,提出了融合食品成分与类别信息的多视表示方法,并在多个数据集上进行了实验分析与验证。

    • [Min2019-ACM CSUR] Weiqing Min,Shuqiang Jiang, Linhu Liu,Yong Rui, Ramesh Jain A Survey on Food Computing. ACM Computing Surveys, 52(5), 92:1-92:36, 2019

    • [Jiang2020-IEEE TIP] Shuqiang Jiang, Weiqing Min, Linhu Liu, Zhengdong Luo, Multi-Scale Multi-View Deep Feature Aggregation for Food Recognition. IEEE Trans. Image Processing, vol.29, pp.265-276, 2020

    • [Min2019-ACMMM] Weiqing Min, Linhu Liu, Zhengdong Luo, Shuqiang Jiang, Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition. (ACM Multimedia 2019), 21-25 October 2019, Nice, France

    • [Xu2015-IEEE TMM] Ruihan Xu, Luis Herranz, Shuqiang Jiang, Shuang Wang, Xinhang Song, Ramesh Jain, Geolocalized Modeling for Dish Recognition. IEEE Trans. Multimedia 17(8): 1187-1199, 2015

    Abstract

    This paper considers the problem of few-shot learning for food recognition. Automatic food recognition can support various applications, e.g., dietary assessment and food journaling. Most existing works focus on food recognition with large numbers of labelled samples, and fail to recognize food categories with few samples. To address this problem, we propose a Multi-View Few-Shot Learning (MVFSL) framework to explore additional ingredient information for few-shot food recognition. Besides category-oriented deep visual features, we introduce ingredient-supervised deep network to extract ingredient-oriented features. As general and intermediate attributes of food, ingredient-oriented features are informative and complementary to category-oriented features, and thus play an important role in improving food recognition. Particularly in few-shot food recognition, ingredient information can bridge the gap between disjoint training categories and test categories. In order to take advantage of ingredient information, we fuse these two kinds of features by first combining their feature maps from their respective deep networks, and then convolving combined feature maps. Such convolution is further incorporated into a multi-view relation network, which is capable of comparing pairwise images to enable fine-grained feature learning. MVFSL is trained in an end-to-end fashion for joint optimization on two types of feature learning subnetworks and relation subnetworks. Extensive experiments on different food datasets have consistently demonstrated the advantage of MVFSL in multi-view feature fusion. Furthermore, we extend another two types of networks, namely Siamese Network and Matching Network by introducing ingredient information for few-shot food recognition. Experimental results have also shown that introducing ingredient information into these two networks can improve the performance of few-shot food recognition.


    • Shuqiang Jiang, Weiqing Min, Yongqiang Lyu, Linhu Liu. Few-Shot Food Recognition via Multi-View Representation. ACM Transactions on Multimedia Computing, Communications and Applications (2020, Accepted)


    ACM Transactions on Multimedia Computing, Communications and Applications (2020, Accepted)
    [PDF]
  • Gongwei Chen, Xinhang Song, Haitao Zeng, Shuqiang Jiang, Scene Recognition with Prototype-agnostic Scene Layout.

    本文主要探索了针对每张场景图像构建自适应的原型无关的场景布局,同时利用图建模的方式融合布局中的空间结构信息来提升场景分类性能。

    Abstract

    Exploiting the spatial structure in scene images is a key research direction for scene recognition. Due to the large intra-class structural diversity, building and modeling flexible structural layout to adapt various image characteristics is a challenge. Existing structural modeling methods in scene recognition either focus on predefined grids or rely on learned prototypes, which all have limited representative ability. In this paper, we propose Prototype-agnostic Scene Layout (PaSL) construction method to build the spatial structure for each image without conforming to any prototype. Our PaSL can flexibly capture the diverse spatial characteristic of scene images and have considerable generalization capability. Given a PaSL, we build Layout Graph Network (LGN) where regions in PaSL are defined as nodes and two kinds of independent relations between regions are encoded as edges. The LGN aims to incorporate two topological structures (formed in spatial and semantic similarity dimensions) into image representations through graph convolution. Extensive experiments show that our approach achieves state-of-the-art results on widely recognized MIT67 and SUN397 datasets without multi-model or multi-scale fusion. Moreover, we also conduct the experiments on one of the largest scale datasets, Places365. The results demonstrate the proposed method can be well generalized and obtains competitive performance.

    • G. Chen, X. Song, H. Zeng and S. Jiang, "Scene Recognition With Prototype-Agnostic Scene Layout," in IEEE Transactions on Image Processing, vol. 29, pp. 5877-5888, 2020, doi: 10.1109/TIP.2020.2986599.

    IEEE Trans. Image Processing, vol.29, pp.5877-5888, 2020
    [PDF]
  • Weiqing Min, Shuhuan Mei, Zhuo Li, Shuqiang Jiang, A Two-Stage Triplet Network Training Framework for Image Retrieval.

    与传统物体检索相比,实例级图像检索有一系列难点,如:相同类别之间差异大(例如,光照,旋转,遮挡,裁剪等),类别与类别之间差异不大(可口可乐瓶与雪碧瓶),图像含有大量的干扰信息(如背景图像)以及有大量的未经标注的干扰图像等。最近的进展表明,卷积神经网络(CNN)可以提供了一个比传统方法更加优秀的图像特征表示方法。但是,卷积神经网络从整个图像中提取的特征包含大量的干扰信息,会导致检索性能达不到预期效果。为了解决这个问题,本文在项目组前期构建的实例级图像检索数据库(INSTRE:[Wang2015-ACM TOMM])等研究基础上, 提出了一个两阶段的实例级图像检索框架。通过在INSTRE等多个实例级图像检索数据集的实验证明了本文所提出框架的有效性。

    • [Wang2015-ACM TOMM] Shuang Wang, Shuqiang Jiang. INSTRE: A New Benchmark for Instance-Level Object Retrieval and Recognition. TOMM 11(3): 37:1-37:21 (2015)

    Abstract

    In this paper, we propose a novel framework for instance-level image retrieval. Recent methods focus on fine-tuning the Convolutional Neural Network (CNN) via a Siamese architecture to improve off-the-shelf CNN features. They generally use the ranking loss to train such networks, and do not take full use of supervised information for better network training, especially with more complex neural architectures. To solve this, we propose a two-stage triplet network training framework, which mainly consists of two stages. First, we propose a Double-Loss Regularized Triplet Network (DLRTN), which extends basic triplet network by attaching the classification sub-network, and is trained via simultaneously optimizing two different types of loss functions. Double-loss functions of DLRTN aim at specific retrieval task and can jointly boost the discriminative capability of DLRTN from different aspects via supervised learning. Second, considering feature maps of the last convolution layer extracted from DLRTN and regions detected from the region proposal network as the input, we then introduce the Regional Generalized-Mean Pooling (RGMP) layer for the triplet network, and re-train this network to learn pooling parameters. Through RGMP, we pool feature maps for each region and aggregate features of different regions from each image to Regional Generalized Activations of Convolutions (R-GAC) as final image representation. R-GAC is capable of generalizing existing Regional Maximum Activations of Convolutions (R-MAC) and is thus more robust to scale and translation. We conduct the experiment on six image retrieval datasets including standard benchmarks and recently introduced INSTRE dataset. Extensive experimental results demonstrate the effectiveness of the proposed framework.

    • Weiqing Min, Shuhuan Mei, Zhuo Li, Shuqiang Jiang. A Two-Stage Triplet Network Training Framework for Image Retrieval. IEEE Transactions on Multimedia (2020, Accepted)


    IEEE Trans. Multimedia (2020, Accepted)
    [PDF]
  • Weiqing Min, Shuqiang Jiang, and Ramesh Jain, Food Recommendation: Framework, Existing Solutions and Challenges.

    Abstract

    A growing proportion of the global population is becoming overweight or obese, leading to various diseases (e.g., diabetes, ischemic heart disease and even cancer) due to unhealthy eating patterns, such as increased intake of food with high energy and high fat. Food recommendation is of paramount importance to alleviate this problem. Unfortunately, modern multimedia research has enhanced the performance and experience of multimedia recommendation in many fields such as movies and POI, yet largely lags in the food domain. This article proposes a unified framework for food recommendation, and identifies main issues affecting food recommendation including incorporating various context and domain knowledge, building the personal model, and analyzing unique food characteristics. We then review existing solutions for these issues, and finally elaborate research challenges and future directions in this field. To our knowledge, this is the first survey that targets the study of food recommendation in the multimedia field and offers a collection of research studies and technologies to benefit researchers in this field.

    Weiqing Min, Shuqiang Jiang, and Ramesh Jain. Food Recommendation: Framework, Existing Solutions and Challenges. IEEE Transactions on Multimedia 2019 (Accepted)


    IEEE Trans. Multimedia (2020, Accepted)
    [PDF]
  • Haitao Zeng, Xinhang Song, Gongwei Chen, and Shuqiang Jiang, Learning Scene Attribute for Scene Recognition.

    Abstract

    Scene recognition has been a challenging task in the field of computer vision and multimedia for a long time. The current scene recognition works often extract object features and scene features through CNN, and combine these two types of features to obtain complementary and discriminative scene representations. However, when the scene categories are visually similar, the object features might lack of discriminations. Therefore, it may be debatable to consider only object features. In contrast to the existing works, in this paper, we discuss the discrimination of scene attributes in local regions and utilize scene attributes as the complementary features of object and scene features. We extract these visual features from two individual CNN branches, one extracting the global features of the image while the other extracting the features of local regions. Through contextual modeling framework, we aggregate these features and generate more discriminative scene representations, which achieve better performance than the feature aggregation of object and scene. Moreover, we achieve the new state-of-the-art performance on both standard scene recognition benchmarks by aggregating more complementary visual features: MIT67 (88.06%) and SUN397 (74.12%).

    • Haitao Zeng, Xinhang Song, Gongwei Chen, Shuqiang Jiang. “Learning Scene Attribute for Scene Recognition”, IEEE Transactions on Multimedia 2019 (Accepted)


    IEEE Trans. Multimedia, vol.22, no.6, pp.1519-1530, June 2020
    [PDF]
  • Weiqing Min, Shuhuan Mei, Linhu Liu, Yi Wang, and Shuqiang Jiang, Multi-Task Deep Relative Attribute Learning for Visual Urban Perception.

    Abstract

    Visual urban perception aims to quantify perceptual attributes (e.g., safe and depressing attributes) of physical urban environment from crowd-sourced street-view images and their pairwise comparisons. It has been receiving more and more attention in computer vision for various applications, such as perceptive attribute learning and urban scene understanding. Most existing methods adopt either (i) a regression model trained using image features and ranked scores converted from pairwise comparisons for perceptual attribute prediction or (ii) a pairwise ranking algorithm to independently learn each perceptual attribute. However, the former fails to directly exploit pairwise comparisons while the latter ignores the relationship among different attributes. To address them, we propose a Multi-Task Deep Relative Attribute Learning Network (MTDRALN) to learn all the relative attributes simultaneously via multi-task Siamese networks, where each Siamese network will predict one relative attribute. Combined with deep relative attribute learning, we utilize the structured sparsity to exploit the prior from natural attribute grouping, where all the attributes are divided into different groups based on semantic relatedness in advance. As a result, MTDRALN is capable of learning all the perceptual attributes simultaneously via multi-task learning. Besides the ranking sub-network, MTDRALN further introduces the classification sub-network, and these two types of losses from two sub-networks jointly constrain parameters of the deep network to make the network learn more discriminative visual features for relative attribute learning. In addition, our network can be trained in an end-to-end way to make deep feature learning and multi-task relative attribute learning reinforce each other. Extensive experiments on the large-scale Place Pulse 2.0 dataset validate the advantage of our proposed network. Our qualitative results along with visualization of saliency maps also show that the proposed network is able to learn effective features for perceptual attributes.


    Weiqing Min, Shuhuan Mei, Linhu Liu, Yi Wang, and Shuqiang Jiang. Multi-Task Deep Relative Attribute Learning for Visual Urban Perception. IEEE Transactions on Image Processing, vol.29, pp.657-669, 2020


    IEEE Trans. Image Processing, vol.29, pp.657-669, 2020
    [PDF]
  • Xinhang Song, Shuqiang Jiang, Bohan Wang, Chengpeng Chen, Gongwei Chen, Image Representations with Spatial Object-to-Object Relations for RGB-D Scene Recognition.

    Abstract

    Scene recognition is challenging due to the intra-class diversity and inter-class similarity. Previous works recognize scenes either with global representations or with the intermediate representations of objects. In contrast, we investigate more discriminative image representations of object-to-object relations for scene recognition, which are based on the triplets of <object, relation, object> obtained with detection techniques. Particularly, two types of representations, including co-occurring frequency of object-to-object relation (denoted as COOR) and sequential representation of object-to-object relation (denoted as SOOR), are proposed to describe objects and their relative relations in different forms. COOR is represented as the intermediate representation of co-occurring frequency of objects and their relations, with a three order tensor that can be fed to scene classifier without further embedding. SOOR is represented in a more explicit and freer form that sequentially describe image contents with local captions. And a sequence encoding model (e.g., recurrent neural network (RNN)) is implemented to encode SOOR to the features for feeding the classifiers. In order to better capture the spatial information, the proposed COOR and SOOR are adapted to RGB-D data, where a RGB-D proposal fusion method is proposed for RGB-D object detection. With the proposed approaches COOR and SOOR, we obtain the state-of-the-art results of RGB-D scene recognition on SUN RGB-D and NYUD2 datasets.


    • Xinhang Song, Shuqiang Jiang, Bohan Wang, Chengpeng Chen, Gongwei Chen. Image Representations with Spatial Object-to-Object Relations for RGB-D Scene Recognition. IEEE Transactions on Image Processing, vol.29, pp.525-537, 2020


    IEEE Trans. Image Processing, vol.29, pp.525-537, 2020
    [PDF]
  • Shuqiang Jiang, Weiqing Min, Linhu Liu, Zhengdong Luo, Multi-Scale Multi-View Deep Feature Aggregation for Food Recognition.

    Abstract

    Recently, food recognition has received more and more attention in image processing and computer vision for its great potential applications in human health. Most of existing methods directly extracted deep visual features via Convolutional Neural Networks (CNNs) for food recognition. Such methods ignore characteristics of food images and are thus hard to achieve optimal recognition performance. In contrast to general object recognition, food images typically do not exhibit distinctive spatial arrangement and common semantic patterns. In this paper, we propose a Multi-Scale Multi-View Feature Aggregation (MSMVFA) scheme for food recognition. MSMVFA can aggregate high-level semantic features, mid-level attribute features and deep visual features into a unified representation. These three types of features describe the food image from different granularity. Therefore, the aggregated features can capture the semantics of food images with the greatest probability. For that solution, we utilize additional ingredient knowledge to obtain mid-level attribute representation via ingredient-supervised CNNs. High-level semantic features and deep visual features are extracted from class-supervised CNNs. Considering food images do not exhibit distinctive spatial layout in many cases, MSMVFA fuses multi-scale CNN activations for each type of features to make aggregated features more discriminative and invariable to geometrical deformation. Finally, the aggregated features are more robust, comprehensive and discriminative via the two-level fusion, namely multi-scale fusion for each type of features and multi-view aggregation for different types of features. In addition, MSMVFA is general and different deep networks can be easily applied into this scheme. Extensive experiments and evaluations demonstrate that our method achieves state-of-the-art recognition performance on three popular large-scale food benchmark datasets in Top-1 recognition accuracy. Furthermore, we expect this work will further the agenda of food recognition in the community of image processing and computer vision.


    • Shuqiang Jiang, Weiqing Min, Linhu Liu, Zhengdong Luo. Multi-Scale Multi-View Deep Feature Aggregation for Food Recognition. IEEE Transactions on Image Processing, vol.29, pp.265-276, 2020

    IEEE Trans. Image Processing, vol.29, pp.265-276, 2020
    [PDF]
  • Weiqing Min,Shuqiang Jiang, Linhu Liu,Yong Rui, Ramesh Jain, A Survey on Food Computing.

    Abstract

    Food is very essential for human life and it is fundamental to the human experience. Food-related study may support multifarious applications and services, such as guiding the human behavior, improving the human health and understanding the culinary culture. With the rapid development of social networks, mobile networks, and Internet of Things (IoT), people commonly upload, share, and record food images, recipes, cooking videos, and food diaries, leading to large-scale food data. Large-scale food data offers rich knowledge about food and can help tackle many central issues of human society. Therefore, it is time to group several disparate issues related to food computing. Food computing acquires and analyzes heterogenous food data from different sources for perception, recognition, retrieval, recommendation, and monitoring of food. In food computing, computational approaches are applied to address food related issues in medicine, biology, gastronomy and agronomy. Both large-scale food data and recent breakthroughs in computer science are transforming the way we analyze food data. Therefore, a series of works have been conducted in the food area, targeting different food-oriented tasks and applications. However, there are very few systematic reviews, which shape this area well and provide a comprehensive and in-depth summary of current efforts or detail open problems in this area. In this paper, we formalize food computing and present such a comprehensive overview of various emerging concepts, methods, and tasks. We summarize key challenges and future directions ahead for food computing. This is the first comprehensive survey that targets the study of computing technology for the food area and also offers a collection of research studies and technologies to benefit researchers and practitioners working in different food-related fields.

    Weiqing Min,Shuqiang Jiang, Linhu Liu,Yong Rui and Ramesh Jain "A Survey on Food Computing," ACM Comput. Surv. 52, 5, Article 92 (September 2019), 36 pages


    ACM Comput. Surv. 52, 5, Article 92 (September 2019), 36 pages
    [PDF]
  • Xiangyang Li, Luis Herranz, Shuqiang Jiang, Multifaceted Analysis of Fine-Tuning in Deep Model for Visual Recognition,

    Abstract

    In recent years, convolutional neural networks (CNNs) have achieved impressive performance for various visual recognition scenarios. CNNs trained on large labeled datasets can not only obtain significant performance on most challenging benchmarks but also provide powerful representations, which can be used to a wide range of other tasks. However, the requirement of massive amounts of data to train deep neural networks is a major drawback of these models, as the data available is usually limited or imbalanced. Fine-tuning (FT) is an effective way to transfer knowledge learned in a source dataset to a target task. In this paper, we introduce and systematically investigate several factors that influence the performance of fine-tuning for visual recognition. These factors include parameters for the retraining procedure (e.g., the initial learning rate of fine-tuning), the distribution of the source and target data (e.g., the number of categories in the source dataset, the distance between the source and target datasets) and so on. We quantitatively and qualitatively analyze these factors, evaluate their influence, and present many empirical observations. The results reveal insights into what fine-tuning changes CNN parameters and provide useful and evidence-backed intuitions about how to implement fine-tuning for computer vision tasks.

    Xiangyang Li, Luis Herranz, Shuqiang Jiang. “Multifaceted Analysis of Fine-Tuning in Deep Model for Visual Recognition”, ACM Transactions on Data Science, 2020, 1(1): 1-22.


    ACM Transactions on Data Science, 2020, 1(1): 1-22.
    [PDF]
  • Xiangyang Li, Shuqiang Jiang, Know More Say Less: Image Captioning Based on Scene Graphs.

    Abstract

    Automatically describing the content of an image has been attracting considerable research attention in the multimedia field. To represent the content of an image, many approaches directly utilize convolutional neural networks (CNNs) to extract visual representations, which are fed into recurrent neural networks (RNNs) to generate natural language. Recently, some approaches have detected semantic concepts from images and then encoded them into high-level representations. Although substantial progress has been achieved, most of the previous methods treat entities in images individually, thus lacking structured information that provides important cues for image captioning. In this work, we propose a framework based on scene graphs for image captioning. Scene graphs contain abundant structured information because they not only depict object entities in images but also present pairwise relationships. To leverage both visual features and semantic knowledge in structured scene graphs, we extract CNN features from the bounding box offsets of object entities for visual representations, and extract semantic relationship features from triples (e.g., man riding bike) for semantic representations . After obtaining these features, we introduce a hierarchical-attention-based module to learn discriminative features for word generation at each time step. The experimental results on benchmark datasets demonstrate the superiority of our method compared with several state-of-the-art methods.

    • Xiangyang Li, Shuqiang Jiang. “Know More Say Less: Image Captioning Based on Scene Graphs”, IEEE Transactions on Multimedia, vol.21, no.8, pp.2117-2130, Aug.2019


    IEEE Transactions on Multimedia, vol.21, no.8, pp.2117-2130, Aug.2019
    [PDF]
  • Shuqiang Jiang, Sisi Liang, Chengpeng Chen, Yaohui Zhu, Xiangyang Li, Class Agnostic Image Common Object Detection.

    Abstract

    Learning similarity of two images is an important problem in computer vision and has many potential applications. Most of previous works focus on generating image similarities in three aspects: global feature distance computing, local feature matching and image concepts comparison. However, the task of directly detecting class agnostic common objects from two images has not been studied before, which goes one step further to capture image similarities at region level. In this paper, we propose an end-to-end Image Common Object Detection Network(CODN) to detect class agnostic common objects from two images. The proposed method consists of two main modules: locating module and matching module. The locating module generates candidate proposals of each two images. The matching module learns the similarities of the candidate proposal pairs from two images, and refines the bounding boxes of the candidate proposals. The learning procedure of CODN is implemented in an integrated way and a multi-task loss is designed to guarantee both region localization and common object matching. Experiments are conducted on PASCAL VOC 2007 and COCO 2014 datasets. Experimental results validate the effectiveness of the proposed method.

    • Shuqiang Jiang, Sisi Liang, Chengpeng Chen, Yaohui Zhu, Xiangyang Li, Class Agnostic Image Common Object Detection. IEEE Trans. Image Processing 28(6):2836-2846(2019)


    IEEE Trans. Image Processing 28(6):2836-2846(2019)
    [PDF]
  • Shuqiang Jiang, Weiqing Min, SHuhuan Mei, Hierarchy-Dependent Cross-Platform Multi-View Feature Learning for Venue Category Prediction.

    Abstract

    In this work, we focus on visual venue category prediction, which can facilitate various applications for location-based service and personalization. Considering that the complementarity of different media platforms, it is reasonable to leverage venue-relevant media data from different platforms to boost the prediction performance. Intuitively, recognizing one venue category involves multiple semantic cues, especially objects and scenes, and thus they should contribute together to venue category prediction. In addition, these venues can be organized in a natural hierarchical structure, which provides prior knowledge to guide venue category estimation. Taking these aspects into account, we propose a Hierarchy-dependent Cross-platform Multi-view Feature Learning (HCM-FL) framework for venue category prediction from videos by leveraging images from other platforms. HCM-FL includes two major components, namely Cross-Platform Transfer Deep Learning (CPTDL) and Multi-View Feature Learning with the Hierarchical Venue Structure (MVFL-HVS). CPTDL is capable of reinforcing the learned deep network from videos using images from other platforms. Specifically, CPTDL first trained a deep network using videos. These images from other platforms are filtered by the learnt network and these selected images are then fed into this learnt network to enhance it. Two kinds of pre-trained networks on the ImageNet and Places dataset are employed. Therefore, we can harness both object-oriented and scene-oriented deep features through these enhanced deep networks. MVFL-HVS is then developed to enable multi-view feature fusion. It is capable of embedding the hierarchical structure ontology to support more discriminative joint feature learning. We conduct the experiment on videos from Vine and images from Foursqure. These experimental results demonstrate the advantage of our proposed framework in jointly utilizing multi-platform data, multi-view deep features and hierarchical venue structure knowledge.


    • Shuqiang Jiang, Weiqing Min, and Shuhuan Mei. Hierarchy-Dependent Cross-Platform Multi-View Feature Learning for Venue Category Prediction. IEEE Trans. Multimedia 21(6):1609-1619(2019)


    IEEE Trans. Multimedia 21(6):1609-1619(2019)
    [PDF]
  • Xinhang Song, Shuqiang Jiang, Luis Herranz, Chengpeng Chen, Learning Effective RGB-D Representations for Scene Recognition.

    Abstract

    Deep convolutional networks (CNN) can achieve impressive results on RGB scene recognition thanks to large datasets such as Places. In contrast, RGB-D scene recognition is still underdeveloped in comparison, due to two limitations of RGB-D data we address in this paper. The first limitation is the lack of depth data for training deep learning models. Rather than fine tuning or transferring RGB-specific features, we address this limitation by proposing an architecture and a two-step training approach that directly learns effective depth-specific features using weak supervision via patches. The resulting RGB-D model also benefits from more complementary multimodal features. Another limitation is the short range of depth sensors (typically 0.5m to 5.5m), resulting in depth images not capturing distant objects in the scenes that RGB images can. We show that this limitation can be addressed by using RGB-D videos, where more comprehensive depth information is accumulated as the camera travels across the scenes. Focusing on this scenario, we introduce the ISIA RGB-D video dataset to evaluate RGB-D scene recognition with videos. Our video recognition architecture combines convolutional and recurrent neural networks (RNNs) that are trained in three steps with increasingly complex data to learn effective features (i.e. patches, frames and sequences). Our approach obtains state-of-the-art performances on RGB-D image (NYUD2 and SUN RGB-D) and video (ISIA RGB-D) scene recognition.

    Xinhang Song, Shuqiang Jiang, Luis Herranz, Chengpeng Chen: Learning Effective RGB-D Representations for Scene Recognition. IEEE Trans. Image Processing, Vol.28, No.1, 2019, pp. 980-993


    IEEE Trans. Image Processing, Vol.28, No.1, 2019, pp. 980-993
    [PDF]
  • Shuqiang Jiang, Gongwei Chen, Xinhang Song, Linhu Liu, Deep Patch Representations with Shared Codebook for Scene Classification,

    Abstract

    Scene classifcation is a challenging problem. Compared with object images, scene images are more abstract,which are composed of objects. Object and scene images have different characteristics with different scales andcomposition structures. How to effectively integrate the local mid-level semantic representations includingboth object and scene concepts needs to be investigated, which is an important aspect for scene classifcation.In this paper, the idea of sharing codebook is introduced by organically integrating deep learning, concept feature and local feature encoding techniques. More specifcally, the shared local feature codebook is generated from the combined ImageNet1K and Places365 concepts (Mixed1365), using convolutional neural networks. As the Mixed1365 features cover all the semantic information including both object and scene concepts, we can extract shared codebook from the Mixed1365 features which only contains a subset of the whole 1365 concepts with the same codebook size. The shared codebook can not only provide complementary representations without additional codebook training, but also be adaptively extracted towards different scene classifcation tasks. A method of fusing the encoded features with both the original codebook and the shared codebook is proposed for scene classifcation. In this way, more comprehensive and representative image features can be generated for classifcation. Extensive experimentations conducted on two public datasets validate the effectiveness of the proposed method. Besides, some useful observations are also revealed to show the advantage of shared codebook.

    • Shuqiang Jiang, Gongwei Chen, Xinhang Song, and Linhu Liu. 2018. Deep Patch Representations with Shared Codebook for Scene Classifcation. ACM Trans. Multimedia Comput. Commun. Appl. , , Article (Accepted), 17 pages.


    TOMCCAP 15(1s): 5:1-5:17 (2019)
    [PDF]
  • Xiangyang Li, Shuqiang Jiang, Bundled Object Context for Referring Expressions,

    Abstract

    Referring expressions are natural language descriptions of objects within a given scene. Context is of crucial importance for a referring expression as the description not only depicts the properties of the object, but also involves the relationships of the referred object with other ones. Most of previous work uses either the whole image or one particular contextual object as the context. However, the context of these approaches is holistic and insufficient, as a referring expression often describes relationships of multiple objects in an image. To leverage rich context information from all objects in an image, in this work, we propose a novel scheme which is composed of a visual context Long Short-Term Memory (LSTM) module and a sentence LSTM module to model bundled object context for referring expressions. All contextual objects are arranged with their spatial locations and progressively fed into the visual context LSTM module to acquire and aggregate the context features. And then the concatenation of the learned context features and the features of the referred object are put into the sentence LSTM module to learn the probability of a referring expression. The feedback connections and internal gating mechanism of the LSTM cells enable our model to selectively propagate relevant contextual information through the whole network. Experiments on three benchmark datasets show our methods can achieve promising results compared to state-of-the-art methods. Moreover, visualization of the internal states of the visual context LSTM cells also shows that our method can automatically select the pertinent context objects.

    Xiangyang Li, Shuqiang Jiang, "Bundled Object Context for Referring Expressions", IEEE Transactions on Multimedia, vol.20, no.10, pp.2749-2760, 2018.


    IEEE Transactions on Multimedia, vol.20, no.10, pp.2749-2760, 2018.
    [PDF]
  • Weiqing Min, Shuqiang Jiang, Shuhui Wang, Ruihan Xu, Yushan Cao, Luis Herranz, Zhiqiang He, A Survey on Context-Aware Mobile Visual Recognition.


    Abstract

    Human beings have developed a diverse food culture. Many factors like ingredients, visual appearance, courses (e.g., breakfast and lunch), flavor and geographical regions affect our food perception and choice. In this work, we focus on multi-dimensional food analysis based on these food factors to benefit various applications like summary and recommendation. For that solution, we propose a delicious recipe analysis framework to incorporate various types of continuous and discrete attribute features and multi-modal information from recipes. First, we develop a Multi-Attribute Theme Modeling (MATM) method, which can incorporate arbitrary types of attribute features to jointly model them and the textual content. We then utilize a multi-modal embedding method to build the correlation between the learned textual theme features from MATM and visual features from the deep learning network. By learning attribute-theme relations and multi-modal correlation, we are able to fulfill different applications, including (1) flavor analysis and comparison for better understanding the flavor patterns from different dimensions, such as the region and course, (2) region-oriented multi-dimensional food summary with both multi-modal and multi-attribute information and (3) multi-attribute oriented recipe recommendation. Furthermore, our proposed framework is flexible and enables easy incorporation of arbitrary types of attributes and modalities. Qualitative and quantitative evaluation results have validated the effectiveness of the proposed method and framework on the collected Yummly dataset.


    • Weiqing Min, Shuqiang Jiang, Shuhui Wang, Ruihan Xu, Yushan Cao, Luis Herranz, Zhiqiang He,A survey on context-aware mobile visual recognition. Multimedia Syst. 23(6): 647-665 (2017)


    Multimedia Syst. 23(6): 647-665 (2017)
    [PDF]
  • Weiqing Min, Bingkun Bao, Shuhuan Mei, Yaohui Zhu, Yong Rui, Shuqiang Jiang, You Are What You Eat: Exploring Rich Recipe Information for Cross-Region Food Analysis

    Abstract

    Cuisine is a style of cooking and usually associated with a specific geographic region. Recipes from different cuisines shared on the web are an indicator of culinary cultures in different countries. Therefore, analysis of these recipes can lead to deep understanding of food from the cultural perspective. In this paper, we perform the first cross-region recipe analysis by jointly using the recipe ingredients, food images and attributes such as the cuisine and course (e.g., main dish and dessert). For that solution, we propose a culinary culture analysis framework to discover the topics of ingredient bases and visualize them to enable various applications. We firstly propose a probabilistic topic model to discover cuisine-course specific topics. The manifold ranking method is then utilized to incorporate deep visual features to retrieve food images for topic visualization. At last, we applied the topic modeling and visualization method for three applications: (1) multi-modal cuisine summarization with both recipe ingredients and images, (2) cuisine-course pattern analysis including topic-specific cuisine distribution and cuisine-specific course distribution of topics, and (3) cuisine recommendation for both cuisine-oriented and ingredient-oriented queries. Through these three applications, we can analyze the culinary cultures at both macro and micro levels. We conduct the experiment on a recipe database Yummly-66K with 66,615 recipes from 10 cuisines in Yummly. Qualitative and quantitative evaluation results have validated the effectiveness of topic modeling and visualization, and demonstrated the advantage of the framework in utilizing rich recipe information to analyze and interpret the culinary cultures from different regions.

    • Weiqing Min, Bing-Kun Bao, Shuhuan Mei, Yaohui Zhu, Yong Rui, Shuqiang Jiang: You Are What You Eat: Exploring Rich Recipe Information for Cross-Region Food Analysis. IEEE Trans. Multimedia, Vol.20, No.4, 2018, pp.950-964


    IEEE Trans. Multimedia, Vol.20, No.4, 2018, pp.950-964
    [PDF]
  • Xiong Lv, Xinda Liu, Xiangyang Li, Xue Li, Shuqiang Jiang, Zhiqiang He, Modality-specific and hierarchical feature learning for RGB-D hand-held object recognition.
    Multimedia Tools Appl. 76(3): 4273-4290 (2017)
    [PDF]
  • Xinhang Song, Shuqiang Jiang, Luis Herranz, Multi-Scale Multi-Feature Context Modeling for Scene Recognition in the Semantic Manifold.

    Abstract

    Before the big data era, scene recognition was often approached with two-step inference using localized intermediate representations (objects, topics, etc). One of such approaches is the semantic manifold (SM), in which patches and images are modeled as points in a semantic probability simplex. Patch models are learned resorting to weak supervision via image labels, which leads to the problem of scene categories co-occurring in this semantic space. Fortunately, each category has its own co-occurrence patterns that are consistent across the images in that category. Thus, discovering and modeling these patterns is critical to improve the recognition performance in this representation. Since the emergence of large datasets, such as ImageNet and Places, these approaches have been relegated in favor of the much more powerful convolutional neural networks (CNNs), which can automatically learn multi-layered representations from the data. In this paper we address many limitations of the original SM approach and related works. We propose discriminative patch representations using neural networks and further propose a hybrid architecture in which the semantic manifold is built on top of multiscale CNNs. Both representations can be computed significantly faster than the Gaussian mixture models of the original SM. To combine multiple scales, spatial relations and multiple features we formulate rich context models using Markov random fields. To solve the optimization problem we analyze global and local approaches, where a top-down hierarchical algorithm has the best performance. Experimental results show that exploiting different types of contextual relations jointly consistently improves the recognition accuracy.



    • Xinhang Song, Shuqiang Jiang, Luis Herranz. “Multi-scale multi-feature context modeling for scene recognition in the semantic manifold.” IEEE Transactions on Image Processing (TIP), 2017, CCF A


    IEEE Trans. Image Processing 26(6): 2721-2735 (2017)
    [PDF]
  • Luis Herranz, Shuqiang Jiang, Ruihan Xu, Modeling Restaurant Context for Food Recognition.

    Abstract

    Food photos are widely used in food logs for diet monitoring and in social networks to share social and gastronomic experiences. A large number of these images are taken in restaurants. Dish recognition in general is very challenging, due to different cuisines, cooking styles, and the intrinsic difficulty of modeling food from its visual appearance. However, contextual knowledge can be crucial to improve recognition in such scenario. In particular, geocontext has been widely exploited for outdoor landmark recognition. Similarly, we exploit knowledge about menus and location of restaurants and test images. We first adapt a framework based on discarding unlikely categories located far from the test image. Then, we reformulate the problem using a probabilistic model connecting dishes, restaurants, and locations. We apply that model in three different tasks: dish recognition, restaurant recognition, and location refinement. Experiments on six datasets show that by integrating multiple evidences (visual, location, and external knowledge) our system can boost the performance in all tasks.


    • L. Herranz, S. Jiang, R. Xu, “Modeling Restaurant Context for Food Recognition”, IEEE Transactions on Multimedia, vol. 19, no. 2, pp. 430-440, Feb. 2017.

    • L. Herranz, R. Xu, S. Jiang, “A probabilistic framework for food recognition in restaurants”, Proc. International Conference on Multimedia and Expo 2015 (ICME15), pp. 1-6, Torino, Italy, June 2015 (earlier conferene version)


    IEEE Trans. Multimedia 19(2): 430-440 (2017)
    [PDF]
  • Weiqing Min, Shuqiang Jiang, Jitao Sang, Huayang Wang, Xinda Liu, Luis Herranz, Being a Supercook: Joint Food Attributes and Multimodal Content Modeling for Recipe Retrieval and Exploration.

    Abstract

    This paper considers the problem of recipe-oriented image-ingredient correlation learning with multi-attributes for recipe retrieval and exploration. Existing methods mainly focus on food visual information for recognition while we model visual information, textual content (e.g., ingredients) and attributes (e.g., cuisine and course) together to solve extended recipe-oriented problems, such as multi-modal cuisine classification and attribute-enhanced food image retrieval. For solution, we propose a Multi-Modal Multi-Task Deep Belief Network (M3TDBN) to learn joint image-ingredient representation regularized by different attributes. By grouping ingredients into visible ingredients (which are visible in the food image, e.g., ``chicken" and ``mushroom") and non-visible ingredients (e.g. ``salt" and ``oil"), M3TDBN is capable of learning both mid-level visual representation between images and visible ingredients and non-visual representation. Furthermore, in order to utilize different attributes to improve the inter-modality correlation, M3TDBN incorporates multitask learning to make different attributes collaborate each other. Based on the proposed M3TDBN, we exploit the derived deep features and the discovered correlations for three extended novel applications: (1) multi-modal cuisine classification, (2) attribute-augmented cross-modal recipe image retrieval and (3) ingredient and attribute inference from food images.The proposed approach is evaluated on the constructed Yummly dataset and the evaluation results have validated the effectiveness of the proposed approach.


    • Weiqing Min, Shuqiang Jiang, Jitao Sang, Huayang Wang, Xinda Liu, Luis Herranz,Being a Supercook: Joint Food Attributes and Multimodal Content Modeling for Recipe Retrieval and Exploration. IEEE Trans. Multimedia 19(5): 1100-1113 (2017)


    IEEE Trans. Multimedia 19(5): 1100-1113 (2017)
    [PDF]
  • Guorong Li, Shuqiang Jiang, Weigang Zhang, Junbiao Pang, Qingming Huang, Online web video topic detection and tracking with semi-supervised learning.
    Multimedia Syst. 22(1): 115-125 (2016)
    [PDF]
  • Luis Herranz, Shuqiang Jiang, Scalable storyboards in handheld devices: applications and evaluation metrics.
    Multimedia Tools Appl. 75(20): 12597-12625 (2016)
    [PDF]
  • Xinhang Song, Shuqiang Jiang, Luis Herranz, Yan Kong, Kai Zheng, Category co-occurrence modeling for large scale scene recognition.
    Pattern Recognition 59: 98-111 (2016)
    [PDF]
  • Shuang Wang, Shuqiang Jiang, INSTRE: A New Benchmark for Instance-Level Object Retrieval and Recognition.
    TOMCCAP 11(3): 37:1-37:21 (2015)
    [PDF]
  • Guorong Li, Qingming Huang, Shuqiang Jiang, Yingkun Xu, Weigang Zhang, Online learning affinity measure with CovBoost for multi-target tracking.
    Neurocomputing 168: 327-335 (2015)
    [PDF]
  • Shuhui Wang, Fuzhen Zhuang, Shuqiang Jiang, Qingming Huang, Qi Tian, Cluster-sensitive Structured Correlation Analysis for Web cross-modal retrieval.
    Neurocomputing 168: 747-760 (2015)
    [PDF]
  • Xiong Lv, Shuqiang Jiang, Luis Herranz, Shuang Wang, RGB-D Hand-Held Object Recognition Based on Heterogeneous Feature Fusion. J. Comput. Sci.
    Technol. 30(2): 340-352 (2015)
    [PDF]
  • Liang Li, Chenggang Clarence Yan, Wen Ji, Bo-Wei Chen, Shuqiang Jiang, Qingming Huang, LSH-based semantic dictionary learning for large scale image understanding.
    J. Visual Communication and Image Representation 31: 231-236 (2015)
    [PDF]
  • Xinhang Song, Shuqiang Jiang, Shuhui Wang, Liang Li, Qingming Huang, Polysemious visual representation based on feature aggregation for large scale image applications.
    Multimedia Tools Appl. 74(2): 595-611 (2015)
    [PDF]
  • Ruihan Xu, Luis Herranz, Shuqiang Jiang, Shuang Wang, Xinhang Song, Ramesh Jain, Geolocalized Modeling for Dish Recognition. IEEE Trans.
    Multimedia 17(8): 1187-1199 (2015)
    [PDF]
  • Chenggang Clarence Yan, Liang Li, Zhan Wang, Jian Yin, Hailong Shi, Shuqiang Jiang, Qingming Huang, Fusing multi-cues description for partial-duplicate image retrieval.
    J. Visual Communication and Image Representation 25(7): 1726-1731 (2014)
    [PDF]
  • Shuqiang Jiang, Xinhang Song, Qingming Huang, Relative image similarity learning with contextual information for Internet cross-media retrieval.
    Multimedia Syst. 20(6): 645-657 (2014)
    [PDF]
  • Shuqiang Jiang, Changsheng Xu, Yong Rui, Alberto Del Bimbo, Hongxun Yao, Preface: Internet multimedia computing and service.
    Multimedia Tools Appl. 70(2): 599-603 (2014)
    [PDF]
  • Lingyang Chu, Shuqiang Jiang, Shuhui Wang, Yanyan Zhang, Qingming Huang, Robust Spatial Consistency Graph Model for Partial Duplicate Image Retrieval.
    IEEE Trans. Multimedia 15(8): 1982-1996 (2013)
    [PDF]
  • Liang Li, Shuqiang Jiang, Zhengjun Zha, Zhipeng Wu, Qingming Huang, Partial-Duplicate Image Retrieval via Saliency-guided Visually Matching.
    IEEE MultiMedia 20(3): 13-23 (2013)
    [PDF]
  • Yi Xie, Shuqiang Jiang, Qingming Huang, Weighted Visual Vocabulary to Balance the Descriptive Ability on General Dataset.
    Neurocomputing 119: 478-488 (2013)
    [PDF]
  • Guorong Li, Qingming Huang, Lei Qin, Shuqiang Jiang, SSOCBT:A Robust Semi-Supervised Online CovBoost Tracker by Using Samples Differently.
    IEEE Transactions on. Circuits and Systems for Video Technology. vol.23, no.4, pp.695-709, April 2013
    [PDF]
  • Liang Li, Shuqiang Jiang, and Qingming Huang, Learning Hierarchical Semantic Description via Mixed-norm Regularization for Image Understanding.
    IEEE Transactions on Multimedia. vol.14, no.5, pp.1401-1413, Oct.2012
    [PDF]
  • Shuhui Wang, Qingming Huang, Shuqiang Jiang and Qi Tian, S3MKL: Scalable Semi-Supervised Multiple Kernel Learning for Real World Image Applications.
    IEEE Transactions on Multimedia. vol.14, no.4, pp.1259-1274, Aug.2012
    [PDF]
  • Guorong Li, Qingming Huang, Junbiao Pang, Shuqiang Jiang and Lei Qin, Online Selection of the Best k-Feature Subset for Object Tracking. Journal of Visual Communication and Image Representation.
    Volume 23, Issue 2, pp 254-263, 2012
    [PDF]
  • Huiying Liu, Qingming Huang, Changsheng Xu and Shuqiang Jiang, @ICT: Attention Based Virtual Content Insertion.
    Multimedia Systems,Volume18,Issue3, pp201-214,2012.
    [PDF]
  • Shuhui Wang, Qingming Huang, Shuqiang Jiang, Qi Tian, Nearest-Neighbor Method Using Multiple Neighborhood Similarities for Social Media Data Mining,
    Neurocomputing, vol.95, pp. 105-116, Oct. 2012
    [PDF]
  • Junbiao Pang, Qingming Huang, Shuicheng Yan, Shuqiang Jiang, Lei Qin, Transferring Boosted Detectors Towards Viewpoint and Scene Adaptiveness.
    IEEE Transactions on Image Processing,vol.20, no.5, pp.1388-1400, May 2011
    [PDF]
  • Shiliang Zhang, Qingming Huang, Shuqiang Jiang, Wen Gao, and Qi Tian, Affective Visualization and Retrieval for Music Video,
    IEEE Transactions on Multimedia,vol.12, no.6, pages 510-522, Oct. 2010
    [PDF]
  • Guangyu Zhu, Changsheng Xu, Qingming Huang, Yong Rui, Shuqiang Jiang, Wen Gao; Hongxun Yao, Event Tactic Analysis Based on Broadcast Sports Video,
    IEEE Transaction on Multimedia, Vol.11, no.1, pp.49-67, Jan. 2009
    [PDF]
  • Lei Qin, Qingfang Zheng, Shuqiang Jiang, Qingming Huang and Wen Gao, Unsupervised texture classification: Automatically discover and classify texture patterns, Image and Vision Computing,
    vol.26, no.5, Pages 647-656,May.2008
    [PDF]
  • Shuqiang Jiang, Qingming Huang, Qixiang Ye, Wen Gao, An Effective Method to Detect and Categorize Digitized Traditional Chinese Paintings ,
    Pattern Recognition Letters, Volume 27, Issue 7, Pages 734-746, May 2006
    [PDF]
  • Shuqiang Jiang, Qingming Huang, Tiejun Huang, Wen Gao, Visual Ontology Construction for Digitized Art Image Retrieval,
    Journal of Computing Science and Technology, Vol.20, No.6, pp855-860,Nov.2005
    [PDF]
Conference
  • Zhuo Li, Weiqing Min, Jiajun Song, Yaohui Zhu, Liping Kang, Xiaoming Wei, Xiaolin Wei, Shuqiang Jiang, Rethinking the Optimization of Average Precision: Only Penalizing Negative Instances before Positive Ones is Enough

    优化平均精度(AP)的近似值在图像检索中得到了广泛的研究。受限于AP的定义,这种方法必须考虑每个正例之前的负例和正例。然而,我们认为只要惩罚正例前面的负例就够了,因为损失只来自这些样本。为此,我们提出了一种新的损失函数,即PNP损失函数,它可以直接最小化每个正例之前的负例数。此外,基于AP的方法采用固定次优的梯度分配策略。为此,我们通过构造损失的导数函数的方式,系统地研究了不同的梯度分配方案,得到了导函数递增的PNP-I和导函数递减的PNP-D。PNP-I通过向困难正例分配更大的梯度而关注难例,并尝试使所有相关实例更接近。相比之下,PNP-D对此类事件关注较少,并缓慢纠正这些样本。对于大多数真实数据,一个类通常包含多个局部簇。PNP-I盲目地聚集这些簇,而PNP-D保持了原始的数据分布。因此,PNP-D更为优越。在三个标准检索数据集上的评估验证了上述分析的正确性,且PNP-D达到了当前最好的性能。

    Abstract

    Optimising the approximation of Average Precision (AP) has been widely studied for image retrieval. Limited by the definition of AP, such methods consider both negative and positive instances ranking before each positive instance. However, we claim that only penalizing negative instances before positive ones is enough, because the loss only comes from these negative instances. To this end, we propose a novel loss, namely Penalizing Negative instances before Positive ones (PNP), which can directly minimize the number of negative instances before each positive one. In addition, AP-based methods adopt a fixed and sub-optimal gradient assignment strategy. Therefore, we systematically investigate different gradient assignment solutions via constructing derivative functions of the loss, resulting in PNP-I with increasing derivative functions and PNP-D with decreasing ones. PNP-I focuses more on the hard positive instances by assigning larger gradients to them and tries to make all relevant instances closer. In contrast, PNP-D pays less attention to such instances and slowly corrects them. For most real-world data, one class usually contains several local clusters. PNP-I blindly gathers these clusters while PNP-D keeps them as they were. Therefore, PNP-D is more superior. Experiments on three standard retrieval datasets show consistent results with the above analysis. Extensive evaluations demonstrate that PNP-D achieves the state-of-the-art performance.

    • Zhuo Li, Weiqing Min, Jiajun Song, Yaohui Zhu, Liping Kang, Xiaoming Wei, Xiaolin Wei, Shuqiang Jiang, "Rethinking the Optimization of Average Precision: Only Penalizing Negative Instances before Positive Ones is Enough", 36th AAAI Conference on Artificial Intelligence (AAAI 2022), Vancouver, BC, Canada, Feb.22 - Mar.1, 2022.


    (AAAI 2022), February 22 - March 1, 2022, Vancouver, BC, Canada
    [PDF]
  • Gongwei Chen, Xinhang Song, Bohan Wang, Shuqiang Jiang, See More for Scene: Pairwise Consistency Learning for Scene Classification

    场景分类是一个具有价值的计算机视觉任务,其独有的特性依然需要进一步的研究。基本上,场景特性是分布于整张图像上的,这就需要分类模型能“看到”更全面和有信息的区域。之前的工作主要集中在场景图像中区域的发掘和融合上,而很少考虑卷积网络内在特性以及其可以满足场景分类需求的潜在能力。在本文中,我们提出了基于聚焦区域来理解场景图像和场景分类网络。从这个新的研究角度,我们发现当模型学习场景特性后,场景分类模型会倾向发现更大的聚焦区域。对于现有模型训练策略的分析帮助我们理解聚焦区域对于模型性能影响,并且引发我们思考用于场景分类的最优训练方法。为了追求对于场景特性更好的利用,我们提出了一种新的学习方法配合定制的损失函数来实现在场景图像激活更大的聚焦区域。因为缺少需要扩大的目标区域的监督信息,从另一个角度,我们的学习策略是通过消除已经被激活的区域来允许模型在训练中去激活更多区域。提出的策略可以通过保持被消除图像和原始图像的输出的成对一致性来实现。特别的,定制的损失函数利用类别相关信息来保持这种成对一致性。基于Places365数据集的实验展示了我们方法在各种网络结构上带来的显著提升效果。我们的方法在物体数据集ImageNet上得到了较差的结果,这实验性地表明我们方法捕获了场景独有的特性。

    Abstract

    Scene classification is a valuable classification subtask and has its own characteristics which still needs more in-depth studies. Basically, scene characteristics are distributed over the whole image, which cause the need of “seeing” comprehensive and informative regions. Previous works mainly focus on region discovery and aggregation, while rarely involves the inherent properties of CNN along with its potential ability to satisfy the requirements of scene classification. In this paper, we propose to understand scene images and the scene classification CNN models in terms of the focus area. From this new perspective, we find that large focus area is preferred in scene classification CNN models as a consequence of learning scene characteristics. Meanwhile, the analysis about existing training schemes helps us to understand the effects of focus area, and also raises the question about optimal training method for scene classification. Pursuing the better usage of scene characteristics, we propose a new learning scheme with a tailored loss in the goal of activating larger focus area on scene images. Since the supervision of the target regions to be enlarged is usually lacked, our alternative learning scheme is to erase already activated area, and allow the CNN models to activate more area during training. The proposed scheme is implemented by keeping the pairwise consistency between the output of the erased image and its original one. In particular, a tailored loss is proposed to keep such pairwise consistency by leveraging category-relevance information. Experiments on Places365 show the significant improvements of our method with various CNNs. Our method shows an inferior result on object dataset, ImageNet, which experimentally indicates that it captures the unique characteristics of scenes.

    Gongwei Chen, Xinhang Song, Bohan Wang, and Shuqiang Jiang. "See More for Scene: Pairwise Consistency Learning for Scene Classification." 35th Advances in Neural Information Processing Systems (NeurIPS 2021), Dec. 6-14, 2021.


    (NeurIPS 2021), December 6-14, 2021
    [PDF]
  • Sixian Zhang, Xinhang Song, Yubing Bai, Weijie Li, Yakui Chu, Shuqiang Jiang, Hierarchical object-to-zone graph for object navigation

    物体导航任务是要求智能体在未知环境中找到指定的目标物体。先前的工作通常利用深度学习模型通过强化学习方式来训练智能体进行实时的动作预测。然而当目标物体没有出现在智能体的视野中,智能体往往由于缺乏指导而不能做出高效的动作。本文提出一种由物体到区域的多层级(hierarchical object-to-zone, HOZ)图,为智能体提供由粗到细的先验信息指导,并且HOZ图可以在新的环境中根据观测信息来不断更新。具体来说,HOZ图包含场景节点、区域节点和物体节点。借助于HOZ图,智能体可以根据目标物体以及当前观测规划出一条从当前区域到目标物体可能出现区域的路径。本文在AI2-Thor三维模型器中验证了所提出的方法,所选用的评测指标,除了常用的成功率(Success Rate, SR)和按路径长度加权的成功率(Success weighted by Path Length, SPL),本文还提出了用于评测导航中动作有效性的指标:按动作效率加权的成功率(Success weighted by Action Efficiency, SAE),实验结果证明了我们方法的有效性。


    Abstract

    The goal of object navigation is to reach the expected objects according to visual information in the unseen environments. Previous works usually implement deep models to train an agent to predict actions in real-time. However, in the unseen environment, when the target object is not in egocentric view, the agent may not be able to make wise decisions due to the lack of guidance. In this paper, we propose a hierarchical object-to-zone (HOZ) graph to guide the agent in a coarse-to-fine manner, and an online-learning mechanism is also proposed to update HOZ according to the real-time observation in new environments. In particular, the HOZ graph is composed of scene nodes, zone nodes and object nodes. With the pre-learned HOZ graph, the real-time observation and target goal, the agent can constantly plan an optimal path from zone to zone. In the estimated path, the next potential zone is regarded as sub-goal, which is also fed into the deep reinforcement learning model for action prediction. Our methods are evaluated on the AI2-Thor simulator. In addition to widely used evaluation metrics Success Rate (SR) and Success weighted by Path Length (SPL), we also propose a new evaluation of Success weighted by Action Efficiency (SAE) that focuses on the effective action rate. Experimental results demonstrate the effectiveness and efficiency of our proposed method.

    • Sixian Zhang, Xinhang Song, Yubing Bai, Weijie Li, Yakui Chu, and Shuqiang Jiang. Hierarchical object-to-zone graph for object navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15130–15140, October 2021.


    (ICCV 2021), October, 2021
    [PDF]
  • Weijie Li, Xinhang Song, Yubing Bai, Sixian Zhang, Shuqiang Jiang, ION: Instance-level Object Navigation

    视觉物体导航是Embodied AI中一项基础且重要的研究课题, 指的是智能体根据指令导航到指定物体。现有的工作主要基于类别级的视觉物体导航,即导航到任意一个符合目标类别的物体就算成功。然而实际应用中往往需要更精细化的物体导航,即导航到指定的特定目标物体,例如,当我们的需求是“喝水”的时候,我们期望智能体能够找到“我们自己的杯子”,而不是任意他人的杯子。因此,本文提出了一个基于实例的视觉物体导航任务(Instance-level Object Navigation, ION),并设计了相应的导航模型框架以及评判标准。基于现有模拟器AI2-THOR, 我们设计了一套物体实例化和自动标注系统,这套系统能够模拟现实生活中物体种类数量繁多的场景,并自动生成描述物体实例的标注数据<物体类别, 物体颜色, 物体材质, 空间关系>,本工作自动收集了27,735条物体实例数据,以此构成ION数据集。此外,针对提出的实例级视觉物体导航任务,我们提出了一个级联框架,其中,基于实例的物体关系图模型(Instance-Relation Graph, IRG)的节点表示物体实例的颜色、材质信息,边表示物体实例的空间关系。在导航过程中,通过实例筛选(Instance Selection),被检测到的物体实例可以激活IRG中相应的节点, 结合目标实例掩模(Instance Mask)和实例框定(Instance Grounding),智能体最终找到目标物体实例。我们通过实验验证了实例级视觉物体导航任务的挑战性,并证明了本文提出的级联框架比基准方法在实例级评估指标上具有更好的性能。

    Abstract

    Visual object navigation is a fundamental task in Embodied AI. Previous works focus on the category-wise navigation, in which navigating to any possible instance of target object category is considered a success. Those methods may be effective to find the general objects. However, it may be more practical to navigate to the specific instance in our real life, since our particular requirements are usually satisfied with specific instances rather than all instances of one category. How to navigate to the specific instance has been rarely researched before and is typically challenging to current works. In this paper, we introduce a new task of Instance Object Navigation (ION), where instance-level descriptions of targets are provided and instance-level navigation is required. In particular, multiple types of attributes such as colors, materials and object ref- erences are involved in the instance-level descriptions of the targets. In order to allow the agent to maintain the ability of instance nav- igation, we propose a cascade framework with Instance-Relation Graph (IRG) based navigator and instance grounding module. To specify the different instances of the same object categories, we construct instance-level graph instead of category-level one, where instances are regarded as nodes, encoded with the representation of colors, materials and locations (bounding boxes). During nav- igation, the detected instances can activate corresponding nodes in IRG, which are updated with graph convolutional neural net- work (GCNN). The final instance prediction is obtained with the grounding module by selecting the candidates (instances) with max- imum probability (a joint probability of category, color and material, obtained by corresponding regressors with softmax). For the task evaluation, we build a benchmark for instance-level object navi- gation on AI2-Thor simulator, where over 27,735 object instance descriptions and navigation groundtruth are automatically obtained through the interaction with the simulator. The proposed model outperforms the baseline in instance-level metrics, showing that our proposed graph model can guide instance object navigation, as well as leaving promising room for further improvement. The project is available at https://github.com/LWJ312/ION.

    • Weijie Li, Xinhang Song, Yubing Bai, Sixian Zhang, Shuqiang Jiang. “ION: Instance-level Object Navigation”, 29th ACM International Conference on Multimedia (ACM Multimedia 2021), Chengdu, China, October 20-24, 2021.


    (ACM Multimedia 2021), October 20–24, 2021, Chengdu, China
    [PDF]
  • Qiang Hou, Weiqing Min, Jing Wang, Sujuan Hou, Yuanjie Zheng, Shuqiang Jiang, FoodLogoDet-1500: A Dataset for Large-Scale Food Logo Detection via Multi-Scale Feature Decoupling Network

    食品Logo检测作为Logo检测的一项重要任务,可应用于健康饮食推荐、食品商标侵权纠纷、食品广告投放、超市自助结账系统等方面。然而,目前还没有用于食品Logo检测的数据集,为此本文提出了一个全新的大规模公共可用的食品Logo数据集,其中包括1500个类别,大约10万张图片和大约15万个手工标注的食品Logo。并进一步提出了一种多尺度特征解耦的食品Logo检测网络,将网络解耦成分类和回归两个分支并且在分类分支中使用特征偏移模块,可以有效地获得最具有代表性的分类特征,并引入平衡的特征金字塔,更加关注多尺度特征的全局信息,提高检测性能。我们在提出的数据集和另外两个公共Logo数据集上进行了大量实验,证明了该方法的有效性。数据集和代码下载地址:https://github.com/hq03/FoodLogoDet-1500-Dataset

    Abstract

    Food logo detection plays an important role in the multimedia for its wide real-world applications, such as food recommendation of the self-service shop and infringement detection on e-commerce platforms. A large-scale food logo dataset is urgently needed for developing advanced food logo detection algorithms. However, there are no available food logo datasets with food brand information. To support efforts towards food logo detection, we introduce the dataset FoodLogoDet-1500, a newlarge-scale publicly available food logo dataset, which has 1,500 categories, about 100,000 images and about 150,000 manually annotated food logo objects. We describe the collection and annotation process of FoodLogoDet-1500, analyze its scale and diversity, and compare it with other logo datasets. To the best of our knowledge, FoodLogoDet-1500 is the first largest publicly available high-quality dataset for food logo detection. The challenge of food logo detection lies in the large-scale categories and similarities between food logo categories. For that, we propose a novel food logo detection method Multi-scale Feature Decoupling Network (MFDNet), which decouples classification and regression into two branches and focuses on the classification branch to solve the problem of distinguishing multiple food logo categories. Specifically, we introduce the feature offset module, which utilizes the deformation-learning for optimal classification offset and can effectively obtain the most representative features of classification in detection. In addition, we adopt a balanced feature pyramid in MFDNet, which pays attention to global information, balances the multi-scale feature maps, and enhances feature extraction capability. Comprehensive experiments on FoodLogoDet-1500 and other two popular benchmark logo datasets demonstrate the effectiveness of the proposed method. The code and FoodLogoDet-1500 can be found at https://github.com/hq03/FoodLogoDet-1500-Dataset.

    • Qiang Hou, Weiqing Min, Jing Wang, Sujuan Hou, Yuanjie Zheng, and Shuqiang Jiang. “FoodLogoDet-1500: A Dataset for Large-Scale Food Logo Detection via Multi-Scale Feature Decoupling Network”, 29th ACM International Conference on Multimedia (ACM Multimedia 2021), Chengdu, China, October 20-24, 2021.


    (ACM Multimedia 2021), October 20–24, 2021, Chengdu, China
    [PDF]
  • Tianyu Zhang, Weiqing Min, Jiahao Yang, Tao Liu, Shuqiang Jiang, Yong Rui, What If We Could Not See? Counterfactual Analysis for Egocentric Action Anticipation

    第一视角行为预测的现有工作一部分仅利用视频中的视觉特征,忽视了行为标签的语义关联,致使行为预测效果受限;另一部分在视觉特征的基础上引入了行为标签所包含的语义信息,但是受行为标签在数据集中长尾分布的影响,预测结果更偏向于高频标签,这些因素都不利于行为预测准确率的提高。为此我们基于因果分析的理论,提出一种反事实分析的方案,我们认为对于预测的结果而言,观察到的视觉信息对应每个案例的具体信息是主因,而行为标签对应抽象的语义信息,只能挖掘出数据集总体的统计特性,反映不出每个案例具体的信息,是副因。我们需要缓解行为标签之间的语义关联对于预测结果产生的副作用,在保留多模态信息的基础上凸显出视觉信息的主作用,削弱语义信息的作用。基于反事实分析方案的第一视角行为预测分为三个阶段:掺杂偏差的事实阶段、捕获偏差的反事实阶段以及去除偏差的最终阶段。首先在掺杂偏差的事实阶段中基于过去的视觉特征和行为类别得到未来的行为类别,对应事实阶段的预测结果。然后在捕获偏差的反事实阶段,想象完全看不见、只根据过去的行为类别这种抽象的语义信息预测出未来行为的反事实场景,此时的预测结果充分捕获了语义信息带来的预测偏差。最终在去除偏差的反事实分析阶段,从事实阶段的预测结果中扣除反事实阶段的预测结果,得到最终的预测结果。

    Abstract

    Egocentric action anticipation aims at predicting the near future based on past observation in first-person vision. While future actions may be wrongly predicted due to the dataset bias, we present a counterfactual analysis framework for egocentric action anticipation (CA-EAA) to enhance the capacity. In the factual case, we can predict the upcoming action based on visual features and semantic labels from past observation. Imagining one counterfactual situation where no visual representation had been observed, we would obtain a counterfactual predicted action only using past semantic labels. In this way, we can reduce the side-effect caused by semantic labels via a comparison between factual and counterfactual outcomes, which moves a step towards unbiased prediction for egocentric action anticipation. We conduct experiments on two large-scale egocentric video datasets. Qualitative and quantitative results validate the effectiveness of our proposed CA-EAA.

    • Tianyu Zhang, Weiqing Min, Jiahao Yang, Tao Liu, Shuqiang Jiang, Yong Rui, “What If We Could Not See? Counterfactual Analysis for Egocentric Action Anticipation”, International Joint Conference on Artificial Intelligence (IJCAI 2021): 1316-1322, Canada, August. 19-26, 2021.


    (IJCAI 2021), August 19-26, 2021, Canada
    [PDF]
  • Weiqing Min, Linhu Liu, Zhiling Wang, Zhengdong Luo, Xiaoming Wei, Xiaolin Wei, Shuqiang Jiang, ISIA Food-500: A dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network.

    食品与人类的行为、健康和文化等密切相关。来自社交网络、移动网络和物联网等泛在网络产生的食品大数据及人工智能尤其是深度学习技术的快速发展催生了新的交叉研究领域食品计算[Min2019-ACM CSUR]。作为食品计算的核心任务之一,食品图像识别同时是计算机视觉领域中细粒度视觉识别的重要分支,因而具有重要的理论研究意义,并在智慧健康、食品智能装备、智慧餐饮、智能零售及智能家居等方面有着广泛的应用前景。本文在项目组前期食品识别(Food Recognition:[Jiang2020-IEEE TIP][Min2019-ACMMM])的研究基础上,提出了一个新的食品数据集ISIA Food-500。该数据集包含500个类别,大约40万张图像,在类别量和图片数据量方面都超过了现有的基准数据集。在此基础上我们提出了一个新的网络SGLANet联合学习食品图像的全局和局部视觉特征以进行食品识别,在ISIA Food-500和现有的基准数据集上进行了实验分析与验证。

    • [Min2019-ACM CSUR] Weiqing Min,Shuqiang Jiang, Linhu Liu,Yong Rui, Ramesh Jain A Survey on Food Computing. ACM Computing Surveys, 52(5), 92:1-92:36, 2019

    • [Jiang2020-IEEE TIP] Shuqiang Jiang, Weiqing Min, Linhu Liu, Zhengdong Luo, Multi-Scale Multi-View Deep Feature Aggregation for Food Recognition. IEEE Trans. Image Processing, vol.29, pp.265-276, 2020

    • [Min2019-ACMMM] Weiqing Min, Linhu Liu, Zhengdong Luo, Shuqiang Jiang, Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition. (ACM Multimedia 2019), 21-25 October 2019, Nice, France

    Abstract

    Food recognition has various of applications in the multimedia community. To encourage further progress in food recognition, we introduce a new food dataset called ISIA Food-500. The dataset contains 500 categories and about 400,000 images and it is a more comprehensive food dataset that surpasses exiting benchmark datasets by category coverage and data volume. We further propose a new network (SGLANet) architecture to jointly learn food-oriented global and local visual features for food recognition. SGLANet consists of two sub-networks, namely Global Feature Learning Subnetwork(GloFLS) and Local Feature Learning Subnetwork(LocFLS). GloFLS first utilizes hybrid spatial-channel attention to obtain more discriminative features for each layer, and then aggregates these features from different layers into global-level features. LocFLS generates attentional regions from different regions via cascaded Spatial Transformers(STs), and further aggregates these multi-scale regional features from different layers into local-level representation. These two types of features are finally fused as comprehensive representation for food recognition. Extensive experiments on ISIA Food-500 and other two popular benchmark datasets demonstrate the effectiveness of our proposed method.

    • Weiqing Min, Linhu Liu, Zhiling Wang, Zhengdong Luo, Xiaoming Wei, Xiaolin Wei, Shuqiang Jiang. 2020. ISIA Food-500: A dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Net- work. In Proceedings of the 28th ACM International Conference on Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3394171.3414031

    (ACM Multimedia 2020), October 12–16, 2020, Seattle, WA, USA
    [PDF]
  • Xinhang Song, Haitao Zeng, Sixian Zhang, Luis Herranz, Shuqiang Jiang, Generalized Zero-shot Learning with Multi-source Semantic Embeddings for Scene Recognition.


    本文面向更具复杂性的场景数据开展研究,提出了一种特征生成式零样本学习框架,主要创新点包括:1)多源语义描述融合的零样本学习;2)基于局部区域描述的场景描述增强。为了生成未知类视觉特征,我们提出了一种二步式生成框架,局部语义描述首先采样生成虚拟样本,再生成局部视觉特征并融合为全局特征。最后,生成的未知类的视觉特征与已知类的提取特征合并,共同训练联合分类器。为了验证本文方法,我们提出了一个新的具有多种语义描述的数据集,实验结果表明本文所提出框架在SUN Attitude和本文所提出数据集上均达到了最优结果。

    Abstract

    Recognizing visual categories from semantic descriptions is a promising way to extend the capability of a visual classifier beyond the concepts represented in the training data (i.e. seen categories). This problem is addressed by (generalized) zero-shot learning methods (GZSL), which leverage semantic descriptions that connect them to seen categories (e.g. label embedding, attributes). Conventional GZSL are designed mostly for object recognition. In this paper we focus on zero-shot scene recognition, a more challenging setting with hundreds of categories where their differences can be subtle and often localized in certain objects or regions. Conventional GZSL representations are not rich enough to capture these local discriminative differences. Addressing these limitations, we propose a feature generation framework with two novel components: 1) multiple sources of semantic information (i.e. attributes, word embeddings and descriptions), 2) region descriptions that can enhance scene discrimination. To generate synthetic visual features we propose a two-step generative approach, where local descriptions are sampled and used as conditions to generate visual features. The generated features are then aggregated and used together with real features to train a joint classifier. In order to evaluate the proposed method, we introduce a new dataset for zero-shot scene recognition with multi-semantic annotations. Experimental results on the proposed dataset and SUN Attribute dataset illustrate the effectiveness of the proposed method.

    • Xinhang Song, Haitao Zeng, Sixian Zhang, Luis Herranz, Shuqiang Jiang. 2020. Generalized Zero-shot Learning with Multi-source Semantic Embeddings for Scene Recognition. In 28th ACM International Conference on Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA.. ACM, New York, NY, USA.


    (ACM Multimedia 2020), October 12–16, 2020, Seattle, WA, USA
    [PDF]
  • Xiaoqian Guo, Xiangyang Li, Shuqiang Jiang, Expressional Region Retrieval.

    图像检索是多媒体领域的一个重要研究课题,应用广泛。图像中的区域包含着非常丰富的信息,但以往的检索方法只局限于图像中的单个物体或只关注整体图像的视觉场景。本文提出了一个新的图像检索任务,Expressional Region Retrieval。该任务着眼于图像区域,且考虑图像区域的语言描述。本文探索了基于图像中可表达区域的图像检索,同时利用视觉和语言信息来提升检索性能。

    Abstract

    Image retrieval is a long-standing topic in the multimedia community due to its various applications, e.g., product search and artworks retrieval in museum. The regions in images contain a wealth of information. Users may be interested in the objects presented in the image regions or the relationships between them. But previous retrieval methods are either limited to the single object of images, or tend to the entire visual scene. In this paper, we introduce a new task called expressional region retrieval, in which the query is formulated as a region of image with the associated description. The goal is to find images containing the similar content with the query and localize the regions within them. As far as we know, this task has not been explored yet. We propose a framework to address this issue. The region proposals are first generated based on region detectors and language features are extracted. Then the Gated Residual Network (GRN) takes language information as a gate to control the transformation of visual features. In this way, the combined visual and language representation is more specific and discriminative for expressional region retrieval. We evaluate our method on a new established benchmark which is constructed based on the Visual Genome dataset. Experimental results demonstrate that our model effectively utilizes both visual and language information, outperforming the baseline methods.

    • Xiaoqian Guo, Xiangyang Li, Shuqiang Jiang. 2020. Expressional Region Retrieval. In 28th ACM International Conference on Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA.. ACM, New York, NY, USA. https://doi.org/10.1145/3394171.3413567

    (ACM Multimedia 2020), October 12–16, 2020, Seattle, WA, USA
    [PDF]
  • Tianyu Zhang, Weiqing Min, Ying Zhu, Yong Rui, Shuqiang Jiang, An Egocentric Action Anticipation Framework via Fusing Intuition and Analysis.

    第一视角行为预测需要机器从人的视角预测出接下来可能发生的行为,在人机交互、康复助力等领域具有广泛的应用价值。本文提出了一个融入直觉和分析的第一视角行为预测模型,主要由基于直觉的预测网络、基于分析的预测网络、自适应融合网络三部分组成。我们将基于直觉的预测网络设计成类似黑箱的编码器-解码器结构,将基于分析的预测网络设计成识别、转移和结合三个步骤,并在自适应融合网络中引入注意力机制,将直觉和分析两部分的结果进行有机融合,得到最终的预测结果。

    Abstract

    In this paper, we focus on egocentric action anticipation from videos, which enables various applications, such as helping intelligent wearable assistants understand users' needs and enhance their capabilities in the interaction process. It requires intelligent systems to observe from the perspective of the first person and predict an action before it occurs. Most existing methods rely only on visual information, which is insufficient especially when there exists salient visual difference between past and future. In order to alleviate this problem, which we call visual gap in this paper, we propose one novel Intuition-Analysis Integrated (IAI) framework inspired by psychological research, which mainly consists of three parts: Intuition-based Prediction Network (IPN), Analysis-based Prediction Network (APN) and Adaptive Fusion Network (AFN). To imitate the implicit intuitive thinking process, we model IPN as an encoder-decoder structure and introduce one procedural instruction learning strategy implemented by textual pre-training. On the other hand, we allow APN to process information under designed rules to imitate the explicit analytical thinking, which is divided into three steps: recognition, transitions and combination. Both the procedural instruction learning strategy in IPN and the transition step of APN are crucial to improving the anticipation performance via mitigating the visual gap problem. Considering the complementarity of intuition and analysis, AFN adopts attention fusion to adaptively integrate predictions from IPN and APN to produce the final anticipation results. We conduct extensive experiments on the largest egocentric video dataset. Qualitative and quantitative evaluation results validate the effectiveness of our IAI framework, and demonstrate the advantage of bridging visual gap by utilizing multi-modal information, including both visual features of observed segments and sequential instructions of actions.

    • Tianyu Zhang, Weiqing Min, Ying Zhu, Yong Rui, Shuqiang Jiang. 2020. An Egocentric Action Anticipation Framework via Fusing Intuition and Analysis. In Proceedings of the 28th ACM International Conference on Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3394171.3413964


    (ACM Multimedia 2020), October 12–16, 2020, Seattle, WA, USA
    [PDF]
  • Yaohui Zhu, Chenlong Liu, Shuqiang Jiang, Multi-attention Meta Learning for Few-shot Fine-grained Image Recognition.

    本文提出了一种多注意元学习(MattML)方法,该方法利用了基学习器和任务学习器的注意力机制来捕获具有判别性的物体部件。更具体地说,基学习器包括一个特征嵌入网络,两个卷积注意力模块CBAM和一个分类器。通过融合卷积特征的通道和空间信息,两个CBAM可以专注于多样化且内容丰富的物体部件。任务学习器在自动编码框架下使用循环编码器和循环解码器学习任务表示。权重生成器使用任务表示来参加基学习器中的分类器权重初始化。这样分类器获得与任务相关的敏感初始化。利用基于梯度的元学习方法来调整两个CBAM和分类器的参数来进行适应,从而使更新后的基学习器可以根据当前的小样本任务自适应关注具有判别性的物体部件。

    Abstract

    The goal of few-shot image recognition is to distinguish different categories with only one or a few training samples. Previous works of few-shot learning mainly work on general object images. And current solutions usually learn a global image representation from training tasks to adapt novel tasks. However, fine-gained categories are distinguished by subtle and local parts, which could not be captured by global representations effectively. This may hinder existing few-shot learning approaches from dealing with fine-gained categories well. In this work, we propose a multi-attention meta-learning (MattML) method for few-shot fine- grained image recognition (FSFGIR). Instead of us- ing only base learner for general feature learning, the proposed meta-learning method uses attention mechanisms of the base learner and task learner to capture discriminative parts of images. The base learner is equipped with two convolutional block attention modules (CBAM) and a classifier. The two CBAM can learn diverse and informative parts. And the initial weights of classifier are attended by the task learner, which gives the classifier a task-related sensitive initialization. For adaptation, the gradient-based meta-learning approach is em- ployed by updating the parameters of two CBAM and the attended classifier, which facilitates the up- dated base learner to adaptively focus on discriminative parts. We experimentally analyze the differ- ent components of our method, and experimental results on four benchmark datasets demonstrate the effectiveness and superiority of our method.

    • Yaohui Zhu, Chenlong Liu, Shuqiang Jiang. “Multi-attention Meta Learning for Few-shot Fine-grained Image Recognition”. International Joint Conference on Artificial Intelligence --Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI), 2020.


    International Joint Conference on Artificial Intelligence --Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI), 2020.
    [PDF]
  • Jing Wang, Weiqing Min, Sujuan Hou, Shengnan Ma, Yuanjie Zheng, Haishuai Wang, Shuqiang Jiang, Logo-2K+: A Large-Scale Logo Dataset for Scalable Logo Classification.

    Abstract

    Logo classification has gained increasing attention for its various applications, such as copyright infringement detection, product recommendation and contextual advertising. Unfortunately, some datasets do not include a wide range of logo images and are lack of diversity and coverage in logo categories, they are not sufficient to support complex statistical models. Therefore, this article proposes a dataset, Logo- 2K+, a new large-scale publicly available real-world logo dataset with 2,341 categories and 167,140 images. Moreover, the article proposes a unified framework for logo classification, which is capable of discovering more informative logo regions and augmenting these image regions. We identifie main issues affecting logo classification including the real-world logo images have larger variety in logo appearance and more complexity in their background, analyzing unique logo characteristics. We then review existing solutions for these issues, and finally elaborate research challenges and future directions in this field. To our knowledge, this is the largest logo dataset and is expected to further the development of scalable logo image recognition to benefit researchers in this field.

    • Jing Wang, Weiqing Min, Sujuan Hou, Shengnan Ma, Yuanjie Zheng, Haishuai Wang, Shuqiang Jiang. Logo-2K+: A Large-Scale Logo Dataset for Scalable Logo Classification. Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI2020), February 7-12, 2020, New York, USA


    (AAAI 2020), February 7-12 2020, New York, USA
    [PDF]
  • Weiqing Min, Linhu Liu, Zhengdong Luo, Shuqiang Jiang, Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition.

    Abstract

    Recently, food recognition is gaining more attention in the multimedia community due to its various applications, e.g., multimodal foodlog and personalized healthcare. Most of existing methods directly extract visual features of the whole image using popular deep networks for food recognition without considering its own characteristics. Compared with other types of object images, food images generally do not exhibit distinctive spatial arrangement and common semantic patterns, and thus are very hard to capture discriminative information. In this work, we achieve food recognition by developing an Ingredient-Guided Cascaded Multi-Attention Network (IG-CMAN), which is capable of sequentially localizing multiple informative image regions with multi-scale from category-level to ingredient-level guidance in a coarse-to-fine manner. At the first level, IG-CMAN generates the initial attentional region from the category-supervised network with Spatial Transformer (ST). Taking this localized attentional region as the reference, IG-CMAN combined ST with LSTM to sequentially discover diverse attentional regions with fine-grained scales from ingredient-guided sub-network in the following levels. Furthermore, we introduce a new dataset WikiFood-200 with 200 food categories from the list in the Wikipedia, about 200,000 food images and 319 ingredients. We conduct extensive experiment on two popular food datasets and newly proposed WikiFood-200, demonstrating that our method achieves the state-of-the-art performance in Top-1 accuracy. Qualitative results along with visualization further show that IG-CMAN can introduce the explainability for localized regions, and is able to learn relevant regions for ingredients.

    • Weiqing Min, Linhu Liu, Zhengdong Luo, Shuqiang Jiang. Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition. (ACM Multimedia 2019), 21-25 October 2019, Nice, France.


    (ACM Multimedia 2019), 21-25 October 2019, Nice, France
    [PDF]
  • Xinhang Song, Sixian Zhang, Yuyun Hua and Shuqiang Jiang, Aberrance-aware gradient-sensitive attentions for scene recognition with RGB-D videos.

    Abstract

    With the developments of deep learning, previous approaches have made successes in scene recognition with massive RGB data obtained from the ideal environments. However, scene recognition in real world may face various types of aberrant conditions caused by different unavoidable factors, such as the lighting variance of the environments and the limitations of cameras, which may damage the performance of previous models. In addition to ideal conditions, our motivation is to investigate researches on robust scene recognition models for unconstrained environments. In this paper, we propose an aberrance-aware framework for RGB-D scene recognition, where several types of attentions, such as temporal, spatial and modal attentions are integrated to spatio-temporal RGB-D CNN models to avoid the interference of RGB frame blurring, depth missing, and light variance. All the attentions are homogeneously obtained by projecting the gradient-sensitive maps of visual data into corresponding spaces. Particularly, the gradient maps are captured with the convolutional operations with the typically designed kernels, which can be seamlessly integrated into end-to-end CNN training. The experiments under different challenging conditions demonstrate the effectiveness of the proposed method.

    • Xinhang Song, Sixian Zhang, Yuyun Hua and Shuqiang Jiang. Aberrance-aware gradient-sensitive attentions for scene recognition with RGB-D videos. (ACM Multimedia 2019), 21-25 October 2019, Nice, France.


    (ACM Multimedia 2019), 21-25 October 2019, Nice, France
    [PDF]
  • Xinhang Song, Bohan Wang, Gongwei Chen and Shuqiang Jiang, MUCH: MUtual Coupling enHancement of scene recognition and dense captioning.

    Abstract

    Due to the abstraction of scenes, comprehensive scene understanding requires semantic modeling in both global and local aspects. Scene recognition is usually researched from a global point of view, while dense captioning is typically studied for local regions. Previous works separately research on the modeling of scene recognition and dense captioning. In contrast, we propose a joint learning framework that benefits from the mutual coupling of scene recognition and dense captioning models. Generally, these two tasks are coupled through two steps, 1) fusing the supervision by considering the contexts between scene labels and local captions, and 2) jointly optimizing semantically symmetric LSTM models. Particularly, in order to balance bias between dense captioning and scene recognition, a scene adaptive non-maximum suppression (NMS) method is proposed to emphasize the scene related regions in region proposal procedure, and a region-wise and category-wise weighted pooling method is proposed to avoid over attention on particular regions in local to global pooling procedure. For the model training and evaluation, scene labels are manually annotated for Visual Genome database. The experimental results on Visual Genome show the effectiveness of the proposed method. Moreover, the proposed method also can improve previous CNN based works on public scene databases, such as MIT67 and SUN397.

    • Xinhang Song, Bohan Wang, Gongwei Chen and Shuqiang Jiang. MUCH: MUtual Coupling enHancement of scene recognition and dense captioning. (ACM Multimedia 2019), 21-25 October 2019, Nice, France.


    (ACM Multimedia 2019), 21-25 October 2019, Nice, France
    [PDF]
  • Yongqing Zhu, Shuqiang Jiang, Attention-based Densely Connected LSTM for Video Captioning.

    Abstract

    Recurrent Neural Networks (RNNs), especially the Long Short-Term Memory (LSTM), have been widely used for video captioning, since they can cope with the temporal dependencies within both video frames and the corresponding descriptions. However, as the sequence gets longer, it becomes much harder to handle the temporal dependencies within the sequence. And in traditional LSTM, previously generated hidden states except the last one do not work directly to predict the current word. This may lead to the predicted word highly related to the last few states other than the overall context. To better capture long-range dependencies and directly leverage early generated hidden states, in this work, we propose a novel model named Attention-based Densely Connected Long Short-Term Memory (DenseLSTM). In DenseLSTM, to ensure maximum information flow, all previous cells are connected to the current cell, which makes the updating of the current state directly related to all its previous states. Furthermore, an attention mechanism is designed to model the impacts of different hidden states. Because each cell is directly connected with all its successive cells, each cell has direct access to the gradients from later ones. In this way, the long-range dependencies are more effectively captured. We perform experiments on two publicly used video captioning datasets: the Microsoft Video Description Corpus (MSVD) and the MSR-VTT, and experimental results illustrate the effectiveness of DenseLSTM.

    • Yongqing Zhu, Shuqiang Jiang. Attention-based Densely Connected LSTM for Video Captioning. (ACM Multimedia 2019), 21-25 October 2019, Nice, France.


    (ACM Multimedia 2019), 21-25 October 2019, Nice, France
    [PDF]
  • Xiangyang Li, Shuqiang Jiang, Jungong Han, Learning Object Context for Dense Captioning

    Abstract

    Dense captioning is a challenging task which not only detects visual elements in images but also generates natural language sentences to describe them. Previous approaches do not leverage object information in images for this task. However, objects provide valuable cues to help predict the locations of caption regions as caption regions often highly overlap with objects (i.e. caption regions are usually parts of objects or combinations of them). Meanwhile, objects also provide important information for describing a target caption region as the corresponding description not only depicts its properties, but also involves its interactions with objects in the image. In this work, we propose a novel scheme with an object context encoding Long Short-Term Memory (LSTM) network to automatically learn complementary object context for each caption region, transferring knowledge from objects to caption regions. All contextual objects are arranged as a sequence and progressively fed into the context encoding module to obtain context features. Then both the learned object context features and region features are used to predict the bounding box offsets and generate the descriptions. The context learning procedure is in conjunction with the optimization of both location prediction and caption generation, thus enabling the object context encoding LSTM to capture and aggregate useful object context. Experiments on benchmark datasets demonstrate the superiority of our proposed approach over the state-of-the-art methods.

    • Xiangyang Li, Shuqiang Jiang, Jungong Han. “Learning Object Context for Dense Captioning” Thirty-Third AAAI Conference on Artificial Intelligence (AAAI2019), January 27 – February 1, 2019, Honolulu, Hawaii, USA


    (AAAI 2019), January 27 – February 1, 2019, Honolulu, Hawaii, USA
    [PDF]
  • Liang Li, Shuhui Wang, Shuqiang Jiang, Qingming Huang, Attentive Recurrent Neural Network for Weak-supervised Multi-label Image Classification

    Abstract

    Multi-label image classification is a fundamental and challenging task in computer vision, and recently achieved significant progress by exploiting semantic relations among labels. However, the spatial positions of labels for multi-labels images are usually not provided in real scenarios, which brings insuperable barrier to conventional models. In this paper, we propose an end-to-end attentive recurrent neural network for multi-label image classification under only image-level supervision, which learns the discriminative feature representations and models the label relations simultaneously. First, inspired by attention mechanism, we propose a recurrent highlight network (RHN) which focuses on the most related regions in the image to learn the discriminative feature representations for different objects in an iterative manner. Second, we develop a gated recurrent relation extractor (GRRE) to model the label relations using multiplicative gates in a recurrent fashion, which learns to decide how multiple labels of the image influence the relation extraction. Extensive experiments on three benchmark datasets show that our model outperforms the state-of-the-arts, and performs better on small-object categories and under the scenario with large number of labels.

    • Liang Li, Shuhui Wang, Shuqiang Jiang, Qingming Huang,Attentive Recurrent Neural Network for Weak-supervised Multi-label Image Classification(ACM Multimedia 2018), October 22–26, 2018, Seoul, Korea


    (ACM Multimedia 2018), October 22–26, 2018, Seoul, Korea
    [PDF]
  • Yaohui Zhu, Shuqiang Jiang, Deep Structured Learning for Visual Relationship Detection.

    Abstract

    In the research area of computer vision and artificial intelligence, learning the relationships of objects is an important way to deeply understand images. Most of recent works detect visual relationship by learning objects and predicates respectively in feature level, but the dependencies between objects and predicates have not been fully considered. In this paper, we introduce deep structured learning for visual relationship detection. Specifically, we propose a deep structured model, which learns relationship by using feature-level pre- diction and label-level prediction to improve learning ability of only using feature-level predication. The feature-level pre- diction learns relationship by discriminative features, and the label-level prediction learns relationships by capturing dependencies between objects and predicates based on the learnt relationship of feature level. Additionally, we use structured SVM (SSVM) loss function as our optimization goal, and decompose this goal into the subject, predicate, and object optimizations which become more simple and more independent. Our experiments on the Visual Relationship Detection (VRD) dataset and the large-scale Visual Genome (VG) dataset validate the effectiveness of our method, which out- performs state-of-the-art methods.

    • Yaohui Zhu, Shuqiang Jiang, Deep Structured Learning for Visual Relationship Detection. Thirty-Second AAAI Conference on Artificial Intelligence (AAAI2018), February 2-7, 2018, New Orleans, Lousiana, USA


    Thirty-Second AAAI Conference on Artificial Intelligence (AAAI2018), February 2-7, 2018, New Orleans, Lousiana, USA
    [PDF]
  • Xinhang Song, Chengpeng Chen, Shuqiang Jiang, RGB-D Scene Recognition with Object-to-Object Relation,

    Abstract

    A scene is usually abstract that consists of several less abstract entities such as objects or themes. It is very difficult to reason scenes from visual features due to the semantic gap between the abstract scenes and low-level visual features. Some alternative works recognize scenes with a two-step framework by representing images with intermediate representations of objects or themes. However, the object co-occurrences between scenes may lead to ambiguity for scene recognition. In this paper, we propose a framework to represent images with intermediate (object) representations with spatial layout, i.e., object-to-object relation (OOR) representation. In order to better capture the spatial information, the proposed OOR is adapted to RGB-D data. In the proposed framework, we first apply object detection technique on RGB and depth images separately. Then the detected results of both modalities are combined with a RGB-D proposal fusion process. Based on the detected results, we extract semantic feature OOR and regional convolutional neural network (CNN) features located by bounding boxes. Finally, different features are concatenated to feed to the classifier for scene recognition. The experimental results on SUN RGB-D and NYUD2 datasets illustrate the efficiency of the proposed method.

    • Xinhang Song, Chengpeng chen, Shuqiang Jiang. “RGB-D Scene Recognition with Object-to-Object Relation” The 25th ACM Multimedia Conference (ACM MM) 2017 (long paper), CCF A


    ACM Multimedia 2017
    [PDF]
  • Min Weiqing, Jiang Shuqiang, Wang Shuhui, Sang Jitao, Mei Shuhuan, A Delicious Recipe Analysis Framework for Exploring Multi-Modal Recipes with Various Attributes,

    Abstract

    Human beings have developed a diverse food culture. Many factors like ingredients, visual appearance, courses (e.g., breakfast and lunch), flavor and geographical regions affect our food perception and choice. In this work, we focus on multi-dimensional food analysis based on these food factors to benefit various applications like summary and recommendation. For that solution, we propose a delicious recipe analysis framework to incorporate various types of continuous and discrete attribute features and multi-modal information from recipes. First, we develop a Multi-Attribute Theme Modeling (MATM) method, which can incorporate arbitrary types of attribute features to jointly model them and the textual content. We then utilize a multi-modal embedding method to build the correlation between the learned textual theme features from MATM and visual features from the deep learning network. By learning attribute-theme relations and multi-modal correlation, we are able to fulfill different applications, including (1) flavor analysis and comparison for better understanding the flavor patterns from different dimensions, such as the region and course, (2) region-oriented multi-dimensional food summary with both multi-modal and multi-attribute information and (3) multi-attribute oriented recipe recommendation. Furthermore, our proposed framework is flexible and enables easy incorporation of arbitrary types of attributes and modalities. Qualitative and quantitative evaluation results have validated the effectiveness of the proposed method and framework on the collected Yummly dataset.

    • Weiqing Min, Shuqiang Jiang, Shuhui Wang, Jitao Sang, Shuhuan Mei,A Delicious Recipe Analysis Framework for Exploring Multi-Modal Recipes with Various Attributes(ACM Multimedia 2017), October 23–27, 2017, Mountain View, CA, USA


    ACM Multimedia 2017
    [PDF]
  • Xinhang Song, Luis Herranz, Shuqiang Jiang, Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better than Transferring from RGB-CNNs.

    Abstract

    Scene recognition with RGB images has been extensively studied and has reached very remarkable recognition levels, thanks to convolutional neural networks (CNN) and large scene datasets. In contrast, current RGB-D scene data is much more limited, so often leverages RGB large datasets, by transferring pretrained RGB CNN models and fine-tuning with the target RGB-D dataset. However, we show that this approach has the limitation of hardly reaching bottom layers, which is key to learn modality-specific features. In contrast, we focus on the bottom layers, and propose an alternative strategy to learn depth features combining local weakly supervised training from patches followed by global fine tuning with images. This strategy is capable of learning very discriminative depth-specific features with limited depth images, without resorting to Places-CNN. In addition we propose a modified CNN architecture to further match the complexity of the model and the amount of data available. For RGB-D scene recognition, depth and RGB features are combined by projecting them in a common space and further leaning a multilayer classifier, which is jointly optimized in an end-to-end network. Our framework achieves state-of-the-art accuracy on NYU2 and SUN RGB-D in both depth only and combined RGB-D data.

    • Xinhang Song, Luis Herranz, Shuqiang Jiang. “Depth CNNs for RGB-D scene recognition: learning from scratch better than transferring from RGB-CNNs” Thirty-First AAAI Conference on Artificial Intelligence (AAAI) 2017, CCF A


    AAAI 2017: 4271-4277, February 4-9, 2017, San Francisco, California, USA
    [PDF]
  • Sisi Liang, Xiangyang Li, Yongqing Zhu, Xue Li, Shuqiang Jiang, ISIA at the ImageCLEF 2017 Image Caption Task.
    CLEF (Working Notes) 2017, Dublin, Ireland, September 11-14, 2017
    [PDF]
  • Yaohui Zhu, Shuqiang Jiang, Xiangyang Li, Visual relationship detection with object spatial distribution.
    ICME 2017: 379-384, Hong Kong, China, July 10-14, 2017
    [PDF]
  • Xiaodan Zhang, Shengfeng He, Xinhang Song, Pengxu Wei, Shuqiang Jiang, Qixiang Ye, Jianbin Jiao, Rynson W. H. Lau, Keyword-driven image captioning via Context-dependent Bilateral LSTM.

    Abstract

    Image captioning has recently received much attention. Existing approaches, however, are limited to describing images with simple contextual information, which typically generate one sentence to describe each image with only a single contextual emphasis. In this paper, we address this limitation from a user perspective with a novel approach. Given some keywords as additional inputs, the proposed method would generate various descriptions according to the provided guidance. Hence, descriptions with different focuses can be generated for the same image. Our method is based on a new Context-dependent Bilateral Long Short-Term Memory (CDB-LSTM) model to predict a keyword-driven sentence by considering the word dependence. The word dependence is explored externally with a bilateral pipeline, and internally with a unified and joint training process. Experiments on the MS COCO dataset demonstrate that the proposed approach not only significantly outperforms the baseline method but also shows good adaptation and consistency with various keywords.

    • Xiaodan Zhang, Shengfeng He, Xinhang Song, Pengxu Wei, Shuqiang Jiang, Qixiang Ye, Jianbin Jiao, Rynson W.H. Lau. Keyword-driven Image Captioning via Context-dependent Bilateral LSTM. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME 2017), July 10-14, 2017, Hong Kong.


    ICME 2017: 781-786, Hong Kong, China, July 10-14, 2017
    [PDF]
  • Shuqiang Jiang, Weiqing Min, Xue Li, Huayang Wang, Jian Sun, Jiaqi Zhou, Dual Track Multimodal Automatic Learning through Human-Robot Interaction.

    Abstract

    Human beings are constantly improving their cog- nitive ability via automatic learning from the inter- action with the environment. Two important as- pects of automatic learning are the visual percep- tion and knowledge acquisition. The fusion of these two aspects is vital for improving the intelligence and interaction performance of robots. Many auto- matic knowledge extraction and recognition meth- ods have been widely studied. However, little work focuses on integrating automatic knowledge extrac- tion and recognition into a unified framework to enable jointly visual perception and knowledge ac- quisition. To solve this problem, we propose a Dual Track Multimodal Automatic Learning (DT- MAL) system, which consists of two components: Hybrid Incremental Learning (HIL) from the vi- sion track and Multimodal Knowledge Extraction (MKE) from the knowledge track. HIL can in- crementally improve recognition ability of the sys- tem by learning new object samples and new ob- ject concepts. MKE is capable of constructing and updating the multimodal knowledge items based on the recognized new objects from HIL and oth- er knowledge by exploring the multimodal signals. The fusion of the two tracks is a mutual promotion process and jointly devote to the dual track learn- ing. We have conducted the experiments through human-machine interaction and the experimental results validated the effectiveness of our proposed system.

    • Shuqiang Jiang, Weiqing Min, Xue Li, Huayang Wang, Jian Sun, Jiaqi Zhou, Dual Track Multimodal Automatic Learning through Human-Robot Interaction. IJCAI 2017: 4485-4491, Melbourne, Australia, August 19-25, 2017


    IJCAI 2017: 4485-4491, Melbourne, Australia, August 19-25, 2017
    [PDF]
  • Xinhang Song, Shuqiang Jiang, Luis Herranz, Combining Models from Multiple Sources for RGB-D Scene Recognition.

    Abstract

    Depth can complement RGB with useful cues about object volumes and scene layout. However, RGB-D image datasets are still too small for directly training deep convolutional neural networks (CNNs), in contrast to the massive monomodal RGB datasets. Previous works in RGB-D recognition typically combine two separate networks for RGB and depth data, pretrained with a large RGB dataset and then fine tuned to the respective target RGB and depth datasets. These approaches have several limitations: 1) only use low-level filters learned from RGB data, thus not being able to exploit properly depth-specific patterns, and 2) RGB and depth features are only combined at high-levels but rarely at lower-levels. In this paper, we propose a framework that leverages both knowledge acquired from large RGB datasets together with depth-specific cues learned from the limited depth data, obtaining more effective multi-source and multi-modal representations. We propose a multi-modal combination method that selects discriminative combinations of layers from the different source models and target modalities, capturing both high-level properties of the task and intrinsic low-level properties of both modalities.

    • Xinhang Song, Shuqiang Jiang , Luis Herranz. “Combining Models from Multiple Sources for RGB-D Scene Recognition”, in IJCAI2017, CCF A


    IJCAI 2017: 4523-4529, Melbourne, Australia, August 19-25, 2017
    [PDF]
  • Yongqing Zhu, Xiangyang Li, Xue Li, Jian Sun, Xinhang Song, Shuqiang Jiang, Joint Learning of CNN and LSTM for Image Captioning.
    CLEF (Working Notes) 2016: 421-427, Évora, Portugal, 5-8 September, 2016
    [PDF]
  • Luis Herranz, Shuqiang Jiang, Xiangyang Li, Scene Recognition with CNNs: Objects, Scales and Dataset Bias.
    CVPR 2016: 571-579, Las Vegas, NV, USA, June 27-30, 2016
    [PDF]
  • Xinda Liu, Xueming Wang, Shuqiang Jiang, RGB-D scene classification via heterogeneous model fusion.
    ICIP 2016: 499-503, Phoenix, AZ, USA, September 25-28, 2016
    [PDF]
  • Xiangyang Li, Xinhang Song, Luis Herranz, Yaohui Zhu, Shuqiang Jiang, Image Captioning with both Object and Scene Information.
    ACM Multimedia 2016: 1107-1110, Amsterdam, The Netherlands, October 15-19, 2016
    [PDF]
  • Xinhang Song, Shuqiang Jiang, Luis Herranz, Joint multi-feature spatial context for scene recognition in the semantic manifold.
    CVPR 2015: 1312-1320, Boston, MA, USA, June 7-12, 2015
    [PDF]
  • Luis Herranz, Ruihan Xu, Shuqiang Jiang, A probabilistic model for food image recognition in restaurants.
    ICME 2015: 1-6, Turin, Italy, June 29 - July 3, 2015
    [PDF]
  • Xiong Lv, Shuqiang Jiang, Luis Herranz, Shuang Wang, Hand-Object Sense: A Hand-held Object Recognition System Based on RGB-D Information.
    ACM Multimedia 2015: 765-766, Brisbane, Australia, October 26 - 30, 2015
    [PDF]
  • Xiaodan Zhang, Xinhang Song, Xiong Lv, Shuqiang Jiang, Qixiang Ye, Jianbin Jiao, Rich Image Description Based on Regions.
    ACM Multimedia 2015: 1315-1318, Brisbane, Australia, October 26 - 30, 2015
    [PDF]
  • Luis Herranz, Shuqiang Jiang, Accuracy and Specificity Trade-off in k -nearest Neighbors Classification.
    ACCV (2) 2014: 133-146, Singapore, Singapore, November 1-5, 2014
    [PDF]
  • Liang Li, Chenggang Yan, Xing Chen, Shuqiang Jiang, Seungmin Rho, Jian Yin, Baochen Jiang, Qingming Huang, Large scale image understanding with non-convex multi-task learning.
    GAMENETS 2014: 1-6, Beijing, China, November 25-27, 2014
    [PDF]
  • Lingyang Chu, Shuhui Wang, Yanyan Zhang, Shuqiang Jiang, Qingming Huang, Graph-Density-based visual word vocabulary for image retrieval.
    ICME 2014: 1-6, Chengdu, China, July 14-18, 2014
    [PDF]
  • Shuhui Wang, Zhenjun Wang, Shuqiang Jiang, Qingming Huang, Cross media topic analytics based on synergetic content and user behavior modeling.
    ICME 2014: 1-6, Chengdu, China, July 14-18, 2014
    [PDF]
  • Xiangyang Li, Shuqiang Jiang, Xinhang Song, Luis Herranz, Zhiping Shi, Multipath Convolutional-Recursive Neural Networks for Object Recognition.
    Intelligent Information Processing 2014: 269-277, Hangzhou, China, October 17-20, 2014
    [PDF]
  • Li Shen, Shuhui Wang, Gang Sun, Shuqiang Jiang, Qingming Huang, Multi-Level Discriminative Dictionary Learning towards Hierarchical Visual Categorization,
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2013), Portland, Oregon, June 23-26, USA
    [PDF]
  • Shuang Wang, Yunfeng Xue, Lingyang Chu, Yuhao Jiang, Shuqiang Jiang, ObjectSense:A Scalable Multi-Objects Recognition SystemBased on Partial-Duplicate Image Retrieval,
    ACM International Conference on Multimedia Retrieval (ICMR2013), Dallas, April 16-19, Dallas, Texas, USA (Best Demo Award)
    [PDF]
  • Shuai Zheng, Luis Herranz, Shuqiang Jiang, Flexible Navigation in Smartphones and Tablets using Scalable Storyboards,
    ACM International Conference on Multimedia Retrieval (ICMR2013), Dallas, April 16-19, Dallas, Texas, USA
    [PDF]
  • Shuhui Wang, Shuqiang Jiang, Qingming Huang, Qi Tian, Multi-feature Metric Learning with Knowledge Transfer among Semantics and Social Tagging,
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2012), Providence, Rhode Island, June 16-21, 2012
    [PDF]
  • Guorong Li, Lei Qin, Qingming Huang, Junbiao Pang, Shuqiang Jiang, Treat samples differently: Object tracking with semi-supervised online CovBoost,
    13th International Conference on Computer Vision(ICCV2011), 6-13, November 2011, Barcelona, Spain
    [PDF]
  • Shuhui Wang, Qingming Huang, Shuqiang Jiang, Qi Tian, Efficient Lp-norm Multiple Feature Metric Learning for Image Data Mining,
    20th ACM Conference on Information and Knowledge Management (CIKM 2011), October 24-28, 2011, Glasgow, UK
    [PDF]
  • Tianlong Chen, Shuqiang Jiang, Lingyang Chu, Qingming Huang, Detection and Location of Near-Duplicate Video Sub-Clips by Finding Dense Subgraphs,
    ACM Multimedia, Scottsdale, Arizona, USA, Nov 28 - Dec 1,2011
    [PDF]
  • Liang Li, Shuqiang Jiang, Qingming Huang, Learning Image Vicept Description via Mixed-Norm Regularization for Large Scale Semantic Image Search,
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2011), Colorado Springs, USA. June 20-25, 2011
    [PDF]
  • Chunxi Liu, Qingming Huang, Shuqiang Jiang, Changsheng Xu, The Third Eye: Mining the Visual Cognition across Multilanguage Communities,
    ACM Multimedia(Full Paper), Florence, Italy, October 25-29, 2010
    [PDF]
  • Shuhui Wang, Shuqiang Jiang, Qingming Huang, and Qi Tian, S3MKL: Scalable Semi-supervised Multiple Kernel Learning for Image Data Mining,
    ACM Multimedia(Full Paper), Florence, Italy, October 25-29, 2010
    [PDF]
  • Shiliang Zhang, Qingming Huang, Gang Hua, Shuqiang Jiang, Wen Gao, and Qi Tian. Building Contextual Visual Vocabulary for Large-scale Image Applications,
    ACM Multimedia(Full Paper), Florence, Italy, October 25-29, 2010
    [PDF]
  • Dawei Liang, Qingming Huang, Hongxun Yao, Shuqiang Jiang, Rongrong Ji, and Wen Gao, Novel Observation Model for Probabilistic Object Tracking.
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2010), San Francisco, California, June 13-18, 2010
    [PDF]
  • Zhipeng Wu, Shuqiang Jiang, Qingming Huang, Near-Duplicate Video Matching with Transformation Recognition,
    ACM International Conference on Multimedia, Beijing, China, Oct. 19-24, 2009, 549-552
    [PDF]
  • Zhipeng Wu, Shuqiang Jiang, Qingming Huang, Friend Recommendation According to Appearances on Photos,
    ACM Multimedia, Beijing, China, pp. 987-988, Oct.2009
    [PDF]
  • Chunxi Liu, Shuqiang Jiang, Qingming Huang, Naming Faces in Broadcast News Video By Image Google,
    ACM InternationalMultimedia, Vancouver, BC, Canada, pp.717-720, Oct.27-31, 2008
    [PDF]
  • Huiying Liu, Shuqiang Jiang, Qingming Huang, Changsheng Xu, A Generic Virtual Content Insertion System Based on Visual Attention Model,
    ACM Multimedia, Vancouver, BC, Canada, pp.379-388, Oct.27-31, 2008
    [PDF]
  • Huiying Liu, Shuqiang Jiang, Qingming Huang, Changsheng Xu, Region-Based Visual Attention Analysis with Its Application in Image Browsing on Small Displays,
    ACM International Conference on Multimedia, Augsburg, Germany, Sept. 24-29, 2007
    [PDF]
  • Shuqiang Jiang, Qixiang Ye, Wen Gao, Tiejun Huang: A new method to segment playfield and its applications in match analysis in sports video.
    ACM Multimedia, New York, NY USA, Oct.10-16, 2004
    [PDF]