ISIA Food-500: A dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network

Weiqing Min, Linhu Liu, Zhiling Wang, Zhengdong Luo, Xiaoming Wei, Xiaolin Wei, Shuqiang Jiang
(ACMMM 2020)
[PDF]

食品与人类的行为、健康和文化等密切相关。来自社交网络、移动网络和物联网等泛在网络产生的食品大数据及人工智能尤其是深度学习技术的快速发展催生了新的交叉研究领域食品计算[Min2019-ACM CSUR]。作为食品计算的核心任务之一,食品图像识别同时是计算机视觉领域中细粒度视觉识别的重要分支,因而具有重要的理论研究意义,并在智慧健康、食品智能装备、智慧餐饮、智能零售及智能家居等方面有着广泛的应用前景。本文在项目组前期食品识别(Food Recognition:[Jiang2020-IEEE TIP][Min2019-ACMMM])的研究基础上,提出了一个新的食品数据集ISIA Food-500。该数据集包含500个类别,大约40万张图像,在类别量和图片数据量方面都超过了现有的基准数据集。在此基础上我们提出了一个新的网络SGLANet联合学习食品图像的全局和局部视觉特征以进行食品识别,在ISIA Food-500和现有的基准数据集上进行了实验分析与验证。

  • [Min2019-ACM CSUR] Weiqing Min,Shuqiang Jiang, Linhu Liu,Yong Rui, Ramesh Jain A Survey on Food Computing. ACM Computing Surveys, 52(5), 92:1-92:36, 2019
  • [Jiang2020-IEEE TIP] Shuqiang Jiang, Weiqing Min, Linhu Liu, Zhengdong Luo, Multi-Scale Multi-View Deep Feature Aggregation for Food Recognition. IEEE Trans. Image Processing, vol.29, pp.265-276, 2020
  • [Min2019-ACMMM] Weiqing Min, Linhu Liu, Zhengdong Luo, Shuqiang Jiang, Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition. (ACM Multimedia 2019), 21-25 October 2019, Nice, France
  • Abstract

    Food recognition has various of applications in the multimedia community. To encourage further progress in food recognition, we introduce a new food dataset called ISIA Food-500. The dataset contains 500 categories and about 400,000 images and it is a more comprehensive food dataset that surpasses exiting benchmark datasets by category coverage and data volume. We further propose a new network (SGLANet) architecture to jointly learn food-oriented global and local visual features for food recognition. SGLANet consists of two sub-networks, namely Global Feature Learning Subnetwork(GloFLS) and Local Feature Learning Subnetwork(LocFLS). GloFLS first utilizes hybrid spatial-channel attention to obtain more discriminative features for each layer, and then aggregates these features from different layers into global-level features. LocFLS generates attentional regions from different regions via cascaded Spatial Transformers(STs), and further aggregates these multi-scale regional features from different layers into local-level representation. These two types of features are finally fused as comprehensive representation for food recognition. Extensive experiments on ISIA Food-500 and other two popular benchmark datasets demonstrate the effectiveness of our proposed method.


    • Weiqing Min, Linhu Liu, Zhiling Wang, Zhengdong Luo, Xiaoming Wei, Xiaolin Wei, Shuqiang Jiang. 2020. ISIA Food-500: A dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Net- work. In Proceedings of the 28th ACM International Conference on Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3394171.3414031