Being a Super Cook: Joint Food Attributes and Multi-Modal Content Modeling for Recipe Retrieval and Exploration

Weiqing Min, Shuqiang Jiang, Jitao Sang, Huayang Wang, Xinda Liu, and Luis Herranz


This paper considers the problem of recipe-oriented image-ingredient correlation learning with multi-attributes for recipe retrieval and exploration. Existing methods mainly focus on food visual information for recognition while we model visual information, textual content (e.g., ingredients) and attributes (e.g., cuisine and course) together to solve extended recipe-oriented problems, such as multi-modal cuisine classification and attribute-enhanced food image retrieval. For solution, we propose a Multi-Modal Multi-Task Deep Belief Network (M3TDBN) to learn joint image-ingredient representation regularized by different attributes. By grouping ingredients into visible ingredients (which are visible in the food image, e.g., ``chicken" and ``mushroom") and non-visible ingredients (e.g. ``salt" and ``oil"), M3TDBN is capable of learning both mid-level visual representation between images and visible ingredients and non-visual representation. Furthermore, in order to utilize different attributes to improve the inter-modality correlation, M3TDBN incorporates multitask learning to make different attributes collaborate each other. Based on the proposed M3TDBN, we exploit the derived deep features and the discovered correlations for three extended novel applications: (1) multi-modal cuisine classification, (2) attribute-augmented cross-modal recipe image retrieval and (3) ingredient and attribute inference from food images.The proposed approach is evaluated on the constructed Yummly dataset and the evaluation results have validated the effectiveness of the proposed approach.

  • Weiqing Min, Shuqiang Jiang, Jitao Sang, Huayang Wang, Xinda Liu, Luis Herranz,Being a Supercook: Joint Food Attributes and Multimodal Content Modeling for Recipe Retrieval and Exploration. IEEE Trans. Multimedia 19(5): 1100-1113 (2017)