Date of publication:2020-07-29 Number of clicks: 0
2. Expressional Region Retrieval (Xiaoqian Guo, Xiangyang Li, Shuqiang Jiang)
Image retrieval is a long-standing topic in the multimedia community due to its various applications, e.g., product search and artworks retrieval in museum. The regions in images contain a wealth of information. Users may be interested in the objects presented in the image regions or the relationships between them. But previous retrieval methods are either limited to the single object of images, or tend to the entire visual scene. In this paper, we introduce a new task called expressional region retrieval, in which the query is formulated as a region of image with the associated description. The goal is to find images containing the similar content with the query and localize the regions within them. As far as we know, this task has not been explored yet. We propose a framework to address this issue. The region proposals are first generated based on region detectors and language features are extracted. Then the Gated Residual Network (GRN) takes language information as a gate to control the transformation of visual features. In this way, the combined visual and language representation is more specific and discriminative for expressional region retrieval. We evaluate our method on a new established benchmark which is constructed based on the Visual Genome dataset. Experimental results demonstrate that our model effectively utilizes both visual and language information, outperforming the baseline methods.
Recognizing visual categories from semantic descriptions is a promising way to extend the capability of a visual classifier beyond the concepts represented in the training data (i.e. seen categories). This problem is addressed by (generalized) zero-shot learning methods (GZSL), which leverage semantic descriptions that connect them to seen categories (e.g. label embedding, attributes). Conventional GZSL are designed mostly for object recognition. In this paper we focus on zero-shot scene recognition, a more challenging setting with hundreds of categories where their differences can be subtle and often localized in certain objects or regions. Conventional GZSL representations are not rich enough to capture these local discriminative differences. Addressing these limitations, we propose a feature generation framework with two novel components: 1) multiple sources of semantic information (i.e. attributes, word embeddings and descriptions), 2) region descriptions that can enhance scene discrimination. To generate synthetic visual features we propose a two-step generative approach, where local descriptions are sampled and used as conditions to generate visual features. The generated features are then aggregated and used together with real features to train a joint classifier. In order to evaluate the proposed method, we introduce a new dataset for zero-shot scene recognition with multi-semantic annotations. Experimental results on the proposed dataset and SUN Attribute dataset illustrate the effectiveness of the proposed method.
Viewpoint variation is a main concern problem for vehicle ReID tasks. For the vehicle images taken in different viewpoints, the visual appearance would be different, which causes severe feature misalignment and feature deformation problem. Traditional methods use the original image as the input. When comparing two images under different perspective, these methods cannot model the differences, which reduces the accuracy of vehicle ReID. In our paper, we proposed a new module called part perspective transform (PPT) to handle the problem of the perspective variation problem. We first locate the different regions by keypoints detection. For different regions, we conduct perspective transform for each part to transform them to a uniform perspective individually, which solves the feature misalignment and deformation problem. As the visible regions of different vehicles are different, we design dynamic selective batch hard triplet loss, which is to select the hardest visible regions in a batch to generate triplets and filter the invalid triplets dynamically. The loss guide the network to focus on the common visible regions. Our method achieves best results in three different vehicle ReID datasets, which shows the effectiveness of our method.