Hokkaido University Collection of Scholarly and Academic Papers >
Graduate School of Information Science and Technology / Faculty of Information Science and Technology >
Peer-reviewed Journal Articles, etc >
Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces
This item is licensed under:Creative Commons Attribution 4.0 International
Title: | Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces |
Authors: | Yanagi, Rintaro Browse this author | Togo, Ren Browse this author | Ogawa, Takahiro Browse this author →KAKEN DB | Haseyama, Miki Browse this author →KAKEN DB |
Keywords: | Visualization | Feature extraction | Gallium nitride | Training | Semantics | Generative adversarial networks | Computational modeling | Multimedia information retrieval | cross-modal retrieval | vision and language | text-to-image model | image-to-text model |
Issue Date: | 20-May-2020 |
Publisher: | IEEE (Institute of Electrical and Electronics Engineers) |
Journal Title: | IEEE Access |
Volume: | 8 |
Start Page: | 96777 |
End Page: | 96786 |
Publisher DOI: | 10.1109/ACCESS.2020.2995815 |
Abstract: | A new approach that drastically improves cross-modal retrieval performance in vision and language (hereinafter referred to as & x201C;vision and language retrieval & x201D;) is proposed in this paper. Vision and language retrieval takes data of one modality as a query to retrieve relevant data of another modality, and it enables flexible retrieval across different modalities. Most of the existing methods learn optimal embeddings of visual and lingual information to a single common representation space. However, we argue that the forced embedding optimization results in loss of key information for sentences and images. In this paper, we propose an effective utilization of representation spaces in a simple but robust vision and language retrieval method. The proposed method makes use of multiple individual representation spaces through text-to-image and image-to-text models. Experimental results showed that the proposed approach enhances the performance of existing methods that embed visual and lingual information to a single common representation space. |
Rights: | © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. | https://creativecommons.org/licenses/by/4.0/ |
Type: | article |
URI: | http://hdl.handle.net/2115/78959 |
Appears in Collections: | 情報科学院・情報科学研究院 (Graduate School of Information Science and Technology / Faculty of Information Science and Technology) > 雑誌発表論文等 (Peer-reviewed Journal Articles, etc)
|
|