HUSCAP logo Hokkaido Univ. logo

Hokkaido University Collection of Scholarly and Academic Papers >
Graduate School of Information Science and Technology / Faculty of Information Science and Technology >
Peer-reviewed Journal Articles, etc >

Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces

This item is licensed under:Creative Commons Attribution 4.0 International

Files in This Item:

The file(s) associated with this item can be obtained from the following URL: https://doi.org/10.1109/ACCESS.2020.2995815


Title: Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces
Authors: Yanagi, Rintaro Browse this author
Togo, Ren Browse this author
Ogawa, Takahiro Browse this author →KAKEN DB
Haseyama, Miki Browse this author →KAKEN DB
Keywords: Visualization
Feature extraction
Gallium nitride
Training
Semantics
Generative adversarial networks
Computational modeling
Multimedia information retrieval
cross-modal retrieval
vision and language
text-to-image model
image-to-text model
Issue Date: 20-May-2020
Publisher: IEEE (Institute of Electrical and Electronics Engineers)
Journal Title: IEEE Access
Volume: 8
Start Page: 96777
End Page: 96786
Publisher DOI: 10.1109/ACCESS.2020.2995815
Abstract: A new approach that drastically improves cross-modal retrieval performance in vision and language (hereinafter referred to as & x201C;vision and language retrieval & x201D;) is proposed in this paper. Vision and language retrieval takes data of one modality as a query to retrieve relevant data of another modality, and it enables flexible retrieval across different modalities. Most of the existing methods learn optimal embeddings of visual and lingual information to a single common representation space. However, we argue that the forced embedding optimization results in loss of key information for sentences and images. In this paper, we propose an effective utilization of representation spaces in a simple but robust vision and language retrieval method. The proposed method makes use of multiple individual representation spaces through text-to-image and image-to-text models. Experimental results showed that the proposed approach enhances the performance of existing methods that embed visual and lingual information to a single common representation space.
Rights: © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
https://creativecommons.org/licenses/by/4.0/
Type: article
URI: http://hdl.handle.net/2115/78959
Appears in Collections:情報科学院・情報科学研究院 (Graduate School of Information Science and Technology / Faculty of Information Science and Technology) > 雑誌発表論文等 (Peer-reviewed Journal Articles, etc)

Export metadata:

OAI-PMH ( junii2 , jpcoar_1.0 )

MathJax is now OFF:


 

 - Hokkaido University