Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces

Yanagi, Rintaro; Togo, Ren; Ogawa, Takahiro; Haseyama, Miki

doi:10.1109/ACCESS.2020.2995815


Hokkaido University \| Library \| HUSCAP	Advanced Search		言語

	Home
	About HUSCAP
	Open Access Policy

	Browse by Author

Browse
	Communities & Collections

	Scholarly Journals
	Theses
	Doctoral Dissertations Listed by Graduate Schools
	Conference Procs.
	Events

	HUSCAP Senior (in Japanese)

	Societies

	Downloads (country)

For university staff
	How to post your papers to HUSCAP
	Publication of theses
	Helpline about theses publication

Open Archives Compliant

You can search our collection also at:
	Google
	Google Scholar
	CiNii
	IRDB
	OAIster
	NDLTD

Hokkaido University Collection of Scholarly and Academic Papers >
Graduate School of Information Science and Technology / Faculty of Information Science and Technology >
Peer-reviewed Journal Articles, etc >

Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces

This item is licensed under:Creative Commons Attribution 4.0 International

Files in This Item:

The file(s) associated with this item can be obtained from the following URL: https://doi.org/10.1109/ACCESS.2020.2995815

Title:	Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces
Authors:	Yanagi, Rintaro Browse this author
	Togo, Ren Browse this author
	Ogawa, Takahiro Browse this author →KAKEN DB
	Haseyama, Miki Browse this author →KAKEN DB
Keywords:	Visualization
	Feature extraction
	Gallium nitride
	Training
	Semantics
	Generative adversarial networks
	Computational modeling
	Multimedia information retrieval
	cross-modal retrieval
	vision and language
	text-to-image model
	image-to-text model
Issue Date:	20-May-2020
Publisher:	IEEE (Institute of Electrical and Electronics Engineers)
Journal Title:	IEEE Access
Volume:	8
Start Page:	96777
End Page:	96786
Publisher DOI:	10.1109/ACCESS.2020.2995815
Abstract:	A new approach that drastically improves cross-modal retrieval performance in vision and language (hereinafter referred to as & x201C;vision and language retrieval & x201D;) is proposed in this paper. Vision and language retrieval takes data of one modality as a query to retrieve relevant data of another modality, and it enables flexible retrieval across different modalities. Most of the existing methods learn optimal embeddings of visual and lingual information to a single common representation space. However, we argue that the forced embedding optimization results in loss of key information for sentences and images. In this paper, we propose an effective utilization of representation spaces in a simple but robust vision and language retrieval method. The proposed method makes use of multiple individual representation spaces through text-to-image and image-to-text models. Experimental results showed that the proposed approach enhances the performance of existing methods that embed visual and lingual information to a single common representation space.
Rights:	© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Rights:	https://creativecommons.org/licenses/by/4.0/
Type:	article
URI:	http://hdl.handle.net/2115/78959
Appears in Collections:	情報科学院・情報科学研究院 (Graduate School of Information Science and Technology / Faculty of Information Science and Technology) > 雑誌発表論文等 (Peer-reviewed Journal Articles, etc)

OAI-PMH ( junii2 , jpcoar_1.0 )

- Hokkaido University