Learning shared embedding representation of motion and text using contrastive learning

Horie, Junpei; Noguchi, Wataru; Iizuka, Hiroyuki; Yamamoto, Masahito

doi:10.1007/s10015-022-00840-0


Hokkaido University \| Library \| HUSCAP	Advanced Search		言語

	Home
	About HUSCAP
	Open Access Policy

	Browse by Author

Browse
	Communities & Collections

	Scholarly Journals
	Theses
	Doctoral Dissertations Listed by Graduate Schools
	Conference Procs.
	Events

	HUSCAP Senior (in Japanese)

	Societies

	Downloads (country)

For university staff
	How to post your papers to HUSCAP
	Publication of theses
	Helpline about theses publication

Open Archives Compliant

You can search our collection also at:
	Google
	Google Scholar
	CiNii
	IRDB
	OAIster
	NDLTD

Hokkaido University Collection of Scholarly and Academic Papers >
Graduate School of Information Science and Technology / Faculty of Information Science and Technology >
Peer-reviewed Journal Articles, etc >

Learning shared embedding representation of motion and text using contrastive learning

Files in This Item:

horie_arob_v03.pdf

5.18 MB

PDF

View/Open

Please use this identifier to cite or link to this item:http://hdl.handle.net/2115/91020

Title:	Learning shared embedding representation of motion and text using contrastive learning
Authors:	Horie, Junpei Browse this author
	Noguchi, Wataru Browse this author →KAKEN DB
	Iizuka, Hiroyuki Browse this author →KAKEN DB
	Yamamoto, Masahito Browse this author →KAKEN DB
Keywords:	Multi-modal learning
	Contrastive learning
	Skeleton-based action recognition
	Motion retrieval
Issue Date:	27-Dec-2022
Publisher:	Springer
Journal Title:	Artificial life and robotics
Volume:	28
Issue:	1
Start Page:	148
End Page:	157
Publisher DOI:	10.1007/s10015-022-00840-0
Abstract:	Multimodal learning of motion and text tries to find the correspondence between skeletal time-series data acquired by motion capture and the text that describes the motion. In this field, good associations can realize both motion-to-text and text-to-motion applications. However, the previous methods failed to associate motion with text, taking into account details of descriptions, for example, whether to move the left or right arm. In this paper, we propose a motion-text contrastive learning method for making correspondences between motion and text in a shared embedding space. We showed that our model outperforms the previous studies in the task of action recognition. We also qualitatively show that, by using a pre-trained text encoder, our model can perform motion retrieval with detailed correspondences between motion and text.
Rights:	This version of the article has been accepted for publication, after peer review (when applicable) and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: http://dx.doi.org/10.1007/s10015-022-00840-0.
Type:	article (author version)
URI:	http://hdl.handle.net/2115/91020
Appears in Collections:	情報科学院・情報科学研究院 (Graduate School of Information Science and Technology / Faculty of Information Science and Technology) > 雑誌発表論文等 (Peer-reviewed Journal Articles, etc)

Submitter: 山本雅人

OAI-PMH ( junii2 , jpcoar_1.0 )

- Hokkaido University