Average-case linear-time similar substring searching by the q-gram distance

Hanada, Hiroyuki; Kudo, Mineichi; Nakamura, Atsuyoshi

doi:10.1016/j.tcs.2014.02.022


Hokkaido University \| Library \| HUSCAP	Advanced Search		言語

	Home
	About HUSCAP
	Open Access Policy

	Browse by Author

Browse
	Communities & Collections

	Scholarly Journals
	Theses
	Doctoral Dissertations Listed by Graduate Schools
	Conference Procs.
	Events

	HUSCAP Senior (in Japanese)

	Societies

	Downloads (country)

For university staff
	How to post your papers to HUSCAP
	Publication of theses
	Helpline about theses publication

Open Archives Compliant

You can search our collection also at:
	Google
	Google Scholar
	CiNii
	IRDB
	OAIster
	NDLTD

Hokkaido University Collection of Scholarly and Academic Papers >
Graduate School of Information Science and Technology / Faculty of Information Science and Technology >
Peer-reviewed Journal Articles, etc >

Average-case linear-time similar substring searching by the q-gram distance

Files in This Item:

hanada-tcs2014-authorfinal.pdf

403.81 kB

PDF

View/Open

Please use this identifier to cite or link to this item:http://hdl.handle.net/2115/58429

Title:	Average-case linear-time similar substring searching by the q-gram distance
Authors:	Hanada, Hiroyuki Browse this author
	Kudo, Mineichi Browse this author →KAKEN DB
	Nakamura, Atsuyoshi Browse this author →KAKEN DB
Keywords:	String searching
	Approximate string matching
	q-Gram distance
Issue Date:	17-Apr-2014
Publisher:	Elsevier
Journal Title:	Theoretical Computer Science
Volume:	530
Start Page:	23
End Page:	41
Publisher DOI:	10.1016/j.tcs.2014.02.022
Abstract:	In this paper we consider the problem of similar substring searching in the q-gram distance. The q-gram distance d(q)(x, y) is a similarity measure between two strings x and y defined by the number of different q-grams between them. The distance can be used instead of the edit distance due to its lower computation cost, O(\|x\| + \|Y\|) vs. O(\|x\|\|Y\|). and its good approximation for the edit distance. However, if this distance is applied to the problem of finding all similar strings, in a long text t, to a given pattern p, the total computation cost is sometimes not acceptable. Ukkonen already proposed two fast algorithms: one with an array and the other with a tree. When "similar" means k or less in dq, their time complexities are O(\|t\|k + \|P\|) and O(\|t\| log k + \|p\|). respectively. In this paper, we propose two algorithms of average-case complexity O(\|t\| + \|p\|). although their worst-case complexities are still O(\|t\|k + \|P\|) and O(\|t\| log k + \|p\|). respectively. The linearity of the average-case complexity is analyzed under the assumption of random sampling of t and the condition that q is larger than a threshold. The algorithms exploit the fact that similar substrings in t are often found at very close positions if the beginning positions of the substrings are close. In the second proposed algorithm, we adopted a doubly-linked list supported by an array and a search tree to search for a list element in O(log k) time. Experimental results support their theoretical average-case complexities. (C) 2014 Elsevier B.V. All rights reserved.
Type:	article (author version)
URI:	http://hdl.handle.net/2115/58429
Appears in Collections:	情報科学院・情報科学研究院 (Graduate School of Information Science and Technology / Faculty of Information Science and Technology) > 雑誌発表論文等 (Peer-reviewed Journal Articles, etc)

Submitter: 花田博幸

OAI-PMH ( junii2 , jpcoar_1.0 )

- Hokkaido University