HUSCAP logo Hokkaido Univ. logo

Hokkaido University Collection of Scholarly and Academic Papers >
Graduate School of Information Science and Technology / Faculty of Information Science and Technology >
Peer-reviewed Journal Articles, etc >

Average-case linear-time similar substring searching by the q-gram distance

Files in This Item:
hanada-tcs2014-authorfinal.pdf403.81 kBPDFView/Open
Please use this identifier to cite or link to this item:http://hdl.handle.net/2115/58429

Title: Average-case linear-time similar substring searching by the q-gram distance
Authors: Hanada, Hiroyuki Browse this author
Kudo, Mineichi Browse this author →KAKEN DB
Nakamura, Atsuyoshi Browse this author →KAKEN DB
Keywords: String searching
Approximate string matching
q-Gram distance
Issue Date: 17-Apr-2014
Publisher: Elsevier
Journal Title: Theoretical Computer Science
Volume: 530
Start Page: 23
End Page: 41
Publisher DOI: 10.1016/j.tcs.2014.02.022
Abstract: In this paper we consider the problem of similar substring searching in the q-gram distance. The q-gram distance d(q)(x, y) is a similarity measure between two strings x and y defined by the number of different q-grams between them. The distance can be used instead of the edit distance due to its lower computation cost, O(|x| + |Y|) vs. O(|x||Y|). and its good approximation for the edit distance. However, if this distance is applied to the problem of finding all similar strings, in a long text t, to a given pattern p, the total computation cost is sometimes not acceptable. Ukkonen already proposed two fast algorithms: one with an array and the other with a tree. When "similar" means k or less in dq, their time complexities are O(|t|k + |P|) and O(|t| log k + |p|). respectively. In this paper, we propose two algorithms of average-case complexity O(|t| + |p|). although their worst-case complexities are still O(|t|k + |P|) and O(|t| log k + |p|). respectively. The linearity of the average-case complexity is analyzed under the assumption of random sampling of t and the condition that q is larger than a threshold. The algorithms exploit the fact that similar substrings in t are often found at very close positions if the beginning positions of the substrings are close. In the second proposed algorithm, we adopted a doubly-linked list supported by an array and a search tree to search for a list element in O(log k) time. Experimental results support their theoretical average-case complexities. (C) 2014 Elsevier B.V. All rights reserved.
Type: article (author version)
URI: http://hdl.handle.net/2115/58429
Appears in Collections:情報科学院・情報科学研究院 (Graduate School of Information Science and Technology / Faculty of Information Science and Technology) > 雑誌発表論文等 (Peer-reviewed Journal Articles, etc)

Submitter: 花田 博幸

Export metadata:

OAI-PMH ( junii2 , jpcoar_1.0 )

MathJax is now OFF:


 

 - Hokkaido University