Hokkaido University Collection of Scholarly and Academic Papers >
Graduate School of Information Science and Technology / Faculty of Information Science and Technology >
Peer-reviewed Journal Articles, etc >
Average-case linear-time similar substring searching by the q-gram distance
Title: | Average-case linear-time similar substring searching by the q-gram distance |
Authors: | Hanada, Hiroyuki Browse this author | Kudo, Mineichi Browse this author →KAKEN DB | Nakamura, Atsuyoshi Browse this author →KAKEN DB |
Keywords: | String searching | Approximate string matching | q-Gram distance |
Issue Date: | 17-Apr-2014 |
Publisher: | Elsevier |
Journal Title: | Theoretical Computer Science |
Volume: | 530 |
Start Page: | 23 |
End Page: | 41 |
Publisher DOI: | 10.1016/j.tcs.2014.02.022 |
Abstract: | In this paper we consider the problem of similar substring searching in the q-gram distance. The q-gram distance d(q)(x, y) is a similarity measure between two strings x and y defined by the number of different q-grams between them. The distance can be used instead of the edit distance due to its lower computation cost, O(|x| + |Y|) vs. O(|x||Y|). and its good approximation for the edit distance. However, if this distance is applied to the problem of finding all similar strings, in a long text t, to a given pattern p, the total computation cost is sometimes not acceptable. Ukkonen already proposed two fast algorithms: one with an array and the other with a tree. When "similar" means k or less in dq, their time complexities are O(|t|k + |P|) and O(|t| log k + |p|). respectively. In this paper, we propose two algorithms of average-case complexity O(|t| + |p|). although their worst-case complexities are still O(|t|k + |P|) and O(|t| log k + |p|). respectively. The linearity of the average-case complexity is analyzed under the assumption of random sampling of t and the condition that q is larger than a threshold. The algorithms exploit the fact that similar substrings in t are often found at very close positions if the beginning positions of the substrings are close. In the second proposed algorithm, we adopted a doubly-linked list supported by an array and a search tree to search for a list element in O(log k) time. Experimental results support their theoretical average-case complexities. (C) 2014 Elsevier B.V. All rights reserved. |
Type: | article (author version) |
URI: | http://hdl.handle.net/2115/58429 |
Appears in Collections: | 情報科学院・情報科学研究院 (Graduate School of Information Science and Technology / Faculty of Information Science and Technology) > 雑誌発表論文等 (Peer-reviewed Journal Articles, etc)
|
Submitter: 花田 博幸
|