Studies on Efficient Index Construction for Multiple and Repetitive Texts

髙木, 拓也


Hokkaido University \| Library \| HUSCAP	Advanced Search		言語

	Home
	About HUSCAP
	Open Access Policy

	Browse by Author

Browse
	Communities & Collections

	Scholarly Journals
	Theses
	Doctoral Dissertations Listed by Graduate Schools
	Conference Procs.
	Events

	HUSCAP Senior (in Japanese)

	Societies

	Downloads (country)

For university staff
	How to post your papers to HUSCAP
	Publication of theses
	Helpline about theses publication

Open Archives Compliant

You can search our collection also at:
	Google
	Google Scholar
	CiNii
	IRDB
	OAIster
	NDLTD

Hokkaido University Collection of Scholarly and Academic Papers >
Theses >
博士　（情報科学） >

Studies on Efficient Index Construction for Multiple and Repetitive Texts

Files in This Item:

Takuya_Takagi.pdf

739.74 kB

PDF

View/Open

Please use this identifier to cite or link to this item:https://doi.org/10.14943/doctoral.k13077

Related Items in HUSCAP:

論文内容及び審査の要旨
Studies on Efficient Index Construction for Multiple and Repetitive Texts [an abstract of dissertation and a summary of dissertation review]

Title:	Studies on Efficient Index Construction for Multiple and Repetitive Texts
Other Titles:	複数テキストと繰り返しテキストに対する効率の良い索引構築の研究
Authors:	髙木, 拓也 Browse this author
Issue Date:	22-Mar-2018
Publisher:	Hokkaido University
Abstract:	Text indexing problem is one of the fundamental problems in computer science and the aim is to construct an efficient data structure that answers queries such as text pattern matching. For the last decades, there has been an increasing amount of multiple texts such as data generated from multiple sensors and repetitive texts such as genome sequence collections. For example, the GeoLife Project collects trajectories from GPS loggers that have a variety of sampling rates. These trajectories were recorded every 1 to 5 seconds or every 5 to 10 meters per point. For another example, the 1000 Genomes Project collects the human genomes from various groups. Since each genome information is similar to each other, the same substructures appear repeatedly in this genome database. These projects are aiming at data analysis, information retrieval, and data mining for text information. For pattern matching, which is the most fundamental query for texts, we can answer queries by using basic text pattern matching algorithms such as Knuth-Morris-Pratt (KMP) algorithm and Boyer-Moore (BM) algorithm. Since these algorithms scan the texts for each query, it requires at least linear time for database size in one query. In order to quickly process these data, preprocessing and indexing are important. For example, the suffix tree, one of the basic text indexes, can support pattern matching in linear time for pattern length. Therefore, building an efficient index structure is the key to processing these large amounts of text information. In this thesis, we show efficient index construction algorithms for text data. For multiple texts and repetitive texts, there are several problems with indexing.Since data grow constantly for multiple sensor data such as GPS trajectories, it is necessary for the index to support online construction for multiple texts. For repetitive texts that is similar text collection such as genome sequences, we should be able to build an index with a more compressed size. In order to solve these problems, we propose several new index structures and construction algorithms. In particular, this thesis deals with speeding up construction and operations of indexes, online construction of indexes for multiple texts, and construction of compressed indexes for texts including long repetitions. In Chapter 3, we propose a faster version of labeled trees (compact tries) called packed compact tries, by using a bit-parallel method. By doing this, we show faster construction of text indexes such as suffix trees and faster various operations like prefix search, insertion, and deletion. Since the compact trie is a widely used data structure, we can speed up some algorithms by using packed compact tries. In particular, we show that LZ-double factorization which is one kind of text compression algorithm is speeded up. In Chapter 4, we first defined a fully-online construction problem, which is a setting that allows a new input symbol can be added an arbitrary string of the set of input strings. To solve this problem, we first showed a fully-online construction algorithm of a DAG index called the directed acyclic word graph (DAWG). We also proposed a fully-online construction algorithm for the suffix tree using similarity between DAWGs and suffix trees. In Chapter 5, we proposed a self-indexing method by combining an index called the compact directed acyclic word graph (CDAWG) with grammar compression, which is one of the compression methods. When the input text is compressible, the index can be held with a size smaller than the original text. In Chapter 6, we give conclusions and future work. Overall, we studied efficient algorithms for text index construction in this thesis.
Conffering University:	北海道大学
Degree Report Number:	甲第13077号
Degree Level:	博士
Degree Discipline:	情報科学
Examination Committee Members:	(主査) 教授有村博紀, 教授湊真一, 教授 Zeugmann Thomas, 准教授喜田拓也
Degree Affiliation:	情報科学研究科（情報理工学専攻）
Type:	theses (doctoral)
URI:	http://hdl.handle.net/2115/70687
Appears in Collections:	学位論文 (Theses) > 博士　（情報科学）課程博士 (Doctorate by way of Advanced Course) > 情報科学院(Graduate School of Information Science and Technology)

OAI-PMH ( junii2 , jpcoar_1.0 )

- Hokkaido University