HUSCAP logo Hokkaido Univ. logo

Hokkaido University Collection of Scholarly and Academic Papers >
Theses >
博士 (情報科学) >

Repetition-Aware Lossless Compression

Files in This Item:
Isamu_Furuya.pdf458.62 kBPDFView/Open
Please use this identifier to cite or link to this item:http://doi.org/10.14943/doctoral.k14281
Related Items in HUSCAP:

Title: Repetition-Aware Lossless Compression
Other Titles: 反復構造のための可逆圧縮
Authors: 古谷, 勇 Browse this author
Issue Date: 25-Sep-2020
Abstract: This thesis studies lossless compression techniques for repetitive data. Lossless compression is a type of data compression that allows restoring the original information completely from compressed data. Today's ever-growing information technology industries involve the enormous data growth, and then an efficient method managing large data is desired. Whereas, these large data in our society are in many cases highly repetitive, that is, most of their fragment parts can be obtained from others occurring in other positions in the data with a few modifications. Managing large repetitive data efficiently is getting attention in many fields and demands for a good compression method for such repetitive data are increasing. A repetition-aware compression technique allows to manage these large data more efficiently and this study contributes to the technique. The term repetition-aware means high effectiveness for repetitiveness. Our approaches to repetition-aware compression are through the grammar compression scheme that constructs a formal grammar that generates a language consisting only of the input data. Grammar compression have been preferable over other lossless compression techniques because of some profitable properties including practical high compression performance for repetitive data. The heart of this study is to develop a grammar compression method that aims to construct a small sized formal grammar from the input data. We discuss on three grammar compression frameworks whose differences are the formal grammars used as the description of the compressed data. We consider a contextfree grammar (CFG), a run-length context-free grammar (RLCFG), and a functional program described by λ-term in Chapter 3, 4, and 5,espectively. In Chapter 3, we approach to the problem of repetition-aware compression on CFGbased grammar compression. We analyze a famous algorithm, RePair, and on the basis of the analysis, we design a novel variant of RePair, called MR-RePair. We implement MR-RePair and experimentally confirm the effectiveness of MR-RePair especially for highly repetitive texts. In Chapter 4, we address further improvement of compression performance via the framework of RLCFG-based grammar compression. In the chapter, we design a compression algorithm using RLCFG, called RL-MR-RePair. Furthermore, we propose an encoding scheme for MR-RePair and RL-MR-RePair. The experimental results demonstrate the high compression performance of RL-MR-RePair and the proposed encoding scheme. In Chapter 5, we study on the framework of higher-order compression, which is a grammar compression using a λ-term as the formal grammar. We present a method to obtain a compact λ-term representing a natural number. Obtaining a compact representation of natural numbers can improve the compression effectiveness of repetition, the most fundamental repetitive structure. For given natural number n, we prove that the size of the obtained λ-term becomes O(slog2n) in the best case and O(slog2n)log n/ log log n in the worst case.
Conffering University: 北海道大学
Degree Report Number: 甲第14281号
Degree Level: 博士
Degree Discipline: 情報科学
Examination Committee Members: (主査) 教授 有村 博紀, 教授 吉岡 真治, 教授 堀山 貴史
Degree Affiliation: 情報科学院(情報科学専攻)
Type: theses (doctoral)
URI: http://hdl.handle.net/2115/79532
Appears in Collections:課程博士 (Doctorate by way of Advanced Course) > 情報科学院(Graduate School of Information Science and Technology)
学位論文 (Theses) > 博士 (情報科学)

Export metadata:

OAI-PMH ( junii2 , jpcoar_1.0 )

MathJax is now OFF:


 

 - Hokkaido University