学 位 論 文 内 容 の 要 旨
博士の専攻分野の名称　　博士（情報科学）　　氏名　孫露
学 位 論 文 題 名
On Improving Multi-Label Classification via Dimension Reduction
（次元縮約によるマルチラベル識別の改善）
In recent years we have witnessed an explosively increasing of web-related applications in our daily
lives, where the scale of data and information has grown dramatically. To deal with such a huge
amount of data, machine learning becomes a crucial way to help human beings to be free from the
massive tasks, such as classification, pattern recognition and prediction. As one of the fundamental
tasks, classification has attracted a lot of attentions from researchers, and been specifically developed
in various settings, such as binary classification and multi-class classification, to meet the distinct
requirements of real-world applications.
In this thesis, we concentrate our research on Multi-Label Classification (MLC). Different form the
traditional single-label classification, where an instance is relevant with one class label, MLC aims to
solve the multi-label problems, where an instance probably belongs to multiple labels. Such a gener-
alization greatly increases the difficulty of achieving a desirable classification accuracy at a tractable
time cost. As an appealing and challenging supervised learning problem, MLC has a wide range of
real-world applications, such as text categorization, semantic image annotation, bioinformatics analy-
sis and music emotion detection.
In general, there are two main concerns on the MLC problems. First, label correlations are strong and
ubiquitous in various multi-label datasets. For example, in semantic image annotation, the labels ”lake”
and ”reflection” probably concur, and share a strong correlation. Thus, it is important and crucial to
capture such label correlations in order to achieve a desirable classification performance. Second, as
the rapid increase of web-related applications, more and more datasets emerge in high-dimensionality,
whose number of instances, features and labels are far from the regular scale. For example, there are
millions of videos in the video-sharing website Youtube, while each one can be tagged by some of
millions of candidate categories. Such high-dimensionality of multi-label data significantly increases
the time and space complexity in learning, and degrades the classification performance due to the curse
of dimensionality. To address the two concerns, various MLC methods have been proposed in recent
years, and achieved much success in a number of applications. However, further improvement in terms
of time complexity and classification accuracy is recently demanding.
The research objective of this thesis is to improve the performance of MLC by capturing label
correlations and reducing dimensionality. According to the objective, the thesis is separated into two
major parts: Part I Multi-Label Classification and Part II Multi-Label Dimension Reduction.
In Part I, we focus on solving MLC problems by label correlation modeling and multi-label fea-
ture selection. Motivated by the Classifier Chains (CC) method, we propose the Polytree-Augmented
Classifier Chains (PACC) in order to save label correlations in one probabilistic graphical model, the
polytree. Benefiting from polytree’s flexible structure, the problems of error propagation and poorly
ordered chain in CC can be avoided in PACC. To further improve its performance, a two-stage feature
selection approach is developed by removing irrelevant and redundant features for each label. In addi-
tion, we reconsider both label correlation modeling and feature selection from a unified framework via
conditional likelihood maximization. Using this approach, we show that existing CC-based methods
and several feature selection approaches are special cases of our generic framework.
In Part II, we aim to improve the classification performance by decreasing the problem size of MLC.
To reduce the dimensionality of features, we conduct Feature Space Dimension Reduction (FS-DR) by
proposing two ML-DR methods, MLC with Meta-Label-Specific Features (MLSF) and Robust sEmi-
supervised multi-lAbel DimEnsion Reduction (READER) via empirical risk minimization. Based on
`2,1-norm loss and regularization, READER performs feature selection in a robust manner through
label embedding (label correlation modeling) and manifold learning (semi-supervised learning). To
avoid the problem of imperfect label information, we conduct Label Space Dimension Reduction (LS-
DR) by extending READER to apply nonlinear Label Embedding (READER-LE) with a linear ap-
proximation. Furthermore, in order to utilize parallel computing, for the first time we introduce a
novel category for ML-DR, Instance Space Decomposition (ISD), and propose the Clustering-based
Local MLC (CLMLC) method to evaluate its efficiency. Different with existing ISD methods, CLMLC
conducts the feature-guided ISD in a feature subspace rather than the original feature space, and builds
cluster-specific local models.
Based on extensive empirical evidences, our work in this thesis demonstrates proposed MLC meth-
ods successfully address the two concerns of MLC, and improve the classification performance com-
pared with the state-of-the-art methods. Therefore, it is hopeful for researchers in the field of MLC to
build their MLC systems and develop novel MLC methods on the basis of the research work in this
thesis.