Multi-label classification by polytree-augmented classifier chains with label-dependent features

Multi-label classification faces several critical challenges, including modeling label correlations, mitigating label imbalance, removing irrelevant and redundant features, and reducing the complexity for large-scale problems. To address these issues, in this paper, we propose a novel method—polytree-augmented classifier chains with label-dependent features—that models label correlations through flexible polytree structures based on low-dimensional label-dependent feature spaces learned by a two-stage feature selection approach. First, a feature weighting approach is applied to efficiently remove irrelevant features for each label and mitigate the effect of label imbalance. Second, a polytree structure is built in the label space using estimated conditional mutual information. Third, an appropriate label-dependent feature subset is found by taking account of label correlations in the polytree. Extensive empirical studies on six synthetic datasets and 12 real-world datasets demonstrate the superior performance of the proposed method. In addition, by incorporating the proposed two-stage feature selection approach, the multi-label classifiers with label-dependent features achieve on average 9.4% performance improvement in Exact-Match compared with the original classifiers.


Introduction
In recent years we have witnessed the increasing demand of multi-label classification (MLC) in a wide range of applications, such as text categorization, semantic image annotation, bioinformatics analysis and audio emotion detection, for which numerous machine learning techniques have been specifically designed and successfully utilized.Unlike traditional multi-class single-label classification, where each instance is associated with only a single label, the task of MLC is to assign a label subset to an unseen instance.The existing MLC methods fall into two broad categories: problem transformation and algorithm adaptation [35].Problem transformation strategy typically transforms an MLC problem into a set of single-label classification problems, and learns a family of classifiers for modeling the single-label memberships.Algorithm adaptation strategy induces conventional machine learning algorithms in the multi-label settings.A number of MLC methods adopting one of the above two strategies have been developed and succeeded in dealing with various multi-label problems.
The previous efforts on MLC focus mainly on two aspects: label correlation modeling and dimensionality reduction.Many researches [28,11] have shown that capturing label correlations is crucial for a MLC method to achieve competitive classification performance.On the other hand, a variety of dimension reduction approaches [22,48,44] have been proposed for the multi-label problems in order to reduce the resource consumption and improve performance.However, most of these methods build their models on the basis of an identical feature space for all labels.Such a universal hypothesis possibly introduces irrelevant and redundant features, resulting in two problems: decreasing the model's generalization ability and increasing its computational complexity for both learning and prediction.Rather, it is natural to think that each label holds its own specific set of features to distinguish from other labels.For example, in image annotation, an object typically relates to only a few regions in the high-dimensional feature space, and in text categorization, one specific topic is probably relevant to a fraction of words from the massive amounts of vocabulary.

For blinded manuscript
Hence, in this study we presume that modeling label correlations and mining label-dependent features would benefit the generalization ability and prediction accuracy of a MLC method.As in our previous work [32], the basic idea of Polytree-Augmented Classifier Chains (PACC) has already been proposed, which is more flexible on modeling label correlations than conventional MLC methods.In this paper, we improve the PACC by selecting Label-Dependent Features to produce the PACC-LDF method.We employ a hybrid two-stage feature selection algorithm for the polytree structure.Specifically, a information gain-based feature weighting algorithm is employed in the first stage to efficiently remove irrelevant features in each label, and alleviate the label imbalance problem; After construction of a polytree, in the second stage, a correlation-based feature subset selection algorithm is carried out to select label-dependent feature subset by incorporating label correlations modeled by the polytree.In this way, label-dependent features chosen for the polytree structure will be used to learn the classifier chain and to make prediction on a test instance.The proposed two-stage feature selection algorithm is also applicable to the other MLC methods, such as Classifier Chains (CC) based methods [28,7,43].Extensive experiments conducted on both synthetic and real-world datasets demonstrate the performance superiority of the proposed PACC-LDF method compared with several state-of-the-art MLC methods in terms of classification performance and time complexity.
The contributions of this work are cast into three-folds.
-The polytree structure is introduced to model label dependency, according to which, we propose PACC for MLC and present more technical details in this paper than our previous work in [32].-A two-stage feature selection framework is specifically developed for PACC to select Label-Dependent Features (LDF), which enables to mitigate the label imbalance problem and save label correlations modeled in the built polytree structure.
-Empirical studies show that on average MLC methods can be improved 9.4% in Exact-Match by incorporating LDF.In addition, extensive experimental results demonstrate the efficiency of the proposed PACC-LDF compared with popular MLC methods.
The remainder of this paper is organized as follows.Section 2 gives the mathematical definition of MLC.Section 3 discusses the related works.Section 4 states the challenges confronted with MLC methods.Section 5 illustrates an overview of the system framework, and presents technical details in two parts: Polytree-Augmented Classifier Chains (PACC) and Label-Dependent Feature (LDF) selection.Section 6 presents the statistical properties of benchmark multi-label datasets, and defines four metrics for evaluating multi-label classifiers.The experimental results are reported and discussed in Section 7. Finally, Section 8 concludes this paper and discusses future work.

Multi-label classification
In the scenario of MLC, given a finite set of labels L = {λ 1 , ..., λ L }, an instance is typically represented by a pair (x, y), which contains a feature vector x = (x 1 , ..., x M ) as a realization of the random vector X = (X 1 , ..., X M ) drawn from the input feature space X = R M , and the corresponding label vector y = (y 1 , ..., y L ) drawn from the output label space Y = {0, 1} L .In other words, y = (y 1 , ..., y L ) can be viewed as a realization of corresponding random vector Y = (Y 1 , ..., Y L ), Y ∈ Y, where y j = 1 if label λ j is associated with the corresponding instance x, and y j = 0 otherwise.
Suppose that we are given a dataset of , where y (i) is the label assignment of the ith instance.The task of MLC is to find an optimal classifier h : X → Y which assigns an appropriate label vector y to each instance x such that h minimizes a loss function.Given a loss function loss(Y, h(X)), the optimal h * is where P (x, y) is the joint probability distribution over the feature vector x and label vector y.The optimal classifier (1) can be rewritten in a pointwise way, ŷ = h * (x) = arg min For the subset 0-1 loss loss S (y, ŷ) = 1 y =ŷ , where Similarly, for the hamming loss loss As proved in [8], (4) coincides with (3) in case of conditional independence of labels.In Section 3, we will show that binary relevance method [3] and several classifier chain based methods [7,28,43,29] are actually indirect approximations of (3).

Related works
In recent years, many efforts in MLC have been paid on two aspects: label correlation modeling and dimensionality reduction.It has been shown in a number of researches [28,11] that modeling label correlations is very crucial to perform accurate classification.On the other hand, various dimension reduction algorithms, including feature selection [22,13] and feature extraction [48,44], have been employed in MLC, in order to simplify the learning phase and overcome the curse of dimensionality.
In terms of label correlation modeling, Classifier chains (CC) based methods have been proposed at a tractable time complexity, originating from the simple Binary Relevance (BR) method.In the BR context, a classifier h is comprised of L binary classifiers h 1 , ..., h L , where each member classifier h j predicts ŷj ∈ {0, 1}, forming a vector ŷ ∈ {0, 1} L .In the prediction phase, BR collects the result of the member classifiers, i.e., ŷj ← h j (x), which is identical with (4).In this sense, BR can be seen as a hamming loss risk minimizer [8].In CC [28], the label correlation is expressed in an ordered chain.In the learning phase, according to a predefined chain order, like .., h L such that each classifier predicts the correct value of y j by referring to the correct values of pa(y j ) = {y 1 , y 2 , ..., y j−1 } in addition to x.In the prediction phase, it predicts in turn the value of y j using the previously estimated values ŷ1 , ..., ŷj−1 with x according to: ŷj = arg max yj ∈{0,1} P (y j | pa(y j ), x), j = 1, ..., L. ( Finally we have ŷ = ( ŷ1 , ŷ2 , ..., ŷL ).Note that CC predicts the presence/absence of a label depending on previously predicted label set and its prediction is made in only one path.Probabilistic Classifier Chains (PCC) [7] provides better estimates than CC at the expense of a higher time complexity in the prediction phase.Although PCC shares the learning model (8) with CC, it chooses the best predictor by examining all the 2 L paths in an exhaustive manner according to the following: The exponential cost of PCC in prediction limits its application.To make the prediction tractable for PCC, several methods [21,9,30] have been proposed to to find the approximate MAP maximum a posterior (MAP) assignment of labels to a test instance.Bayesian Classifier Chains (BCC) [43] introduces a directed tree as the probabilistic structure over labels.The directed tree is established by randomly choosing a label as its root and by assigning directions to the remaining edges.It shares the same model ( 8) and ( 5) with CC, but |pa(Y j )| ≤ 1 limits its expression ability on label correlations.Fig. 1 shows an example of the graphical models of BR, CC, PCC and BCC with four labels.In terms of time complexity, all these methods hold linear complexity O(LM N ) for training if a linear baseline classifier is utilized.
In the prediction phase, BR, CC and BCC have linear complexity O(LM ) for testing a single instance, while PCC needs a time complexity of O(2 L LM ).On the other hand, a variety of MLC methods have been proposed to reduce the dimensionality of multi-label problems.The Multi-Label Naive Bayes (MLNB) method [47] incorporates feature selection mechanism into a new-designed naive Bayes classifier.Principal component analysis is employed to  remove unnecessary features, and then a wrapper approach with genetic algorithm is performed.However, MLNB is applicable to regular-scale datasets with continuous features due to its feature selection mechanism.Label specIfic FeaTures (LIFT) [44] extracts label-specific features by conducting k-means clustering analysis on the positive and negative instances according to a specific label.It obtained competitive results on a broad range of benchmark multi-label datasets, but spent more prediction time than the MLC methods with linear time complexity.By learning a Hierarchy Of Multi-label classifiER (HOMER) based on a balanced k-means clustering approach, HOMER [36] partitioned the whole label set into a series of smaller and more balanced sets following the layers of the label hierarchy.Label Partition for Sublinear Ranking (LPSR) [39] consists of two-stages: feature space partition and label assignment.It reduces the prediction complexity by learning a hierarchy over base classifiers, but it has a higher training cost than the linear complexity in L. To cope with large-scale problems, a tree-based multi-label method, FastXML, is proposed in [27].Based on a novel ranking loss function, nDCG, it developed an efficient alternating minimization algorithm to optimize the objective function.In this way, it achieved competitive classification accuracy compared with other scalable MLC methods, and could scale to large-scale datasets even with a million of labels.
For the efficiency, in this paper, we incorporate a two-stage label-dependent feature selection mechanism into the learning phase of a novel polytree-augmented classifier chains method [32] in order to improve its performance on the basis of label-dependent features.

Label correlations
There are two types of label correlations for MLC, marginal and conditional dependency.According to the structure of a Bayesian network for P (X, Y), we have the marginal distribution and the conditional distribution given as, where pa(Y j ) denotes the parent label set of Y j .Then the definition on label dependence can be induced: Definition 1 Label random vector Y is called marginally or conditionally independent if ∀Y j : pa(Y j ) = ∅ in (7) or (8).
We can see that label-pair correlation is prevalent in many datasets from the values of mutual information I(Y j ; Y k ) for any pair of Y j and Y k .Fig. 2(a) shows the label correlations in the Enron dataset, which is measured by mutual information.

Label-dependent irrelevant and redundant features
The existence of irrelevant and redundant features for classification increases the computational complexity in both learning and prediction, and moreover often reduces the generalization ability of the classifiers designed from instances due to the curse of dimensionality.Irrelevant and redundant features are two distinct concepts, an irrelevant feature has no discriminative information, while a redundant feature shares the same discriminative information with other features.Removal of these features, therefore, does not lose the discriminative information.Rather, elimination of irrelevant and redundant features simplifies the learning phase and prevents from overfitting.
An MLC example with irrelevant and redundant features is shown in Fig. 3.For label Y 3 , features X 1 and X 2 are redundant as Y 3 could be readily classified by either X 1 or X 2 .If we have prior information of the presence of Y 3 , X 2 is irrelevant since Y 1 can be discriminated from Y 2 based only on X 1 .This example shows how irrelevant and redundant features can exist in MLC, and how label-dependent features can be exploited to faciliate the process of classification.Moreover, it shows that the predicted label value can provide some useful information for the unpredicted labels, which could further compact labeldependent feature subsets and promote classification.For example, given the absence of Y 3 , only X 2 is the discriminative feature for labels Y 1 and Y 2 .

Label imbalance
Multi-label datasets are typically imbalanced, i.e., the number of instances associated with each label is often unequal.In other words, the ratio of positive instances against the negative ones may be quite low for some labels.The imbalance problem usually harms the performance of the learned classifier from two points of view.On one hand, if we aim to minimize hamming loss or ranking loss, we tend to ignore minority labels.On the other hand, classifiers for minority labels are difficult to design.Such label imbalance problem becomes more serious in multi-label datasets than in single-label datasets, since a label comprising of more classes typically has less samples.To depict the imbalance level of a dataset D, the mean of the Imbalance Ratio (IR) and the Coefficient of Variation of IR (CVIR) [5] are introduced as follows, where . We can see that multi-label datasets are highly imbalanced from the values of these two measures (Table 2 of Section 6.1).Fig. 2(b) shows the label imbalance problem in the Enron dataset (L = 53), where only 15 most frequent labels are reported.A simple way to mitigate the imbalance level in correlation measurement is to use the normalized mutual information (defined in (25) of Sec.7.1), instead of the original mutual information, I(Y j ; Y k ).Another way is to perform undersampling for majority labels and oversampling for minority labels.For more information on the imbalance problems of MLC, see [5].
the label space † in the feature space ‡ in both the feature and label spaces

High complexity for large-scale data
The real applications of MLC often confront with large-scale problems, where either of the number of labels L, attributes M and instances N might be very large.In such a case, the time complexity will become an important aspect for evaluating an MLC algorithm, sometimes more important than classification accuracy for real-world applications.
Up to now, one of the simplest MLC methods is to transform the MLC problem into a series of single-label classification problems, namely Binary Relevance (BR), which has a linear time complexity O(LM N ) in terms of the complexity of the baseline classifier.However, even such a linear complexity can be intractable for large-scale MLC problems.To overcome the limitation, as mentioned in Sec. 3, several dimension reduction approaches are applied in MLC methods so as to attain a sublinear complexity.For instance, embedding based methods [38,41,2] project the label vectors onto a low-dimensional linear or nonlinear label subspace, leading to a time complexity of O( LM N ) with L L. In this paper, we focus on reducing the dimensionality of the feature space, to attain a complexity of O(L M N ) with M M .Table 1 summarizes several MLC methods evaluated from the four viewpoints discussed in this section.It shows that most of popular MLC methods enable to cope with only some of the four issues confronted with MLC.In this paper, we try to deal with all the four aspects with the proposed method.

Polytree-augmented classifier chains
We propose a novel polytree-augmented classifier chains (PACC) as a compromise between the expression ability and the efficiency.A polytree (Fig. 4) is a directed acyclic graph whose underlying undirected graph is a tree but a node can have multiple parents [31].That is, it is more flexible than trees.A causal basin, as shown in Fig. 4(b), is a subgraph which starts with a multi-parent node and continues following a causal flow to include all the descendants and their direct parents.

Structure learning
In PACC, the conditional label dependence is obtained by approximating the true distribution P (Y|X) by another distribution.According to Chou-liu's proof [6] and our previous work [32], we can have its feature-conditioned version.
Theorem 1 To approximate a conditional distribution P (Y|X), the optimal Bayesian network B * in K-L divergence is obtained if the sum of conditional mutual information between each variable of Y and its parent variables given the observation X is maximized.
Proof Here we use Kullback-Leibler (KL) divergence [20], D KL (P ||P B ), a quasi distance between two distributions, to evaluate how close an alternative distribution P B (Y|X) is to P (Y|X), where B denotes According to the parents-children relationship in B, where I P (Y i ; pa(Y j )|X) represents the conditional mutual information between Y j and its parents pa(Y j ) given X in B. As a result, the optimal B * is obtained by maximizing Theorem 1 shows min That is, we should construct B so as to maximize the mutual information between a child and its parents.However, in practice, we do not know the true P (Y|X).Therefore we use the empirical distribution P (Y|X) instead.Unfortunately, learning of the optimal B * is NP-hard in general, we limit our hypothesis B to the ones satisfying |pa(Y j )| ≤ 1 so as to pa(Y j ) = Y k for some k ∈ {1, 2, ..., L} or null, indicating the tree skeleton is to be built.In practice, we carry out Chou-liu's algorithm [6] to obtain the maximum-cost spanning tree (Fig. 4(a)), maximizing the weight sum, with edge weights I P (Y i ; pa(Y j )|X).

Mutual information estimation
It is quite difficult to estimate conditional probability P (Y|X), when X is continuous.Recently some methods [8,43,45] have been proposed to solve this problem.In BCC [43], as an approximation of conditional probability, marginal probability of labels Y is obtained by simply counting the frequency of occurrence.Similar with [8], LEAD [45] directly obtains conditional dependence by estimating the degree of dependency of errors in multivariate regression models.
In [32], we used a more general approach to estimate the conditional probability.The data set D is splitted into two sets: a training set D t and a hold-out set D h .Probabilistic classifiers, outputing the probability of each label, are learned from D t to represent conditional probability of labels, and the probability is calculated based on the output of the learned classifiers over D h .First, three probabilistic classifiers f j , f k and f j|k are learned on D t to approximate conditional probabilities P (y j = 1|x), P (y k = 1|x) and P (y j = 1|y k , x), respectively.Then corresponding probabilities are computed by conducting f j , f k and f j|k on D h .Last, I P (Y j ; Y k |X) is estimated by

Construction of PACC
After obtaining the skeleton of the polytree, our next task is to assign directions to its edges, that is, an ordering of the nodes to complete the polytree.First we assign some or all directions to the skeleton by finding causal basins.This is implemented by finding multi-parent nodes and the corresponding directionality.The detailed procedure is as follows.Fig. 5 shows three possible graphical models over triplets A, B and C.Here Types 1 and 2 are indistinguishable because they share the same joint distribution, while Type 3 is different from Types 1 and 2. In Type 3, A and C are marginally independent, so that we have The other non-parent neighbors will be treated as Y j 's child nodes.By performing the Zero-MI testing for every pair of Y j 's direct neighbors, pa(Y j ) and a causal flow outside Y j is determined, by which a causal basin will be found.In PACC, pa(Y j ) can be more than one node, so that the model is more flexible than that of BCC using a tree.
In order to build a classifier chain by the learned directions, we rank the labels to form a chain and then train a classifier for every label following the chain.The ranking strategy is simple: the parents should be ranked higher than their descendants, and the parents sharing the same child should be ranked in the same level.Hence, learning of a label is not performed until the labels with higher ranks, including its parents, have been learned.That is, a kind of lazy decision is made.In PACC, we choose logistic regression with 2 regularization as the baseline classifier.Therefore, a set of L logistic regressiors f = {f j } L j=1 is learned, each of which is trained by treating the union of x and pa(y j ) as new augmented attributes xj = (x, pa(y j )) T , shown as follows: where θ j is the model parameters for Y j , which could be learned by maximizing the regularized loglikelihood given the training set: where λ is a trade-off coefficient to avoid overfitting by generating sparse parameters θ j .Then traditional convex optimization techniques, such as Quasi-Newton method with BFGS iteration [23], can be used to learn the parameters.

Classification
Exact inference in the prediction phase, as shown in (3), is NP-hard in directed acyclic graphs.However, in polytrees, using the max-sum algorithm [26], we can make exact/exhaustive inference in a reasonable time by bounding the indegree of nodes.Two phases are performed in order.The first phase, we begin at the root(s) and propagates testing downward to the leaves.The conditional probability table for each node is calculated on the basis of its local graphical structure.In the second phase, message propagation starts upward from the leaves to the root(s).In each node Y j , we collect all the incoming messages and finding the local maximum with its value ŷj .In this way, we have the Maximum a Posteriori (MAP) estimate ŷ = (ŷ 1 , ..., ŷL ) such as where Y l represents a leaf and Y r a root, respectively.An example of learning and prediction in PACC is shown in Fig. 6.The algorithm of PACC is depicted in Algorithm 1.

Label-dependent feature selection
A two-stage feature selection approach consisting of classifier-independent filter and classifier-dependent wrapper, has been recommended to gain a good trade-off between classification performance and computation time in [19].Motivated by this study, we develop a two-stage feature selection approach for CC-based methods based on the simple filter algorithm, in order to find label-dependent, equivalently class-dependent, features [1] and save label correlations during feature selection.In this way, we expect that the proposed approach enable to improve classification performance and reduce the computational complexity in both learning and prediction phases.
According to whether features are evaluated individually or not, the existing filter algorithms can be categorized into two groups: feature weighing algorithms and subset search algorithms [42].Feature Learn a probabilistic classifier f j on D + j according to ( 15) and ( 16); Testing: 7: for x ∈ T do 8: Return ŷ = arg max y∈Y L j=1 f j (x, pa(y j )) according to (17); weighting algorithms evaluate the weights of features individually, and rank them by the relevance to the target class.It is quite efficient to remove irrelevant features, but totally ignores the correlations among features.On the other hand, redundant features that are strongly correlated to others also harm the performance of learning algorithm [17].Subset search algorithms aim to overcome such limitation, and still maintain a reasonable time complexity compared with the wrapper algorithm.It searches through candidate feature subsets guided by a certain evaluation measure which captures the goodness of each subsets [24].In this study, we propose a two-stage approach by using both feature weighting and subset search in order to select label-dependent features.

Label-dependent feature weighting
In the first stage, we develop a novel Multi-Label Information Gain (MLIG) algorithm based on feature weighting to efficiently remove irrelevant features for each label.IG has been frequently used as an evaluation criterion for feature weighting in various machine learning tasks [40].Given a label variable Y j and a feature variable X k , IG measures the amount of the entropy of Y j reduced by knowing X k , P (y j ) log P (y j ) + where V k denotes the value space of the feature variable X k .In practice, the numeric features should be discretized beforehand for the computational efficiency.For the multi-label datasets, a straightforward way to apply IG is to rank all the features for each label according to (18), and then select the top-ranked features to feed the post-process.However, it is a non-trivial thing to choose an appropriate threshold for filtering out irrelevant features.In addition, in the MLC setting, it is unreasonable to set the same threshold for all labels due to the label imbalance problem as stated in Sec.4.3.For the labels with higher imbalance ratio, the number of positive instance may be insufficient for building an accurate classifier, in which case a smaller number of features should be chosen.To overcome the problem, in MLIG we set the percentage α j of selected features for the label variable Y j according to the following: where r is a factor controlling the range of α j so that α j ∈ [r, 3r].According to (19), we can see that the value of α j is close to 3r for the majority labels in well balanced datasets, and α j becomes r for the minority labels in highly imbalanced datasets.As a result, a smaller number of features is selected for each minority label in an imbalanced dataset, and vice versa.
In this way, MLIG first calculates a feature-label information gain matrix according to (18), then ranks the features for each label and selects most relevant label-dependent features up to m j = α j M , j = 1, ..., L. Finally, we transform the original data D j = {(x (i) , y j )} N i=1 , z j ∈ R mj by eliminating irrelevant features.

Label-dependent feature subset selection
Although the MLIG approach works for feature selection to some extend, it is unable to eliminate the redundant features.Thus we consider to develop a feature subset selection algorithm in order to find a more compact feature subset by incorporating the label dependency modeled by the polytree structure.
In this stage, we extend the Correlation-based Feature Selection (CFS) [12], one of the subset search algorithms, to remove redundant features.CFS is conducted once the polytree B = {pa(Y j )} L j=1 has been constructed.In the proposed Multi-Label CFS (MLCFS) approach, we apply CFS on the label-specific feature subspace, taking label correlations modeled by B into account.More specifically, given a label variable Y j with its dataset Z + j = {z , where zj = z j ∪ pa(y j ), the merit of a feature subset S j of ñj (ñ j = n j + |pa(Y j )|) features is evaluated by where the mean correlations ρ Yj Z and ρ Z Z are calculated according to MLCFS first calculates the feature-feature and feature-label correlation matrices, and then employs a heuristic search algorithm, such as Best First [17], with the start set pa(Y j ) to search the feature subset of space Y j by maximizing (20).In this way, the dimensionality of the feature space is reduced from m j to n j , typically n j m j .We transform the data , where ṽj = v j ∪ pa(y j ), v j ∈ R nj .Finally, V + j is used to learn the probabilistic classifier f j .Algorithm 2 gives the algorithm of PACC with Label-Dependent Features, named PACC-LDF.In the training phase, PACC-LDF first performs problem transformation in Step 1, applies MLIG to remove irrelevant features and transforms the training set {D j } into {Z j } from Steps 2 to 4. Then a polytree B is built on {Z j } from Steps 5 to 6, and {Z j } is transformed into {Z + j } based on B in Step 7.After that, MLCFS is performed on {Z + j }, which is further transformed into {V + j } in Steps 8 to 10. Finally, based on the dataset {V + j } with label-dependent features, a multi-label probabilistic classifier {f j } is learned at Step 11.In the testing phase, a test dataset T is first projected into the lower-dimensional feature subspaces, and then feed to the learned classifier for prediction in Steps 12 to 15. Fig. 7 shows the framework of PACC-LDF in terms of training and testing phases.Perform MLIG on D j according to (18), i.e., g j : , where z j = g j (x); 5: Calculate the mutual information matrix I = {I jk } L×L , where I jk is computed according to (13) based on Z j and Z k ; 6: Construct a polytree B = {pa(Y j )} L j=1 on I, and form the chain; Conduct MLCFS on Z + j by Best First search with the start set pa(Y j ) according to (20), i.e., g j : , where v j = g j (z j ); 11: Learn a probabilistic classifier f j on V + j according to ( 15) and ( 16); Testing: 12: for x ∈ T do 13: for j ∈ chain do 14: Transform xj into vj , i.e., vj = (g j • g j )(x j );

Discussion
PACC-LDF can be considered as a general version of PACC, since PACC-LDF selects label-dependent features during the model building of PACC, in order to improve its performance and reduce time complexity.By applying only one stage of the proposed feature selection approach, we can have two  (i) | and the number of distinct label sets |{y|(x, y) ∈ D}| in order to depict its statistical properties.In addition, we also report the imbalance level of D by IR and CVIR defined in (9).As a rule of thumb [5], a dataset D is considered as an imbalanced dataset if IR is higher than 1.5 and CVIR exceeds 0.2.In this sense, all the datasets except the Scene and Emotions datasets are imbalanced, indicating the necessity of alleviating this problem in MLC methods.Table 2 reports the statistics of twelve benchmark multi-label datasets from a variety of domains used in the experiments.According to the size of N , M and L, we treat the first eight datasets as regular-scale datasets and the last four as large-scale datasets.

Evaluation metrics
The existing multi-label evaluation metrics can be separated into two groups: instance-based metrics and label-based metrics [35].To evaluate the performance of a MLC method on a test data set T = {(x (i) , y (i) )} Nt i=1 , we use two instance-based metrics: and two label-based metrics: Among the metrics, Exact-Match is the most stringent measure, especially for the MLC problems with a large number of labels, since it does not evaluate the partial match of a label set.In spite of that, Table 2: Statistics of twelve benchmark multi-label datasets.In below, N , M and L are the data size in instances, features and labels, respectively.Cardinality, Density and Distinct denotes the label cardinality, the label density and the number of distinct label combinations, respectively.IR and CVIR together depict the degree of label imbalance, which are defined in (9).according to the definition, it is a good measure to measure how well label correlations are modeled.Accuracy is useful to measure the performance of a classifier in terms of both positive and negative prediction ability.Unlike Exact-Match, both Macro-F1 and Micro-F1 are able to take the partial match of labels into account.In addition, as stated in [33], Macro-F1 is more sensitive to the performance of rare categories (the labels in minority), while Micro-F1 is affected more by the major categories (the labels in majority).Hence, joint use of Macro-F1 and Micro-F1 should be a good supplement for the instance-based evaluation metrics to evaluate the performances of MLC methods.

Implementation issues
In both feature weighting (18) and feature subset selection (20), calculation of mutual information is extensively performed.For the discrete and categorical feature variable X (label variable Y is originally binary), the calculation of mutual information is simple and straightforward.Given a sample of n i.i.d.observations {(x (i) , y (i) )} n i=1 , based on the law of large numbers, we have the following approximation: x,y P (x, y) log P (x, y) P (x)P (y) where P denotes the empirical probability distribution.When the feature variable X is continuous, it becomes quite difficult to compute mutual information I(X; Y ), since it is typically impossible to obtain P .One of solutions is to use kernel density estimation [14], but it is computationally expensitve and typically difficult to select good value of its bandwidth.To circumvent this difficulty, in practice, we compute I(X; Y ) with continuous feature X by applying data discretization as preprocessing.In this study, the continuous feature X is discretized based on its mean µ X and standard deviation σ X .For example, we can apply a similar discretization approach used in [4], which divides a numeric value of feature variable X into one of three categories {−1, 0, 1} according to µ X ± σ X .The experimental results demonstrate the efficiency of such simple data discretization approach for approximating I(X; Y ) to perform feature selection.
In addition, the calculation of conditional mutual information in (13) for building the polytree is computational expensive for large-scale datasets.To reduce the training cost and make the proposed PACC and PACC-LDF tractable for large problems, normalized marginal mutual information estimation, rather than conditional mutual information estimation (13), is used to model label correlations in PACCrelated methods for large-scale datasets.The normalized mutual information is defined in the following: Compared with I(X; Y ), the advantage of N I(X; Y ) is that N I(X; Y ) enables to alleviate the negative effect resulting from the label imbalance problem, as we have discussed in Sec.4.3.Fig. 8: The performances of PACC-LDF in terms of four evaluation metrics on six synthetic datasets by varying the value of r from 0.05 to 0.3 by step 0.05.

Experimental setting
The methods used in the experiments were implemented based on Mulan1 and Meka2 , and performed on six synthetic datasets and twelve benchmark datasets.To evaluate the classification performance, 5-fold and 3-fold cross validation were used for the eight regular-scale and four large-scale datasets, respectively.
In the experiments we chose logistic regression with 2 regularization as the baseline classifier, and set the constant value λ = 0.1 for the trade-off parameter λ in ( 16) for all MLC methods.To reduce the training cost, normalized mutual information, instead of conditional mutual information (13), was calculated for large-scale datasets.The experiments were conducted in a computer configured with an Intel Quad-Core i7-4770 CPU at 3.4GHz with 4G RAM.

Experiments on synthetic datasets
In this section, we conduct experiments on six synthetic multi-label datasets to evaluate the performances of PACC with its three variants, PACC-MLIG, PACC-MLCFS and PACC-LDF.In total, six synthetic datasets, including four regular-scale sets and two large-scale sets, were generated according to the method in [34].In each data set, instances were produced by randomly sampling from R hypercubes (labels) in the M -dimensional feature space, and thus the dataset is represented by DataM -R.The M -dimensional features consisted of three parts: relevant features, irrelevant features and redundant features.The irrelevant features were randomly generated, and the redundant features were the copies of existing relevant features.In addition, in order to simulate real-world multi-label data, classification noise was added into these synthetic datasets, which flips the value of each label for a instance in a random manner with a probability of 0.02.The statistics of the synthetic datasets are reported in Table 3. First, we performed experiments on PACC-LDF by changing the value of factor r in (19), which controls the lower and upper bounds of α j by r ≤ α j ≤ 3r according to (19).Experimental results in four evaluation metrics are shown in Fig. 8, by which we can reach the following two conclusions: (1) In the regular-sized datasets, PACC-LDF works worse for a small value of r, r < 0.1, but becomes better and stable as r exceeds 0.15; (2) In the large-sized set, PACC-LDF performs better when the value of r is small, and works slightly worse if r exceeds 0.15.Therefore, in the rest of paper, r is set to the moderate value of 0.15 and 0.05 for regular-scale and large-scale sets, respectively.
In Fig. 9, the performances of PACC, PACC-MLIG, PACC-MLCFS and PACC-LDF in Accuracy and Learning time are reported.Note that we do not show the performances in other metrics here, since similar results and patterns can be observed.The proposed LDF and its variants significantly improve the performance of PACC.Specifically, PACC-LDF works best among the four methods, and achieves at least 10% of performance improvement compared with the original PACC, indicating the effectiveness of the proposed two-stage feature selection approach.In terms of learning time, PACC-LDF consumed the least time on the last five datasets.In testing time, all the methods consumed similar time on regular-scale datasets, but PACC-LDF cost the least time on two large-scale datasets.Therefore, the two-stage feature selection approach, LDF, rather than MLIG and MLCFS, is employed in the following experiments.

Experiments on real-world datasets
Next we evaluate the performances of popular MLC methods on the twelve real-world benchmark multilabel datasets in Table 2.This part of experiment is composed of three major parts.In the first part, we compared PACC with three CC-based methods, including BR, CC and BCC, to demonstrate the effectiveness of the polytree structure on capturing label correlations.In the second part, PACC-LDF is compared with three state-of-the-art MLC methods in terms of classification accuracy and execution time.In the last part, CC-based methods are compared with their LDF variants in a pairwise way to evaluate the performance of the two-stage feature selection approach.In addition, the comparing results of LDF with traditional feature selection algorithms are presented.The MLC methods used in this section are summarized as follows: -CC-based methods have been introduced in Section 3. In CC, the chain is established in a randomly determined order.In BCC, the normalized mutual information is used for marginal dependency estimation on each label pair, since the performance could be slightly improved without consuming extra processing time.-Multi-Label k-Nearest Neighbors (MLkNN) [46] originates from the traditional k-nearest neighbors algorithm.For each test instance, according to the label assignments of its k nearest neighbors in the training set, the prediction is made on the basis of MAP principal.In the experiments, we set k = 10, by following the suggestion in the literature [46].-RAndom k LabELsets (RAkEL) [37] is an ensemble variant of the Label Combination (LC) method.
RAkEL transforms an MLC problem into a set of smaller MLC problems, by training m LC models using random k-subsets of the original label set.To make it executable in a limited time cost (24h), RAkEL employed the C4.5 decision tree as its baseline single-label classifier for large-scale datasets.
We set k = 3 and m = 2L as recommended in [37].-By building a Hierarchy Of Multi-label classifiERs (HOMER) on the basis of balanced k-means clustering, HOMER [36] reduces the complexity of prediction and addresses the label imbalance problem.According to the experimental results in [36], the number k of clusters for building the hierarchical structure was set to 4. In addition, Binary Relevance with 2 regularized logistic regression was used as its baseline multi-label classifier.

Comparison of PACC with CC-based methods
The resutls of CC-based methods are summarized in Table 4.In terms of instance-based evaluation metrics, Exact-Match and Accuracy, PACC was the best or competitive with the best methods, except for the Yeast dataset.It is understandable because PACC is a subset 0-1 risk minimizer benefiting from its ability on the polytree structure as well as exact inference.In these metrics, CC is the second best, while BCC is the third in most cases.This is probably because BCC models only label pairwise According to [10], the performance of two methods is regarded as significantly different if their average ranks differ by at least the critical difference (CD).Figure 10 shows the CD diagrams for four evaluation metrics at 0.05 significance level.In each subfigure, the CD is given above the axis, where the averaged rank is marked.In Figure 10  and Macro-F1.However, there was no significant difference between PACC-LDF and other methods in Accuracy and Micro-F1.Table 6 summarizes the Learning and Prediction time of eight comparing methods.Over all the methods, MLkNN needed the least training time due to its lazy strategy, while HOMER cost the least time in the prediction phase as it has sublinear time complexity with respect to the number of labels.RAkEL consumed the largest training time in all datasets except for the Medical dataset in spite that it employs the simple decision tree as its baseline classifier.The high complexity of RAkEL probably arises from its ensemble strategy and the LC models for modeling label correlations.For the CC-based methods, significant reduction of both learning and prediction time can be observed by employing LDF.Indeed, on average 60% of features were removed in two balanced datasets, Scene and Emotions, while at least 80% of features were eliminated in the other datasets, leading to a remarkable reduction in time complexity.However, PACC-LDF consumed more time in Corel16k1 and Corel5k than CC-based method.It is probably because feature selection dominates the time complexity in these two datasets.In total, PACC-LDF is one of good choices for MLC when the exact matching is expected and less executing time is demanding.

Results of feature selection
From Fig. 11, we can confirm the effectiveness of the proposed Label-Dependent Feature (LDF) selection approach.In terms of Exact-Match, the performances of CC-based methods have been significantly improved in most of datasets, especially in the large-scale datasets.For example, in the Corel5k dataset,  PACC-LDF works more than 40% better than PACC, and even 4 times better than BR, demonstrating the performance superiority of selecting label-dependent features for such a large-scale dataset.According to Fig. 11, CC-based methods with LDF achieve 9.4% performance improvement on average in Exact-Match, compared with the original methods.The effectiveness of LDF is also confirmed by Table 7, where the results of Wilcoxon signed-ranks test [10] are shown.The Wilcoxon signed-ranks test was conducted sixteen times, each time on one CC-based method with its LDF counterpart.According the results of Wilcoxon test, all the LDF variants outperform the original methods in Exact-Match, and obtain comparable results in other evaluation metrics.
In addition, to demonstrate the effectiveness of the proposed LDF, also meaning the feature selection algorithm for LDF, we compared LDF with three feature selection approaches, Gain Ratio (GR) [15], ReliefF [16] and Wrapper [18], on the Emotions and Medical datasets.As a classifier, PACC with 2 regularied logistic regression was chosen.In these feature selection algorithms, backward greedy stepwise search is applied to find the relevant features for each label individually.In order to reduce the time cost of Wrapper, top 50% (emotions) and 10% (medical) relevant features were selected by a filter algorithm [40], before applying the wrapper algorithm [18].The percentage of features is increased from 0.05 to 0.5 by step 0.05.Fig. 12 shows the experimental results on the two datasets in terms of four evaluation metrics.As shown in Fig. 12, LDF consistently works better than the other algorithms.Wrapper is the second best algorithm, and is competitive with LDF in Macro-F1.ReliefF performs better than Gain Ratio (GR) in most of cases, and can even be comparable with LDF and Wrapper in some cases, but it is sensitive to the number of selected features.In terms of time complexity, ReliefF, GR and LDF have the similar time cost, while Wrapper needs more than hundreds of execution time than the other algorithms.

Conclusion and future work
In this paper we have proposed polytree-augmented classifier chains with label-dependent features in order to achieve a better classification accuracy and lower computational cost compared with other popular MLC methods.As verified by the experimental results, the proposed PACC method outperformed other CC-based methods in Exact-Match.In addition, the two-stage label-dependent feature selection approach, LDF, contributed to the improvement of performance and reduction of executing time for PACC and other CC-based methods.In the future work, we consider to conduct dimension reduction in the label space, in order to further decrease the computational complexity of MLC methods and improve their scalability for large-scale datasets.

Fig. 1 :
Fig. 1: Probabilistic graphical models of CC-based methods for a MLC problem with four label variables.

Fig. 2 :
Fig. 2: Label correlation and label imbalance in the Enron dataset (L = 53).(a) Visualization of label correlations; (b) The label imbalance problem, where 15 most frequent labels are reported.

Fig. 3 :
Fig. 3: Irrelevant and redundant features in a multi-label setting.(a) shows the label distribution over the 2D feature space; (b) shows the classification partition strategy for (a); (c) gives the probability graphical model for (a).

P
(a, c) log P (a, c) P (a)P (c) = 0. (14) In this case, B is a multi-parent node.More generally, we can do Zero-Mutual Information (Zero-MI) testing for a triplet, Y j with its two neighbors Y a and Y b : if I(Y a ; Y b ) = 0, then Y a and Y b are parents of Y j , and Y j becomes a multi-parent node.
Polytree by Zero-MI testing

Fig. 6 :
Fig. 6: Learning (b-e) and prediction (f) phases of PACC.The true but hidden graphical model (a) is learned from data.(b) Construct a complete graph G with edges weighted by the mutual information I. (c) Construct a spanning tree in G.(d) Make directions by Zero-MI testing.(e) Train six probabilistic classifiers f 1 -f 6 .(f) Prediction is made in the order of circled numbers.

Algorithm 1 2 :
Algorithm of PACC Input: D: training set, T : test set, f = {f j } L j=1 : multi-label probabilistic classifier Output: ŷ: prediction on a test instance x, x ∈ T Training: 1: Transform D into {D j } L j=1 , where D j = {x (i) , y Calculate the mutual information matrix I = {I jk } L×L according to (13); 3: Construct a polytree B = {pa(Y j )} L j=1 on I, and form the chain; 4: Transform {D j } L j=1 into {D + j } L j=1 based on B, where

Fig. 7 :
Fig. 7: Flow chart of the proposed PACC-LDF method.The learning phase (a) consists of seven steps: problem transformation, MLIG, MI estimation, polytree construction, feature augmentation, MLCFS and model learning.The prediction phase (b) consists of three steps: instance transform, testing and exact inference.

Fig. 9 :
Fig. 9: The performance of PACC with its three variants in Accuracy and Learning time (in seconds) on six synthetic datasets.

Fig. 11 :
Fig. 11: Comparison of CC-based methods with their LDF variants in Exact-Match.For each dataset, the values in Exact-Match have been normalized by dividing the lowest value in the dataset.
The source of datasets: http://mulan.sourceforge.net/datasets-mlc.html‡ Type of features.a: numeric, b: nominal, c: both numeric and nominal

Table 3 :
Statistics of six synthetic multi-label datasets.

Table 5 :
[25]rimental results (mean±std) of PACC-LDF with three state-of-the-art methods on twelve multi-label datasets in terms of four evaluation metrics.correlations.In consistent with our theroretical analysis, BR obtains the worst result in Exact-Match due to ignoring label correlations.It is also worth noting that BR works better than CC-related methods only on the Birds set, indicating weak label correlations in that set.In Macro/Micro-F1, BR and BCC obtained competitive results with CC and PACC.This is probably because the label-based evaluation metrics emphasize more on the performance on individual label.Indeed BR is actually the hamming-loss risk minimizer and BCC only models the most important label pairwise dependency.7.4.2 Results on PACC-LDF with the state-of-the-artNext, PACC-LDF was compared with three popular MLC methods.The experimental results are shown in Table5.From Table5, we can see that PACC-LDF is the best in Exact-Match, and competitive with RAkEL and MLkNN in Accuracy and Macro-F1.In Accuracy and Micro-F1, RAkEL works the best, while MLkNN and PACC-LDF follows.To compare the performance of multiple methods on multiple datasets, we conducted the Friedman test[10]aiming to reject the null-hypothesis as equal performance among the comparing methods.Furthermore, since the null-hypothesis is rejected by the Friedman test in terms of all the metrics (Statistic F F of Exact-Match, Accuracy, Macro-F1 and Macro-F1 are 8.8995, 3.2960, 2.9237 and 8.5074, respectively, higher than the critical value 2.8805 with significance level α = 0.05), Nemenyi test[25]was conducted for pairwise comparison in classification performance.
Fig. 10: CD diagrams (0.05 significance level) of four comparing methods in four evaluation metrics.The performance of two methods is regarded as significantly different if their average ranks differ by at least the Critical Difference.
, algorithms which are not significantly different are connected by a thick line.The test said that PACC-LDF is significantly better than both MLkNN and HOMER in Exact-Match

Table 6 :
Learning and prediction time (in seconds) of eight comparing methods on twelve datasets.

Table 7 :
Wilcoxon signed-ranks test with significance level α = 0.05 for CC-based methods against their LDF variants in terms of four evaluation metrics (p α -values are shown in brackets).A "win" denotes the existence of a significant difference.