1430
IEICE TRANS. INF. & SYST., VOL.E101–D, NO.5 MAY 2018
PAPER
Graph-Based Video Search Reranking with Local and Global
Consistency Analysis
Soh YOSHIDA†a), Takahiro OGAWA††b), Miki HASEYAMA††c), Members,
and Mitsuji MUNEYASU†d), Senior Member
SUMMARY Video reranking is an effective way for improving the re-
trieval performance of text-based video search engines. This paper pro-
poses a graph-based Web video search reranking method with local and
global consistency analysis. Generally, the graph-based reranking ap-
proach constructs a graph whose nodes and edges respectively correspond
to videos and their pairwise similarities. A lot of reranking methods are
built based on a scheme which regularizes the smoothness of pairwise rel-
evance scores between adjacent nodes with regard to a user’s query. How-
ever, since the overall consistency is measured by aggregating only the local
consistency over each pair, errors in score estimation increase when noisy
samples are included within query-relevant videos’ neighbors. To deal with
the noisy samples, the proposed method leverages the global consistency
of the graph structure, which is different from the conventional methods.
Specifically, in order to detect this consistency, the propose method intro-
duces a spectral clustering algorithm which can detect video groups, in
which videos have strong semantic correlation, on the graph. Furthermore,
a new regularization term, which smooths ranking scores within the same
group, is introduced to the reranking framework. Since the score regular-
ization is performed by both local and global aspects simultaneously, the
accurate score estimation becomes feasible. Experimental results obtained
by applying the proposed method to a real-world video collection show its
effectiveness.
key words: video search reranking, graph learning, graph consistency
analysis, spectral clustering
1. Introduction
With the explosive growth of social media, a great number
of videos are being generated and shared on the Internet. For
example, YouTube has over a billion users and people watch
hundreds of millions of hours every day∗. Thus, many tech-
niques have been developed for multimedia searches. Ow-
ing to the success of information retrieval businesses, such
as Google, Bing, and Yahoo!, most search engines employ
text-based techniques by using nonvisual information such
as surrounding text and user-provided tags, associated with
visual content. However, since textual information is some-
times noisy or unavailable, the inconsistency between tex-
tual features and visual contents can cause poor image/video
Manuscript received September 1, 2017.
Manuscript revised December 20, 2017.
Manuscript publicized January 30, 2018.
†The authors are with Kansai University, Suita-shi, 564–8680
Japan.
††The authors are with Hokkaido University, Sapporo-shi, 060–
0814 Japan.
a) E-mail: sohy@kansai-u.ac.jp
b) E-mail: ogawa@lmd.ist.hokudai.ac.jp
c) E-mail: miki@ist.hokudai.ac.jp
d) E-mail: muneyasu@kansai-u.ac.jp
DOI: 10.1587/transinf.2017EDP7277
search results [1], [2].
To improve the text-based search performance and
overcome the semantic gap between text information and
video contents, visual search reranking has been the focus
of attention in recent years [3]–[9]. This technique adjusts
the initial ranking orders by mining visual content or lever-
aging some auxiliary knowledge. Most reranking methods
have been developed on the basis of the following three as-
sumptions: (1) visual contents with dominant patterns are
expected to be ranked higher than others, (2) visual contents
with similar visual appearance are to be ranked closely, and
(3) top-ranked contents in initial search results are expected
to be ranked relatively higher than the others. Under these
assumptions, visual information is introduced to refine the
initial search result.
A lot of reranking methods are formulated as finding
the optimal ranked list from the perspective of Bayesian
theory [10], [11] and manifold discovery [12], [13]. These
reranking approach assumes that relevant multimedia doc-
uments such as images and videos lie on a manifold in vi-
sual feature space. Then the reranking is accomplished by
graph-based learning methods. Therefore, we call it graph-
based reranking. Generally, the approach constructs a graph,
where the nodes are multimedia documents and the edges
reflect their pairwise similarities. The initial relevance of
each document can be viewed as the stationary probabil-
ity of each node and can be transitioned to other similar
nodes until some convergence conditions are satisfied. This
graph representation of search results can be integrated into
a regularization framework by considering the following
two terms: a graph regularizer that keeps the ranking po-
sitions of visually similar documents close and a loss term
insuring that the reranked results do not change too much
from the initial ranking list.
Although many different methods have been proposed,
the visual consistency between similar video contents is not
always guaranteed due to the complexity of real-world video
contents. Then, in several cases, search performance may
be even degraded after the reranking. This is because that
most graph-based methods measure visual consistency pair-
wisely. The overall consistency is measured by aggregat-
ing the local consistency over each pair. Thus, errors in
score estimation increase when noisy samples are included
in each pair. To solve this problem, we introduce the idea
∗https://www.youtube.com/yt/press/statistics.html
Copyright c© 2018 The Institute of Electronics, Information and Communication Engineers
YOSHIDA et al.: GRAPH-BASED VIDEO SEARCH RERANKING WITH LOCAL AND GLOBAL CONSISTENCY ANALYSIS
1431
of social network analysis. Specifically, community detec-
tion methods have attracted great research interests in the
past years [14]. A community consists of a group of nodes
that are densely connected to each other but sparsely con-
nected to other dense groups. Since a community struc-
ture in networks usually reveals the common topic or inter-
est, the consistency over an area among a same community
means a video group whose videos have strong correlation
with its neighbor. We call it global consistency. However,
these consistency analysis is not considered for improving
performance of video search reranking. Therefore, it is de-
sirable to develop a novel algorithm that regularizes graph
consistency based on both local and global aspects, simulta-
neously.
In this paper, we propose a novel graph-based rerank-
ing with local and global consistency analysis. We adopt the
following two procedures: (A) detection of the global con-
sistency over the graph and (B) modeling of the graph-based
reranking considering both local and global consistency.
First, in (A), we detect the global consistency by adopt-
ing a spectral clustering algorithm [15] to the constructed
graph. Given a similarity graph, a spectral clustering algo-
rithm finds a partition of the set of its nodes into clusters.
This algorithm satisfies the following hold: nodes in dif-
ferent clusters are dissimilar to each other, which aims to
minimize the between-cluster similarities; and nodes in the
same cluster are similar to each other, which aims to maxi-
mize the within-cluster similarities. From the clustering re-
sult, we extract center nodes corresponding to representa-
tive nodes of each cluster. Then we define a new affinity
matrix representing the similarity between center nodes and
the similarity between nodes among the same video group.
In (B), we model reranking using the graph and the
affinity matrix which reflects global consistency over the
graph. Our reranking model is built based on a Bayesian
formulation [10] and its multimodal expansion [9]. In this
paper, we introduce a new graph regularizer that smooths
the ranking scores among the same video group obtained
by the procedure in (A). For a video, instead of calculating
the consistency with each of its neighbors individually, the
proposed regularizer considers the consistency with all of
videos among the same group simultaneously. By using this
term with the previous regularization framework, the pro-
posed method can suppress the influence of noisy videos.
Furthermore, it is difficult to assign the appropriate parame-
ters to the two types of affinity matrices. In order to integrate
two aspects, we introduce the graph-learning approach and
tune these parameters automatically. Finally, by minimizing
the objective function including three terms, i.e., graph lo-
cal and global regularizer terms and a loss term, the desired
consistency over the graph is guaranteed. Therefore, perfor-
mance improvement by the graph-based reranking becomes
feasible.
The contribution of this work is summarized as fol-
lows:
1) We propose a graph global consistency detection ap-
proach for video search reranking. This enables in-
tegration of global consistency analysis into a graph-
based regularization framework.
2) The proposed method simultaneously regularizes the
smoothness of the ranking scores between not only
adjacent nodes but also nodes among the same video
group. This approach enables suppression of the influ-
ence of noisy videos’ score propagation.
This paper is an extended version of [16]. In this pa-
per, the following three aspects are enhanced. 1) In order
to improve the robustness of the algorithm for obtaining the
affinity matrix of each aspect, we introduce the graph-based
learning approach in our method. By using this approach,
tuning parameters for determining the scale of the affinity
matrix are automatically learned. 2) We complement dis-
cussions of parameters to set manually. 3) We collect a Web
video search dataset using 15 queries and study the effec-
tiveness of the proposed method by comparing it with vari-
ous conventional graph-based reranking methods.
The remainder of this paper is organized as follows.
In Sect. 2, we review the related work on the visual search
reranking for image and video retrieval. Section 3 presents
the proposed method, which retrieves videos using a graph-
based reranking framework with local and global consis-
tency analysis. Section 4 provides experimental results that
verify the performance of the proposed method. Finally,
Sect. 5 presents concluding remarks.
2. Related Work
2.1 Visual Search Reranking
Visual search reranking has been widely investigated for im-
proving the search performance of images, videos and other
multimedia documents. The existing visual search reranking
efforts can be mainly classified into two categories accord-
ing to whether there are query examples available, which are
called example-based reranking and self-reranking.
For the first category, these methods need several ex-
amples in addition to a text-query. Yan et al. [3] regard
the query examples as relevant samples and several bottom-
ranked results in a ranking list as irrelevant ones. A Sup-
port Vector Machine (SVM) model is then learned based on
these samples to rerank the search results. Natsev et al. [4]
improve the robustness of this example-based approach by
a bagging strategy. They collect multiple irrelevant sam-
ple sets and then generate different ranking lists accord-
ingly. These ranking lists are aggregated to generate the final
reranked result. Liu et al. [5] use the query examples to dis-
cover the relevant and irrelevant concepts for a given query
and identify an optimal set of document pairs by using an
information theory. A ranking list is then directly recovered
from this pair set. These methods can improve search per-
formance if good visual examples are provided. However,
these methods cannot be used in the cases when there is no
visual example available.
1432
IEICE TRANS. INF. & SYST., VOL.E101–D, NO.5 MAY 2018
For the second category, the self-reranking approach
does not rely on query examples. It aims to improve text-
based search by mining the visual information of images or
videos. In many cases, we can assume that the top-ranked
documents are the few “relevant” (called pseudo relevant)
documents that can be viewed as “positive”. This is in
contrast to relevance feedback where users explicitly pro-
vide feedback by labeling the results as positive or nega-
tive. Kennedy et al. [6] regard top and bottom-ranked re-
sults in a ranking list as pseudo relevant and irrelevant sam-
ples respectively to discover the related concepts. The de-
tection results of the related concepts are then used as high-
level features in SVM to build classifiers for reranking. Hsu
et al. [17] formulate the reranking process as a random walk
over a context graph, where videos are nodes and the edges
between them are weighted by multimodal similarities. Jing
et al. [8] apply the PageRank [18] to product image search
and design the VisualRank algorithm for reranking. After
a similarity-based image link graph is generated, an itera-
tive computation similar to PageRank is utilized to rerank
the images. Yang et al. [19] extract multiple features from
each image and collect a training set that contains several
queries and labeled search results. Reranking is then re-
garded as a supervised learning task. Tian et al. [10] model
the textual and visual information from the probabilistic per-
spective and formulate visual reranking as an optimization
problem in the Bayesian framework, named Bayesian visual
reranking. This method encodes the assumptions that the
reranked results do not change much from the initial rank-
ing list and the ranking positions of visually similar images
are close.
However, its fundamental deficiency lies in the noise,
i.e., it is not guaranteed that the irrelevant instances are al-
ways apart from the top returns, which would push away
true positive after reranking in many cases. In this work, to
perform robust visual reranking in this kind of situation, we
investigate video search reranking with local and global con-
sistency analysis based on community detection approach.
By learning the adaptive similarity weights of each aspect,
we will show that our approach can effectively integrate two
aspects to boost ranking performance.
2.2 Graph-Based Learning
Graph-based learning has been introduced into visual
reranking in the past year. One major advantage of graph-
based learning is to encode the data structure into the data
similarity measurement to refine inference and modeling. In
these methods, a graph is constructed based on the given
data, where nodes and edges respectively correspond to
samples and their pairwise similarities. They are usually
formulated in a regularization scheme with two terms. One
term is used to enforce the function to be smooth on the
graph, and the other term is used to keep the function consis-
tent with prior information such as the labeling information
of several samples. The algorithms can be accomplished by
a random walk process.
He et al. [12] adopt a graph-based method named
manifold-ranking in image retrieval. Wang et al. [9] devel-
oped a multi-graph learning approach to fuse multiple fea-
ture channels based on semi-supervised learning. In [20],
multiple graphs from different retrieval methods are fused
by summing up the edge weights, and then a graph align-
ment is conducted to build an overall similarity graph. In
[8], [10], [21], the initial ranking list is refined on the graph
by propagating the ranking scores through the edges.
Unfortunately, the regularization term used in these
methods measures the graph consistency pairwisely. Specif-
ically, the overall consistency is measured by aggregating
the local consistency over each pair. The consistency on the
graph is multiplewise instead of pairwise since it is a term
defined over the whole neighboring samples. Therefore, the
consistency approximated through pairwise regularizers is
not satisfactory enough. Our method is inspired by [9], [14].
Our approach first detects the global consistency of the over-
all graph. By using the multimodal graph learning method,
we then fuse the two types of graphs and then estimate an
optimal relevance score with regard to the user’s query.
3. Graph-Based Video Search Reranking with Consis-
tency Analysis
In this section, we describe our proposed reranking ap-
proach. We first introduce the existing graph-based rerank-
ing methods with a general regularization scheme. We then
present our approach including consistency analysis and
new graph regularization. For clarity, the notations and def-
initions throughout this paper are summarized in Table 1.
3.1 Graph-Based Reranking with Local Regularizer
We first follow [10] to define several terms in reranking. Let
r̄ = [r̄1, r̄2, . . . , r̄N]T and r = [r1, r2, . . . , rN]T denote vec-
tors of the initial ranking scores and the relevance scores,
which correspond to the video set X = {x1, x2, . . . , xN}. r̄i
and ri are the initial ranking scores, which are calculated
from the ranking position by keyword search, and the rel-
evance scores with regard to the user’s query. We also use
Table 1 Notation table.
Notation Definition
X, xi The Video set and ith video in a ranking list.
r̄, r̄i The vector of the initial ranking scores and the score of xi.
r, ri The vector of the relevance scores and the score of xi.
L, G Indicators for local and global aspects.
W• The affinity matrix of videos.
A• The transformation matrix including the affinity matrix.
L•, L̃• The graph Laplacian and the normalized graph Laplacian
derived from W•.
D• The degree matrix derived from W•.
O The centroids of spectral clustering.
C The node set which corresponds to each centroid.
α•, ρ Tuning parameters.
N The number of videos.
K The number of clusters for spectral clustering.
T , T1 The iteration time in the alternating optimization.
YOSHIDA et al.: GRAPH-BASED VIDEO SEARCH RERANKING WITH LOCAL AND GLOBAL CONSISTENCY ANALYSIS
1433
xi to denote its feature vector. In this paper, three kinds of
visual features and one kind of audio feature are adopted
(described in 4.1).
Generally, graph-based reranking can be formulated as
a regularization framework. The objective function is then
defined as:
arg min
r
Q(r) = R(r,W) + ρL(r, r̄), (1)
where the first part is a regularization term that makes the
ranking scores of visually similar videos close, the second
part is a loss term that estimates the difference between r
and r̄, and ρ is a trade-off parameter. As the term R(r,W), a
graph G is constructed with nodes being the videos and sim-
ilar videos are linked by edges. Then graph Laplacian [22]
and normalized graph Laplacian [23] can be widely utilized.
When constructing the graph G, each video is connected
with its k-nearest neighbors [10]. W is an affinity matrix
in which Wi j indicates the visual similarity between xi and
x j. In this paper, we use WL and WG as the affinity matri-
ces for local and global aspects, respectively. For the local
aspect, if two videos xi and x j are connected as the edge,
the similarity WLi j is calculated based on the Gaussian ker-
nel with the scaling parameter σL. Otherwise, two videos
are not connected WLi j = 0. We define the affinity matrix
WL ∈ RN×N by taking WLi j as its (i, j)th element. Through
minimizing the objective function Q(r), the optimum rank-
ing score list r∗ can be derived as r∗ = arg minr Q(r) using
the local regularizer R(r,WL).
3.2 Global Consistency Detection
This subsection shows how to detect global consistency by
using a spectral clustering algorithm [15]. In this paper,
global consistency means that videos on the same video
group structure, typically referred to as a cluster, are likely
to have a high similarity. Since this structure in the graph
usually reveals the common topic or interest, the consis-
tency over a local area within the same graph means that
each sample has strong correlation with its neighbor. Thus,
if we can deduce a sample’s score in its neighbors precisely,
it is regarded that this sample is locally consistent.
Spectral clustering unveils the video group structure by
exploiting the eigen-structure of the graph Laplacian matrix
LL, where LL = DL −WL and DL is a diagonal matrix and
its (i, i)th element is the sum of ith row of WL. Let U con-
sist of the unit-length eigenvectors which are associated with
the K smallest eigenvalues of LL, namely U = {u1, . . . ,uK},
which is a K-dimensional embedding of the graph. The in-
formation of each node is therefore captured by a point in
R
K . In order to discover the video group structure, k-means
clustering is applied to the rows of U and returns the video
group labels z = {z1, . . . , zN} ∈ {1, . . . ,K} and K centroids
O = {μ1, . . . , μK}. Then we detect nodes C = {c1, . . . , cK},
which correspond to each centroid O and are called center
nodes. A spectral clustering algorithm is provided in Al-
gorithm 1 with the input being the affinity matrix WL and
Algorithm 1 Global consistency detection using a spectral
clustering algorithm
Input: The affinity matrix WL of the video graph G and K
Output: Label set z and center nodes C
1: procedure GlobalConsistencyDetection(W, K)
2: dLii ←
∑N
j=1 W
L
i j
3: DL ← diag{d11, . . . , dNN }
4: LL ← DL −WL
5: {u1, . . . ,uK } ← unit-length eigenvectors of LL which are associ-
ated with the K smallest eigenvalues of LL
6: U← {u1, . . . ,uK }
7: Cluster labels for all nodes and centroids of K groups (z,O) ←
results of k-means clustering on the rows of U with K centres
8: {c1, . . . , cK } ← nodes corresponding to each centroid O =
{μ1, . . . , μK }
9: C← {c1, . . . , cK }
10: return (z,C)
11: end procedure
Fig. 1 Center node detection and similarity definition based on shortest
path problem.
the pre-specified number of groups K. Its outputs are the
estimated labels z and the center nodes C.
The goal of our reranking is to regularize smoothness
of the ranking scores between not only adjacent nodes but
nodes among the same video group simultaneously. There-
fore, we define a new weight WGi j , which represents the sim-
ilarity between each node and its center node among the
same video group. As shown in Fig. 1, if two videos xi and
x j have the same label z and x j ∈ C, we connect them by
an edge and calculate its weight WGi j . We define the affin-
ity matrix WG ∈ RN×N by taking WGi j as its (i, j)th element.
By using the affinity matrix WG, we formulate the reranking
problem.
3.3 Proposed Graph-Based Reranking Algorithm
We develop our approach based on normalized graph
Laplacian and ranking distance. Typically, the similarity of
kth aspect (k ∈ {L,G}) between ith and jth videos is firstly
defined as Wki j = exp(−||xi−x j||2/σ2k), whereσk is the scaling
parameter of the Gaussian function that converts distance to
similarity. However, Euclidean distance may not be appro-
priate as the most suitable distance metric [24]. Therefore,
we replace the Euclidean distance metric with the follow-
ing Mahalanobis distance metric, which can be learned an
optimization framework:
Wki j = exp
(
−(xi − x j)T Mk(xi − x j)
)
, (2)
1434
IEICE TRANS. INF. & SYST., VOL.E101–D, NO.5 MAY 2018
where Mk is a symmetric positive semi-define real matrix.
We decompose Mk as Mk = ATk Ak, where Ak ∈ Rd×d and is
substituted it into Eq. (2) as
Wki j = exp
(
−||Ak(xi − x j)||2
)
. (3)
This is equivalent to transform each video xi to Akxi. For the
initialization, we set Ak to a diagonal matrix I/σk, where σk
is the median value of the pairwise Euclidean distance of the
videos in the kth aspect.
The proposed method considers local and global as-
pects in the graph. Here, we linearly combine the normal-
ized graph Laplacian regularizers. Mathematically, in order
to smooth reranking scores based on both global and local
consistencies, we model the reguralizer term so as to com-
bine local and global terms as follows:
R(r,AL,AG) =
∑
k∈{L,G}
∑
i, j
αkW
k
i j
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
ri√
dkii
− r j√
dkj j
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
2
=
∑
k∈{L,G}
αkrT L̃kr,
(4)
where αk is the weight for local and global regularizers. The
weights satisfy 0 ≤ αk ≤ 1 and αL+αG = 1. dkii is the sum of
the ith row of Wk, L̃k = I−D−1/2k WkD−1/2k is the normalized
graph Laplacian, and Dk is the diagonal matrix whose (i, i)th
element is dkii.
Accordingly, our algorithm can be formulated as the
following optimization problem:
min
r,AL,AG
Q(r,AL,AG) =
∑
k∈{L,G}
∑
i, j
αkW
k
i j
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
ri√
dkii
− r j√
dkj j
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
2
+ ρ
∑
i, j∈S r̄
(
1 − ri − r j
r̄i − r̄ j
)2
, (5)
where the loss term indicates the preference strength ranking
distance [10] and S r̄ is the set of pairs (i, j) whose relevance
scores of all the sample-pairs (xi, x j) satisfy r̄i > r̄ j. Note
that an appropriate scale of Ak for estimating Wk will also
be automatically determined. The scaling parameter is usu-
ally very sensitive for graph-based learning, and it needs to
be carefully tuned. The elimination of the parameter by au-
tomatically determining the scale of Ak is also an important
element of our approach.
3.4 Alternating Optimization
The formulation shown in Eq. (5) is a minimization problem
involving two variables to optimize. Since this objective is
not convex, it is difficult to simultaneously recover both un-
knowns. However, if we hold one unknown constant and
solve the objective for the other, we have two convex prob-
lems that can be optimally solved. In the rest of this section,
we introduce an alternating optimization for our reranking
framework, which iterates between the updates of r and Ak.
3.4.1 Update for r
By using the form of normalized graph Laplacian, we can
rewrite Eq. (5) as follows:
Q(r,AL,AG) =
∑
k∈{L,G}
αkrT L̃kr + ρ
∑
i, j∈S r̄
(
1 − ri − r j
r̄i − r̄ j
)2
,
(6)
If the transformation matrices AL and AG are constant,
then denote βi j = 1/(r̄i− r̄ j) and the relevance score list r can
be updated by solving the following optimization problem:
min
r
Q(r)
= min
r
∑
k∈{L,G}
αkrT L̃kr + ρ
∑
i, j∈S r̄
{
1 − βi j(r̄i − r̄ j)
}2
= min
r
∑
k∈{L,G}
αkrT L̃kr + ρ(rT L(B) − 2Be)r,
(7)
where L(B) is a graph Laplacian matrix defined over the
graph GB which has the same structure of G regarding the
weight between nodes xi and x j as |βi j|. B = [βi j]N×N is an
anti-symmetric matrix, and e is a vector with all elements
equal to 1.
Finally, the relevance score list r is derived by differen-
tiating w.r.t r and equating it to zero as follows:
r =
⎛⎜⎜⎜⎜⎜⎝
∑
k
αkL̃k + ρL(B)
⎞⎟⎟⎟⎟⎟⎠
−1
ρ̃, (8)
where ρ̃ = 2ρ(Be). It can be seen that different from the
normalized graph Laplacian based learning, the two types
of normalized graph Laplacian matrices have been linearly
combined with weights αk.
3.4.2 Update for Ak
Now, we consider the optimization of Ak (k = L,G). Since
the optimization of both AL and AG is the same process, we
describe that of AL as an example. Considering r and AG
are fixed, we then derive the derivative of Q with respect to
AL as follows:
∂
∂AL
Q(AL,AG)
= αL
∂
∂AL
∑
i, j
WLi j
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
ri√
dLii
− r j√
dLj j
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
2
= αL
∑
i, j
(hLi j)
2
∂WLi j
∂AL
−WLi jhLi j
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
ri√
(dLii)
3
∂dLii
∂AL
− r j√
(dLj j)
3
∂dLj j
∂AL
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ ,
(9)
YOSHIDA et al.: GRAPH-BASED VIDEO SEARCH RERANKING WITH LOCAL AND GLOBAL CONSISTENCY ANALYSIS
1435
Algorithm 2 Gradient descent process for solving Ak.
Input: Step-size parameter ηt = 1.
Output: The transformation matrix Ak .
1: Set A(0)k to a diagonal matrix I/σk , where σk is the median value of the
pairwise Euclidean distances of the videos in the kth aspect.
2: for t = 1 to T1 do
3: Let A(t+1)k = A
(t)
k − ηt ∂Q∂Ak |Ak=A(t)k .
4: if Q(A(t+1)k ) < Q(A
(t)
k ) then
5: ηt+1 = 2ηt;
6: else
7: A(t+1)k = A
(t)
k , ηt+1 = ηt/2.
8: end if
9: end for
Algorithm 3 Optimization process of the reranking algo-
rithm
Input: Tuning parameters αk and trade-off parameter ρ. The affinity ma-
trices W(0)L ,W
(0)
G for initialization.
Output: The relevance score list r.
1: Set A(0)L , A
(0)
G to diagonal matrices
I
σL
, IσG
, respectively, where σ• is
the median value of the pairwise Euclidean distances of the videos in
each aspect.
2: for t = 1 to T do
3: Compute the tth optimal relevance score list r(t) according to
Eq. (8).
4: Update tth transformation matrices A(t+1)L and A
(t+1)
G sequentially
according to Algorithm 2.
5: Update the similarity matrices W(t+1)k as Eq. (3).
6: end for
where
hLi j =
ri√
dLii
− r j√
dLj j
,
∂WLi j
∂AL
= −2WLi jAL(xi − x j)T (xi − x j),
∂dLii
∂AL
=
N∑
j=1
∂WLi j
∂AL
.
(10)
In order to solve the optimization of AL using Eq. (9),
we adopt a gradient descent process. In the gradient descent
process, we dynamically adapt the step-size in order to ac-
celerate the process while guaranteeing its convergence. De-
note A(t)L as a result of AL in tth turn of the iterative process.
If Q(A(t+1)L ,AG) < Q(A
(t)
L ,AG), i.e., the cost function ob-
tained after the gradient descent is reduced, we double the
step-size. Otherwise, we decrease the step-size and do not
update AL. The process is shown in Algorithm 2. In this
process, we denote Q(AL) as the value of the object func-
tion when entering AL. After the iteration of AL, r and AL
are fixed, and AG is calculated by the same way as AL.
The whole alternating optimization process is illus-
trated in Algorithm 3. After the alternating optimization,
the proposed method returns videos in accordance with the
optimal relevance score r as the video searching result.
Table 2 15 Event Queries.
Queries
UEFA EURO 2016 highlights
Sochi 2014 Winter Olympics opening ceremony
World Figure Skating Championships 2016
Rio 2016 Summer Olympics games
NBA Finals 2014 highlights
CCTV new years gala 2016
Speech at Apec China 2014 speech
2014 Hong Kong protests
2014 Israel Gaza Conflict air strikes
Malaysia Airlines Flight 17 crash moment
New York Fashion Week 2014 runway
November 2015 Paris attacks
Flood in Indonesia 2014
Calbuco Volcano Eruption in Chile
Italy earthquake 2016
4. Experimental Results
In this section, we verify the effectiveness of our pro-
posed method. We first describe the datasets collected from
YouTube† and the measurements in the experiments. We
then analyze the performance of our method of video search
reranking.
4.1 Datasets and Features
Datasets: While the research on video search has recently
received intensive attention, the public datasets do not re-
flect current social event topics. To substantially evaluate
our approach, we collected a new dataset with rank infor-
mation from YouTube for video search reranking. Specifi-
cally, the used videos were crawled from YouTube by using
15 event queries as shown in Table 2. There is a MSRA-
MM (Microsoft Research Asia Multimedia) dataset [25] as
a well-known dataset for video search. In this task, 9
categories of videos are searched. Therefore, we use 15
queries in the experiments. These queries cover current
topics of news from events, which were selected by refer-
ence to the categories “Categories:2014, 2015, and 2016”
from Wikipedia††. For each query, we obtained max top-500
videos, and analyzed the related videos of each video by us-
ing YouTube API†††. Furthermore, the associated contextual
information such as tags, titles and descriptions were also
crawled together with videos. This dataset is a real-world
Web video dataset containing the original ranking informa-
tion. By using these videos, we construct the video graph G.
When constructing the graph G, each sample is connected
with its k-nearest neighbors. The neighborhood size is set
to 5. For the iteration times T and T1, we set them to 5
and 10, respectively. Our method assigns the initial score
r̄i = 1− r̂i/N, where r̂i is the rank of video xi returned by the
search engine.
†http://www.youtube.com
††http://en.wikipedia.org/wiki/Category:2014, 2015, and 2016
†††https://developers.google.com/youtube/v3/
1436
IEICE TRANS. INF. & SYST., VOL.E101–D, NO.5 MAY 2018
Fig. 2 Visual results of the video reranking from different approaches of the specific query (Rio 2016
Summer Olympics Games): (a) Ours, (b) Ours (without Ak optimization), (c) Ours (αG = 0), (d) So-
cialRank, (e) MGL, (f) Bayesian, (g) VisualRank, (h) RandomWalk, (i) BM25, (j) Initial (No method).
Note that the corresponding YouTube IDs are shown below the images.
Features: For query videos, we extract the following the
sequential features from the whole videos and the frame-
level visual and audio features from keyframes. Note that
we denote I-frames of the MPEG-4 video as the keyframes.
C3D: We apply the C3D model [26] pre-trained on the
Sports 1M dataset to compute representations with 512
dimensions.
Inception-v3: We apply the Inception-V3 model [27] pre-
trained on the ImageNet 1K classification task to com-
pute representations with 2048 dimensions.
HSV Color histogram: We use the HSV color histogram
to exploit the color information. To contain spatial
information, keyframes are divided into 25 blocks of
the same size. A 1600-dimensional HSV normalized
color histogram of each region with 4 bins in each color
space is extracted.
MFCC: Mel-frequency cepstral coefficients (MFCC),
which describe the short-time spectral shape of au-
dio frames, are extracted to capture audio information.
MFCC are widely used not only for speech recogni-
tion but also for generic audio classification. ΔMFCC,
ΔΔMFCC, log-power, Δlog-power and ΔΔlog-power
are extracted in addition to the MFCC. The dimension
of the audio feature is 39 including 12-dimensional
MFCC.
These sequential and frame-level visual and audio features
are combined by early fusion followed by PCA to reduce the
dimension to 256. The video-level feature xi of the ith video
is mean-pooled from frame-level features.
YOSHIDA et al.: GRAPH-BASED VIDEO SEARCH RERANKING WITH LOCAL AND GLOBAL CONSISTENCY ANALYSIS
1437
4.2 Evaluation Metrics
The performance evaluation of our method is voted by eight
volunteers who are invited to assign the relevance scores for
top N videos of each query. The averaged relevance score is
used to measure the retrieval results.
The performance is measured by the widely used aver-
age precision (AP), which averages the precision obtained
when each relevant video occurs. We average the APs over
all the 15 queries to obtain the mean AP (MAP) as an over-
all performance measurement. Then, to measure the video
search performance, the normalized discounted cumulative
gain (NDCG) [28], which is commonly used measure in in-
formation retrieval when there are more than two relevance
levels, is adopted. For a given query, the NDCG score at
position d in the ranking list is calculated as follows:
NDCG@d = Zd
d∑
j=1
2t
j − 1
log(1 + j)
, (11)
where t j is the degree of the jth video in the ranking list
and Zd is a normalization constant chosen to guarantee that
NDCG@d is 1 for a perfect ranking. For each video, in
the experiments, the relevance degree t j was judged man-
ually on four scales: “0:Irrelevant”, “1:Fair”, “2:Relevant”,
and “3:Very Relevant”. To evaluate the overall performance,
we average the NDCGs over all queries to obtain the mean
NDCG (MNDCG).
Table 3 MAP comparison of video reranking performance.
Methods MAP
Ours 0.735
Ours (without Ak optimization) 0.729
Ours (αG = 0) 0.684
SocialRank 0.731
MGL 0.727
Bayesian 0.717
VisualRank 0.635
RandomWalk 0.529
BM25 0.583
Initial (No method) 0.597
Table 4 MNDCG@d comparison of the video reranking performance.
Methods @5 @10 @20 @30 @40 @50 @60 @70 @80 @90 @100
Ours 0.901 0.895 0.837 0.825 0.811 0.792 0.762 0.751 0.749 0.751 0.746
Ours (without Ak optimization) 0.893 0.881 0.832 0.824 0.804 0.791 0.751 0.735 0.733 0.731 0.733
Ours (αG = 0) 0.806 0.781 0.773 0.754 0.744 0.733 0.721 0.701 0.699 0.685 0.679
SocialRank 0.899 0.894 0.831 0.823 0.813 0.789 0.742 0.743 0.741 0.739 0.729
MGL 0.871 0.850 0.845 0.796 0.785 0.778 0.761 0.749 0.738 0.751 0.733
Bayesian 0.850 0.811 0.786 0.759 0.753 0.742 0.738 0.735 0.730 0.723 0.722
VisualRank 0.671 0.645 0.647 0.661 0.659 0.651 0.656 0.652 0.649 0.644 0.651
RandomWalk 0.581 0.578 0.575 0.581 0.582 0.573 0.571 0.586 0.587 0.579 0.590
BM25 0.682 0.679 0.659 0.641 0.638 0.642 0.633 0.628 0.625 0.632 0.631
Initial (No method) 0.677 0.663 0.668 0.668 0.661 0.665 0.665 0.654 0.654 0.652 0.653
4.3 Reranking Results
To evaluate the performance of the proposed reranking al-
gorithm, we first compare the proposed method with the fol-
lowing eight reranking methods:
1) No method, i.e., the initial search results without
reranking. This method is denoted as “Initial”.
2) The text-based search results based on the Okapi BM-
25 formula [29] using the associated contextual infor-
mation of each video. The method is denoted as
“BM25”.
3) The random walk method proposed in [17]. The
method is denoted as “RandomWalk”.
4) Graph-based reranking proposed in [8]. The method is
denoted as “VisualRank”.
5) Bayesian reranking proposed in [10]. The method is
denoted as “Bayesian”.
6) Multimodal graph-based reranking proposed in [9],
which is the state-of-the-art for graph-based reranking.
The method is denoted as “MGL”.
7) Social ranking proposed in [30]. User information is
utilized to boost the retrieval performance. A regu-
larization framework which fuses the visual and views
information is introduced. The method is denoted as
“SocialRank”.
8) The proposed reranking method without the global reg-
ularizer. That means we fix αG = 0. The method is
denoted as “Ours (αG = 0)”
9) The proposed reranking method with assigning equiva-
lent scaling parameters to two aspects. That means we
Table 5 p values of the significance test comparison. The performance
measure is MAP.
Methods p values
versus Ours (without Ak optimization) 3.16 ×10−3
versus Ours (αG = 0) 5.75 ×10−4
versus SocialRank 2.65 ×10−2
versus MGL 1.05 ×10−3
versus Bayesian 6.25 ×10−4
versus VisualRank 1.94 ×10−7
versus RandomWalk 1.78 ×10−7
versus BM25 1.15 ×10−5
versus Initial (No method) 1.45 ×10−6
1438
IEICE TRANS. INF. & SYST., VOL.E101–D, NO.5 MAY 2018
Table 6 MNDCG@d comparison of the video reranking performance when the initial search results
obtained by each query equally contain 80% of noisy samples.
Methods @5 @10 @20 @30 @40 @50 @60 @70 @80 @90 @100
Ours 0.766 0.741 0.715 0.664 0.605 0.607 0.587 0.609 0.635 0.638 0.656
Ours (without Ak optimization) 0.741 0.729 0.690 0.663 0.602 0.577 0.575 0.586 0.623 0.627 0.644
Ours (αG = 0) 0.702 0.664 0.698 0.629 0.600 0.592 0.556 0.529 0.524 0.534 0.522
SocialRank 0.748 0.714 0.696 0.641 0.589 0.586 0.584 0.595 0.632 0.635 0.653
MGL 0.767 0.731 0.698 0.677 0.621 0.587 0.585 0.596 0.633 0.636 0.654
Bayesian 0.739 0.689 0.698 0.643 0.591 0.577 0.575 0.586 0.623 0.627 0.644
VisualRank 0.448 0.538 0.536 0.563 0.596 0.552 0.572 0.591 0.586 0.585 0.579
RandomWalk 0.258 0.457 0.451 0.482 0.518 0.476 0.460 0.480 0.476 0.475 0.465
BM25 0.364 0.317 0.395 0.417 0.377 0.384 0.360 0.375 0.393 0.402 0.417
Initial 0.652 0.634 0.611 0.629 0.579 0.592 0.556 0.529 0.525 0.534 0.523
fix Ak = 1/σk. The method is the same as [16] and
denoted as “Ours (without Ak optimization)”
For fair comparison, the comparisons 3) - 7) were imple-
mented by using the same video-level features as shown in
Sect. 4.1.
Figure 2 shows the top results with comparisons be-
tween the proposed method and other methods for an ex-
ample query “Rio 2016 Summer Olympics games”. It is
obvious that our approach is superior to all compared meth-
ods owing to our capability to rank the relevance videos by
using multiple types of objects and multiple types of rela-
tionships. The results of the MAP comparison are shown
in Table 3. It can be seen that the proposed reranking algo-
rithm has a better performance than the other methods. This
demonstrates the robustness of our algorithm.
Next, we show the video retrieval results obtained by
using the proposed method and the other retrieval meth-
ods. Table 4 demonstrates the MNDCG@5,10,20,30,40,50,
60,70,80,90,100 of different methods. Overall, our pro-
posed graph-based reranking outperforms the other meth-
ods, and the improvements are consistent and stable at dif-
ferent depths of NDCG. Especially, using the proposed
method, the value of the MNDCG@100 shows an improve-
ment of 0.017 and 0.013 over SocialRank and MGL, which
are the state-of-the-art methods in reranking, respectively.
To verify whether the improvement of the proposed
method is statistically significant, we further perform a sta-
tistical significance test. Here, we conduct paired T-test at
the 5% significance level between ours and all other meth-
ods. The p values are shown in Table 5. The T-test is con-
ducted over 15 queries. From this result, we can see that the
improvement of the proposed method is statistically signifi-
cant.
Table 6 shows the simulation results to verify the ro-
bustness to noisy videos. It is observed that the average
of noise ratio, which means the ratio of relevant and irrel-
evant videos, is originally 72% in our dataset. Thus, in
this experiment, we randomly insert noisy videos from other
queries’ ranking lists in the target initial ranking list so that
the ratio of noise videos is 80%. Table 6 also demonstrates
the MNDCG@5,10,20,30,40,50,60,70,80,90,100 of differ-
ent methods. As shown in Table 6, our proposed graph-
based reranking outperforms the other methods, and the im-
Fig. 3 Performance comparisons between different center nodes detec-
tion approaches for different parameter K in terms of MNDCG@100: (a)
Ours, (b) PageRank, (c) HITS.
provements are consistent and stable at most of different
depths of NDCG. Thus, it can be seen that our method in-
cluding the global consistency analysis can improve the ro-
bustness to nosy videos.
Next, in order to confirm the effectiveness of the pro-
posed center node detection using a spectral clustering algo-
rithm, we compare the proposed method with two popular
representative node detection schemes including:
1) The PageRank algorithm [18] which was used in
Google and was designed as a method for link analysis.
The method is defined as “PageRank”.
2) The HITS (Hypertext Induced Topic Selection) algo-
rithm [31]. HITS makes the distinction between hubs
and authorities and computes them in a mutually rein-
forcing way. The method is defined as “HITS”.
Note that for implementation of PageRank and HITS, we
also used the same video graph and its affinity matrix as
those used in the proposed method. To further analyze the
results, we compare the results of the different parameter
K, which is the number of center nodes. Figure 3 depicts
the performance of three types of methods, the proposed
method, HITS and PageRank with different K ranging from
5 to 30 in terms of NDCG@100. From the results, we can
see that the proposed method always gives better perfor-
mance, and the best number for K is 10.
Finally, we also test the sensitivity of the two parame-
ters ρ and αL, which are used in the proposed method. We
first set αL = 0.5 and vary ρ from 0.001 to 1. Figure 4
demonstrates the performance curve with respect to the vari-
ation of ρ. We then set ρ = 0.1 and vary αL (αG = 1 − αL)
from 0.1 to 0.9. Figure 5 demonstrates the performance
curve with respect to the variation of αL. Here, we also illus-
YOSHIDA et al.: GRAPH-BASED VIDEO SEARCH RERANKING WITH LOCAL AND GLOBAL CONSISTENCY ANALYSIS
1439
Fig. 4 Illustration of the effects of the parameter ρ in terms of
MNDCG@100: (a) Ours, (b) Ours (without Ak optimization), (c) Ours
(αG = 0).
Fig. 5 Illustration of the effects of the parameter αL in terms of
MNDCG@100: (a) Ours, (b) Ours (without Ak optimization).
trate the performance of the methods based on the proposed
method. From the results we can see that the performance
of our approach will not be significantly degraded when the
two parameters vary in a fairly wide range, and it can keep
outperforming the other methods.
From the above experimental results, we can verify
the effectiveness of the proposed method using the local
and global consistency analysis. Therefore, the proposed
method improves the performance of graph-based reranking
in video searches.
4.4 Complexity Analysis
From the above solution process, we can see that its compu-
tational cost mainly contains three parts, which are for de-
tecting global consistency, updating r, and updating A{L,G},
respectively. First, the computational cost of the global con-
sistency detection is O(K3 + KNt), where K is the number
of clusters, N is the number of videos, and t is the num-
ber of k-means iterations. In the graph-based reranking
method, we sparsify W{L,G} by only keeping the l largest
components in each row, where l is the number of neigh-
bors for each video. From Eq. (8), we can see that the
cost for updating r is O(Nl). For updating AL and AG,
from the process in Algorithm 2, we can see that the cost is
O(T1Nld2). Overall, the total time complexity for reranking
is O(K3+KNt+T (Nl+T1Nld2)), where d is the dimension-
ality of video feature vectors, and T and T1 are the iteration
times of optimization, respectively.
Besides theoretical analysis, we also test the time cost
experimentally for the proposed method. It is implemented
by using Python and run on a workstation with Intel Xeon
E5-2620 v3, 2.4 GHz, 32GB memory in a single thread. By
averaging the time cost of the all queries, our method can
rank videos within 10s when N = 500 in a single thread.
From the theoretical analysis and the experimental test dis-
cussed above, we can see that the efficiency of the proposed
method is acceptable for real applications.
5. Conclusions
This paper has presented a method to improve performance
of graph-based Web video search reranking. We first con-
struct the video graph and detect global consistency over
the graph by using a spectral clustering algorithm. From
the clustering result, we extract center nodes, which are rep-
resentative nodes of each cluster and then define the new
affinity matrix and the global regularizer representing the
similarity between center nodes and each node among the
same video group. Secondly, by considering both local and
global graph consistency, video search reranking is formu-
lated as an optimization problem. The effectiveness of inte-
grating local and global regularizers has been demonstrated.
We have also compared our method with several existing
reranking methods, and the results demonstrate the superi-
ority of our method.
Acknowledgments
This research was financially supported by JSPS KAKENHI
Grant Number 17K12687.
References
[1] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain,
“Content-based image retrieval at the end of the early years,” IEEE
Trans. Pattern Anal. Mach. Intell., vol.22, no.12, pp.1349–1380,
2000.
[2] A. Hauptmann, R. Yan, W.H. Lin, M. Christel, and H. Wactlar, “Can
high-level concepts fill the semantic gap in video retrieval? a case
study with broadcast news,” IEEE Trans. Multimedia, vol.9, no.5,
pp.958–966, 2007.
[3] R. Yan, A. Hauptmann, and R. Jin, “Multimedia search with pseu-
do-relevance feedback,” Proceedings of International Conference on
Content-Based Image and Video Retrieval, vol.2728, pp.238–247,
2003.
[4] A.P. Natsev, M.R. Naphade, and J. TešiĆ, “Learning the semantics of
multimedia queries and concepts from a small number of examples,”
Proceedings of the ACM International Conference on Multimedia,
pp.598–607, 2005.
[5] Y. Liu and T. Mei, “Optimizing visual search reranking via pairwise
learning,” IEEE Trans. Multimedia, vol.13, no.2, pp.280–291, 2011.
[6] L.S. Kennedy and S.-F. Chang, “A reranking approach for contex-
t-based concept fusion in video indexing and retrieval,” Proceedings
of the ACM International Conference on Image and Video Retrieval,
pp.333–340, 2007.
[7] W.H. Hsu, L.S. Kennedy, and S.-F. Chang, “Video search rerank-
ing via information bottleneck principle,” Proceedings of the ACM
International Conference on Multimedia, pp.35–44, 2006.
[8] Y. Jing and S. Baluja, “VisualRank: Applying pagerank to large-
scale image search,” IEEE Transanctions on Pattern Analysis and
Machine Intelligence, vol.30, no.11, pp.1877–1890, 2008.
[9] M. Wang, H. Li, D. Tao, K. Lu, and X. Wu, “Multimodal
graph-based reranking for web image search,” IEEE Trans. Image
Process., vol.21, no.11, pp.4649–4661, 2012.
[10] X. Tian, Y. Yang, J. Wang, X. Wu, and X.-S. Hua, “Bayesian vi-
sual reranking,” IEEE Trans. Multimedia, vol.13, no.4, pp.639–652,
1440
IEICE TRANS. INF. & SYST., VOL.E101–D, NO.5 MAY 2018
2011.
[11] T. Mei, Y. Rui, S. Li, and Q. Tian, “Multimedia search rerank-
ing: A literature survey,” ACM Computing Surveys, vol.46, no.3,
pp.38:1–38:38, 2014.
[12] J. He, M. Li, H.-J. Zhang, H. Tong, and C. Zhang, “Manifold-rank-
ing based image retrieval,” Proceedings of the ACM International
Conference on Multimedia, pp.9–16, 2004.
[13] S.T. Roweis and L.K. Saul, “Nonlinear dimensionality reduction
by locally linear embedding,” Science, vol.290, no.5500, pp.2323–
2326, 2000.
[14] M.A. Porter, J.P. Onnela, and P.J. Mucha, “Communities in net-
works,” Notices of the AMS, vol.56, no.9, pp.1082–1097, 2009.
[15] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and
Computing, vol.17, no.4, pp.395–416, 2007.
[16] S. Yoshida, T. Ogawa, and M. Haseyama, “Graph-based Web video
search reranking through consistency analysis using spectral clus-
tering,” Proceedings of the IEEE International Conference on Mul-
timedia and Expo, pp.1–6, 2016.
[17] W.H. Hsu, L.S. Kennedy, and S.-F. Chang, “Video search rerank-
ing through random walk over document-level context graph,” Pro-
ceedings of the ACM International Conference on Multimedia,
pp.971–980, 2007.
[18] S. Brin and L. Page, “The anatomy of a large-scale hypertextual
web search engine,” Computer Networks and ISDN Systems, vol.30,
no.1-7, pp.107–117, 1998.
[19] L. Yang and A. Hanjalic, “Supervised reranking for Web image
search,” Proceedings of the ACM International Conference on Mul-
timedia, pp.183–192, 2010.
[20] S. Zhang, M. Yang, T. Cour, K. Yu, and D.N. Metaxas, “Query spe-
cific rank fusion for image retrieval,” vol.7573, pp.660–673, 2012.
[21] W. Liu, Y.-G. Jiang, J. Luo, and S.-F. Chang, “Noise resistant
graph ranking for improved web image search,” Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
pp.849–856, 2011.
[22] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning
using gaussian fields and harmonic functions,” Proceedings of the
International Conference on Machine Learning, pp.912–919, 2003.
[23] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, and B. Schölkopf,
“Learning with local and global consistency,” Proceedings of the In-
ternational Conference on Neural Information Processing Systems,
pp.321–328, 2004.
[24] B. Geng, D. Tao, and C. Xu, “DAML: Domain adaptation
metric learning,” IEEE Trans. Image Process., vol.20, no.10,
pp.2980–2989, 2011.
[25] H. Li, M. Wang, and X.-S. Hua, “MSRA-MM 2.0: A large-scale web
multimedia dataset,” Proceedings of the IEEE International Confer-
ence on Data Mining Workshops, pp.164–169, 2009.
[26] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learn-
ing spatiotemporal features with 3D convolutional networks,” Pro-
ceedings of the IEEE International Conference on Computer Vision,
pp.4489–4497, 2015.
[27] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re-
thinking the inception architecture for computer vision,” Proceed-
ings of The IEEE Conference on Computer Vision and Pattern
Recognition, 2016.
[28] C.D. Manning, P. Raghavan, and H. Schütze, Introduction to Infor-
mation Retrieval, Cambridge University Press, New York, NY, USA,
2008.
[29] S. Robertson and H. Zaragoza, “The probabilistic relevance frame-
work: Bm25 and beyond,” Foundations and Trends in Information
Retrieval, vol.3, no.4, pp.333–389, 2009.
[30] D. Lu, X. Liu, and X. Qian, “Tag-based image search by social
re-ranking,” IEEE Trans. Multimedia, vol.18, no.8, pp.1628–1639,
2016.
[31] J.M. Kleinberg, “Authoritative sources in a hyperlinked environ-
ment,” Journal of the ACM, vol.46, no.5, pp.604–632, 1999.
Soh Yoshida received the B.S., M.S., and
Ph.D. degrees in electronics and information en-
gineering from Hokkaido University, Japan, in
2012, 2014, and 2016, respectively. He joined
the Faculty of Engineering, Kansai University,
in 2016, where he is currently an Assistant Pro-
fessor. His research interests are Image/Video
Semantic Analysis and Information Retrieval.
He is a member of the ACM, the IEEE, the
IEICE, and the ITE.
Takahiro Ogawa received the B.S., M.S.,
and Ph.D. degrees in electronics and infor-
mation engineering from Hokkaido University,
Japan, in 2003, 2005, and 2007, respectively. He
joined the Graduate School of Information Sci-
ence and Technology, Hokkaido University, in
2008, where he is currently an Associate Profes-
sor. His research interests are multimedia signal
processing and its applications. He has been an
Associate Editor of the ITE Transactions on Me-
dia Technology and Applications. He is a mem-
ber of the ACM, the EURASIP, the IEICE, and the ITE.
Miki Haseyama received the B.S., M.S.,
and Ph.D. degrees in electronics from Hokkaido
University, Japan, in 1986, 1988, and 1993, re-
spectively. She joined the Graduate School of
Information Science and Technology, Hokkaido
University, as an Associate Professor in 1994.
She was a Visiting Associate Professor with
Washington University, USA, from 1995 to
1996. She is currently a Professor with the
Graduate School of Information Science and
Technology, Hokkaido University. Her current
research interests include image and video processing and its development
into semantic analysis. She has been the Vice President of the Institute
of Image Information and Television Engineers (ITE), Japan, an Editor-
in-Chief of the ITE Transactions on Media Technology and Applications,
and the Director of the International Coordination and Publicity, Institute
of Electronics, Information, and Communication Engineers (IEICE). She
is a member of the IEICE, the ITE, and the ASJ.
Mitsuji Muneyasu received the B.E. and
M.E. degrees in system engineering from Kobe
University in 1982 and 1984, respectively, and
Doctor of Engineering degree from Hiroshima
University, Japan, in 1993. In 1984, he joined
Oki Electric Industry Co., Ltd., in Tokyo, Japan.
From 1990 to 1991, he was a Research Assistant
at the Faculty of Engineering, Tottori University,
Tottori, Japan. From 1991 to 2001, he was a Re-
search Assistant and Associate Professor at the
Faculty of Engineering, Hiroshima University,
Higashi-Hiroshima, Japan. In 2001 he joined the Faculty of Engineering,
Kansai University, Osaka, Japan, where he is currently a Professor. His
research interests include image processing theory and nonlinear digital
signal processing. He is a member of IEICE, IEEE, and IPSJ.