Video Captioning with Multi-Faceted Attention

Video captioning has attracted an increasing amount of interest, due in part to its potential for improved accessibility and information retrieval. While existing methods rely on different kinds of visual features and model architectures, they do not make full use of pertinent semantic cues. We present a unified and extensible framework to jointly leverage multiple sorts of visual features and semantic attributes. Our novel architecture builds on LSTMs with two multi-faceted attention layers. These first learn to automatically select the most salient visual features or semantic attributes, and then yield overall representations for the input and output of the sentence generation component via custom feature scaling operations. Experimental results on the challenging MSVD and MSR-VTT datasets show that our framework outperforms previous work and performs robustly even in the presence of added noise to the features and attributes.


Introduction
The task of automatically generating captions for videos has been receiving an increasing amount of attention.On YouTube, for example, every single minute, hundreds of hours of video content are uploaded.There is no way a person could sit and watch these overwhelming amounts of videos, so new techniques to search and quickly understand them are highly sought.Generating captions, i.e., short natural lan-guage descriptions, for videos is an important technique to address this challenge, while also greatly improving their accessibility for blind and visually impaired users.
Video captioning has been studied for a long time and remains challenging, given the difficulties of video interpretation, natural language generation, and the interplay between them.Understanding a video hinges on our ability to make sense of video frames and of the relationships between consecutive frames.
The output needs to be grammatically correct sequence of words.Different parts of the output caption may pertain to different parts of the video.In previous work, 3D ConvNets (Du et al., 2015) have been proposed to capture motion information in short videos, while LSTMs (Hochreiter and Schmidhuber, 1997) can be used to generate natural language, and a variety of different visual attention models (Yao et al., 2015;Pan et al., 2015;Yu et al., 2015) have been deployed, attempting to capture the relationship between caption words and the video content.
These methods, however, only make use of visual information from the video, often with unsatisfactory results.In many real-world settings, we can easily obtain additional information related to the video.Apart from sound, there may also be a title, user-supplied tags, categories, and other metadata.Both visual video features as well as attributes such as tags can be imperfect and incomplete.However, by jointly considering all available signals, we may obtain complementary information that aids in generating better captions.Humans, too, often benefit from additional context information when trying to understand what a video is portraying.Incorporating these additional signals is not just a matter of adding additional features.While generating the sequence of words in the caption, we need to be able to flexibly attend to the relevant frames over time, the relevant parts within a given frame, and relevant additional signals to the extent that they pertain to a particular output word.
Based on these considerations, we propose a novel multi-faceted attention architecture that jointly considers multiple heterogeneous forms of inputs.This model is flexibly attends to temporal information, motion features, and semantic attributes for every channel.An example of this is given in Figure 1.Each part of the attention model is an independent branch and it is straightforward to incorporate additional branches for further kinds of features, making our model highly extensible.We present a series of experiments that highlight the contribution of attributes to yield state-of-the-art results on standard datasets.

Related Work
Machine Translation.Some of the first widely noted successes of deep sequence-to-sequence learning models were for the task of machine translation (Cho et al., 2014b;Cho et al., 2014a;Sutskever et al., 2014;Kalchbrenner and Blunsom, 2013;Li et al., 2015;Lin et al., 2015).In several respects, this is actually a similar task to video caption generation, just with a rather different input modality.What they share in common is that both require bridging different representations, and that often an encoder-decoder paradigm is used with a Recurrent Neural Network (RNN) decoder to generate sentences in the target language.Many techniques for video captioning are inspired by neural machine translation ones, including soft attention mechanisms to focus on different parts of the input when generating the target sentence word by word (Bahdanau et al., 2015).
Image Captioning.Image captioning can be regarded as a greatly simplified case of video captioning, with videos consisting of just a single frame.Recurrent architectures are often used here as well (Karpathy et al., 2014;Kiros et al., 2014;Chen and Zitnick, 2015;Mao et al., 2015;Vinyals et al., 2015).Spatial attention mechanisms allow for focusing on different areas of an image (Xu et al., 2015b).Recently, image captioning incorporating semantic concepts have achieved inspiring results.A semantic attention approach has been proposed (You et al., 2016) to selectively attend to semantic concept proposals and fuse them into hidden states and outputs of RNNs, but their model is difficult to extend for multiple channels.Overall, none of these methods for image captioning need to account for temporal and motion aspects.
Video captioning.For video captioning, many works utilize a recurrent neural architecture to generate video descriptions, conditioned on either an average-pooling (Venugopalan et al., 2015b) or recurrent encoding (Xu et al., 2015a;Donahue et al., 2015;Venugopalan et al., 2015a;Venugopalan et al., 2016) of frame-level features, or on a dynamic linear combination of context vectors obtained via temporal attention (Yao et al., 2015).Recently, hierarchical recurrent neural encoders (HRNE) with attention mechanism have been proposed to encode video (Pan et al., 2015).A recent paper (Yu et al., 2015) additionally exploits several kinds of visual attention and relies on a multimodal layer to combine them.In our work, we present a novel attention model with more effective multimodal layers that jointly models multiple heterogeneous signals, including semantic attributes, and experimentally show the benefits of this approach over previous work.

The Proposed Approach
In this section, we describe our approach for combining multiple forms of attention for video cap-tioning.Figure 2 illustrates the architecture of our model.The core of our model is a sentence generator based on generator is a simple Long Short Term Memory (LSTM) units (Hochreiter and Schmidhuber, 1997).Instead of a traditional sentence generator, which directly receives a previous word and selects the next word, our model relies on several attention layers to selectively focus on important parts of temporal, motion, and semantic features.The output words are generated via a softmax reading from a multimodal layer (Mao et al., 2015), which integrates information from the different attention layers.An additional multimodal layer integrates information before the input reaches the sentence generator to enable better hidden representations in the LSTM.We first briefly review the basic LSTM, and then describe our model in detail, including our novel multi-faceted attention mechanism to consider temporal, motion, and semantic attribute perspectives.

Long Short Term Memory Networks
A Recurrent Neural Network (RNN) (Elman, 1990) is a neural network adding extra feedback connections to feed-forward networks, so as to be able to work with sequences.The network is updated not only based on the input but also based on the previous hidden state.RNNs can compute the hidden states (h 1 , h 2 , . . ., h m ) given an input sequence (x 1 , x 2 , . . ., x m ) based on recurrence of the following form: where weight matrices W , U and bias b are parameters to be learned and φ(•) is an element-wise activation function.
RNNs trained via unfolding have proven inferior at capturing long-term temporal information.LSTM units were introduced to avoid these challenges.LSTMs not only compute the hidden states but also maintains a cell state to account for relevant signals that have been observed.They have the ability to remove or add information to the cell state, modulated by gates.
Given an input sequence (x 1 , x 2 , ..., x m ), an LSTM unit computes the hidden state (h 1 , h 2 , ..., h m ) and cell states (c 1 , c 2 , ..., c m ) via repeated application of the following equations: where σ(•) is the sigmoid function and denotes the element-wise multiplication of two vectors.For convenience, we denote the computations of the LSTM at each time step t as h t , c t = LSTM(x t , h t−1 , c t−1 ).

Input Representations
When training a video captioning model, as a first step, we need to extract feature vectors that serve as inputs to the LSTM.For visual features, we can extract one feature vector per frame, leading to a series of what we call temporal features.We can also extract another form of feature vector from several consecutive frames, which we call motion features.
Additionally, we could also extract other forms of visual features, such as features from an area of a frame, the same area of consecutive frames, etc.In this paper, we only consider temporal features, denoted by {v i }, and motion features, denoted by {f i }, which are commonly used in video captioning.
For semantic features, we need to extract a set of related attributes denoted by {a i }.These can be based on title, tags, etc., if available.Alternatively, we can also rely on techniques to extract or predict attributes that are not directly given.In particular, because we have captions for the videos in the training set, we can train different models to predict caption-related semantic features for videos in the validation and test sets.As the choice of semantic features is not the core contribution, we describe our specific experimental setups in Section 4.
After determining a set of attributes for each video, each attribute a i in any video in the entire dataset corresponds to an entry in the vocabulary and each word w i in any caption of the training set also corresponds to an entry in the vocabulary.
An embedding matrix E is used to represent both words and semantic attributes and we denote by  E[w] an embedding vector of a given w.Thus, we obtain attribute embedding vectors {s i } and input word embedding vectors as:

Multi-Faceted Attention
We do not directly feed x t to the LSTM.Instead, we first apply our multi-faceted attention model to x t .
Assuming that we have a series of multimodal feature vectors for a given video, we generate a caption word by word.At each step, we need to select relevant information from these feature vectors, which we from now on refer to as context vectors {c 1 , c 2 , ..., c n }.Due to the variability of the length of videos, it is challenging to directly input all these vectors to the model at every time step.A simple strategy is to compute the average of the context vectors and input this average vector to each time step of the model.
However, this strategy collapses all available information into a single vector, neglecting the inherent structure, which captures the temporal progression, among other things.Thus, this sort of folding leads to a significant loss of information.Instead, we wish to focus on the most salient parts of the features at every time step.Instead of a naive averaging of the context vectors {c 1 , c 2 , . . ., c n }, a soft attention model calculates weights α t i for each c i , conditioning on the input vector x t at each time step t.For this, we first compute basic attention scores e t i and then feed these through a sequential softmax layer to obtain a set of attention weights {α t 1 , α t 2 , . . ., α t n } that quantify the relevance of {c 1 , c 2 , . . ., c n } for x t .
We obtain the corresponding output vectors y t as weighted averages.
This soft attention model, strictly speaking, converts an entire input sequence (x 1 , x 2 , . . ., x m ) to an entire output sequence (y 1 , y 2 , . . ., y m ) based on all context vectors {c 1 , c 2 , . . ., c n }.For convenience, we denote the attention model outputs at a given time step t as y t = Attention(x t , {c i }).
In particular, the attention model is applied to the temporal features {v i }, motion features {f i } and se-mantic features {s i }: We then obtain the input to the LSTM m x t via a multimodal layer ) Here, w x v and w x f facilitate capturing the relative importance of each dimension of the temporal and motion feature space (You et al., 2016).We apply dropout (Srivastava et al., 2014) to this multimodal layer to reduce overfitting.
Subsequently, we can obtain h t via the LSTM.At the first time step, the mean values of the features are used to initialize the LSTM states to yield a general overview of the video: where Mean(•) denotes mean pooling of the given feature set.
We also apply the attention model to hidden states h t , and use a multimodal layer to concatenate outputs of the attention model and map it into a feature space that has exactly the same dimensionality as the word embeddings.This multimodal layer is followed by a softmax layer with a dimensionality equal to the size of the vocabulary.The projection matrix from the multimodal layer to the softmax layer is set to be the transpose of the word embedding matrix: where Softmax(•) denotes a sequential softmax.
By using two multimodel layers, we combine six attention layers with the core LSTM.This model is highly extensible since we can easily add extra branches for additional features.

Training and Generation
We can interpret the output of the softmax layer p t as a probability distribution over words: where V denotes the corresponding video, S denotes semantic attributes and Θ denotes model parameters.The overall loss function is defined as the negative logarithm of the likelihood and our goal is to learn all parameters Θ in our modal by minimizing the loss function over the entire training set: where N is the total number of captions in the training set, and T i is the number of words in caption i.During the training phase, we add a begin-ofsentence tag BOS to the start of the sentence and an end-of-sentence tag EOS to the end of sentence.
We use Stochastic Gradient Descent to find the optimum with the gradient computed via Backpropagation Through Time (BPTT) (Werbos, 1990).Training continues until the METEOR evaluation score on the validation set stops increasing, and we optimize the hyperparameters using random search to maximize METEOR on the validation set, following previous studies that found that METEOR is more consistent with human judgments than BLEU or ROUGE (Vedantam et al., 2015).
After the parameters are learned, during the testing phase, we also have temporal and motion features extracted from the video as well as semantic attributes, which were either already given or are predicted using a model trained on the training set.Given a previous word, we can calculate the probability distribution of the next word p t using the model described above.Thus, we can generate captions starting from the special symbol BOS with Beam Search (Yu et al., 2015).(Chen and Dolan, 2011).MSVD consists of 1,970 video clips typically depicting a single activity, downloaded from YouTube.Each video clip is annotated with multiple human generated descriptions in several languages.We only use the English descriptions, about 41 descriptions per video.In total, the dataset consists of 80,839 video/description pairs.Each description on average contains about 8 words.We use 1,200 videos for training, 100 videos for validation and 670 videos for testing, as provided by previous work (Guadarrama et al., 2013).

MSR-VTT:
We also evaluate on the MSR Videoto-Text (MSR-VTT) dataset (Xu et al., 2016), a new large-scale video benchmark for video captioning.MSR-VTT provides 10,000 web video clips.Each video is annotated with about 20 natural sentences.Thus, we have 200,000 video-caption pairs in total.
Our video captioning models are trained and hyperparameters are selected using the official training and validation set, which consists of 6,513 and 497 video clips respectively.And models are evaluated using the test set of 2,990 video clips.

Preprocessing
Visual Features: We extract two kinds of visual features, temporal features and motion features.We use a pretrained ResNet-152 model (He et al., 2015) to extract temporal features, obtaining one fixed-length feature vector for every frame.We use a pretrained C3D (Du et al., 2015) to extract motion features.The C3D net reads in a video and emits a fixedlength feature vector every 16 frames.Semantic Attributes: While MSVD and MSR-VTT are standard video caption datasets, they do not come with tags, titles, or other semantic information about the videos.Nevertheless, we can reproduce a setting with semantic attributes by extracting attributes from captions.First, we invoke the Stanford Parser (Klein and Manning, 2003) to parse captions and choose the nsubj edges to find the subject-verb pairs for each caption.We then select the most frequent subject and verb across captions of each video as the high-quality semantic attributes.These at- tributes can be used to evaluate our models under a high-quality attribute condition.Next, we can use the high-quality semantic attributes of the training set to train a model to predict semantic attributes for the test set.Such attributes are used to evaluate our model under low-quality semantic attribute conditions.
For our experiments, we consider three models to predict semantic attributes.The first one (NN) is to perform a nearest-neighbor search on every frame of the training set to retrieve similar ones for every frame of each test video based on ResNet-152 features and select the most frequent attributes.The second one (SVM) is to train SVMs for the top 100 frequent attributes in the training set and predict semantic attributes for test videos based on a mean pooling of ResNet-152 features.The third one (HRNE) is to train two hierarchical recurrent neural encoders (Pan et al., 2015) to predict the subject and verb separately based on temporal ResNet-152 features.

Evaluation Metrics
We rely on four standard metrics, BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), CIDEr (Vedantam et al., 2015) and ROUGE-L (Lin, 2004) to evaluate our methods.These are commonly used in image and video captioning tasks, and allow us to compare our results against previous work.We use the Microsoft COCO evaluation server (Chen et al., 2015), which is widely used in previous work, to compute the metric scores.Across all three metrics, higher scores indicate that the generated cap-tions are assessed as being closer to captions created by humans.

Experimental Settings
The number of hidden units in the input multimodal layer and in the LSTM are both 512.The activation function of the LSTM is tanh and the activation functions of both multimodal layers are linear.The dropout rates of both the input and output multimodal layers are set to 0.5.We use pretrained 300dimensional GloVe (Pennington et al., 2014) vectors as our word embedding matrix.We rely on the RM-SPROP algorithm (Tieleman and Hinton, 2012) to update parameters for better convergence, with the learning rate 10 −4 .The beam size during sentence generation is set to 5. Our system is implemented using the Theano (Bastien et al., 2012;Bergstra et al., 2010) framework.

Results
Visual only: First, for comparison, we show the result of only using visual attention at first.Specifically, we only use the temporal features and motion features (TM), by removing the semantic branch with other components of our model unchanged.To evaluate the effectiveness of different sorts of visual cues, we also report the results of using only temporal features (T) and using only motion features (M).We compare our methods with six state-of-the-art methods: LSTM-YT (Venugopalan et al., 2015b), S2VT (Venugopalan et al., 2015a), TA (Yao et al., 2015), LSTM-E (Pan et al., 2016), HRNE-A (Pan et al., 2015), and h-RNN (Yu et al., 2015).vides a comparison of these systems on the MSVD dataset.Since some of the previous work uses different features, we also run experiments for some of them whose source code are provided by the authors, or we re-implement the models described in their papers, and then evaluate them using our features.The corresponding extra results are marked by ' * '.
We observe that even just with temporal features alone, we obtain fairly good results, which implies that the attention model in our approach is useful.Combining temporal and motion features, we see that our method can outperform previous work, confirming that our attention model with multimodel layers can extract useful information from temporal and motion features effectively.In fact, the TA, LSTM-E studies also employ both temporal and motion features, but do not have a separate motion attention mechanism.And the h-RNN study only considers attention after the sentence generator.Instead our attention mechanism operates both before and after the sentence generator, enabling it to attend to different aspects during the analysis and synthesis processes for a single sentence.The results on the MSR-VTT dataset are shown in Table 3.They are consistent in that they also show that the combined attention for temporal and motion features obtains better results.

Multi-Faceted Attention:
To show the influence of our multi-faceted attention with additional semantic cues, we first consider the low-quality semantic attributes.Tables 2 and 3 provide results using low-quality attributes obtained via our NN (TM-P-NN), SVM (TM-P-SVM), and HRNE (TM-P-HRNE) methods described above.We find that the results for NN and SVM are sometimes slightly worse than only using visual attention, which means that too low-quality attributes do not help in improving the quality.It appears that these methods are rather unreliable and introduce significant noise.HRNE fares slightly better than using only visual attention, as it combines top-down and bottom-up approaches to obtain more stable and reliable results.
Then, we consider the high-quality semantic attributes, subject and verb (TM-HQ-SV), derived from the ground truth captions.We also report the performance of only using the subject (TM-HQ-S) or the verb (TM-HQ-V) individually.These results, too, are included in Tables 2 and 3. We find that our method is able to exploit high-quality subject and verb attributes to outperform other methods by very large margins.Even using just a single semantic attribute yields very strong results.Here, verb information proves slightly more informative than the subject, indicating that identifying actions in videos remains more challenging than identifying important objects in a video.
Overall, we observe that our method, with just two high-quality features, approaches human-level scores in terms of the METEOR and CIDEr metrics.For this, we randomly selected one caption for each video in the test set and evaluate this caption by removing them from the ground truth.Although not perfect, such results (Human) can be viewed as an estimation of the human-level performance.The BLEU scores of our method are in fact even greater than the human-level ones, since humans often prefer generating longer captions, which tend to obtain lower BLEU scores.Several studies, including on caption generation, have concluded that BLEU is not a sufficiently reliable metric in terms of replicating human judgment scores (Kulkarni et al., 2013;Vedantam et al., 2015).Figure 4 shows several example captions generated by our approach for MSVD videos.
To further investigate the influence of noise, we randomly select genuine high-quality subject and verb attributes and replace them with random incorrect ones.Figure 3 provides the results on MSVD.These result show that even when adding 50% noise, the results are better than just using regular visual attention.With extremely strong noise levels, the results are worse than only using visual attention, but are still maintained at a certain level.This shows that we are likely to benefit from further semantic attributes such as tags, titles, comments, and so on, which are often available for online videos, even if they are noisy.

Conclusion
We have proposed a novel method for video captioning based on an extensible multi-faceted attention mechanism, outperforming previous work by large margins.
Even without semantic attributes, our method outperforms state-of-the-art approaches using visual features.With just two high-quality semantic attributes, the results become competitive with human results across a range of metrics.This opens up important new avenues for future work on exploring the large space of potential additional forms of semantic cues and attributes.

Figure 1 :
Figure 1: Example video with extracted visual features, semantic attribute, and the generated caption as output.

Figure 2 :
Figure 2: Model architecture.Temporal, motion, and semantic features are weighted via an attention mechanism and aggregated, together with an input word, by a multimodal layer.Its output updates the LSTM's hidden state.A similar attention mechanism then determines the next output word via a softmax layer.

Figure 4 :
Figure 4: Examples of generated captions on MSVD.GT1 and GT2 are ground truth captions.

Table 1 :
Results of models using visual features only, on MSVD, where (-) indicates unknown scores.

Table 2 :
Results of models combining visual and semantic attention on MSVD.
Table 1 pro-Figure 3: Results of adding noise to high-quality semantic attributes of MSVD.The blue solids are results of adding noise.The red dashes are corresponding results of the TM model.