Some jottings for baseline paper of my MAI project

Image credit: SCAN

It is been over a semester since my MAI programme started here in KU Leuven! What an exciting new start after flying 30 hours from Southern Hemisphere coastal Auckland to historical university town Leuven! As my final semester just commerced today, I thought probably it is time to write some bits and pieces of my thought on the baseline paper for my MAI project.

After did my Honours research in unsupervised machine learning specifically-data stream mining, I decided to touch my hands on some other trendy fields of AI, which led to the current project I am working on: image-text alignment for artworks. This project overlaps on NLP and CV which requires knowledge in both areas.

Why do we need to dig into the alignment between text and images? Andrej Karpathy and Li Fei-Fei described as below in their CVPR 2015 paper.

  • A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene. However, this remarkable ability has proven to be an elusive task for our visual recognition models. The majority of previous work in visual recognition has focused on labelling images with a fixed set of visual categories and great progress has been achieved in these endeavours. However, while closed vocabularies of visual concepts constitute a convenient modeling assumption, they are vastly restrictive when compared to the enormous amount of rich descriptions that a human can compose. [1]

Since the proposal of visual-semantic alignments by Karpathy and Li, this topic has been discussed many times and a lot of great works came out which significantly improved the performance of visual-semantic alignments task.

For my thesis project, the domain image dataset we choose mainly focuses on artworks. Different from most of other real-life situational images, artworks may contain some more subtle features/object compared to others, which requires our model to achieve alignment in a very fine-grained level. The baseline model I adopted is Stacked Cross Attention (SCAN) [2].

Lee, et al. mentioned some current existing drawbacks of mainstream image-text alignment models, such as:

  • Either the attention mechanism is NOT considered at all, and the entire picture and sentence pair are aggregated together, e.g. VSE++[3].

  • Either be a bit more refined, calculate the similarity between each region-word, take the maximum and add them together as the similarity of the entire pair, e.g. deep-vs.

  • Either attention is used, but a predetermined step-process is used to capture limited semantic alignment, and because the entire feature is sent in, it lacks interpretability, e.g. DAN.

This article aims to combine the above aspects (attention mechanism and using pairs), first extracting the features of the image and sentence, then using attention for each region and word, and finally calculating similarity so that attention is used for finer-grained alignment.

Also, this article uses some of the currently available optimisation methods, such as the use of hard-negative, triplet ranking loss, etc.

The proposed model has the following parts:

  1. Image-Text Matching

In simple terms:

First, use Bottom-up attention [4] to extract multiple proposals into features for the image, then map to the same dimensions as the sentence features, and use bi-direction GRU to extract features for the sentence.

  • Stage 1: calculate the attention representation $\alpha_{i j}$ of all words for each region $i$, and add them together to obtain the sentence representation $a_{i}^{t}$, the formula is as follows:

$$ a_{i}^{t}=\sum_{j=1}^{n} \alpha_{i j} e_{j} $$

$$ \alpha_{i j}=\frac{\exp \left(\lambda_{1} \bar{s}_{i j}\right)}{\sum_{j=1}^{n} \exp \left(\lambda_{1} \bar{s}_{i j}\right)} $$

  • Stage 2: calculate the cosine similarity of the i-th region and the obtained.

$$R\left(v_{i}, a_{i}^{t}\right)=\frac{v_{i}^{T} a_{i}^{t}}{\left|v_{i}\right|\left|a_{i}^{t}\right|}$$

Finally, i areas are superimposed together to get the similarity between image and text.

$$ S_{L S E}(I, T)=\log \left(\sum_{i=1}^{k} \exp \left(\lambda_{2} R\left(v_{i}, a_{i}^{t}\right)\right)\right)^{\left(1 / \lambda_{2}\right)} $$

$$ S_{A V G}(I, T)=\frac{\sum_{i=1}^{k} R\left(v_{i}, a_{i}^{t}\right)}{k} $$

  1. Text-Image Matching

The overall steps correspond exactly to the above, except that each word is used to calculate the similarity with the attention of a picture, which is not repeated here.

  1. Target alignment

Target alignment is essentially the setting of the loss function. The author here is, fortunately, the method of triplet loss plus the hardest negative. That is, for a pair ($I$, $T$), the hard negative of the image is similar to the text except for the image with the highest similarity between the original image pair and the text. The formula is as follows:

$$ I_{h}=\operatorname{argmax}_{m \neq I} S(m, T) \text { and } T_{h}=\operatorname{argmax}_{d \neq T} S(I, d) $$

The final loss:

$$ l_{h a r d}(I, T)=\left[\alpha-S(I, T)+S\left(I, \hat{T}_{h}\right)\right]_{+}+\left[\alpha-S(I, T)+S\left(\hat{I}_{h}, T\right)\right]_{+} $$

  1. Image and text feature representation
  • Image: Bottom-up attention, which is a method of target detection, is obtained on the basis of faster-RCNN. Attention means to focus more on the target or object, and less on the background. This method is proposed for the problem of target detection. Some changes have been made here, and the detection threshold has been adjusted to select outstanding targets.

    Flow: faster-RCNN, Resnet101 → 2048-dimensional features → fully-connected layer to $h$-dimensional → get feature set $v$

  • Sentence: RNN

    Flow: word → one-hot vector → embedding to 300 dimensions→ bidirectional GRU to $h$ dimensions

  1. Summary

The most prominent thing in this article is the application of attention to the alignment of the word and region levels, which has brought a lot of explanatory improvements. In this way, the mutual attention mechanism and similarity calculation of word and region are also called Stacked. Reason for Cross Attention.


[1] Karpathy, Andrej and Fei-Fei Li. “Deep visual-semantic alignments for generating image descriptions.” CVPR (2015).

[2] Lee, Kuang-Huei & Chen, Xi & Hua, Gang & Hu, Houdong & He, Xiaodong. “Stacked Cross Attention for Image-Text Matching.” ECCV (2018).

[3] Faghri, Fartash & Fleet, David & Kiros, Jamie & Fidler, Sanja. “VSE++: Improved Visual-Semantic Embeddings.” (2017).

[4] Anderson, Peter & He, Xiaodong & Buehler, Chris & Teney, Damien & Johnson, Mark & Gould, Stephen & Zhang, Lei. (2018). “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.” 6077–6086. 10.1109/CVPR.2018.00636.

Feiyang Tang
Feiyang Tang
Ph.D. Candidate in Machine Learning

Data Enthusiast, ENFJ-T. Travelling, hiking and crime series lover. Multilingual.