Image Captioning in Vietnamese
Research

Image Captioning in Vietnamese

Undergraduate Thesis. Built a Vietnamese dataset from MS-COCO and trained CNN+LSTM models.

Project Details

  • Role

    Student Researcher

  • Team Size

    1

  • Duration

    Feb 2018 - Oct 2018

Overview

Fun Fact

This is my Undergraduate Thesis, my first experience with research paper writing.

The Challenge

Image captioning—automatically generating natural language descriptions for images—has achieved significant progress in English. However, for low-resource languages like Vietnamese, the lack of labeled datasets poses a major hurdle. This project aimed to bridge that gap by constructing a dataset and exploring diverse methodologies to adapt English-centric models to Vietnamese.

Dataset Construction

To create a benchmark for Vietnamese image captioning, I employed a semi-automated approach to translate a subset of the MS-COCO dataset.

  • Source: 4,000 images from the MS-COCO training set.
  • Volume: 20,000 image-caption pairs (5 captions per image).
  • Process:
    • Machine Translation: Used Google Translate for initial candidates.
    • Human Translation: Manually refined translations to ensure cultural nuances and grammatical correctness were preserved, correcting errors where the machine translation conflicted with the visual content.

Methodology

The core architecture utilized the Show and Tell model (CNN + LSTM). I experimented with four distinct strategies to determine the most effective approach for Vietnamese:

  1. Translate-First: Generating English captions and translating them to Vietnamese (CIDEr: 0.827).
  2. Machine-Translated Training: Training directly on Google-translated captions (CIDEr: 0.935).
  3. Human-Translated Training: Training on manually quality-controlled captions (CIDEr: 1.095).
  4. Tokenized Training: Applying Vietnamese word segmentation (using RDRsegmenter) to handle the space-delimiter ambiguity in Vietnamese before training.

Results

The experiments demonstrated that data quality and language-specific preprocessing are crucial.

  • Impact of Human Translation: Training on high-quality, human-translated data significantly outperformed machine-translated baselines.
  • Impact of Tokenization: Vietnamese words often consist of multiple space-separated syllables. Treats them as single tokens (e.g., "trưởng thành" vs "trưởng", "thành") improved the CIDEr score to 1.148, the best result in the study.
MethodBLEU-1BLEU-4CIDEr
English Baseline0.6770.2490.949
English-to-VN (Translate)0.6300.1910.827
Google-Translated Train0.6390.2510.935
Human-Translated Train0.6830.2841.095
Tokenized Human-Translated0.6860.2811.148

Conclusion

This thesis established a baseline for Vietnamese image captioning, showing that while transfer learning from English is viable, adapting to the specific linguistic characteristics of Vietnamese (like word segmentation) yield the best performance.

Key Achievements

  • Built a Vietnamese Image Captioning dataset by translating 4,000 images and 20,000 captions from MS-COCO
  • Investigated the impact of word segmentation on Vietnamese captioning
  • Achieved 1.148 CIDEr score with Tokenized Human-Translated method

Technologies Used

TensorFlowKerasCNNLSTMWord2Vec

Skills Applied

Computer VisionNatural Language ProcessingMachine Translation