Sanskrit to Vietnamese Translation
Research

Sanskrit to Vietnamese Translation

Benchmarking Modern LLMs on Ancient Buddhist Scripture Translation (Sanskrit/Pali to Vietnamese).

Project Details

  • Role

    Researcher

  • Team Size

    Individual

  • Duration

    a weekend project in Jan 2026

Overview

1. Research Motivation

This project aims to evaluate the capability of modern Large Language Models (LLMs) in translating ancient Buddhist scriptures (Sanskrit and Pali) into Vietnamese. The core research questions address:

  • Can LLMs match the translation quality of renowned human scholars (e.g., Thich Nhat Hanh, Thich Minh Chau)?
  • Which LLM currently performs best for this specific low-resource, domain-specific task?
  • Is Sanskrit or Pali a better source language for translation quality into Vietnamese?

2. Technical Stack

The benchmark is built with a modular Python architecture using modern tooling:

  • Language & Manager: Python 3.12+, managed by uv.
  • LLM Interface: litellm for unified API access to various providers (Groq, Gemini, etc.).
  • Configuration: hydra-core for hierarchical configuration management.
  • Data Handling: pandas for CSV manipulation, pydantic for data validation.
  • Caching: diskcache for persistent disk-based caching to handle API rate limits and resume interrupted runs.
  • Scoring: sacrebleu (BLEU) and bert_score (Semantic similarity with torch).
  • Observability: langfuse with opentelemetry (API, SDK, OTLP exporter) for tracing and experiment tracking.
  • Reliability: tenacity for retry logic with exponential backoff.
  • Templating: jinja2 for prompt template rendering.
  • Data Crawling: Modular crawler design (src/crawlers/) using cloudscraper and BeautifulSoup for robust data collection from Buddhist websites.
  • Prompt Versioning: Structured management (src/system_prompts/) ensuring reproducibility and traceability of LLM behaviors.
  • Local Inference: vllm for serving local models (e.g. Qwen2.5) with OpenAI-compatible API.
  • Infrastructure Monitoring: prometheus and grafana for tracking vLLM metrics (GPU usage, request latency).

3. Data Sources

The benchmark utilizes parallel datasets comparing ancient texts with human translations:

  • Sanskrit → Vietnamese: based on the Heart Sutra (Bát Nhã Tâm Kinh), sanskrit_vi_heart_sutra.csv.
  • Pali → Vietnamese: based on the Dhammapada (Kinh Pháp Cú), pali_vi_dhammapada.csv.
  • Parallel Comparison: A dataset aligning Dhammapada (Pali) and Udanavarga (Sanskrit) verses, dhammapada_udanavarga_parallel.csv.

Langfuse Datasets (Data Management)

Langfuse Dataset Screenshot

Langfuse Dataset Screenshot

The benchmark supports auto dataset uploading into Langfuse:

  • Auto-Sync: When running with a local file, it automatically checks if the corresponding Langfuse dataset exists. If the local file has more items, it upserts them.
  • Schema Enforcement: All datasets utilize Langfuse's Native Schema Enforcement to ensure data validity:
    • input: String (Source text)
    • expected_output: Dictionary (Reference translations)
  • Evaluation Scores Dashboard: scores are visualized in Langfuse dashboard grouped by model and dataset.

4. Evaluation Metrics

The project employs a multi-faceted evaluation strategy:

  1. BLEU Score: Measures lexical overlap with human references (sacrebleu).
  2. BERTScore: Measures semantic similarity using contextual embeddings (bert-base-multilingual-cased implied for 'vi').
  3. LLM-as-a-Judge: A qualitative assessment using Gemini-3-Flash-Preview to score translations on a 1-5 scale for:

Accuracy*: Faithfulness to the meaning.

Fluency*: Naturalness of the Vietnamese text.

5. Models Benchmarked

The following models are currently evaluated:

  • Llama-3.3-70b-versatile (via Groq)
  • Llama-4-Maverick-17B (via Groq)
  • GPT-OSS-120b (via Groq)
  • Kimi-k2-instruct-0905 (MoonshotAI via Groq)
  • Qwen3-32b (via Groq)
  • Gemini-3-Flash-Preview (Google)
  • GPT-5.2 (OpenAI)
  • Grok-4-0709 & Grok-4.1-Fast-Reasoning (xAI)
  • DeepSeek-V3.2-Chat & DeepSeek-V3.2-Reasoner (DeepSeek)

vLLM Monitoring (Prometheus + Grafana)

Support local LLM inference with vLLM and monitoring of vLLM performance, see the vLLM Monitoring Guide.

vLLM Dashboard on Grafana

vLLM Dashboard on Grafana

6. Current Results Highlights

Sanskrit → Vietnamese (Heart Sutra | Bát Nhã Tâm Kinh)

Date: 2026-01-24

LLM Judge Model: gemini/gemini-3-flash-preview

Dataset: data/sanskritviheart_sutra.csv

DatasetModelBLEU ↑BERTScore ↑LLM Judge Accuracy (1-5) ↑LLM Judge Fluency (1-5) ↑Time (s) ↓
sanskrit-vi-heart-sutraGPT-OSS-20B4.290.622.722.114.36
sanskrit-vi-heart-sutraGPT-OSS-120b8.520.713.002.619.32
sanskrit-vi-heart-sutraLlama-3.1-8b6.020.682.392.221.49
sanskrit-vi-heart-sutraLlama-3.3-70B15.820.724.003.612.89
sanskrit-vi-heart-sutraLlama-4-Maverick-17B21.720.754.674.224.54
sanskrit-vi-heart-sutraKimi-K210.800.674.614.004.03
sanskrit-vi-heart-sutraQwen3-32B11.500.753.833.334.76
sanskrit-vi-heart-sutraGPT-5.220.790.754.894.3917.89
sanskrit-vi-heart-sutraGrok-4-070926.170.764.834.0031.69
sanskrit-vi-heart-sutraGrok-4.1-Fast-Reasoning15.270.774.784.4435.26
sanskrit-vi-heart-sutraDeepSeek-V3.2-Chat26.090.764.674.1129.81
sanskrit-vi-heart-sutraDeepSeek-V3.2-Reasoner23.270.764.503.83103.64
sanskrit-vi-heart-sutraGemini-3-Flash33.730.764.944.5611.11

Pali → Vietnamese (Dhammapada | Kinh Pháp Cú)

Date: 2026-01-24

LLM Judge Model: gemini/gemini-3-flash-preview

Dataset: data/palividhammapada.csv

DatasetModelBLEU ↑BERTScore ↑LLM Judge Accuracy (1-5) ↑LLM Judge Fluency (1-5) ↑Time (s) ↓
pali-vi-dhammapadaGPT-OSS-20B1.680.671.401.553.57
pali-vi-dhammapadaGPT-OSS-120b5.960.702.451.9510.84
pali-vi-dhammapadaLlama-3.1-8b0.840.661.451.651.99
pali-vi-dhammapadaLlama-3.3-70B9.270.773.052.755.13
pali-vi-dhammapadaLlama-4-Maverick-17B15.750.784.203.903.06
pali-vi-dhammapadaKimi-K215.990.794.304.006.53
pali-vi-dhammapadaQwen3-32B13.580.773.253.054.71
pali-vi-dhammapadaGPT-5.221.900.804.704.4522.24
pali-vi-dhammapadaGrok-4-070921.390.804.454.2034.38
pali-vi-dhammapadaGrok-4.1-Fast-Reasoning14.840.783.853.6520.84
pali-vi-dhammapadaDeepSeek-V3.2-Chat22.090.814.654.2047.69
pali-vi-dhammapadaDeepSeek-V3.2-Reasoner19.840.804.454.05125.94
pali-vi-dhammapadaGemini-3-Flash32.670.834.654.458.47

Pali vs Sanskrit Comparison (Dhammapada - Udanavarga | Kinh Pháp Cú)

To analyze the comparative translation quality between Middle Indo-Aryan (Pali) and Old Indo-Aryan (Sanskrit) into Vietnamese, we constructed a multi-source parallel corpus based on the "Gāthā" (verse) literature of early Buddhism.

  • Pali Source: The Dhammapada (Theravāda tradition), widely regarded as the most representative anthology of early Buddhist ethics.
  • Sanskrit Source: The Udanavarga (Sarvāstivāda tradition), the Sanskrit textual cousin to the Dhammapada.
  • Vietnamese References:
    • Reference A (Liturgical): Translations by Thich Minh Chau (strictly adhering to the Pali Prime).
    • Reference B (Modern/Natural): Contemporary prose translations focusing on intelligibility.

Date: 2026-01-24

Judge Model: gemini/gemini-3-flash-preview

Dataset: dhammapadaudanavargaparallel.csv (20 samples)

DatasetModelBLEU ↑BERTScore ↑LLM Judge Accuracy (1-5) ↑LLM Judge Fluency (1-5) ↑Time (s) ↓
pali-vi-dhammapada-18versesGPT-OSS-20B1.410.671.722.176.97
pali-vi-dhammapada-18versesGPT-OSS-120b7.600.702.502.2810.20
pali-vi-dhammapada-18versesLlama-3.1-8b1.590.712.002.331.89
pali-vi-dhammapada-18versesLlama-3.3-70B9.000.763.723.224.39
pali-vi-dhammapada-18versesLlama-4-Maverick-17B16.600.804.223.722.95
pali-vi-dhammapada-18versesKimi-K211.610.722.882.945.42
pali-vi-dhammapada-18versesQwen3-32B12.020.783.173.064.44
pali-vi-dhammapada-18versesGPT-5.219.490.804.674.4421.71
pali-vi-dhammapada-18versesGrok-4-070920.520.794.333.7833.42
pali-vi-dhammapada-18versesGrok-4.1-Fast-Reasoning13.350.784.063.5611.16
pali-vi-dhammapada-18versesDeepSeek-V3.2-Chat27.390.814.564.5042.27
pali-vi-dhammapada-18versesDeepSeek-V3.2-Reasoner18.160.804.003.72112.90
pali-vi-dhammapada-18versesGemini-3-Flash37.580.844.674.566.33
sanskrit-vi-udanavarga-18versesGPT-OSS-20B1.750.671.611.895.49
sanskrit-vi-udanavarga-18versesGPT-OSS-120b5.290.702.001.898.32
sanskrit-vi-udanavarga-18versesLlama-3.1-8b4.180.721.722.561.77
sanskrit-vi-udanavarga-18versesLlama-3.3-70B8.340.763.062.723.84
sanskrit-vi-udanavarga-18versesLlama-4-Maverick-17B13.240.783.613.002.55
sanskrit-vi-udanavarga-18versesKimi-K28.050.764.063.284.94
sanskrit-vi-udanavarga-18versesQwen3-32B13.760.773.002.505.39
sanskrit-vi-udanavarga-18versesGPT-5.217.610.804.614.2822.07
sanskrit-vi-udanavarga-18versesGrok-4-070915.520.794.333.8343.94
sanskrit-vi-udanavarga-18versesGrok-4.1-Fast-Reasoning10.890.763.783.5617.88
sanskrit-vi-udanavarga-18versesDeepSeek-V3.2-Chat18.580.804.674.1742.37
sanskrit-vi-udanavarga-18versesDeepSeek-V3.2-Reasoner13.170.803.893.61114.35
sanskrit-vi-udanavarga-18versesGemini-3-Flash35.260.834.504.339.65

Key Achievements

  • Benchmarked over 12 state-of-the-art LLMs (including GPT-5.2, DeepSeek-V3, Grok-4, Gemini-3) on specialized Buddhist translation tasks.
  • Demonstrated Gemini-3-Flash's superior performance, consistently outperforming other models across all metrics.
  • Created custom parallel validation datasets aligning Pali (Dhammapada) and Sanskrit (Udanavarga) verses with Vietnamese translations.

Technologies Used

LiteLLMvLLMGroqSacreBLEU & BERTScorePandasGrafana & PrometheusBeautifulSoup & CloudscraperGoogle Gemini APILangfuse & OpenTelemetryHydra

Skills Applied

NLP Evaluation & BenchmarkingLow-Resource Neural Machine TranslationData Curation & CrawlingLLM-as-a-Judge EvaluationAncient Languages (Sanskrit/Pali)