Skip to content

ArmanJR/Persian-Poetry-Semantic-Similarity-Benchmark

Repository files navigation

Persian Poetry Semantic Similarity Benchmark

This repository benchmarks different LLM and embedding models to assess their performance in understanding semantic similarity (قرابت معنایی) in Persian classic poetry.

Experiments Overview

  1. Experiment #1: LLM quiz on 42 questions using zero-shot and few-shot prompting to identify poetry outliers
  2. Experiment #2: Benchmarking embedding models using centroid-based cosine similarity for outlier detection
  3. Experiment #3: Evaluating OpenRouter API embedding models on the same outlier detection task
  4. Experiment #4: Generating interpretations of poetry couplets using 13 different LLMs via OpenRouter
  5. Experiment #5: Testing 117 combinations of embedding models (9) and LLM explanations (13) in a 2D matrix
  6. Experiment #6: Large-scale benchmark of 30 LLMs on 591 questions from the Gherabat book dataset

Experiment #1: LLM Quiz – Understanding Persian Poetry

In this experiment, we evaluated some of the SOTA large language models on their ability to interpret the conceptual meaning of Persian poetry couplets, known as beits. Each beit consists of two lines (mesra) and conveys a distinct message or idea.

We tested a set of 42 multiple-choice questions on zero-shot prompting and 39 on few-shot prompting. Each question follows the same format but contains different poetry options. The task was to identify the one beit whose conceptual meaning differs from the others.

Zero-Shot Prompting

System Prompt

The models received the following system prompt:

You are an AI assistant analyzing Persian poetry couplets. Identify the outlier based on concept/message. Respond *only* with the single digit number (1, 2, 3, or 4) of the outlier option. Output nothing else.

Sample Question

Analyze the conceptual meaning of the following options:

Options:
1. چو در وقت بهار آیی پدیدار - حقیقت، پرده برداری ز رخسار
2. فروغ رویت اندازی سوی خاک - عجایب نقشها سازی سوی خاک
3. از جلوه وجود تو ظلمت سرای خاک - روشن تر از جمال بتان طراز باد
4. جلوه‌ای کرد رُخت دید ملک عشق نداشت - عین آتش شد از آن غیرت و بر آدم زد

Instruction: Identify the *single* option (by its number) that has a different concept and message from the others. Respond with *only* the number (1, 2, 3, or 4). Do not provide any explanation or other text.

Results

Here’s how the models performed, ranked by accuracy (percentage of correct outlier identifications out of 42 questions):

Bar chart visualization goes here

Few-Shot Prompting

The models received the following system prompt:

You are an expert literary critic with a deep understanding of Persian poetry, its cultural nuances, and its stylistic features. Your task is to analyze a set of poetic options—each option presenting two parts of a couplet—and identify the one option that deviates in conceptual meaning or thematic message from the others. Focus exclusively on the underlying concepts, disregarding stylistic or linguistic differences.

For example:

---
Options:
1. طریق عشق پرآشوب و فتنه است ای دل - بیفتد آن که در این راه با شتاب رود
2. گر نور عشق حق به دل و جانت اوفتد - بالله از آفتاب فلک خوبتر شوی
3. شکوه عشق نگه کن که موی مجنون را - فلک به شعشعه آفتاب، شانه کند
4. فرزانه درآید به پری خانه مقصود - هر کس که در این بادیه دیوانه عشق است

Correct answer: 1

(Option 1 warns against hastily pursuing the turbulent path of love, whereas the other options present love as an uplifting force)

---
Options:
1. شمشیر نیک از آهن بد چون کند کسی؟ - ناکس تربیت نشود ای حکیم کس
2. سگ به دریای هفت گانه بشوی - که چو تر شد پلیدتر باشد
3. ز وحشی نیاید که مردم شود - به سعی اندر او تربیت گم شود
4. سگ اصحاب کهف روزی چند - پی نیکان گرفت و مردم شد

Correct answer: 4

(Option 4 emphasizes the significant impact of upbringing, unlike the other options which imply that upbringing makes little difference)

---
Options:
1. هر چند خوشگوار بود باده غرور - زین می فزون از سنگ نگه دار شیشه را
2. از ساده دلی هر که دهد پند به مغرور - بیدار به افسانه کند خواب گران را
3. کبر مفروش به مردم که به میزان نظر - زود گردد سبک آن کس که بود سنگین تر
4. خاک بر فرقش اگر از کبر سر بالا کند - هر که داند بازگشت او به غیر از خاک نیست

Correct answer: 2

(The meaning of option 2 is the ineffectiveness of giving advice to the arrogant, while the common meaning of the other options is the recommendation to avoid arrogance)

Sample Question

Analyze the conceptual meaning of the following options:

Options:
1. چو در وقت بهار آیی پدیدار - حقیقت، پرده برداری ز رخسار
2. فروغ رویت اندازی سوی خاک - عجایب نقشها سازی سوی خاک
3. از جلوه وجود تو ظلمت سرای خاک - روشن تر از جمال بتان طراز باد
4. جلوه‌ای کرد رُخت دید ملک عشق نداشت - عین آتش شد از آن غیرت و بر آدم زد

Instruction: Identify the *single* option (by its number) that has a different concept and message from the others. Respond with *only* the number (1, 2, 3, or 4). Do not provide any explanation or other text.

Results

Here’s how the models performed, ranked by accuracy (percentage of correct outlier identifications out of 39 questions):

Bar chart visualization goes here


Experiment #2: Benchmarking Embedding Models

In this experiment, we evaluated the performance of several multilingual embedding models by testing their ability to identify the outlier in a set of 42 multiple-choice questions, each containing 4 options.

For each question, we generated embeddings for the four choices and calculated the cosine similarity between each option and the centroid of the remaining three. The option with the lowest similarity to the rest was predicted as the outlier.

Here's the core logic used for outlier detection:

def find_outlier_index(embeddings):
    """
    Predicts the index of the outlier embedding as the one least similar to the others.

    Args:
        embeddings: A list or numpy array containing 4 embedding vectors.

    Returns:
        Index (0-3) of the predicted outlier, or -1 if input is invalid.
    """
    if not isinstance(embeddings, np.ndarray):
        embeddings = np.array(embeddings)

    if len(embeddings) != 4 or embeddings.ndim != 2:
        print(f"Warning: Invalid input shape {embeddings.shape if isinstance(embeddings, np.ndarray) else 'N/A'}.")
        return -1

    similarity_scores = []

    for i in range(4):
        others = np.delete(embeddings, i, axis=0)
        if others.size == 0:
            similarity_scores.append(-np.inf)
            continue

        centroid = np.mean(others, axis=0, keepdims=True)
        current = embeddings[i].reshape(1, -1)
        similarity = cosine_similarity(current, centroid)[0][0]
        similarity_scores.append(similarity)

    if all(score == -np.inf for score in similarity_scores):
        print("Warning: No valid similarities calculated.")
        return -1

    return int(np.argmin(similarity_scores))

We then evaluated the models based on their accuracy in selecting the correct outlier. For reference, random guessing yields a baseline accuracy of 25%.

Bar chart visualization goes here


Experiment #3: OpenRouter Embedding Models Benchmark

This experiment evaluates embedding models available through the OpenRouter API on the same outlier detection task. Unlike Experiment 2, which used locally-hosted models, this experiment tests cloud-based embeddings via API calls.

The outlier detection algorithm is identical to Experiment 2 (centroid-based cosine similarity). Each model generates embeddings for the 4 poetry options, and the option with the lowest similarity to the centroid of the other three is predicted as the outlier.

Results (ranked by accuracy):

Model Accuracy (%) Correct Total
google/gemini-embedding-001 39.02 16 41
openai/text-embedding-3-large 36.59 15 41
mistralai/codestral-embed-2505 32.50 13 40
qwen/qwen3-embedding-0.6b 29.27 12 41
qwen/qwen3-embedding-8b 29.27 12 41
openai/text-embedding-3-small 26.83 11 41
openai/text-embedding-ada-002 24.39 10 41
qwen/qwen3-embedding-4b 21.95 9 41
mistralai/mistral-embed-2312 14.63 6 41

Experiment #4: LLM-Generated Explanations

This experiment generates interpretations of Persian poetry couplets using multiple large language models via OpenRouter. For each poetry option in the dataset, 13 different LLMs provide explanations of the couplet's meaning.

The output is an enriched dataset containing both the original poetry and machine-generated explanations from each model, enabling analysis of how different LLMs interpret Persian literature.

Example

For each poetry couplet, we ask 13 LLMs to interpret its meaning:

Input Poetry:

طریق عشق پرآشوب و فتنه است ای دل - بیفتد آن که در این راه با شتاب رود

Sample LLM Explanations:

  • GPT: "This couplet warns that the path of love is turbulent and fraught with trials..."
  • Claude: "راه عشق پر از آشوب و فتنه است و کسی که با شتاب در این راه حرکت کند، دچار مشکل می‌شود..."
  • DeepSeek: (includes reasoning) "The verse cautions against rushing into love..."

The script processes all 42 questions × 4 options = 168 couplets, generating explanations from 13 models concurrently.


Experiment #5: Comprehensive Embedding Benchmark on Explanations

This experiment creates a 2D matrix to evaluate:

  • 9 embedding models (from exp-3)
  • 13 LLM explanation sources (from exp-4)

This produces 117 combinations to determine which embedding models work best with which LLM-generated explanations.

Research Questions

  1. Which embedding models perform best overall on explanations?
  2. Which LLM's explanations are most effective for semantic similarity tasks?
  3. Are there specific embedding+LLM combinations that excel?
  4. Do explanations improve performance compared to raw poetry?

For each combination, the benchmark:

  1. Loads explanations from a specific LLM
  2. Embeds those explanations using a specific embedding model
  3. Applies outlier detection (same algorithm as Experiment 2)
  4. Calculates accuracy

Results are aggregated by embedding model and by explanation source, revealing which combinations work best for Persian poetry semantic similarity.

Comparison with Experiment 3


Experiment #6: Large-Scale LLM Benchmark on Gherabat Dataset

This experiment represents a significant expansion of Experiment 1, using a much larger dataset and improved methodology to evaluate 30 LLMs on Persian poetry semantic similarity tasks.

Dataset and Scale

  • Questions: 591 poetry outlier detection questions (vs 42 in exp-1)
  • Source: Gherabat book dataset with comprehensive answer keys
  • Models: 30 LLMs tested via OpenRouter API

Results

The experiment tested models across multiple families including GPT, Gemini, Claude, DeepSeek, Qwen, LLaMA, and others. Google Gemini 2.5 Pro achieved the highest accuracy at 70%.

Model Accuracy Comparison

The analysis includes:

  • Model performance comparison across 30 LLMs
  • Question difficulty analysis
  • Model agreement and disagreement patterns
  • Semantic error pattern analysis
  • Statistical significance testing

Citation

If you use this benchmark or dataset in your research, please cite this repository:

@misc{jafarnezhad_persian_poetry_2025,
  author       = {Arman Jafarnezhad},
  title        = {Persian Poetry Semantic Similarity Benchmark},
  year         = {2025},
  version      = {1.0.0},
  url          = {https://github.com/ArmanJR/Persian-Poetry-Semantic-Similarity-Benchmark},
  note         = {Dataset and experiments repository},
}

Alternatively, you can click the "Cite this repository" button on the right sidebar of the GitHub page to get citation formats in BibTeX or APA.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks