Persian Poetry Semantic Similarity Benchmark

This repository benchmarks different LLM and embedding models to assess their performance in understanding semantic similarity (قرابت معنایی) in Persian classic poetry.

Experiments Overview

Experiment #1: LLM quiz on 42 questions using zero-shot and few-shot prompting to identify poetry outliers
Experiment #2: Benchmarking embedding models using centroid-based cosine similarity for outlier detection
Experiment #3: Evaluating OpenRouter API embedding models on the same outlier detection task
Experiment #4: Generating interpretations of poetry couplets using 13 different LLMs via OpenRouter
Experiment #5: Testing 117 combinations of embedding models (9) and LLM explanations (13) in a 2D matrix
Experiment #6: Large-scale benchmark of 30 LLMs on 591 questions from the Gherabat book dataset

Experiment #1: LLM Quiz – Understanding Persian Poetry

In this experiment, we evaluated some of the SOTA large language models on their ability to interpret the conceptual meaning of Persian poetry couplets, known as beits. Each beit consists of two lines (mesra) and conveys a distinct message or idea.

We tested a set of 42 multiple-choice questions on zero-shot prompting and 39 on few-shot prompting. Each question follows the same format but contains different poetry options. The task was to identify the one beit whose conceptual meaning differs from the others.

Zero-Shot Prompting

System Prompt

The models received the following system prompt:

You are an AI assistant analyzing Persian poetry couplets. Identify the outlier based on concept/message. Respond *only* with the single digit number (1, 2, 3, or 4) of the outlier option. Output nothing else.

Sample Question

Analyze the conceptual meaning of the following options:

Options:
1. چو در وقت بهار آیی پدیدار - حقیقت، پرده برداری ز رخسار
2. فروغ رویت اندازی سوی خاک - عجایب نقشها سازی سوی خاک
3. از جلوه وجود تو ظلمت سرای خاک - روشن تر از جمال بتان طراز باد
4. جلوه‌ای کرد رُخت دید ملک عشق نداشت - عین آتش شد از آن غیرت و بر آدم زد

Instruction: Identify the *single* option (by its number) that has a different concept and message from the others. Respond with *only* the number (1, 2, 3, or 4). Do not provide any explanation or other text.

Results

Here’s how the models performed, ranked by accuracy (percentage of correct outlier identifications out of 42 questions):

Few-Shot Prompting

The models received the following system prompt:

You are an expert literary critic with a deep understanding of Persian poetry, its cultural nuances, and its stylistic features. Your task is to analyze a set of poetic options—each option presenting two parts of a couplet—and identify the one option that deviates in conceptual meaning or thematic message from the others. Focus exclusively on the underlying concepts, disregarding stylistic or linguistic differences.

For example:

---
Options:
1. طریق عشق پرآشوب و فتنه است ای دل - بیفتد آن که در این راه با شتاب رود
2. گر نور عشق حق به دل و جانت اوفتد - بالله از آفتاب فلک خوبتر شوی
3. شکوه عشق نگه کن که موی مجنون را - فلک به شعشعه آفتاب، شانه کند
4. فرزانه درآید به پری خانه مقصود - هر کس که در این بادیه دیوانه عشق است

Correct answer: 1

(Option 1 warns against hastily pursuing the turbulent path of love, whereas the other options present love as an uplifting force)

---
Options:
1. شمشیر نیک از آهن بد چون کند کسی؟ - ناکس تربیت نشود ای حکیم کس
2. سگ به دریای هفت گانه بشوی - که چو تر شد پلیدتر باشد
3. ز وحشی نیاید که مردم شود - به سعی اندر او تربیت گم شود
4. سگ اصحاب کهف روزی چند - پی نیکان گرفت و مردم شد

Correct answer: 4

(Option 4 emphasizes the significant impact of upbringing, unlike the other options which imply that upbringing makes little difference)

---
Options:
1. هر چند خوشگوار بود باده غرور - زین می فزون از سنگ نگه دار شیشه را
2. از ساده دلی هر که دهد پند به مغرور - بیدار به افسانه کند خواب گران را
3. کبر مفروش به مردم که به میزان نظر - زود گردد سبک آن کس که بود سنگین تر
4. خاک بر فرقش اگر از کبر سر بالا کند - هر که داند بازگشت او به غیر از خاک نیست

Correct answer: 2

(The meaning of option 2 is the ineffectiveness of giving advice to the arrogant, while the common meaning of the other options is the recommendation to avoid arrogance)

Sample Question

Analyze the conceptual meaning of the following options:

Options:
1. چو در وقت بهار آیی پدیدار - حقیقت، پرده برداری ز رخسار
2. فروغ رویت اندازی سوی خاک - عجایب نقشها سازی سوی خاک
3. از جلوه وجود تو ظلمت سرای خاک - روشن تر از جمال بتان طراز باد
4. جلوه‌ای کرد رُخت دید ملک عشق نداشت - عین آتش شد از آن غیرت و بر آدم زد

Instruction: Identify the *single* option (by its number) that has a different concept and message from the others. Respond with *only* the number (1, 2, 3, or 4). Do not provide any explanation or other text.

Results

Here’s how the models performed, ranked by accuracy (percentage of correct outlier identifications out of 39 questions):

Experiment #2: Benchmarking Embedding Models

In this experiment, we evaluated the performance of several multilingual embedding models by testing their ability to identify the outlier in a set of 42 multiple-choice questions, each containing 4 options.

For each question, we generated embeddings for the four choices and calculated the cosine similarity between each option and the centroid of the remaining three. The option with the lowest similarity to the rest was predicted as the outlier.

Here's the core logic used for outlier detection:

def find_outlier_index(embeddings):
    """
    Predicts the index of the outlier embedding as the one least similar to the others.

    Args:
        embeddings: A list or numpy array containing 4 embedding vectors.

    Returns:
        Index (0-3) of the predicted outlier, or -1 if input is invalid.
    """
    if not isinstance(embeddings, np.ndarray):
        embeddings = np.array(embeddings)

    if len(embeddings) != 4 or embeddings.ndim != 2:
        print(f"Warning: Invalid input shape {embeddings.shape if isinstance(embeddings, np.ndarray) else 'N/A'}.")
        return -1

    similarity_scores = []

    for i in range(4):
        others = np.delete(embeddings, i, axis=0)
        if others.size == 0:
            similarity_scores.append(-np.inf)
            continue

        centroid = np.mean(others, axis=0, keepdims=True)
        current = embeddings[i].reshape(1, -1)
        similarity = cosine_similarity(current, centroid)[0][0]
        similarity_scores.append(similarity)

    if all(score == -np.inf for score in similarity_scores):
        print("Warning: No valid similarities calculated.")
        return -1

    return int(np.argmin(similarity_scores))

We then evaluated the models based on their accuracy in selecting the correct outlier. For reference, random guessing yields a baseline accuracy of 25%.

Experiment #3: OpenRouter Embedding Models Benchmark

This experiment evaluates embedding models available through the OpenRouter API on the same outlier detection task. Unlike Experiment 2, which used locally-hosted models, this experiment tests cloud-based embeddings via API calls.

The outlier detection algorithm is identical to Experiment 2 (centroid-based cosine similarity). Each model generates embeddings for the 4 poetry options, and the option with the lowest similarity to the centroid of the other three is predicted as the outlier.

Results (ranked by accuracy):

Model	Accuracy (%)	Correct	Total
google/gemini-embedding-001	39.02	16	41
openai/text-embedding-3-large	36.59	15	41
mistralai/codestral-embed-2505	32.50	13	40
qwen/qwen3-embedding-0.6b	29.27	12	41
qwen/qwen3-embedding-8b	29.27	12	41
openai/text-embedding-3-small	26.83	11	41
openai/text-embedding-ada-002	24.39	10	41
qwen/qwen3-embedding-4b	21.95	9	41
mistralai/mistral-embed-2312	14.63	6	41

Experiment #4: LLM-Generated Explanations

This experiment generates interpretations of Persian poetry couplets using multiple large language models via OpenRouter. For each poetry option in the dataset, 13 different LLMs provide explanations of the couplet's meaning.

The output is an enriched dataset containing both the original poetry and machine-generated explanations from each model, enabling analysis of how different LLMs interpret Persian literature.

Example

For each poetry couplet, we ask 13 LLMs to interpret its meaning:

Input Poetry:

طریق عشق پرآشوب و فتنه است ای دل - بیفتد آن که در این راه با شتاب رود

Sample LLM Explanations:

GPT: "This couplet warns that the path of love is turbulent and fraught with trials..."
Claude: "راه عشق پر از آشوب و فتنه است و کسی که با شتاب در این راه حرکت کند، دچار مشکل می‌شود..."
DeepSeek: (includes reasoning) "The verse cautions against rushing into love..."

The script processes all 42 questions × 4 options = 168 couplets, generating explanations from 13 models concurrently.

Experiment #5: Comprehensive Embedding Benchmark on Explanations

This experiment creates a 2D matrix to evaluate:

9 embedding models (from exp-3)
13 LLM explanation sources (from exp-4)

This produces 117 combinations to determine which embedding models work best with which LLM-generated explanations.

Research Questions

Which embedding models perform best overall on explanations?
Which LLM's explanations are most effective for semantic similarity tasks?
Are there specific embedding+LLM combinations that excel?
Do explanations improve performance compared to raw poetry?

For each combination, the benchmark:

Loads explanations from a specific LLM
Embeds those explanations using a specific embedding model
Applies outlier detection (same algorithm as Experiment 2)
Calculates accuracy

Results are aggregated by embedding model and by explanation source, revealing which combinations work best for Persian poetry semantic similarity.

Experiment #6: Large-Scale LLM Benchmark on Gherabat Dataset

This experiment represents a significant expansion of Experiment 1, using a much larger dataset and improved methodology to evaluate 30 LLMs on Persian poetry semantic similarity tasks.

Dataset and Scale

Questions: 591 poetry outlier detection questions (vs 42 in exp-1)
Source: Gherabat book dataset with comprehensive answer keys
Models: 30 LLMs tested via OpenRouter API

Results

The experiment tested models across multiple families including GPT, Gemini, Claude, DeepSeek, Qwen, LLaMA, and others. Google Gemini 2.5 Pro achieved the highest accuracy at 70%.

The analysis includes:

Model performance comparison across 30 LLMs
Question difficulty analysis
Model agreement and disagreement patterns
Semantic error pattern analysis
Statistical significance testing

Citation

If you use this benchmark or dataset in your research, please cite this repository:

@misc{jafarnezhad_persian_poetry_2025,
  author       = {Arman Jafarnezhad},
  title        = {Persian Poetry Semantic Similarity Benchmark},
  year         = {2025},
  version      = {1.0.0},
  url          = {https://github.com/ArmanJR/Persian-Poetry-Semantic-Similarity-Benchmark},
  note         = {Dataset and experiments repository},
}

Alternatively, you can click the "Cite this repository" button on the right sidebar of the GitHub page to get citation formats in BibTeX or APA.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
exp-1		exp-1
exp-2		exp-2
exp-3		exp-3
exp-4		exp-4
exp-5		exp-5
exp-6		exp-6
preprocess-data		preprocess-data
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Persian Poetry Semantic Similarity Benchmark

Experiments Overview

Experiment #1: LLM Quiz – Understanding Persian Poetry

Zero-Shot Prompting

System Prompt

Sample Question

Results

Few-Shot Prompting

Sample Question

Results

Experiment #2: Benchmarking Embedding Models

Experiment #3: OpenRouter Embedding Models Benchmark

Experiment #4: LLM-Generated Explanations

Example

Experiment #5: Comprehensive Embedding Benchmark on Explanations

Research Questions

Experiment #6: Large-Scale LLM Benchmark on Gherabat Dataset

Dataset and Scale

Results

Citation

About

Uh oh!

Languages

ArmanJR/Persian-Poetry-Semantic-Similarity-Benchmark

Folders and files

Latest commit

History

Repository files navigation

Persian Poetry Semantic Similarity Benchmark

Experiments Overview

Experiment #1: LLM Quiz – Understanding Persian Poetry

Zero-Shot Prompting

System Prompt

Sample Question

Results

Few-Shot Prompting

Sample Question

Results

Experiment #2: Benchmarking Embedding Models

Experiment #3: OpenRouter Embedding Models Benchmark

Experiment #4: LLM-Generated Explanations

Example

Experiment #5: Comprehensive Embedding Benchmark on Explanations

Research Questions

Experiment #6: Large-Scale LLM Benchmark on Gherabat Dataset

Dataset and Scale

Results

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages