EquiME, a large-scale synthetic dataset for micro-expression analysis, generated using the image-to-video model. By leveraging a structured causal modeling approach, we employ Facial Action Units (AUs) as intermediate representations that drive the generation of realistic ME sequences. This paper presents a streamlined pipeline for generating synthetic micro-expression datasets, designed to be accessible to users without a computer science background.
- Dataset Overview
- Technical Specifications
- Dataset Organization
- Methodology
- Repository Structure
- System Requirements
- Installation
- Usage Guidelines
- Evaluation Protocols
- Ethical Considerations
- Licensing
- Access Policy
- Citation
- Acknowledgments
- Contact Information
The ME-VLM dataset addresses critical limitations in micro-expression research by providing a large-scale, controlled synthetic dataset. Micro-expressions, characterized by brief and subtle facial movements lasting 1/25 to 1/5 of a second, are notoriously difficult to capture and annotate in real-world scenarios. Our synthetic approach enables:
- Scalability: Generation of large-scale datasets with consistent quality
- Controllability: Precise manipulation of emotional expressions and demographic attributes
- Reproducibility: Standardized evaluation protocols for fair model comparison
- Ethical Compliance: Elimination of privacy concerns associated with real human subjects
- Scale: 100,000 high-resolution video sequences
- Diversity: Comprehensive coverage of five primary emotional categories
- Quality: 512×512 pixel resolution with 24 FPS temporal consistency
- Annotation: Rich metadata including demographic attributes and Action Unit annotations
- Multimodality: Synchronized video-text pairs for vision-language model training
| Parameter | Specification |
|---|---|
| Total Samples | 75,000 video sequences |
| Video Format | MP4 (H.264 codec) |
| Resolution | 512 × 512 pixels |
| Frame Rate | 24 FPS |
| Duration | 6.0 seconds per clip |
| Total Frames | 144 frames per sequence |
| Emotion Classes | 5 categories (happiness, sadness, surprise, fear, anger) |
| Annotation Format | JSON metadata files |
| Source Images | CelebA-HQ dataset |
| Generation Model | LTX-Video diffusion model |
The dataset follows a hierarchical structure optimized for research applications:
ME-VLM/
├── emotion_categories/
│ ├── happiness/
│ │ ├── [gender]_[subject_id]_[sequence].mp4
│ │ └── metadata/
│ │ └── [video_name]_metadata.json
│ ├── sadness/
│ ├── surprise/
│ ├── fear/
│ └── anger/
Where:
gender: Demographic identifier (m/f)subject_id: Unique subject identifier from source datasetsequence_number: Sequential identifier for multiple generations per subject
Each video is accompanied by comprehensive metadata including:
- Demographic Attributes: Gender, estimated age, ethnicity (via DeepFace analysis)
- Technical Metadata: File size, bitrate, duration, frame count
- Generation Parameters: Model configuration, prompt templates, random seeds
- Quality Metrics: PSNR, SSIM, perceptual quality scores
- Generating Prompt: Facial muscle movement indicators
- Source Data Curation: High-quality facial images selected from CelebA-HQ dataset based on quality metrics and demographic diversity
- Prompt Engineering: Emotion-specific prompts designed to elicit target micro-expressions
- Video Synthesis: LTX-Video model generates temporally consistent sequences
- Quality Assessment: Automated filtering based on perceptual quality metrics
- Attribute Extraction: DeepFace analysis for demographic and emotional attributes
- Validation: Human expert review of subset for quality assurance
Emotion-specific prompts were carefully crafted to ensure consistent and realistic micro-expression generation:
- Happiness: "A professional static headshot showing subtle happiness micro-expression transitioning to neutral..."
- Sadness: "A professional static headshot showing subtle sadness micro-expression with slight lip corners turning down..."
- Surprise: "A professional static headshot showing brief surprise micro-expression with raised eyebrows..."
- Fear: "A professional static headshot showing subtle fear micro-expression with widened eyes..."
- Anger: "A professional static headshot showing controlled anger micro-expression with tightened facial muscles..."
Purpose: Complete implementation of the synthetic data generation workflow
Components:
inference_cropped.py: Core inference script implementing the LTX-Video generation pipeline- Handles image preprocessing, model inference, and post-processing
- Supports batch processing for large-scale dataset creation
Purpose: Comprehensive evaluation protocols using established computer vision metrics
Components:
video_quality_metrics.py: Implementation of standard video quality assessment metrics- Evaluation datasets and benchmark results:
summary_metrics_generated5k_mmewAU.csv: Results on synthetic dataset subsetsummary_metrics_miex.csv: MIEX dataset evaluationsummary_metrics_samm.csv: SAMM dataset evaluationsummary_metrics_smic.csv: SMIC dataset evaluationvideo_metrics_hq_emotion.csv: High-quality emotion-specific metrics
Supported Metrics:
- Peak Signal-to-Noise Ratio (PSNR)
- Structural Similarity Index (SSIM)
- Total Variation (TV)
- Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE)
- CLIP-based Image Quality Assessment (CLIP-IQA)
Purpose: Reference implementations and benchmark models for comparative evaluation
Training Scripts:
train_mevlm.py: ME-VLM model training with 3-class emotion classificationtrain_miex.py: MIEX dataset baseline implementationtrain_real.py: Real data baseline for comparison studies
Configuration Management:
config/config_3class_mobilenet.json: Lightweight MobileNet architecture configurationconfig/config_original.json: Original baseline model parameters
Model Architectures:
- 3D Convolutional Neural Networks for spatiotemporal feature extraction
- ResNet-based architectures with residual connections
- MobileNet variants for efficient inference
Purpose: Demographic and facial attribute analysis tools
Components:
deepface_only_analysis.py: Comprehensive facial analysis using the DeepFace framework- Supports multi-attribute extraction: age, gender, ethnicity, emotional state
- Batch processing capabilities for large-scale analysis
- Configurable backends for optimal performance
Purpose: Web-based documentation and visualization resources
Structure:
css/: Stylesheet definitions for web interfaceimages/: Visual documentation assetsjs/: Interactive components and data visualizationpdfs/: Comprehensive documentation in portable formatvideos/: Representative sample videos and demonstrations
- GPU: NVIDIA GPU with ≥8GB VRAM (A100 or equivalent)
- RAM: ≥32GB system memory
- Storage: ≥500GB available space for full dataset
- CPU: Multi-core processor (≥8 cores recommended)
- Python: 3.8 or higher
- CUDA: 11.0 or compatible version
- Operating System: Linux (Ubuntu 20.04+), Windows 10+, or macOS 11+
- Clone the repository:
git clone https://github.com/[kirito-blade]/me-vlm.git
cd me-vlm- Create virtual environment:
python -m venv me_vlm_env
source me_vlm_env/bin/activate # Linux/macOS
# or
me_vlm_env\Scripts\activate # Windows- Install dependencies:
pip install -r requirements.txt- Verify CUDA installation:
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"The dataset is available for academic research purposes following proper access approval. Researchers must:
- Submit formal access request to corresponding author
- Provide institutional verification
- Agree to ethical usage terms
- Cite the work in resulting publications
- All source images are from publicly available, ethically-collected datasets
- No personally identifiable information is retained
- Synthetic generation ensures no direct correspondence to real individuals
Researchers utilizing this dataset must:
- Adhere to institutional ethics guidelines
- Consider potential societal impacts of research applications
- Avoid applications that could cause harm or perpetuate discrimination
This dataset is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
- ✅ Share and redistribute in any medium or format
- ✅ Adapt, remix, transform, and build upon the material
- ✅ Use for academic research and educational purposes
- ❌ Commercial use prohibited
- ❌ No warranty or liability provided
- ❌ Additional restrictions cannot be imposed
- 📝 Appropriate attribution must be provided
- 🔗 License terms must be indicated
- 📋 Changes to the material must be documented
Access is restricted to:
- Accredited academic institutions
- Recognized research organizations
- Graduate students under faculty supervision
- Industry researchers for non-commercial purposes
- Submit Request: Email corresponding author with detailed research proposal
- Institutional Verification: Provide official institutional affiliation
- Research Statement: Describe intended use and expected outcomes
- Agreement Acknowledgment: Confirm adherence to usage terms
- Access Approval: Receive download credentials upon approval
- Institutional affiliation and verification
- Research project description
When using ME-VLM in your research, please cite our work:
@article{tan2025me-vlm,
title={ME-VLM: A Visual-Language Model for Micro-expression Synthesis and Classification},
author={Tan, Pei-Sze and Tan, Yee-Fan and Rajanala, Sailaja and Phan, Raphael C.-W. and Ong, Huey-Fang},
journal={arXiv preprint arXiv:2025.xxxxx},
year={2025},
doi={10.48550/arXiv.2025.xxxxx},
url={https://arxiv.org/abs/2025.xxxxx}
}We gratefully acknowledge:
- The Chinese University of Hong Kong for the CelebA-HQ dataset
- Lightricks Ltd. for the LTX-Video generative model
- The PIQ Development Team for quality assessment tools
Pei-Sze Tan
School of Information Technology
Monash University Malaysia
Email: tan.peisze@monash.edu
For technical issues or dataset access problems:
- Create an issue on this repo
- Include system specifications and error logs
- Follow the issue template for faster resolution
Last Updated: May 2025
Version: 1.0.0
DOI: [To be assigned upon publication]