Skip to content

This repository contains accompanying code for the CFA Institute's Research and Policy Center 'Synthetic Data in Investment Management' report.

License

Notifications You must be signed in to change notification settings

CFA-Institute-RPC/Synthetic-Data-For-Finance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 

Synthetic-Data-For-Finance

This repository complements the CFA Institute's Research and Policy Center Synthetic Data in Investment Management report. It aims to serve as a centralized hub for generative AI (genAI) approaches to synthetic data generation and their applications within finance. The repository provides a curated list of libraries, papers and case studies that can be used for synthetic data generation to aid practitioners and is regularly updated.

📘 Contents


🧠 Overview

Synthetic data is artificially generated data designed to resemble real data. It can be used to address data-related challenges such as:

  • Lack of historical data
  • Privacy and compliance concerns around data-sharing
  • Overfitting in backtesting and model training
  • Imbalanced datasets

This repository focuses on genAI approaches to synthetic data generation, focusing on the following:

  • Variational Autoencoders (VAEs)
  • Generative Adversarial Networks (GANs)
  • Diffusion models
  • Large Language Models (LLMs)

These methods are more flexible than traditional statistical methodologies, allowing for each data type to be modelled - from textual datasets to time-series and tabular data. As a result, synthetic data has a wide range of use cases within the industry, from enhanced risk modelling and portfolio optimization approaches to forecasting and sentiment analysis.


🛠️ Libraries


📁 Case Studies

See /LLM for an example using synthetic data to improve the performance of a fine-tuned small LLM (Qwen3-0.6B) for financial sentiment classification.

📚 Papers

Variational Autoencoders

Paper Release Date Type of Data Modeled Codebase
An Overview of Variational Autoencoders for Source Separation, Finance, and Bio-Signal Applications 2021 N/A No official repo
TimeVAE: A Variational Auto-Encoder for Multivariate Time Series Generation 2021 Time Series GitHub
Variational Autoencoders: A Hands-Off Approach to Volatility 2021 N/A Implied Volatility

Generative Adversarial Networks

Paper Release Date Type of Data Modeled Codebase
SeriesGAN: Time Series Generation via Adversarial and Autoregressive Learning 2024 Time Series GitHub
Time-series Generative Adversarial Networks 2019 Time Series GitHub
Simulating Asset Prices using Conditional Time-Series GAN 2024 Time Series GitHub
CorrGAN: Sampling Realistic Financial Correlation Matrices Using Generative Adversarial Networks 2019 Financial Correlation Matrices No official repo
cCorrGAN: Conditional Correlation GAN for Learning Empirical Conditional Distributions in the Elliptope 2021 Financial Correlation Matrices No official repo
Conditional Sig-Wasserstein GANs for Time Series Generation 2020 Time Series GitHub
Deep Hedging: Learning to Simulate Equity Option Markets 2019 Equity Options No official repo
GANs and synthetic financial data: calculating VaR 2024 Time-Series No official repo
A Modified CTGAN-Plus-Features Based Method for Optimal Asset Allocation 2023 Time-Series No official repo
Autoencoding Conditional GAN for Portfolio Allocation Diversification 2022 Time-Series No official repo
Data Synthesis based on Generative Adversarial Networks 2018 Tabular GitHub
Financial Thought Experiment: A GAN-based Approach to Vast Robust Portfolio Selection 2021 Time Series No official repo
Improved Data Generation for Enhanced Asset Allocation: A Synthetic Dataset Approach for the Fixed Income Universe 2023 Financial Correlation Matrices No official repo
MTSS-GAN: Multivariate Time Series Simulation Generative Adversarial Networks 2020 Time Series GitHub
PAGAN: Portfolio Analysis with Generative Adversarial Networks 2019 Time Series No official repo
Quant GANs: Deep Generation of Financial Time Series 2019 Time Series No official repo
Tail-GAN: Learning to Simulate Tail Risk Scenarios 2022 Time Series GitHub
Time Series Simulation by Conditional Generative Adversarial Net 2019 Time Series No official repo

Diffusion models

Paper Release Date Type of Data Modeled Codebase
Denoising Diffusion Probabilistic Model for Realistic Financial Correlation Matrices 2024 Financial Correlation Matrices GitHub
FinDiff: Diffusion Models for Financial Tabular Data Generation 2023 Tabular GitHub
High-Resolution Image Synthesis with Latent Diffusion Models 2021 Image GitHub

Large Language Models

Paper Release Date Type of Data Modeled Codebase
AugGPT: Leveraging ChatGPT for Text Data Augmentation 2023 Text GitHub
Data Augmentation using LLMs: Data Perspectives, Learning Paradigms and Challenges 2024 N/A No official repo
FinLLMs: A Framework for Financial Reasoning Dataset Generation with Large Language Models 2024 Text No official repo
Simulating Financial Market via Large Language Model based Agents 2024 Time Series No official repo

📣 Contribute

Feel free to contribute if you’d like to add a new paper, case study or tool.

About

This repository contains accompanying code for the CFA Institute's Research and Policy Center 'Synthetic Data in Investment Management' report.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published