Skip to content

Cython-accelerated MDL histogram density estimation. Implements Kontkanen & Myllymaki's dynamic programming algorithm for optimal variable-width bins, parametric complexity with Ramanujan approximations, and automatic bin count selection. Based on "MDL Histogram Density Estimation" (JMLR 2007).

License

Notifications You must be signed in to change notification settings

MrTarantoga/MDL-Density-Histogram

Repository files navigation

Upload Python Package Python application test

MDL Optimal Histogram Density Estimation

This package provides a Cython-accelerated implementation of the Minimum Description Length (MDL) optimal histogram density estimation algorithm from Kontkanen & Myllymaki (2007). It uses information-theoretic principles to automatically determine optimal variable-width bins for density estimation.

Freedman-Diaconis vs. MDL-Optimization

Features

  • MDL Principle: Uses stochastic complexity for model selection
  • Dynamic Programming: Efficient O(E²·K_max) optimization (cache parametric complexity computation, speed up)
  • Score of each Kth bin: The score of each bin is returned to understand the performance of different properties of the same dataset.
  • Variable-Width Bins: Adapts to data density variations
  • Automatic Bin Count: No manual parameter tuning required (except maximum bin count to consider $K_{max}$ and data resolution $\epsilon$)
  • Cython Acceleration: Critical operations compiled to C

Installation

You can install the package using pip:

pip install MDL-Density-Histogram

Alternatively, you can install it from source by cloning the repository and running:

# From project root directory
pip install .

Requires:

  • Python 3.11+
  • NumPy
  • Cython
  • C compiler (GCC/Clang/MSVC)

Usage Example

import numpy as np
from mdl_density_hist import mdl_optimal_histogram

# Generate sample data
data = np.random.normal(0, 1, 1000)

# Compute optimal histogram
cut_points, K_scores = mdl_optimal_histogram(data, epsilon=0.1)

# Print score of each bin
print(f"K_scores: {K_scores}")

# Visualize result
import matplotlib.pyplot as plt
plt.hist(data, bins=cut_points, density=True)
plt.title('MDL Optimal Histogram')
plt.show()

Parameters

  • data: Input array (1D numpy array)
  • epsilon: Quantization precision (default: 0.1)
  • K_max: Maximum number of bins (default: 10)

Algorithm Highlights

  • Uses Ramanujan's factorial approximation for efficient parametric complexity
  • Cache parameteric complexity to speed up computation

Paper Citation

Kontkanen, P., & Myllymäki, P. (2007).
MDL Histogram Density Estimation
Journal of Machine Learning Research 8 (2007) PDF

License

Apache 2.0 License - See LICENSE file

Project Structure

src/
├── mdl_density_hist/
│   ├── __init__.py
│   └── mdl_hist.pyx  # Core Cython implementation
└── pyproject.toml

Performance Notes

  • Precomputed parametric complexity using dynamic programming
  • Memory-optimized array operations via NumPy
  • Candidate cut point pruning for reduced search space

For implementation details, see the paper and inline code comments.

About

Cython-accelerated MDL histogram density estimation. Implements Kontkanen & Myllymaki's dynamic programming algorithm for optimal variable-width bins, parametric complexity with Ramanujan approximations, and automatic bin count selection. Based on "MDL Histogram Density Estimation" (JMLR 2007).

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published