This package provides a Cython-accelerated implementation of the Minimum Description Length (MDL) optimal histogram density estimation algorithm from Kontkanen & Myllymaki (2007). It uses information-theoretic principles to automatically determine optimal variable-width bins for density estimation.
- MDL Principle: Uses stochastic complexity for model selection
- Dynamic Programming: Efficient O(E²·K_max) optimization (cache parametric complexity computation, speed up)
- Score of each Kth bin: The score of each bin is returned to understand the performance of different properties of the same dataset.
- Variable-Width Bins: Adapts to data density variations
-
Automatic Bin Count: No manual parameter tuning required (except maximum bin count to consider
$K_{max}$ and data resolution$\epsilon$ ) - Cython Acceleration: Critical operations compiled to C
You can install the package using pip:
pip install MDL-Density-HistogramAlternatively, you can install it from source by cloning the repository and running:
# From project root directory
pip install .Requires:
- Python 3.11+
- NumPy
- Cython
- C compiler (GCC/Clang/MSVC)
import numpy as np
from mdl_density_hist import mdl_optimal_histogram
# Generate sample data
data = np.random.normal(0, 1, 1000)
# Compute optimal histogram
cut_points, K_scores = mdl_optimal_histogram(data, epsilon=0.1)
# Print score of each bin
print(f"K_scores: {K_scores}")
# Visualize result
import matplotlib.pyplot as plt
plt.hist(data, bins=cut_points, density=True)
plt.title('MDL Optimal Histogram')
plt.show()data: Input array (1D numpy array)epsilon: Quantization precision (default: 0.1)K_max: Maximum number of bins (default: 10)
- Uses Ramanujan's factorial approximation for efficient parametric complexity
- Cache parameteric complexity to speed up computation
Kontkanen, P., & Myllymäki, P. (2007).
MDL Histogram Density Estimation
Journal of Machine Learning Research 8 (2007)
PDF
Apache 2.0 License - See LICENSE file
src/
├── mdl_density_hist/
│ ├── __init__.py
│ └── mdl_hist.pyx # Core Cython implementation
└── pyproject.toml
- Precomputed parametric complexity using dynamic programming
- Memory-optimized array operations via NumPy
- Candidate cut point pruning for reduced search space
For implementation details, see the paper and inline code comments.
