-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Description
Clarify and document the behavior of the Khiops co-clustering algorithm on a simple toy example, in order to explain why the result is visually counter‑intuitive and to check that I am using the algorithm correctly.
Questions / Ideas
-
Context and issue (first example only)
I’m experimenting with Khiops co-clustering on the examples from the Clustering section of the scikit-learn documentation (version 1.8.0), which use two continuous variables:
https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.htmlOn the first example, the data structure looks visually simple and intuitive, but the partition produced by Khiops co-clustering is quite different from what I would expect. The result is hard to interpret given the shape of the point cloud, whereas I thought this would be one of the most “pedagogical” cases.
As I understand it:
- the two continuous variables are automatically discretized;
- the algorithm optimizes a criterion based on the frequencies observed in the rectangles defined by this 2D discretization;
- geometric distance between individual points does not directly enter the criterion, which may explain the gap between visual intuition and the optimal solution for the objective.
-
Specific questions / points to clarify on this first example
- Can you confirm that, in this 2D continuous case, the co-clustering criterion is based only on frequencies in the rectangles produced by automatic discretization, with no explicit notion of distance between individual points?
- For this first scikit-learn dataset, is the partition returned by Khiops consistent with what the algorithm is supposed to optimize, even if it looks unnatural when you just look at the point cloud?
- Do you see any potential misuse in my setup on this example (data preparation, parameter choices, etc.) that could explain why the result is so counter‑intuitive?
-
Ideas / possible improvements
- Use this first example as a detailed case study in the documentation:
- make the chosen discretization on both axes explicit,
- show the difference between clustering V×I and clustering (density) V×V (and provide recommendations on how to choose between them),
- explain why the solution selected by Khiops is preferable to other partitions that look more “natural” visually.
- Add a Q&A entry or a short tutorial based on this scikit-learn example to illustrate the difference between a geometric intuition of clusters and the logic of the Khiops co-clustering criterion.
- Use this first example as a detailed case study in the documentation:
The most illustrative example is generated by:
from sklearn import datasets
data = datasets.make_circles(
n_samples=1000, centers=[[10, 10], [-10, 10], [0, 0]], random_state=42
)And produces a point cloud similar to:
