Skip to content

Add Q&A explaining counter‑intuitive Khiops co-clustering results on a simple example. #555

@ElouenGinat

Description

@ElouenGinat

Description

Clarify and document the behavior of the Khiops co-clustering algorithm on a simple toy example, in order to explain why the result is visually counter‑intuitive and to check that I am using the algorithm correctly.

Questions / Ideas

  • Context and issue (first example only)
    I’m experimenting with Khiops co-clustering on the examples from the Clustering section of the scikit-learn documentation (version 1.8.0), which use two continuous variables:
    https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

    On the first example, the data structure looks visually simple and intuitive, but the partition produced by Khiops co-clustering is quite different from what I would expect. The result is hard to interpret given the shape of the point cloud, whereas I thought this would be one of the most “pedagogical” cases.

    As I understand it:

    • the two continuous variables are automatically discretized;
    • the algorithm optimizes a criterion based on the frequencies observed in the rectangles defined by this 2D discretization;
    • geometric distance between individual points does not directly enter the criterion, which may explain the gap between visual intuition and the optimal solution for the objective.
  • Specific questions / points to clarify on this first example

    • Can you confirm that, in this 2D continuous case, the co-clustering criterion is based only on frequencies in the rectangles produced by automatic discretization, with no explicit notion of distance between individual points?
    • For this first scikit-learn dataset, is the partition returned by Khiops consistent with what the algorithm is supposed to optimize, even if it looks unnatural when you just look at the point cloud?
    • Do you see any potential misuse in my setup on this example (data preparation, parameter choices, etc.) that could explain why the result is so counter‑intuitive?
  • Ideas / possible improvements

    • Use this first example as a detailed case study in the documentation:
      • make the chosen discretization on both axes explicit,
      • show the difference between clustering V×I and clustering (density) V×V (and provide recommendations on how to choose between them),
      • explain why the solution selected by Khiops is preferable to other partitions that look more “natural” visually.
    • Add a Q&A entry or a short tutorial based on this scikit-learn example to illustrate the difference between a geometric intuition of clusters and the logic of the Khiops co-clustering criterion.

The most illustrative example is generated by:

from sklearn import datasets
data = datasets.make_circles(
    n_samples=1000, centers=[[10, 10], [-10, 10], [0, 0]], random_state=42
)

And produces a point cloud similar to:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions