Clustering

Clustering

Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels_ attribute.

in summary, methods:
  • - based on parameters:

      n_clusters as params: K-Means, Spectral, ward hierarchical, agglomerative, OPTICS (minimum)
      neighborhood size: DBSCAN
      many params: Gaussian Mixtures
      distance threshold: ward hierarchical, agglomerative, birch
      bandwidth: Mean-shift
      sample preferences/damping: affinity propagation
  • - based on scalability

      large n_samples, medium n_clusters: K-Means, DBSCAN
      medium n_samples, small n_clusters: spectral
      large n_samples, large n_clusters: ward hierarchical, agglomerative, OPTICS, Birch*
      not-scalable with n_samples: affinity propagation, Mean-shift, Guassian Mixture

    * note that Birch is used in Large datasets, for outlier removal and data reduction.

  • - based on n_clusters

      many clusters: affinity propagation, Mean-shift, ward hierarchical, agglomerative, medium clusts: K-Means, spectral
  • - based on geometry

      flat: K-Means, Gaussian Mixture
      non-flat: affinity propagation, Mean-shift, spectral, DBSCAN, OPTICS
  • - based on metric used

      dist between pooint: K-Means, Mean-shift, ward hierarchical, DBSCAN, OPTICS, Birch, Gaussian Mixtures
      graph distance: affinity propagation, Spectral
      any pairwise dist: agglomerative
  • - if distance threshold or neighborhood size methods considered:

      ward hierarchical, agglomerative, K-Means, Mean-shift, DBSCAN, OPTICS, Birch, Gaussian Mixtures
  • - if graph distance method considered:

      affinity propagation, Spectral

As an another view:

if flat geometry considered: (all not scalable with n_samples)

    1- if many n_clusters considered:

  • if distance threshold method: use Mean-shift
  • if graph distance method: use affinity propagation

    2- if no matter how many n_clusters: use Gaussian Mixtures (distance threshold method)

if non-flat geometry considered:

    if large n_samples and matter how many n_clusters: (all using distance threshold method)

  • if many: use ward hierarchical, agglomerative, OPTICS, Birch
  • if medium/few: use K-Means, DBSCAN

    if medium n_samples: use Spectral (using graph distance method)

Check this page for detailed information and comparison between methods.

>

Have Fun!

Comments