Clustering
Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels_ attribute.
in summary, methods:
-
- based on parameters:
- n_clusters as params: K-Means, Spectral, ward hierarchical, agglomerative, OPTICS (minimum)
- neighborhood size: DBSCAN
- many params: Gaussian Mixtures
- distance threshold: ward hierarchical, agglomerative, birch
- bandwidth: Mean-shift
- sample preferences/damping: affinity propagation
-
- based on scalability
- large n_samples, medium n_clusters: K-Means, DBSCAN
- medium n_samples, small n_clusters: spectral
- large n_samples, large n_clusters: ward hierarchical, agglomerative, OPTICS, Birch*
- not-scalable with n_samples: affinity propagation, Mean-shift, Guassian Mixture
* note that Birch is used in Large datasets, for outlier removal and data reduction.
-
- based on n_clusters
- many clusters: affinity propagation, Mean-shift, ward hierarchical, agglomerative,
medium clusts: K-Means, spectral
-
- based on geometry
- flat: K-Means, Gaussian Mixture
- non-flat: affinity propagation, Mean-shift, spectral, DBSCAN, OPTICS
-
- based on metric used
- dist between pooint: K-Means, Mean-shift, ward hierarchical, DBSCAN, OPTICS, Birch, Gaussian Mixtures
- graph distance: affinity propagation, Spectral
- any pairwise dist: agglomerative
-
- if distance threshold or neighborhood size methods considered:
- ward hierarchical, agglomerative, K-Means, Mean-shift, DBSCAN, OPTICS, Birch, Gaussian Mixtures
-
- if graph distance method considered:
- affinity propagation, Spectral
As an another view:
if flat geometry considered: (all not scalable with n_samples)
- if distance threshold method: use Mean-shift
- if graph distance method: use affinity propagation
1- if many n_clusters considered:
2- if no matter how many n_clusters: use Gaussian Mixtures (distance threshold method)
if non-flat geometry considered:
- if many: use ward hierarchical, agglomerative, OPTICS, Birch
- if medium/few: use K-Means, DBSCAN
if large n_samples and matter how many n_clusters: (all using distance threshold method)
if medium n_samples: use Spectral (using graph distance method)
Check this page for detailed information and comparison between methods.
>Have Fun!
Comments
Post a Comment