Clustering

Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels_ attribute.

in summary, methods:

- based on parameters:
- based on scalability
* note that Birch is used in Large datasets, for outlier removal and data reduction.
- based on n_clusters
- based on geometry
- based on metric used
- if distance threshold or neighborhood size methods considered:
- if graph distance method considered:

As an another view:

if flat geometry considered: (all not scalable with n_samples)

1- if many n_clusters considered:

if distance threshold method: use Mean-shift
if graph distance method: use affinity propagation

2- if no matter how many n_clusters: use Gaussian Mixtures (distance threshold method)

if non-flat geometry considered:

if large n_samples and matter how many n_clusters: (all using distance threshold method)

if many: use ward hierarchical, agglomerative, OPTICS, Birch
if medium/few: use K-Means, DBSCAN

if medium n_samples: use Spectral (using graph distance method)

Check this page for detailed information and comparison between methods.

Have Fun!

realDSProjs

Search This Blog

Clustering

Clustering

in summary, methods:

Comments

Post a Comment