API Reference

VAT module

pyclustertend.visual_assessment_of_tendency.compute_ivat_ordered_dissimilarity_matrix(X)[source]

The ordered dissimilarity matrix is used by ivat. It is a just a a reordering of the dissimilarity matrix.

Parameters:X (matrix) – numpy array
Returns:D_prim – the ordered dissimalarity matrix .
Return type:matrix
pyclustertend.visual_assessment_of_tendency.compute_ordered_dissimilarity_matrix(X)[source]

The ordered dissimilarity matrix is used by visual assesement of tendency. It is a just a a reordering of the dissimilarity matrix.

Parameters:X (matrix) – numpy array
Returns:ODM – the ordered dissimalarity matrix .
Return type:matrix
pyclustertend.visual_assessment_of_tendency.ivat(data, return_odm=False, figure_size=(10, 10))[source]

iVat return a visualisation based on the Vat but more reliable and easier to interpret.

Parameters:
  • data (matrix) – numpy array
  • return_odm (return the Ordered Dissimalirity Matrix) – boolean (default to False)
  • figure_size (size of the VAT.) – tuple (default to (10,10))
Returns:

D_prim – the ivat ordered dissimalarity matrix.

Return type:

matrix

pyclustertend.visual_assessment_of_tendency.vat(data, return_odm=False, figure_size=(10, 10))[source]

VAT means Visual assesement of tendency. basically, it allow to asses cluster tendency through a map based on the dissimiliraty matrix.

Parameters:
  • data (matrix) – numpy array
  • return_odm (return the Ordered Dissimalirity Matrix) – boolean (default to False)
  • figure_size (size of the VAT.) – tuple (default to (10,10))
Returns:

ODM – the ordered dissimalarity matrix plotted.

Return type:

matrix

hopkins module

pyclustertend.hopkins.hopkins(data_frame, sampling_size)[source]

Assess the clusterability of a dataset. A score between 0 and 1, a score around 0.5 express no clusterability and a score tending to 0 express a high cluster tendency.

Parameters:
  • data_frame (numpy array) – The input dataset
  • sampling_size (int) – The sampling size which is used to evaluate the number of DataFrame.
Returns:

score – The hopkins score of the dataset (between 0 and 1)

Return type:

float

Examples

>>> from sklearn import datasets
>>> from pyclustertend import hopkins
>>> X = datasets.load_iris().data
>>> hopkins(X,150)
0.16

metric module

pyclustertend.metric.assess_tendency_by_mean_metric_score(dataset, n_cluster=10, random_state=None)[source]

Assess the clusterability of a dataset using KMeans algorithm and the silhouette, calinski and davies bouldin score, the best cluster number is the mean of the result of the three methods.

Parameters:
  • dataset (numpy array, DataFrame) – The input dataset
  • n_cluster (int) – The maxium number of cluster to consider
  • random_state (int (default to None)) –
Returns:

n_clusters

Return type:

n_clusters is the mean of the best number of cluster score (with Kmeans algorithm)

Examples

>>> from sklearn import datasets
>>> from pyclustertend import assess_tendency_by_mean_metric_score
>>> from sklearn.preprocessing import scale
>>> X = scale(datasets.load_boston().data)
>>> assess_tendency_by_mean_metric_score(X,10)
2.6666666666666665
pyclustertend.metric.assess_tendency_by_metric(dataset, metric='silhouette', n_cluster=10, random_state=None)[source]

Assess the clusterability of a dataset using KMeans algorithm and a metric score, the best cluster number is the number that best scored with the silhouette score.

Parameters:
  • dataset (numpy array, DataFrame) – The input dataset
  • metric (string) – The method to assess cluster quality (‘silhouette’, ‘calinski_harabasz’, ‘davies_bouldin’), default to ‘silhouette’
  • n_cluster (int) – The maxium number of cluster to consider
  • random_state (int (default to None)) –
Returns:

  • (n_clusters, value) (n_clusters is the number of cluster that best scored on the silhouette score on Kmeans.)
  • As for value, it is the silhouette score for each number of cluster on KMeans.

Examples

>>> from sklearn import datasets
>>> from pyclustertend import assess_tendency_by_metric
>>> from sklearn.preprocessing import scale
>>> X = scale(datasets.load_boston().data)
>>> assess_tendency_by_metric(X, n_cluster=10)
(2, array([0.36011769, 0.25740335, 0.28098046, 0.28781574, 0.26746932,
    0.26975514, 0.27155699, 0.28883395, 0.29028124]))