API Reference¶
VAT module¶
-
pyclustertend.visual_assessment_of_tendency.
compute_ivat_ordered_dissimilarity_matrix
(X)[source]¶ The ordered dissimilarity matrix is used by ivat. It is a just a a reordering of the dissimilarity matrix.
Parameters: X (matrix) – numpy array Returns: D_prim – the ordered dissimalarity matrix . Return type: matrix
-
pyclustertend.visual_assessment_of_tendency.
compute_ordered_dissimilarity_matrix
(X)[source]¶ The ordered dissimilarity matrix is used by visual assesement of tendency. It is a just a a reordering of the dissimilarity matrix.
Parameters: X (matrix) – numpy array Returns: ODM – the ordered dissimalarity matrix . Return type: matrix
-
pyclustertend.visual_assessment_of_tendency.
ivat
(data, return_odm=False, figure_size=(10, 10))[source]¶ iVat return a visualisation based on the Vat but more reliable and easier to interpret.
Parameters: - data (matrix) – numpy array
- return_odm (return the Ordered Dissimalirity Matrix) – boolean (default to False)
- figure_size (size of the VAT.) – tuple (default to (10,10))
Returns: D_prim – the ivat ordered dissimalarity matrix.
Return type: matrix
-
pyclustertend.visual_assessment_of_tendency.
vat
(data, return_odm=False, figure_size=(10, 10))[source]¶ VAT means Visual assesement of tendency. basically, it allow to asses cluster tendency through a map based on the dissimiliraty matrix.
Parameters: - data (matrix) – numpy array
- return_odm (return the Ordered Dissimalirity Matrix) – boolean (default to False)
- figure_size (size of the VAT.) – tuple (default to (10,10))
Returns: ODM – the ordered dissimalarity matrix plotted.
Return type: matrix
hopkins module¶
-
pyclustertend.hopkins.
hopkins
(data_frame, sampling_size)[source]¶ Assess the clusterability of a dataset. A score between 0 and 1, a score around 0.5 express no clusterability and a score tending to 0 express a high cluster tendency.
Parameters: - data_frame (numpy array) – The input dataset
- sampling_size (int) – The sampling size which is used to evaluate the number of DataFrame.
Returns: score – The hopkins score of the dataset (between 0 and 1)
Return type: float
Examples
>>> from sklearn import datasets >>> from pyclustertend import hopkins >>> X = datasets.load_iris().data >>> hopkins(X,150) 0.16
metric module¶
-
pyclustertend.metric.
assess_tendency_by_mean_metric_score
(dataset, n_cluster=10, random_state=None)[source]¶ Assess the clusterability of a dataset using KMeans algorithm and the silhouette, calinski and davies bouldin score, the best cluster number is the mean of the result of the three methods.
Parameters: - dataset (numpy array, DataFrame) – The input dataset
- n_cluster (int) – The maxium number of cluster to consider
- random_state (int (default to None)) –
Returns: n_clusters
Return type: n_clusters is the mean of the best number of cluster score (with Kmeans algorithm)
Examples
>>> from sklearn import datasets >>> from pyclustertend import assess_tendency_by_mean_metric_score >>> from sklearn.preprocessing import scale >>> X = scale(datasets.load_boston().data) >>> assess_tendency_by_mean_metric_score(X,10) 2.6666666666666665
-
pyclustertend.metric.
assess_tendency_by_metric
(dataset, metric='silhouette', n_cluster=10, random_state=None)[source]¶ Assess the clusterability of a dataset using KMeans algorithm and a metric score, the best cluster number is the number that best scored with the silhouette score.
Parameters: - dataset (numpy array, DataFrame) – The input dataset
- metric (string) – The method to assess cluster quality (‘silhouette’, ‘calinski_harabasz’, ‘davies_bouldin’), default to ‘silhouette’
- n_cluster (int) – The maxium number of cluster to consider
- random_state (int (default to None)) –
Returns: - (n_clusters, value) (n_clusters is the number of cluster that best scored on the silhouette score on Kmeans.)
- As for value, it is the silhouette score for each number of cluster on KMeans.
Examples
>>> from sklearn import datasets >>> from pyclustertend import assess_tendency_by_metric >>> from sklearn.preprocessing import scale >>> X = scale(datasets.load_boston().data) >>> assess_tendency_by_metric(X, n_cluster=10) (2, array([0.36011769, 0.25740335, 0.28098046, 0.28781574, 0.26746932, 0.26975514, 0.27155699, 0.28883395, 0.29028124]))