pylupnt.plot.KMeans¶

class pylupnt.plot.KMeans(n_clusters=8, *, init='k-means++', n_init='auto', max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')¶

K-Means clustering.

See also

MiniBatchKMeans: Alternative online implementation that does incremental updates of the centers positions using mini-batches. For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much faster than the default batch implementation.

Notes

The k-means problem is solved using either Lloyd’s or Elkan’s algorithm.

The average complexity is given by O(k n T), where n is the number of samples and T is the number of iteration.

The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features. Refer to :doi:`"How slow is the k-means method?" D. Arthur and S. Vassilvitskii - SoCG2006.<10.1145/1137856.1137880>` for more details.

In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.

If the algorithm stops before fully converging (because of tol or max_iter), labels_ and cluster_centers_ will not be consistent, i.e. the cluster_centers_ will not be the means of the points in each cluster. Also, the estimator will reassign labels_ after the last iteration to make labels_ consistent with predict on the training set.

Examples

>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10.,  2.],
       [ 1.,  2.]])

For a more detailed example of K-Means using the iris dataset see sphx_glr_auto_examples_cluster_plot_cluster_iris.py.

For examples of common problems with K-Means and how to address them see sphx_glr_auto_examples_cluster_plot_kmeans_assumptions.py.

For an example of how to use K-Means to perform color quantization see sphx_glr_auto_examples_cluster_plot_color_quantization.py.

For a demonstration of how K-Means can be used to cluster text documents see sphx_glr_auto_examples_text_plot_document_clustering.py.

For a comparison between K-Means and MiniBatchKMeans refer to example sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py.

__init__(n_clusters=8, *, init='k-means++', n_init='auto', max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')¶

fit(X, y=None, sample_weight=None)¶

Compute k-means clustering.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (array-like of shape (n_samples,), default=None) –
The weights for each observation in X. If None, all observations are assigned equal weight. sample_weight is not used during initialization if init is a callable or a user provided array.

Added in version 0.20.

Returns:

self – Fitted estimator.

Return type:

object

fit_predict(X, y=None, sample_weight=None)¶

Compute cluster centers and predict cluster index for each sample.

Convenience method; equivalent to calling fit(X) followed by predict(X).

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (array-like of shape (n_samples,), default=None) – The weights for each observation in X. If None, all observations are assigned equal weight.

Returns:

labels – Index of the cluster each sample belongs to.

Return type:

ndarray of shape (n_samples,)

fit_transform(X, y=None, sample_weight=None)¶

Compute clustering and transform X to cluster-distance space.

Equivalent to fit(X).transform(X), but more efficiently implemented.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (array-like of shape (n_samples,), default=None) – The weights for each observation in X. If None, all observations are assigned equal weight.

Returns:

X_new – X transformed in the new space.

Return type:

ndarray of shape (n_samples, n_clusters)

get_feature_names_out(input_features=None)¶

Get output feature names for transformation.

The feature names out will prefixed by the lowercased class name. For example, if the transformer outputs 3 features, then the feature names out are: [“class_name0”, “class_name1”, “class_name2”].

Parameters:: input_features (array-like of str or None, default=None) – Only used to validate feature names with the names seen in fit.
Returns:: feature_names_out – Transformed feature names.
Return type:: ndarray of str objects

get_metadata_routing()¶

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)¶

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

predict(X)¶

Predict the closest cluster each sample in X belongs to.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to predict.
Returns:: labels – Index of the cluster each sample belongs to.
Return type:: ndarray of shape (n_samples,)

score(X, y=None, sample_weight=None)¶

Opposite of the value of X on the K-means objective.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (array-like of shape (n_samples,), default=None) – The weights for each observation in X. If None, all observations are assigned equal weight.

Returns:

score – Opposite of the value of X on the K-means objective.

Return type:

float

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KMeans¶

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:: sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.
Returns:: self – The updated object.
Return type:: object

set_output(*, transform=None)¶

Set output container.

See sphx_glr_auto_examples_miscellaneous_plot_set_output.py for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

”default”: Default output format of a transformer
”pandas”: DataFrame output
”polars”: Polars output
None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KMeans¶

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:: sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
Returns:: self – The updated object.
Return type:: object

transform(X)¶

Transform X to a cluster-distance space.

In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
Returns:: X_new – X transformed in the new space.
Return type:: ndarray of shape (n_samples, n_clusters)