您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Implementation of k-means algorithm and mixture Gaussian model based on Python

編輯：Python

1. Purpose of the experiment

Implement a k-means algorithm and a Gaussian mixture model, and use the EM algorithm to estimate the parameters in the model.

2. Experiment Requirements

Use Gaussian distribution to generate k Gaussian distributed data (different mean and variance) (where the parameters are set by yourself).
- Use k-means clustering to test the effect;
- Use the Gaussian mixture model and your implementation of the EM algorithm to estimate the parameters, see how the likelihood values change after each iteration, and see if the EM algorithm can get the correct results (compared to the results you set).
Application: You can find a simple problem data on UCI and use the GMM you implemented for clustering.

3. Design Thinking

3.1 K-means Algorithm

k-means clustering algorithm (k-means clustering algorithm) is an iterative clustering analysis algorithm. The steps are: pre-divide the data into K groups, then randomly select K objects as the initial clustering centers, and then calculate the distance between each object and each seed cluster center, and assign each object to the cluster center closest to it.Cluster centers and the objects assigned to them represent a cluster.Each time a sample is assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster.This process will repeat until a certain termination condition is met.Termination conditions can be that no (or a minimum number) of objects are reassigned to different clusters, no (or a minimum number) of cluster centers change again, and the sum of squared errors is locally minimized.

The algorithm is: first randomly select K objects as the initial cluster centers.Then calculate the distance between each object and each seed cluster center, and assign each object to the cluster center closest to it.Cluster centers and the objects assigned to them represent a cluster.Once all objects have been assigned, the cluster center for each cluster is recalculated based on the existing objects in the cluster.This process will repeat until a certain termination condition is met.The termination condition can be any of the following:

No (or minimum number) objects are reassigned to different clusters.
No (or minimum number) of cluster centers change again.
The sum of squared errors is a local minimum.

3.2 GMM Algorithm

GMM, Gaussian Mixture Model, can also be abbreviated as MOG.The Gaussian model is to use the Gaussian probability density function (normal distribution curve) to accurately quantify things, and decompose a thing into several models based on the Gaussian probability density function (normal distribution curve).

Compared with K-means, GMM is an application implementation of a more complex EM algorithm.Different from K-means, the GMM algorithm does not use the "closest distance method" to assign a class (hard assignment) to each sample at step E, but increases the latent variable γ.γ is a matrix of (N,K), γ[N,K] represents the probability that the nth sample is the kth class, so it has normalization.That is, the sum of the elements of each row of is 1.

The GMM algorithm uses a Gaussian mixture model to describe the distribution of samples, because in a multi-class situation, a single Gaussian distribution must not be able to describe the data distribution.A simple superposition of multiple Gaussian distributions also fails to describe the data distribution.Only a mixture of Gaussian distributions can better describe a set of samples generated by multiple Gaussian models.For any data point in the sample, the probability that any Gaussian model can generate the point, that is, the contribution of any Gaussian model to the generation of the point is different, but it is certain that the sum of these contributionsThe value is 1.