手撕机器学习算法 | 聚类算法 | Kmeans

基于Python（Numpy）和C++（Eigen）手撕常见的机器学习算法中的核心逻辑

谭贤芳

499人浏览 · 2024-07-16 18:31:10

谭贤芳 · 2024-07-16 18:31:10 发布

❗写在前面 ❗：

… … … … … … … … . … … …
1️⃣ 提供两种语言实现：本文提供两种语言实现，分别为Python和C++，分别依赖的核心库为Numpy和Eigen.

2️⃣ 提供关键步的公式推导：写出详尽的公式推导太花时间，所以只提供了关键步骤的公式推导。更详尽的推导，可移步其他博文或相关书籍.

3️⃣ 重在逻辑实现：文中的代码主要以实现算法的逻辑为主，在细节优化上面，并没有做更多的考虑，如果发现有不完善的地方，欢迎指出.

🌞 欢迎关注专栏，相应的内容会持续更新，祝大家变得更强！！
… … … … … … … … . … … …

1、原理推导

聚类分析（cluster analysis）是一类经典的无监督学习算法。在给定样本的情况下，聚类分析通过度量特征相似度或者距离，将样本自动划分为若干类别。

Kmeans是基于距离聚类中具有代表性的聚类算法。它的原理非常简单：

Kmeans聚类过程
（1）初始化簇中心。在刚开始迭代的时候，随机选择k个样本作为初始化的簇中心。

（2）按照样本与每个簇中心的距离进行聚类。将样本划分到与之距离最近的簇中。

（3）根据划分的结果，计算新的簇中心。对每簇中的样本取均值，当作下一次迭代的簇中心。

（4）如果迭代收敛或者满足迭代停止的条件，保存最终的簇中心结果。

在进行预测的时候只需要判断当前样本与保存的簇中心距离，取最近的距离为其所属类别即可。

这里补充一些距离以及相似度的计算方法（仅Python实现）：

import numpy as np

class SimilarityMeasure:

    # 欧氏距离
    def euclidean_distance(self, x1, x2):
        return np.sqrt(np.sum((x1 - x2) ** 2))

    # 曼哈顿距离
    def manhattan_distance(self, x1, x2):
        return np.sum(np.abs(x1 - x2))

    # 余弦相似
    def angle_cosine(self, x1, x2):
        dot_product = np.dot(x1, x2)
        norm_x1 = np.linalg.norm(x1)
        norm_x2 = np.linalg.norm(x2)
        return dot_product / (norm_x1 * norm_x2)

    # 相关性系数
    def correlation_coefficient(self, x1, x2):
        return np.corrcoef(x1, x2)[0, 1]

    # 马哈拉诺比斯距离
    def mahalanobis_distance(self, x1, x2):
        try:
            cov_inv = np.linalg.inv(np.cov(x1, x2))
            diff = x1 - x2
            return np.sqrt(np.dot(np.dot(diff, cov_inv), diff.T))
        except np.linalg.LinAlgError:
            return None

    # 切比雪夫距离
    def chebyshev_distance(self, x1, x2):
        return np.max(np.abs(x1 - x2))

2、算法实现

🌈Python实现

import numpy as np

class Kmeans:

    def __init__(self, k=2, max_iter=1000, distance_func=SimilarityMeasure().euclidean_distance):
        """
        :param k: 聚类的簇数
        :param max_iter: 最大迭代次数
        :param distance_func: 距离的计算方式，默认为欧氏距离
        """
        self.k = k
        self.max_iter = max_iter
        self.cluster_centers = None   # 簇中心
        self.calc_distance = distance_func  

    def fit(self, X):

        m, n = X.shape

        # 初始化随机的簇中心
        init_cluster_indices = np.random.choice(m, self.k, replace=False)
        init_cluster_centers = X[init_cluster_indices]

        for _ in range(self.max_iter):

            # 分配样本到最近的簇
            clusters = [[] for _ in range(self.k)]

            for x1 in X:
                min_distance = np.inf
                cluster_idx = None
                for i, x2 in enumerate(init_cluster_centers):
                    distance = self.calc_distance(x1, x2)

                    if distance < min_distance:
                        min_distance = distance
                        cluster_idx = i
                clusters[cluster_idx].append(x1)

            # 保存当前簇中心
            cur_cluster_centers = np.array(init_cluster_centers)

            # 计算新的簇中心
            init_cluster_centers = [np.mean(cluster, axis=0) for cluster in clusters]

            # 检查收敛性
            if np.linalg.norm(np.array(init_cluster_centers) - cur_cluster_centers) == 0:
                self.cluster_centers = np.array(init_cluster_centers)
                break

        self.cluster_centers = np.array(init_cluster_centers)

    def predict(self, X):

        y_pred = []
        for x1 in X:
            min_distance = np.inf
            cluster_idx = None
            for i, x2 in enumerate(self.cluster_centers):
                distance = self.calc_distance(x1, x2)

                if distance < min_distance:
                    min_distance = distance
                    cluster_idx = i
            y_pred.append(cluster_idx)
        return y_pred

# 演示
np.random.seed(42)
X = np.random.rand(20, 2)
kmeans = Kmeans(k=3, max_iter=100, distance_func=SimilarityMeasure().euclidean_distance)
kmeans.fit(X)
y_pred = kmeans.predict(X)

print("簇中心:\n", kmeans.cluster_centers)
print("预测的簇:\n", y_pred)

结果

簇中心:
 [[0.44564467 0.18482996]
 [0.12956495 0.93392146]
 [0.56328    0.62563539]]
预测的簇:
 [1, 2, 0, 1, 2, 1, 0, 0, 2, 0, 0, 0, 2, 2, 0, 0, 1, 2, 0, 2]

⚡C++实现

太简单了…略…有人需要再加…

3、总结

k均值聚类算法是众多聚类算法中最常用的代表性算法。作为一种动态迭代的算法，k均值聚类算法的主要步骤包括质心初始化、根据距离度量进行初步聚类、根据聚类结果计算新的质心、不断迭代聚类直至满足停止条件。这种不断动态迭代优化的算法，从思想上看有点像EM算法

技术共进，成长同行——讯飞AI开发者社区

更多推荐

论文笔记：AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models（AlphaEdit）

论文发表于人工智能顶会ICLR（基于定位和修改的模型编辑方法（针对和等）会破坏LLM中最初保存的知识，特别是在顺序编辑场景。为此，本文提出AlphaEdit：1、在将保留知识应用于参数之前，将扰动投影到保留知识的零空间上。2、从理论上证明，这种预测确保了在查询保留的知识时，编辑后的LLM的输出保持不变，从而减轻中断问题。3、对各种LLM（包括LLaMA3、GPT2XL和GPT-J）的广泛实验表明，