人工智能/机器学习基础知识——聚类(性能度量 & 距离计算)
人工智能/机器学习基础知识——聚类(性能度量 & 距离计算)
聚类
Clustering
性能度量
Target:聚类结果的“簇内相似度”(Intra-Cluster Similarity)高且“簇间相似度”(Inter-Cluster Similarity)低
-
外部指标(External Index)
将聚类结果与某个“参考模型”(Reference Model)比较
a=∣SS∣,SS={(xi,xj)∣λi=λj,λi∗=λj∗,i<j)}b=∣SD∣,SD={(xi,xj)∣λi=λj,λi∗≠λj∗,i<j)}c=∣DS∣,DS={(xi,xj)∣λi≠λj,λi∗=λj∗,i<j)}d=∣DD∣,DD={(xi,xj)∣λi≠λj,λi∗≠λj∗,i<j)} \begin{array}{ll} a=|S S|, & \left.S S=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i}=\lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j\right)\right\} \\ b=|S D|, & \left.S D=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i}=\lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j\right)\right\} \\ c=|D S|, & \left.D S=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j\right)\right\} \\ d=|D D|, & \left.D D=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j\right)\right\} \end{array} a=∣SS∣,b=∣SD∣,c=∣DS∣,d=∣DD∣,SS={(xi,xj)∣λi=λj,λi∗=λj∗,i<j)}SD={(xi,xj)∣λi=λj,λi∗=λj∗,i<j)}DS={(xi,xj)∣λi=λj,λi∗=λj∗,i<j)}DD={(xi,xj)∣λi=λj,λi∗=λj∗,i<j)}
∗*∗表示参考模型,下述外部指标值越大越好-
Jaccard系数
Jaccard Coefficient,JC
JC=aa+b+c JC = \frac{a}{a+b+c} JC=a+b+ca
-
FM指数
Fowlkes and Mallows Index,FMI
FMI=aa+b⋅aa+c FMI = \sqrt{\frac{a}{a+b}·\frac{a}{a+c}} FMI=a+ba⋅a+ca
-
Rand指数
Rand Index,RI
RI=2(a+d)m(m−1) RI = \frac{2(a+d)}{m(m-1)} RI=m(m−1)2(a+d)
-
-
内部指标(Internal Index)
直接考察聚类结果而不利用任何参考模型
avg(C)=2∣C∣(∣C∣−1)∑1⩽i<j⩽∣C∣dist(xi,xj)diam(C)=max1⩽i<j⩽∣C∣dist(xi,xj)dmin(Ci,Cj)=minxi∈Ci,xj∈Cjdist(xi,xj)dcen(Ci,Cj)=dist(μi,μj) \begin{aligned} &\operatorname{avg}(C)=\frac{2}{|C|(|C|-1)} \sum_{1 \leqslant i<j \leqslant|C|} \operatorname{dist}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \\ &\operatorname{diam}(C)=\max _{1 \leqslant i<j \leqslant|C|} \operatorname{dist}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \\ &d_{\min }\left(C_{i}, C_{j}\right)=\min _{\boldsymbol{x}_{i} \in C_{i}, \boldsymbol{x}_{j} \in C_{j}} \operatorname{dist}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \\ &d_{\mathrm{cen}}\left(C_{i}, C_{j}\right)=\operatorname{dist}\left(\boldsymbol{\mu}_{i}, \boldsymbol{\mu}_{j}\right) \end{aligned} avg(C)=∣C∣(∣C∣−1)21⩽i<j⩽∣C∣∑dist(xi,xj)diam(C)=1⩽i<j⩽∣C∣maxdist(xi,xj)dmin(Ci,Cj)=xi∈Ci,xj∈Cjmindist(xi,xj)dcen(Ci,Cj)=dist(μi,μj)
avg(C)avg(C)avg(C)表示簇C内样本间的平均距离,diam(C)diam(C)diam(C)表示簇C内样本间的最远距离,dmin(Ci,Cj)d_{min}(C_i, C_j)dmin(Ci,Cj)两簇最近样本间的距离,dcen(Ci,Cj)d_{cen}(C_i, C_j)dcen(Ci,Cj)两簇中心点的距离-
DB指数
Davies-Bouldin Index,DBI
DBI=1k∑i=1kmaxj≠i(avg(Ci)+avg(Cj)dcen(μi,μj)) \mathrm{DBI}=\frac{1}{k} \sum_{i=1}^{k} \max _{j \neq i}\left(\frac{\operatorname{avg}\left(C_{i}\right)+\operatorname{avg}\left(C_{j}\right)}{d_{\operatorname{cen}}\left(\boldsymbol{\mu}_{i}, \boldsymbol{\mu}_{j}\right)}\right) DBI=k1i=1∑kj=imax(dcen(μi,μj)avg(Ci)+avg(Cj))
越小越好 -
Dunn指数
Dunn Index,DI
DI=min1⩽i⩽k{minj≠i(dmin(Ci,Cj)max1⩽l⩽kdiam(Cl))} \mathrm{DI}=\min _{1 \leqslant i \leqslant k}\left\{\min _{j \neq i}\left(\frac{d_{\min }\left(C_{i}, C_{j}\right)}{\max _{1 \leqslant l \leqslant k} \operatorname{diam}\left(C_{l}\right)}\right)\right\} DI=1⩽i⩽kmin{j=imin(max1⩽l⩽kdiam(Cl)dmin(Ci,Cj))}
越大越好
-
距离计算
dist(·,·)
非负性、同一性、对称性、直递性
-
闵可夫斯基距离
Minkowski Distance
适用于有序属性(Ordinal Attribute)
distmk(xi,xj)=(∑u=1n∣xiu−xju∣p)12,p≥1 d i s t_{m k}\left(x_{i}, x_{j}\right)=\left(\sum_{u=1}^{n}\left|x_{i u}-x_{j u}\right|^{p}\right)^{\frac{1}{2}}, p \geq 1 distmk(xi,xj)=(u=1∑n∣xiu−xju∣p)21,p≥1
即LPL_PLP范数 -
VDM
Value Difference Metric
适用于无序属性(Non-ordinal Attribute)
VDMp(a,b)=∑i=1k∣mu,a,imu,a−mu,b,imu,b∣p V D M_{p}(a, b)=\sum_{i=1}^{k} \mid \frac{m_{u, a, i}}{m_{u, a}}-\frac{m_{u, b, i}}{\left.m_{u, b}\right|^{p}} VDMp(a,b)=i=1∑k∣mu,amu,a,i−mu,b∣pmu,b,i
mu,am_{u,a}mu,a表示在属性u上取值为a的样本数,mu,a,im_{u,a,i}mu,a,i表示在第i个样本簇中在属性u上取值为a的样本数,kkk为样本簇数 -
可将上述两种距离度量结合起来处理混合属性,假定有ncn_cnc个有序属性,n−ncn-n_cn−nc个无序属性,则
MinkovDMp(xi,xj)=(∑u=1nc∣xiu−xju∣p+∑u=nc+1nVDMp(xiu,xju))1p \operatorname{Minkov} D M_{p}\left(x_{i}, x_{j}\right)=\left(\sum_{u=1}^{n_{c}}\left|x_{i u}-x_{j u}\right|^{p}+\sum_{u=n_{c}+1}^{n} V D M_{p}\left(x_{i u}, x_{j u}\right)\right)^{\frac{1}{p}} MinkovDMp(xi,xj)=(u=1∑nc∣xiu−xju∣p+u=nc+1∑nVDMp(xiu,xju))p1
-
当样本空间中不同属性的重要性不同时,可使用“加权距离”,如
distwmk(xi,xj)=(w1⋅∣xi1−xj1∣p+...+wn⋅∣win−wjn∣p)1p dist_{wmk}(x_i, x_j) = (w_1 · |x_{i1} - x_{j1}|^p + ... + w_n · |w_{in} - w_{jn}|^p)^{\frac{1}{p}} distwmk(xi,xj)=(w1⋅∣xi1−xj1∣p+...+wn⋅∣win−wjn∣p)p1
其中权重wi>=0w_i >= 0wi>=0,表征不同属性的重要性 -
非度量距离
Non-metric Distance
- 可通过距离度量学习(Distance Metric Learning)来实现
更多推荐
所有评论(0)