【自然语言处理】sklearn的TF-IDF：TfidfVectorizer

关于tf和idf的定义这里就不再赘述了，网上关于二者的讲解博客非常多，这里只讲一下在TfidfVectorizer中是如何计算的，此处计算与平常的公式略有不同。先举个栗子：from sklearn.feature_extraction.text import TfidfVectorizercorpus = ['This is the first document.','This document

Legolas~

667人浏览 · 2020-08-17 23:56:49

Legolas~ · 2020-08-17 23:56:49 发布

关于tf和idf的定义这里就不再赘述了，网上关于二者的讲解博客非常多，这里只讲一下在TfidfVectorizer中是如何计算的，此处计算与平常的公式略有不同。
先举个栗子：

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer(norm=None)
x = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())
print(x.toarray())

def __init__(self, input='content', encoding='utf-8',
                 decode_error='strict', strip_accents=None, lowercase=True,
                 preprocessor=None, tokenizer=None, analyzer='word',
                 stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
                 ngram_range=(1, 1), max_df=1.0, min_df=1,
                 max_features=None, vocabulary=None, binary=False,
                 dtype=np.float64, norm='l2', use_idf=True, smooth_idf=True,
                 sublinear_tf=False):

TfidfVectorizer中常用的参数有norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False，这四个参数的取值都是默认情况，取值不同计算过程也不同。
smooth_idf=False:
在这里插入图片描述
官网写的是log，一开始不知道是ln，看了好多人写的博客都没提到，通过反复将程序运算结果和公式对比才发现是ln。
smooth_idf=True:

而且最终算的tf-idf是某个词在其文档中出现的频数乘以idf的值，而不是频率乘以idf，这里要注意。
norm='l2':L2范数标准化处理。拿上面的栗子来说，上面程序得到的tf-idf如下所示：

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

[[0.         1.22314355 1.51082562 1.         0.         0.
  1.         0.         1.        ]
 [0.         2.4462871  0.         1.         0.         1.91629073
  1.         0.         1.        ]
 [1.91629073 0.         0.         1.         1.91629073 0.
  1.         1.91629073 1.        ]
 [0.         1.22314355 1.51082562 1.         0.         0.
  1.         0.         1.        ]]

我们拿第一行的数来说，经过L2范数标准化处理的过程如下：
在这里插入图片描述
最终得到的结果为：

[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]

技术共进，成长同行——讯飞AI开发者社区

更多推荐

论文笔记：AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models（AlphaEdit）

论文发表于人工智能顶会ICLR（基于定位和修改的模型编辑方法（针对和等）会破坏LLM中最初保存的知识，特别是在顺序编辑场景。为此，本文提出AlphaEdit：1、在将保留知识应用于参数之前，将扰动投影到保留知识的零空间上。2、从理论上证明，这种预测确保了在查询保留的知识时，编辑后的LLM的输出保持不变，从而减轻中断问题。3、对各种LLM（包括LLaMA3、GPT2XL和GPT-J）的广泛实验表明，