词袋 高频词 文本分类_使用词袋法进行文本分类和预测
词袋 高频词 文本分类by gk_ 由gk_使用词袋法进行文本分类和预测 (Text classification and prediction using the Bag Of Words approach)There are a number of approaches to text classification. In other articles I’ve covered Mul...
词袋 高频词 文本分类
by gk_
由gk_
使用词袋法进行文本分类和预测 (Text classification and prediction using the Bag Of Words approach)
There are a number of approaches to text classification. In other articles I’ve covered Multinomial Naive Bayes and Neural Networks.
有多种文本分类方法。 在其他文章中,我介绍了多项朴素贝叶斯和神经网络 。
One of the simplest and most common approaches is called “Bag of Words.” It has been used by commercial analytics products including Clarabridge, Radian6, and others.
最简单,最常见的方法之一就是“单词袋”。 商业分析产品(包括Clarabridge , Radian6等)已使用它。
The approach is relatively simple: given a set of topics and a set of terms associated with each topic, determine which topic(s) exist within a document (for example, a sentence).
该方法相对简单:给定一组主题和与每个主题相关联的一组术语,确定文档(例如,句子)中存在哪些主题。
While other, more exotic algorithms also organize words into “bags,” in this technique we don’t create a model or apply mathematics to the way in which this “bag” intersects with a classified document. A document’s classification will be polymorphic, as it can be associated with multiple topics.
虽然其他更奇特的算法也将单词组织到“袋”中,但是在这种技术中,我们不会创建模型或将数学应用于“袋”与机密文档相交的方式。 文档的分类将是多态的,因为它可以与多个主题相关联。
Does this seem too simple to be useful? Try it before you jump to conclusions. In NLP, it is often the case that a simple approach can sometimes go a long way.
这看起来太简单而无用吗? 在得出结论之前先尝试一下。 在NLP中,通常情况下,简单的方法有时可能会走很长的路。
We will need three things:
我们将需要三件事:
- A topics/words definition file 主题/单词定义文件
- A classifier function 分类器功能
- A notebook to test our classifier 测试我们分类器的笔记本
And then we will venture a bit further and build and test a predictive model using our classification data.
然后,我们将进一步冒险,并使用分类数据构建和测试预测模型。
主题与词汇 (Topics and Words)
Our definition file is in JSON format.We will use it to classify messages between patients and a nurse assigned to their care.
我们的定义文件为JSON格式,我们将使用该文件对患者和分配给其护理的护士之间的消息进行分类。
topic.json (topics.json)
There are two items of note in this definition.
此定义中有两个注意事项。
First, let’s look at some terms some terms. For example, “bruis” is a stem. It will cover supersets such as “bruise,” “bruising,” and so on. Second, terms containing * are actually patterns, for example *dpm is a pattern for a numeric digit followed by “pm.”
首先,让我们来看一些术语。 例如,“ bruis”是词干。 它将涵盖超集,例如“ bruise”,“ bruising”等。 其次,包含*的术语实际上是模式 ,例如* dpm是数字d igit后跟“ pm”的模式。
To keep things simple, we are only handling numeric pattern matching, but this could be expanded to a broader scope.
为了简单起见,我们仅处理数字模式匹配,但是可以将其扩展到更大的范围。
This ability of finding patterns within a term is very useful to when classifying documents containing dates, times, monetary values, and so on.
在对包含日期,时间,货币值等的文档进行分类时,这种在术语中查找模式的能力非常有用。
Let’s try out some classification.
让我们尝试一些分类。
The classifier returns a JSON result set containing the sentence(s) associated with each topic found in the message. A message can contain multiple sentences, and a sentence can be associated with none, one, or multiple topics.
分类器返回一个JSON结果集,其中包含与消息中找到的每个主题相关的句子。 一条消息可以包含多个句子,并且一个句子可以与一个,一个或多个主题关联。
Let’s take a look at our classifier. The code is here.
让我们来看看我们的分类器。 代码在这里 。
msgClassify.py (msgClassify.py)
The code is relatively straightforward, and includes a convenience function to split a document into sentences.
该代码相对简单明了,并包括一个便捷功能,可将文档拆分为句子。
预测建模 (Predictive Modeling)
The aggregate classification for a set of documents associated with an outcome can be used to build a predictive model.
与结果相关联的一组文档的汇总分类可用于构建预测模型。
In this use-case, we wanted to see if we could predict hospitalizations based on the messages between patient and nurse prior to the incident. We compared messages for patients who did and did not incur hospitalizations.
在此用例中,我们想看看是否可以根据事件发生前患者与护士之间的消息预测住院情况。 我们比较了有和没有住院的患者的信息。
You could use a similar technique for other types of messaging associated with some binary outcome.
您可以对与某种二进制结果相关联的其他类型的消息传递使用类似的技术。
This process takes a number of steps:
此过程采取以下步骤:
-
A set of messages are classified and each topic receives a count for this set. The result is a fixed list of topics with a % allocation from the messages.
一组消息被分类,并且每个主题都为此组计数。 结果是固定的主题列表,并从消息中分配了%。
-
The topic allocation is then assigned a binary value, in our case a 0 if there was no hospitalization and a 1 if there was a hospitalization
然后为主题分配分配一个二进制值 ,在本例中,如果没有住院,则为0;如果有住院,则为1。
-
A logistic Regression algorithm is used to build a predictive model
使用Logistic回归算法建立预测模型
-
The model is used to predict the outcome from new input
该模型用于预测新输入的结果
Let’s look at our input data. Your data should have a similar structure. We’re using a pandas DataFrame.
让我们看看我们的输入数据。 您的数据应具有类似的结构。 我们正在使用pandas DataFrame 。
“incident” is the binary outcome, and it needs to be the first column in the input data.
“事件”是二进制结果,它必须是输入数据中的第一列。
Each subsequent column is a topic and the % of classification from the set of messages belonging to the patient.
随后的每一列都是一个主题以及属于该患者的一组消息中的分类百分比。
In row 0, we see that roughly a quarter of the messages for this patient are about the thanks topic, and none are about medical terms or money. Thus each row is a binary outcome and a messaging classification profile across topics.
在第0行中,我们看到该患者的消息中大约有四分之一与感谢主题有关,而与医学术语或金钱无关 。 因此,每一行都是二进制结果和跨主题的消息传递分类配置文件 。
Your input data will have different topics, different column labels, and a different binary condition, but otherwise will be a similar structure.
您的输入数据将具有不同的主题,不同的列标签和不同的二进制条件,但其他方面将具有相似的结构。
Let’s use scikit-learn to build a Logistic Regression and test our model.
让我们使用scikit-learn构建Logistic回归并测试我们的模型。
Here’s our output:
这是我们的输出:
precision recall f1-score support 0 0.66 0.69 0.67 191 1 0.69 0.67 0.68 202avg / total 0.68 0.68 0.68 393
The precision and recall of this model against the test data are in the high-60’s — slightly better than a guess, and not accurate enough to be of much value, unfortunately.
相对于测试数据,此模型的精度和召回率都处于60的高位–不幸的是,它比猜测值略好 ,并且不够准确,无法提供很多价值。
In this example, the amount of data was relatively small (a thousand patients, ~30 messages sampled per patient). Remember that only half of the data can be used for training, while the other half (after shuffling) is used to test.
在此示例中,数据量相对较小(一千名患者,每名患者采样了约30条消息)。 请记住,只有一半的数据可用于训练,而另一半(经过改组后)用于测试。
By including structured data such as age, gender, condition, past incidents, and so on, we could strengthen our model and produce a stronger signal. Having more data would also be helpful as the number of training data columns is fairly large.
通过包含结构化数据(例如年龄,性别,状况,过去的事件等),我们可以增强模型并产生更强的信号。 由于训练数据列的数量很大,因此拥有更多数据也将有所帮助。
Try this with your structured/unstructured data and see if you can get a highly predictive model. You may not get the kind of precision that leads to automated actions, but a “risk” probability could be used as a filter or sorting function or as an early warning sign for human experts.
对您的结构化/非结构化数据进行尝试,看看是否可以获得高度预测的模型。 您可能无法获得导致自动操作的精确度,但是“风险”概率可以用作过滤器或排序函数,也可以用作人类专家的预警信号。
The “Bag of Words” approach is suitable to certain kinds of text classification work, particularly where the language is not nuanced.
“单词袋”方法适用于某些类型的文本分类工作,尤其是在语言没有细微差别的情况下。
Enjoy.
请享用。
词袋 高频词 文本分类
更多推荐
所有评论(0)