词袋高频词文本分类_使用词袋法进行文本分类和预测

词袋高频词文本分类by gk_ 由gk_使用词袋法进行文本分类和预测 (Text classification and prediction using the Bag Of Words approach)There are a number of approaches to text classification. In other articles I’ve covered Mul...

cumian8165

915人浏览 · 2020-07-21 15:37:34

cumian8165 · 2020-07-21 15:37:34 发布

词袋高频词文本分类

by gk_

由gk_

使用词袋法进行文本分类和预测 (Text classification and prediction using the Bag Of Words approach)

There are a number of approaches to text classification. In other articles I’ve covered Multinomial Naive Bayes and Neural Networks.

有多种文本分类方法。在其他文章中，我介绍了多项朴素贝叶斯和神经网络。

One of the simplest and most common approaches is called “Bag of Words.” It has been used by commercial analytics products including Clarabridge, Radian6, and others.

最简单，最常见的方法之一就是“单词袋”。商业分析产品(包括Clarabridge ， Radian6等)已使用它。

The approach is relatively simple: given a set of topics and a set of terms associated with each topic, determine which topic(s) exist within a document (for example, a sentence).

该方法相对简单：给定一组主题和与每个主题相关联的一组术语，确定文档(例如，句子)中存在哪些主题。

While other, more exotic algorithms also organize words into “bags,” in this technique we don’t create a model or apply mathematics to the way in which this “bag” intersects with a classified document. A document’s classification will be polymorphic, as it can be associated with multiple topics.

虽然其他更奇特的算法也将单词组织到“袋”中，但是在这种技术中，我们不会创建模型或将数学应用于“袋”与机密文档相交的方式。文档的分类将是多态的，因为它可以与多个主题相关联。

Does this seem too simple to be useful? Try it before you jump to conclusions. In NLP, it is often the case that a simple approach can sometimes go a long way.

这看起来太简单而无用吗？在得出结论之前先尝试一下。在NLP中，通常情况下，简单的方法有时可能会走很长的路。

We will need three things:

我们将需要三件事：

A topics/words definition file
主题/单词定义文件
A classifier function
分类器功能
A notebook to test our classifier
测试我们分类器的笔记本

And then we will venture a bit further and build and test a predictive model using our classification data.

然后，我们将进一步冒险，并使用分类数据构建和测试预测模型。

主题与词汇 (Topics and Words)

Our definition file is in JSON format.We will use it to classify messages between patients and a nurse assigned to their care.

我们的定义文件为JSON格式，我们将使用该文件对患者和分配给其护理的护士之间的消息进行分类。

topic.json (topics.json)

There are two items of note in this definition.

此定义中有两个注意事项。

First, let’s look at some terms some terms. For example, “bruis” is a stem. It will cover supersets such as “bruise,” “bruising,” and so on. Second, terms containing * are actually patterns, for example *dpm is a pattern for a numeric digit followed by “pm.”

首先，让我们来看一些术语。例如，“ bruis”是词干。 它将涵盖超集，例如“ bruise”，“ bruising”等。其次，包含*的术语实际上是模式，例如* dpm是数字d igit后跟“ pm”的模式。

To keep things simple, we are only handling numeric pattern matching, but this could be expanded to a broader scope.

为了简单起见，我们仅处理数字模式匹配，但是可以将其扩展到更大的范围。

This ability of finding patterns within a term is very useful to when classifying documents containing dates, times, monetary values, and so on.

在对包含日期，时间，货币值等的文档进行分类时，这种在术语中查找模式的能力非常有用。

Let’s try out some classification.

让我们尝试一些分类。

The classifier returns a JSON result set containing the sentence(s) associated with each topic found in the message. A message can contain multiple sentences, and a sentence can be associated with none, one, or multiple topics.

分类器返回一个JSON结果集，其中包含与消息中找到的每个主题相关的句子。一条消息可以包含多个句子，并且一个句子可以与一个，一个或多个主题关联。

Let’s take a look at our classifier. The code is here.

让我们来看看我们的分类器。代码在这里。

msgClassify.py (msgClassify.py)

The code is relatively straightforward, and includes a convenience function to split a document into sentences.

该代码相对简单明了，并包括一个便捷功能，可将文档拆分为句子。

预测建模 (Predictive Modeling)

The aggregate classification for a set of documents associated with an outcome can be used to build a predictive model.

与结果相关联的一组文档的汇总分类可用于构建预测模型。

In this use-case, we wanted to see if we could predict hospitalizations based on the messages between patient and nurse prior to the incident. We compared messages for patients who did and did not incur hospitalizations.

在此用例中，我们想看看是否可以根据事件发生前患者与护士之间的消息预测住院情况。我们比较了有和没有住院的患者的信息。

You could use a similar technique for other types of messaging associated with some binary outcome.

您可以对与某种二进制结果相关联的其他类型的消息传递使用类似的技术。

This process takes a number of steps:

此过程采取以下步骤：

A set of messages are classified and each topic receives a count for this set. The result is a fixed list of topics with a % allocation from the messages.

一组消息被分类，并且每个主题都为此组计数。结果是固定的主题列表，并从消息中分配了％。
The topic allocation is then assigned a binary value, in our case a 0 if there was no hospitalization and a 1 if there was a hospitalization

然后为主题分配分配一个二进制值 ，在本例中，如果没有住院，则为0；如果有住院，则为1。
A logistic Regression algorithm is used to build a predictive model

使用Logistic回归算法建立预测模型
The model is used to predict the outcome from new input

该模型用于预测新输入的结果

Let’s look at our input data. Your data should have a similar structure. We’re using a pandas DataFrame.

让我们看看我们的输入数据。您的数据应具有类似的结构。我们正在使用pandas DataFrame 。

“incident” is the binary outcome, and it needs to be the first column in the input data.

“事件”是二进制结果，它必须是输入数据中的第一列。

Each subsequent column is a topic and the % of classification from the set of messages belonging to the patient.

随后的每一列都是一个主题以及属于该患者的一组消息中的分类百分比。

In row 0, we see that roughly a quarter of the messages for this patient are about the thanks topic, and none are about medical terms or money. Thus each row is a binary outcome and a messaging classification profile across topics.

在第0行中，我们看到该患者的消息中大约有四分之一与感谢主题有关，而与医学术语或金钱无关 。因此，每一行都是二进制结果和跨主题的消息传递分类配置文件 。

Your input data will have different topics, different column labels, and a different binary condition, but otherwise will be a similar structure.

您的输入数据将具有不同的主题，不同的列标签和不同的二进制条件，但其他方面将具有相似的结构。

Let’s use scikit-learn to build a Logistic Regression and test our model.

让我们使用scikit-learn构建Logistic回归并测试我们的模型。

Here’s our output:

这是我们的输出：

precision    recall  f1-score   support          0       0.66      0.69      0.67       191          1       0.69      0.67      0.68       202avg / total       0.68      0.68      0.68       393

The precision and recall of this model against the test data are in the high-60’s — slightly better than a guess, and not accurate enough to be of much value, unfortunately.

相对于测试数据，此模型的精度和召回率都处于60的高位–不幸的是，它比猜测值略好，并且不够准确，无法提供很多价值。

In this example, the amount of data was relatively small (a thousand patients, ~30 messages sampled per patient). Remember that only half of the data can be used for training, while the other half (after shuffling) is used to test.

在此示例中，数据量相对较小(一千名患者，每名患者采样了约30条消息)。请记住，只有一半的数据可用于训练，而另一半(经过改组后)用于测试。

By including structured data such as age, gender, condition, past incidents, and so on, we could strengthen our model and produce a stronger signal. Having more data would also be helpful as the number of training data columns is fairly large.

通过包含结构化数据(例如年龄，性别，状况，过去的事件等)，我们可以增强模型并产生更强的信号。由于训练数据列的数量很大，因此拥有更多数据也将有所帮助。

Try this with your structured/unstructured data and see if you can get a highly predictive model. You may not get the kind of precision that leads to automated actions, but a “risk” probability could be used as a filter or sorting function or as an early warning sign for human experts.

对您的结构化/非结构化数据进行尝试，看看是否可以获得高度预测的模型。您可能无法获得导致自动操作的精确度，但是“风险”概率可以用作过滤器或排序函数，也可以用作人类专家的预警信号。

The “Bag of Words” approach is suitable to certain kinds of text classification work, particularly where the language is not nuanced.

“单词袋”方法适用于某些类型的文本分类工作，尤其是在语言没有细微差别的情况下。

Enjoy.

请享用。

翻译自: https://www.freecodecamp.org/news/text-classification-and-prediction-using-bag-of-words-8aeb1396cded/

词袋高频词文本分类

技术共进，成长同行——讯飞AI开发者社区

更多推荐

人类记忆与人工智能记忆：大语言模型时代的融合与发展

讯飞AI开发者社区

奈飞算法优化实战全解析

通过关键词组合搜索，例如“技术实现+行业领域”（如“技术实现+人工智能”、“技术实现+区块链”）可以获取更精准的文献。《ZZ平台关键技术实现研究》是一篇硕士学位论文，全面阐述了ZZ平台的实现过程，包括需求分析、技术路线、核心模块实现和测试验证等环节。阅读时重点关注文献的“系统设计”、“实现方法”、“实验验证”等章节，这些部分通常包含技术实现的具体要点。使用学术搜索引擎如CNKI、万方、维普等，输入