使用PaddleNLP语义预训练模型ERNIE完成快递单信息抽取
注意

本项目代码需要使用GPU环境来运行:

 

命名实体识别是NLP中一项非常基础的任务,是信息提取、问答系统、句法分析、机器翻译等众多NLP任务的重要基础工具。命名实体识别的准确度,决定了下游任务的效果,是NLP中的一个基础问题。在NER任务提供了两种解决方案,一类LSTM/GRU + CRF,RNN类的模型来抽取底层文本的信息,而CRF(条件随机场)模型来学习底层Token之间的联系;另外一类是通过预训练模型,例如ERNIE,BERT模型,直接来预测Token的标签信息。

本项目将演示,如何使用PaddleNLP语义预训练模型ERNIE完成从快递单中抽取姓名、电话、省、市、区、详细地址等内容,形成结构化信息。辅助物流行业从业者进行有效信息的提取,从而降低客户填单的成本。

在2017年之前,工业界和学术界对NLP文本处理依赖于序列模型Recurrent Neural Network (RNN).1:RNN示意图

基于BiGRU+CRF的快递单信息抽取项目介绍了如何使用序列模型完成快递单信息抽取任务。 

近年来随着深度学习的发展,模型参数的数量飞速增长。为了训练这些参数,需要更大的数据集来避免过拟合。然而,对于大部分NLP任务来说,构建大规模的标注数据集非常困难(成本过高),特别是对于句法和语义相关的任务。相比之下,大规模的未标注语料库的构建则相对容易。为了利用这些数据,我们可以先从其中学习到一个好的表示,再将这些表示应用到其他任务中。最近的研究表明,基于大规模未标注语料库的预训练模型(Pretrained Models, PTM) 在NLP任务上取得了很好的表现。

近年来,大量的研究表明基于大型语料库的预训练模型(Pretrained Models, PTM)可以学习通用的语言表示,有利于下游NLP任务,同时能够避免从零开始训练模型。随着计算能力的发展,深度模型的出现(即 Transformer)和训练技巧的增强使得 PTM 不断发展,由浅变深。

 

图2:预训练模型一览,图片来源于:https://github.com/thunlp/PLMpapers

本示例展示了以ERNIE(Enhanced Representation through Knowledge Integration)代表的预训练模型如何Finetune完成序列标注任务。

记得给PaddleNLP点个小小的Star⭐

开源不易,希望大家多多支持~

GitHub地址:https://github.com/PaddlePaddle/PaddleNLP 

AI Studio平台后续会默认安装PaddleNLP,在此之前可使用如下命令安装。

In [1]
!pip install --upgrade paddlenlp\>=2.0.0rc18 -i https://pypi.org/simple
Requirement already up-to-date: paddlenlp>=2.0.0rc18 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.0.0rc18)
Requirement already satisfied, skipping upgrade: visualdl in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp>=2.0.0rc18) (2.1.1)
Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp>=2.0.0rc18) (0.4.4)
Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp>=2.0.0rc18) (4.1.0)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp>=2.0.0rc18) (2.9.0)
Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp>=2.0.0rc18) (0.42.1)
Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp>=2.0.0rc18) (1.2.2)
Requirement already satisfied, skipping upgrade: Pillow>=7.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp>=2.0.0rc18) (7.1.2)
Requirement already satisfied, skipping upgrade: requests in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp>=2.0.0rc18) (2.22.0)
Requirement already satisfied, skipping upgrade: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp>=2.0.0rc18) (1.16.4)
Requirement already satisfied, skipping upgrade: six>=1.14.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp>=2.0.0rc18) (1.15.0)
Requirement already satisfied, skipping upgrade: Flask-Babel>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp>=2.0.0rc18) (1.0.0)
Requirement already satisfied, skipping upgrade: protobuf>=3.11.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp>=2.0.0rc18) (3.14.0)
Requirement already satisfied, skipping upgrade: pre-commit in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp>=2.0.0rc18) (1.21.0)
Requirement already satisfied, skipping upgrade: bce-python-sdk in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp>=2.0.0rc18) (0.8.53)
Requirement already satisfied, skipping upgrade: flake8>=3.7.9 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp>=2.0.0rc18) (3.8.2)
Requirement already satisfied, skipping upgrade: flask>=1.1.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp>=2.0.0rc18) (1.1.1)
Requirement already satisfied, skipping upgrade: shellcheck-py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp>=2.0.0rc18) (0.7.1.1)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp>=2.0.0rc18) (0.22.1)
Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp>=2.0.0rc18) (2.8)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp>=2.0.0rc18) (2019.9.11)
Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp>=2.0.0rc18) (3.0.4)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp>=2.0.0rc18) (1.25.6)
Requirement already satisfied, skipping upgrade: Jinja2>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp>=2.0.0rc18) (2.10.3)
Requirement already satisfied, skipping upgrade: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp>=2.0.0rc18) (2019.3)
Requirement already satisfied, skipping upgrade: Babel>=2.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp>=2.0.0rc18) (2.8.0)
Requirement already satisfied, skipping upgrade: nodeenv>=0.11.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp>=2.0.0rc18) (1.3.4)
Requirement already satisfied, skipping upgrade: aspy.yaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp>=2.0.0rc18) (1.3.0)
Requirement already satisfied, skipping upgrade: pyyaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp>=2.0.0rc18) (5.1.2)
Requirement already satisfied, skipping upgrade: toml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp>=2.0.0rc18) (0.10.0)
Requirement already satisfied, skipping upgrade: virtualenv>=15.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp>=2.0.0rc18) (16.7.9)
Requirement already satisfied, skipping upgrade: identify>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp>=2.0.0rc18) (1.4.10)
Requirement already satisfied, skipping upgrade: cfgv>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp>=2.0.0rc18) (2.0.1)
Requirement already satisfied, skipping upgrade: importlib-metadata; python_version < "3.8" in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp>=2.0.0rc18) (0.23)
Requirement already satisfied, skipping upgrade: future>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl->paddlenlp>=2.0.0rc18) (0.18.0)
Requirement already satisfied, skipping upgrade: pycryptodome>=3.8.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl->paddlenlp>=2.0.0rc18) (3.9.9)
Requirement already satisfied, skipping upgrade: pycodestyle<2.7.0,>=2.6.0a1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp>=2.0.0rc18) (2.6.0)
Requirement already satisfied, skipping upgrade: mccabe<0.7.0,>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp>=2.0.0rc18) (0.6.1)
Requirement already satisfied, skipping upgrade: pyflakes<2.3.0,>=2.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp>=2.0.0rc18) (2.2.0)
Requirement already satisfied, skipping upgrade: Werkzeug>=0.15 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp>=2.0.0rc18) (0.16.0)
Requirement already satisfied, skipping upgrade: itsdangerous>=0.24 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp>=2.0.0rc18) (1.1.0)
Requirement already satisfied, skipping upgrade: click>=5.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp>=2.0.0rc18) (7.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp>=2.0.0rc18) (0.14.1)
Requirement already satisfied, skipping upgrade: scipy>=0.17.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp>=2.0.0rc18) (1.3.0)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Jinja2>=2.5->Flask-Babel>=1.0.0->visualdl->paddlenlp>=2.0.0rc18) (1.1.1)
Requirement already satisfied, skipping upgrade: zipp>=0.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->pre-commit->visualdl->paddlenlp>=2.0.0rc18) (0.6.0)
Requirement already satisfied, skipping upgrade: more-itertools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < "3.8"->pre-commit->visualdl->paddlenlp>=2.0.0rc18) (7.2.0)
In [2]
# 下载并解压数据集
from paddle.utils.download import get_path_from_url
URL = "https://paddlenlp.bj.bcebos.com/paddlenlp/datasets/waybill.tar.gz"
get_path_from_url(URL, "./")

# 查看预测的数据
!head -n 5 data/test.txt
2021-04-16 17:49:30,049 - INFO - unique_endpoints {''}
2021-04-16 17:49:30,050 - INFO - Found ./waybill.tar.gz
2021-04-16 17:49:30,052 - INFO - Decompressing ./waybill.tar.gz...
text_a	label
黑龙江省双鸭山市尖山区八马路与东平行路交叉口北40米韦业涛18600009172	A1-BA1-IA1-IA1-IA2-BA2-IA2-IA2-IA3-BA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IP-BP-IP-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-I
广西壮族自治区桂林市雁山区雁山镇西龙村老年活动中心17610348888羊卓卫	A1-BA1-IA1-IA1-IA1-IA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IP-BP-IP-I
15652864561河南省开封市顺河回族区顺河区公园路32号赵本山	T-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IA1-BA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IP-BP-IP-I
河北省唐山市玉田县无终大街15918614253058尚汉生	A1-BA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IP-BP-IP-I
In [3]
from functools import partial

import paddle
from paddlenlp.datasets import MapDataset
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.transformers import ErnieTokenizer, ErnieForTokenClassification
from paddlenlp.metrics import ChunkEvaluator
from utils import convert_example, evaluate, predict, load_dict
加载自定义数据集
推荐使用MapDataset()自定义数据集。

In [4]
def load_dataset(datafiles):
    def read(data_path):
        with open(data_path, 'r', encoding='utf-8') as fp:
            next(fp)  # Skip header
            for line in fp.readlines():
                words, labels = line.strip('\n').split('\t')
                words = words.split('\002')
                labels = labels.split('\002')
                yield words, labels

    if isinstance(datafiles, str):
        return MapDataset(list(read(datafiles)))
    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
        return [MapDataset(list(read(datafile))) for datafile in datafiles]

# Create dataset, tokenizer and dataloader.
train_ds, dev_ds, test_ds = load_dataset(datafiles=(
        './data/train.txt', './data/dev.txt', './data/test.txt'))
In [ ]
for i in range(5):
    print(train_ds[i])
(['1', '6', '6', '2', '0', '2', '0', '0', '0', '7', '7', '宣', '荣', '嗣', '甘', '肃', '省', '白', '银', '市', '会', '宁', '县', '河', '畔', '镇', '十', '字', '街', '金', '海', '超', '市', '西', '行', '5', '0', '米'], ['T-B', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'P-B', 'P-I', 'P-I', 'A1-B', 'A1-I', 'A1-I', 'A2-B', 'A2-I', 'A2-I', 'A3-B', 'A3-I', 'A3-I', 'A4-B', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I'])
(['1', '3', '5', '5', '2', '6', '6', '4', '3', '0', '7', '姜', '骏', '炜', '云', '南', '省', '德', '宏', '傣', '族', '景', '颇', '族', '自', '治', '州', '盈', '江', '县', '平', '原', '镇', '蜜', '回', '路', '下', '段'], ['T-B', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'P-B', 'P-I', 'P-I', 'A1-B', 'A1-I', 'A1-I', 'A2-B', 'A2-I', 'A2-I', 'A2-I', 'A2-I', 'A2-I', 'A2-I', 'A2-I', 'A2-I', 'A2-I', 'A3-B', 'A3-I', 'A3-I', 'A4-B', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I'])
(['内', '蒙', '古', '自', '治', '区', '赤', '峰', '市', '阿', '鲁', '科', '尔', '沁', '旗', '汉', '林', '西', '街', '路', '南', '1', '3', '7', '0', '1', '0', '8', '5', '3', '9', '0', '那', '峥'], ['A1-B', 'A1-I', 'A1-I', 'A1-I', 'A1-I', 'A1-I', 'A2-B', 'A2-I', 'A2-I', 'A3-B', 'A3-I', 'A3-I', 'A3-I', 'A3-I', 'A3-I', 'A4-B', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'T-B', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'P-B', 'P-I'])
(['广', '东', '省', '梅', '州', '市', '大', '埔', '县', '茶', '阳', '镇', '胜', '利', '路', '1', '3', '6', '0', '1', '3', '2', '8', '1', '7', '3', '张', '铱'], ['A1-B', 'A1-I', 'A1-I', 'A2-B', 'A2-I', 'A2-I', 'A3-B', 'A3-I', 'A3-I', 'A4-B', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'T-B', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'P-B', 'P-I'])
(['新', '疆', '维', '吾', '尔', '自', '治', '区', '阿', '克', '苏', '地', '区', '阿', '克', '苏', '市', '步', '行', '街', '1', '0', '号', '1', '5', '8', '1', '0', '7', '8', '9', '3', '7', '8', '慕', '东', '霖'], ['A1-B', 'A1-I', 'A1-I', 'A1-I', 'A1-I', 'A1-I', 'A1-I', 'A1-I', 'A2-B', 'A2-I', 'A2-I', 'A2-I', 'A2-I', 'A3-B', 'A3-I', 'A3-I', 'A3-I', 'A4-B', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'A4-I', 'T-B', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'T-I', 'P-B', 'P-I', 'P-I'])
每条数据包含一句文本和这个文本中每个汉字以及数字对应的label标签。

之后,还需要对输入句子进行数据处理,如切词,映射词表id等。

数据处理
预训练模型ERNIE对中文数据的处理是以字为单位。PaddleNLP对于各种预训练模型已经内置了相应的tokenizer。指定想要使用的模型名字即可加载对应的tokenizer。

tokenizer作用为将原始输入文本转化成模型model可以接受的输入数据形式。




图3:ERNIE模型示意图

In [ ]
label_vocab = load_dict('./data/tag.dic')
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')

trans_func = partial(convert_example, tokenizer=tokenizer, label_vocab=label_vocab)

train_ds.map(trans_func)
dev_ds.map(trans_func)
test_ds.map(trans_func)
print (train_ds[0])
[2021-04-16 17:49:30,684] [    INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt
100%|██████████| 89/89 [00:00<00:00, 26435.31it/s]
([1, 208, 515, 515, 249, 540, 249, 540, 540, 540, 589, 589, 803, 838, 2914, 1222, 1734, 244, 368, 797, 99, 32, 863, 308, 457, 2778, 484, 167, 436, 930, 192, 233, 634, 99, 213, 40, 317, 540, 256, 2], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 40, [12, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 1, 1, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12])
数据读入
使用paddle.io.DataLoader接口多线程异步加载数据。

In [ ]
ignore_label = -1
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack(),  # seq_len
    Pad(axis=0, pad_val=ignore_label)  # labels
): fn(samples)

train_loader = paddle.io.DataLoader(
    dataset=train_ds,
    batch_size=36,
    return_list=True,
    collate_fn=batchify_fn)
dev_loader = paddle.io.DataLoader(
    dataset=dev_ds,
    batch_size=36,
    return_list=True,
    collate_fn=batchify_fn)
test_loader = paddle.io.DataLoader(
    dataset=test_ds,
    batch_size=36,
    return_list=True,
    collate_fn=batchify_fn)
PaddleNLP一键加载预训练模型
快递单信息抽取本质是一个序列标注任务,PaddleNLP对于各种预训练模型已经内置了对于下游任务文本分类Fine-tune网络。以下教程以ERNIE为预训练模型完成序列标注任务。

paddlenlp.transformers.ErnieForTokenClassification()一行代码即可加载预训练模型ERNIE用于序列标注任务的fine-tune网络。其在ERNIE模型后拼接上一个全连接网络进行分类。

paddlenlp.transformers.ErnieForTokenClassification.from_pretrained()方法只需指定想要使用的模型名称和文本分类的类别数即可完成定义模型网络。

In [ ]
# Define the model netword and its loss
model = ErnieForTokenClassification.from_pretrained("ernie-1.0", num_classes=len(label_vocab))
[2021-04-16 17:49:30,768] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-1.0
[2021-04-16 17:49:30,771] [    INFO] - Downloading ernie_v1_chn_base.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams
100%|██████████| 390123/390123 [00:07<00:00, 48846.56it/s]
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1303: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1303: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
PaddleNLP不仅支持ERNIE预训练模型,还支持BERT、RoBERTa、Electra等预训练模型。 下表汇总了目前PaddleNLP支持的各类预训练模型。您可以使用PaddleNLP提供的模型,完成文本分类、序列标注、问答等任务。同时我们提供了众多预训练模型的参数权重供用户使用,其中包含了二十多种中文语言模型的预训练权重。中文的预训练模型有bert-base-chinese, bert-wwm-chinese, bert-wwm-ext-chinese, ernie-1.0, ernie-tiny, gpt2-base-cn, roberta-wwm-ext, roberta-wwm-ext-large, rbt3, rbtl3, chinese-electra-base, chinese-electra-small, chinese-xlnet-base, chinese-xlnet-mid, chinese-xlnet-large, unified_transformer-12L-cn, unified_transformer-12L-cn-luge等。

更多预训练模型参考:PaddleNLP Transformer API。

更多预训练模型fine-tune下游任务使用方法,请参考:examples。

设置Fine-Tune优化策略,模型配置
适用于ERNIE/BERT这类Transformer模型的迁移优化学习率策略为warmup的动态学习率。

 

图4:动态学习率示意图

In [ ]
metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True)
loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label)
optimizer = paddle.optimizer.AdamW(learning_rate=2e-5, parameters=model.parameters())
模型训练与评估
模型训练的过程通常有以下步骤:

从dataloader中取出一个batch data
将batch data喂给model,做前向计算
将前向计算结果传给损失函数,计算loss。将前向计算结果传给评价方法,计算评价指标。
loss反向回传,更新梯度。重复以上步骤。
每训练一个epoch时,程序将会评估一次,评估当前模型训练的效果。

In [ ]
step = 0
for epoch in range(10):
    for idx, (input_ids, token_type_ids, length, labels) in enumerate(train_loader):
        logits = model(input_ids, token_type_ids)
        loss = paddle.mean(loss_fn(logits, labels))
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        step += 1
        print("epoch:%d - step:%d - loss: %f" % (epoch, step, loss))
    evaluate(model, metric, dev_loader)

    paddle.save(model.state_dict(),
                './ernie_result/model_%d.pdparams' % step)
# model.save_pretrained('./checkpoint')
# tokenizer.save_pretrained('./checkpoint')
epoch:0 - step:1 - loss: 2.440917
epoch:0 - step:2 - loss: 2.241402
epoch:0 - step:3 - loss: 2.064299
epoch:0 - step:4 - loss: 1.941091
epoch:0 - step:5 - loss: 1.810125
epoch:0 - step:6 - loss: 1.718215
epoch:0 - step:7 - loss: 1.646599
epoch:0 - step:8 - loss: 1.527680
epoch:0 - step:9 - loss: 1.492934
epoch:0 - step:10 - loss: 1.401691
epoch:0 - step:11 - loss: 1.351265
epoch:0 - step:12 - loss: 1.304195
epoch:0 - step:13 - loss: 1.238849
epoch:0 - step:14 - loss: 1.174362
epoch:0 - step:15 - loss: 1.156204
epoch:0 - step:16 - loss: 1.094712
epoch:0 - step:17 - loss: 1.082015
epoch:0 - step:18 - loss: 1.019320
epoch:0 - step:19 - loss: 0.993033
epoch:0 - step:20 - loss: 0.948091
epoch:0 - step:21 - loss: 0.903512
epoch:0 - step:22 - loss: 0.861731
epoch:0 - step:23 - loss: 0.834337
epoch:0 - step:24 - loss: 0.785762
epoch:0 - step:25 - loss: 0.782404
epoch:0 - step:26 - loss: 0.730888
epoch:0 - step:27 - loss: 0.717108
epoch:0 - step:28 - loss: 0.706697
epoch:0 - step:29 - loss: 0.621379
epoch:0 - step:30 - loss: 0.591193
epoch:0 - step:31 - loss: 0.574058
epoch:0 - step:32 - loss: 0.560003
epoch:0 - step:33 - loss: 0.517378
epoch:0 - step:34 - loss: 0.492897
epoch:0 - step:35 - loss: 0.453224
epoch:0 - step:36 - loss: 0.453232
epoch:0 - step:37 - loss: 0.432989
epoch:0 - step:38 - loss: 0.394299
epoch:0 - step:39 - loss: 0.386831
epoch:0 - step:40 - loss: 0.356459
epoch:0 - step:41 - loss: 0.332505
epoch:0 - step:42 - loss: 0.325347
epoch:0 - step:43 - loss: 0.264217
epoch:0 - step:44 - loss: 0.273122
epoch:0 - step:45 - loss: 0.222397
[2021-04-16 17:59:07,274] [ WARNING] - Compatibility Warning: The params of ChunkEvaluator.compute has been modified. The old version is `inputs`, `lengths`, `predictions`, `labels` while the current version is `lengths`, `predictions`, `labels`.  Please update the usage.
eval precision: 0.946414 - recall: 0.965517 - f1: 0.955870
epoch:1 - step:46 - loss: 0.234128
epoch:1 - step:47 - loss: 0.217608
epoch:1 - step:48 - loss: 0.194733
epoch:1 - step:49 - loss: 0.190501
epoch:1 - step:50 - loss: 0.196022
epoch:1 - step:51 - loss: 0.187103
epoch:1 - step:52 - loss: 0.173465
epoch:1 - step:53 - loss: 0.135932
epoch:1 - step:54 - loss: 0.131949
epoch:1 - step:55 - loss: 0.133976
epoch:1 - step:56 - loss: 0.114723
epoch:1 - step:57 - loss: 0.125495
epoch:1 - step:58 - loss: 0.084843
epoch:1 - step:59 - loss: 0.080494
epoch:1 - step:60 - loss: 0.120894
epoch:1 - step:61 - loss: 0.111187
模型预测
训练保存好的训练,即可用于预测。如以下示例代码自定义预测数据,调用predict()函数即可一键预测。

In [ ]
preds = predict(model, test_loader, test_ds, label_vocab)
file_path = "ernie_results.txt"
with open(file_path, "w", encoding="utf8") as fout:
    fout.write("\n".join(preds))
# Print some examples
print(
    "The results have been saved in the file: %s, some examples are shown below: "
    % file_path)
print("\n".join(preds[:10]))
进一步使用CRF
PaddleNLP提供了CRF Layer,它能够学习label之间的关系,能够帮助模型更好地学习、预测序列标注任务。

我们在PaddleNLP仓库中提供了示例,您可以参照示例代码使用Ernie-CRF结构完成快递单信息抽取任务。

PaddleNLP更多项目
使用BiGRU-CRF模型完成快递单信息抽取
使用seq2vec模块进行句子情感分类
如何通过预训练模型Fine-tune下游任务
使用Seq2Seq模型完成自动对联
使用预训练模型ERNIE-GEN实现智能写诗
使用TCN网络完成新冠疫情病例数预测
使用预训练模型完成阅读理解
自定义数据集实现文本多分类任务
Logo

技术共进,成长同行——讯飞AI开发者社区

更多推荐