基于神经网络的中文分词系统的研究（正文）

VIP免费

3.0 高德中 2024-11-19 5 4 2.39MB 71 页 15积分

侵权投诉

摘要

在汉语中，词与词之间不存在分隔符，词本身也缺乏明显的形态标记，因此，

中文信息处理的特有问题就是如何将中文的字串分割为合理的词语序列，即中文

分词[1]。

自70年代以来，各种分词方法不断提出并且很多都已应用到实际的中文分词

系统中，取得了较好的分词效果。现有的中文分词技术从算法角度划分有三类，

即基于字符串匹配的分词方法，基于理解的分词方法和基于统计的分词方法。

基于神经网络的中文分词研究是新兴的研究领域，目前国内一些专家学者做

了初步的研究和探讨，提出了一些算法，显示出了神经网络在中文分词技术应用

上的优势，如：大规模并行处理、非线性、自学习、自适应、知识表达简洁等。但是

对于真实语料环境下，神经网络分词技术的应用研究，还没有什么突破，由此，

至今也没有一套完全成熟的基于神经网络的中文分词系统。

造成该问题的主要原因是汉语语言规则的复杂性和笼统性。要将神经网络应

用于实际的分词过程中，就必须使神经网络能对大样本集的真实语料进行模式训

练和特征提取，并保持较高的训练质量。而对于这一难题，至今没有十分有效的

解决方法。

本文根据综合分级处理中文分词的思想，提出了粗分和细分相结合的基于神

经网络的中文分词系统的框架。同时，针对已有神经网络中文分词模型没有解决

的问题，提出了基于类模式的训练和动态多神经网络模型。基于类模式的训练和

动态多神经网络模型有效地解决了神经网络对大样本集真实语料的训练问题，在

保持神经网络分词优势的基础上，显著地提高了神经网络在中文分词应用中的泛

化能力，缓解了由于样本集过大而造成的神经网络过度训练的问题，最终提高了

系统整体分词的精度。实验证明，本文提出的类模式训练方式和动态多神经网络

训练模型，可以使神经网络训练错误率明显降低，并使神经网络对于歧义的消解

有较高的准确度。

本文还将基于神经网络的中文分词系统作为基于用户查询模式和神经网络技

术的自适应全文检索系统的辅助分词模块。该检索系统基于全文搜索引擎 Lucene

构建，是一个具有动态调整特性的全文检索系统，在加入基于神经网络的中文分

词辅助模块后，可提高系统的整体性能，在检索含歧义字段的中文语句方面，与

Lucene的JE中文分词包[2]相比，所得的效果更好。

关键词：中文分词神经网络歧义消解

ABSTRACT

In Chinese, there is not any separator between one word and another. The word

itself is lack of obvious form of mark. The particular problem of Chinese information

processing is how to split the Chinese sequence of words in a reasonable form, which is

called the Chinese word segmentation [1].

Since the 1970s, a variety of segmentation methods have been raised, and a lot of

them have been applied to the actual system of the Chinese word segmentation and

achieved a good performance. There are three kinds of methods of the existing Chinese

word segmentation technology from the point of the view of the algorithm. Those are,

the Chinese word segmentation based on the match of the string, the Chinese word

segmentation based on the understanding and the Chinese word segmentation based on

the statistics.

The research of the Chinese word segmentation based on the neural network is a

new research field. Now, some domestic experts and scholars have done a preliminary

study and exploration on it and made some algorithms, which shows the advantages of

the neural network applied in the Chinese word segmentation, such as, scale parallel

processing, non-linear, self-learning, adaptive, simple knowledge representation and so

on. But, the Chinese word segmentation based on the neural network has not been made

a breakthrough in the real corpus environment, so that there hasn’t been a full -fledged

system of the Chinese word segmentation based on the neural network.

The main cause of this problem is due to the complexity and general of the rules of

Chinese language. For the application of the Chinese word segmentation based on the

neural network in the actual process, it’s necessary to make the neural network trained

in the form of model and extract the features of the huge set of the samples from the real

corpus.

This paper is based on a thinking of dealing with the Chinese word segmentation

by the way of comprehensive classification and presents a framework of the Chinese

word segmentation based on neural network system which combines a model of rough

segmentation with a model of particular segmentation. At the same time, the paper

presents a training of the form based on class and a dynamic model of multiple neural

networks to solve the problems which haven’t been solved by the existing model of the

Chinese word segmentation based on the neural network. The training of the form based

on class and the dynamic model of multiple neural networks are effective solutions to

solve the problem of the neural network trained by the huge sets of samples from the

real corpus. It obviously improves the generalization ability of the Chinese word

segmentation based on the neural network in the application and relieves the over-

training problem of the neural network which is due to too much samples, and those

maintain the advantages of the neural network in the application of the Chinese word

segmentation. At last, these solutions enhanced the accuracy of the Chinese word

segmentation in whole. It has been proved by the experiments that the training of the

form based on class and the dynamic model of multiple neural networks can obviously

reduce the error rate in the process of the training of the neural network, and produce a

high accuracy of the disambiguation.

The system based on the Chinese word segmentation of the neural network has

also been the auxiliary module of the “an adaptive full-text retrieval system based on

user queries and neural network technology”. The full-text retrieval system is built by

Lucene search engine and has the characteristics of the dynamic adjustment. It

improved the overall system performance by the auxiliary modules of the Chinese word

segmentation based on the neural network. It has obtained a better performance than the

package of JE Chinese word segmentation of Lucene[2] in the search of Chinese

sentences which contain the ambiguous fields.

Key Word ：Chinese Word Segmentation, Neural Network,

Disambiguation

目 录
摘  要
ABSTRACT
第一章 绪论........................................................1
§1.1 研究背景.....................................................1
§1.2 本文研究内容及意义...........................................2
§1.3 论文组织结构.................................................3
第二章 中文分词研究................................................5
§2.1 汉语自动分词系统的现状研究...................................5
§2.2 中文分词技术.................................................7
§2.2.1 机械分词方法.............................................7
§2.2.2 基于理解的分词方法.......................................8
§2.2.3 基于统计的分词方法.......................................8
§2.3 中文分词中的难点.............................................8
§2.3.1 歧义识别.................................................9
§2.3.2 未登录词识别............................................10
第三章 基于神经网络的中文分词系统的模型设计.......................11
§3.1 系统设计思想................................................11
§3.2 神经网络技术及处理中文分词的优点............................11
§3.2.1 神经网络概述............................................11
§3.2.2 BP 神经网络.............................................12
§3.2.3 BP 神经网络结合中文分词的优势...........................13
§3.3 单神经网络训练和消歧模型....................................14
§3.3.1 建立输入模型............................................14
§3.3.2 建立学习模型............................................14
§3.3.3 建立输出解释模型........................................15
§3.3.4 单神经网络训练模型......................................16
§3.3.5 单神经网络消歧模型......................................16
§3.4 基于类模式的训练............................................16
§3.4.1 已有训练方式存在的问题..................................17
§3.4.2 歧义产生的原因与性质及切分的规则........................17
§3.4.3 基于类模式的训练........................................20
§3.5 动态多神经网络训练和消歧模型................................22

§3.5.1 单神经网络消歧模型遇到的瓶颈............................22
§3.5.2 多神经网络模型设计......................................23
§3.5.3 模型优化——B+树管理器..................................23
§3.5.4 动态多神经网络训练模型..................................25
§3.5.5 动态多神经网络消歧模型..................................26
§3.6 神经网络分词系统的整体模型..................................27
第四章 基于神经网络的中文分词系统的实现...........................28
§4.1 系统的整体框架图............................................28
§4.2 系统利用的资源..............................................28
§4.2.1 词典....................................................28
§4.2.2 语料库..................................................30
§4.3 神经网络的参数调优..........................................30
§4.3.1 编码方式................................................30
§4.3.2 网络结构的设计..........................................30
§4.3.3 训练的串行方式和集中方式................................32
§4.3.4 激活函数................................................32
§4.3.5 初始权值和学习速率......................................32
§4.4 模块详细设计................................................33
§4.4.1 B+树管理器..............................................33
§4.4.2 训练系统................................................35
§4.4.3 消歧系统................................................40
§4.4.4 最终分词生成模块........................................40
第五章 实验.......................................................42
§5.1 实验简介....................................................42
§5.2 实验系统....................................................42
§5.3 具体实验....................................................43
§5.3.1 训练系统性能测试........................................44
§5.3.2 消歧系统性能测试........................................51
第六章 全文检索中的应用...........................................54
§6.1 全文检索技术................................................54
§6.2 现有检索系统的问题..........................................55
§6.3 基于用户查询模式和神经网络技术的自适应全文检索系统..........56
§6.4 基于神经网络中文分词模块的应用..............................58

§6.4.1 Lucene 工具包............................................58
§6.4.2 神经网络中文分词的应用..................................60
第七章 总结和展望.................................................62
§7.1 论文工作总结................................................62
§7.2 进一步的研究方向............................................62
参考文献..........................................................63

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 积分 4人已下载

立即下载 VIP免费下载

摘要：

摘要在汉语中，词与词之间不存在分隔符，词本身也缺乏明显的形态标记，因此，中文信息处理的特有问题就是如何将中文的字串分割为合理的词语序列，即中文分词[1]。自70年代以来，各种分词方法不断提出并且很多都已应用到实际的中文分词系统中，取得了较好的分词效果。现有的中文分词技术从算法角度划分有三类，即基于字符串匹配的分词方法，基于理解的分词方法和基于统计的分词方法。基于神经网络的中文分词研究是新兴的研究领域，目前国内一些专家学者做了初步的研究和探讨，提出了一些算法，显示出了神经网络在中文分词技术应用上的优势，如：大规模并行处理、非线性、自学习、自适应、知识表达简洁等。但是对于真实语料环境下，神经网络分...

展开>> 收起<<

基于神经网络的中文分词系统的研究（正文）.doc

共71页,预览8页

还剩页未读，继续阅读

基于神经网络的中文分词系统的研究（正文）

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

推荐作者

热门标签

举报选择: