基于神经网络的中文分词系统的研究(正文)

VIP免费
3.0 高德中 2024-11-19 5 4 2.39MB 71 页 15积分
侵权投诉
在汉语中,词与词之间不存在分隔符,词本身也缺乏明显的形态标记,因
中文信息处理的特有问题就是如何将中文的字串分割为合理的词语序列,即中
分词[1]
70代以来,各种分词方法不断提出并且很多都已应用到实际的中文分词
系统中,取得了较好的分词效果。现有的中文分词技术从算法角度划分有三类
即基于字符串匹配的分词方法,基于理解的分词方法和基于统计的分词方法。
基于神经网络的中文分词研究是新兴的研究领域,目前国内一些专家学者
了初步的研究和探讨,提出了一些算法,显示出了神经网络在中文分词技术应
上的优势,如:大规模并行处理、非线性、自学习、自适应、知识表达简洁等。但是
对于真实语料环境下,神经网络分词技术的应用研究,还没有什么突破,由此
至今也没有一套完全成熟的基于神经网络的中文分词系统。
造成该问题的主要原因是汉语语言规则的复杂性和笼统性。要将神经网络
用于实际的分词过程中,就必须使神经网络能对大样本集的真实语料进行模式
练和特征提取,并保持较高的训练质量。而对于这一难题,至今没有十分有效
解决方法。
本文根据综合分级处理中文分词的思想,提出了粗分和细分相结合的基于
经网络的中文分词系统的框架。同时,针对已有神经网络中文分词模型没有解
的问题,提出了基于类模式的训练和动态多神经网络模型。基于类模式的训练
动态多神经网络模型有效地解决了神经网络对大样本集真实语料的训练问题,
保持神经网络分词优势的基础上,显著地提高了神经网络在中文分词应用中的
化能力,缓解了由于样本集过大而造成的神经网络过度训练的问题,最终提高
系统整体分词的精度。实验证明,本文提出的类模式训练方式和动态多神经网
训练模型,可以使神经网络训练错误率明显降低,并使神经网络对于歧义的消
有较高的准确度。
本文还将基于神经网络的中文分词系统作为基于用户查询模式和神经网络
术的自适应全文检索系统的辅助分词模块。该检索系统基于全文搜索引擎 Lucene
构建,是动态特性的全文检索系统,加入基于神经网络的中文分
词辅助模,可提高系统的整体性能,在检歧义的中文语,与
LuceneJE中文分词[2]得的效果好。
关键词:中文分词 神经网络 歧义消解
ABSTRACT
In Chinese, there is not any separator between one word and another. The word
itself is lack of obvious form of mark. The particular problem of Chinese information
processing is how to split the Chinese sequence of words in a reasonable form, which is
called the Chinese word segmentation [1].
Since the 1970s, a variety of segmentation methods have been raised, and a lot of
them have been applied to the actual system of the Chinese word segmentation and
achieved a good performance. There are three kinds of methods of the existing Chinese
word segmentation technology from the point of the view of the algorithm. Those are,
the Chinese word segmentation based on the match of the string, the Chinese word
segmentation based on the understanding and the Chinese word segmentation based on
the statistics.
The research of the Chinese word segmentation based on the neural network is a
new research field. Now, some domestic experts and scholars have done a preliminary
study and exploration on it and made some algorithms, which shows the advantages of
the neural network applied in the Chinese word segmentation, such as, scale parallel
processing, non-linear, self-learning, adaptive, simple knowledge representation and so
on. But, the Chinese word segmentation based on the neural network has not been made
a breakthrough in the real corpus environment, so that there hasn’t been a full -fledged
system of the Chinese word segmentation based on the neural network.
The main cause of this problem is due to the complexity and general of the rules of
Chinese language. For the application of the Chinese word segmentation based on the
neural network in the actual process, it’s necessary to make the neural network trained
in the form of model and extract the features of the huge set of the samples from the real
corpus.
This paper is based on a thinking of dealing with the Chinese word segmentation
by the way of comprehensive classification and presents a framework of the Chinese
word segmentation based on neural network system which combines a model of rough
segmentation with a model of particular segmentation. At the same time, the paper
presents a training of the form based on class and a dynamic model of multiple neural
networks to solve the problems which haven’t been solved by the existing model of the
Chinese word segmentation based on the neural network. The training of the form based
on class and the dynamic model of multiple neural networks are effective solutions to
solve the problem of the neural network trained by the huge sets of samples from the
real corpus. It obviously improves the generalization ability of the Chinese word
segmentation based on the neural network in the application and relieves the over-
training problem of the neural network which is due to too much samples, and those
maintain the advantages of the neural network in the application of the Chinese word
segmentation. At last, these solutions enhanced the accuracy of the Chinese word
segmentation in whole. It has been proved by the experiments that the training of the
form based on class and the dynamic model of multiple neural networks can obviously
reduce the error rate in the process of the training of the neural network, and produce a
high accuracy of the disambiguation.
The system based on the Chinese word segmentation of the neural network has
also been the auxiliary module of the “an adaptive full-text retrieval system based on
user queries and neural network technology. The full-text retrieval system is built by
Lucene search engine and has the characteristics of the dynamic adjustment. It
improved the overall system performance by the auxiliary modules of the Chinese word
segmentation based on the neural network. It has obtained a better performance than the
package of JE Chinese word segmentation of Lucene[2] in the search of Chinese
sentences which contain the ambiguous fields.
Key Word Chinese Word Segmentation, Neural Network,
Disambiguation
目 录
ABSTRACT
章 绪论........................................................1
§1.1 研究背景.....................................................1
§1.2 本文研究内容及意...........................................2
§1.3 组织构.................................................3
第二章 中文分词研究................................................5
§2.1 汉语自动分词系统的现研究...................................5
§2.2 中文分词技术.................................................7
§2.2.1 机械分词方法.............................................7
§2.2.2 基于理解的分词方法.......................................8
§2.2.3 基于统计的分词方法.......................................8
§2.3 中文分词中的难点.............................................8
§2.3.1 歧义识别.................................................9
§2.3.2 未登录词识别............................................10
基于神经网络的中文分词系统的模型.......................11
§3.1 系统计思想................................................11
§3.2 神经网络技术处理中文分词的优点............................11
§3.2.1 神经网络概述............................................11
§3.2.2 BP 神经网络.............................................12
§3.2.3 BP 神经网络结合中文分词的优势...........................13
§3.3神经网络训练和消歧模型....................................14
§3.3.1 建立输入模型............................................14
§3.3.2 建立学习模型............................................14
§3.3.3 建立输出解模型........................................15
§3.3.4神经网络训练模型......................................16
§3.3.5神经网络消歧模型......................................16
§3.4 基于类模式的训练............................................16
§3.4.1 已有训练方式存在的问题..................................17
§3.4.2 歧义产生的原因与性质及切分的规则........................17
§3.4.3 基于类模式的训练........................................20
§3.5 动态多神经网络训练和消歧模型................................22
§3.5.1神经网络消歧模型到的瓶颈............................22
§3.5.2 多神经网络模型......................................23
§3.5.3 模型优化——B+树管器..................................23
§3.5.4 动态多神经网络训练模型..................................25
§3.5.5 动态多神经网络消歧模型..................................26
§3.6 神经网络分词系统的整体模型..................................27
第四章 基于神经网络的中文分词系统的实现...........................28
§4.1 系统的整体框架图............................................28
§4.2 系统用的资源..............................................28
§4.2.1 典....................................................28
§4.2.2 语料库..................................................30
§4.3 神经网络的参数调..........................................30
§4.3.1 编码方式................................................30
§4.3.2 网络结..........................................30
§4.3.3 训练的串行方式和集中方式................................32
§4.3.4 激活函数................................................32
§4.3.5 始权值和学习......................................32
§4.4 模块................................................33
§4.4.1 B+树管器..............................................33
§4.4.2 训练系统................................................35
§4.4.3 消歧系统................................................40
§4.4.4 最终分词成模块........................................40
第五章 实验.......................................................42
§5.1 实验简介....................................................42
§5.2 实验系统....................................................42
§5.3体实验....................................................43
§5.3.1 训练系统性能测试........................................44
§5.3.2 消歧系统性能测试........................................51
第六章 全文检索中的应用...........................................54
§6.1 全文检索技术................................................54
§6.2 现有检索系统的问题..........................................55
§6.3 基于用户查询模式和神经网络技术的自适应全文检索系统..........56
§6.4 基于神经网络中文分词模块的应用..............................58
§6.4.1 Lucene 工具包............................................58
§6.4.2 神经网络中文分词的应用..................................60
第七章 总结和展望.................................................62
§7.1................................................62
§7.2 进一步的研究方向............................................62
参考献..........................................................63
摘要:

摘要在汉语中,词与词之间不存在分隔符,词本身也缺乏明显的形态标记,因此,中文信息处理的特有问题就是如何将中文的字串分割为合理的词语序列,即中文分词[1]。自70年代以来,各种分词方法不断提出并且很多都已应用到实际的中文分词系统中,取得了较好的分词效果。现有的中文分词技术从算法角度划分有三类,即基于字符串匹配的分词方法,基于理解的分词方法和基于统计的分词方法。基于神经网络的中文分词研究是新兴的研究领域,目前国内一些专家学者做了初步的研究和探讨,提出了一些算法,显示出了神经网络在中文分词技术应用上的优势,如:大规模并行处理、非线性、自学习、自适应、知识表达简洁等。但是对于真实语料环境下,神经网络分...

展开>> 收起<<
基于神经网络的中文分词系统的研究(正文).doc

共71页,预览8页

还剩页未读, 继续阅读

作者:高德中 分类:高等教育资料 价格:15积分 属性:71 页 大小:2.39MB 格式:DOC 时间:2024-11-19

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 71
客服
关注