机器翻译中的词义排歧

VIP免费
3.0 牛悦 2024-11-19 4 4 989.36KB 106 页 15积分
侵权投诉
机器翻译中的词义排歧
Word Sense Disambiguation
In Machine Translation
摘 要
机器翻译是当代科学技术的十大难题之一,而词义排歧是机器翻译中最困难
的问题。如果词义排歧不能解决,机器翻译的译文质量就不可能有质的提高。
本文首先评价了目前常用的几种词义排歧方法,接下来介绍作者设计的词义
排歧实验。在实验中作者从大学英语四六级新大纲规定的词汇中选取了 5260 个常
用词编入了机读词典, 选取了五个有代表性的多义词(“bank”, “old”, “draw”,
“sweet” “since”作为词义排歧的实验对象,从英国国家语料库中提取了三万
左右含有这五个词的句子,并通过编程对这些语料进行分析研究。最后作者用 C++
语言设计了“词义排歧系统”和“词义标注系统”对本文提出的假设进行编程测
试。
实验证明:每个多义词词义变化的规律都不同,无法归纳出普遍规律;虽
有些语义变化有一些规律,但是大多数多义词的词义变化都没有普遍规律可循,
各种排歧方法不能解决词义排歧的原因就是忽略了多义词词义变化是没有什么普
遍规律可言的事实。因此,要首先研究出每个多义词词义变化的特殊规律,再根
据这些规律选择合适的排歧方法来提高排歧效率。因为没有通用的排歧方法,各
种排歧方法又都有不足之处,所以排歧系统要能够根据可用的语境信息灵活地选
择合适的排歧方法。 词义排歧中最因难的任务不是创造出一个全新的排歧方法,
而是研究出每个多义词词义变化的规律。无论排歧方法多么新颖,没有用于排歧
的数据,什么方法都无法发挥作用。排歧效率取决于用于排歧的数据的质量,而
不研究出每个多义词词义变化的特殊规律就不可能得到高质量的可用于排歧的数
据。
虽然人们认为普遍性的规律有实用价值,但是在词义排歧中多义词词义变
的特殊规律比一般性的语义规律更重要。因为多义词词义变化规律复杂多变无法
归纳,我们只有一个一个地研究出每个多义词的词义变化规律,再逐个解决多义
词的词义排歧问题,直到大多数的常用多义词的词义排歧问题都得到解决,机器
翻译的质量令人满意。这就是说设计一个排歧系统是一项艰巨的任务,事实上对
一个多义词的一个词义的排歧都可能会是费时而又艰辛的工作,更何况要对数以
万计的词义进行排歧。所以,排歧是非常不易的,而试图找到一个普遍通用的排
歧方法又是不可能的。但这并不是说词义排歧不可能解决,只是词义排歧需要长
期不懈的努力。当然投入的艰辛的劳动也会得到丰厚的回报,因为成功的机器翻
译系统必定会带来巨大的商业利润。
关键词:词义排歧 机器翻译 同现词 词义排歧系统 词义标注系统
ABSTRACT
Machine translation (MT for short) is considered one of the ten most difficult
problems to solve in science and technology (Feng 2), and word sense disambiguation
(WSD for short) is the most difficult issue in MT. If it could not be solved, MT could
hardly achieve any substantial development.
This paper first comments on the current WSD methods, then introduces the pilot
study carried out by the author. In this study, 5260 frequently used words chosen from
CET 4 (College English Test) and CET 6 vocabularies are compiled into a machine
readable dictionary. About thirty thousand sentences containing the polysemous words
“bank”, “old”, “draw”, “sweet” and “since”are downloaded from the British
National Corpus and analyzed by the programs compiled in C++ language. Then the
author designed two WSD systems “WSD Machine” and “WSD Machine Sense
Marker” to test the hypothesis proposed in this paper.
This pilot study proves that the specific semantic rules vary from sense to sense,
and from word to word, and thus cannot be generalized. Most of the usually used
methods cannot solve the WSD issue in MT, because they neglect the fact that while
some semantic rules can be generalized, many others are too specific to be generalized,
especially the specific semantic rules of each polysemous word. Therefore, the specific
semantic rules of each polysemous word must be worked out first before utilizing
appropriate methods to improve the efficiency of WSD. Since all the WSD methods
have disadvantages and none is universally applicable to all kinds of ambiguities, a
well designed WSD system should be able to choose appropriate WSD methods
according to the available WSD information found in the context. The most difficult
issue in WSD is not the task to innovate a totally new WSD method, but the task to
work out the specific semantic rules of the senses of each polysemous word. No matter
how innovative a method is, without WSD data it cannot work at all. The WSD
efficiency hinges on the quality of the data, and high-quality data cannot be collected
before working out the specific semantic rules of the senses of each polysemous word.
Although it is widely accepted that rules should be generalized for practical
purposes, in WSD the specific semantic rules of the polysemous words turn out to be
bank”, “old”, “draw”, “sweet” and “since” cover most of the word classes of polysemous words, i.e. noun, verb,
adjective, preposition and conjunction. The WSD methods of these five words are very typical, which will be
illustrated later.
much more important than the general rules. Since the semantic rules of polysemous
words are too multifarious to be generalized, we should work out the specific semantic
rules of the senses of each polysemous word, then solve the WSD issue word by word
till most of the frequently used polysemous words have been well disambiguated and
the quality of MT is satisfactory. This means painstaking effort must be made in
designing a WSD system. The fact is, even the disambiguation of one sense of a
polysemous word might be an arduous and time-consuming work, let alone thousands
of polysemous words with thousands upon thousands of senses. Therefore, WSD is no
easy task and to attempt to find a universally applicable approach is to beg for
frustration. Nonetheless, WSD is not a matter of impossibility, but a matter of time and
efforts. And the great efforts made in WSD will be well rewarded because a successful
MT system must be very profitable.
Key Words: word sense disambiguation, machine translation,
co-occurrence word, WSD Machine, WSD Machine Sense Marker
Contents
摘要
ABSTRACT
Chapter One: Introduction................................................................................................ 1
Chapter Two: Current Word Sense Disambiguation Methods .......................................... 3
Chapter Three: A Pilot Study of Word Sense Disambiguation ......................................... 8
§3.1 Adaptation of the word sense disambiguation method based on co-occurrence
features.......................................................................................................................9
§3.1.1 The co-occurrence words of “bank”...................................................... 10
§3.1.2 Limitation of the method based on co-occurrence words ......................14
§3.1.3 Other co-occurrence features................................................................. 16
§3.1.4 Priority of monosemantic expressions in machine translation .............. 17
§3.1.5 Compilation of a machine readable dictionary based on co-occurrence
features..............................................................................................................17
§3.1.6 “WSD Machine Sense Marker” and the word sense disambiguation
result..................................................................................................................22
§3.2 The key role of grammatical structures in disambiguating the senses of
“since”......................................................................................................................24
§3.2.1 Incompetence of the current word sense disambiguation methods in
disambiguating the senses of “since” ................................................................24
§3.2.2 The different grammatical structures in which the two main senses are
used................................................................................................................... 26
§3.2.3 Word sense disambiguation result of “since”.........................................27
§3.3 Word sense disambiguation of “old”, “draw” and “sweet”..............................27
§3.3.1 Difference between the word sense disambiguation of “old” and “bank”27
§3.3.2 Multifarious word sense disambiguation methods required in
disambiguating the forty five senses of “draw”................................................29
§3.3.3 The ability of understanding required in disambiguating polysemous
words in complicated situations........................................................................38
§3.3.4 Significance of probability in word sense disambiguation.................... 40
§3.3.5 Application of componential analysis theory in compiling a machine
readable dictionary............................................................................................44
§3.3.6 Utilization of WordNet in the compilation of a machine readable
dictionary.......................................................................................................... 45
§3.3.7 Pragmatics and word sense disambiguation.......................................... 46
§3.3.8 Example-based and statistics-based approaches in machine translation46
Chapter Four: Conclusion............................................................................................... 47
Appendix I: WSD Machine.cpp......................................................................................49
Appendix II: Co-occurrence Frequencies Counter.cpp...................................................69
Appendix III: WSD Machine Sense Marker.cpp............................................................ 74
Appendix IV: WSD Result of “bank” (100 examples)................................................... 93
Appendix V: WSD Machine Installation Package .......................................................... 98
Bibliography....................................................................................................................99
在读期间公开发表的论文和承担科研项目及取得成果 ...........................................101
Acknowledgements.......................................................................................................102
摘要:

机器翻译中的词义排歧WordSenseDisambiguationInMachineTranslation摘要机器翻译是当代科学技术的十大难题之一,而词义排歧是机器翻译中最困难的问题。如果词义排歧不能解决,机器翻译的译文质量就不可能有质的提高。本文首先评价了目前常用的几种词义排歧方法,接下来介绍作者设计的词义排歧实验。在实验中作者从大学英语四六级新大纲规定的词汇中选取了5260个常用词编入了机读词典,选取了五个有代表性的多义词(“bank”,“old”,“draw”,“sweet”和“since”)作为词义排歧的实验对象,从英国国家语料库中提取了三万左右含有这五个词的句子,并通过编程对这些语料...

展开>> 收起<<
机器翻译中的词义排歧.pdf

共106页,预览10页

还剩页未读, 继续阅读

作者:牛悦 分类:高等教育资料 价格:15积分 属性:106 页 大小:989.36KB 格式:PDF 时间:2024-11-19

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 106
客服
关注