机器翻译中的词义排歧

VIP免费

3.0 牛悦 2024-11-19 4 4 989.36KB 106 页 15积分

侵权投诉

机器翻译中的词义排歧

Word Sense Disambiguation

In Machine Translation

摘要

机器翻译是当代科学技术的十大难题之一，而词义排歧是机器翻译中最困难

的问题。如果词义排歧不能解决，机器翻译的译文质量就不可能有质的提高。

本文首先评价了目前常用的几种词义排歧方法，接下来介绍作者设计的词义

排歧实验。在实验中作者从大学英语四六级新大纲规定的词汇中选取了 5260 个常

用词编入了机读词典, 选取了五个有代表性的多义词（“bank”, “old”, “draw”,

“sweet” 和“since”）作为词义排歧的实验对象，从英国国家语料库中提取了三万

左右含有这五个词的句子，并通过编程对这些语料进行分析研究。最后作者用 C++

语言设计了“词义排歧系统”和“词义标注系统”对本文提出的假设进行编程测

试。

实验证明：每个多义词词义变化的规律都不同，无法归纳出普遍规律；虽然

有些语义变化有一些规律，但是大多数多义词的词义变化都没有普遍规律可循，

各种排歧方法不能解决词义排歧的原因就是忽略了多义词词义变化是没有什么普

遍规律可言的事实。因此，要首先研究出每个多义词词义变化的特殊规律，再根

据这些规律选择合适的排歧方法来提高排歧效率。因为没有通用的排歧方法，各

种排歧方法又都有不足之处，所以排歧系统要能够根据可用的语境信息灵活地选

择合适的排歧方法。词义排歧中最因难的任务不是创造出一个全新的排歧方法，

而是研究出每个多义词词义变化的规律。无论排歧方法多么新颖，没有用于排歧

的数据，什么方法都无法发挥作用。排歧效率取决于用于排歧的数据的质量，而

不研究出每个多义词词义变化的特殊规律就不可能得到高质量的可用于排歧的数

据。

虽然人们认为普遍性的规律有实用价值，但是在词义排歧中多义词词义变化

的特殊规律比一般性的语义规律更重要。因为多义词词义变化规律复杂多变无法

归纳，我们只有一个一个地研究出每个多义词的词义变化规律，再逐个解决多义

词的词义排歧问题，直到大多数的常用多义词的词义排歧问题都得到解决，机器

翻译的质量令人满意。这就是说设计一个排歧系统是一项艰巨的任务，事实上对

一个多义词的一个词义的排歧都可能会是费时而又艰辛的工作，更何况要对数以

万计的词义进行排歧。所以，排歧是非常不易的，而试图找到一个普遍通用的排

歧方法又是不可能的。但这并不是说词义排歧不可能解决，只是词义排歧需要长

期不懈的努力。当然投入的艰辛的劳动也会得到丰厚的回报，因为成功的机器翻

译系统必定会带来巨大的商业利润。

关键词：词义排歧机器翻译同现词词义排歧系统词义标注系统

ABSTRACT

Machine translation (MT for short) is considered one of the ten most difficult

problems to solve in science and technology (Feng 2), and word sense disambiguation

(WSD for short) is the most difficult issue in MT. If it could not be solved, MT could

hardly achieve any substantial development.

This paper first comments on the current WSD methods, then introduces the pilot

study carried out by the author. In this study, 5260 frequently used words chosen from

CET 4 (College English Test) and CET 6 vocabularies are compiled into a machine

readable dictionary. About thirty thousand sentences containing the polysemous words

“bank”, “old”, “draw”, “sweet” and “since”①are downloaded from the British

National Corpus and analyzed by the programs compiled in C++ language. Then the

author designed two WSD systems “WSD Machine” and “WSD Machine Sense

Marker” to test the hypothesis proposed in this paper.

This pilot study proves that the specific semantic rules vary from sense to sense,

and from word to word, and thus cannot be generalized. Most of the usually used

methods cannot solve the WSD issue in MT, because they neglect the fact that while

some semantic rules can be generalized, many others are too specific to be generalized,

especially the specific semantic rules of each polysemous word. Therefore, the specific

semantic rules of each polysemous word must be worked out first before utilizing

appropriate methods to improve the efficiency of WSD. Since all the WSD methods

have disadvantages and none is universally applicable to all kinds of ambiguities, a

well designed WSD system should be able to choose appropriate WSD methods

according to the available WSD information found in the context. The most difficult

issue in WSD is not the task to innovate a totally new WSD method, but the task to

work out the specific semantic rules of the senses of each polysemous word. No matter

how innovative a method is, without WSD data it cannot work at all. The WSD

efficiency hinges on the quality of the data, and high-quality data cannot be collected

before working out the specific semantic rules of the senses of each polysemous word.

Although it is widely accepted that rules should be generalized for practical

purposes, in WSD the specific semantic rules of the polysemous words turn out to be

①“bank”, “old”, “draw”, “sweet” and “since” cover most of the word classes of polysemous words, i.e. noun, verb,

adjective, preposition and conjunction. The WSD methods of these five words are very typical, which will be

illustrated later.

much more important than the general rules. Since the semantic rules of polysemous

words are too multifarious to be generalized, we should work out the specific semantic

rules of the senses of each polysemous word, then solve the WSD issue word by word

till most of the frequently used polysemous words have been well disambiguated and

the quality of MT is satisfactory. This means painstaking effort must be made in

designing a WSD system. The fact is, even the disambiguation of one sense of a

polysemous word might be an arduous and time-consuming work, let alone thousands

of polysemous words with thousands upon thousands of senses. Therefore, WSD is no

easy task and to attempt to find a universally applicable approach is to beg for

frustration. Nonetheless, WSD is not a matter of impossibility, but a matter of time and

efforts. And the great efforts made in WSD will be well rewarded because a successful

MT system must be very profitable.

Key Words: word sense disambiguation, machine translation,

co-occurrence word, WSD Machine, WSD Machine Sense Marker

Contents

摘要

ABSTRACT

Chapter One: Introduction................................................................................................ 1

Chapter Two: Current Word Sense Disambiguation Methods .......................................... 3

Chapter Three: A Pilot Study of Word Sense Disambiguation ......................................... 8

§3.1 Adaptation of the word sense disambiguation method based on co-occurrence

features.......................................................................................................................9

§3.1.1 The co-occurrence words of “bank”...................................................... 10

§3.1.2 Limitation of the method based on co-occurrence words ......................14

§3.1.3 Other co-occurrence features................................................................. 16

§3.1.4 Priority of monosemantic expressions in machine translation .............. 17

§3.1.5 Compilation of a machine readable dictionary based on co-occurrence

features..............................................................................................................17

§3.1.6 “WSD Machine Sense Marker” and the word sense disambiguation

result..................................................................................................................22

§3.2 The key role of grammatical structures in disambiguating the senses of

“since”......................................................................................................................24

§3.2.1 Incompetence of the current word sense disambiguation methods in

disambiguating the senses of “since” ................................................................24

§3.2.2 The different grammatical structures in which the two main senses are

used................................................................................................................... 26

§3.2.3 Word sense disambiguation result of “since”.........................................27

§3.3 Word sense disambiguation of “old”, “draw” and “sweet”..............................27

§3.3.1 Difference between the word sense disambiguation of “old” and “bank”27

§3.3.2 Multifarious word sense disambiguation methods required in

disambiguating the forty five senses of “draw”................................................29

§3.3.3 The ability of understanding required in disambiguating polysemous

words in complicated situations........................................................................38

§3.3.4 Significance of probability in word sense disambiguation.................... 40

§3.3.5 Application of componential analysis theory in compiling a machine

readable dictionary............................................................................................44

§3.3.6 Utilization of WordNet in the compilation of a machine readable

dictionary.......................................................................................................... 45

§3.3.7 Pragmatics and word sense disambiguation.......................................... 46

§3.3.8 Example-based and statistics-based approaches in machine translation46

Chapter Four: Conclusion............................................................................................... 47

Appendix I: WSD Machine.cpp......................................................................................49

Appendix II: Co-occurrence Frequencies Counter.cpp...................................................69

Appendix III: WSD Machine Sense Marker.cpp............................................................ 74

Appendix IV: WSD Result of “bank” (100 examples)................................................... 93

Appendix V: WSD Machine Installation Package .......................................................... 98

Bibliography....................................................................................................................99

在读期间公开发表的论文和承担科研项目及取得成果 ...........................................101

Acknowledgements.......................................................................................................102

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 积分 4人已下载

立即下载 VIP免费下载

摘要：

机器翻译中的词义排歧WordSenseDisambiguationInMachineTranslation摘要机器翻译是当代科学技术的十大难题之一，而词义排歧是机器翻译中最困难的问题。如果词义排歧不能解决，机器翻译的译文质量就不可能有质的提高。本文首先评价了目前常用的几种词义排歧方法，接下来介绍作者设计的词义排歧实验。在实验中作者从大学英语四六级新大纲规定的词汇中选取了5260个常用词编入了机读词典,选取了五个有代表性的多义词（“bank”,“old”,“draw”,“sweet”和“since”）作为词义排歧的实验对象，从英国国家语料库中提取了三万左右含有这五个词的句子，并通过编程对这些语料...

展开>> 收起<<

机器翻译中的词义排歧.pdf

共106页,预览10页

还剩页未读，继续阅读

机器翻译中的词义排歧

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

推荐作者

热门标签

举报选择: