
3.0 牛悦 2024-11-19 4 4 989.36KB 106 页 15积分
Word Sense Disambiguation
In Machine Translation
摘 要
排歧实验。在实验中作者从大学英语四六级新大纲规定的词汇中选取了 5260 个常
用词编入了机读词典, 选取了五个有代表性的多义词(“bank”, “old”, “draw”,
“sweet” “since”作为词义排歧的实验对象,从英国国家语料库中提取了三万
左右含有这五个词的句子,并通过编程对这些语料进行分析研究。最后作者用 C++
择合适的排歧方法。 词义排歧中最因难的任务不是创造出一个全新的排歧方法,
关键词:词义排歧 机器翻译 同现词 词义排歧系统 词义标注系统
Machine translation (MT for short) is considered one of the ten most difficult
problems to solve in science and technology (Feng 2), and word sense disambiguation
(WSD for short) is the most difficult issue in MT. If it could not be solved, MT could
hardly achieve any substantial development.
This paper first comments on the current WSD methods, then introduces the pilot
study carried out by the author. In this study, 5260 frequently used words chosen from
CET 4 (College English Test) and CET 6 vocabularies are compiled into a machine
readable dictionary. About thirty thousand sentences containing the polysemous words
“bank”, “old”, “draw”, “sweet” and “since”are downloaded from the British
National Corpus and analyzed by the programs compiled in C++ language. Then the
author designed two WSD systems “WSD Machine” and “WSD Machine Sense
Marker” to test the hypothesis proposed in this paper.
This pilot study proves that the specific semantic rules vary from sense to sense,
and from word to word, and thus cannot be generalized. Most of the usually used
methods cannot solve the WSD issue in MT, because they neglect the fact that while
some semantic rules can be generalized, many others are too specific to be generalized,
especially the specific semantic rules of each polysemous word. Therefore, the specific
semantic rules of each polysemous word must be worked out first before utilizing
appropriate methods to improve the efficiency of WSD. Since all the WSD methods
have disadvantages and none is universally applicable to all kinds of ambiguities, a
well designed WSD system should be able to choose appropriate WSD methods
according to the available WSD information found in the context. The most difficult
issue in WSD is not the task to innovate a totally new WSD method, but the task to
work out the specific semantic rules of the senses of each polysemous word. No matter
how innovative a method is, without WSD data it cannot work at all. The WSD
efficiency hinges on the quality of the data, and high-quality data cannot be collected
before working out the specific semantic rules of the senses of each polysemous word.
Although it is widely accepted that rules should be generalized for practical
purposes, in WSD the specific semantic rules of the polysemous words turn out to be
bank”, “old”, “draw”, “sweet” and “since” cover most of the word classes of polysemous words, i.e. noun, verb,
adjective, preposition and conjunction. The WSD methods of these five words are very typical, which will be
illustrated later.
much more important than the general rules. Since the semantic rules of polysemous
words are too multifarious to be generalized, we should work out the specific semantic
rules of the senses of each polysemous word, then solve the WSD issue word by word
till most of the frequently used polysemous words have been well disambiguated and
the quality of MT is satisfactory. This means painstaking effort must be made in
designing a WSD system. The fact is, even the disambiguation of one sense of a
polysemous word might be an arduous and time-consuming work, let alone thousands
of polysemous words with thousands upon thousands of senses. Therefore, WSD is no
easy task and to attempt to find a universally applicable approach is to beg for
frustration. Nonetheless, WSD is not a matter of impossibility, but a matter of time and
efforts. And the great efforts made in WSD will be well rewarded because a successful
MT system must be very profitable.
Key Words: word sense disambiguation, machine translation,
co-occurrence word, WSD Machine, WSD Machine Sense Marker
Chapter One: Introduction................................................................................................ 1
Chapter Two: Current Word Sense Disambiguation Methods .......................................... 3
Chapter Three: A Pilot Study of Word Sense Disambiguation ......................................... 8
§3.1 Adaptation of the word sense disambiguation method based on co-occurrence
§3.1.1 The co-occurrence words of “bank”...................................................... 10
§3.1.2 Limitation of the method based on co-occurrence words ......................14
§3.1.3 Other co-occurrence features................................................................. 16
§3.1.4 Priority of monosemantic expressions in machine translation .............. 17
§3.1.5 Compilation of a machine readable dictionary based on co-occurrence
§3.1.6 “WSD Machine Sense Marker” and the word sense disambiguation
§3.2 The key role of grammatical structures in disambiguating the senses of
§3.2.1 Incompetence of the current word sense disambiguation methods in
disambiguating the senses of “since” ................................................................24
§3.2.2 The different grammatical structures in which the two main senses are
used................................................................................................................... 26
§3.2.3 Word sense disambiguation result of “since”.........................................27
§3.3 Word sense disambiguation of “old”, “draw” and “sweet”..............................27
§3.3.1 Difference between the word sense disambiguation of “old” and “bank”27
§3.3.2 Multifarious word sense disambiguation methods required in
disambiguating the forty five senses of “draw”................................................29
§3.3.3 The ability of understanding required in disambiguating polysemous
words in complicated situations........................................................................38
§3.3.4 Significance of probability in word sense disambiguation.................... 40
§3.3.5 Application of componential analysis theory in compiling a machine
readable dictionary............................................................................................44
§3.3.6 Utilization of WordNet in the compilation of a machine readable
dictionary.......................................................................................................... 45
§3.3.7 Pragmatics and word sense disambiguation.......................................... 46
§3.3.8 Example-based and statistics-based approaches in machine translation46
Chapter Four: Conclusion............................................................................................... 47
Appendix I: WSD Machine.cpp......................................................................................49
Appendix II: Co-occurrence Frequencies Counter.cpp...................................................69
Appendix III: WSD Machine Sense Marker.cpp............................................................ 74
Appendix IV: WSD Result of “bank” (100 examples)................................................... 93
Appendix V: WSD Machine Installation Package .......................................................... 98
在读期间公开发表的论文和承担科研项目及取得成果 ...........................................101


展开>> 收起<<


还剩页未读, 继续阅读

作者:牛悦 分类:高等教育资料 价格:15积分 属性:106 页 大小:989.36KB 格式:PDF 时间:2024-11-19


  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 106