机器翻译的层级消岐模式

VIP免费
3.0 牛悦 2024-11-19 4 4 748.06KB 140 页 15积分
侵权投诉
Chapter 1: Introduction
1
Chapter 1: Introduction
Ambiguity is a very common phenomenon in English. Most English words are
polysemous. For example, the word palm means either a kind of tree or a part of the
hand. Such polysemous words and expressions may trigger lexical ambiguities. The
task to assign an appropriate meaning to these ambiguous words according to given
contexts is called Word Sense Disambiguation (WSD for short). WSD is very vital to
machine translation.
Humans can successfully recognize the appropriate meaning of a word or
expression. For machine, however, it is much more complicated and difficult to make it
out. Furthermore, solutions to lexical ambiguities require many knowledge resources,
such as linguistic information, contextual (pragmatic) information and cultural
information. Due to the limitation of the computer technology, the representation of
most contextual and cultural knowledge remains an impossible issue and the WSD
issue in machine translation still remains unsolved.
With the development of related sciences (such as linguistics and computer
science), a lot of WSD methods are proposed to solve lexical ambiguities.
Researches on several commonly used methods show that none of these methods
is capable of disambiguating all kinds of ambiguities and each has its own
disambiguation reference and application range. For example, WordNet-based method
mainly deals with nouns and takes context nouns as its disambiguation reference. The
method based on co-occurrence features solves polysemous words with co-occurrence
words and the method based on selectional restriction can only be used when a verb or
modifier can strongly restrict the meaning of the noun. Although they can solve these
ambiguities with high accuracy rates in the evaluation, these methods mostly treat
limited kinds of ambiguity and have various shortcomings. If one of them is used to
disambiguate all kinds of polysemous words in a text, the accuracy rate on the whole is
very low. In order to obtain optimal disambiguation efficiency for a text, this thesis
proposes a hierarchical disambiguation model which combines six methods with
different disambiguation references and application ranges.
The hierarchical disambiguation model proposed in this thesis comprises six
steps, i.e., looking up idiom and collocation dictionaries, tagging part-of-speech,
A Hierarchical Disambiguation Model in Machine Translation
2
disambiguating nouns with context nouns, disambiguating verbs and nouns with
co-occurrence words, solving ambiguous words whose meaning can be strongly
restricted by selectional restrictions and assigning the most frequently used sense to
words without any disambiguation reference.
Evaluation results show that the WSD efficiency of the hierarchical
disambiguation model is much higher than that of any method working alone.
Chapter 2 An Overview of Current Word Sense Disambiguation Methods
3
Chapter 2 An Overview of Current Word Sense
Disambiguation Methods
Many disambiguation methods have been proposed since the 1950s. Among
them, idiom dictionary and compound words dictionary look-up, part-of-speech
tagging, WordNet-based method, method based on co-occurrence features, method
based on selectional restriction and most frequency method are popular and will be
briefly introduced in the following.
Due to different knowledge sources, each method has its own application range
and disambiguation reference. Application range refers to certain ambiguity
phenomenon to which the method is applicable. Disambiguation reference refers to
surrounding words which are used as semantic reference to determine the appropriate
meaning of the target word. Appropriate methods for a target text can be chosen based
on the type of the text and the application range and disambiguation reference of each
method.
§2.1 Idiom Dictionary and Compound Words Dictionary Look-up
The meaning of a ploysemous word in a fixed collocation can be determined
directly. For example, the word hand in the expression on the one hand means hand#7
(one of two sides of an issue) and the compound word black tea is translated as
in Chinese. The idiom dictionary and compound words dictionary are necessary to deal
with fixed expressions in natural languages. If meanings of the words in fixed
collocations can be found in dictionaries, other analyses will be avoided.
For some expressions which can be used as either idioms or common phrases,
dictionary look-up may assign wrong meanings to them. Take the phrase black sheep
for example. In the sentence We all thought my youngest brother was the black sheep
in our family (我们都将我的小弟弟视为我们家的败家子)’, the phrase black sheep
acts as an idiom. In the song ‘Baa, baa, black sheep, Have you any wool? Yes, sir, yes,
sir, Three bags full; One for the master, And one for the dame, And one for the little
boy Who lives down the lane (咩咩,黑羊,你有羊毛吗?是的先生,是的先生,
满三袋:一袋给主人,一袋给夫人,还有一袋给住在巷尾的小男孩)’, black sheep
A Hierarchical Disambiguation Model in Machine Translation
4
here is just a common noun phrase. The determination of expressions like black sheep
requires other reference information and other WSD methods.
§2.2 Part-of-speech Tagging
Part-of-speech (PoS) tagging means the process that the PoS tagging system tags
the part-of-speech for words in the text. Some words have different meanings
depending on their parts-of-speech, i.e. concurrences which refer to words having two
or more grammar functions, such as part of speech and meanings, in different contexts.
If its part of speech can be determined, its appropriate meaning also can be determined.
In other words, PoS tagging can solve some lexical ambiguities triggered by
concurrences. For example, the word will as a noun means an intention or wish, while
it indicates futurity as an auxiliary verb:
a) He will meet his mother at the airport tomorrow.
b) His will makes him successful.
However, if a word with a certain part of speech is still polysemous, PoS tagging
is not able to disambiguate it. For example, the word plant still has two different
commonly used meanings as a noun: plant life and industrial works. For the sentence
He has worked in the plant for 20 years’, if the parser tags the word plant as a noun,
the meaning of the word still cannot be assigned, because of its two different
meanings.
§2.3 WordNet-based Method
WordNet-based method is a kind of knowledge-based methods relying on
machine readable dictionaries which can provide word sense and its sense relation.
Lesk (1986) firstly created a knowledge base to count the overlapping content
words in the sense definition of the ambiguous word and in the context words
occurring nearby and select the sense of the target word whose signature contained the
greatest number of overlaps. Lesk took the word cone in the phrase pine cone for
example to illustrate the method. The appropriate sense of the word cone in the phrase
is chosen from its three senses by comparing the definitions of pine and cone.
pine: 1 kinds of evergreen tree with needle-shaped leaves
2 waste away through sorrow or illness
Chapter 2 An Overview of Current Word Sense Disambiguation Methods
5
Cone: 1 solid body which narrows to a point
2 something of this shape whether solid or hollow
3 fruit of certain evergreen trees
In this example, sense 1 of pine has two content words overlapping the content
words in sense 3 of cone:evergreen and tree. Hence, if the two words occur in the
same context, the sense of cone can be determined as fruit of certain evergreen trees.
Since Lesk (1986), many researchers used machine-readable dictionaries as
knowledge source for WSD. A disambiguation method using WordNetis presented by
Agirre (1996) to solve the lexical ambiguity of nouns using noun taxonomy provided by
WordNet.
WordNet developed at the Princeton Cognitive Science Laboratory is a large
freely available lexical database, which takes a hybrid approach to identify, define and
organize word senses. Word senses in WordNet are defined as synsets. In WordNet,
words are represented by their definitions, synonyms, antonyms, hypernyms
(superordinates), hyponyms (subordinates), coordinate terms and meronyms (parts).
WordNet has more than 118,000 word forms and about 90,000 synsets.
WordNet involves various sets of information which can be used to solve the
problem of ambiguity separately or in order depending on the type of text and the
requirement of the WSD model, such as synonyms, hypernym/hyponym, meronyms,
derivationally related terms, domain term and familiarity etc. In this way, the program
can deal with ambiguities based on a large scale of disambiguating information from a
free available database rather than a tagged corpus which is time and labor consuming.
Still take the word plant (noun) for example.
a) Definitions of plant:
Plant 1, works, industrial plant -- (buildings for carrying
on industrial labor; "they built a large plant to manufacture
automobiles")
Plant 2, flora, plant life -- (a living organism lacking the
power of locomotion)
Plant 3-- (something planted secretly for discovery by
another; "the police used a Plant to trick the thieves"; "he
claimed that the evidence against him was a plant")
WordNet is freely available at http://wordnet.princeton.edu/online/.
A Hierarchical Disambiguation Model in Machine Translation
6
Plant 4-- (an actor situated in the audience whose acting is
rehearsed but seems spontaneous to the audience)
b) Hypernyms synsets of plant
plant# plant#2 plant#3 plant#4
building complex life form contrivance actor
structure entity scheme
performer
artifact plan of action entertainer
object plan person
entity idea life form
content entity
cognition
psychological feature
c) Synonyms of plant
4 senses of plant
Sense 1
plant, works, industrial plant -- (buildings for carrying on
industrial labor; "they built a large plant to manufacture
automobiles")
=> building complex, complex -- (a whole
structure (as a building) made up of interconnected or related
structures)
Sense 2
plant, flora, plant life -- (a living organism lacking the
power of locomotion)
=> organism, being -- (a living thing that has (or
can develop) the ability to act or function independently)
Sense 3
plant -- (something planted secretly for discovery by
another; "the police used a plant to trick the thieves"; "he
claimed that the evidence against him was a plant")
=> contrivance, stratagem, dodge -- (an
elaborate or deceitful scheme contrived to deceive or evade; "his
testimony was just a contrivance to throw us off the track")
摘要:

Chapter1:Introduction1Chapter1:IntroductionAmbiguityisaverycommonphenomenoninEnglish.MostEnglishwordsarepolysemous.Forexample,thewordpalmmeanseitherakindoftreeorapartofthehand.Suchpolysemouswordsandexpressionsmaytriggerlexicalambiguities.Thetasktoassignanappropriatemeaningtotheseambiguouswordsaccord...

展开>> 收起<<
机器翻译的层级消岐模式.pdf

共140页,预览10页

还剩页未读, 继续阅读

作者:牛悦 分类:高等教育资料 价格:15积分 属性:140 页 大小:748.06KB 格式:PDF 时间:2024-11-19

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 140
客服
关注