从Web页面中抽取出用户感兴趣的数据
![](/assets/7a34688/images/icon/s-doc.png)
VIP免费
目 录
摘 要..................................................................................................................................I
Abstract...........................................................................................................................II
第1章 绪论...................................................................................................................1
1.1 选题背景和意义................................................................................................................1
1.2 Web 信息的抽取................................................................................................................2
1.3 Web 信息抽取的应用........................................................................................................3
1.4 本文的研究内容................................................................................................................3
1.5 本文的组织........................................................................................................................3
第2章 相关标准与 Web 信息抽取技术......................................................................5
2.1 引言....................................................................................................................................5
2.2 相关标准............................................................................................................................5
2.2.1 XML.............................................................................................................................................5
2.2.2 XHTML........................................................................................................................................8
2.2.3 DOM.............................................................................................................................................8
2.2.4 XPath............................................................................................................................................8
2.2.5 XSLT............................................................................................................................................9
2.3 Web 信息抽取技术概述..................................................................................................10
2.3.1 Web 信息抽取技术的分类........................................................................................................11
2.3.2 Web 信息抽取存在的问题........................................................................................................16
2.3.3 Web 信息抽取的关键技术........................................................................................................17
2.3.4 信息抽取系统的评测指标........................................................................................................19
2.4 资料综述..........................................................................................................................20
2.5 本章小结..........................................................................................................................20
第3章 基于 XML 的Web 信息抽取平台...............................................................22
3.1 概述..................................................................................................................................22
3.1.1 平台的目标................................................................................................................................22
3.1.2 设计的基本思想........................................................................................................................22
3.1.3 XML 和XSLT 在平台中的角色...............................................................................................23
3.1.4 数据导向型页面........................................................................................................................25
3.2 平台的总体框架..............................................................................................................26
3.3 平台中的知识库与数据库..............................................................................................27
3.3.1 构造领域知识库........................................................................................................................27
3.3.2 抽取规则库................................................................................................................................28
3.3.3 抽取结果数据库和 Web 页面数据库.......................................................................................28
- I -
3.4 页面优化模块..................................................................................................................29
3.4.1 清洗(TIDY)页面文档..........................................................................................................29
3.4.2 页面解析(PARSER).............................................................................................................32
3.5 信息抽取模块..................................................................................................................32
3.5.1 规则学习的依据........................................................................................................................33
3.5.2 规则学习的步骤........................................................................................................................34
3.5.3 信息抽取过程的描述................................................................................................................41
3.6 资料综述..........................................................................................................................42
3.7 本章小结..........................................................................................................................43
第4章 抽取规则的优化研究....................................................................................44
4.1 信息定位的优化方法......................................................................................................44
4.1.1 基于树路径的定位....................................................................................................................44
4.1.2 路径与内容结合的定位方式....................................................................................................45
4.1.3 完全基于文本的定位方式........................................................................................................46
4.1.4 基于属性的定位........................................................................................................................47
4.2 几种定位方法的小结......................................................................................................47
4.3 本章小结..........................................................................................................................49
第5章 总结....................................................................................................................50
5.1 实验例子..........................................................................................................................50
5.2 本文的研究工作..............................................................................................................51
5.2 进一步的工作..................................................................................................................52
攻读学位期间公开发表的论文.....................................................................................53
致 谢...............................................................................................................................54
参考文献.........................................................................................................................55
- II -
摘 要
随着 Web 的快速发展,如何从中获得想要的信息成为亟待解决的问题,因此
信息抽取成为必要。Wrapper 是从网页进行抽取的程序,信息抽取研究需要解决的
问题是:构造尽可能准确、健壮和通用的 Wrapper,使其免受网站结构不同和页面
结构变化的影响,并尽可能地减少人为参与。
目前,已经产生了各种各样的方法来生成 Wrapper,但这些方法有不同的局
限性,在精确度、健壮性和通用性方面难以达到很高的要求,本文对这些方法进
行研究和分析。
本文利用了标准的 XML 技术来解决信息抽取问题,提出一个基于 XML 技术
的Web 信息抽取平台。并通过归纳学习算法,寻找和识别出感兴趣的数据。利用
XSLT 和Xpath 技术在数据定位和转换方面的优势,解决信息抽取中的关键问题 :
编写抽取规则。
最后,本文还对抽取规则的优化问题进行了研究,对几种信息定位方式进行
了比较,目的是此基础上编写更为简单、健壮和通用的抽取规则。
关键词:信息抽取 XML XSLT
- I -
Abstract
With the explosion of Web, how to get the piece of information
what he want from the web has become a serious problem, so information
extraction from web pages is necessary. Wrapper is the program that
performs the extraction task and the key problems are how to
constructing accurate, robust and adaptable wrapper without much human
intervention, and the wrapper should be independent on particular web
sites and could avoid impact from changes of web pages.
Many approaches have been proposed to generate wrapper, but they
have different limitations that hard to make it accurate, robust or
general. This dissertation studied and analyzed those approaches
This paper apply standard technologies of XML to web extraction
problem and developed a platform of web information extraction based
XML. With Inductive Learning arithmetic lactated and identified the
information blocks that we want. This paper used standard XSLT and
Xpath, exploiting their powers of data location and conversion, to
solve the key problem: writing extraction rules.
At last, this paper studied the optimization of extraction rules
and compared several information location methods. The aim is to
generate simple, robust and general extraction rules.
Key Words: Information Extraction XML XSLT
- II -
摘要:
展开>>
收起<<
目录摘要..................................................................................................................................IAbstract...........................................................................................................................II第1章 绪论.............................
相关推荐
-
VIP免费2025-01-09 9
-
VIP免费2025-01-09 6
-
VIP免费2025-01-09 6
-
VIP免费2025-01-09 6
-
VIP免费2025-01-09 6
-
VIP免费2025-01-09 9
-
VIP免费2025-01-09 8
-
VIP免费2025-01-09 7
-
VIP免费2025-01-09 8
-
VIP免费2025-01-09 7
作者:朱铭铭
分类:高等教育资料
价格:150积分
属性:59 页
大小:2.44MB
格式:DOC
时间:2024-09-20