从Web页面中抽取出用户感兴趣的数据

VIP免费

3.0 朱铭铭 2024-09-20 4 4 2.44MB 59 页 150积分

目    录
摘  要..................................................................................................................................I
Abstract...........................................................................................................................II
第１章　绪论...................................................................................................................1
1 选题背景和意义................................................................................................................1
2 Web 信息的抽取................................................................................................................2
3 Web 信息抽取的应用........................................................................................................3
4 本文的研究内容................................................................................................................3
5 本文的组织........................................................................................................................3
第2章　相关标准与 Web 信息抽取技术......................................................................5
1 引言....................................................................................................................................5
2 相关标准............................................................................................................................5
2.1 XML.............................................................................................................................................5
2.2 XHTML........................................................................................................................................8
2.3 DOM.............................................................................................................................................8
2.4 XPath............................................................................................................................................8
2.5 XSLT............................................................................................................................................9
3 Web 信息抽取技术概述..................................................................................................10
3.1 Web 信息抽取技术的分类........................................................................................................11
3.2 Web 信息抽取存在的问题........................................................................................................16
3.3 Web 信息抽取的关键技术........................................................................................................17
3.4 信息抽取系统的评测指标........................................................................................................19
4 资料综述..........................................................................................................................20
5 本章小结..........................................................................................................................20
第3章 基于 XML 的Web 信息抽取平台...............................................................22
1 概述..................................................................................................................................22
1.1 平台的目标................................................................................................................................22
1.2 设计的基本思想........................................................................................................................22
1.3 XML 和XSLT 在平台中的角色...............................................................................................23
1.4 数据导向型页面........................................................................................................................25
2 平台的总体框架..............................................................................................................26
3 平台中的知识库与数据库..............................................................................................27
3.1 构造领域知识库........................................................................................................................27
3.2 抽取规则库................................................................................................................................28
3.3 抽取结果数据库和 Web 页面数据库.......................................................................................28
- I -

4 页面优化模块..................................................................................................................29
4.1 清洗（TIDY）页面文档..........................................................................................................29
4.2 页面解析（PARSER）.............................................................................................................32
5 信息抽取模块..................................................................................................................32
5.1 规则学习的依据........................................................................................................................33
5.2 规则学习的步骤........................................................................................................................34
5.3 信息抽取过程的描述................................................................................................................41
6 资料综述..........................................................................................................................42
7 本章小结..........................................................................................................................43
第4章 抽取规则的优化研究....................................................................................44
1 信息定位的优化方法......................................................................................................44
1.1 基于树路径的定位....................................................................................................................44
1.2 路径与内容结合的定位方式....................................................................................................45
1.3 完全基于文本的定位方式........................................................................................................46
1.4 基于属性的定位........................................................................................................................47
2 几种定位方法的小结......................................................................................................47
3 本章小结..........................................................................................................................49
第5章 总结....................................................................................................................50
1 实验例子..........................................................................................................................50
2 本文的研究工作..............................................................................................................51
2 进一步的工作..................................................................................................................52
攻读学位期间公开发表的论文.....................................................................................53
致  谢...............................................................................................................................54
参考文献.........................................................................................................................55
- II -

摘要

随着 Web 的快速发展，如何从中获得想要的信息成为亟待解决的问题，因此

信息抽取成为必要。Wrapper 是从网页进行抽取的程序，信息抽取研究需要解决的

问题是：构造尽可能准确、健壮和通用的 Wrapper，使其免受网站结构不同和页面

结构变化的影响，并尽可能地减少人为参与。

目前，已经产生了各种各样的方法来生成 Wrapper，但这些方法有不同的局

限性，在精确度、健壮性和通用性方面难以达到很高的要求，本文对这些方法进

行研究和分析。

本文利用了标准的 XML 技术来解决信息抽取问题，提出一个基于 XML 技术

的Web 信息抽取平台。并通过归纳学习算法，寻找和识别出感兴趣的数据。利用

XSLT 和Xpath 技术在数据定位和转换方面的优势，解决信息抽取中的关键问题：

编写抽取规则。

最后，本文还对抽取规则的优化问题进行了研究，对几种信息定位方式进行

了比较，目的是此基础上编写更为简单、健壮和通用的抽取规则。

关键词：信息抽取 XML XSLT

- I -

Abstract

With the explosion of Web, how to get the piece of information

what he want from the web has become a serious problem, so information

extraction from web pages is necessary. Wrapper is the program that

performs the extraction task and the key problems are how to

constructing accurate, robust and adaptable wrapper without much human

intervention, and the wrapper should be independent on particular web

sites and could avoid impact from changes of web pages.

Many approaches have been proposed to generate wrapper, but they

have different limitations that hard to make it accurate, robust or

general. This dissertation studied and analyzed those approaches

This paper apply standard technologies of XML to web extraction

problem and developed a platform of web information extraction based

XML. With Inductive Learning arithmetic lactated and identified the

information blocks that we want. This paper used standard XSLT and

Xpath, exploiting their powers of data location and conversion, to

solve the key problem: writing extraction rules.

At last, this paper studied the optimization of extraction rules

and compared several information location methods. The aim is to

generate simple, robust and general extraction rules.

Key Words: Information Extraction XML XSLT

- II -

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

150 积分 4人已下载

立即下载 VIP免费下载

摘要：

目录摘要..................................................................................................................................IAbstract...........................................................................................................................II第１章　绪论.............................

展开>> 收起<<

从Web页面中抽取出用户感兴趣的数据.doc

共59页,预览4页

还剩页未读，继续阅读

从Web页面中抽取出用户感兴趣的数据

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

推荐作者

热门标签

举报选择: