互联网航空数据抓取系统

VIP免费
3.0 李琳琳 2024-10-12 7 4 1.1MB 78 页 15积分
侵权投诉
东北大学硕士毕业论文 中文摘要
网页数据抓取系统的研究与设计
摘要
随着国际互联网的快速发展,进入了一个信息膨胀的时代。如何浏览、使用这些信息成
为我们关注的问题。网络数据抓取的概念由此应运而生。通过数据抓取能够准确的发现我们
所关心的信息,大大提高了网络的使用效率。本文应用软件工程的原理,解决了航空类机票
数据抓取问题。运用面向对象的方法,结合封装技术,编制了数据抓取特定算法。解决了利
用飞机出行的顾客查找机票不方便的问题,提高了机票查找的准度与效率。
本文首先介绍了信息提取技术及其产生背景和发展历史,分析了信息提取系统体系结构
和关键技术。对Web信息提取的途径、主要学习算法、评价标准等进行了相关阐述。由于本
文是对HTML半结构化Web语言就行数据采集,所以本文对现有的XML技术加以研究,在此
基础上提出了适合XML的结构又较为通用的树型结构抽取规则,它能够把Web上的数据抽取
出来整合到指定模式的XML文档中去。Web信息抽取出来,用户如果不能随心所欲的加以利
用,那将是毫无价值的。为了方便用户对抽取出的数据进行二次利用,本文还提出基于
XMLWeb查询模式。
在系统设计部分,本系统主要分为机票查询、航班时刻表查询、联程方案三大模块,为
改善现有的航空票务网站的查询精度,需要对票务网站的相关信息进行各种处理。这个处理
过程,是将各个子模块的相关要素结合起来进行整体计划、执行和控制。从而达到一个最佳
的解决方案。联程模型是本次设计的重要组成部分,其功能是对各大航空公司的单程信息进
行汇总,摘取出相映的票价、燃油费和机场建设费等航空信息,然后加入实际的时间约束条
件,组成一个合理的联程方案。在联程方案中,为查询顾客提供了联程票价,两次航班的起
抵时间、起抵机场、搭的航空公司的名称内容。本系统结构清晰操作简便,成为
航班士的必备
关键词:Web数据抓取;XML;HTML结构树;JSP技术
- I -
东北大学硕士毕业论文 中文摘要
- II -
北大学硕士毕业论文 文摘要
Study and Design Internet Data Gathering system
Abstract
With the rapid development of the Internet, we have entered the era of an information
expansion. How to browse, and use this information to become our concern. Grasping the concept
of network data which have emerged. Crawl through data to be accurate we are concerned about the
discovery of information and greatly improve the utilization efficiency of the network. In this paper,
the application of software engineering principles to solve the airline ticket Data Handling
problems. Use of the object-oriented approach, combining packaging technology for data grab
specific algorithm. Solved the use of aircraft for travel customers find tickets inconvenient
problems, and enhance the accuracy of the tickets View and efficiency. The system functions in the
design of modular approach, the system is more precise and easier operation. At the same time,
SQL-Sever2000 used as a database, business logic of the complex procedures on the database
stored procedure, so that the function of the system and data separation to ensure stable operation.
This dissertation introduced the background of information extraction and its history, analyse
the system architecture, the taxonomy of information extraction and the key technology and
weighing measure of information extraction.This paper studies the half-structure Web language-
HTML to data gather and researches the technique of information extraction and the technique of
XML. Based on it, finding a general rule of tree structure, which fits XML structure well. This rule
can extract the data of Web into XML document in some pattern. If users can't utilize information
atl :heir pleasure after Web information is extracted, it will be good for nothing. At the same time,
the author presents a Web query model based on XML. In the summary, extraction technique of
Web information combines with technique of XML store and access, which realize the reuse of Web
data in maximum.
The system consists of air tickets for flights on schedule, the three-module programme, in
order to improve the existing air ticketing website for accuracy, the need for ticketing site of
treatment-related information. This process is that all the relevant elements of sub-module
integrated into the overall planning, implementation and control. To achieve an optimal
- II -
北大学硕士毕业论文 文摘要
solution.The way this design model is an important part of its function is of major airlines summary
of the one-way information, extracted from Xiangyang fares, fuel costs and airport construction fees
aeronautical information, and then adding the actual time constraints conditions, the formation of a
reasonable way of the programme. In the way the programme, in order to provide the customer
enquiries way fares from the two flights arrived in time, arrived from the airport, take the name of
the airline, and so forth. Clear structure of the system, easy to operate, can be required by the flight
instruments.
KeyWords: Web data gather;XML;HTML struct tree; JSP technonogy
- III -
北大学硕士毕业论文 目录
目录
摘要.........................................................................I
Abstract....................................................................II
第1章 引 ..................................................................1
1.1 课题研究背景..........................................................1
1.2内外研究现状.........................................................1
1.3本文的研究意义.........................................................3
1.4本文主要研究内容.......................................................4
第2章 Web信息提取技术....................................................5
2.1信息提取系统的体系结构.................................................5
2.2信息提取中的关键技术...................................................8
2.2.1命名实体识别.....................................................8
2.2.2句法分析.........................................................8
2.2.3篇章分析与...................................................9
2.2.4知识获........................................................10
2.3 Web信息提取的分类....................................................10
2.3.1 Web内容提取...................................................11
2.3.2 Web结构提取...................................................12
2.3.3 Web使用记录提取...............................................13
2.4 Web信息提取的途径....................................................14
2.5 Web信息提取的评价标准................................................16
2.6章小..............................................................16
第3章 系统需分析和实现技术................................................17
3.1业务需分析..........................................................17
3.2 功能需求.............................................................18
3.2.1快速的查询单程信息..............................................18
3.2.2联程特色........................................................18
3.3 业务.............................................................18
3.2基于树型结构的Web数据抽取规则.........................................20
北大学硕士毕业论文 目录
3.2.1用树型结构表示Web文档...........................................20
3.2.2算法总体思路....................................................21
3.2.3抽取数据的XML输...............................................23
3.3基于DOMXML数据访问机制..............................................24
3.3.1XML数据岛.......................................................24
3.3.2使用DOM访XML文档..............................................25
3.4章小..............................................................29
第4章XML与关系数据集成方法分析..............................................30
4. 1 XML数据的关系存储...............................................30
4. 1.1如何建关系映射...............................................31
4.1.2关系映的相关模型..........................................31
4. 2基于XMLWeb查询处理.................................................34
4.2.1 现有的XML查询语言..............................................35
4.2.2 基于XMLWeb查询模式...........................................37
4.3章小..............................................................38
第5章 系统及关键技术实现....................................................39
5.1系统平台..........................................................39
5.1.1JAVA技术........................................................39
5.1.2 J2EE系统平台...................................................40
5.1.3 SQL Server数据库...............................................40
5.2 系统总体模块设计.....................................................41
5.2.1机票查询........................................................42
5.2.1机票查询:......................................................42
5.2.2时刻表查询:....................................................42
5.2.3联程方案:......................................................43
5.3 数据设计...........................................................43
5.4 系统心算法.........................................................46
5.5 用到的关键技术.......................................................48
5.6 HTTP协议.............................................................52
摘要:

东北大学硕士毕业论文中文摘要网页数据抓取系统的研究与设计摘要随着国际互联网的快速发展,进入了一个信息膨胀的时代。如何浏览、使用这些信息成为我们关注的问题。网络数据抓取的概念由此应运而生。通过数据抓取能够准确的发现我们所关心的信息,大大提高了网络的使用效率。本文应用软件工程的原理,解决了航空类机票数据抓取问题。运用面向对象的方法,结合封装技术,编制了数据抓取特定算法。解决了利用飞机出行的顾客查找机票不方便的问题,提高了机票查找的准度与效率。本文首先介绍了信息提取技术及其产生背景和发展历史,分析了信息提取系统体系结构和关键技术。对Web信息提取的途径、主要学习算法、评价标准等进行了相关阐述。由于本...

展开>> 收起<<
互联网航空数据抓取系统.doc

共78页,预览8页

还剩页未读, 继续阅读

作者:李琳琳 分类:高等教育资料 价格:15积分 属性:78 页 大小:1.1MB 格式:DOC 时间:2024-10-12

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 78
客服
关注