基于Lucene的网页抓取与检索系统
VIP免费
基于 Lucene 的网页抓取与检索系统
摘 要
随着网络的发展和 Web 资源的丰富,利用 Web 全文信息检索系统来获取所需信息已经成为人
们日常生活的重要组成部分,用户也越来越关注如何能够更加准确、高效地查找信息。
本文对 Web 信息检索和系统实现的相关理论与技术作了介绍,对信息检索在 Web 全文信息检
索中的应用进行了较深入的实践。在第二章,介绍了论文的相关的理论,如搜索引擎的种类,中
文分词的方法,倒排索引的理论和本论文之中所用到的 Lucene 的理论与使用方法,在第三章之中,
根据网页的特点,提出两种网页模板的分析算法,第一种算法基于最长公共子序列模型,利用
动态规划的方法求出最优解,对于原文献的算法的进行了优化和扩展,以达到求出网页模板字
符串和被插入字符串的目的,第二种算法利用统计学的相关理论与原理,将网页模板抽象成为
一个数学模型,提取出网页的常见标识在网页的开始位置与结束位置,根据正文的长度的不同
算出不同的方差,确定正文在原文之中的位置,从而可以从网页之中提取出正文的内容,以达到节省
空间和减少建立索引与搜索索引的时间,在最后比较了两种算法的优点与缺点.在第四章中,介绍了
一个利用 Java 开发的网络蜘蛛,其中包括异构的数据的处理,例如 word,pdf,rtf 的内容的提取,另
外介绍了一种 HTML 文件解析方法和多线程的使用.在第五章,实现了一个网页抓取系统,自动的
从互联网下载用户所指定的网页的信息 ,包括内容与下一页,为了提高信息检索的效率 ,利用
Lucene 软件包,建立索引,加快了搜索的速度与信息的准确性和时效性,节省了大量的存储空间.
本文从理论和实际的角度出发,既有算法的设计与分析,也有具体的程序的实现,使用了
Oracle,Tomcat,Jsp,Java,Eclipse,Lucene 等软件和语言,独创一个 HTML 的解析方法,为用户节省时
间,提高了工作效率.
关键词:Web 搜索引擎 Java Lucene
-II-
Website crawler and Retrieval System Based on Lucene
Abstract
Along with the development of networks and the expansion of Web resources, how to obtain the
required information by using the full text of Web information retrieval systems has become an
important part of everyday life, users are increasingly concerned about how to find information more
accurately and efficiently.
In this paper, the Web Information Retrieval System and the related theories and techniques were
introduced, and some in-depth practice was done to show how to get information from the Web
Information Retrieval System. In the second chapter, the relevant theories of this paper were introduced,
such as the type of search engine, Chinese word segmentation method, the inverted index and Lucene.
In the third chapter , according to the features of web site, I invented two page template analysis
algorithms, the first algorithm is based on the longest common sequence model, using dynamic
programming methods to get a optimal solution, the original algorithm has been optimized and
expanded,so I can get the website template string and other string were inserted, the second algorithm
uses statistical theory to create a mathematical model, extract the common sequence’s start position and
the end position, because of the different length of the content, we can get different variance, so that we
can extract useful contents from the web site,not only save space,but also reduce indexing and search
indexing time, at the end, the advantages and shortcomings of two algorithm were compared. in the
fourth chapter, a Java Web Spider were introduced, including heterogeneous data processing, such as
how to extract contents from word, pdf, rtf and etc. and then,a new HTML document analytical methods
and the use of multi-threading were introduced. In the fivth Chapter, a Web page crawling system was
introduced, it can automatically download information from the Internet, including the content, picture
and next page, in order to enhance the efficiency and speed of information retrieval, Lucene was added
to index information.
In this paper, I design and analyse two algorithms, and a lot of software and programming lanuage such
as Oracle, Tomcat, Jsp, Java, Eclipse and Lucene were used, and more importantly, I created a HTML
analytical method and integrate Lucene to this system to reduce retrieval time and improve the
efficiency of search.
Key words: Web; Search engine; Lucene; Java;
-III-
目 录
独创性声明...........................................................................................................................................I
摘 要.......................................................................................................................................................II
Abstract..................................................................................................................................................III
目 录.......................................................................................................................................................V
第一章 引 言......................................................................................................................................1
1.1 研究背景.......................................................................................................................................1
1.2 课题的内容与意义.......................................................................................................................1
1.3 本人所做的工作...........................................................................................................................2
1.4 论文结构.......................................................................................................................................2
第二章 论文相关理论...................................................................................................................3
2.1 搜索引擎的相关理论...................................................................................................................3
2.1.1 搜索引擎定义....................................................................................................................3
2.1.2 搜索引擎的特点................................................................................................................3
2.1.3 搜索引擎的分类................................................................................................................4
2.2 中文分词技术...............................................................................................................................5
2.2.1 单字分词............................................................................................................................5
2.2.2 双字分词............................................................................................................................5
2.2.3 基于词典的分词技术........................................................................................................6
2.3 文本搜索.......................................................................................................................................7
2.3.1 文本搜索概述....................................................................................................................7
2.3.2 英文文本处理方式............................................................................................................7
2.3.3 倒排索引............................................................................................................................8
2.4 Lucene 介绍与源码分析............................................................................................................11
2.4.1 Lucene 概述.....................................................................................................................11
2.4.2 Lucene 建立索引.............................................................................................................12
2.4.3 Lucene 分词器(Analyzer)................................................................................................14
第三章 网页模板的提取............................................................................................................15
3.1 网页模板.....................................................................................................................................15
3.2 网页模板的作用.........................................................................................................................15
3.3 问题的提出.................................................................................................................................16
3.4 问题模型与求解.........................................................................................................................16
3.4.1 初步模型..........................................................................................................................16
3.4.2 最长公共子序列模型......................................................................................................18
3.4.3 统计模型..........................................................................................................................26
-V-
摘要:
展开>>
收起<<
基于Lucene的网页抓取与检索系统摘要随着网络的发展和Web资源的丰富,利用Web全文信息检索系统来获取所需信息已经成为人们日常生活的重要组成部分,用户也越来越关注如何能够更加准确、高效地查找信息。本文对Web信息检索和系统实现的相关理论与技术作了介绍,对信息检索在Web全文信息检索中的应用进行了较深入的实践。在第二章,介绍了论文的相关的理论,如搜索引擎的种类,中文分词的方法,倒排索引的理论和本论文之中所用到的Lucene的理论与使用方法,在第三章之中,根据网页的特点,提出两种网页模板的分析算法,第一种算法基于最长公共子序列模型,利用动态规划的方法求出最优解,对于原文献的算法的进行了优化和扩...
相关推荐
-
VIP免费2025-01-09 9
-
VIP免费2025-01-09 6
-
VIP免费2025-01-09 6
-
VIP免费2025-01-09 6
-
VIP免费2025-01-09 6
-
VIP免费2025-01-09 9
-
VIP免费2025-01-09 8
-
VIP免费2025-01-09 7
-
VIP免费2025-01-09 8
-
VIP免费2025-01-09 7
作者:朱铭铭
分类:高等教育资料
价格:150积分
属性:49 页
大小:1.72MB
格式:DOC
时间:2024-09-20