iv
§5.1 网页分类技术 ..................................................................................................33
§5.1.1 文本分类技术 .......................................................................................33
§5.1.2 网页分类的特点 ...................................................................................34
§5.1.3 网页分类的相关研究 ...........................................................................34
§5.2 网页分类模块的设计 ......................................................................................35
§5.2.1 模块框架 ...............................................................................................36
§5.2.2 网页预处理 ...........................................................................................38
§5.2.2.1 HTML 解析 ........................................................................................ 38
§5.2.2.2 英文文法分析和中文分词 ................................................................38
§5.2.2.3 停用词删除 ........................................................................................38
§5.2.2.4 词条频率计算及倒排索引建立 ........................................................38
§5.2.3 类别特征词库抽取 ...............................................................................39
§5.2.3.1 类别文档建模 ....................................................................................39
§5.2.3.2 平凡词过滤 ........................................................................................40
§5.2.3.3 归类可信度计算 ................................................................................40
§5.2.3.3 抽取类别特征词 ................................................................................41
§5.2.4 分类器 ...................................................................................................41
§5.3 网页分类实验 ..................................................................................................42
§5.3.1 训练网页集 ...........................................................................................43
§5.3.2 构建类别特征词库 ...............................................................................43
§5.3.3 类别阈值的确定 ....................................................................................44
§5.3.3.1 四类训练集网页类别可信度计算 ....................................................45
§5.3.3.2 其它类训练集网页类别可信度计算 ................................................45
§5.3.3.3 阈值计算 ............................................................................................47
§5.4 本章小结 .........................................................................................................48
第六章 搜索引擎的个性化技术及其实现 ...................................................................49
§6.1 Web 信息检索个性化的相关技术 ................................................................. 50
§6.1.1 个性化网页权重 ....................................................................................50
§6.1.2 查询改进 ................................................................................................51
§6.2 增加 IP 影响因子的页面排序算法 ............................................................... 53
§6.2.1 Lucene 的页面排序算法 .......................................................................53
§6.2.2 算法改进 ...............................................................................................54
§6.3 个性化搜索实现 .............................................................................................55
§6.4 本章小结 .........................................................................................................56
第七章 系统集成 ...........................................................................................................57
§7.1 Java 后台应用程序 ......................................................................................... 57
§7.1.1 Nutch09Dev 主工程 .............................................................................. 58
§7.1.2 WordSegmentation 工程和 MyLucene 工程 .........................................59
§7.2 Web 应用程序 ................................................................................................. 60
§7.3 本章小结 .........................................................................................................62
第八章 总结和展望 .......................................................................................................63
§8.1 本文工作总结 .................................................................................................63
§8.2 Intranet 搜索的未来展望 ................................................................................ 63