Web文本发现及其在网络广告投放中的应用研究

VIP免费
3.0 陈辉 2024-11-19 6 4 879.44KB 85 页 15积分
侵权投诉
I
摘 要
Internet已成为世界上最大的信息积聚地,这些海量的Web信息数据中,蕴涵着
巨大潜在价值的知识。Internet上的信息,是以网页形式存放的,而网页的内容又
多以文本方式来表示,但它们的结构复杂,风格多样,构成了一个异常庞大的具
有异构性、开放性的分布式数据库。如何快速、有效地在这些海量信息中发掘出
有潜在价值的数据和知识成为亟待解决的问题。
上下文网络广告,也即窄告,是一种新型的定向网络广告模式,通过运用互
联网技术和相关发布系统,使广告客户的广告内容与网络媒体上的文章内容、浏
览者偏好、浏览者地理位置等信息自动进行匹配,并发布到与之相匹配的文章周
围,从而达到广告投放精准的目的。窄告是网络广告发展的方向。本文在现有Web
文本发现理论和方法基础上,着重对中文分词、文本摘要、文本分类、文本聚类、
Web用户兴趣挖掘等过程分别进行分析研究,并提出改进算法或方法,最后将Web
文本发现理论和方法应用于上下文网络广告的匹配。
主要内容以及成果是:分析当前中文分词研究现状,对主流分词理论进行多
方位比较,在此基础上提出一种改进的分词方法,在召回率和查准率方面有显著
的提高;分析当前文本分类的主要方法,对主要的摘要方法进行比较分析,总结
出适合Web文本发现的摘要方法;对经典文本分类方法进行比较,探索出最优的文
本分类方法;文本聚类质量评价、关键技术、常用文本聚类方法比较分析,探索
出一种针对Web文本类型的聚类方法;用户个性化信息的特征分析和提取,兴趣度
的表示和计算方法深入探讨;当前网络广告的优缺点分析,基于语义匹配的Web
信息提取和Web用户兴趣预测的研究。
掘 文现 中词 文要 文
用户兴趣 上下文网络广告
II
ABSTRACT
Internet has become the world's largest carrier of information accumulated, in
these vast amounts of Web information and data contained in the potential value of a
huge knowledge. Internet information is stored in the form of Web pages, and the
content of the page is the text, however, the complexity of their structure, style
diversity and constitute an unusually large with a heterogeneous, open and distributed
database. How quickly and effectively tap the potential value of data and knowledge
become a serious problem in these mass information.
Context-based online advertising, that is, contextual advertisingis a new type of
targeted online advertising model. It is through the use of Internet technologies and
related distribution system.Advertiser content automatically match with the online
media on the article content, viewer preferences, geographical location of visitors
and other information, and publish the articles associated with it around, so as to
achieve the purpose of accurate advertising. Contextual advertising is the direction for
the development of online advertising. In thi paper, Chinese word segmentation, text
summary, text categorization, text clustering, Web user interest mining and other
processes are analysed and researched mainly, based on the latest theories and methods
of the text mining. In these processes, improved algorithms or methods have been
proposed. In the final, the theory and methods of Web text mining are used in the
matching of contextual advertising.
The main content and results are as follows: Analyzing the current research status
of Chinese word segmentation. The segmentation mainstream theory were compared in
Multi-faceted, on this basis an improved segmentation methods have been proposed,
recall and precision have significantly improved; The main methods of current text
classification have been analysis, the main summary methods have been compared and
analyzed, summed up the summary method of Web text mining; Classic text
classification methods have been compared,and explore the best methods of text
categorization;Based on the text clustering quality evaluation, the key technology and
commonly used text clustering methods, a type of text for Web clustering method was
proposed; The analysis and extraction of customized features information, the degree
of interest expressed and calculation methods were used in this paper;The present
online advertising advantages and disadvantages were analysised, Web information
retrieval and Web users interested forecasting were researched based on the semantic
matching.
Key Word data mining, text mining, Chinese segmentation, text
summary, text clustering, users interested in, contextual advertising
III
目 录
摘 要............................................................................................................................... I
ABSTRACT ..................................................................................................................... II
第一章 绪论.....................................................................................................................1
§1.1 课题研究背景和意义........................................................................................1
§1.1.1 Web 挖掘背景 ......................................................................................... 1
§1.1.2 Web 文本发现背景和研究意义 ............................................................. 3
§1.1.3 上下文网络广告.....................................................................................4
§1.2 国内外研究现状................................................................................................5
§1.2.1 Web 文本发现研究现状 ......................................................................... 5
§1.2.2 窄告理论与技术现状.............................................................................6
§1.3 研究的主要内容和结构组织...........................................................................8
§1.3.1 研究的主要内容.....................................................................................8
§1.3.2 本文的组织结构.....................................................................................8
第二章 基于含噪词表的中文分词方法.......................................................................10
§2.1 Web 文本预处理 ............................................................................................. 10
§2.1.1 Web 页面信息预处理 ........................................................................... 10
§2.1.2 Web 文本预处理 ................................................................................... 10
§2.2 中文分词概述.................................................................................................11
§2.2.1 中文分词概述.......................................................................................11
§2.2.2 中文分词以及算法...............................................................................11
§2.3 相关概念.........................................................................................................13
§2.4 中文分词研究现状.........................................................................................15
§2.5 基于含噪词表中文分词方法.........................................................................15
§2.5.1 含噪词表的构建...................................................................................15
§2.5.2 算法总体过程.......................................................................................16
§2.5.3 算法描述...............................................................................................16
§2.5.4 算法流程图...........................................................................................17
§2.5.5 含噪词表法算法分析...........................................................................18
§2.5.6 实验结果以及分析...............................................................................19
§2.5.7 算法小结...............................................................................................19
第三章 基于统计和用户兴趣的文本摘要方法研究...................................................20
§3.1 文本特征和结构.............................................................................................20
IV
§3.1.1 文本特征表示.......................................................................................20
§3.1.2 文本特征提取.......................................................................................20
§3.1.3 文本结构分析.......................................................................................20
§3.2 文本摘要概述.................................................................................................21
§3.3 文本摘要算法分类.........................................................................................21
§3.4 文本摘要的原理.............................................................................................21
§3.5 文本摘要主要方法.........................................................................................22
§3.5.1 基于统计的自动摘要...........................................................................22
§3.5.2 基于理解的文本摘要...........................................................................26
§3.5.3 基于信息抽取的摘要...........................................................................27
§3.5.4 基于结构的自动摘要...........................................................................28
§3.6 基于用户兴趣的文本摘要方法研究.............................................................29
第四章 改进 K近邻文本分类算法 ..............................................................................31
§4.1 文本分类概述.................................................................................................31
§4.1.1 Web 文本预处理 ................................................................................... 32
§4.1.2 文本表示...............................................................................................32
§4.1.3 文本特征抽取.......................................................................................33
§4.2 文本分类方法.................................................................................................35
§4.2.1 Rocchio 方法——相似度计算方法 ..................................................... 35
§4.2.2 朴素贝叶斯分类...................................................................................35
§4.2.3 K 近邻算法(KNN........................................................................... 35
§4.2.4 改进 K近邻文本分类算法..................................................................36
§4.2.5 支持向量机...........................................................................................38
§4.2.6 决策树...................................................................................................39
§4.3 分类算法评估指标.........................................................................................39
§4.4 Web 文本分类探讨分析 ................................................................................. 39
第五章 文本聚类及其关键技术研究...........................................................................41
§5.1 文本聚类概述.................................................................................................41
§5.2 文本聚类概念.................................................................................................41
§5.2.1 聚类.......................................................................................................41
§5.2.2 文本聚类...............................................................................................42
§5.3 文本聚类质量评价.........................................................................................42
§5.4 文本聚类主要方法分析.................................................................................43
V
§5.4.1 基于划分的方法...................................................................................44
§5.4.2 基于层次的方法...................................................................................44
§5.4.3 基于模型的聚类算法...........................................................................45
§5.4.4 基于概念的聚类...................................................................................47
§5.4.5 基于对话文本的聚类...........................................................................47
§5.4.6 基于密度的聚类算法...........................................................................48
§5.4.7 四类文本聚类算法实验比较...............................................................48
§5.5 文本聚类关键技术.........................................................................................49
§5.5.1 文本表示...............................................................................................49
§5.5.2 文本降维与特征词选择.......................................................................49
§5.5.3 语义计算问题.......................................................................................49
§5.6 基于自组织特征映射神经网络(SOM)Web 聚类算法(WTCA........ 49
第六章 Web 用户兴趣挖掘研究 .................................................................................. 52
§6.1 Web 用户兴趣挖掘概述 ................................................................................. 52
§6.1.1 用户个性化信息的特征及其分类.......................................................52
§6.1.2 用户个性化信息的获取.......................................................................53
§6.2 Web 显式挖掘 ................................................................................................. 54
§6.3 Web 内容挖掘 ................................................................................................. 55
§6.3.1 用户浏览行为分析...............................................................................55
§6.3.2 用户行为数据采集...............................................................................56
§6.3.3 兴趣度的计算.......................................................................................57
§6.3.4 兴趣表示...............................................................................................57
§6.4 Web 日志挖掘 ................................................................................................. 58
§6.4.1 Web 日志数据 ....................................................................................... 58
§6.4.2 Web 日志挖掘步骤 ............................................................................... 59
§6.4.3 Web 日志数据预处理技术 ................................................................... 59
§6.4.4 Web 日志挖掘的挑战 ........................................................................... 60
§6.4.5 用户识别算法及其改进.......................................................................61
§6.5 几种挖掘方法比较.........................................................................................62
第七章 Web 文本发现在上下文网络广告中的应用研究 .......................................... 63
§7.1 网络广告概述.................................................................................................63
§7.1.1 网络广告的重要性...............................................................................63
§7.1.2 网络广告的特点与优势.......................................................................63
VI
§7.1.3 互联网广告的不足和缺陷...................................................................64
§7.2 上下文网络广告——窄告.............................................................................65
§7.2.1 窄告的涵义...........................................................................................65
§7.2.2 窄告诞生的意义...................................................................................67
§7.2.3 窄告前景预测.......................................................................................68
§7.3 基于 Web 文本发现的窄告 ........................................................................... 69
§7.4 基于语义匹配的 Web 信息提取 ................................................................... 71
§7.4.1 Web 信息提取模型 ............................................................................... 71
§7.4.2 基于语义的信息匹配方法...................................................................72
§7.4.3 语义的信息匹配...................................................................................73
§7.5 Web 用户兴趣预测 ......................................................................................... 73
§7.5.1 基于关联规则的分类方法...................................................................73
§7.5.2 Web 用户兴趣的预测 ........................................................................... 75
§7.6 总结与展望.....................................................................................................76
参考文献.........................................................................................................................77
在读期间公开发表的论文和承担科研项目及取得成果.............................................80
致 谢.............................................................................................................................81
摘要:

I摘要Internet已成为世界上最大的信息积聚地,这些海量的Web信息数据中,蕴涵着巨大潜在价值的知识。Internet上的信息,是以网页形式存放的,而网页的内容又多以文本方式来表示,但它们的结构复杂,风格多样,构成了一个异常庞大的具有异构性、开放性的分布式数据库。如何快速、有效地在这些海量信息中发掘出有潜在价值的数据和知识成为亟待解决的问题。上下文网络广告,也即窄告,是一种新型的定向网络广告模式,通过运用互联网技术和相关发布系统,使广告客户的广告内容与网络媒体上的文章内容、浏览者偏好、浏览者地理位置等信息自动进行匹配,并发布到与之相匹配的文章周围,从而达到广告投放精准的目的。窄告是网络广告...

展开>> 收起<<
Web文本发现及其在网络广告投放中的应用研究.pdf

共85页,预览9页

还剩页未读, 继续阅读

作者:陈辉 分类:高等教育资料 价格:15积分 属性:85 页 大小:879.44KB 格式:PDF 时间:2024-11-19

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 85
客服
关注