微博热点话题提取方法的研究与实现

VIP免费

3.0 侯斌 2024-11-19 4 4 2.45MB 55 页 15积分

侵权投诉

微博热点话题提取方法的研究与实现

摘要

随着网络全球化进程的加快以及移动技术的推广，微博已经成为了网络舆论

的主要源头和重要的传播媒介。通过微博热点话题就可以了解网络媒体的舆论动

态、可以对突发性社会事件与自然事件进行监测和预警，因此研究如何提取并分

析微博的热点话题，对于企业决策、行业调研、政府舆情监控等都有着重大的意义

然而微博具有：文本长度短、数据稀疏性突出、草根特性严重且数据量大、维度高

等特点，传统的话题提取方法在处理海量微博数据时存在降噪、降维能力不足、语

义信息丢失等问题。针对上述问题，提出一种适合微博的热点话题提取方法。论文

主要的工作如下：

(1) 通过研究话题提取技术，针对微博短文本的数据稀疏性导致无法直接使用

普通文本特征权重的计算方法的问题，提出一种基于微博评论的短文本扩展方法

该方法充分利用了微博的对话属性和传播模型，并且没有引入额外的“噪声”，

降低了短文本的稀疏性的影响。

(2) 研究了常用的文本模型化表示方法，针对传统的向量空间模型对微博文本

建模时，向量空间维度高和语义信息丢失等问题，本文采用潜在语义分析方法对

微博文本进行建模，获取词语间隐含的语义结构，利用该结构来描述词与文本，

减弱了词语间的相关性，降低了文本向量的维度。

(3) 为了快速准确地处理海量的微博数据，在研究经典聚类算法各自的优缺点

之后，提出一种层次聚类和 K-means 聚类相结合的改进混合聚类算法，并结合了

时间信息来计算文本相似度。该算法通过微博建模后的数据集进行层次聚类，得

到下一步 K-means 聚类所需要的初始聚类中心和聚类个数，从而取长补短，提高

了话题提取的效率和准确率。

(4) 根据提出的话题热度计算方法，结合短文本扩展方法、潜在语义分析模型、

改进的混合聚类算法，提出了基于混合聚类和热度排序的微博热点话题提取方法

并对其进行验证。实验结果表明，该方法降低了特征空间矩阵的维度和噪声，保

留了文本的潜在语义信息，从而大大降低了话题提取的错失率，提高了微博话题

提取的性能，使提取的微博热点话题更为精准。

关键词：微博话题提取文本聚类语义空间潜在语义分析

ABSTRACT

With the acceleration of the network globalization and the popularization of mobile

technology, microblog has become the major source and important media of network

public opinion. Microblog hot topic can be used to grasp the public opinion dynamics of

network media,as well as monitor and warn the sudden social events and natural events.

So how to extract and analyze hot topic of microblog has aspect of vital significance for

business decisions,industry research, the government public opinion monitoring and so

on. However, microblog has short text length, prominence of data sparsity, seriousness

of grassroots features and large amount of data, high dimension and other

characteristics. Traditional topic extraction methods have lack of noise and dimension

reduction, loss of semantic information and other issues when dealing with massive

short text of microblog. In the view of these problems, a suitable microblog hot topic

extraction method is proposed. The main research work of this dissertation is as

follows:

(1) By studying the topics extraction technology, to the problem that data

sparseness of the microblog short text, which cause the calculation method of plain text

feature weight can not be used directly, a short text extension methods which based on

microblog comment is put forward. This method makes full use of dialogue properties

and communication model of microblog without introducing additional "noise", which

reduce the impact of sparsity of short text.

(2) Representation method of common text model is stuied, as well as the

problems like high dimension of vector space and loss of semantic information when

modeling microblog text which is based on the traditional vector space model. This

dissertation adopt the method which is based on latent semantic analysis of text when

modeling, extract the underlying semantic structure between words,and use this

potential semantic structure to represent words and text,to achieve the purpose of

eliminating the correlation between words and simplifing the text vector for

dimensionality reduction.

(3) In order to deal with massive microblog data rapidly and exactly, An improved

hybrid clustering algorithm which is the combination of hierarchical clustering and K-

means clustering is proposed, to calculate the text similarity by incorporating it with

time information after analyzing the advantages and disadvantages of classic clustering

algorithm. The algorithm obtain the initial cluster centers and the number of clusters

those are needed of the next K-means clustering by hierarchical clustering after the data

set of microblog modeling.

(4) According to the proposed calculation means of topic heat, extraction method

of hybrid clustering with microblog hot topic of heat sort is proposed, with the

combination of short text extension methods, latent semantic analysis model, and the

improved hybrid clustering algorithm. Experimental results show that this method

reduces dimension and the noise of feature space matrix, and retains the latent semantic

information of the text, which reduces miss rate of topic extraction greatly, and improve

the performance of microblog topic extraction, so that the extraction of microblog hot

topic is more accurate.

Key words: microblog, topics extract, text clustering, semantic space,

latent semantic analysis

中文摘要

ABSTRACT

第一章绪论 ........................................................ 1

1.1

研究的背景与意义 ............................................ 1

1.2

国内外研究现状 .............................................. 1

1.3

主要研究工作及结构安排 ...................................... 3

第二章微博话题提取及相关技术 ...................................... 5

2. 1

微博 ........................................................ 5

2. 1 . 1

微博的概念 ............................................ 5

2. 1 .2

微博的特点 ............................................ 5

2. 1 . 3

微博话题提取原理 ...................................... 7

2. 2

话题提取的相关技术 .......................................... 8

2. 2 .1

数据提取与预处理 ...................................... 8

2. 2 . 2

文本表示模型 ......................................... 1 0

2. 2 . 3

文本相似度计算 ........ Error: Reference source not found

2. 2 . 4

文本聚类算法 ......... Error: Reference source not found 4

2 . 3

微博话题提取的难点 ......................................... 19

2. 4

针对微博话题提取难点的解决方案 ............................. 19

2. 5

本章小结 ................................................... 20

第三章基于短文本扩展与

LSA

的微博文本建模 ......................... 2 1

3. 1

微博短文本扩展 ............................................. 2 1

3 . 1 . 1

基于主题词的微博评论处理方法 ......................... 2 1

3 . 1 . 2

整合讨论树扩展微博文本 ............................... 2 2

3. 2

基于

LSA

的微博文本建模 ..................................... 2 5

3 . 2 . 1 LSA

的介绍 ........................................... 2 5

3 . 2 . 2 基于

LSA

的语义建模 ................................... 2 7

3. 3

本章小结 ................................................... 27

第四章基于混合聚类和热度排序的微博热点话题提取 ................... 28

4. 1

微博文本聚类 ............................................... 28

4.1.1 聚类算法的选择与比较.................................28

4. 1 .2

改进的混合聚类算法 ................................... 29

4. 1 . 3

结合时间信息的文本相似度计算 ......................... 31

4. 1 . 4

聚类算法实验验证 ..................................... 31

4. 2

微博话题热度的研究 ......................................... 33

4. 2 .1

话题热度影响因素的分析 ............................... 33

4. 2 .2

话题热度值的计算 ..................................... 33

4. 3

实验结果与分析 ............................................. 35

4. 3 .1

实验环境及数据 ....................................... 35

4. 3 .2

评价指标 ............................................. 36

4. 3 .3

结果分析 ............................................. 37

4.4

本章小结 ................................................... 4 0

第五章微博热点话题提取系统 ....................................... 4 1

5. 1

系统整体架构 ............................................... 41

5. 2

数据采集模块 ............................................... 42

5. 3

微博热点话题提取模块 ....................................... 45

5. 3 .1

微博文本预处理模块 ................................... 45

5. 3 .2

基于

LSA

的语义分析模块 ................................ 46

5. 3 .3

聚类及话题提取模块 ................................... 47

5. 3 .4

系统展示 ............................................. 47

5. 4

本章小结 .................. ................................ 5 1

第六章结论与展望 ................................................. 52

6.1

总结 ....................................................... 52

6.2

展望 ....................................................... 52

参考文献 .......................................................... 54

第一章绪论

1.1 课题研究的背景及意义

随着技术[1]不断更新以及移动技术的推广，互联网媒体已经影响和

改变了人们获取和发布信息的习惯与途径。作为一种新型社交媒体及信息交流平

台，微博凭借其开放性、互动性强、传播迅速、实时发布信息等特点，在近年来得

到了飞速发展和广泛应用[1]，注册用户量暴增。微博现已成为广大用户分享、关注、

获取最新热点资讯的主要途径，也成为了网络舆论的主要源头和重要的传播媒

介。

通过微博，用户能够非常方便地获取当前重要的热点资讯[2]；企业能够及时

了解相关领域的最新动态及多样的用户需求，进一步提升产品质量和核心竞争力

方便政府机构体察民意，及时知晓目前社会重大事件的舆论导向和发展趋势等。

然而在现实微博应用中，用户能够及时知晓受关注的博主和群组发布的博文，但

不能获取或跟踪当前整个平台上的热点话题[1]。因此，面对海量复杂的微博信息数

据，如何从中获取所需要的信息及感兴趣的热点话题，也成为互联网时代人们普

遍关注的问题。

目前各种信息提取技术、话题检测与追踪技术等都是将大量冗余的网络数据

进行有效的组织和分析，提取出人们所关注的热点话题，并追踪话题的发展趋势

和动态。然而当前网络热点话题的发现与追踪的研究对象大多是针对传统的新

闻、BBS 论坛以及博客等网络长文本，对以微博为代表的新兴且飞速发展的“自

媒体”的热点话题提取的研究才刚刚开始。微博文字是一种草根文本，很简短，

任何注册的用户都可以公开发布任何有关的话题信息。不同用户通常在表达方式

和书写风格上相差甚远。与传统的互联网媒体相比，微博在其内容和用户的参与

方式上都存在很大的差异。因此，不能把以往的话题提取方法直接应用在微博热

点话题提取上[3]。针对于微博的话题提取技术还是一个可以深入研究的方向。

综上所述，研究微博不仅具有重要的学术价值，而且蕴含着广泛的应用前景

[3]。微博热点话题的提取对个人生活、行业调研、企业决策、政府舆情监控等都有着

重大的现实意义。

1.2 国内外研究现状

对网络热点话题的研究通常包含数据釆集、预处理、聚类分析、分类优化等过

程[4]。目前，在热点话题的传播规律、发现与提取等方面的研究取得了重大进展[5]。

鉴于网络热点话题提取的重要意义，与之相关的关键技术，例如相似度计算、聚

类分析等技术也随着国内外研究人员的深入研究而获得了广泛的发展。

随着对话题研究的进一步深入，国外一些学者着手探索分析自然语言处理

(Natural Language Processing，NLP)，在此过程中，话题分析发挥了至关重要的用

处。 [6]等人对于二维话题检测中发现的缺陷提出一种基于语义域的话

题分割算法，该算法可以把对事件叙述的相关文本整理到一起，从而在文本段落

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 积分 4人已下载

立即下载 VIP免费下载

摘要：

微博热点话题提取方法的研究与实现摘要随着网络全球化进程的加快以及移动技术的推广，微博已经成为了网络舆论的主要源头和重要的传播媒介。通过微博热点话题就可以了解网络媒体的舆论动态、可以对突发性社会事件与自然事件进行监测和预警，因此研究如何提取并分析微博的热点话题，对于企业决策、行业调研、政府舆情监控等都有着重大的意义然而微博具有：文本长度短、数据稀疏性突出、草根特性严重且数据量大、维度高等特点，传统的话题提取方法在处理海量微博数据时存在降噪、降维能力不足、语义信息丢失等问题。针对上述问题，提出一种适合微博的热点话题提取方法。论文主要的工作如下：(1)通过研究话题提取技术，针对微博短文本的数据稀疏性...

展开>> 收起<<

微博热点话题提取方法的研究与实现.doc

共55页,预览6页

还剩页未读，继续阅读

微博热点话题提取方法的研究与实现

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

推荐作者

热门标签

举报选择: