ABSTRACT
With the acceleration of the network globalization and the popularization of mobile
technology, microblog has become the major source and important media of network
public opinion. Microblog hot topic can be used to grasp the public opinion dynamics of
network media,as well as monitor and warn the sudden social events and natural events.
So how to extract and analyze hot topic of microblog has aspect of vital significance for
business decisions,industry research, the government public opinion monitoring and so
on. However, microblog has short text length, prominence of data sparsity, seriousness
of grassroots features and large amount of data, high dimension and other
characteristics. Traditional topic extraction methods have lack of noise and dimension
reduction, loss of semantic information and other issues when dealing with massive
short text of microblog. In the view of these problems, a suitable microblog hot topic
extraction method is proposed. The main research work of this dissertation is as
follows:
(1) By studying the topics extraction technology, to the problem that data
sparseness of the microblog short text, which cause the calculation method of plain text
feature weight can not be used directly, a short text extension methods which based on
microblog comment is put forward. This method makes full use of dialogue properties
and communication model of microblog without introducing additional "noise", which
reduce the impact of sparsity of short text.
(2) Representation method of common text model is stuied, as well as the
problems like high dimension of vector space and loss of semantic information when
modeling microblog text which is based on the traditional vector space model. This
dissertation adopt the method which is based on latent semantic analysis of text when
modeling, extract the underlying semantic structure between words,and use this
potential semantic structure to represent words and text,to achieve the purpose of
eliminating the correlation between words and simplifing the text vector for
dimensionality reduction.
(3) In order to deal with massive microblog data rapidly and exactly, An improved
hybrid clustering algorithm which is the combination of hierarchical clustering and K-
means clustering is proposed, to calculate the text similarity by incorporating it with
time information after analyzing the advantages and disadvantages of classic clustering
algorithm. The algorithm obtain the initial cluster centers and the number of clusters
those are needed of the next K-means clustering by hierarchical clustering after the data
set of microblog modeling.
(4) According to the proposed calculation means of topic heat, extraction method
of hybrid clustering with microblog hot topic of heat sort is proposed, with the
combination of short text extension methods, latent semantic analysis model, and the
improved hybrid clustering algorithm. Experimental results show that this method
reduces dimension and the noise of feature space matrix, and retains the latent semantic
information of the text, which reduces miss rate of topic extraction greatly, and improve
the performance of microblog topic extraction, so that the extraction of microblog hot