基于PLS-Logistic模型的中文文本分类研究

VIP免费

3.0 牛悦 2024-11-19 4 4 532.49KB 62 页 15积分

侵权投诉

摘要

互联网技术等信息传播媒介已得到广泛的应用与发展，人们所面对的数据量

不断膨胀，如何高效的获取和管理信息成为迫切需要解决的信息科学问题。文本

是信息传播中主要的信息载体。文本自动分类技术作为获取和管理文本信息的有

效方式，已成为当前具有重要理论意义和实际应用价值的一个研究领域和研究热

点。文本分类技术就是利用计算机将用户提交的文本按内容或主题自动分为一类

或几类的过程。在理论上，文本分类为模式识别和机器学习的一个应用方向。目

前我国的网络普及率越来越高，网络用户越来越多，在网站中蕴涵着海量的一文

本形式存在的中文信息，由于中西文之间的巨大差异，中文文本分类无法直接应

用国外在文本分类方面的研究成果，因此研究中文文本分类技术具有重要意义。

本文研究基于统计学习的文本分类及其相关技术，提出了偏最小二乘 Logistic

文本分类模型，该模型以偏最小二乘理论为基础，可较好的提取潜在语义，在一

定程度上解决了文本分类的稀疏性和高维性问题。主要研究内容如下：

首先，运用偏最小二乘回归方法对文本进行特征抽取。特征选择和特征抽取

是文本分类中常用的两种维数简约方法。特征选择选择的特征虽具有较好的语义

解释，但是文本分类效果一般；特征抽取可以较好的解决共线性问题，但是不具

有语义解释的能力。为了解决上述问题，本文运用偏最小二乘方法进行特征抽取，

并引入变量投影重要性指标对各特征进行度量，根据 VIP 值进行特征选择。该方

法在运用偏最小二乘法时提取了潜在的语义信息，是一种融合了特征抽取的特征

选择方法；使用 VIP 值对各特征进行度量，能够很好的反映各类别的结构信息，

被选择的特征自己也可很好的反映各个类别的领域知识。

其次，对偏最小二乘 Logistic 文本分类模型进行研究。Logistic 模型较为广泛

地运用于文本分类领域，其可以很好的解决离散型因变量问题，但 Logistic 模型不

能很好的解决文本分类的高共线性问题，会产生回归系数不显著和模型拟合较差

等问题。偏最小二乘回归模型是一种强有力的数据分析工具，基于该理论的潜在

语义分类模型可以较好的提取文档的潜在语义和类别特征，但在二元文本分类任

务中，因变量是离散变量，偏最小二乘回归模型是用于处理连续型因变量的模型。

偏最小二乘 Logistic 回归模型则可以很好的处理这些问题，综合了偏最小二乘法和

Logistic 回顾法的优点。

实验结果显示，和普通 Logistic 模型相比，该模型的文本分类性能显著提高；

和其他经典的 SVM、SMO、C4.5 和kNN 四种文本分类模型相比，该模型的分类

性能良好。

关键词：文本分类偏最小二乘法 Logistic回归模型

ABSTRACT

At present due to the internet continuously popularization and development, many

categories of information increased dramatically. How to catch and manage the

information has become an urgent problem. Text is the main carrier in many categories

of information. As a effective method to obtain useful information, automatic text

classification has become one of the important part of the text information processing

and research problem. Text classification technique is spit the text into one class or

several classes which user submitted automatically. In theory, text classification is an

application of machine learning. As the characteristics of its handling object, it involves

other disciplines, such as Linguistics, Cognitive Science, Information Theory, Artificial

Intelligence, Statistics, Computer and so on. Text classification can be applied to many

social areas, including web page classification, classification of scientific literature,

electronic libraries, patent classification, trademark classification, e-mail filtering, etc.

Therefore, text classification research has important theoretical value. With the

increasing Internet penetration in China, there are more and more people start to use

Internet. A mass of Chinese information has been implied in the variety sites. Because

of the differences between Chinese and English, the foreign text classification research

results can not be applied to Chinese text directly, which make the Chinese text

classification researches have very important practical significance.

In this paper, the PLS-Logistic model is developed based on the research of

statistical learning text categorization and its related technology, this model is better to

extract potential semantics, and a certain extent, solve the sparse and high dimensional

problems. The following is the main contents:

First, use Partial Lease Squares (PLS) method to extract features. There are usually

two methods of reducing dimensions: feature selection and feature extraction. Feature

selection could select features which have better semantic interpretation, but the

capability of text classification is not good. Feature extraction is a better solution of

linear problems, but it does not have the ability to semantic interpretation. In order to

solve these problems, PLS is been used in this paper to extract features, at the same time,

the VIP value is been introduction and is used to measure the characteristics, according

to the VIP value to select features. By using this method the features have been

extracted when the PLS is used. And it could well reflect information’s structure and

domain knowledge of every type.

Second, research PLS-Logistic Model. Logistic Model is widely used in the field

of text classification, which could solve the discrete of dependent variable, but it could

not solve problem of the high colinearity which will produce no significant regression

coefficients and poor model fit and other issues. PLS model is a powerful tool for data

analysis, the latent semantic classification model is based on it which can extract

documents’ semantics and categories’ potential features. In binary text classification, the

dependent variables are discrete, but the PLS is used to dealing with continuous

dependent variables. PLS-Logistic model is very good at dealing with these problems,

which combine the advantage of PLS model and Logistic model.

The results of experiments shows that: compared to ordinary Logistic model, the

text classification performance of new model is significantly improved; and compared

to other classic models, including SVM、SMO、C4.5 and kNN, the new model’s

classification performance is good.

Key words: Text Classification, Partial Lease Squares, Logistic Model

中文摘要

ABSTRACT

第一章绪论················································································································ 1

§1.1 研究的背景和意义 .......................................................................................... 1

§1.2 国内外发展历史和研究现状 .......................................................................... 1

§1.3 本文所做工作 ............................................................................................................ 3

第二章文本分类概述 ·································································································· 4

§2.1 文本分类简介 .................................................................................................. 4

§2.1.1 文本分类的数学定义··········································································· 4

§2.1.2 文本分类任务的特点··········································································· 4

§2.1.3 文本分类系统的组成··········································································· 5

§2.2.4 维数简约······························································································· 6

§2.2.5 学习训练······························································································· 7

§2.2.6 测试和评价··························································································· 7

§2.2 文本分类方法 .................................................................................................. 7

§2.2.1 文档特征······························································································· 7

§2.2.2 文档表示模型······················································································· 9

§2.3 常用文本分类模型介绍 ................................................................................ 10

第三章偏最小二乘回归方法····················································································· 16

§3.1 偏最小二乘回归的发展历史 ........................................................................ 16

§3.2 偏最小二乘回归的基本原理 ........................................................................ 16

§3.3 偏最小二乘回归算法 .................................................................................... 18

§3.4 成分数的确定 ................................................................................................ 21

第四章基于偏最小二乘回归模型的特征抽取 ························································· 23

§4.1 维数简约技术 ................................................................................................ 23

§4.2 常用的特征选择方法 .................................................................................... 24

§4.2.1 2

统计量 ··························································································· 24

§4.2.2 信息增益····························································································· 24

§4.2.3 互信息方法·························································································· 25

§4.2.4 文档频率······························································································ 25

§4.2.5 期望交叉熵·························································································· 25

§4.3 常用的特征抽取方法 .................................................................................... 25

§4.3.1 主成分分析························································································· 26

§4.3.2 潜在语义索引········································································· 26

§4.3.3 Fisher 线性判决分析··········································································· 26

§4.4 基于偏最小二乘的特征抽取 ........................................................................ 27

§4.4.1 基本原理····························································································· 27

§4.4.2 数学公式推导····················································································· 28

§4.4.3 基于投影重要性指标的特征选择 ····················································· 28

§4.4.4 实验结果和分析················································································· 30

第五章基于 PLS-Logistic 模型的文本分类器·························································· 35

§5.1 Logistic 回归模型 ........................................................................................... 35

§5.2 偏最小二乘 Logistic 回归模型............................................................................ 36

§5.2.1 提取偏最小二乘成分········································································· 36

§5.2.2 建立 Logistic 模型·············································································· 38

§5.2.3 提取成分的标准················································································· 38

§5.3 偏最小二乘 Logistic 文本分类模型 ................................................................... 38

§5.3.1 提取偏最小二乘成分········································································· 38

§5.3.2 构建偏最小二乘 Logistic 模型 ·························································· 39

§5.4 实验结果和分析..................................................................................................... 40

§5.4.1 实验设计方案····················································································· 40

§5.4.2 偏最小二乘 Logistic 模型和 Logistic 模型的性能比较···················· 41

第六章结论与展望 ···································································································· 47

§6.1 研究工作总结 ................................................................................................. 47

§6.2 研究展望 ......................................................................................................... 47

附录···························································································································· 49

参考文献 ······················································································································ 54

在读期间公开发表的论文和承担科研项目及取得成果············································ 58

致谢·························································································································· 59

第一章绪论

§1.1 研究的背景和意义

在信息时代的背景下，信息量疯狂地增长，人们可以轻松获得大量的信息。

虽然可获得信息的数量不断增加，但这些信息中往往包含着许多不相关的内容，

如何在海量信息中快速、准确、便捷地从中获取、管理和使用这些信息，已经成

为当今急需解决的问题。以前由于网络信息资源有限，人们往往采用人工的方式

对网上信息进行分类，并进行组织和整理。但由于人工分类的方法不仅耗费大量

的人力、物力和精力等资源，而且文本分类结果一致性不高。自动文本分类技术

作为解决这些问题的重要工具之一，得到了空前的发展。

文本自动分类[1] [2]首先是一种有效的信息组织方式。文本自动分类技术在处理

信息时具有有效性，同时还可节省大量的人力、物力资源。文本自动分类同时还

是一种有效的信息过滤方式。文本分类可以自动过滤到互联网上垃圾信息或有害

信息，避免其对人们生活的不利影响。文本自动分类还为人们信息查询提供了一

种有效的途径，可按照用户的需求提供相关的信息，为用户信息查询提供了有效

的方式。

中文文本分类[3]是文本自动分类的一部分，是研究如何采用计算机技术对中文

文本进行分类。随着我国信息技术迅猛发展和广泛应用，中文文本分类技术已与

人们的日常生活息息相关，应此对中文文本自动分类技术的研究是非常必要也有

意义的。

§1.2 国内外发展历史和研究现状

文本分类系统通过在预先分类好的文本集上训练，建立一个判别的规则和分

类器，从而对未知类别进行自动归类，它的学习不需要专家的干预，能适用于任

何领域的学习，目前意识文本分类的主流方法。在文本分类系统中对文本的维数

简约和文本分类器的建立是主要的研究内容。

高维性和稀疏性是文本分类的主要特征。这样会产生两个问题：一、训练和分类

的时间上开销很大；二、特征维数灾难。因此维数简约是文本分类的前途。维数

简约的任务为：在不损失文本分类性能的条件下，从一组特征数据中选择出一组

最优特征，最优特征数量小于特征数据。维数简约有特征选择和特征抽取两种常

用方法。

文本分类中常用的特征选择方法主要有文档频率[4]、2

统计量[5] [20] 、互信息

[6]、信息增益[7]和几率比[8]等。这些特征选择方法提取的特征具有较好的语义解释，

接近文本语义的描述。常用的特征抽取方法主要有主成分分析[9] [25]、潜在语义索引

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 积分 4人已下载

立即下载 VIP免费下载

摘要：

摘要互联网技术等信息传播媒介已得到广泛的应用与发展，人们所面对的数据量不断膨胀，如何高效的获取和管理信息成为迫切需要解决的信息科学问题。文本是信息传播中主要的信息载体。文本自动分类技术作为获取和管理文本信息的有效方式，已成为当前具有重要理论意义和实际应用价值的一个研究领域和研究热点。文本分类技术就是利用计算机将用户提交的文本按内容或主题自动分为一类或几类的过程。在理论上，文本分类为模式识别和机器学习的一个应用方向。目前我国的网络普及率越来越高，网络用户越来越多，在网站中蕴涵着海量的一文本形式存在的中文信息，由于中西文之间的巨大差异，中文文本分类无法直接应用国外在文本分类方面的研究成果，因此研究...

展开>> 收起<<

基于PLS-Logistic模型的中文文本分类研究.pdf

共62页,预览7页

还剩页未读，继续阅读

基于PLS-Logistic模型的中文文本分类研究

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

推荐作者

热门标签

举报选择: