数据清洗的关键技术研究与实现

VIP免费

3.0 赵德峰 2024-11-19 4 4 1.62MB 83 页 15积分

侵权投诉

摘要

随着信息技术的飞速发展，管理人员进行决策分析时对数据的依赖性越来越

强。于是在数据库的基础上产生了能够满足决策分析所需要的数据环境——数据

仓库。但是在构建数据仓库的过程中，从异构的数据源中导入的数据中存在各种

质量问题，使得应用于数据仓库前端的决策支持系统产生错误的分析结果，影响

信息服务的质量。所以必须对其进行数据清洗来提高其质量。数据清洗正在成为

数据仓库和数据挖掘领域，乃至网络数据处理领域的一个重要课题。

本文首先对数据清洗的知识进行了全面和详细的描述，介绍了数据清洗的概

念、意义和国内外研究与应用的现状。对数据清洗技术的原理、方法、评价标准

以及基本流程进行了分析和总结，得出了目前数据清洗研究领域最复杂而又亟待

进一步解决的最主要的三个问题，即：错误数据清洗问题、相似重复数据清洗问

题和数据清洗框架问题，并基于以上问题引入了文本的第一个研究重点——错误

数据的清洗。在详细比较当前各种错误数据清洗方法优劣的基础上提出了一种新

的基于聚类技术检测孤立点数据的新方法，并通过实验验证了该方法的有效性。

其次，相似重复记录的清洗是本文的另一个研究重点，文中首先详细介绍了当前

检测相似重复记录多种方法，在总结其不足的前提下提出了一种在高维空间基于

密度收缩的相似重复记录的检测方法，通过在模拟数据集上的实验验证了该方法

的可行性和有效性。再次，本文提出了一个基于数据清洗关键技术的可扩展数据

清洗框架的思想，该框架具有开放的规则库和算法库。对数据进行清洗时，可以

根据具体业务，通过预定义清洗规则并选择合适的算法，来清洗数据源中的种种

错误。最后的实现阶段初步完成了该框架数据清洗模块的基本功能，并通过实验

验证了此框架具有良好的运行效率和运行效果。在本文的结尾，对所做的工作进

行了总结，并对未来数据清洗技术的研究方向进行了展望。

通过理论分析和实践研究，本论文所获得的研究成果为将来进一步发展和完

善各类数据清洗算法和系统级数据清洗软件平台提供了一条有益的思路。

关键词：数据清洗聚类孤立点相似重复记录可扩展数据清洗

框架

ABSTRACT

With the rapid development of information technology,organizational managers

depend on data more and more when making their decisions.On the foundation of

database there appears data warehouse which can support decision analysis.But during

the construction of data warehouse,data from different data sources are inputted into

the data warehouse ,there may exist many data qualitative Problems,resulting in false

decisive analysis and influent quality of information service.There is a strong need to

carry out a data cleansing process to improve the data quality.Data cleaning is

becoming an important topic in data warehouse and data mining,as well as web data

processing fields.

In this paper ,we depicted the knowledge of data cleaning in detail firstly.We

introduced the concept ,meaning and current research and application situation in home

and abroad of data cleaning fields.We summarized and described the theories,

methods,evaluating standards and basic workflow of data cleaning. Based on the

konwlodge mentioned above, we realized that the most difficult problems that should

be resolved as soon as possible are the followings: outlier detection problem,

duplicated records detection problem and the data cleaning tools problem.So the outlier

detection is the first point in this paper, We introduce a new method of outlier detection

based on clustering after comparison of many outlier detection methods in present, and

certify the effect of the method by experiment. Secondly, the duplicated records

detection is another point in this paper.We introduce many method of duplicated record

detection in present at first and propose a new duplicated records approach that based

on high dimension space by using adjustable density after analysis the defects of them,

then prove the method’s feasibility and efficiency by the simulated input records.

Thirdly, we propose a novel idea named the extended data cleaning framework based

on the critical technologies of data cleaning that lets users clean data flexibly and

effortlessly without any coding .The extensible framework has open algorithms library,

open functions library and Fuzzy Inference System based on fuzzy rule, which make it

universal and adaptive.At last we implement the framework’s basic function model and

the experimental results prove the framework’s effectiveness. At the end of the paper,

we summarized the research work in this paper and gave a future view about research

direction of data cleansing technology.

By the fact that theoretical analysis and practice study, study results in this paper

have provided the certain train of thought and beneficial trial for developing algorithm

of data cleaning and data cleaning software platform in system going a step further in

the future.

Key Word ：Data Cleaning, Clustering, Outlier, Duplicate Record,

EDCF

第一章绪论 ......................................................................................................1

§1.1 研究背景..................................................................................................1

§1.2 国内外研究现状及不足..........................................................................1

§1.2.1 国外研究现状...............................................................................2

§1.2.2 国内研究现状...............................................................................3

§1.3 存在的问题..............................................................................................4

§1.4 论文的内容和组织结构..........................................................................4

第二章数据质量和数据清洗的相关知识 ..........................................................5

§2.1 数据质量的基本概念..............................................................................5

§2.1.1“脏数据”的产生.........................................................................5

§2.1.2 数据质量的定义...........................................................................5

§2.1.3 数据质量的分类...........................................................................6

§2.2 数据清洗相关概念..................................................................................7

§2.2.1 数据清洗的定义...........................................................................7

§2.2.2 脏数据的分类及对应的数据清洗方法.......................................9

§2.3 数据清洗的基本流程............................................................................11

§2.4 数据清洗的评价标准............................................................................12

§2.5 ETL 与数据清洗.................................................................................... 12

§2.5.1 ETL 简介..................................................................................... 12

§2.5.2 数据清洗在 ETL 中的应用模型............................................... 13

§2.6 本章小结................................................................................................14

第三章错误数据清洗的研究与改进 ................................................................15

§3.1 多数据源集成与数据标准化................................................................15

§3.1.1 多数据源集成的数据模型.........................................................15

§3.1.2 数据标准化的重要性.................................................................16

§3.1.3 数据标准化的定义.....................................................................17

§3.1.4 数据标准化的方法.....................................................................18

§3.2 依赖型错误数据清洗............................................................................19

§3.3 孤立点检测的相关方法研究................................................................20

§3.3.1 基于统计的方法.........................................................................20

§3.3.2 基于距离的方法.........................................................................21

§3.3.3 基于密度方法.............................................................................22

§3.3.4 基于深度的方法.........................................................................22

§3.3.5 基于偏移的方法.........................................................................23

§3.3.6 基于业务规则的方法.................................................................23

§3.4 一种新的基于聚类分析的孤立点检测方法的研究与实现................26

§3.4.1 将数据表中的记录映射为高维空间的点.................................27

§3.4.2 基于高维空间中点集的密度聚类算法.....................................29

§3.4.3 基于高维空间中聚类的孤立点检测算法(BHCOD) ................ 33

§3.4.4 算法分析与比较.........................................................................35

§3.4.5 实验及结果.................................................................................35

§3.5 孤立点数据的处理................................................................................36

§3.6 本章小结................................................................................................37

第四章相似重复记录清洗的研究与改进 ........................................................37

§4.1 排序比较检测相似重复记录的方法....................................................39

§4.1.1 字段匹配算法.............................................................................39

§4.1.2 重复记录检测.............................................................................44

§4.2 聚类分析检测相似重复记录的方法....................................................48

§4.2.1 聚类的概念.................................................................................48

§4.2.2 主要聚类方法的分类.................................................................49

§4.2.3 DBSCAN 聚类之前的数据准备 ................................................ 49

§4.2.4 DBSCAN 聚类算法 .................................................................... 49

§4.3 改进型的 DBSCAN 聚类算法—IDS 算法 ..........................................53

§4.4 IDS 算法的实现..................................................................................... 54

§4.4.1 IDS 算法的数据结构.................................................................. 54

§4.4.2 与数据库的底层连接及数据交换.............................................55

§4.4.3 寻找核心点.................................................................................56

§4.4.4 聚类.............................................................................................56

§4.5 相似重复记录检测的标准及实验验证................................................57

§4.6 相似重复记录冲突处理........................................................................57

§4.7 本章小结................................................................................................58

第五章可扩展数据清洗框架 EDCF 的研究与实现 ....................................... 59

§5.1 可扩展数据清洗框架的原理................................................................60

§5.1.1 EDCF 的功能模块及清洗方法 .................................................. 60

§5.1.2 EDCF 的清洗过程 ...................................................................... 61

§5.1.3 EDCF 的规则库与算法库 .......................................................... 62

§5.1.4 EDCF 的特点 .............................................................................. 64

§5.2 EDCF 的实现 ......................................................................................... 65

§5.2.1 EDCF 的开发方法 ...................................................................... 65

§5.2.2 EDCF 的主要功能界面 .............................................................. 66

§5.3 数据清洗框架的效果评价....................................................................69

§5.4 本章小结................................................................................................70

第六章总结与展望 ............................................................................................72

§6.1 本文所做工作的总结............................................................................72

§6.2 未来研究方向的展望............................................................................72

参考文献 ................................................................................................................74

在读期间公开发表的论文和承担科研项目及取得成果 ....................................78

致谢 ....................................................................................................................79

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 积分 4人已下载

立即下载 VIP免费下载

摘要：

摘要随着信息技术的飞速发展，管理人员进行决策分析时对数据的依赖性越来越强。于是在数据库的基础上产生了能够满足决策分析所需要的数据环境——数据仓库。但是在构建数据仓库的过程中，从异构的数据源中导入的数据中存在各种质量问题，使得应用于数据仓库前端的决策支持系统产生错误的分析结果，影响信息服务的质量。所以必须对其进行数据清洗来提高其质量。数据清洗正在成为数据仓库和数据挖掘领域，乃至网络数据处理领域的一个重要课题。本文首先对数据清洗的知识进行了全面和详细的描述，介绍了数据清洗的概念、意义和国内外研究与应用的现状。对数据清洗技术的原理、方法、评价标准以及基本流程进行了分析和总结，得出了目前数据清洗研究领...

展开>> 收起<<

数据清洗的关键技术研究与实现.pdf

共83页,预览9页

还剩页未读，继续阅读

数据清洗的关键技术研究与实现

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

推荐作者

热门标签

举报选择: