基于异常序列剔除的多变量时间序列结构化预测

毛文涛; 蒋梦雪; 李源; 张仕光

doi:10.16383/j.aas.2017.c160707

基于异常序列剔除的多变量时间序列结构化预测

doi: 10.16383/j.aas.2017.c160707

毛文涛^1,2,3, ,,
蒋梦雪^1,,
李源^1,2,,
张仕光^1,2,

1.
河南师范大学计算机与信息工程学院新乡 453007
2.
河南省高校计算智能与数据挖掘工程技术研究中心新乡 453007
3.
西北工业大学力学与土木建筑学院西安 710129

基金项目:

河南省高校青年骨干教师资助计划 2014GGJS-046

河南师范大学优秀青年科学基金 14YQ007

国家自然科学基金 U1204609

河南省高等学校重点科研项目 16A520015

中国博士后科学基金 2016T90944

河南省高校科技创新人才资助计划 15HASTIT022

详细信息

作者简介:
蒋梦雪  河南师范大学计算机与信息工程学院硕士研究生.主要研究方向为机器学习, 时间序列预测.E-mail:jmxhtu@126.com

李源  河南师范大学计算与信息工程学院副教授.主要研究方向为故障诊断, 可靠性预测.E-mail:liyuan2015097@163.com

张仕光  河南师范大学计算与信息工程学院副教授.主要研究方向为机器学习, 大数据处理.E-mail:121114@htu.edu.cn

通讯作者:
毛文涛河南师范大学计算与信息工程学院副教授.主要研究方向为机器学习, 时间序列预测.本文通信作者.E-mail:maowt@htu.edu.cn

计量
- 文章访问数: 3673
- HTML全文浏览量: 688
- PDF下载量: 990
- 被引次数: 0
出版历程
- 收稿日期: 2016-10-10
- 录用日期: 2017-02-07
- 刊出日期: 2018-04-20

Structural Prediction of Multivariate Time Series Through Outlier Elimination

MAO Wen-Tao^{1,2,3
, ,},
JIANG Meng-Xue^1
,,
LI Yuan^{1,2
,},
ZHANG Shi-Guang^{1,2
,}

1.
College of Computer and Information Engineering, Henan Normal University, Xinxiang 453007
2.
Computational Intelligence and Data Mining Engineering Technology Research Center of Colleges and Universities of Henan Province, Xinxiang 453007
3.
School of Mechanics and Civil & Architecture, Northwestern Polytechnical University, Xi'an 710129

Funds:

the Funding Scheme of University Young Core Instructor of Henan Province 2014GGJS-046

the Foundation of Henan Normal University for Excellent Young Teachers 14YQ007

National Natural Science Foundation of China U1204609

the Key Research Project of High School of Henan Province 16A520015

China Postdoctoral Science Foundation 2016T90944

the Funding Scheme of University Science and Technology Innovation of Henan Province 15HASTIT022

More Information

Author Bio:
Master student at the College of Computer and Information Engineering, Henan Normal University. Her research interest covers machine learning and prediction of time series

Associate professor at the College of Computer and Information Engineering, Henan Normal University. Her research interest covers fault diagnosis and reliability prediction

Associate professor at the College of Computer and Information Engineering, Henan Normal University. His research interest covers machine learning and big data processing

Corresponding author: MAO Wen-Tao Associate professor at the College of Computer and Information Engineering, Henan Normal University. His research interest covers machine learning and prediction of time series. Corresponding author of this paper

摘要

摘要: 针对传统多变量时间序列预测方法未考虑变量间依赖关系从而影响预测效果的问题，提出了一种基于异常序列剔除的多变量时间序列预测算法.该算法旨在利用多维支持向量回归机（Multi-dimensional support vector regression，M-SVR）内在的结构化输出特性，对选取到具有相似性的多个变量序列进行联合预测.首先，对已知序列进行基于模糊熵的层次聚类，实现对相似序列的初步划分；其次，求出类中所有序列的主曲线，根据序列到主曲线的距离计算各个序列的异常因子，从而进一步剔除聚类结果中的异常序列；最后，将选取到的相似变量序列作为输入，利用M-SVR进行预测.通过理论分析，证明本文算法在理论上存在信息损失上界与可靠度下界，从而说明本文算法的合理性与可行性.采用混沌时间序列数据与多个实际数据集进行对比实验，结果表明，与现有多个代表性方法相比，本文算法可有效挖掘多变量时间序列的内在结构信息，预测精度更高，数值稳定性更好.
- 时间序列聚类 /
- 主曲线 /
- 异常序列 /
- 多维支持向量回归机
Abstract: To solve the problem that the traditional multivariate time series prediction generally ignores the dependency among all variables, a new multivariate time series structural prediction method through outlier elimination is proposed. This algorithm predicts on the selected multivariate time series by using the structural output characteristic. Firstly, to recognize the relatedness among the sequences, the variable sequences are initially divided by hierarchical clustering according to fuzzy entropy. Secondly, to further evaluate the similarity of the sequences in the obtained cluster, the principal curve is introduced to calculate the abnormality degree of each sequence, and then the outlier sequence can be eliminated in terms of the value of abnormality degree. As a result, similar sequences can be distinguished. Finally, for the similar series, multi-dimensional support vector regression (M-SVR) is used to construct the prediction model, and then the structural prediction for multivariate time series is conducted. Moreover, a theoretical proof is provided to show the proposed method has an upper bound of the loss of information and a lower bound of reliability and that the proposed method is reasonable and feasible from the perspective of information entropy. Experiments are conducted on three chaotic time series datasets and five real-life datasets. The results show that the proposed method can effectively recognize the inner group structure among multivariable sequences, so as to obtain a better forecasting accuracy and numerical stability than those widely used methods in terms of two different error measurements.
- Time series clustering /
- principal curve /
- outlier sequence /
- multi-dimensional support vector regression (M-SVR)
注释:

1) 本文责任编委张敏灵

HTML全文

图 1 算法流程图

Fig. 1 The flowchart of the algorithm

下载: 全尺寸图片幻灯片

图 2 Lorenz序列数据

Fig. 2 The Lorenz sequences

下载: 全尺寸图片幻灯片

图 3 Mackey-Glass序列数据

Fig. 3 The Mackey-Glass sequences

下载: 全尺寸图片幻灯片

图 4 Henon序列数据

Fig. 4 The Henon sequences

下载: 全尺寸图片幻灯片

图 5 初始序列聚类结果

Fig. 5 The result of clustering on original sequences

下载: 全尺寸图片幻灯片

图 6 $B$类序列的主曲线

Fig. 6 The principal curve of $B$ class

下载: 全尺寸图片幻灯片

图 7 $B$类各序列的异常因子

Fig. 7 Abnormal factor of every sequence in $B$ class

下载: 全尺寸图片幻灯片

图 8 OE-MSVR在三个阶段的预测效果

Fig. 8 The prediction of three stages with OE-MSVR algorithms

下载: 全尺寸图片幻灯片

图 9 序列3的OE-MSVR预测效果图

Fig. 9 The prediction of the third sequence with OE-MSVR algorithm

下载: 全尺寸图片幻灯片

图 10 序列3的SVR预测效果图

Fig. 10 The prediction of the third sequence with SVR algorithm

下载: 全尺寸图片幻灯片

图 11 序列3的ELM预测效果图

Fig. 11 The prediction of the third sequence with ELM algorithm

下载: 全尺寸图片幻灯片

图 12 三种方法下预测误差对比图

Fig. 12 The errors of three algorithms

下载: 全尺寸图片幻灯片

图 13 初始序列聚类结果

Fig. 13 The result of clustering on original sequences

下载: 全尺寸图片幻灯片

图 14 异常序列检测过程图

Fig. 14 The detection of abnormal sequences

下载: 全尺寸图片幻灯片

图 15 五种方法下预测误差对比图

Fig. 15 The errors of five algorithms

下载: 全尺寸图片幻灯片

图 16 四个数据集的聚类结果图

Fig. 16 The results of clustering on four datasets

下载: 全尺寸图片幻灯片

图 17 四个数据集对应类的主曲线结果图

Fig. 17 The principal curve of classes on four datasets

下载: 全尺寸图片幻灯片

图 18 四个数据集的异常因子结果图

Fig. 18 The abnormal factors of four datasets

下载: 全尺寸图片幻灯片

表 1 各个阶段预测结果性能指标对比

Table 1 Prediction performance parameters of capillary of three stages

序列	RMSE			MAE
序列	初始序列	聚类后	异常序列剔除后	初始序列	聚类后	异常序列剔除后
序列3	0.0589	0.0511	0.0319	0.0405	0.0357	0.0248
序列4	0.0843	0.0814	0.0364	0.0638	0.0603	0.0270
序列5	0.0559	0.0508	0.0379	0.0435	0.0387	0.0292
序列6	0.0675	0.0585	0.0350	0.0494	0.0435	0.0269

下载: 导出CSV

表 2 实际数据集信息

Table 2 Real datasets

数据集	样本数目		属性数目
数据集	训练样本	测试样本	属性数目
澳门气象数据	1 276	547	11
Monitor system	1 260	540	19
Istanbul stock exchange	375	161	9
Air quality	1 680	720	13
Gas sensor array drift	310	133	30

下载: 导出CSV

表 3 A monitorsystem数据集预测结果性能指标对比

Table 3 Prediction performance parameters of capillary of A monitor system dataset

序列	RMSE					MAE
序列	OE-MSVR	Fol-BP	SVR	ELM	AR	OE-MSVR	Fol-BP	SVR	ELM	AR
序列1	0.1318	0.4561	0.8583	0.2388	0.1628	0.0949	0.3288	0.4115	0.1115	0.1628
序列2	0.1006	0.5843	0.4862	0.1926	0.1514	0.0835	0.4236	0.3271	0.1249	0.1514
序列5	4.7365	9.1443	14.5330	6.8052	4.0215	3.0607	6.4887	14.2472	2.4689	4.0215
序列6	0.5226	1.6900	0.7693	0.4723	0.4642	0.3459	1.1693	0.6508	0.3407	0.4642
序列7	0.3767	1.2893	0.5828	0.3817	0.2548	0.2730	0.9145	0.4679	0.2784	0.2548
序列17	0.2112	0.9896	1.6971	0.4789	0.2820	0.1642	0.7497	0.7099	0.2239	0.2820
序列18	0.7122	2.2701	0.9597	0.8947	0.6812	0.5146	1.7620	0.7590	0.6643	0.6812
序列19	0.1244	0.7442	0.2993	0.0434	0.0994	0.0613	0.4913	0.2992	0.0038	0.0994

下载: 导出CSV

表 4 Italian airquality数据集预测结果性能指标对比

Table 4 Prediction performance parameters of capillary of Italian air quality

序列	RMSE					MAE
序列	OE-MSVR	Fol-BP	SVR	ELM	AR	OE-MSVR	Fol-BP	SVR	ELM	AR
序列2	121.9587	122.2608	123.6130	122.9571	123.4309	75.7143	70.2002	75.4696	76.6154	69.6428
序列5	146.4420	148.3487	146.5258	146.3931	168.6011	89.7674	89.3968	88.6896	92.8062	100.1906
序列7	113.6631	113.6741	117.6938	122.5415	123.6603	76.7395	77.8434	81.9074	85.8995	85.7884
序列9	165.8359	167.1297	167.6215	174.2383	184.5600	100.3676	100.1063	101.4094	106.4807	108.7986
序列10	178.1549	180.5234	186.9799	190.4919	200.0233	120.5245	118.7809	125.6156	129.6792	131.0641

下载: 导出CSV

表 5 Istanbul stockexchange数据集预测结果性能指标对比

Table 5 Prediction performance parameters of capillary of Istanbul stock exchange

序列	RMSE					MAE
序列	OE-MSVR	Fol-BP	SVR	ELM	AR	OE-MSVR	Fol-BP	SVR	ELM	AR
序列1	0.0119	0.0192	0.0121	0.0133	0.0119	0.0090	0.0140	0.0092	0.0101	0.0092
序列2	0.0151	0.0217	0.0153	0.0167	0.0151	0.0116	0.0167	0.0117	0.0128	0.0117
序列5	0.0097	0.0170	0.0095	0.0110	0.0098	0.0072	0.0123	0.0073	0.0083	0.0073

下载: 导出CSV

表 6 Gas sensor arraydrift数据集预测结果性能指标对比

Table 6 Prediction performance parameters of capillary of Gas sensor array drift dataset

序列	RMSE					MAE
序列	OE-MSVR	Fol-BP	SVR	ELM	AR	OE-MSVR	Fol-BP	SVR	ELM	AR
序列2	9.08E + 04	2.58E + 05	9.17E + 04	1.17E + 05	9.27E + 04	5.20E + 04	1.58E + 05	5.73E + 04	7.48E + 04	5.30E + 04
序列4	22.7062	74.2977	27.8097	26.3881	25.4101	10.9343	39.8237	17.5229	17.1492	15.2092
序列5	31.5850	111.9757	34.3149	39.0925	34.9793	16.1337	73.2789	20.6630	25.5340	21.5229
序列6	45.5174	221.2905	52.0124	59.9806	52.5336	22.6310	118.2681	27.9837	36.8887	30.6452
序列12	18.0458	80.8880	24.1219	28.6046	20.4600	9.0439	45.7426	16.7021	16.8084	12.8685
序列13	26.9213	79.6147	26.4391	34.9546	30.0822	13.2553	49.2278	16.5452	24.2313	19.3481
序列14	36.5061	241.8456	38.7079	49.0467	43.4256	15.8653	138.9438	20.7230	27.4320	24.8211
序列22	3.2263	5.0164	3.4150	2.7369	2.3271	2.7023	2.9809	2.9791	1.8935	1.4541
序列30	3.1137	6.3610	3.4056	2.7941	2.2838	2.5935	4.2371	3.0138	1.8682	1.3841

下载: 导出CSV

参考文献(27)

[1]	Schölkopf B B, Smola A J. Learning with Kernels. Cambridge, Britain:MIT Press, 2002, 3:2165-2176 https://www.cs.utah.edu/~piyush/teaching/learning-with-kernels.pdf
[2]	张勇, 关伟.基于最大Lyapunov指数的多变量混沌时间序列预测.物理学报, 2009, 58(2):756-763 doi: 10.7498/aps.58.756 Zhang Yong, Guan Wei. Predication of multivariable chaotic time series based on maximal Lyapunov exponent. Acta Physica Sinica, 2009, 58(2):756-763 doi: 10.7498/aps.58.756
[3]	Sun B Q, Guo H F, Karimi H R, Ge Y J, Xiong S. Prediction of stock index futures prices based on fuzzy sets and multivariate fuzzy time series. Neurocomputing, 2015, 151:1528-1536 doi: 10.1016/j.neucom.2014.09.018
[4]	韩敏, 许美玲, 任伟杰.多元混沌时间序列的相关状态机预测模型研究.自动化学报, 2014, 40(5):822-829 http://www.aas.net.cn/CN/abstract/abstract18350.shtml Han Min, Xu Mei-Ling, Ren Wei-Jie. Research on multivariate chaotic time series prediction using mRSM model. Acta Automatica Sinica, 2014, 40(5):822-829 http://www.aas.net.cn/CN/abstract/abstract18350.shtml
[5]	Wang X Y, Han M. Improved extreme learning machine for multivariate time series online sequential prediction. Engineering Applications of Artificial Intelligence, 2015, 40:28-36 doi: 10.1016/j.engappai.2014.12.013
[6]	Han M, Xu M L, Liu X X, Wang X Y. Online multivariate time series prediction using SCKF-γ ESN model. Neurocomputing, 2015, 147:315-323 doi: 10.1016/j.neucom.2014.06.057
[7]	Chen T T, Lee S J. A weighted LS-SVM based learning system for time series forecasting. Information Sciences, 2015, 299:99-116 doi: 10.1016/j.ins.2014.12.031
[8]	Sanchez-Fernandez M, de-Prado-Cumplido M, Arenas-Garcia J, Perez-Cruz F. SVM multiregression for nonlinear channel estimation in multiple-input multiple-output systems. IEEE Transactions on Signal Processing, 2005, 52(8):2298-2307 http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=1315948
[9]	Bao Y K, Xiong T, Hu Z Y. Multi-step-ahead time series prediction using multiple-output support vector regression. Neurocomputing, 2014, 129:482-493 doi: 10.1016/j.neucom.2013.09.010
[10]	Han J W, Kamber M, Pei J[著], 范明, 孟小峰[译]. 数据挖掘概念与技术(第3版) (计算机科学丛书). 北京: 机械工业出版社, 2012. 297-301 Han J W, Kamber M, Pei J[Author], Fan Ming, Meng Xiao-Feng[Translator]. Data Mining Concepts and Techniques (3rd edition) (Computer Science Series). Beijing: China Machine Press, 2012. 297-301
[11]	韩忠明, 陈妮, 乐嘉锦, 段大高, 孙践知.面向热点话题时间序列的有效聚类算法研究.计算机学报, 2012, 35(11):2337-2347 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=jsjx201211012&dbname=CJFD&dbcode=CJFQ Han Zhong-Ming, Chen Ni, Le Jia-Jin, Duan Da-Gao, Sun Jian-Zhi. An efficient and effective clustering algorithm for time series of hot topics. Chinese Journal of Computers, 2012, 35(11):2337-2347 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=jsjx201211012&dbname=CJFD&dbcode=CJFQ
[12]	Lee H. The Euclidean distance degree of Fermat hypersurfaces. Journal of Symbolic Computation, 2017, 80:502-510 doi: 10.1016/j.jsc.2016.07.006
[13]	Hautamaki V, Nykanen P, Franti P. Time-series clustering by approximate prototypes. In: Proceedings of the 19th International Conference on Pattern Recognition. Tampa, FL, USA: IEEE, 2008. 1-4
[14]	杨一鸣, 潘嵘, 潘嘉林, 杨强, 李磊.时间序列分类问题的算法比较.计算机学报, 2007, 30(8):1259-1266 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=jsjx200708010&dbname=CJFD&dbcode=CJFQ Yang Yi-Ming, Pan Rong, Pan Jia-Lin, Yang Qiang, Li Lei. A comparative study on time series classification. Chinese Journal of Computers, 2007, 30(8):1259-1266 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=jsjx200708010&dbname=CJFD&dbcode=CJFQ
[15]	张军平, 王珏.主曲线研究综述.计算机学报, 2003, 26(2):129-146 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=jsjx200302000&dbname=CJFD&dbcode=CJFQ Zhang Jun-Ping, Wang Yu. An overview of principal curves. Chinese Journal of Computers, 2003, 26(2):129-146 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=jsjx200302000&dbname=CJFD&dbcode=CJFQ
[16]	毛文涛, 王金婉, 何玲, 袁培燕.面向贯序不均衡数据的混合采样极限学习机.计算机应用, 2015, 35(8):2221-2226 doi: 10.11772/j.issn.1001-9081.2015.08.2221 Mao Wen-Tao, Wang Jin-Wan, He Ling, Yuan Pei-Yan. Hybrid sampling extreme learning machine for sequential imbalanced data. Journal of Computer Application, 2015, 35(8):2221-2226 doi: 10.11772/j.issn.1001-9081.2015.08.2221
[17]	孙克辉, 贺少波, 尹林子, 阿地力·多力坤.模糊熵算法在混沌序列复杂度分析中的应用.物理学报, 2012, 61(13):130507 doi: 10.7498/aps.61.130507 Sun Ke-Hui, He Shao-Bo, Yin Lin-Zi, A Di-Li·Duo Li-Kun. Application of fuzzyen algorithm to the analysis of complexity of chaotic sequence. Acta Physica Sinica, 2012, 61(13):130507 doi: 10.7498/aps.61.130507
[18]	毛文涛, 赵胜杰, 张俊娜.基于主曲线的多输入多输出支持向量机算法.计算机应用, 2013, 33(5):1281-1284, 1293 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=jsjy201305022&dbname=CJFD&dbcode=CJFQ Mao Wen-Tao, Zhao Sheng-Jie, Zhang Jun-Na. Multi-input-multi-output support vector machine based on principal curve. Journal of Computer Application, 2013, 33(5):1281 -1284, 1293 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=jsjy201305022&dbname=CJFD&dbcode=CJFQ
[19]	Yuan P Y, Ma H D, Fu H Y. Hotspot-entropy based data forwarding in opportunistic social networks. Pervasive and Mobile Computing, 2015, 16:136-154 doi: 10.1016/j.pmcj.2014.06.003
[20]	汤礼东, 宋保维, 李正, 郑珂.基于信息熵理论的小子样模糊可靠性评定方法.弹箭与制导学报, 2005, 25(S1):214-216 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=djzd2005s1040&dbname=CJFD&dbcode=CJFQ Tang Li-Dong, Song Bao-Wei, Li Zheng, Zheng Ke. A fuzzy reliability evaluation method for sub-sample products based on information entropy theory. Journal of Projectiles, Rockets, Missiles and Guidance, 2005, 25(S1):214-216 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=djzd2005s1040&dbname=CJFD&dbcode=CJFQ
[21]	Mao W T, Zhao S J, Mu X X, Wang H C. Multi-dimensional extreme learning machine. Neurocomputing, 2015, 149:160 -170 doi: 10.1016/j.neucom.2014.02.073
[22]	Chang C C, Lin C J. LIBSVM: a library for support vector machines[Online], available: http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html. February 1, 2016
[23]	许美玲, 韩敏.多元混沌时间序列的因子回声状态网络预测模型.自动化学报, 2015, 41(5):1042-1046 http://www.aas.net.cn/CN/abstract/abstract18678.shtml Xu Mei-Ling, Han Min. Factor echo state network for multivariate chaotic time series prediction. Acta Automatica Sinica, 2015, 41(5):1042-1046 http://www.aas.net.cn/CN/abstract/abstract18678.shtml
[24]	李军, 李大超.基于优化核极限学习机的风电功率时间序列预测.物理学报, 2016, 65(13):33-42 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=wlxb201613005&dbname=CJFD&dbcode=CJFQ Li Jun, Li Da-Chao. Wind power time series prediction using optimized kernel extreme learning machine method. Acta Physica Sinica, 2016, 65(13):33-42 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=wlxb201613005&dbname=CJFD&dbcode=CJFQ
[25]	马千里, 郑启伦, 彭宏, 覃姜维.基于模糊边界模块化神经网络的混沌时间序列预测.物理学报, 2009, 58(3):1410-1419 doi: 10.7498/aps.58.1410 Ma Qian-Li, Zheng Qi-Lun, Peng Hong, Qin Jiang-Wei. Chaotic time series prediction based on fuzzy boundary modular neural networks. Acta Physica Sinica, 2009, 58(3):1410-1419 doi: 10.7498/aps.58.1410
[26]	侯公羽, 梁荣, 孙磊, 刘琳, 龚砚芬.基于多变量混沌时间序列的煤矿斜井TBM施工动态风险预测.物理学报, 2014, 63(9):90505 doi: 10.7498/aps.63.090505 Hou Gong-Yu, Liang Rong, Sun Lei, Liu Lin, Gong Yan-Fen. Risk analysis on long inclined-shaft construction in coalmine by TBM techniques based on multiple variables chaotic time series. Acta Physica Sinica, 2014, 63(9):90505 doi: 10.7498/aps.63.090505
[27]	Liu C H, Shang Y L, Duan L, Chen S P, Liu C C, Chen J. Optimizing workload category for adaptive workload prediction in service clouds. Service-Oriented Computing. Lecture Notes in Computer Science. Berlin, Heidelberg, Germany: Springer, 2015. 87-104