臺灣博碩士論文加值系統

English | Mobile

免費會員登入| 註冊

功能切換導覽列

訪客IP：216.73.216.231

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
電子全文
紙本論文
QR Code

本論文永久網址:

研究生:

張毓珊

研究生(外文):

Yu-shan Chang

論文名稱:

發展處理類別不平衡問題之資料探勘模式

論文名稱(外文):

Developing Data Mining Models for Class Imbalance Problems

指導教授:

陳隆昇

指導教授(外文):

Long-sheng Chen

學位類別:

碩士

校院名稱:

朝陽科技大學

系所名稱:

資訊管理系碩士班

學門:

電算機學門

學類:

電算機一般學類

論文種類:

學術論文

論文出版年:

2009

畢業學年度:

語文別:

中文

論文頁數:

132

中文關鍵詞:

類別不平衡問題、分類、馬氏距離、資料探勘、不平衡資料

外文關鍵詞:

Class Imbalance Problem、Classification、Mahalanobis Distance、Data Mining、Imbalanced Data

相關次數:

被引用:4
點閱:2873
評分:
下載:142
書目收藏:1

目錄
摘要 I
Abstract II
目錄 III
表目錄 VI
圖目錄 IX
第一章緒論 1
1.2 研究背景 1
1.2 研究動機 5
1.3 研究目的 7
1.4 研究架構 9
1.5 研究流程 10
第二章文獻探討 12
2.1 類別不平衡問題 12
2.2 解決類別不平衡之資料操弄技術 17
2.2.1 減少多數法 17
2.2.2 增加少數法 18
2.2.3 分群抽樣法 19
2.2.4 調整錯誤分類成本法 21
2.3 學習演算法 23
2.3.1 馬氏距離 (Mahalanobis Distance) 23
2.3.2 支援向量機 (Support Vector Machines) 25
2.3.3 決策樹 (Decision Tree) 28
2.3.4 邏輯斯迴歸 (Logistic Regression) 31
2.3.5 自組織映射圖網路 (Self Organizing Map) 33
第三章 MD-SVM方法 38
3.1 MD-SVM方法與流程 38
3.2 簡單說明例 41
第四章 SWAI方法 59
4.1 SWAI方法與流程 59
4.1.1 SWAI方法1 59
4.1.2 SWAI方法2 63
4.2 簡單說明例 67
第五章實驗與分析 74
5.1 實驗資料與前置處理 74
5.2 評估指標 77
5.3 評估學習演算法 80
5.3.1參數設定與使用的工具 80
5.3.2評估學習演算法的分類結果與比較 81
5.4 MD-SVM四個方法與SWAI二個方法的比較 86
5.4.1 MD-SVM四個方法設定的門檻值 86
5.4.2 MD-SVM與SWAI參數設定 95
5.4.3 MD-SVM四種方法的分類結果與比較 100
5.4.4 SWAI二種方法的分類結果與比較 105
5.5目前處理類別不平衡資料技術的比較 108
5.5.1 參數設定 108
5.5.2 目前處理類別不平衡資料技術的分類結果與比較 111
第六章結論與建議 116
6.1結論 116
6.2討論 118
6.3未來研究方向 120
參考文獻 121

表目錄
表1-1 處理類別不平衡問題方法的優缺點 6
表2-1 解決不平衡資料分類困難的方法 13
表2-2 近年幾種解決不平衡資料的方法簡介 14
表2-3 決策樹分析模簡介與優缺點 29
表2-4 一般線性迴歸和邏輯斯迴歸的比較 32
表5-1 本研究使用的UCI資料集基本資料 76
表5-2 Confusion matrix 77
表5-3 SVM設定的參數 80
表5-4 SVM學習演算法的分類結果 81
表5-5 DT學習演算法的分類結果 82
表5-6 MD學習演算法的分類結果 82
表5-7 LR學習演算法的分類結果 83
表5-8 四個學習演算法的分類結果之平均值與標準差 85
表5-9 四種MD-SVM方法的最佳門檻值 94
表5-10 MD-SVM方法1設定的參數 95
表5-11 MD-SVM方法2設定的參數 96
表5-12 MD-SVM方法3設定的參數 96
表5-13 MD-SVM方法4設定的參數 97
表5-14 SWAI方法1設定的參數(SVM) 98
表5-15 SWAI方法1設定的參數(SOM) 98
表5-16 SWAI方法2設定的參數(SVM) 99
表5-17 SWAI方法2設定的參數(SOM) 99
表5-18 MD-SVM方法1的分類結果 100
表5-19 MD-SVM方法2的分類結果 101
表5-20 MD-SVM方法3的分類結果 101
表5-21 MD-SVM方法4的分類結果 102
表5-22 四種MD-SVM方法的分類結果之平均值與標準差 104
表5-23 SWAI方法1的分類結果 105
表5-24 SWAI方法2的分類結果 106
表5-25 二種SWAI方法的分類結果之平均值與標準差 107
表5-26 隨機減少多數法設定的參數 108
表5-27 隨機增加少數法設定的參數 109
表5-28 調整錯誤分類成本法設定的參數 109
表5-29 分群抽樣法設定的參數 110
表5-30 隨機減少多數法的分類結果 111
表5-31 隨機增加少數法的分類結果 112
表5-32 分群抽樣法的分類結果 112
表5-33 調整錯誤分類成本法的分類結果 113
表5-34 本研究所提方法與其他技術比較的分類結果之平均值與標準差 115

圖目錄
圖1-1類別不平衡問題 3
圖1-2本研究之研究架構 9
圖1-3本研究之研究流程 11
圖2-1 分群抽樣法流程圖 21
圖2-2 馬氏距離的例子 24
圖2-3 SVM基本概念 26
圖2-4決策樹 28
圖2-5 邏輯斯曲線 32
圖2-6 二維矩陣的SOM架構圖 35
圖3-1 MD-SVM方法流程圖 40
圖3.2 方法1門檻值示意圖 42
圖3.3 方法2門檻值示意圖 42
圖3-4方法1示意圖 43
圖3-5方法2示意圖 43
圖3-6方法3示意圖 44
圖3-7方法4示意圖 45
圖4-1 SWAI方法1流程圖 62
圖4-2 SWAI方法2流程圖 66
圖5-1 四個學習演算法的分類結果比較 85
圖5-2 為MD-SVM方法1的門檻值決定結果 86
圖5-3 為MD-SVM方法2的門檻值決定結果 88
圖5-4 為MD-SVM方法3的門檻值決定結果 90
圖5-5 為MD-SVM方法4的門檻值決定結果 92
圖5-2 四種MD-SVM方法的分類結果比較 104
圖5-3 二種SWAI方法的分類結果比較 107
圖5-4 本研究所提方法與其他技術比較的分類結果 115

中文文獻
[1]凌士雄 (2004)，非對稱性分類分析解決策略之效能比較，碩士論文，國立中山大學資訊管理學系，高雄。
[2]張琦、吳斌、王柏 (2005)，「非平衡數據訓練方法概述」，計算機科學，第三二卷，第十期，第181-186頁。
[3]郭琇靜 (2007)，應用支援向量機與製程統計特徵於線上偵測製程異常之研究，碩士論文，國立虎尾科技大學工業工程與管理研究所，雲林。

英文文獻
[1]A. An, and Y. Wang, (2001), “Comparisons of classification methods for screening potential compounds,” Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 11-18.
[2]A. Orriols-Puig, and E. Bernadó-Mansilla, (2009), “Evolutionary rule-based systems for imbalanced datasets,” Soft Computing, vol. 13, pp. 213-225.
[3]A. Fernández, S. García, M.J. del Jesus, and F. Herrera, (2008), “A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets,” Fuzzy Sets System, Vol. 159, pp. 2378-2398.
[4]A. Fernández, M.J. del Jesus, and F. Herrera, (2009), “On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets,” Expert Systems with Applications, Vol. 36, pp. 9805-9812.
[5]A. Maciej Mazurowski, A. Piotr Habas, M. Jacek Zurada, Y. Joseph Lo, AJay Baker, and D. Georgia Tourassi (2008), “Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance,” Neural Networks, Vol. 21, No. 2-3. pp. 427-436.
[6]B. Raskutti, A. Kowalczyk, (2004), “Extreme rebalancing for svms: a case study,” SIGKDD Explorations, Vol. 6 No. 1, pp. 60-69.
[7]B. Zadrozny and C. Elkan, (2001), “Learning and making decisions when costs and probilities are both unknown,” In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pp. 204-213.
[8]C. Drummond, R.C. Holte, (2003), “C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling,” Workshop on Learning from Imbalanced Datasets, NRC 47381.
[9]C.-I. Lee, C.-J. Tsai, T.-Q. Wu and W.-P. Yang, (2008), “An approach to mining the multi-relational imbalanced database, ” Expert Systems with Applications: An International Journal, Vol. 34 No. 4, pp. 3021-3032.
[10]C.-T. Su, C.-H. Yang, K.-H. Hsu and W.-K. Chiu (2006), “Data mining for the diagnosis of type II Diabetes from three-dimensional body surface anthropometrical scanning data,” Computers & Mathematics with Applications, 51, 1075-1092.
[11]C.-T. Su, and Y.-H. Hsiao, (2007), “An evaluation of the robustness of MTS for imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 10, pp. 1321-1332.
[12]C.-T. Su, L.-S. Chen, and T.-L. Chiang, (2006), “A neural network based information granulation approach to shorten the cellular phone test process,” Computers In Industry, Vol. 57, No. 5, pp. 412-423.
[13]C.-T. Su, L.-S. Chen, and Y. Yih, (2006), “Knowledge acquisition through information granulation for imbalanced data,” Expert System with Applications, Vol. 31, No. 3, pp. 531-541.
[14]D.Lewis, and J. Catlett, (1994), “Heterogeneous uncertainty sampling for supervised learning,” Proceedings of the 11th International Conference on Machine Learning, pp. 144-156.
[15]D.R. Wilson, and T.R. Martinez, (2000), “Reduction techniques for instance-based learning algorithms,” Mach. Learning, Vol. 38, No. 3, pp. 257-286.
[16]E. Andrews, Q. Morris, and A. Bonner, (2008), “Neural networks approaches for discovering the learnable correlation between gene function and gene expression in mouse,” Neurocomputing, Vol. 71, No. 16-18, pp. 3168-3175.
[17]G.Batista, R. Prati, and M. Monard, (2004), “Astudy of the behaviour of several methods for balancing machine learning training data,” SIGKDD Explorations, Vol. 6, No. 1, pp. 20-29.
[18]G.Batista, R.C. Prati, and M.C. Monard, (2004), “A study of the behavior of several methods for balancing machine learning training data,” SIGKDD Explorations, Vol. 6, No. 1, pp. 20-29.
[19]G.H. Nguyen, A. Bouzerdoum, S.L. Phung, (2008), “A supervised learning approach for imbalanced data sets,” IEEE Xplore, pp. 1-4.
[20]G.M. Weiss, (2004), “Mining with rarity: a unifying framework,” SIGKDD Exploration, Vol. 6, No. 1, pp. 7-19.
[21]G.M. Weiss, and F. Provost, (2001), “The effect of class distribution on classifier learning,” Technical Report, MLTR43, Department of Computer Science, Rutgers University.
[22]G.V. Kass, (1980), “An exploratory technique for investigating large quantities of categorical data,” Applied Statistics, pp. 119-127.
[23]G. Weiss, and F. Provost, (2003), “Learning when training data are costly: the effect of class distribution on tree induction,” Journal of Artificial Intelligence Research, No. 19, pp. 315-354.
[24]G. Wu, and E.Y. Chang, (2005), “KBA: kernel boundary alignment considering imbalanced data distribution,” IEEE Transactions on Knowledge and Data Engineering, Vol. 17, Vol. 6, pp. 786-795.
[25]H. Altincay, and C. Ergun, (2004), “Clustering based undersampling for improving speaker verification decisions using AdaBoost,” Lecture Notes in Computer Science, Vol. 3138, pp. 698- 706.
[26]H. Guo, and H.L. Viktor, (2004), “Learning from imbalanced data sets with boosting and data generation: the DataBoost- IM approach,” SIGKDD Explorations, Vol. 6, No. 1, pp. 30-39.
[27]H.T. Lin, and C.J. Lin, (2003), “A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods,” Technical report, Department of Computer Science & Information Engineering, National Taiwan University.
[28]Hsu, C.-W., Chang, C.-C. and Lin, C.-J., (2006), “A practical guide to support vector classification,” available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html.
[29]I.H. Witten, and E. Frank, (2002), “Data mining: practical machine learning tools with Java implementations,” Morgan Kaufmann, San Francisco.
[30]J.G. Xie, and Z.D. Qiu, (2007), “The effect of imbalanced data sets on LDA: a theoretical and empirical analysis,” Pattern Recognition, Vol. 40, No. 2, pp. 557–562.
[31]J.R. Quinlan, (1986) “Induction of decision tree,” Machine Learning, Vol. 1, No. 1, pp.81-106.
[32]J.R. Quinlan, (1993), “C4.5:programs for machine learning,” Morgankaufmann, San Mateo, CA.
[33]J. Yu, and L. Xi, (2008), “A hybrid learning-based model for on-line monitoring and diagnosis of out-of-control signals in multivariate manufacturing processes,” International Journal of Production Research, DOI: 10.1080/00207540801942208.
[34]L.M. Taft, R.S. Evans, C.R. Shyu, M.J. Egger, N. Chawla, J.A. Mitchell, S.N. Thornton, B. Bray, and M. Varner, (2009), “Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery,” Journal of Biomedical Informatics, Vol. 42, pp. 356-364.
[35]L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, (1984), “Classification and regression trees,” Wadsworth Belmont.
[36]L. Xu, and M.Y. Chow, (2006), “A classification approach for power distribution systems fault cause identification,” IEEE Transactions on Power Systems, Vol. 21, No. 1, pp. 53-60.
[37]L. Zhuang, H. Dai, and X. Hang, (2005), “A novel field learning algorithm for dual imbalance text classification,” International Conf. on Fuzzy Systems and Knowledge Discovery, Lecture Notes on Artificial Intelligence, Vol. 3614, pp. 39–48.
[38]J. Liu, Q. Hu, and D. Yu, (2008), “A comparative study on rough set based class imbalance learning,” Knowledge based Systems, Vol. 21, No. 8, pp. 753-763.
[39]K. Coussement, D. Van den Poel, (2008), “Churn prediction in subscription services: An application of support vector machines while comparing two parameter selection techniques,” Expert Systems with Applications, 313–327.
[40]M.A. Maloof, (2003), “Learning when data sets are imbalanced and when costs are unequal and unknown,” ICML-2003 Workshop on Learning from Imbalanced Data Sets.
[41]M.-C. Chen, L.-S. Chen, C.-C. Hsu, and W.-R. Zeng, (2008), “An information granulation based data mining approach for classifying imbalanced data,” Information Sciences, Vol. 178, No. 16, pp. 3214-3227.
[42]M.J.A. Berry, and G. Linoff, (1997) “Data mining techniques: for marketing sale and customer support,” John Wiley & Sons, Inc.
[43]M. Kubat, and S. Matwin, (1997), “Addressing the curse of imbalanced training sets: one-sided selection,” Machine Learning, pp. 179-186.
[44]M. Kubat, R. Holte, and S. Matwin, (1997), “Learning when negative examples abound,” Proceedings of Europeanm Conference on Machine Learning, pp. 146-153.
[45]M. Rosell, V. Kann, and J. Litton, (2004), “Comparing comparisons: document clustering evaluation using two manual classifications,” Proceedings of Int. Conf. on Natural Language Processing, pp. 207-216.
[46]N. Chawla, A. Lazarevic, L. Hall and K. Bowyer, (2003), “SMOTEBoost: improving prediction of the minority class in boosting,” 7th uropean Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia , pp. 107-119.
[47]N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer, (2002), “Smote: synthetic minority over-sampling technique,” J. Artificial Intelligent Res, Vol. 16, pp. 321-357.
[48]N. Japkowicz, and S. Stephen, (2002), “The class imbalance problem: a systematic study,” Intelligent Data Analysis, Vol. 6, No. 5, pp. 429-449.
[49]N.V. Chawla, N. Japkowicz, and A. Kolcz, (2004), “Editorial: special issue on learning from imbalanced data sets,” SIGKDD Explorations, Vol. 6, No. 1, pp. 1-6.
[50]P. Campadelli, E. Casiraghi, and G. Valentini, (2005), “Support vector machines for candidate nodules classification,” Neurocomputing, Vol. 68, pp. 281–288.
[51]P. Hart, (1968), “The condensed nearest neighbor rule,” Inform. Theory, IEEE Trans, pp. 515-516.
[52]R. Barandela, J.S. Sanchez, V. Garcia, and E. Rangel, (2003), “Strategies for learning in class imbalance problems,” Pattern Recognition, Vol. 36, No. 3, pp. 849-851.
[53]S. Lessmann, S. Voß, (2009), “A reference model for customer-centric data mining with support vector machines,” European Journal of Operational Research, 520-530.
[54]S.S. Keerthi, C.-J. Lin, (2003), “Asymptotic behaviors of support vector machines with Gaussian kernel,” Neural Computation, 1667–1689.
[55]S.J. Press, and S. Willson, (1978), “Choosing between logistic regression and discriminant analysis,” Journal of the American Statistical Association, pp. 699-705.
[56]S.R. Gunn, (1998), “Support vector machines for classification and regression,” Technical Report, University of Southampton.
[57]S. Li, L. Shue, and S. Lee, (2008), “Business intelligence approach to supporting strategy-making of ISP service management,” Expert Systems with Applications, vol. 35, no. 3, pp. 739-754.
[58]S.-J. Yen, Y.-S. Lee, (2009), “Cluster-based under-sampling approaches for imbalanced data distributions,” Expert Systems with Applications, Vol. 36, pp. 57180-5727.
[59]T. Estabrooks, N. Jo, Japkowicz, (2004), “A multiple resampling method for learning from imbalanced data sets,” Comput. Intelligence, Vol. 20, No. 1, pp. 18-36.
[60]T. Fawcett, and F.J. Provost, (1997), “Adaptive fraud detection,” Data Mining Knowledge Discovery, Vol. 1, No. 3, pp. 291-316.
[61]T. Kohonen, (1990), “The self-organizing map,” Proceedings of the IEEE, Vol. 78, No. 9, pp. 1464-1480.
[62]T.W. Liao, (2008), “Classification of weld flaws with imbalanced class data,” Expert Systems with Applications, Vol. 35, No. 3, pp. 1041-1052.
[63]Tomek, (1976), “Two modifications of CNN,” IEEE Trans, Vol. 6, No. 11, pp. 769-772.
[64]U. Seiffert, and L. Jain, (2002), “Self-organizing neural networks: recent advances and applications,” Studies in Fuzziness and Soft Computing, 78, Springer, Berlin.
[65]V.N. Vapnik, (1995), “The nature of statistical learning theory,” Springer Verlag, New York, NY, USA.
[66]V.S. Desai, J.N. Crook, and G.A. Overstreet, (1996), “A comparison of neural networks and linear scoring models in the credit union environment” European Journal of Operation Research, Vol. 95, pp. 24-37.
[67]Y.M. Chae, S.H. Ho, K.W. Cho, D.H. Lee, and S.H. Ji, (2001), “Data mining approach to policy analysis in a health insurance domain,” International Journal of Medical Informatics, Vol. 62, No. 2-3, pp. 103-111.
[68]Y. Xie, X. Le, E.W.T. Ngai, W. Ying, (2009), “Customer churn prediction using improved balanced random forests,” Expert Systems with Applications, Vol. 36, pp. 5445-5449.
[69]Z.H. Zhou, and X.Y. Liu, (2006), “Training cost-sensitive neural networks with methods addressing the class imbalance problem,” IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 1, pp. 63-77.

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	非對稱性分類分析解決策略之效能比較
2.	應用支援向量機與製程統計特徵於線上偵測製程異常之研究
3.	資料探勘在慢性病預測模式之建構
4.	資料探勘在財務領域的運用-以債券型基金之績效評估為例
5.	類神經網路與MARS於資料探勘分類模式之應用
6.	演化式類神經網路分類技術於資料探勘上之應用
7.	應用資料探勘分類技術於專利分析之研究
8.	利用資料挖掘技術提供網際網路使用者個人化服務
9.	類神經網路在稅務稽核選案模式之應用
10.	遺傳演算法應用於支援向量機之參數調整與屬性篩選
11.	以混合式資料探勘技術強化客戶保留之工作
12.	半導體製程之分類式資料探勘
13.	資料探勘於外籍勞工逃逸管理之研究

簡易查詢 | 進階查詢 |