Data Pre-Processing for Classification and Clustering

S. Balamurugan; A. B. Arockia Christopher

doi:10.51983/ajcst-2012.1.1.1698

Authors

S. Balamurugan Department of Information Technology, Thiagarajar College of Engineering, Madurai, Tamil Nadu, India
A. B. Arockia Christopher Department of Information Technology, Thiagarajar College of Engineering, Madurai, Tamil Nadu, India

DOI:

https://doi.org/10.51983/ajcst-2012.1.1.1698

Keywords:

data mining, classification algorithm, redundancy, conflicting data

Abstract

In real world datasets, lots of redundant and conflicting data exists. The performance of a classification algorithm in data mining is greatly affected by noisy information (i.e. redundant and conflicting data). These parameters not only increase the cost of mining process, but also degrade the detection performance of the classifiers. They have to be removed to increase the efficiency and accuracy of the classifiers. This process is called as the tuning of the dataset. The redundancy check will be performed on the original dataset and the resultant is to be preserved. This resultant dataset is to be then checked for conflicting data and if they will be corrected and updated to the original dataset. This updated dataset is to be then classified using a variety of classifiers like Multilayer perceptron, SVM, Decision stump, Kstar, LWL, Rep tree, Decision table, ID3, J48 and NaÃ¯ve Bayes. The performance of the updated datasets on these classifiers is to be found. The results will show a significant improvement in the classification accuracy when redundancy and conflicts are to be removed. The conflicts after correction are to be updated to the original dataset, and when the performance of the classifier is to be evaluated, great improvement is to be witnessed.

References

A. Gal, T. Sagi, Tuning the ensemble selection process of schema matchers, Information Systems 35 (8) (2010) 845–859.

A. Tsymbal, M. Pechenizkiy, P. Cunningham, Diversity in search strategies for ensemble feature selection, Information Fusion 6 (1) (2005) 83–98.

A. Tsymbal, M. Pechenizkiy, P. Cunningham, Sequential genetic search for ensemble feature selection, in: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-2005), 2005, pp. 877–882. [4] C. fu Lin, S. de Wang, Training algorithms for fuzzy support vector machines with noisy data, Pattern Recognition Letters 25 (14) (2004) 1647–1656.

C. Garcı´a-Osorio, A. de Haro-Garcı´a, N. Garcı´a-Pedrajas, Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts, Artificial Intelligence 174 (5–6) (2010) 410–41.

C.E. Brodley, M.A. Friedl, Identifying mislabeled training data, Journal of Artificial Intelligence Research 11 (1999) 131–167.

C.M. Teng, Correcting noisy data, in: Proceedings of the Sixteenth International Conference on Machine Learning (ICML-1999), 1999, pp. 239–248.

Chen, W., Tseng, S., & Hong, T. (2008). An efficient bit-based feature selection method. Expert Systems with Applications, 34, 2858–2869.

Crone, S. F., Lessmann, S., & Stahlbock, R. (2006). The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing. European Journal of Operational Research, 173, 781–800.

D. He, X. Zhu, X. Wu, Error detection and uncertainty modeling for imprecise data, in: ICTAI ’09 Proceedings of the 2009 21st IEEE International Conference on Tools with Artificial Intelligence, 2009, pp. 792–795.

D. Opitz, Feature selection for ensembles, in: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI- 1999), 1999, pp. 379–384.

D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithm, Machine Learning 6 (1) (1991) 37–66.

Feyza and Lale: Data mining and preprocessing application on component reports of an airline company in Turkey. Expert Systems with Applications 38 (2011) 6618-6626.

Friedman, J.H. 1997. Data mining and statistics: What’s the connection? Proceedings of the 29th Symposium on the Interface Between Computer Science and Statistics.

G. Martinez-Munoz, A. Suarez, Switching class labels to generate classification ensembles, Pattern Recognition 38 (10) (2005) 1483–1494.

G. Zenobi, P. Cunningham, Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error, in: Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), Freiburg, Germany, 2001, pp. 576–587.

H. Zouari, L. Heutte, Y. Lecourtier, Controlling the diversity in classifier ensembles through a measure of agreement, Pattern Recognition 38 (11) (2005) 2195–2199.

Hernandez, M.A.; Stolfo, S.J.: Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Data Mining and Knowledge Discovery 2(1):9-37, 1998.

http://www.google.com

Hu, X. (2003). DB-reduction: A data preprocessing algorithm for data mining applications. Applied Mathematics Letters, 16, 889–895.

I. Fellegi, D. Holt, A systematic approach to automatic edit and imputation, Journal of the American Statistical Association 71 (1976) 17–35.

J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting, The Annals of Statistics 28 (2) (2000) 337–374.

J.F. Kolen, J.B. Pollack, Back propagation is sensitive to initial conditions, Advances in Neural Information Processing Systems 3 (1991) 860–867.

J.R. Quinlan, Bagging, boosting, and c4.5, in: Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI- 1996), 1996, pp. 725–730.

J.R. Quinlan, Induction of decision trees, Machine Learning 1 (1) (1986) 81–106.

J.R. Quinlan, Learning from noisy data, in: Proceedings of the Second International Machine Learning Workshop, 1983.

J.R. Quinlan, The effect of noise on concept learning, in: R.S. Michalski, J.G. Carboneel, T.M. Mitcheel (Eds.), Machine Learning, 1986.

K. M. Ho, and P. D. Scott. Reducing Decision Tree Fragmentation Through Attribute Value Grouping: A Comparative Study, in Intelligent Data Analysis Journal, 4(1), pp.1-20, 2000.

Kim, Y., Street, W. N., & Menczer, F. (2003). Feature selection. In Data mining. USA:University Of Iowa.

Kubat, M. and Matwin, S., ‘Addressing the Curse of Imbalanced Data Sets: One Sided Sampling’, in the Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179-186, 1997.

L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123–140.

L. Breiman, Randomizing outputs to increase prediction accuracy, Machine Learning 40 (3) (2000) 229–242.

L. Rokach, Ensemble-based classifiers, Artificial Intelligence Review 33 (2010) 1–39.

L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Machine Learning 51 (2) (2003) 181–207.

Li, X., & Jacob, V. S. (2008). Adaptive data reduction for large-scale transaction data. European Journal of Operational Research, 188, 910–924.

M. Mannino, Y. Yang, Y. Ryu, Classification algorithm sensitivity to training data with non representative attribute noise, Decision Support Systems 46 (3) (2009) 743–751.

N. Goldberg, C. Chieh Shan, N. Goldberg, C. Chieh Shan, Boosting optimal logical patterns using noisy data, in: Proceedings of the Seventh SIAM International Conference on Data Mining, Minneapolis, Minnesota, 2007.

P. Melville, N. Shah, L. Mihalkova, R.J. Mooney, Experiments on ensembles with missing and noisy data, in: Proceedings of the Workshop on Multi Classifier Systems, Springer Verlag, 2004, pp. 293–302.

P. Melville, R.J. Mooney, Constructing diverse classifier ensembles using artificial training examples, in: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI- 2003), 2003, pp. 505–510.

P. Zhang, X. Zhu, Y. Shi, L. Guo, X. Wu, Robust ensemble learning for mining noisy data streams, Decision Support Systems 50 (2011) 469–479.

P.M. Long, R.A. Servedio, Random classification noise defeats all convex potential boosters, in: Proceedings of the 25th International Conference on Machine Learning (ICML), Helsinki, Finland, 2008, pp. 608–615.

Pizzi, N. J., & Pedrycz, W. (2008). Effective classification using feature selection and fuzzy. Integration Fuzzy Sets and Systems.

R. Maclin, J.W. Shavlik, Combining the predictions of multiple classifiers: using competitive learning to initialize neural networks, in: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI-1995), Montreal, Canada, 1995, pp. 524–530.

R.A. Servedio, Smooth boosting and learning with malicious noise, Journal of Machine Learning Research 4 (2003) 633–648.

S. Shah, A. Kusiak, Relabeling algorithm for retrieval of noisy instances and improving prediction quality, Computers in Biology and Medicine 40 (2010) 288–299.

T.G. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization, Machine Learning 40 (2) (2000) 139–157.

T.G. Dietterich, Ensemble methods in machine learning, Proceedings of the First International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science, vol. 1857, 2000, pp. 1–15.

T.M. Khoshgoftaar, J.V. Hulse, Empirical case studies in attribute noise detection, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews—Special Issue on Information Reuse and Integration 39 (4) (2009) 379–388.

W. Jiang, Boosting with noisy data: some views from statistical theory, Neural Computation 16 (4) (2004) 789–810.

X. Zhu, X. Wu, Class noise vs attribute noise: a quantitative study of their impacts, Artificial Intelligence Review 22 (3–4) (2004) 177–210.

X. Zhu, X. Wu, Cost-constrained data acquisition for intelligent data preparation, IEEE Transactions on Knowledge and Data Engineering (TKDE) 17 (11) (2005) 1542–1556.

X. Zhu, X. Wu, Data acquisition with active impact-sensitive instance selection, in: Proceedings of the Sixteenth IEEE International Conference on Tools with Artificial Intelligence (ICTAI-2004), 2004, pp. 721–726.

X. Zhu, X. Wu, Y. Yang, Error detection and impact-sensitive instance ranking in noisy datasets, in: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-2004), 2004, pp. 378–384.

Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: Proceedings of the Thirteenth International Conference on Machine Learning (ICML-1996), 1996, pp. 148–156.

Yan Zhang, Xingquan Zhu, Sindong Wu, Jeffrey P.Bond: Corrective classification: Learning from data imperfections with aggressive and diverse classifier ensembling . Information Systems 36 (2011) 1135- 1157.

Zhang, S., Zhang, C., & Yang, Q. (2003). Data preparation for data mining. Applied Artificial Intelligence, 173, 75–381.

Data Pre-Processing for Classification and Clustering

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Most read articles by the same author(s)

Announcement

Open Access

Abstracting and Indexing

Join as Reviewer

Make a Submission

Information