الفهرس | Only 14 pages are availabe for public view |
Abstract This thesis presents a study of the concept of Big data in an attempt to deal with it by understanding, analyzing and limiting its challenges. The increasing of data and diversity of its forms is a challenge to deal with Big data. So two frameworks are proposed to improve classification performance. In data preprocessing stage, data is transformed to put all features in numerical format. Genetic Rough set is used to reduce and select features. In data processing stage, removing mislabeled instances helps to reach to accurate results. It is necessary to remove mislabeled instances by learning algorithms for increasing classification accuracy. The first framework is proposed for eliminating mislabeled instances to improve classification performance. We tried to reduce and remove instances that cause misclassification. We use Fuzzy-Rough Nearest Neighbor to remove mislabeled instances. After that classification techniques are implemented. In the second framework, MapReduce and Fuzzy Rough set are used for feature selection. A proposed framework has three main stages, which are data preprocessing, map, and reduce stage. In data preprocessing stage, we tried to overcome two main problems. The first problem is variety of data, one to one transformation is proposed to face the problem of heterogeneous data. The second problem is incomplete data which is solved by k-nearest neighbor imputation. In map stage, we apply rough set concepts for feature selection. In reducing stage, we applied clustering for identifying similar features to assign the same key. Our framework aim is to reduce features of big data sets. Using FuzzyRough for feature selection saves time to build classification model according to results. The best related work related in around 70-88% classification accuracy. The result of the proposed decision tree are 87.2% accuracy, 90.3 % precision. The accuracy of the KNN reached 90.9%. |