Mining of Datasets using Big Data Technique: Hadoop Platform

Authors: Rohit Kumar; Daulat Sihag; Khushboo Sukhija
DIN
IJOER-JAN-2017-7
Abstract

BIG DATA IS THE FUTURE OF IT INDUSTRY. Here see the methodology i.e. ETL process used for analysis of big data by using Hadoop ecosystem. The analysis of big data extracts business values from the raw data and helps in gaining competitive advantage by different organisations. There is a drastic growth of data in the web applications and social networking and such data are said be as Big Data. It requires huge amount of time consumption to retrieve those datasets. It lacks in performance analysis. To overcome this problem the Hive queries with the integration of Hadoop are used to generate the report analysis for thousands of datasets. The objective is to store the data persistently along with the past history of the data set and performing the report analysis of that data set. The main aim of this system is to improve performance through parallelization of various operations such as loading the data, index building and evaluating the queries. Thus the performance analysis is done with parallelization. HDFS file system is used to store the data after performing the MapReduce operations and the execution time is decreased when the number of nodes gets increased. The performance analysis is tuned with the parameters such as the execution time and number of nodes.

Keywords
Big Data Hadoop HDFS.
Introduction

To generate information it requires massive collection of data. The data can be simple numerical figures and text documents, to more complex information such as spatial data, multimedia data, and hypertext documents. To take complete advantage of data; the data retrieval is simply not enough, it requires a tool for automatic summarization of data, extraction of the essence of information stored, and the discovery of patterns in raw data. With the enormous amount of data stored in files, databases, and other repositories, it is increasingly important, to develop powerful tool for analysis and interpretation of such data and for the extraction of interesting knowledge that could help in decision-making. The only answer to all above is ‘Data Mining’.

Data mining is the extraction of hidden predictive information from large databases; it is a powerful technology with great potential to help organizations focus on the most important. information in their data warehouses [1][2][3][4].Data mining tools predict future trends and behaviours, helps organizations to make proactive knowledge-driven decisions[2]. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer the questions that traditionally were too time consuming to resolve. They prepare databases for finding hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.

Data mining, popularly known as Knowledge Discovery in Databases (KDD), it is the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases [3][5]. Though, data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data mining is actually part of the knowledge discovery process.

Conclusion

In this present work, Data mining is performed by using the Hadoop Ecosystem Approach. The presented work is about to performed data mining on large loan data sets using Hive component of Hadoop Ecosystem. Here, the hive queries are performed for mining the useful data like “we are pulling those Bank Customers whose Loans were processed successfully starting from year 2007 till 2013.They were proven as the best customers for banks as their payment schedule and other incomes were verified and made on timely basis. They were provided Loan at ROI = 7% and for 3years duration.” This information can be mined in less time because of parallelization feature of Hadoop ecosystem.

So from our data analytics we conclude that those customers are best market for Banks in future and can be given priority over other customers. Companies can create separate operational data store (ODS) to make inventory of those customers for faster search & processing of loan.

Article Preview