Confluence Journal of Pure and Applied Sciences (CJPAS) Vol. 1, No. 1, November 2017 Faculty of Science, Federal University Lokoja, Kogi State, Nigeria ISSN: 2616-1303 | Web: www.cjpas.fulokoja.edu.ng ________________________________________________________________________________________________________________________

TRENDS AND TECHNOLOGIES IN BIG DATA ANALYTICS: A REVIEW

Taiwo Kolajo, Emeka Ogbuju, Sunday Eric Adewumi

Department of Computer Science, Federal University Lokoja, Nigeria

Email: {taiwo.kolajo, emeka.ogbuju, sunday.adewumi}@fulokoja.edu.ng

_______________________________________________________________________________________ ABSTRACT Big data has become a vital part of our computing technologies in recent times. However, in many literatures, the concept of big data had focused mainly on identifying its definitive features without concrete applications to solving the several challenges it presents. Though a number of tools and technologies exist to solve problems in the big data domain, many problems remain unsolved due to less understanding of the applied features of these tools. This paper defines five research questions in this new area, examined through a review of the state-of-the-art approaches in big data analytics. The review includes major fields in which big data analytics readily find application, tools and technologies employed for big data solutions, trends in big data analytics as well as application areas. The paper concludes with highlighting the current and emerging research issues in the big data domain and identify next directions in future works. Keywords: Big data, Big data analytics, NoSQL, Hadoop, MapReduce, Hadoop Ecosystem, Apache Spark _______________________________________________________________________________________ 1.0 INTRODUCTION We are living in an era where everybody and everything is an agent of data. That is, both man and machine can generate data at their own pace. This has given rise to a deluge of data demanding for analytics and meaningful insights for business growth. We are not just being flooded with data from connected social networks to sensors in devices; we eventually face the challenge of storing and analyzing these data with a myriad of tools and techniques. Understanding these tools and techniques as they emerge is paramount to successful data analytics that would bring about lasting solutions to big data related problems. It is difficult to gauge the aggregate volume of information electronically stored. However, an International Data Corporation (IDC) gauge put the span of the "digital universe" at 4.4 zettabytes in 2013 and a tenfold growth by 2020 (White, 2015). The entire world and everything in it is becoming digitized. It is growing at the rate of 40% a year into the following decade, extending to incorporate not just the expanding business or of people and enterprises doing everything on the web, additionally all 'things', that is, smart gadgets associated with the web, unleashing another influx

of opportunities for business and individuals around the globe (Zwolenski and Weatherill, 2014). Many authors have tried to define big data. Based on various big data definitions, there are three different schools of thought; some authors lay emphasis on technologies involved in big data, some try to compare traditional data with big data while some try to define big data with respect to its features. As a result of these disparate schools of thought the various definitions found in literature can be categorized into three: architectural, comparative and attributive. Looking at the architectural point of view, big data can be defined as “The cluster of methods and technologies in which new forms are integrated to unfold hidden values in diverse, complex and high volume data sets” (Hashem et al., 2015). Compared with traditional approaches, big data refers to “Datasets that stretches the limits of traditional data processing and storage systems” (Duggal et al., 2015). While attributive category defines big data as “A collection of data sets which is very large in size as well as complex in structure” (Lovalekar, 2014). Reflecting on the three schools of thought in big data definitions, big data can be defined as the storage and analysis of large and or complex data sets generated at high velocity using a series of

266

more advanced techniques on supporting platforms than required in traditional data analytics. Big data can be further categorized into big data science and big data framework. Big data science refers to the study of techniques and algorithms covering the procurement, molding, and assessment of big data while big data framework are platforms or technologies that empower distributed processing of big data analysis over clusters of computers (Team, 2011); (Grobelnik, 2012). Big data in a vacuum is worthless. The only way to unlock the potentials of big data is by leveraging it to drive decision-making (IBM, 2013). Big data analytics refers to the systematic discovery of previously unknown correlation and meaningful patterns in large amount of multivariate dataset generated at a great velocity to support decision- making (Reena et al., 2015). It uses algorithms running on powerful platforms to uncover concealed potentials buried in big data, such as previously unknown correlations and hidden patterns (Hu et al., 2014). As opposed to utilizing structured data systems, big data analysis concentrates on methods, for example, distributed processing, divide & conquer and pattern matching to take care of real life problems. With the utilization of cutting edge techniques, for example, text analytics, machine learning, predictive analytics, statistics, natural language processing can dissect big data to comprehend the present condition of their services. Example of such technologies include NoSQL databases, Hadoop and MapReduce, Apache Spark (Russon et al., 2011). In this paper, the following research questions are discussed:

1. What are the various categories of big data application?

2. What state-of-art tools and technologies are used in big data analytics?

3. What is the trend in big data analytics? 4. In what areas has big data analytics been

applied? 5. What are the research issues in big data

analytics? The structure of this paper is as follows: Section 2 deals with the big data analytics (BDA) application,

enumerating the six (6) major categories of structured and unstructured datasets and the methods employed in handling them (research question 1). Section 3 discusses BDA technologies, expounding the three (3) prominent technologies currently in use in most of the big data solutions (research question 2). In Section 4, the trends of BDA from MapReduce to Spark programming as well as visualization techniques and benchmark options available are presented (research question 3). Section 5 deals with the application areas of BDA using the identified technologies (research question 4); and finally section five (5) presents the research issues in the era of big data (research question 5). Intermittently, we distinguish big data analytics from traditional analytics. The paper concludes with highlighting the major contributions from the review. 2.0 BIG DATA ANALYTICS APPLICATION Big data analytics applications can be categorized into six based on data type; structured data analytics, text analytics, multimedia analytics, mobile analytics, web analytics and network analytics. 2.1 Structured Data Analytics An extensive amount of structured data produced in the scientific and business fields have been well handled by mature Relational Database Management Systems (RDBMS), data warehousing, Online Analytical Processing (OLAP) and Business Process Management (BPM) utilizing data mining and statistical analysis (Chaudhuri et al., 2011). Much research have been done in data mining and statistical analysis. Recently, deep learning, privacy preserving data mining and process mining are active research fields. Deep learning as opposed to most of the current machine learning algorithms (which utilize human-designed representations and input features) incorporate representation learning and learn multiple levels of representation of increasing complexity or abstraction. Due to security concerns in healthcare, e-government and e- commerce, privacy-preserving data mining is becoming an interesting field for researchers. Process mining, which concentrates on process analysis with the use of event data is another area of

267

study for researchers (Verykios et al., 2004); (Alast, 2012). 2.2 Text Analytics Text incorporates e-mail communication, webpages, corporate documents and social media content. Text analytics or text mining is the process of extracting useful information and knowledge from unstructured text. Most text mining analytics are based on natural language processing and text representation. Information extraction, summarization, clustering, topic modeling, opinion mining and question answering are several technologies that have been developed for text mining. 2.3 Multimedia Analytics Multimedia analytics refers to extracting interesting knowledge and understanding the semantics captured in multimedia data. The research in multimedia analytics covers a wide range of subjects, including multimedia annotation, summarization, indexing and retrieval, recommendation and event detection, etc. Automatic multimedia annotation has attracted substantial research interest. The main challenge lies in the semantic gap between the low-level features and annotations. Most current research on event detection is limited to news, or sport events. The work of Ma et al. (2012). proposed an algorithm for ad hoc multimedia event detection, which addresses a limited number of positive training examples. 2.4 Mobile Analytics Mobile computing is witnessing a rapid growth and this result in more mobile terminals such as mobile phones, sensors and Radio Frequency Identification (RFID) and applications being deployed globally (Zhang et al., 2013). Global data traffic moved from 2.1 exabytes per month at the end of 2014 and reached 3.7 exabytes per month at the end of 2015 (Zhang et al., 2016); (Cisco Visual Networking Index, 2016). The challenges in mobile analytics are as a result of inherent characteristics of mobile data, which includes activity sensitivity, mobile awareness, noisiness and redundancy richness. Recent advancement in wireless sensor, mobile technologies and streaming processing brought about the deployment of real-time monitoring body sensors used in healthcare industry. RFID, Radio

Frequency identification is used for unique product identification. It is used to identify, locate, track and monitor physical objects cost effectively. RFID is currently widely embraced in logistics and inventory management but not without challenges; RFID are inherently noisy, redundant, temporal, streaming, high volume and must be processed on the fly. 2.5 Web Analytics Web analytics refers to retrieving, extracting and evaluating information for the purpose of knowledge discovery from web documents and services. It is subdivided into three categories depending on the part of the web that is mined; web structure mining, web content mining and web usage mining. Web structure mining discovers the model underlying the link structures on the web. The structure is represented by the graph of links in a website or between websites. The design of the model is based on the hyperlink topology with link description or without link description. The model compares and contrasts among different websites. This can be used for website classification. Page Rank, CLEVER and Focused Crawling adopt this model to find web pages. Web content mining is the extraction of useful information from website content. Website content usually involve several types of data such as image, text, video, audio, symbolic, hyperlinks and metadata. Web content are unstructured data in most cases, as a result most of research efforts are directed to text and hypertext content. Text mining has been well researched into as described earlier. Hypertext mining refers to mining HTML pages that have hyperlinks. Web content mining utilizes either information retrieval or database approaches. Information retrieval approach aims to assist in finding or filtering information to the users based on either requested or derived user profiles. The database approach models the data on the web and coordinates them with the goal that more advanced queries other than the keyword-based inquiries may be performed. Web usage mining manages secondary data generated by web sessions or behaviours. Web usage data encompasses browser logs, web server access logs, proxy server logs, user profiles, registration data, user sessions or transactions, user queries, cookies, bookmark data, scrolls and mouse clicks, and other user-generated data by the

268

interaction of users and the web. Web usage mining finds its application in e-commerce, personalizing space, web privacy or security and so on (Xu et al., 2012). 2.6 Network Analytics As a result of revolutionary web development, user- generated content is exploding. This includes blogs, photos and video sharing, social networking sites, social book marketing, social news and social wikis. Social media content contains multimedia, text, comments and locations. Social media analytics are faced with certain challenges due to tremendous and ever-growing social media data, noise inherent in social data and dynamic nature of social data. Research on social media analytics is still at the infancy stage. In social networks, multimedia datasets incorporate rich information such as social interaction, semantic ontology, geographical maps, community media and multimedia comments. The link structure of social networks is logical and plays a vital role in multimedia information networks. Logical link structure is categorized into four. They are community media, semantic ontologies, personal photograph albums and geographical locations (Aggarwal, 2011). The results of retrieval system, collaborative tagging, and recommendation system can be improved upon as a result of logical link structures embedded in multimedia information networks (Rabbath et al., 2011); (Mamu and Cautis, 2012); (Shridhar, 2012).

3.0 BIG DATA ANALYTICS TECHNOLOGIES The key difference between the big data analytics conceptual framework and that of conventional data analytics lies in how processing is being executed (Kudyba, 2014). In conventional/traditional data analytics, analyses are performed with business intelligence tool on a stand-alone system, such as desktop or laptop. In big data analytics, processing is broken down into chunks, distributed and executed across multiple nodes. Hadoop/MapReduce have enabled the big data analytics (Raghupathi and Raghupathi, 2014). While traditional analytics and big data analytics have similar models and algorithms, their user interfaces are extremely different. Traditional analytics tools are user friendly and transparent but that of big data analytics are complex and require top programming skills as well as application of a variety of skills. The complexity begins with the data itself as depicted in Figure 1. Data from various sources are pooled. In the second component, data are in their raw state and needed to be transformed. There are several ways to this; one possibility is to use service oriented architectural method combined with web services (middleware). The next component is the framework where several decisions are made regarding the data input approach, distributed designs, tool selection and analytic models. The fourth component represents the typical application of big data analytics.

Figure 1: A conceptual Architecture of Big Data Analytics (Zikopoulos et al., 2013)

269

Some of the prominent technologies for big data analytics are described below. 3.1 NoSQL NoSQL is a paradigm shift when it comes to storage and access to information especially in this era of information age that we have large portion of the semi-structured or unstructured data being generated. Existing database frameworks, for example, MySQL, Oracle, SQLServer and Postgress can't deal well with such data. NoSQL is a discharge from the imperatives forced on database administration frameworks by the relational model (Brooks, et al., 2014). NoSQL database oversees information in a method other than tabular relations in customary databases. The relational model takes information and isolates it into numerous interrelated tables that contain rows and columns while document-oriented NoSQL database for instance, takes the data into document utilizing the JavaScript Object Notation (JSON) format. Another real distinction is that relational technologies have rigid schema while NoSQL models are “schemaless”. Such databases have been in existence since late 1960s but did not gain popularity as the name “NoSQL” until early twenty- first century, which was triggered by the needs of Web 2.0 companies such as Google, Facebook and Amazon.com. NoSQL database has four diverse sorts of data structures, which are: i) Key Value Stores: With key value stores, each item is stored as an attribute name along with its value. It is based on a hash table, which uses a unique key and a pointer to reference a particular item of data. Examples are Dynamo, Voldemort, Rhino DHT, etc. ii) Document Database: In document-oriented database, each key is being paired with complex data structure referred to as document. Semi- structured document can be XML or JSON formatted. In addition to the key, document can be retrieved with queries. Examples are CouchDB, MongoDB, Lotus Notes, Redis, etc. iii) Family/Wide Column Stores: Rather than using rows to store data, they use columns instead. They store and process large amount of data distributed over many machines. They are optimized

for queries over very large datasets. Examples are BigTable, Cassandra, HBase, Hadoop, etc. iv) Graph Stores: Graph database is based on graph theory. They are built based on nodes, relationship between nodes (edges) and properties of nodes. Examples are Neo4J, FlockDB, GraphBase InfoGrip, etc. One of the motivations for NoSQL includes simplicity in terms of design; it accommodates horizontal scaling to clusters of computers, which is an issue for relational databases. Another motivation is finer control over availability as well as flexibility in terms of data structure when compared with relational databases. With many NoSQL databases, there is a trade-off between consistency, availability, partition tolerance and speed, i.e. there is a compromise in consistency in favour of speed, partition tolerance and availability. Barriers to greater adoption of NoSQL databases include the use of low-level query languages (e.g. inability to perform ad-hoc joins across tables), huge previous investments in existing relational database and lack of standardized interfaces. Other challenges are maturity, support for data analysis, support from users and maintenance. 3.2 Hadoop Platform Hadoop is the predominant big data platform. It has three primary resources: the Hadoop Distributed File system (HDFS), the MapReduce programming platform and the Hadoop ecosystem (Sitto and Presser, 2015) (See Figure 2). Hadoop was brought about as a result of trying to fix scalability problem associated with Nutch (a crawler and search engine that makes use of MapReduce and big-table developed by Google). Hadoop uses master-slave architecture based on data partitioning across multiple nodes and parallel computation of large datasets. It manages scalability with the addition of computing nodes (Shvachko et al., 2010); (Olson, 2010). Hadoop can process extremely large amount of data through partitioning with the help of numerous nodes (servers), each of the nodes solves small and different parts of the larger problem and then finally integrates them (Zikopoulos et al., 2013). Hadoop plays a double role; data organizer and analytics tool. Enterprises can now harness data

270

(unstructured, semi-structured and structured) that is usually difficult to manage and analyze.

Figure 2: Hadoop architecture (Coppa, 2014) Hadoop works well for data-intensive processing by utilizing move-code-to-data philosophy (Tsai et al., 2015). The client sends the usually small

MapReduce programs to be executed. Each of the clusters receives small chunk of data to be computed which usually takes place on the same machine where that chunk of data resides.

Figure 3. The architecture of Hadoop Cluster (Mohammed et al., 2014)

3.3 MapReduce Programming Framework One of the most commonly programming framework implemented on top of Hadoop Distributed File System is MapReduce framework. It is a programming model that applies functional programming paradigm in which programmer can define Map and Reduce tasks. Large-scale data computations can be performed efficiently and in a way that is tolerant to hardware failures. Although MapReduce does not perform well with online transactions, the key strengths lie in high degree of parallelization, programming framework simplicity as well as its ability to manage large variety of

application (Tsai et al., 2015); (Press, 2013). MapReduce functions are usually written in Java but they can as well be coded in other languages such as Perl, C++, Ruby, Python, R, etc. Consider the Word Count problem; assume that there is a large set of documents, the goal is to count the occurrence of each word found in the documents. Data chunks are distributed to each of the mappers for analysis. The map function processes the input pairs (key1, value1) returning some intermediary pairs (key2, value2). Then the intermediary pairs are thereafter grouped together according to their key. The reduce function then output some new key-value pairs of the form (key3, value3). An example is illustrated in Figure 4.

Figure 4: MapReduce Algorithm for WordCount (Dean and Ghemawat, 2008)

271

3.4 Hadoop Ecosystem Hadoop ecosystem refers to technologies or frameworks that sit between Hadoop and MapReduce in order to complement the Hadoop MapReduce paradigm. Some of these technologies are briefly discussed below i) Hive: The goal of Hive is to allow SQL access to data in the HDFS. Queries written in HQL are converted into MapReduce code by Hive and executed by Hadoop though HQL is not full ANSI- standard SQL. (i.e. some features are missing; Hive does not support non-equality join condition, update and delete statements. Though you may not need these but if you run code generated by third-party solutions, they may generate non-Hive compliant code). It also allows user-defined functions. ii) Zookeeper: When there is need to distribute small amount of data across many machines, Zookeeper finds its relevance. It is an effective mechanism running dependent jobs. Apache Zookeeper provides distributed configuration services within the Hadoop ecosystem. iii) HBase: HBase is a scalable, column- oriented rather than row- oriented database management system that sits on top of HDFS. HBase is suitable for real-time read/write random access to very large dataset. It is not relational and uses a non-SQL approach. The HBase finds its application in webtable (Chaudhuri et al., 2011). iv) Jaql: Jaql processes large datasets through the use of functional declarative query language. It facilitates parallel processing by converting high- level queries of MapReduce tasks to low-level queries. v) Mahout: It is a scalable machine learning algorithms that run on Hadoop. While there is much analysis that can be done in MapReduce or Pig, there are some machine-learning algorithms that are distributed as part of Mahout Project. Some examples are classification, recommendation, and clustering. vi) Oozie: Oozie is a Hadoop’s workflow scheduler for running workflow of dependent jobs. It has a workflow and a coordinator engine

(Frampton, 2014). Oozie can process and manage thousands of workflows in a Hadoop cluster in a timely and efficient manner. Oozie employs Directed Acyclic Graph (DAG) for workflow tasks execution (Chaudhuri et al., 2011). vii) Blur: Blur is a tool for indexing and searching text with Hadoop. Because it has Lucene (a very popular text-indexing framework) at its core, it has many useful features, including fuzzy matching, wildcard searches, and paged results. It allows you to search through unstructured data in a way that would otherwise be very difficult. viii) BigTop: used for packaging and testing the Hadoop ecosystem. Although it is not a cluster manager, it simplifies installation, integration and smoke testing of the Hadoop tool kit providing an integrated tool stack. This results in a well-tested, high-quality, stack-based Hadoop product set. 3.5 Apache Spark

Apache Spark is an open-source cluster-computing framework that uses in-memory computation to provide better and faster performance than Hadoop’s disk-based MapReduce paradigm. The differences between Apache Spark and Hadoop is summarized in table I. Over time, MapReduce has been the centre paradigm for batch jobs. However, Apache Spark, is suitable for batch, iterative and streaming jobs (Apache Spark, 2014); (Dawar, 2015). Spark is relevant for machine learning concepts as it allows frequent query to be applied to data load in the cluster’s memory. Spark framework consists of a cluster manager and a distributed storage system. Apache Spark is a competitive solution with additional benefits compared to MapReduce in addressing real-time data analysis (Sharma, 2016). Apache Spark has a user-friendly programming interface and as a result coding efforts is reduced. It does not only provide an alternative to MapReduce but also has both SQL (Shark) as well as machine learning library called MLLib (Apache Spark Documentation, 2014).

272

Table I: Comparison between Spark and Hadoop (Nair and Shetty, 2015)

Spark Hadoop

Second generation big data analytics engine with additional features

First generation big data analytics engine with much expertise available

Availability of additional functions other than MapReduce, such as writing program in python, java or Scala with user friendly interface, programming made easy

Rely on Map and Reduce functions, which makes programming difficult

100 and 10 times faster when running programs in memory and on disk respectively

Slower as intermediate data or result is stored on hard disk

Supports batch processing and also includes Machine Learning Library for machine learning, spark streaming for streaming data, spark SQL for querying and GraphX for graph processing, providing all-in- one solution

Supports mainly batch processing, requires other compatible platforms for streaming, querying and machine learning

Compatible with HDDFS Compatible with HDDFS

Higher memory requirement, results in performance degradation if accommodated in memory

Lesser memory requirement

Spark uses master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executor runs. The driver and executors run in their own Java processes. You can run them all on the same (horizontal cluster) or separate machines (vertical cluster) or in a mixed machine configuration (Laskowski, 2016). (See Figure 5). A Spark driver is the process that creates and owns an instance of SparkContext. A master is a running Spark instance that connects to a cluster manager for resources. The master acquires cluster nodes to run executor. Workers are running Spark instances where executors live to execute tasks. Executors are distributed agents responsible for executing tasks. They provide in-memory storage for Resilient Distributed Databases (RDDs) that are cached in Spark applications.

Figure 5: Apache Spark rchitecture (Maniyam, 2015)

4.0 BIG DATA ANALYTICS TRENDS Big data technology also refers to Big Table, Dynamo, Cassandra, Hadoop, Google File System, HBase, MapReduce, Mashum and stream processing (Maniyam, 2015). Big data started with a little shift from traditional analytics to include batch-processing computations. It gradually moved from this stage with MapReduce paradigm to a higher level where stream processing is involved with Apache Spark platform. The trend continues to near real-time processing and is currently progressing to real-time analytics. There are three noteworthy viewpoints to consider here: 4.1 Data Analysis Model Architecture As a result of difficulties such as scalability present with structured data techniques, MapReduce concept was born in 2004 to cope with unstructured and semi-structured databases. However as earlier noted MapReduce is not suitable for online transactions, iterative and graph computation. There are other technologies that sit between the HDFS and MapReduce commonly referred to as Hadoop ecosystem that helps to achieve other purposes where MapReduce may not be suitable. Examples of such technologies are HBase, Jaql, Mahout, etc. Recently, a successor of MapReduce paradigm in Hadoop was born, the Apache Spark, which encompasses all other computation other than batch processing such as iterative, streaming and on-line transaction analysis.

273

4.2 Visualization Visualization is considered as the strongest potential growth when it comes to choosing among different options for big data analytics (Sharma, 2016). One of the most ideal approaches to covey your message is to use visualization; through this, the attention of the audience is drawn to important messages and with visual presentation more surprising patterns and observations that are not apparent with statistics alone can be uncovered. Visualization is not only for presenting result but can be used at all the stages of data analytics. What can suffice for big data analytics is referred to as Advanced Data Visualization (ADV). As opposed to standard visualization framework such as pie, line and bar charts, advanced data visualization technologies can scale to represent thousands or millions of data point. It can accommodate diverse data types and complex structures that cannot be easily represented on the computer screen (e.g. hierarchies and neural sets). Majority of the advanced data visualization tools can interact with data sources directly, which enables analysts to choose the right data set at real time. There are four major techniques in parallelism visualization; task parallelism, data streaming, pipeline parallelism and data parallelism. Advanced data visualization tools are numerous; examples are Charts.js, Tableau, Raw, Dygraphs, Google charts, Crossfitter, Tangle, Polymaps, OpenLayers, Kartograph, CartoDB, and Gephi to mention a few. 4.3 Testing Benchmark The diversity of big data poses a challenge when it comes to developing big data benchmarks that will be suitable for all workload cases. One cannot stick to one big data benchmark because it has been observed that using it on different data sets do not give the same result. This implies that benchmark testing should be application specific. Subsequently, in evaluating big data system, identification of workload for an application domain is a prerequisite (Apache Spark Documentation, 2014). Most of the existing big data benchmarks are designed to evaluate specific type of systems or architectures. For instance, HiBench, GridMix and PigMix are for MapReduce Hadoop systems. BigBench for Teradata Aster DBMS, MapReduce systems, Redshift database, Hive, Spark and Impala. LinkBench for MySQL databases. Presently, BigDataBench seems to be the only big data

benchmark that can evaluate hybrid of different big data systems.

5.0 APPLICATION AREAS OF BIG DATA ANALYTICS Our world is changing for the better as a result of leveraging the power of big data analytics. The last decade recorded a lot of advancement with respect to the amount of data generated as well the technology employed in analyzing and understanding it. For individual firms to gain competitive advantage such firms must leverage on data driven strategies to compete, innovate and capture from real-time information (IDC, 2013) Big data analytics is relevant in every field of human endeavor and as many that would want to gain a competitive advantage, big data is key. Almost all sectors like computer and electronic products, government and insurance will benefit from big data (Taft, 2013). Below are some of the application areas: i) Smarter Healthcare: Apart from yielding better profits and streamlining on overhead incurred on waste, big data is relevant in the prediction of epidemics, curing diseases, improving life quality and preventing avoidable death. With wireless sensor networks, healthcare patients’ record can be tracked and uploaded. There are also stream of wearable devices (Fit bit, Samsung gear fit, Jawbone, etc.) that have also emerged through which people can monitor their heartbeat rate, the number of calories burnt as a result of exercise, quality of sleep, etc. In addition, pharmacovigilance, which involves the detection, examination, understanding and prevention of adverse effects of drugs, is now possible by using big data analytics to harness and analyze information from various sources (social media feeds, published literature, real world data, and health agency databases) that report the case of adverse effect on any drug. ii) Homeland Security: This refers to concerted effort in ensuring a homeland is safe, secure and able to resist every form of hazards and terrorism. For instance, cyber-attack on one of the major lifelines such as water, power, financial services and communication. Big data analytics can be used to recognize potential menaces by

274

monitoring communication (email, mobile and social media), financial transactions, locations and travels itineraries of users with suspicious activity patterns. iii) Traffic Control: Combining information from satellites linked with sensors positioned along roads and traffic data from both moving vehicles and pedestrians can be used to analyse road conditions and help plan routes more accurately, estimate the time to reach a destination and over a wider range than is possible with Global Positioning System. iv) Manufacturing: Manufacturers are able to monitor product quality and delivery accuracy with the help of big data in order to improve customer service. They achieve this with telemetry data, which consists of data generated by machines during manufacturing process. Once the product is in the hand of the customers, its performance is monitored which can be used to determine whether a customer will have problem before it eventually happens. v) Education: Big data has opportunities in the field of education; more detailed information for school can be generated. Progress can be tracked and analyzed. Collaborative library is enabled. These lead to better education and informed stakeholder and the outside world thereafter. vi) Multi-channel Sales: This refers to multiple mediums for buying, marketing and selling. For multi-channel markets to be successful, big data analytics must be employed. Information collected from customers provide 360-degree view of the customers, which lead to overall improvement of multi-channel strategy. In addition, customer does not need to go to a specific website to find a product, they can buy and make transactions from any website they find product that is of interest to them. Companies can take product to any environment that there are potential buyers. vii) Telecommunication Industries: Telecommunication industries are sitting on gold mine due to the fact they have plenty data about their customers. Efficient and effective analysis of structured, semi-structured and unstructured data to

get deeper insight about their customers’ behaviour, preferences, usage patterns, locations and travel patterns in real time is required. This is where big data comes in. With big data analytics, customers can be marketed with almost on individual basis, which significantly reduces churn rate and improves the customers’ perception of the company. Moreover, they have a complete view of the number of targeted customers, how many were reached, how many yielded to company’s adverts. They can also calculate Return on Investment very quickly. viii) Search Quality: Finding product easily on websites drives a great user experience, customer loyalty and increased profit. With big data analytics, search results, catalogues and recommendations can be tailored to customers’ need paving way to a greater user experience and eventually better conversation. ix) Trading Analytics: With big data analytics, structure and trends in the market can be analyzed. Trades can be accurately executed at best possible prices and that at high speed. For instance, United Parcel Service can track an average of 39.5 million request per day. x) Finance: The continual adoption of big data in finance will eventually transform the landscape of financial services from being faster to smarter, leading to better investments with consistent returns. 6.0 RESEARCH ISSUES IN BIG DATA ANALYTICS

Big data analytics is a rapidly expanding research area and has become ubiquitous in terms of solving complex problems virtually in all fields of human endeavors such as applied mathematics, engineering, computational biology, medicine, healthcare, business, finance, social networks, education, telecommunication transportation, etc. Internet and mobile computing has led to generation of huge volumes of data in enterprises, companies and governments. This information overload has greatly complicated efficient and effective decision- making. In addition, the heterogeneous nature of the sources and data size poses a difficult challenge for

275

analysis. As a result of the above points, a lot of research issues in big data analytics are gingered up which include but not limited to (Al-Jarrah et al., 2015); (Paakkonen and Pakkala, 2015); (Kolomvatsos et al., 2015); (Bhat et al., 2015); (Colombo et al., 2015); (Assuncao, 2015): i) Algorithms for Big Data: Efficient algorithms are of paramount when it comes to big data analytics. The challenge is obtaining scalable innovative algorithmic solutions for ever growing and exploding data size. There is need for models for sketching and streaming, external memory and cache-obliviousness, dimensionality reduction, etc. ii) Visualization Analytics for Big Data: Standard charts and graphs are no longer sufficient for today’s data visualization tools. There is need for the design of data visualization tools that have interactive capabilities (giving users the opportunities of manipulating or drilling into data sources for querying and analysis) and indicators to alert users when changes have been made to data. iii) Link and Graph Mining: Link is generally referred to as relationships among data instances. They are used to exhibit the importance, rank or category of objects. A key challenge is mining richly structured dataset which are usually difficult with traditional statistical models. Of recent, a surge of interest in mining security and law enforcement, epidemiological records, social networks and bibliographic citations has emerged. iv) Data Acquisition, Integration, Cleaning and Scalability: Managing the ever growing data in terms of storage, integration and transformation is a challenge. Scalability issue with respect to the volume coupled with the velocity at which data is generated is another challenge. v) Data and Information Quality for Big Data: According to recent studies, there is prevalence of poor quality data on the web and in large databases. As it is generally known that the consequence of data analysis is dependent on the data that is being used, veracity, one of the features of big data is increasingly being recognized. At present, there are no research quality standards and quality assessment methods for big data.

vi) High Performance/Parallel Computing Platforms for Big Data: The challenge here is designing high performance and parallel computing platforms that include system configuration to support both batch and stream processing. vii) Autonomic Computing Design and Deployment: Autonomic computing refers to self- managing characteristics of distributed computing resources (without human intervention), ability to adapt to unpredictable changes while hiding intrinsic complexities from users. Autonomic computing includes self-configuration, self-healing, self-optimizing and self-protection. viii) New Programming Models for Big Data beyond Hadoop: Hadoop is no longer sufficed for enterprises that need better and faster ways to extract business value from large dataset. While many organizations are still embarking on Hadoop, its creator, Google has moved to newer technologies such as Cloud DataFlow, etc. ix) High Performance Cryptography: Malicious hacking and high profile data breaches are continuously putting organization at risk and significant business disruption. The challenge is how to harness the value of large amount while mitigating the risk of exposure and compromise. Existing solutions such as fragmented cryptographic key management are not sufficient. To ensure an end-to-end secure of private sensitive data, data has to be encrypted based on access control policies. More efficient and scalable Attribute-Based Encryption (ABE) has to be implemented. x) Trust Management in IoT: Internet of things refers to seamless integration of physical objects into information networks to provide human beings with advanced and intelligent services. Trust management in IoT ensures qualified services, data fusion and mining reliability and enhanced user privacy and information security. Research issues include the design of distributed and scalable trust management protocol for Internet of things that will advocate the use of trust properties (such as cooperativeness, honesty and community interest) and consider the issue of social relationship to evaluate trust.

276

xi) Privacy Preserving Big Data Analytics: Big data is a troubling manifestation of big brother by enabling invasive marketing, invasions of privacy and decreased civil freedoms. User data are constantly mined by inside analyst and potential outside business partners and as a result can be abused by them. There is need for robust and scalable privacy preserving algorithm that will increase user safety. xii) Stream Computing for Big Data: Time is of the essence for time-sensitive processes such as mitigating security threats, thwarting fraud or responding to natural disaster. There is need for scalable architectures or platforms that enable continuous processing of data streams which can be used to maximize the timeliness of data 7.0 CONCLUSION, CONTRIBUTIONS AND FUTURE WORK

This review had given a concise exposure to the inherent trends and technologies in big data analytics. A careful look at the foregoing reveals the potentials of data and its technologies for insight generation. A general overview of big data as it relates to the trends and technologies employed covering NoSQL, Hadoop, MapReduce, Hadoop ecosystem, Apache Spark and the analytics algorithm issues in the big data research. As a very concise expert tutorial for both new comers in the big data world and existing practitioners, this work had provided a structured understanding of the technology concepts of big data and their applications in both research and industry. It has shown the next directions in computational analytics and provided the background knowledge from which the trend moves. In the future, we hope to cover other emerging technologies for big data analytics such as Flink, Samza and Storm and compare them with existing technologies. In practice, we will explore the programming tools like Python, R/RStudio and other tools for implementing solutions in the big data era. In addition, it is clear that data volume will continue to grow at an alarming velocity due to the

presence of the internet. The technology set up at present can accommodate terabyte to petabyte data, however, there has been no revolutionary innovation to accommodate Exabyte dataset (Tsai, 2015) and from literature it was predicted that the digital universe will be 44 zettabytes by 2020 (Chaudhuri et al., 2011). Therefore, research efforts should be geared towards developing frameworks and algorithms that will accommodate scalability and parallelization issues to cope with the ever- increasing size of data. REFERENCES

Aalst, W. V. (2012). Process Mining: Overview and Opportunities. ACM Trans. Manag. Infor. Syst. 3(2): 1-17. Aggarwal, C.C. (2011). An Introduction to Social Network Data Analytics. In Aggarwal, C.C. (ed.) Social Network Analytics. United States: Springer.

Al-Jarrah, O. Y., Yoo, P. D., Muhaidat, S., Karagiannidis, G. K., & Taha, K. (2015). Efficient machine learning for big data: A review. Big Data Research, 2(3): 87-93.

Apache Spark documentation. (2014). Available: https://spark.apache.org/documentation.html

Apache Spark. (2014). Apache Spark-Lightning- Fast Cluster Computing. Available: http://spark.apache.org

Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Buyya, R. (2015). Big Data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 79: 3-15.

Bhat, T. P., Karthik, C., & Chandrasekaran, K. (2015). A Privacy Preserved Data Mining Approach Based on k-Partite Graph Theory. Procedia Computer Science, 54: 422-430.

Brooks, C. (2014). Enterprise NoSQL for Dummies. New Jersey: John Wiley & Sons, Inc.

Chaudhuri, S., Dayal, U., & Narasayya, V. (2011). An Overview of Business Intelligence Technology. Communications of the ACM, 54(8): 88-98.

277

Cisco Visual Networking Index, (2016). Global Mobile Data Traffic Forecast Update, 2015-2020, Available: www.cisci.com/c/en/us/

Colombo, P., & Ferrari, E. (2015). Privacy aware access control for big data: a research roadmap. Big Data Research, 2(4): 145-154.

Coppa, E. (2014). Hadoop Architecture Overview. Available:http://ercoppa.github.io/HadoopInternals/ HadoopArchitectureOverview.html

Dawar, A. (2015). Apache Spark vs. MapReduce – Whiteboard Walkthrough.

Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1): 107- 113.

Duggal, R., Khatri, S. K., & Shukla, B. (2015, September). Improving patient matching: single patient view for Clinical Decision Support using Big Data analytics. 2015 4th IEEE International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), pp. 1-6.

Duggal, R., Shukla, B., & Khatri, S. K. (2015). Big Data Analytics in Indian healthcare system— opportunities and challenges. In National Conference on Computing, Communication and Information Processing, pp.92-104.

Frampton, M. (2014). Big Data made easy: A working guide to the complete Hadoop toolset. Apress.

Grobelnik, M. (2012). Big Data Tutorial. Available: http://videolectures.net/eswc2012_grobelnik_big_da ta/

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47: 98- 115.

Hu, H., Wen, Y., Chua, T. S., & Li, X. (2014). Toward scalable systems for big data analytics: a technology tutorial. IEEE access, 2: 652-687.

IBM. (2013). Big Data at the Speed of Business. Big Data & Analytics Hub. Available: http://www- 01.ibm.com/software/data/bigdata/

International Data Corporation (2013). Big Data and Analytics – An IDC Four Pillar Research Area. IDC Tech. Rep. Available:http://www.idc.com/prodsery/FourPillars/ bigData/index.jsp

Kolomvatsos, K., Anagnostopoulos, C., & Hadjiefthymiades, S. (2015). An efficient time optimized scheme for progressive analytics in big data. Big Data Research, 2(4): 155-165.

Kudyba, S. (2014). Big Data, Mining, and Analytics: Components of Strategic Decision Making. Boca Raton: CRC Press, Taylor & Francis Group, 2014.

Laskowski, J. (2016). Mastering Apache Spark. GitBook. Available:https://www.gitbook.com/book/jaceklask owski/mastering-apache spark/details

Li, H., and Lu, X. (2014). Challenges and Trends of Big Data Analytics. 2014 Ninth IEEE International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, pp. 566-567.

Lovalekar, S. (2014). Big data: an emerging trend in future. International Journal of Computer Science and Information Technologies (IJCSIT), 5(1): 538- 54.

Ma, Z., Yang, Y., Cai, Y., Sebe, N., & Hauptmann, A. G. (2012). Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In Proceedings of the 20th ACM international conference on Multimedia, pp. 469-478.

Maniu, S., & Cautis, B. (2012, May). Taagle: Efficient, personalized search in collaborative tagging networks. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 661-664.

Maniyam, S. (2015). Apache Spark: fast and easy data processing. Elephant Scale LLC, SNIA Analytics and Big Data Summit. Available: http://elephantscale.com

Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce

278

programming framework to clinical big data analysis: current landscape and future trends. BioData mining, 7(1): 22.

Nair, L. R., & Shetty, S. D. (2015). Streaming twitter data analysis using spark for effective job search. Journal of Theoretical and Applied Information Technology, 80(2): 349-353.

Olson, M. (2010). Hadoop: scalable, flexible data storage and analysis. IQT Quart, 1(3): 14-18.

Pääkkönen, P., & Pakkala, D. (2015). Reference architecture and classification of technologies, products and services for big data systems. Big Data Research, 2(4): 166-186.

Press, G. (2013). $16.1 billion big data market: 2014 predictions from IDC and IIA. Forbes. com, 12. Available:http://www.forbes.com/sites/gilpress/201 3/12/12/16-1-billion-big-data-market-2014- predictions-from-idc-and-iia

Rabbath, M., Sandhaus, P., & Boll, S. (2011, April). Multimedia retrieval in social networks for photo book creation. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, p. 72.

Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health information science and systems, 2(1):3. Available: http://www.hissjournal.com/content/2/1/3

Russom, P. (2011). Big data analytics. TDWI Best Practices Report, Fourth Quarter. Available: http://tdwi.org/portals/big-data-analytics.aspx

Sharma, S. (2016). Expanded Cloud Plumes Hiding Big Data Ecosystem. Future Generation Computer System, 59: 63-92.

Shridhar, S., Lakhanpuria, M., Charak, A., and Gupta, A. (2012). A Framework for Personalized Recommendations Based on Social Network Analysis. In Pro. 5th Int. Workshop Location-Based Soc. Netw., pp.55-61, 2012.

Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010, May). The Hadoop Distributed File System. IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp.1-10.

Sitto, K., & Presser, M. (2015). Field Guide to Hadoop: An Introduction to Hadoop, Its Ecosystem, and Aligned Technologies. “O’Reilly Media, Inc.".

Taft, D. K. (2013). Big data market to reach $46.34 billion by 2018. EWEEK, Tech. Rep. Available: http://www.ewwek.com/database/big-data-market- to-each-46.34-billion-by-2018.html

Team, O. R. (2011). Big Data Now: Current Perspectives, Sebastopol, CA, USA: O'Reilly Media.

Tsai, C. W., Lai, C. F., Chao, H. C., & Vasilakos, A. V. (2015). Big data analytics: a survey. Journal of Big Data, 2(1): 21.

Verykios, V. S., Bertino, E., Fovino, I. N., Provenza, L.P., Saygin, Y. & Theodoridis, Y. (2004). State-of-the-art in Privacy Preserving Data Mining. ACM Sigmod Record, 33(1): 50-57.

White, T. (2015). Hadoop: The Definitive Guide. (4th ed.) Sebastopol, CA, USA: O’Reilly Media,

Xu, B., Bu, J., Chen, C., & Cai, D. (2012, April). An exploration of improving collaborative recommender systems via user-item subgroups. In Proceedings of the 21st ACM International Conference on World Wide Web, pp. 21-30.

Zhang, H., Dai, H., Zhang, Z., & Huang, Y. (2016). Mobile conductance in sparse networks and mobility-connectivity tradeoff. IEEE Transactions on Wireless Communications, 15(4): 2954-2965.

Zhang, H., Zhang, Z., & Dai, H. (2013). Gossip- based information spreading in mobile networks. IEEE Transactions on Wireless Communications, 12(11): 5918-5928.

Zikopoulos, P., Parasuraman, K., Deutsch, T., Giles, J., & Corrigan, D. (2013). Harness the power of big data. The IBM big data platform. McGraw Hill Professional.

Zwolenski, M., & Weatherill, L. (2014). The digital universe: rich data and the increasing value of the internet of things. Australian Journal of Telecommunications and the Digital Economy, 2(3): 47.

View publication statsView publication stats

https://www.researchgate.net/publication/328531358