Project paper

profilesaradhipj
2BigDataTechnologyandApplicationsPRoposal.docx

Big Data Technology and Applications

Student’s Name

Course Name

Institution Name

Date

Contents 1. INTRODUCTION 3 Background 4 Objectives 4 2. LITERATURE REVIEW 5 2.1 Big Data 5 2.2 Apache Hadoop 5 2.2.1 Evolution of Hadoop 5 2.2.2 Core Components of Hadoop 5 2.2.2.1 HDFS Architecture 5 2.2.2.2 MapReduce Framework 5 2.2.2.3 Daemon Process in Hadoop 5 2.2.3 Hadoop-1 vs Hadoop-2.x 5 Hadoop Eco-system 5 3. BIG DATA TRENDS 5 3.1 Big Data Eco-system 5 3.2 Hadoop Distributions 5 3.3 Hadoop in the Cloud and Virtualized Environment 5 3.4 Hadoop as a Big Data Operating System 6 3.5 Big Data Security and Privacy Issues 6 4. BIG DATA APPLICATIONS 7 4.1 Politics 7 4.2 National Security 7 4.3 Health Care and Medicine 7 4.4 Science and Research 8 4.5 Social Media Analysis 8 5. Conclusion 8 6. Recommendations 9 6.1Examine the types of Business Challenges 9 6.2 Develop a Strategic Plan 10 6.3 Treat Big Data as a Unique Subject 10 6.4 Think Long-term 10 7. References 11

Running head: BIG DATA TECHNOLOGIES AND APPLICATIONS

2

BIG DATA TECHNOLOGY AND APPLICATIONS

Big Data Technology and Applications

INTRODUCTION

In the contemporary world of Information Technology, Big Data is one of the most hyped phrases. Researchers and academicians are increasingly using the term Big Data. Big Data can be described as the excessive amount of imprecise data in numerous formats that is rapidly generated from numerous sources (Buhl et al. 2013).

The analyses and discussions of this paper will yield a clear and comprehensive understanding of the subject of Big Data. The Hadoop project and technology will be well understood at the end of the paper. Furthermore, the trends and applications in Big Data will be discussed and analyzed.

Background

The field of Big Data expands beyond data perspective. It includes the emerging of streams of tools, associated technologies and real world applications. Big Data cannot be stored and processed by one machine. There is need to study Big Data, its applications in the real world because, Big Data is critical in organization decision making (EMC Education Services, 2015). It has a big influence of marketing trends and consumer behavior.

Data may be generated by humans in form of emails, documents and images for example. Machines generate data in form of sensor data, and log data for instance email logs and web logs. Machine generated data are of larger size than human data hence Big Data. The description of Big Data is not only limited to the size of the data, but also the speed of generation of the data, the volume and veracity.

Objectives

1. To provide a simple yet comprehensive introduction of Big Data

2. To provide an overview of Hadoop and its sub projects

3. To discuss recent trends and eminent applications in Big Data

LITERATURE REVIEW

2.1 Big Data

Big data describes vast volumes of unstructured, semi-structured, and structured data that could be possibly mined to obtain information and after they're used in learning of machine projects or other upgraded applications of analytics. They can be characterized using 3Vs: large data volume, a wide variety of data, and the processing velocity. Other V's can be added to the description, including the variability, value, and integrity. (Chen, & Zhang, 2014). The voluminous amount of data often reside from various sources like databases, transaction systems, medical records, and others. The data can be left in their natural forms or processed using tools for data mining or software for preparation of data.

2.2 Apache Hadoop

Apache Hadoop serves as an open-source approach meant for a distributed processing and storage of the sets of big data across computer clusters. It consists of components like HDFS (Hadoop Distributed File System, which serves as the layer storage composition. It divides files to chunks and later distributes the divisions across cluster nodes. It also has a MapReduce utilized in parallel processes of processing. (Adluru, Datla, & Zhang, 2015). Yarn used in the system serves to schedule and cluster jobs for management of resources. Included are also the frequently used libraries utilized in other subsystems of Hadoop. The framework is often used together with NoSQL and Apache Spark databases in the provision of data management and storage for the information pipelines powered by Spark.

0. Evolution of Hadoop

Hadoop was introduced over a decade years ago. It was first spawned from Nutch founding and the MapReduce white paper. Since then, gradual changes have been implemented from being a tool of Silicon Valley to being an essential data storage tool. Its massive production scale was first recorded by Yahoo. In the present day, the framework has extended its use to touch most big data aspects as well as the analytics environment. (Adluru, Datla, & Zhang, 2015). It can also be suited for a multitude of cases of applications. This ranges from advanced-scale ETL to massive analytics. The Hadoop two now upgraded to Hadoop 3.0 has led to many systems adopting the use of the new software due to capabilities of improved security and excellent performances.

2.2.2 Core Components of Hadoop

The Hadoop system contains a MapReduce set-up that plays a role in data processing in the Hadoop Distributed File System. This parts, therefore, make up the core Hadoop components. HDFS is a scalable file system which gives a greater throughput to application data access and the running of hardware of commodities. The blocks of data for the system are often stored at the top of the native system of files. (Adluru, Datla, & Zhang, 2015). HDFS's master is termed as the Namenode while its slave is the Datanode. Namenode runs Hadoop cluster's master node. It is made up of the file's metadata stored in the file system. Datanode runs on the slave masters in a cluster of Hadoop framework. It periodically sends messages about the held blocks to Namenode

2.2.2.1 HDFS Architecture

HDFS architecture is a combination of slave and master files called Namenode and the Datanode. The master file, Namenode, maintains two types of data; the EditLog keeps a record of the changes that take place in the metadata of the file system while FsImage provides storage to the whole namespace, properties of the file system and the mapping of files and blocks. (Chen, & Zhang, 2014). They serve as the central structures of data for the system. Datanode plays a role in the serving of write/ read client request. It also carries out the creation of blocks, replication, and deletion as per the Namenode instructions. The properties of the system include its robustness, data integrity, rebalancing of clusters, among others.

0. MapReduce Framework

MapReduce allows one to write various applications that help in the processing of large data sets on vast clusters of hardware of commodities in a manner that is reliable and tolerant to faults. The framework is responsible for task scheduling, monitoring, and re-execution of the failed jobs. Just like HDFS, the approach operates in a similar node making it most suitable in the programming of tasks within the data storage nodes and localities hence improving its performances. (Chen, & Zhang, 2014). In most cases, storage of data in HDFS incurs fewer costs, is tolerant of faults and can be easily scaled. The approach consists of a single master JobTracker daemon in every cluster and a slave TaskTracker daemon in each node of a group.

0. Daemon Process in Hadoop

Daemon refers to the background running processes. Hadoop contains such five procedures including Datanode, JobTracker, Secondary NameNode, NameNode, and TaskTracker. Every single daemon separately runs in their JVM. As discussed above, Namenode serves as the master node responsible for the storage of meta-data for all directories and files. (Adluru, Datla, & Zhang, 2015). DataNode is the slave node with the actual information, and it reports the number of blocks it holds periodically. Secondary NameNode merges NameNode changes periodically with edit log so that the size is maintained. The process involved in the above software includes writing the files in clusters, reading them from the locations, strategy for the tolerance of faults, and the plan of replication.

0. Hadoop-1 vs Hadoop-2.x

Hadoop 1 is made up of two major components, including HDFS V1 and MapReduce (MR V1).the two components are termed as the pillars of Hadoop. On the other side, Hadoop 2.x is made up of three components, which include: YARN, MapReduce, and HDFS V.2. The composition is also termed as Hadoop pillars. Hadoop 1 has many limitations as compared to the other software. First, it is only suited for processing applications powered by MapReduce. It is also not suitable for the processing of real-time data and the streaming of data. (Chen, & Zhang, 2014). Additionally, it is made up of only one JobTracker meant to perform many operations like task scheduling, management of resources, task monitoring, and re-scheduling. The Hadoop 2.x system solves all these limitations via the introduction of a new component, YARN.

Hadoop Eco-system

Hadoop ecosystem is a framework or platform suitable for solving problems related to big data. It encompasses various services such as ingestions, storage, analysis, and maintenance. To perform the multiple functions and services, the system is made up of different components that numerous complete tasks for the success of the system. An example is MapReduce that works on unstructured information. Others like Hive, Lucene, and Mahout handle structured information, text searches, and learning of machine algorithms and collection of data, respectively. Also, Flume and Ambari carry out operations like aggregation and the administration of clusters that help in quickly solving the problems. (O’Driscoll, Daugelaite, & Sleator, 2013).

4. BIG DATA TRENDS

Big Data Eco-system

The ecosystem for big data involves the layers and the levels of abstraction together with the components. In most cases, the compositions integrate their functions with the HDFS forms making up the most portion of the ecosystem. The layers of abstraction include analytic applications, modeling, fast-loading analytic databases, security and management, more significant level languages, Task and Job Trackers, file systems for Location-aware as well as original data and processing. (Adluru, Datla, & Zhang, 2015). The storage systems, in this case, are the HDFS that holds the data and meta-data need for computation completion. The logic layer also called the computation layer, consists of MapReduce, Pig, Hive, and Cascading. The frameworks perform various operations as per the instructions provided.

The Apache software supports other Hadoop related projects (Adluru, Datla, & Zhang, 2015). Each of the projects supported by this system deals with individual aspects of Big Data and offers complementary support o Hadoop. The Hadoop related projects fall under the Hadoop Eco-system (). Some of these ecosystems include: The HBase and Casandra.

The Hbase system was inspired by Google’s Big Table. It is a Hadoop database system that’s scalable and non-relational (O’Driscoll, Daugelaite, & Sleator, 2013). It supports the storage of big tables of structured data. as an underlying storage mechanism, I uses he HDFS. It is applied when there is need for random and or real time access to Big Data

Cassandra is also a scalable database system that offers availability and support of multi-master to minimize failure due to single points. Data can be retrieved from Cassandra using Map Reduce. The supporting systems of Casandra are derived from Google’s File System and Big Table.

BIG DATA TRENDS

3.1 Big Data Eco-system

3.2 Hadoop Distributions

A distribution offers easy installation and packaging of numerous components to work in unison. A Hadoop distribution is tested and patched with improvements. As an open source project of Apache, and just like Linux distributions, some enterprises launched their individual Hadoop distributions with tools so as to manage clusters and offer a premium policy. Hashem et al. (2015) for instance is the oldest distribution of Hadoop. Horton Works is closely related in functionality to Apache Hadoop. Intel offers a distribution that contains an encryption support.

3.3 Hadoop in the Cloud and Virtualized Environment

Originally, Hadoop was designed to process physical machine clusters; however, its use has expanded to now provide cloud and virtual machines. It is now possible to set up Hadoop clusters in both private and public cloud. Amazon for instance offers Hadoop cluster on customer demand, Google Inc. offers Hadoop on Google Compute Engine. Hadoop clusters as deployed in virtual clusters have numerous advantages. It saves operation costs for example when a single image is cloned. Physical infrastructure can be reused and further, clusters can be setup on demand.

3.4 Hadoop as a Big Data Operating System

Hadoop is slowly turning into a general purpose operating system. It has analytic frameworks like the YARN that now works as distributed resource manager. YARN offers daemons and APIs and further develops generic distributed applications in the real world. Data analytics for instance graph analytics can be incorporated with Hadoop and used to perform storage and computational frameworks.

3.5 Big Data Security and Privacy Issues

Security and privacy issues have been magnified by Big Data features for instance velocity, volume and variety. When data is hosted on large scale cloud infrastructures, security and privacy issues become significant to put into considerations. Large scale data hosted on cloud infrastructures comes with the challenges of diversity of data formats, high volume inter cloud migrations and streaming data. Attractive avenues to launch their attacks through the numerous avenues provided by the handling of Big Data on cloud infrastructures such as the spread of large volumes of information on numerous software platforms spread on numerous large computer networks.

BIG DATA APPLICATIONS

Big Data technologies contain a long list of applications. For instance Big Data technologies may be used for search engineers, recommender systems, log processing and data warehousing, banking and financials, video and image analysis, web and social media, social life, science and research and retail manufacturing.

4.1 Politics

In politics the application of Big Data is limitless. A fresh and perhaps most applicable application of big data technologies in politics is Mr. Barack Obama’s presidential campaign in 2012. Obama’s campaign was built on a strong 100 data analysts that were tasked with shaking dozens of terabytes of scale data. Using HP Vertica and predictive models with R and Stata they were able to process and analyses such a huge database.

4.2 National Security

Big Data technologies could be used to promote national security and further help in crime detection and prevention. Akhgar et al. presents strategic approaches to use of Big Data for terrorism and crime prevention.

4.3 Health Care and Medicine

The storage and processing of medical records is a docket of Big Data technologies. Sensors or other medical equipment attached to patients rely data from the patients that can be captured and stored in HDFS and quickly analyzed (Murdoch, & Detsky, 2013). Human genome mapping is a part of medical records that is an application of Big Data. It becomes easy to search and locate genetic determinants causing diseases and hence promotes personalized medical development for such diseases.

4.4 Science and Research

Technology is currently the main river of research in science. In Europe for instance, the European Organization for Nuclear Research have begun the largest particle accelerator, Large Hadron Collider (LHC) (Chen, & Zhang, 2014). The data generated from the experiment was massive. The center has about 65, 000 data processors that analyze about 30 petabytes of data. There are about 150 data centers in the world with computing powers of thousands of computers.

4.5 Social Media Analysis

Business use powerful Social media analytics, SaaS solutions to gain insights into their customer’s web activities. Companies such as IBM provide such services. Business employs such data analytics to better understand their competition, their customers and their market of operation (Wu et al. 2013). Customer behavior can be studied and analyzed by such technologies. Customer data from their social media activities is captured and such data is used to predict their behaviors and hence creates customized campaigns.

Conclusion

Big Data is not limited to the volume of the data under considerations. It also looks at the velocity, veracity and variations of such data. There is a new attitude in the data processing world that has been introduced by Big Data. New opportunities are emerging with regards to providence o solutions for world challenges that were considered infeasible. Technologies in Big Data such as Apache Hadoop provide infinite opportunities and possibilities in data analysis. Hadoop was initially developed with twin core components, MapReduce and HDFS. The NextGen MapReduce, YARN converts Hadoop as a general purpose data operating system. On top of Hadoop, Apache offers sub-projects that provide extra services. Hadoop can be further set up and hosted by virtual infrastructures and in the cloud. Virtualizations offer a Hadoop system that is set up on demand while cloud enables Hadoop services without clusters. The adoption of Hadoop systems is wide with areas such as engineering, social life and science being on the forefront in its applications and use. The way of thinking and problem solving has immensely improved with the introduction of Big Data technologies such as Hadoop.

Big Data provides an opportunity to gain critical insights into the emerging data types and contents. It enables the creation after such insight has been gained of more agile businesses that are customized. Big Data is admired by many due to its ability to enable better business results. The operations of businesses globally have been positively impacted by Big Data.

Recommendations

This paper has not explored every dimension and aspect of Big Data; however, the essential aspects of Big Data will have been discussed. The benefits of Big Data to organizations and individuals have been thoroughly analyzed and evaluated. There is however some aspects that future research in this area might consider useful.

6.1Examine the types of Business Challenges

Variety of data sources, quality of data, data visualization are examples of challenges of the integration of Big Data (Chen, & Zhang, 2014). It is critical to closely examine the types of business problems of world challenges that are to be solved. It will be a great way of evaluating the suitability of technological solutions with regards to Big Data application. It’s further important to consider fundamental operational performances such as data volume scalability, both of demand versus the growth of the environment, and purpose of the analysis. When discussing big data it is significant to keep in mind the key variables of Big Data including data variety, Data velocity, data volumes and parallelization.

6.2 Develop a Strategic Plan

A strategic plan of the analysis is then important as the next phase of Big Data handling. Explore also the numerous alternatives of Big Data that one is provided with. Different suppliers of the tools for Big Data analysis should be selected using performance criteria. The best value estimation can be obtained by the clarification of the success criteria (Chen, & Zhang, 2014). For best Big Data analytics, the strategic pla should be aligned with the g data technologies in the existing Data analytics infrastructure. It is important to treat big data as unique and hold it with the highest value.

6.3 Treat Big Data as a Unique Subject

The deployment of Big Data application is unique and different from handling of other data systems. In Big Data off-the-shelf solutions should be avoided, various components such as database management systems and data cleaning systems should be offered in a distinct manner. There are no simple shortcuts from conception to production for a firm’s data. Developers and businesses should work closely to develop and improve design requirements.

6.4 Think Long-term

Thinking long-term as compared to short-term thinking in the application of Big Data is critical. It is important to understand the potential pay-off of any project before investing in such a project. Big data however does not come with such assurances. Firms investing in Big Data initially lose a lot of money. It is therefore important to shift the focus of the management from thinking short-term and looking for a quick way for return on Investments to actualize and start to think long-term. In Big Data, the risks are enormous and so are the rewards.

References

Adluru, P., Datla, S. S., & Zhang, X. (2015, May). Hadoop eco system for big data security and privacy. In 2015 Long Island Systems, Applications and Technology (pp. 1-6). IEEE

Akhtar, Z., Drioli, C., Farinosi, M., Ferrin, G., Foresti, G. L., Martinel, N., ... & Vernier, M. Sensor network reconfiguration and big multimedia data fusion for situational awareness in smart environments.

Buhl, H. U., Röglinger, M., Moser, F., & Heidemann, J. (2013). Big data.

Chen, C. P., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information sciences275, 314-347.

EMC Education Services (Ed.). (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. John Wiley & Sons.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information systems47, 98-115.

Murdoch, T. B., & Detsky, A. S. (2013). The inevitable application of big data to health care. Jama309(13), 1351-1352.

O’Driscoll, A., Daugelaite, J., & Sleator, R. D. (2013). ‘Big data’, Hadoop and cloud computing in genomics. Journal of biomedical informatics46(5), 774-781.

Wu, X., Zhu, X., Wu, G. Q., & Ding, W. (2013). Data mining with big data. IEEE transactions on knowledge and data engineering26(1), 97-107.