page paper comparing SQL and NoSQL Database for Big Data.

NoSQLdatabasesforbigdata.pdf

Home >Computer Science homework help > page paper comparing SQL and NoSQL Database for Big Data.

Int. J. Big Data Intelligence, Vol. 4, No. 3, 2017 171

NoSQL databases for big data

Ahmed Oussous and Fatima-Zahra Benjelloun LGS, ENSA, Ibn Tofail University, Kenitra, Morocco Email: ahmed.oussous@outlook.com Email: fatimazahra.benjelloun@outlook.com

Ayoub Ait Lahcen* LGS, ENSA, Ibn Tofail University, Kenitra, Morocco and LRIT, Unité associée au CNRST URAC 29, Mohammed V University in Rabat, Morocco Email: ayoub.aitlahcen@univ-ibntofail.ac.ma *Corresponding author

Samir Belfkih LGS, ENSA, Ibn Tofail University, Kenitra, Morocco Email: samir.belfkih@univ-ibntofail.ac.ma

Abstract: NoSQL solutions have been created to respond to many issues encountered when dealing with some specific applications, e.g., storage of very large datasets. In fact, traditional RDMS ensure data integrity and transaction consistency. But, this is at the cost of a rigid storage schema and a complex management. Certainly, data integrity and consistency are required in many cases like in financial applications but they are not always needed. The goal of this paper is to establish a precise picture about NoSQL’s evolution and mechanisms as well as the advantages and disadvantages of the main NoSQL data models and frameworks. For this purpose, first, a deep comparison between SQL and NoSQL databases is presented. Many criteria are examined such as: scalability, performance, consistency, security, analytical capabilities and fault-tolerance mechanisms. Second, the four major types of NoSQL databases are defined and compared: key-value stores, document databases, column-oriented databases and graph databases. Third, we compare for each NoSQL data model the main available technical solutions.

Keywords: NoSQL; key-value databases; document databases; column-oriented databases; graph databases; big data.

Reference to this paper should be made as follows: Oussous, A., Benjelloun, F-Z., Ait Lahcen, A. and Belfkih, S. (2017) ‘NoSQL databases for big data’, Int. J. Big Data Intelligence, Vol. 4, No. 3, pp.171–185.

Biographical notes: Ahmed Oussous is a PhD student at the Ibn Tofail University in Kenitra-Morocco. He is a member of Systems Engineering Laboratory at the National School of Applied Sciences of Kenitra, which is a Moroccan Engineering School. Prior to that, he received his DEUG, an undergraduate diploma, in 2010 in mathematics and computer science at the Ibn Zohr University in Agadir. Afterward, he received his Engineering degree in Computer Science in 2013 from the National School of Applied Sciences of Agadir. He is a Big Data Engineer. His research interests include big data exploration and analysis, big data classification and NoSQL databases.

172 A. Oussous et al.

Fatima-Zahra Benjelloun is a PhD candidate in Computer Science at the ENSA Kenitra, Morocco. She received her MBA in the Management of Information Technology and another MBA in E-Business from the Laval University, Quebec, Canada. Prior to that, she received her Engineer’s degree in Computer Science from the Al Akhawayn University, Ifrane, Morocco. She is a certified Project Management Professional (PMP) and an ISO 27001 Provisional Auditor of Information Systems. She has worked as a consultant in the management of information security field in several ministries of Quebec City from 2006 to 2011. His research interests include big data, security and privacy.

Ayoub Ait Lahcen is an Assistant Professor of Computer Engineering at the ENSA Kenitra, a Moroccan Engineering School, and a researcher at both LGS Laboratory (ENSA Kenitra) and LRIT Laboratory (Mohammed V University, Morocco). Prior to that, he received a Swiss Government scholarship to work during an academic year, as a Postdoctoral Researcher, with the Software Engineering Group of the University of Fribourg, Switzerland. He received his PhD in Computer Science from both Nice Sophia Antipolis University, France (prepared at INRIA Sophia Antipolis Research Center, in the Zenith team) and Mohammed V University (prepared at LRIT Laboratory). He received a Best Paper Award at MOPAS 2010. He was awarded a Moroccan Research Excellence Scholarship for PhD candidates and a Merit Scholarship for his Master’s in Computer Science and Telecommunications.

Samir Belfkih is currently a Professor of Computer Engineering at the ENSA, a Moroccan Engineering School at Kenitra. His is also the Director of several research teams at LGS Laboratory (ENSA Kenitra) including information security team and data analysis and information processing team. He received his PhD from France. He was a Lecturer and researcher at several universities including Montpelllier II, Lille II and Sidi Mohammed Ben Abdellah in Fes. He has participated in several international conferences (the USA, Australia, …) and published various research papers in international journals. He has also overseen multiple public institutions in Morocco and abroad in implementing decision support tools and information system urbanisation.

This paper is a revised and expanded version of a paper entitled ‘Comparison and classification of NoSQL databases for big data’ presented at International Conference on Big Data, Cloud and Applications, Tetuan, Morocco, 25–26 May 2015.

1 Introduction

There exist many trends that have been driving the spread of NoSQL databases and supporting their popularity. For instance, the proliferation of the internet uses the emergence of the new web technologies and cheap storage as well as the trends towards Web 2.0, Web 3.0 and big data. Such trends bring new requirements and challenges in terms of data storage, processing, analysis and display (Li and Manoharan, 2013).

Indeed, big data sets include a variety of structured, semi-structured and unstructured data that are difficult to handle by relational database management systems RDBMS and SQL. In fact, those traditional type of data storage and querying language require structuring data in specific and predefined model, which is not adequate in big data context (where rapid and huge volumes of data are generated in heterogeneous formats) (Zerhari et al., 2015).

To face this challenge, NoSQL databases (a term that cover all non-relational databases) have been developed to handle big data challenges and issues including the huge storage of variant data formats, the need for flexible schemas as well as the need for scalable, rapid and distributed databases.

In fact, they provide various categories that fulfil the requirement of different use cases, such as: key-value pair, document, graph, columnar, and geospatial databases. Thus,

NoSQL are a database management system suitable for distributed systems and non-relational data storage like HDFS. It is worth mentioning that NoSQL systems complement and do not replace the RDBMSs with its common query language SQL.

It is also essential to note that NoSQL solutions were not created for the same purposes as SQL-based solutions. While relational databases are mainly dedicated for structured data and to handle transactions, NoSQL solutions were created to resolve the storage problems of massive unstructured datasets. In fact, there are advantages and disadvantages of both solutions.

Ordonez et al. (2010) outline that NoSQL have been developed long time ago before its fully adoption. Strauch et al. (2011) argue that the term NoSQL was first used in 1998 for relational databases that do not use SQL and used afterward in 2009 for a conference about non-relational databases in San Francisco. Indeed, the proliferation of clouds especially Databases as a service and the urgent need for rapid, scalable and cheaper databases to handle big data encouraged the spread of NoSQL databases.

Strauch et al. (2011) adds others motives that encouraged NoSQL proliferation. That is the need to store data in a simpler structure to support object-oriented principles while avoiding expensive object-relational mapping. So NoSQL was an answer to fulfil the requirements of simple applications that are not complex.

NoSQL databases for big data 173

Another trend is the proliferation of web technologies and cloud computing that need low administration overhead and high scalability. There was also the movement in programming languages and development frameworks that tried to hide the complexity of SQL and the relational databases to offer more flexible and convenient technologies (e.g., Ruby, Java Persistent API, ADO.net, etc.).

The goal of this article is to establish a precise picture about NoSQLs evolution and mechanisms as well as the advantages and disadvantages of the main NoSQL data models and frameworks. For this purpose, firstly, a deep comparison between SQL and NoSQL databases is presented. Secondly, the four major types of NoSQL databases are defined and compared. Finally, we present and compare for each NoSQL data model the main available technical solutions.

2 Comparing SQL and NOSQL databases features

This section provides a short summary of the most important features of different NoSQL systems in order to examine their performance, advantages and disadvantages compared to traditional relational database systems.

RDMS and SQL databases were the dominant model for many years. But, they can just handle one type of a predefined schema and only support structured datasets. They guarantee the atomicity, consistency, isolation and durability (ACID) properties that are important for many applications. In fact, ACID properties are one of the key important features to protect online transactions’ reliability.

On the other hand, NoSQL are more suitable for non-relational, unstructured datasets like big data. They support many data schemas. However, they are slow in handling large queries whereas SQL databases are more

rapid and appropriate for complex and intensive queries. NoSQL offers a cheaper way to handle big data management using clusters and commodity servers. Table 1 presents a comparison between SQL and NoSQL databases.

There are three main issues encountered by RDMS when dealing with big data and some web applications. This includes:

1 scale out data

2 performance of single servers

3 rigid schema design.

The following elements present a precise comparison between relational and NoSQL databases in terms of multiple criteria: scalability, performance, flexibility, cloud usage, data model, queries and analytics, security, data replication, standardisation and maturity.

2.1 Scalability

Big data storage, retrieving and processing constitute a complex issues in relational databases. In fact, SQL databases are vertically scalable (Mohamed et al., 2014). To handle increasing load, users have to increase the capacity and the performance within a single server. This is achieved by increasing for example the capacity of the CPU, the RAM, or the SSD of the dedicated database server. However, in such architecture, sharding multiple tables across large clusters or grids is expensive and complex.

In contrast, NoSQL databases are horizontally scalable. So, to handle large data volumes, users have just to add servers to the NoSQL database infrastructure. Operations load is thus distributed over many servers. System scalability is thus easier and cheaper to achieve using NoSQL.

Table 1 SQL and NoSQL features

Database type Features ties

SQL database NoSQL database

Types One type (table-based databases) Many different types: key-value, document, wide-column, and graph databases

Schemas Predefined schema Dynamic schema for unstructured data Properties Atomicity, consistency, isolation and

durability Consistency, availability and partition tolerance

Scaling Vertically scalable Horizontally scalable Security Limited security mechanisms, vulnerable to

SQL injection Authorisation and authentication, weaknesses, no

encryption, multiple interfaces increase attack surface Complex query Good for complex query intensive

environment Are not suitable for complex queries

Query language SQL standardised language Specific query languages for each database. Development mode Mix of open-source and commercial product Mainly open-source Support Excellent support Limited support based on communities Economics RDBMS tend to rely on expensive

proprietary servers and storage systems Cheapest database : NoSQL databases typically use clusters of cheap commodity servers to manage huge data volume

174 A. Oussous et al.

2.2 Performance

On the other hand, relational databases require a predefined data model and structured data. They offer advanced functionalities to manage, update and query data using SQL. They have various benefits such us preserving the integrity, the consistency and the reliability of data and transactions. There is no doubt that relational databases ensure more reliability in comparison to NoSQL databases. Indeed, SQL databases save the reliability and the integrity of data and transactions by respecting ACID properties.

However, ensuring ACID properties is hard to achieve in the case of huge growing datasets. That is why NoSQL databases instead rely on basically available, softstate, eventually consistent (BASE) principals. Thus, they offer a flexible architecture to handle not only structured data but also unstructured and semi-structured data. Users can easily perform frequent code pushes and quick iterations.

It is worth mentioning that both ACID and BASE properties are derived from consistency, availability, partition tolerance (CAP) theorem. BASE principles are more flexible than ACID principles whereas ACID properties ensure more consistency and transaction reliability. However, those two qualities are achieved at the cost of performance and important investments. Thus, depending on the use case and business needs, users have to analyse their needs in terms of flexibility and performance. They can choose either relational database to guarantee consistency through ACID properties or NoSQL databases when flexibility and performance are privileged to handle large datasets and to manage multiple servers in a cluster, even if flexibility means less integrity (Zikopoulos et al., 2011).

In addition to all that, relational databases lack efficiency when dealing with big data. In fact, the performance of relational databases tends to decrease as data volume increases especially when dealing with semi-structured data in large warehouses. In addition, they require important investment when having to increase scalability (e.g., adding servers to store and processing large datasets require purchasing additional licenses).

Furthermore, the increasing need for real-time analysis of large evolving heterogeneous data volumes adds another level of difficulty (Ordonez et al., 2010). In addition, row storage model of RDBMS are less rapid than column stores (e.g., statistical processing is slow in RDMS). Some researches propose some upgrades in order to enable RDMS to deal with this issue. Ordonez (2013) argues that RDMS should incorporate array storage and be extended to include matrices and mathematical libraries. Other experts support the promise of NoSQL databases and schema free databases (e.g., graphs or object-oriented databases).

Unlike RDMS, NoSQL databases were adapted and enhanced to provide scalability, performance, flexibility needed for big data use cases. For instance, Strauch et al. (2011) reported that billion data can be injected per day in the column-store Hypertable of Zvent, while google is able to process 20 petabytes data stored in BigTable via MapReduce. In addition, they are based on more affordable

hardware and technologies compared to relational databases. NoSQL are even privileged for some simple applications where data storage and processing do not require the advanced features of RDMS nor ensuring data integrity as for banking transactions. Thus, NoSQL enable to avoid the unnecessary complexity of relational databases (Strauch et al., 2011). For example, social media sites and big web applications do not necessarily need trustful transaction and ACID properties (e.g., updating Facebook status or Tweets comments). Zero data loss and zero service interruption are not crucial in those cases. Furthermore, implementing ACID properties of RDMS can be expensive compared to the utility of social media (Strauch et al., 2011).

2.3 Cloud usage

The relational databases are not suitable for cloud environment. In fact, RDMS have a limited scalability and rely on ACID properties. Thus, they cannot support very large semi-structured and unstructured datasets.

However, NoSQL databases are the best solution for cloud applications. This is because; they provide a better availability, scalability, performance and flexibility. They can handle all types of data (structured, semi-structured and unstructured data).

2.4 Data models

On one hand, change management is complex to handle in relational databases. Users have to define the database schema before data injection. In addition, any change in database schema or tables should be studied carefully. Otherwise, such changes can cause service failure, reduce performance or may require maintenance and additional investment to adapt application modules.

On the other hand, NoSQL Databases enable an easy change management. In fact, there is no need to specify in advance rigid database schema. This provides flexibility to store data without a predefined schema. Furthermore, it is possible to change any time the data model without affecting the system or application’s performance. Thus users have the possibility to choose the appropriate data model and database depending on their use case.

2.5 Queries and analytics

A query language constitutes a computer language that enables developers to manipulate data inside a database.

Users of relational databases launch queries using the common structured query language (SQL) standard. However, there is no standard to query NoSQL databases. Indeed, each NoSQL database has its unique way to manage, extract and query data. Consequently, data scientists face the challenge to understand the query language of each NoSQL database.

On the contrary, SQL databases are powerful to handle complex queries through a standardised interface. However, NoSQL databases lack performance when dealing with

NoSQL databases for big data 175

complex queries. Joints are difficult to achieve in NoSQL databases. Instead, NoSQL are more adequate to handle parallel computations and mathematical equations on distributed large and evolving datasets (Oussous et al., 2015).

Compared to RDMS, NoSQL solutions are less suitable for business intelligence use cases. This is because NoSQL are usually complex to use for advanced analytics complex queries and joins, as mentioned before.

2.6 Security

In general, relational databases incorporate secure mechanisms. However, they are still facing many security risks such as SQL injection, cross site scripting, root kits, and weak communication protocols (Mohamed et al., 2014; Benjelloun and Ait Lahcen, 2015).

Certainly, NoSQL databases offer better scalability and flexibility. But, most NoSQL databases do not include built-in security mechanisms. So users have to deal with different security issues. Some security tools and modules have been added on the top of NoSQL databases. They constitute a very thin security layer in comparison to relational databases.

In fact, most NoSQL databases do not secure client server communications and do not provide authentication nor auditing mechanisms.

CouchDB is an example that offers auditing but it stores users’ names and passwords in logs files. This compromises data security. To ensure authentication, users are usually required to add external components to NoSQL infrastructure

Furthermore, while encryption of structured data is easier in relational databases, the encryption of very large unstructured data sources is difficult to achieve. Thus, most of such sources are stored in clear format in NoSQL.

Therefore, we can conclude that NoSQL rise many security issues because they are not mature. However, for some use cases, security is important to protect valuable, confidential or sensitive large sources (e.g., health, government, system security and so on) (Benjelloun et al., 2015).

2.7 Sharding

Sharding refers to a usual practice of using multiple servers. It describes the mechanism that splits large volumes of data store in the same database across multiple servers and virtual data nodes. Sharding enhances performance since each server handles different data partitions.

However, it is recommended to use replication instead of sharding. This is because replication provides not only performance but also reliability (Hadjigeorgiou et al., 2013).

NoSQL databases embraces sharding to balance the load and to ensure parallel storage and processing. They offer the valuable option to add or remove servers from data layer without affecting application performance.

On the contrary, RDMS were not originally created with this purpose. Instead, sharding feature was added to RDMS.

Tables are partitioned over multiple servers. Sharding is based on the mapping between shards (data partitions) and data nodes that contain those shards. The mapping can either be dynamic or static. One downside of sharding is that it does not allow joins between shards.

2.8 Data replication

Data replication is the concept of distributing data over a system. Such concept is better accomplished through a non-interactive and reliable process.

Replication is difficult to achieve in the case of relational databases. This is because they were not created to deal with horizontal scaling. In case of relational databases, replication and backup is carried out via a semi-manual process.

However, in the case of big data, it is usually required to ensure an automatic live recovery of large and geo-distributed datasets.

Traditional means of data redundancy focus on data mirroring. They replicate data over target arrays at the data centre or over a distant site. This method consumes a lot of storage space especially in the case of large datasets that exceeds petabytes. In fact, it is an overhead and expensive for organisation to store large streams of data (data in motion) as well as big data archives using traditional means.

Because NoSQL are horizontally scalable, they enable an easier management of data replication to prevent data loss. In fact, most NoSQL databases provide an automatic data replication for fault-tolerance. They replicate data across cluster servers and even across data centres. They enable users and administrators to easily configure replication settings and to tune it by specifying where and how data should be replicated across distributed systems.

Thus, by using big data technologies and NoSQL databases, developers do not have to worry about the complexity of the heterogeneous storage environment nor the mechanisms of parallel processing.

IBM InfoSphere data replication provides a good example of a real-time data replication. IBM White Paper (2014) confirms that real-time data replication allows ensuring continuous high data availability in both heterogeneous and homogenous environments. Real-time data replication is crucial to perform reporting, interactive analysis and to ensure synchronised transactions. It helps to extract accurate insight, to support rapid decision making and to optimise resources.

Aguilera et al. (2005) outline the utility of another option based on an error-correcting algorithm called erasure coding that is paired to object-based storage technique. Such solution is an alternative to data replication in a distributed environment. For instance, a data object (e.g., document with its metadata) is split into segments. Each segment is encoded and cut into slices that are stored on different servers. Thus, if some slices are no more accessible due to a disk failure, organisation can still reconstruct the original data. This solution reduces cost, consumes less storage and guarantees fault-tolerance repositories. However, it is not yet mature.

176 A. Oussous et al.

2.9 Standardisation and open source

NoSQL are open source solutions. This may accelerate their spread and popularity among big data users. However, it could be also a disadvantage as it does not promote standardisations practices. In fact, each NoSQL Database differs from the other ones. This is also true for their supported queries. Indeed, as far as we know, there is no yet a reliable standard for NoSQL databases. Each one has its own query language. Developers and administrators thus face the challenge to learn and be trained for each NoSQL available solution. Instead, relation databases are standardised and have a common SQL language.

2.10 Maturity

Due to their popularity, relational databases are commonly used worldwide inside enterprises. They have been used for a long time. Thus, they provide a common query language, rich features and enjoy a greater acceptance. There are also multiple professionals and consultant that can support enterprises to exploit, manage and administer their traditional databases. This fact supports their proliferation and increases their maturity as many experts can participate to report weaknesses and enhance the use of relational databases.

Instead, even though that NoSQL stores have emerged for more than ten years ago and are developer-friendly, they still lag behind in term of widespread acceptance. In fact, NoSQL databases are still relatively immature in comparison to relational databases. Additionally, there is a lack of developers and administrators that master those types of databases. This slows down their gain of maturity.

3 Types and main characteristics of NOSQL databases

Experts classify NoSQL databases according to different criteria especially the data model. Such types of databases are appropriate to handle the complexity of big data and its 3Vs (volume, velocity and variety). However, each type of NoSQL Databases offers a certain level of flexibility and a different data model to respond to the different big data cases.

In fact, users have the choice among different NoSQL databases according to their data structure as well as their storage and retrieval needs: document, key-value, column family and graph databases.

The NoSQL ecosystem provides various databases. Some of the most known ones are: HBase, Cassandra, DynamoDB, MongoDB, Riak, Redis, Accumulo, and Couchbase.

Table 2 NoSQL databases’ features

Types Features

Key-value store Document store Column-oriented store Graph store

Characteristics A simple hash table indexed by key.

Multiple key/value pairs form a document. Document stored

generally in JSON format.

Store data in columnar format. Each key is

associated with multiple attributes.

Focused on modelling the structure of the data.

Pros Very fast and scalable. Simple model.

Schema free: Unstructured data can be

stored easily. Simple, powerful data model. Horizontal scalability.

Better for complex read queries. Fast querying of

data. Storage of very large quantities of data. Better

analytic performance. Improved data compression.

Powerful data model. Locally connected data. Indexed data. Easy to

query. Handling complex relational information.

Cons Stored data have no schema. Poor for

complex data. All joins must be done in code.

No foreign key constraints. No triggers.

Query model limited to keys and indexes. No standard query syntax. Map Reduce for larger

queries. Poor for interconnected data.

Very low-level API. Undefined data usage pattern.

Increased disk seek time. Increased cost of inserts.

Poor for interconnected data.

Travers entire graph to give correct results.

Sharding.

Suitable for Storing session’s, information, user

profiles, preferences, shopping cart data.

Content management systems, blogging

platforms, web analytics, real-time analytics,

e-commerce applications.

Content management systems, blogging platforms,

maintaining counters, expiring usage, heavy write

volume such as log aggregation.

Space problem and connected data, such as social networks, spatial

data, routing information for goods and money,

recommendation engines. Examples Riak

Redis MemcacheDB

Dynamo Voldemort

MongoDB CouchDB ArangoDB MarkLogic RzthinkDB

BigTable Habse

Casandra Accumulo Hypertable

Neo4j OrientDB Allegro Virtuoso

InfiniteGraph

NoSQL databases for big data 177

In the following sections, we present the four major types of NoSQL databases. For each type, we compare in Table 2 some essential databases characteristics and we provide some examples. The purpose is to help users choosing the appropriate database for their big data specific projects.

3.1 Key-value databases

Key-value stores are completely schema free. Indeed, this model provides interesting storage flexibility and a simple structure based on a hash table. In such table, there is a unique key and a pointer to a particular item of data creating a key-value pair. Thus, key-value stores are suitable for simple operations based on key attributes. Indeed, hash tables are useful to look up for simple or complex values in extremely large datasets. For users who need a certain structure for their data, they may rely on collection of key-pairs.

Concerning flexibility, key-value stores enable to add at runtime any type of new values while preserving system availability. This is possible without compromising the already stored data that may have a different structure.

Key-value stores can handle very large number of records. They can support high volumes of state changes per second with millions of simultaneous users through distributed processing and storage. Most of key-value databases hold their datasets in memory. That is why they are suitable for caching of intensive SQL queries. Furthermore, those stores enable to speed-up the display of web pages by calculating in advance parts of a webpage. The result can be retrieved and displayed quickly upon a request by user-IDs (Atikoglu et al., 2012).

They are very useful for both storing the results of analytical algorithms (such as phrase counts among massive numbers of documents) and for producing those results via reports.

However, key-value databases inherit one drawback of NoSQL databases. They do not provide any kind of traditional database capabilities. Thus, to ensure transactions atomicity or the consistency of multiple parallel transactions, users should instead rely on the application itself (Loshin, 2013).

Another drawback is that users cannot access data by value. Indeed, it is impossible to query a key value data store in order to extract all records that contain a particular set of values. As confirmed by Manoochehri (2013), the only way to query a key-value database is through specifying a request either by key or by a range of keys.

3.1.1 Key-value databases examples and comparison

We compare hereafter three examples of key-value databases: Redis, Riak and Voldemort. All of those three solutions provide scalability, fault tolerance and a near-linear increase in performance.

Riak is based on a simple and symmetric architecture and is designed for highly distributed environments such as the cloud. Like Voldemort, Riak relies on a consistent hashing. It incorporates map-reduce programming model

that split the work to multiple tasks over several cluster nodes (Cattell, 2011).

In Riak, any node can respond to a client request. In order to track system status, Riak do not rely on a unique node. Instead, it relies on a gossip protocol between the nodes to track nodes status (nodes that are alive, nodes that hold data). Riak provides high fault-tolerance but with less performance than Redis.

Indeed, Redis is more suitable for time-critical applications because it relies on in-memory dataset for fast responses. Like other key-value stores, Redis enables simple operations such as to insert, delete and lookup. Similar to Voldemort, Redis allows users to associate lists and sets not only with a blob (large data objects) or a string but also with a key. It also enables list and set operations.

Redis is appropriate to handle rapidly changing data such as real-time data collection from sensors and real-time communications. On the other hand, Voldermort is suitable for very large datasets such as geological data and meta-data of maps. Indeed, it can support the storage of huge volumes without a great impact on performance (Feng, 2012).

Experiments showed that Redis scales when increasing datasets volumes but it does not scale with the increasing number of nodes. On the contrary, Voldemort scales when the number of nodes increases but it does not scale with the increasing size of datasets. Redisensures better data availability compared to Voldermort. However, both of them show a reduced availability when having to deal with very large datasets. It has been also shown that adding nodes to Voldermort system helps to enhance its availability (Feng, 2012). Table 3 summarises the important features of Redis, Riak and Voldemort.

3.2 Document databases

Document databases (Moniruzzaman and Hossain, 2013) were designed to handle the storage and the management of large-scale documents. This type of database assigns a key value to each document. Documents may contain multiple key-value pairs, or key-array pairs, or even nested documents. Documents are encoded in a standard data exchange format such as XML, JavaScript option notation (JSON) or binary JSON (BSON). Document databases are recognised as a powerful, flexible and agile tool to store big data. In fact, because data is stored in an interpretable JSON formats, such stores support various data types and are convenient for developers.

In contrast to the key-value stores, the document stores offer a mechanism to query collections based on multiple attribute value constraints. In fact, while key-value stores enable to search for data only by key value, document databases allow users to search for data based on the content of documents. They can query either by keys, values or examples. In fact, the encoded documents contain metadata objects, so it is possible to query data by example (Loshin, 2013).

178 A. Oussous et al.

Table 3 Overview of Riak, Redis and Voldermort features

K-V stores Properties

Riak Redis Voldemort

Language Erlang C, C++ Java Fault tolerance Replication Replication Data partition. Replication. RAID

repair. Data model Buckets, keys-values Data structures Structured blob/text Community Apache BSD Apache Protocol Http/REST or custom binary Telnet-like, binary safe Http Data storage Bitcask.LevelDB. Volatile

memory. File system. Volatile memory. File system. TSconfig.LevelDB.

HDFS.GridGain. Query language Bhttp.Javascript. REST.Erlang. API calls API calls Map Reduce YES NO NO Replication mode Multi-master replication Master-slave replication Symmetric replication Best for High availability. Partition

tolerance. Persistence. For rapidly changing data. Frequently written, rarely read. Statistical data.

Application with large requirement of data capacity.

In document databases, complex data structures like nested objects can be handled more easily. Therefore, documents stores can be more expressive than the data model of column families. They also offer the possibility to use secondary indexes, querying nested documents and to use operations like ‘and’, ‘or’, ‘between’. To launch queries, users can either rely on a rich programming APIs or a query language. Those possibilities provide document databases with a great flexibility required by multiple use cases. They are usually used for real-time analytics, logging and the storage layer of small and flexible websites like blogs. This is because they are easy to maintain.

On the contrary of the simple key-value stores, the value column in document databases contains semi-structured data and specifically the attribute name/value pairs. Furthermore, document databases support a flexible schema. Indeed, they do not have any schema restrictions and they allow storing documents with hundreds of attributes in a single column of a document scheme. So rows can receive various amount and types of attributes. It is also possible to add attributes at runtime (Hecht and Jablonski, 2011).

3.2.1 Document databases examples and comparisons

In this section, we compare two popular document databases: CouchDB and MongoDB. Both are open source document databases and are designed to scale across multiple nodes easily. For both, data is stored in documents with self-contained records and no intrinsic relationships. However, MongoDB provides consistency, so each client always has the same view of data. Instead, CouchDB ensures availability. Thus, all clients can always read and write.

To view the query’s results, users of MongoDB can rely on a simple and intuitive query-like language. They are represented as JSON-like structure. On the other hand,

CouchDB is based on a ‘map-reduce’ approach and its view concept. Indeed, queries are done through CouchDB’s ‘views’. They are defined with Javascript to specify the field constraints.

CouchDB stores data to disk by append-only files while MongoDB stores data in the memory-mapped storage engine. It uses memory mapped files for all disk I/O. As an interchange format, CouchDB offers an HTTP API for both data access and administration. MongoDB provides instead a socket-based wire protocol with BSON.

For fault-tolerance, CouchDB supports both master/master and master/slave replication. Replication can be finely tuned via replication filters. On the contrary, MongoDB manages replication using a form of asynchronous master/slave replication called replica sets.

To conclude, both CouchDB and MongoDB have many common features such us replication for fault-tolerance and volatile memory file system for data storage. Both rely on MapReduce paradigm for data processing and have as well a good community support.

However, CouchDB is not adapted to extremely changing data. In fact, while CouchDB requires setting up pre-defined queries, MongoDB is suitable for dynamic queries and ensures a better performance on big databases.

Table 4 summarises the important features of CouchDB and MongoDB.

3.3 Wide-column databases

Wide-column databases are also called wide columnar stores, oriented stores and extensible record stores. They represent an extension of the key-value architecture with columns. Wide-column stores are designed to process distributed data over a pool of infrastructure. Their flexible architecture is suitable to handle very large number of columns and to deal with frequent changes in schema.

NoSQL databases for big data 179

Table 4 MongoDB and CouchDB characteristics

Document store Properties

MongoDB CouchDB

Language C++ Erlang Fault tolerance Replication Replication Data model Document oriented (BSON) Document oriented(JSON) Community AGPL and others Apache Protocol TCP/IP HTTP/REST Data storage Volatile memory, file system Volatile memory, file system Query language JSON-like structure Queries are done via CouchDB ‘views’. They are defined

with JavaScript to specify field constraints. Map Reduce YES YES Replication mode Master-slave replication Multi-master replication Best for Dynamic queries. Defining indexes.

Good performance on a big DB. Accumulating. Occasionally changing data. Pre-defined

queries to be run. Web use cases and mobile applications.

Wide column databases (Moniruzzaman and Hossain, 2013) are based on hybrid approaches that rely on relational databases declarative characteristics and various key-value stores schema. Column family stores have a graphical representation that is similar to relational databases. However, while relational databases store a null value in each column when a dataset has no value for, wide column databases only store a key value pair in one row only if a dataset needs it. This is an optimisation in dealing with null values and sparse data (data with various numbers of attributes).

In fact, due to their data model that can be efficiently partitioned, wide column databases are appropriate for applications that need to store huge volume of data on very large clusters.

Column-oriented databases offer a better performance compared to relational databases. In fact, they are more efficient for read-only queries (ad hoc and dynamic queries) and more rapid in executing some operations such as aggregations. This is due to their storage type (column-oriented physical design) as well as superior CPU and cache performance. Indeed, in relational databases rows are stored contiguously. On the contrary, in column-oriented databases each column is stored contiguously on a separate location on a disk (Abadi et al., 2009).

In addition, the values stored in the columns are densely packed and compressed for read efficiency. Column stores enable direct and optimised operations on compressed data (Kaur and Rani, 2013).

Hereafter, we summarise some of wide-column databases advantages and disadvantages.

Compared to row-oriented databases, wide columnar stores show the following advantages:

• A better data compression: this is because, wide-column databases stores the repeated column values as a single column value and stores columns in the most used format.

• An enhanced use of bandwidth and cache: unlike relational databases, wide-column databases only read the required data from the disk. So, no extra data or columns are read. In addition, only the required data is put in the cache locality.

• A better code pipelining: wide-column databases consume the CPU cycle performance only for the required data attributes.

However, wide columnar stores show the following weaknesses when compared to row-oriented databases:

• A bigger disk seek time: This is because large number of columns is read in parallel.

• An increased time for small inserts: in fact, wide-column databases need to update multiple values in multiple columns.

• An expanded time for tuple reconstructions: wide-column databases translate value position information into disk locations to reconstruct tuples. However, because they were not designed to reconstruct rows from multiple columns, this may compromise their advantages.

3.3.1 Wide-column databases examples and comparison

HBase, Cassandra and Accumulo are some examples of column family databases.

HBase is an Apache open source project that is suitable for handling various large datasets. It is designed to scale out horizontally in distributed clusters. HBase is based on column-oriented key/value data model. In fact, it provides flexible structured hosting for very large tables in a BigTable-like format. This column store is written in Java and uses the Hadoop distributed file system (HDFS) (Carstoiu et al., 2010).

180 A. Oussous et al.

Table 5 Overview of HBase, Cassandra and Accumulo features

Column DB Properties

HBase Cassandra Accumulo

Language Java Java Java Fault tolerance Replication. Partitioning Replication. Partitioning Replication Data model BigTable BigTable and Dynamo BigTable Community Apache Facebook Apache Protocol Custom API. Thrift. Reset Thrift Thrift Data storage HDFS Inspired by Amazon's Dynamo for

storing data HDFS

Query language Apl calls, Reset XML, Thrift API Apl calls, Thrift API Java API, thrift API, REST calls Map Reduce YES YES YES Replication mode Master-slave replication. Master-slave replication. Multi-master replication Best for Real-time access, bulk operation

(indexing, …) When you write more than you read (logging). When you must use Java

Access on the cell level

HBase is designed to support high table-update’s rates. In fact, it puts updates into memory and periodically writes them out to files on the disk. HBase provides many features such us real-time queries, natural language search, consistent access to big data sources, linear and modular scalability, automatic and configurable sharding of tables. It is a popular non-relational database which is included in many big data solutions and data driven websites.

Cassandra is also a popular Apache project written in Java. It is based on the same data model like other extensible record stores and offers similar basic functionality. Indeed, Cassandra is a key-value database that uses column-oriented storage and column groups, sharding by key ranges and redundant storage. Cassandra supports partitioning and replication (Lakshman and Malik, 2009).

It offers many advantages such as scalability, read/write performances, as well as resiliency against ‘hot’ nodes and node failures. In fact, Cassandra provides automatic failure detection and recovery. Cassandra also allows configuring settings to adjust tradeoffs preferences between consistency and availability. Updates are cached in memory and then flushed to disk. Cassandra periodically compacts the disk representation.

Nevertheless, Cassandra model do not provide locking mechanism. The replicas are asynchronously updated.

Apache Accumulo (Halldorsson, 2013) is a distributed column store solution that provides scalability and high performance. Apache Accumulo is based on Google’s BigTable design and it is built on the top of Hadoop, Zookeeper, and Thrift. It allows an access control at cell level on the BigTable. It also allows modifying key/value pairs at various points in the data management process. This is ensured through a server-side programming mechanism.

Accumulo has the advantage of maintaining consistency even in case of thousands of nodes and petabytes of data. Furthermore, it can read and write data in near real-time. Another advantage of Accumulo is its built-in cell-level security functionality.

Table 5 summarises the important features of HBase, Cassandra and Accumulo.

3.4 Graph databases

Both relational databases and some NoSQL databases like the already introduced key-value stores are not efficient when dealing with highly connected data. They lack the performance and the flexibility needed to process and query multiple relationships inside large datasets (Robinson et al., 2013).

Even though that MapReduce paradigm paired to Hadoop framework provide scalability, fault tolerance and easy-to-program tools for large datasets, it has been proven that some NoSQL databases like key-value stores are not always suitable for connected data and very large graphs (Malewicz et al., 2010).

On the contrary, graph databases are suitable to store not only information about objects but also all relationships that exist among them. Indeed, graph databases are designed to better manage heavily linked data. Thus, they are suitable for applications with many relationships among their data.

They rely on schema-free graph model that is based on a graph abstraction. This allows users to easily model and represent connectivity. More precisely, such graph model includes a collection of vertices (e.g., objects or items represented by nodes) and edges to represent links, connections or relationships between data. So, graph databases are suitable to capture relations between entities.

To illustrate this, a graph can refer to a professional network like that in Viadeo. In this case, the vertices represent professionals while the directed edges represent links and relationships between those professionals. Each vertex is also initialised with a value. It is worth mentioning that even if graph databases save relationships, they have nothing to do with relational databases.

They are useful to store, access and analyse the strength and the nature of relationships between two or more items (e.g., how close is the relationship between two people? How far away is a taxi driver from another one or from a touristic site). Answering such questions enable to formulate valuable recommendations in many industries.

NoSQL databases for big data 181

Graph databases offer for many use cases: enhanced performance (they ensure lower latency in comparison to batch processing of aggregates), flexible data model (easy way to express relationships and to enrich the graph as data and business requirements get more precised) and agility (ability to evolve applications in a controlled manner aligned with Agil and test driven software development practices) (Robinson et al., 2013).

Graph databases are efficient to manage connected data since they permits to replace the costly intensive operations like recursive joins by efficient traversals.

Both relational and graph databases show good performance in the environments for which they were created. However, in contrast to relational databases that use SQL as a common language, graph databases require a language-specific and have their own APIs. Thus, transitioning between implementations is more difficult with graph databases (Vicknair et al., 2010).

Unlike most classes of NoSQL data store, graph databases are not the best solutions for updating sets of data or for very large volumes of data (Robinson et al., 2013).

3.4.1 Examples and comparisons related to graph databases

Neo4J, ArangoDB and OrientDB are three examples of Graph databases.

Neo4J is an open-source database. It is entirely written in Java, though there are available bindings with other languages like Ruby, Scala and Python.

On the contrary to relational databases that are based on a rigid and upfront schema, Neo4j database is a schema-less data model and based on a bottom-up approach. Consequently, Neo4j supports agility and enables to absorb an ad-hoc and dynamic data (Sharma, 2015).

Indeed, Neo4j (Angles, 2012) is based on a network-oriented model where relations form class objects. It provides an object-oriented API, a native disk-based storage manager for graphs. Neo4j constitutes also a framework for graph traversals.

Neo4J stores highly connected data in a graph format rather than in tables (tables are more suitable for aggregated data). In fact, data is stored in nodes connected to each other by defined relationship. Both nodes and relationships have their properties. It is an embedded, disk-based, fully transactional Java persistence engine with few small jars. It offers high scalability allowing to add up to several billion of data, high availability even if data is distributed across many machines, fast queries and rapid path identification through its traversal framework. In addition, data analysts can use its human readable query language adapted for graph models. Neo4j offers also a convenient and simple access (via rest interface or an object-oriented Java API). Similarly to relational databases, it ensures full ACID properties for reliable transactions (Kaur and Rani, 2013).

Neo4j spatial is a library of utilities for Neo4j that permits to add spatial indexes to already located data. It is designed to facilitate spatial operations on data (to search

for specific data by region or within a defined perimeter). The classes enable to use the geotools through some applications such as Geoserver and uDig.

OrientDB (Abubakar et al., 2014) is an open source NoSQL database management system that is released under the Apache 2 license. It is written in Java so it can run on Linux, Windows and any system that supports Java. It provides the flexibility of documents as well as a good performance to handle distributed graphs. In fact, OrientDB is a document-based database that consists of ODocuments (possibility to dynamically add and remove properties). At the same time, it allows to manage relationships as in graph databases with direct connections between records. Therefore, it emulates the property of index free adjacency of documents. It supports multiple modes including schema- less, schema-full and schema-mixed. Furthermore, it supports transactions and ACID properties. It can store up to 150,000 records per second.

ArangoDB (Dohmen et al., 2012) is an open source distributed and multi-purpose NoSQL database. It supports multiple data models including documents, graphs, and key-values. It is suitable for applications that need space efficiency, high performance with convenient querying tools. Indeed, it allows using an SQL-like query language as well as JavaScript and Ruby extensions.

As mentioned before, OrientDB supports SQL as its query language. On the contrary, ArangoDB provides its own query language called AQL. This one enables aggregation, graph queries, grouping, joins, list iteration, results filtering, results projection, sorting and variables.

In fact, ArangoDB query language (AQL) is designed to support complex queries especially on ArangoDB data models. The storage and retrieval of data is based on collections. AQL is considered as a declarative language focusing on the results instead of how the results should be produced. Furthermore, AQL is independent of the programming language of the clients. Thus, all the clients use the same language and syntax. In addition, it offers REST option for querying documents and permits to query by example. It enables a vertical and horizontal sharding (i.e., to add more computation power, to shard data to many servers). It runs on different platforms such us Linux, Windows, OSX and even Raspberry Pi. It is available under Apache 2 license.

3.4.2 Comparisons

From Table 6, we can notice that OrientDB and Neo4j share many common features (such as supporting Java language, replication for fault tolerance, and Http REST protocol).

Both ArangoDB and OrientDB are multi-model, support sharding, ACID properties and reliable transaction that are operated on the server-side. Instead, Neo4j is dedicated only for graphs and do not provide way for partitioning. Both offer REST APIs. Moreover, both can be used as an ‘API Server’ via JavaScript request handlers. However, ArangoDB is more practical in this case.

182 A. Oussous et al.

Table 6 Principle features of Neo4J, ArangoDB and OrientDB

Graph DB Properties

Neo4J ArangoDB OrientDB

Licence Open source/commercial version Open source Open source (Apache) Language Java, Scala C, C++, Ruby Java (any JVM scripting language) Fault tolerance Replication Replication Replication Data model Graph database – connected data Multi-model (document, graph,

key-value store) Multi-model (document, graph,

key-value store) Community GPLv3 Commercial Apache2 license Apache2 license Protocol http REST http Binary, http REST/JSON Data storage File System. Volatile memory File System. Memory card. Volatile

memory SSD Volatile Memory. Memory-Mapped

File. Remote Storage Query language API calls. REST. SparQL. Cypher

Thinkerpop. Gremlin AQL. REST

JavaScript/Ruby.Tinkerpop. Gremlin. API calls

SQL. Tinkerpop. Gremlin.SparQL. API calls. REST

Map Reduce No No No Replication mode Master-slave replication Master-slave replication Multi-master replication Pardoning No partitioning Sharding Sharding Best used For graph-style. Rich or complex

data relationships and queries. ArangoDB has a query language

which supports group-by like queries that allows implementing a

faceted search.

For graph-style. Rich or complex interconnected data

Both Neo4J and ArangoDB support fault tolerance through maser-slave replication. They handle data storage using a volatile memory and a file system.

OrientDB and ArangoDB are both free databases licensed with Apache 2. Instead, Neo4j offers an enterprise edition under APGL in addition to a free version.

However, there are many key differences between Neo4j, OrientDB and ArangoDB. The following points present a comparison in term of many criteria and features.

3.4.2.1 Storage and performance

While OrientDB and ArangoDB are multi-model (supporting documents, graphs and simple key/values), Neo4j is dedicated only for graphs.

Both Neo4J and OrientDB are dedicated to support the storage of large datasets based on graphs model. This ensures scalability and performance when inserting or querying data. The main difference between those two NoSQL databases lies in the core storage. Indeed, while OrientDB is based essentially on documents as the main storage (in addition to a graph layer that supports graphs), whereas Neo4J is based on graphs as its core storage (Barmpis and Kolovos, 2012).

According to some experiments (Beis et al., 2015), both Neo4J and OrientDB demonstrate comparable performance when dealing with small graphs. However, Neo4J appears to be more efficient than OrientDB when handling the storage and queries on big graphs.

Like OrientDB, ArangoDB is essentially a document store that allows connecting documents. So it is possible to query them as graphs. Each document is identified by key which enable a key/value storage.

3.4.2.2 Complex types

Neo4j has some weaknesses. In contrast to OrientDB that supports a large set of types, Neo4 supports only primitive types. With Neo4j, precision is lost when storing decimal numbers (amounts, currencies, salaries, etc.). While OrientDB supports DATE and DATETIME types to handle dates easily, Neo4j instead do not support DATE types. Thus, the user has to manage the temporal data.

OrientDB supports other complex types such as binary type to store binary large objects (BLOB), embedded type to store embedded objects recursively, collections and maps.

ArangoDB supports JSON data types including numbers, UTF-8 strings, boolean values, arrays/lists and documents.

3.4.2.3 Query languages

Neo4j has its own query language called Cypher. Consequently, users have to be trained to use it. ArangoDB uses also its own custom language called AQL. It supports aggregation, graph queries, grouping, joins, list iteration, results filtering, results projection, sorting and variables.

On the contrary, OrientDB supports SQL as a query language (most developers are familiar with it). It also offers the possibility to manage graphs of connected documents and enables to handle relationships without SQL joins.

3.4.2.4 Scalability and replication

Neo4j supports replication but only in the enterprise version. In both Neo4j and ArangoDB databases, the replication mechanism is based on master/slave architecture.

NoSQL databases for big data 183

It means that only one server can be the master. Therefore, Neo4j and ArangoDB are not able to scale on writes because write throughput is limited to the capacity of the single master server. Such type of replication provides read-scalability and supports backups.

Unlike Neo4j and ArangoDB that supports master-slave replication, OrientDB supports a multi-master replication and sharded architecture. This feature enforces data reliability. It means that all the servers in a cluster are masters and are able to read and write to the database. Indeed, in OrientDB, the throughput is not limited by a single server. With OrientDB, the global throughput is the sum of the throughput of all the servers. This ensures a linear scalability.

OrientDB can host several databases per instance. Instead Neo4j allows one database per server. With ArangoDB, it is possible to connect many slave databases to the master database. ArangoDB provides an asynchronous replication. It permits also a vertical and a horizontal sharding (i.e., to add more computation power, to shard data to many servers).

3.4.2.5 Space management

An operational database should manage space without requiring a restart or down-time for maintenance. When records are deleted, OrientDB has the advantage of automatically reusing the freed space. This is done transparently while the server is online. On the contrary, Neo4j cannot automatically reclaim the space of the deleted records. In fact, to use the freed space, Neo4j requires a complete restart of the server.

3.4.2.6 Complex domains

Neo4j does not support the creation of schemas against vertex and edges, but only the label concept to group vertices and edges of the same type. Neo4j do not offer ways to support inheritance, polymorphism nor complex constraints. In fact, it only supports the uniqueness of values using indexes.

In contrast, OrientDB supports the creation of schemas around graphs. It is also possible to create subclasses of Vertex and Edges through inheritance and polymorphism.

Concerning the management of relationship, ArangoDB is based on edges and do not support ‘links’. However, OrientDB offers this useful feature. It thus enables to avoid the overhead of edges by allowing unidirectional relationships (like a hyperlink on the web).

With OrientDB, it is possible to decide whether to embed documents or link to them directly. To search a document, OrientDB resolves automatically all the links. This is one key difference with other document database like MongoDB.

3.4.2.7 Indexing

Indexing enables the efficient data retrieval from a database. This is at the cost of additional storage space and slower

writes operations. Thus, it is important to decide what properties or attributes to index.

Neo4j allows the creation of custom indexes on elements’ properties for all nodes that have a defined label. Neo4j uses Apache Lucene as the default indexing engine. In fact, this engine enables to search for data independently of the graph structure. It also ensures a fast search for textual data, especially in long texts.

For fast insertions and queries, OrientDB uses its own data structure to index properties of elements. It uses its own indexing algorithm called MVRB-Tree, derived from the red-black tree and from the B+ tree. This algorithm ensures both fast insertions and fast lookups. OrientDB supports different types of indexes: SB-Tree algorithm, hash index algorithm for very fast lookup as well as Lucene that is only used for full text and spatial index.

ArangoDB automatically indexes some system attributes. Moreover, it allows users to create indexes on non-system attributes of documents by specifying the names of attributes. A user-defined index is created at the collection level. Many index types are supported such us primary index to select documents, edges index used for quick access to documents, hash index to find documents, Skiplist index to find and sort documents, fulltext index to find words or their prefixes in documents and the geo index used to find places on the surface of the earth. It is worth mentioning that some types of indexes allow only indexing one attribute (fulltext index) whereas other index types allow indexing multiple attributes at the same time (geoandskip-list indexes).

3.4.2.8 Security

OrientDB supports basic security management which is based on the users and roles model. It supports also secure SSL connections since v1.7. In addition to that, OrientDB offers the possibility to encrypt records on disk using either AES or DES algorithm (starting from v2.2). Encryption works can be configured to be executed either on a cluster level or a database level.

Moreover, OrientDB supports the token-based authentication and the integration of the LDAP. Since OrientDB supports multi-master replication. It is possible to replicate database servers in a cluster. This constitutes another security solution as it reduces risks of data loss and downtime while increasing data accessibility.

However, Neo4j does not provide any means to secure the database. Unlike OrientDB, Neo4j does not support data encryption, authorisation or auditing. To secure communications between client and server, Neo4j uses the SSL protocol (Grolinger et al., 2013). In contrast, ArangoDB supports only HTTP Basic authentication.

To sum up, while Neo4j is suitable for graph data model and complex interconnected data. ArangoDB is instead suitable for fast search queries and group-by like queries. It allows quickly building and running small applications. OrientDB is more suitable for large applications that need extensive storage, or have thousands of concurrent users it

184 A. Oussous et al.

is also appropriate for cases that need fine grained security controls and for structured linked data such as RDF.

It is worth nothing that the comparison made in this article is done according to the current state of those NoSQL databases. Since updates and enhancements are made continuously to enrich the features of all those studied databases, the reader should be careful when choosing among them and refer to the latest updates.

3.5 Other NOSQL categories

The databases discussed above are considered to be the main popular NoSQL databases. However, there exist other NoSQL categories of databases that can be used for different applications such as: object databases (DB40, velocity), multimodal databases, grid and cloud database solutions (Gigaspace, Gemfire), XML database (BaseX, Berkeley DB XML), multidimensional databases (SciDB, MiniM DB) (Mapanga and Kadebu, 2013).

4 Conclusions

There exist large numbers of NoSQL database solutions to choose among them. Every NoSQL database is dedicated to a certain category of use cases. We aim that the comparisons, advantages and disadvantages presented in this paper will enable users to choose the right tools for their jobs.

NoSQL offers many advantages to deal with big data storage, processing and querying. Scalability, performance and flexibility are the main advantages of NoSQL databases. Such databases can handle semi-structured and unstructured data through flexible data model. For big data projects, users have to assess their data and evaluate the complexity of the planed queries to pick the right data model. This is essential to avoid unnecessary mapping tasks and complex transformations.

On one hand, relational databases are better for structured data, complex queries, trustful transactions and high integrity. In fact, NoSQL do not support joins and complex queries. On the other hand, NoSQL are suitable for dynamic queries, and to search for simple or complex values in extremely growing large datasets. NoSQL are also good to support high volumes of state changes per second with millions of simultaneous distributed users. NoSQL databases support a good and relatively linear performance even if volumes increase rapidly. In addition, unlike traditional stores, NoSQL databases ensure better sharding and real-time data replication at a lower cost while optimising system resources.

Because NoSQL solutions are not mature and are progressing at different speeds, administrators have to choose carefully between NoSQL and relational databases. Indeed, even though that NoSQL solutions have been upgraded to offer transactional features they cannot replace RDMS.

To summarise, firstly, NoSQL key-value stores are suitable for fast and simple applications. Secondly,

document stores show more flexibility to handle various query types. Thirdly, wide column stores are appropriate to scale easily for very large data sources. Finally, graph stores are better for applications that need to deal with relationships among large number of entities.

References Abadi, D.J., Boncz, P.A. and Harizopoulos, S. (2009) ‘Column

oriented database systems’, Proceedings of the VLDB Endowment, Vol. 2, No. 2, pp.1664–1665.

Abubakar, Y., Adeyi, T.S. and Auta, I.G. (2014) ‘Performance evaluation of NoSQL systems using YCSB in a resource austere environment’, Performance Evaluation, Vol. 7, No. 8, pp.23–27.

Aguilera, M.K., Janakiraman, R. and Xu, L. (2005) ‘Using erasure codes efficiently for storage in a distributed system’, in International Conference on Dependable Systems and Networks, IEEE, pp.336–345.

Angles, R. (2012) ‘A comparison of current graph database models’, in 28th International Conference on Data Engineering Workshops (ICDEW), IEEE, pp.171–177.

Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S. and Paleczny, M. (2012) ‘Workload analysis of a large-scale key-value store’, in ACM SIGMETRICS Performance Evaluation Review, ACM, Vol. 40, No. 1, pp.53–64.

Barmpis, K. and Kolovos, D.S. (2012) ‘Comparative analysis of data persistence technologies for large-scale models’, in Proceedings of the 2012 Extreme Modeling Workshop, ACM, pp.33–38.

Beis, S., Papadopoulos, S. and Kompatsiaris, Y. (2015) ‘Benchmarking graph databases on the problem of community detection’, in New Trends in Database and Information Systems II, Springer, pp.3–14.

Benjelloun, F. and Ait Lahcen, A. (2015) ‘Big data security: challenges, recommendations and solutions’, Handbook of Research on Security Considerations in Cloud Computing, pp.301–313, IGI Global.

Benjelloun, F., Ait Lahcen, A. and Belfkih, S. (2015) ‘An overview of big data opportunities, applications and tools’, The First International Conference on Intelligent Systems and Computer Vision, Fez, Morocco.

Carstoiu, D., Lepadatu, E. and Gaspar, M. (2010) ‘HBase-NoNSQL database, performances evaluation’, Int. J. Adv. Comp. Techn., Vol. 2, No. 5, pp.42–52.

Cattell, R. (2011) ‘Scalable SQL and NoSQL data stores’, ACM SIGMOD Record, Vol. 39, No. 4, pp.12–27.

Dohmen, L., Klamma, P.D.R. and Celler, F. (2012) Algorithms for Large Networks in the NoSQL Database ArangoDB, PhD thesis.

Feng, H. (2012) Benchmarking the Suitability of Key-Value Stores for Distributed Scientific Data, Dissertation, The University of Edinburgh.

Grolinger, K., Higashino, W.A., Tiwari, A. and Capretz, M.A. (2013) ‘Data management in cloud environments: NoSQL and NewSQL data stores’, Journal of Cloud Computing: Advances, Systems and Applications, Vol. 2, No. 1, p.22.

Hadjigeorgiou, C. et al. (2013) RDBMS vs. NoSQL: Performance and Scaling Comparison, The University of Edinburgh, Edinburgh, Scotland, UK.

NoSQL databases for big data 185

Halldorsson, G.J. (2013) Apache Accumulo for Developers, Packt, Birmingham, UK.

Hecht, R. and Jablonski, S. (2011) NoSQL Evaluation: A Use Case Oriented Survey, IEEE, Washington, DC, USA.

IBM White Paper (2014) Derive Actionable Real-Time Insight from your Big Data with Data Replication, April, IBM Software Thought Leadership White Paper.

Kaur, K. and Rani, R. (2013) ‘Modeling and querying data in NoSQL databases’, in IEEE International Conference on Big Data, IEEE, pp.1–7.

Lakshman, A. and Malik, P. (2009) ‘Cassandra: structured storage system on a P2P network’, in Proceedings of the 28th ACM Symposium on Principles of Distributed Computing, ACM, p.5.

Li, Y. and Manoharan, S. (2013) ‘A performance comparison of SQL and NoSQL databases’, in IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM), pp.15–19.

Loshin, D. (2013) Big data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph, Elsevier, San Francisco, CA, USA.

Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N. and Czajkowski, G. (2010) ‘Pregel: a system for large-scale graph processing’, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ACM, pp.135–146.

Manoochehri, M. (2013) Data Just Right: Introduction to Large-scale Data & Analytics, Addison-Wesley, Crawfordsville, IN, USA.

Mapanga, I. and Kadebu, P. (2013) ‘Database management systems: a NoSQL analysis’, International Journal of Modern Communication Technologies & Research (IJMCTR), Vol. 1, No. 7, pp.12–18.

Mohamed, M.A., Altrafi, O.G. and Ismail, M.O. (2014) ‘Relational vs. NoSQL databases: a survey’, International Journal of Computer and Information Technology, Vol. 3, No. 3, pp.598–601.

Moniruzzaman, A.B.M. and Hossain, S.A. (2013) ‘NoSQL database: new era of databases for big data analytics – classification, characteristics and comparison’, International Journal of Database Theory and Application, Vol. 6, No. 4, pp.1–14.

Ordonez, C. (2013) ‘Can we analyze big data inside a DBMS?’, in Proceedings of the Sixteenth International Workshop on Data Warehousing and OLAP, ACM, pp.85–92.

Ordonez, C., Song, I-Y. and Garcia-Alvarado, C. (2010) ‘Relational versus non-relational database systems for data warehousing’, in Proceedings of the 13th International Workshop on Data Warehousing and OLAP, ACM, pp.67–68.

Oussous, A., Benjelloun, F., Ait Lahcen, A. and Belfkih, S. (2015) ‘Comparison and classification of nosql databases for big data’, International Conference on Big Data, Cloud and Applications, Tetouan, Morocco.

Robinson, I., Webber, J. and Eifrem, E. (2013) Graph Databases, O’Reilly Media, Inc., Sebastopol, CA, USA.

Sharma, S. (2015) An Extended Classification and Comparison of NoSQL Big Data Models, arXiv preprint arXiv: 1509.08035.

Strauch, C., Sites, U-L.S. and Kriha, W. (2011) NoSQL Databases, Lecture Notes, Stuttgart Media University, Stuttgart, Germany.

Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y. and Wilkins, D. (2010) ‘A comparison of a graph database and a relational database: a data provenance perspective’, in Proceedings of the 48th Annual Southeast Regional Conference, ACM, p.42.

Zerhari, B., Ait Lahcen, A. and Mouline, S. (2015) ‘Big data clustering: algorithms and challenges’, International conference on Big Data, Cloud and Applications, Tetouan, Morocco.

Zikopoulos, P., Eaton, C. et al. (2011) Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, McGraw-Hill Osborne Media, New York, NY, USA.