Research Paper
20 BUSINESS INTELLIGENCE JOURNAL • VOL. 19, NO. 2
BIG DATA SECURIT Y
Oren Hamami is the director of security strategy for SunGard Availability Services.
Big Data Security: Understanding the Risks Oren Hamami
Abstract Big data. It’s this year’s cloud computing—a transformative technology that is exploding into the mainstream of enterprise IT. Enterprises are wading into the big data pool without fully understanding the associated dangers. Big data introduces new tools, computing models, and classes of information assets to protect, as well as a diverse group of new techni- cal and nontechnical users. As a consequence, traditional approaches to data security and resiliency simply no longer apply.
The potential business value of big data can’t be ignored. This article explores big data security issues, including protecting a new kind of (big) information asset, understanding the risks, protecting big data, and where to begin.
Introduction Big data is a paradigm-shifting, potentially trans- formative technology. Long used on the fringes of the technology world, it is now exploding into the mainstream of enterprise IT. As with cloud computing, it has become easy for enterprises to start big data projects without a full understanding of the risks.
Big data will require new tools, computing models, and classes of information assets that an enterprise must protect. It also adds a diverse group of users. Traditional approaches to data security and resiliency don’t apply to managing big data. A McKinsey study proclaimed that although big data would likely deliver productivity and profit gains of five to six percent, leaders in every sector (not just a few data-oriented IT managers) “will have to grapple with the implications of big data” (Manyika, et al, 2011). Prudent enterprises must understand the risks and tackle them directly if they wish to protect the value of their big data investments.
21BUSINESS INTELLIGENCE JOURNAL • VOL. 19, NO. 2
BIG DATA SECURIT Y
What—and Where—Is My Data? As is so often the case with technology trends, the term big data became widely used long before there was a consensus about its meaning. Confusion over what constitutes big data is so great that researchers at the University of St. Andrews in Scotland recently published a paper surveying the most widely used definitions in an effort to synthesize a single, cohesive meaning. This article relies on their description of big data as “the storage and analysis of large and or [sic] complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning” (Ward and Barker, 2013).
Even with this specific definition, identifying the use of big data across a company is not a trivial matter. Low storage and computing costs, coupled with easily available, open source tools, allow teams throughout the enterprise to accumulate and analyze significant amounts of data outside the view of IT. Adoption is further accelerated by the availability of cloud-based services for every aspect of big data, from storage to reporting. The democratization of data provides business users with unprecedented access to information, but the lack of visibility can cause significant risk exposure to go unnoticed until problems arise. When assessing big data risk, companies must identify all their big data assets, rather than limit their analysis to those controlled by IT.
Valuing a New Kind of (Big) Information Asset Once big data assets have been identified, an enterprise typically begins to assess the associated risk by determining
the assets’ value. Here we encounter our next challenge: how do you value big data? Mature models exist for valuing many traditional types of data. Theft of intellectual property can be linked to associated losses in revenue and competitive advantage, and extensive research has been conducted on the cost of a breach of personally identifiable information (Ponemon Institute, 2013). It is far more dif- ficult to assign a value, even a qualitative one, to a data set whose value, if it has any at all, must be extracted through data mining.
One approach to valuing such an asset is to look to other extractive endeavors, such as oil and gas exploration. Oil and gas companies must be able to value a well without knowing exactly how much oil or gas it will produce during its lifetime. To do this, a well’s reserves are classified as proved, probable, or possible based on the likelihood of successful extraction, with further sub- classification based on the difficulty and cost of extraction (SPE Board, 2007).
This approach can be applied to any resource that must be mined, including data. For example, a company that has been mining a data set for years in support of specific business processes with measurable outputs might classify this data as proved and could assign it a fairly accurate value based on the business process it supports. Conversely, a new, unknown data set that is purely exploratory might be classified as probable or possible based on the likelihood it will produce information of value. These are not hard and fast rules, but rather tools to help businesses assign proportional weights to more accurately reflect the expected value of a big data asset over time. Once big data assets are identified and valued, the enterprise can then examine its big data security risks.
New Tools, Old Risks Much attention has been given to the security deficiencies in the tools commonly used with big data in general, specifically in NoSQL databases. The Cloud Security Alliance (CSA) lists the security of non-relational database tools among its top 10 big data security and privacy challenges (Cloud Security Alliance, 2013), and NoSQL security has been called out by security researchers in both industry (SpiderLabs Radio, 2013)
Prudent enterprises must
understand the risks and tackle
them directly if they wish to
protect the value of their big data
investments.
22 BUSINESS INTELLIGENCE JOURNAL • VOL. 19, NO. 2
and academia (Manyika, et al, 2011). This attention is not entirely unwarranted, as many of these tools’ security capabilities are described in the CSA’s report as “still evolving.” However, placing the blame for poor security on the tools themselves is unfair. To understand why, it is important to consider their origins.
Most of the tools most commonly associated with big data were developed by large Internet companies to serve a specific purpose and analyze a specific type of data. Depending on the type of data analyzed, high-level security might not have been required. In other cases, the tool may have simply been a small cog in a much larger application, with security handled elsewhere. Once the tools moved to open source, they became general-purpose software. However, because of their original use, they often lacked the security functionality found in other tools designed for general use. Although the tools were different, most shared a common set of security gaps, including weak access control (authentication, authoriza- tion, auditing), insecure communications (both inter- and intra-cluster), weak client or API security, no encryption functionality, and vulnerability to injection attacks (Manyika, et al, 2011).
Until recently, companies looking to secure their big data tools were largely on their own. Sophisticated access control often required writing custom authentication and authorization modules or API proxies; secure communi- cations required the use of private networks or operating system-based tunneling; and data encryption had to be performed entirely by external mechanisms.
Today’s picture, although not perfect, is not quite as bleak. Many big data tools have significantly improved the security functionality in their core projects or in commercially supported distributions. In addition, new vendors such as Sqrrl have entered the market with security-focused offerings. Sqrrl offers an enterprise- focused version of Accumulo, a highly secure NoSQL database originally designed by the National Security Agency (Sqrrl, 2013). Note that some security gaps attributed to NoSQL tools, such as vulnerability to injection attacks, are in fact flaws in applications, not in the database. After all, no one advocates abandoning
relational databases despite the fact that they are all susceptible to SQL injection.
Old Tools, New Risks In addition to the risks associated with big data tools, companies should also consider risks that arise from the ways they use big data. Foremost among these is the close association between big data and cloud computing. Big data does not require the use of the cloud, but there is a strong affinity between the two. Big data depends on large quantities of cheap storage and computing resources, which the cloud provides. Like big data, the cloud is not inherently insecure, but companies need to take precautions to ensure that they are using the cloud securely. A complete analysis of cloud security is beyond the scope of this article, but if you want to leverage the cloud as part of your big data solutions, pay particular attention to data residency, data obfuscation and encryp- tion, data retention and destruction, and regulatory compliance.
Big data’s compliance challenges extend beyond the use of the cloud, particularly in the area of privacy. To avoid running afoul of the complex web of local, national, and international laws and regulations governing the use and protection of personally identifiable information, data analysts have long employed de-identification techniques to anonymize data sets. In the era of big data, this task becomes more difficult.
In 2006, AOL released an anonymized set of 20 million search queries for use by researchers but quickly came under fire when the New York Times was able to deter- mine users’ identities based on other publicly available
BIG DATA SECURIT Y
Any data breach is costly, but if a
breach involves data sets with tens
or hundreds of millions of records,
the costs become astronomical.
23BUSINESS INTELLIGENCE JOURNAL • VOL. 19, NO. 2
information (Barbaro, Zeller Jr., and Hansell, 2006). With today’s big data tools and widely available large public data sets, it is becoming increasingly difficult to truly anonymize large data sets.
Another example is the challenge of de-identifying personal health information in the age of big data. The HIPAA Privacy Rule includes a standard for de- identification that requires either an expert determination that data cannot be re-identified using common statistical tools and practices or the removal of any data fields that would make it possible to re-identify the individual (U.S. Department of Health and Human Services, 2013). Enterprises that want to de-identify personal health information will be chasing a moving target as analysis tools and available data sets continue to proliferate.
This leads to another challenge: protecting large-scale data sets containing personally identifiable or otherwise sensitive information. This case combines the valu- ation typically associated with more transactional data with the massive scale of big data. According to the Symantec/Ponemon Institute “2013 Cost of Data Breach Study,” the average data breach costs $136 per compromised record (Ponemon Institute, 2013). Any data breach is costly, but if a breach involves data sets with tens or hundreds of millions of records, the costs become astronomical. In such cases, companies will have to balance performance, scalability, and security, which may increase the cost of extracting valuable information from sensitive data sets.
One final area to consider is resiliency. With the increasing reliance on big data to support critical business processes, tools need to be highly available and rapidly recoverable. Many big data tools were designed with resiliency as their primary requirement and can tolerate the failure of multiple nodes without disruption. However, this ability is not always inherent; in most cases, NoSQL databases and the applications using them must be designed to take advantage of the resiliency features. For example, adding a second node to a cluster allows it to tolerate the failure of a node, but if the two servers are in the same rack, they are both vulnerable to a data center failure. Another consideration is that many tools leverage a highly distributed architecture to gain resiliency at the expense of consistency. If these tools support a business process that requires consistent information at all times, some sacrifices in either resiliency or scalability may be required.
There is more to resiliency and uptime. Even the best- designed system will eventually suffer a loss of data through technical failure, user error, or malicious action. Enterprises should consider recoverability. Due to their size and (often) rapid growth, big data assets often cannot be backed up and restored using traditional backup tech- niques and time frames, meaning that companies must be more thoughtful about their approach to recoverability.
As with valuation, decisions about backup and recovery should take into account the business impact of losing some or all of a big data asset. In some cases, such as where analysis is performed in real time, the loss of a previous day’s data might not be a problem. Conversely, a data set containing regulated information might be subject to specific backup and retention requirements. The best approach is to include recoverability as a dimension within the broader risk assessment, so that risks to an asset’s availability can be evaluated using the same framework used to assess risks to its confidentiality and integrity.
Where to Begin It would be easy to say that enterprises should establish strong policies and governance structures prior to starting any big data initiatives—easy, but not terribly helpful. As with cloud computing, increasingly low
BIG DATA SECURIT Y
With today’s big data tools and
widely available large public data
sets, it is becoming increasingly
difficult to truly anonymize large
data sets.
24 BUSINESS INTELLIGENCE JOURNAL • VOL. 19, NO. 2
BIG DATA SECURIT Y
barriers to entry mean that most enterprises are already using big data in some form, even in the absence of a formal big data initiative.
To get a handle on big data risk, business and IT leaders would be wise to heed the early lessons of cloud security: approach the problem as partners, not adversaries; understand the business value and technical challenges; and be prepared to throw out old assumptions about data security. Here are five key steps for enterprises looking to get a handle on their big data risk.
1. Get the lay of the land The first step in partnering with your big data users is to find out who they are. This can be no small feat. If you are fortunate enough to work in an organization with strong data governance processes, much of this work may have already been done for you. Likewise, if your company tends to be a late adopter of technology, you may not have any current big data users to find. For most organizations, however, some detective work will be required.
A good place to start is with power users of traditional business intelligence tools, who are likely to be early adopters of new analytics tools. Another approach is to look at the groups that were early adopters of cloud ser- vices. There is a significant overlap between big data and the cloud, and these groups have already demonstrated a willingness to obtain third-party technology services— with or without authorization. A final place to look is within your own IT shop, both in terms of analytical tools deployed for business users and those increasingly used to mine system and security event and log data.
2. Establish ground rules Once you have a better picture of how big data is being used, you can start to evaluate risks and establish basic governance. Absent a strong senior executive mandate, this task may be better achieved with a flanking maneu- ver than a frontal assault. The best place to start is with the legal department: big data raises numerous privacy and compliance challenges, and a legal mandate can be a powerful tool for implementing basic policies. Policies can also be enforced at key gates that may help identify a new big data initiative before it begins. Examples include
procurement of big data tools or services; program management organizations or other centralized project governance bodies; and IT itself, which is often the source of much of the raw data used in big data analytics.
Another, less obvious ally in this effort is an organiza- tion’s public relations team. The consumer response to big data is not always positive, and this may also be an effective driver of governance (Strong, 2013).
3. Build in-house capabilities Heed the lesson of cloud computing: if your users can’t get the services they want from corporate IT, they will get them somewhere else. Thanks to open source tools and cloud-based services, it has never been easier for them to do so.
The best way to avoid the risks associated with such “rogue” deployments is to be part of the solution. In some cases, this might mean building an in-house big data analytics platform, but this is not right for every organiza- tion, nor is it the only way to retain a measure of control. Third-party and cloud services provide a viable alternative for many organizations and are not inherently insecure when deployed and used properly. IT departments that have developed staff expertise with big data tools and techniques are well positioned to serve as trusted advisors to business users and thus become the partner of choice for big data, irrespective of delivery platform.
4. Don’t ignore resiliency At the most abstract level, big data entails taking one data set and applying analytical tools and techniques to produce another data set. With the heavy focus on analytical tools and techniques, it is easy to forget that
Once you have a better picture of
how big data is being used, you can
start to evaluate risks and establish
basic governance.
25BUSINESS INTELLIGENCE JOURNAL • VOL. 19, NO. 2
BIG DATA SECURIT Y
these data sets, like any data, are subject to loss, which renders any tools moot.
Any big data program should account for backup and recovery, beginning with the decision of what to back up. The volume of raw source data makes it impractical to back up everything, and the real-time nature of some data sets means they may not require backup at all. Consideration should also be given to the output of big data analytics, including whether analysis results can be recreated if they—or the source data—are lost. In many cases, resiliency must extend beyond mere recoverability to include high availability, particularly when performing real-time analytics, when even a brief outage can seriously impact the business. Both recoverability and high avail- ability are considerably easier to address before a disaster strikes, and a little advance planning can go a long way toward building a resilient big data program.
5. Be prepared for the worst Even the most secure organizations can suffer data breaches, and every company should take the time to develop a breach response plan that includes provi- sions for handling a big data breach. Specifically, the plan should include breach notification requirements, which may not be as obvious as those for a breach of transactional data. Breach plans should also account for the logistics of containing, investigating, and reporting a breach that is massive in scale and may be distributed across multiple locations and vendors.
Finally, be prepared for post-breach damage control. As with any breach, companies can expect to suffer financial and reputation losses and increased scrutiny from regula- tors, all of which are amplified by the volume of data involved. Reputation damage can also extend beyond the
direct effect of disclosure of confidential data records. As the National Security Agency recently learned, a breach may result in a previously secret big data program becoming known to the public. Even if your use of big data is comparatively innocuous, be prepared to explain it if it involves personal or other private data (Gray, 2013). ■
References Barbaro, Michael; Tom Zeller, Jr.; and Saul Hansell
[2006]. “A Face Is Exposed for AOL Searcher No. 4417749,” The New York Times, August 9. http://select. nytimes.com/gst/abstract.html?res=F10612FC345B0C 7A8CDDA10894DE404482
Cloud Security Alliance [2013]. “Expanded Top Ten Big Data Security and Privacy Challenges.” https:// cloudsecurityalliance.org/download/expanded-top-ten- big-data-security-and-privacy-challenges/
Gray, Patrick [2013]. “The NSA and Big Data,” TechRepublic, July 11. http://www.techrepublic.com/ blog/big-data-analytics/the-nsa-and-big-data/
Manyika, James, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers [2011]. “Big data: The next frontier for innovation, competition, and productivity,” McKinsey Global Institute, May. http://www.mckinsey.com/insights/ business_technology/big_data_the_next_frontier_ for_innovation
Okman, Lior, Nurit Gal-Oz, Yaron Gonen, Ehud Gudes, and Jenny Abramov [2011]. “Security Issues in NoSQL Databases,” Ben-Gurion University, Beer-Sheva, Israel; 2011 International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11.
Ponemon Institute [2013]. “2013 Cost of Data Breach Study: Global Analysis,” Ponemon Institute LLC, May. https://www4.symantec.com/mktginfo/ whitepaper/053013_GL_NA_WP_Ponemon-2013- Cost-of-a-Data-Breach-Report_daiNA_cta72382.pdf
Decisions about backup and
recovery should take into account
the business impact of losing some
or all of a big data asset.
26 BUSINESS INTELLIGENCE JOURNAL • VOL. 19, NO. 2
BIG DATA SECURIT Y
SPE Board [2007]. “Standards Pertaining to the Estimating and Auditing of Oil and Gas Reserves Information.” http://www.spe.org/industry/docs/ Reserves_Audit_Standards_2007.pdf
SpiderLabs Radio [2013]. “Mongodb—Security weaknesses in a typical NoSQL database,” March 15. http://blog.spiderlabs.com/2013/03/mongodb-security- weaknesses-in-a-typical-nosql-database.html
Sqrrl [2013]. “Securely Explore Your Data.” http://sqrrl. com/product/accumulo/
Strong, Colin [2013]. “The big data arms race part two: consumer perceptions,” The Guardian, October 4. http://www.theguardian.com/media-network/ media-network-blog/2013/oct/04/consumer- marketing-big-data-perceptions
U.S. Department of Health and Human Services [2013]. “Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.” http://www.hhs.gov/ ocr/privacy/hipaa/understanding/coveredentities/ De-identification/guidance.html
Ward, Jonathan S., and Adam Barker [2013]. “Undefined By Data: A Survey of Big Data Definitions,” School of Computer Science, University of St. Andrews, September 20. http://arxiv.org/pdf/1309.5821v1.pdf
The Business Intelligence Journal is a quarterly journal that focuses on all aspects of data warehousing and business intelligence. It serves the needs of researchers and prac- titioners in this important field by publishing surveys of current practices, opinion pieces, conceptual frameworks, case studies that describe innovative practices or provide important insights, tutorials, technology discussions, and annotated bibliographies. The Journal publishes educa- tional articles that do not market, advertise, or promote one particular product or company.
Visit tdwi.org/journalsubmissions for the Business Intelligence Journal’s complete submissions guidelines, including writing requirements and editorial topics.
Submissions tdwi.org/journalsubmissions
Materials should be submitted to: Jennifer Agee, Managing Editor E-mail: [email protected]
Upcoming Deadlines Volume 19, Number 4 Submission Deadline: August 8, 2014 Distribution: December 2014
Volume 20, Number 1 Submission Deadline: November 21, 2014 Distribution: March 2015
Instructions for Authors
Copyright of Business Intelligence Journal is the property of Data Warehousing Institute and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.