Data Growth Solutions Using Hadoop

Big Data No Comments »

We now understand that the world is drowning in data. It is estimated that over 15 petabytes of new information is created every day, which is eight times more than the information in all the libraries in the United States. This year, the amount of digital information generated is expected to reach 988 exabytes. This is equivalent to the amount of information if books were stacked from the Sun to Pluto and back.1

Gartner agrees data growth is now the leading data center infrastructure challenge: In a recent survey, 47 percent of Gartner survey respondents ranked data growth as their number one challenge.2

Data growth is capable of stripping entire data centers of cooling and power capacity. System availability is impacted as batch processes are no longer able to meet scheduled completion times. The “outage windows” necessary to convert data during ERP upgrade cycles may extend from hours to days, and other critical processes like replication and disaster recovery are impacted since more and more data is harder and harder to move.

Additionally, unchecked data growth creates governance, risk and compliance challenges. HIPAA, PCI DSS, FISMA and SAS 70 mandates all require that organizations establish compliance frameworks for data security and compliance. Information Lifecycle Management (ILM) programs are required to meet compliance objectives throughout the data lifecycle.

Advances in semiconductor technology have enabled impressive new solutions for data growth by using “commodity” hardware to process and store extraordinary amounts of data at lower unit costs. Through virtualization, this new low-cost infrastructure may now be utilized with extraordinary efficiency.

Apache Hadoop is designed to leverage this powerful, low-cost infrastructure to deliver massive scalability. Using the MapReduce programming model to process large data sets across distributed compute nodes in parallel, Hadoop provides the most efficient and cost-effective bulk data storage solution available. Such capabilities enable compelling new big data applications, such as Enterprise Archiving and Data Lake and establish a new enterprise blueprint for data management on a petabyte scale.

Experts agree that as much as 80 percent of production data in ERP, CRM, file servers and other mission-critical applications may not be in active use, and both structured and unstructured data becomes less active as they age. Large amounts of inactive data stored online for too long reduces the performance of production applications, increases costs and creates compliance challenges.

Enterprise archiving and data lake applications using big data offer low-cost bulk storage alternatives to storing inactive enterprise data online. By moving inactive data to nearline storage, application performance is improved and costs are reduced as data sets are smaller and workloads are more manageable. Data access is maintained through analytics applications, structured query and reporting, or just simple text search.

Big data is driving a new enterprise blueprint enable organizations to gain improved value from their data. Enterprise data warehouse (EDW) and analytics applications leverage big data for better described views of critical information. As a low-cost data repository to store copies of enterprise data, big data is an ideal platform to stage critical enterprise data for later use by EDW and analytics applications.



No Excuse for Non-Production Database Breaches

Solix EDMS Data Masking Standard Edition (SE) Comments Off reports that over 500 million credit card records have been breached since 2005.  Say what?  500 million?

Last week it was reported that 400,000 clear text passwords were breached from Yahoo within a day. Yahoo spokeswoman Dana Lengkeek said "an older file" had been stolen.  Such a breach suggests Yahoo did not mask their Non-production data.

Who cares?  Well, Trusted ID reports an identity is stolen every 4 seconds and there are over 10 million identity theft victims in the US.  Furthermore, the average cost to restore a stolen identity is $8,000, and victims spend an average of 600 hours recovering from this crime.

The fact is that too many organizations still have not taken adequate steps to protect Non-production data and comply with the Payment Card Industry Data Security Standard (PCI DSS).  Non-production databases often store sensitive data cloned from production, and this data is often left unprotected on development servers, laptops and test instances.  PCI DSS Requirement 3.4 mandates that stored cardholder data is protected “anywhere it is stored,” yet somehow, non-production databases are often overlooked in security plans.

Data must be protected where it lives – in the database, and it is not surprising so many attacks target non-production databases.   Non-production data is a soft target since there is a lot more of it and fewer controls are in place.  Furthermore, high profile thefts reveal that insiders often do most of the damage.  Sometimes the culprit is a disgruntled employee, but more often sensitive test data inadvertently ends up on a stolen laptop, is lost through outsourcing, or simply gets misplaced on account of weak or nonexistent controls.

Data masking has emerged as a best practice to protect non-production data because unlike encryption, masking is able to support the entire application development lifecycle.  Data masking removes personally identifiable information such as a person’s name and account, credit card, or social security number, and transforms it into contextually accurate, albeit fictionalized, data.  By obfuscating the information, data masking de-identifies personally identifiable information.  And because it is no longer confidential, masked data is acceptable for use in nonproduction environments such as application development.

The Data Masking Process

But even despite best practices, many companies still do not mask non-production data.  One problem is that “small and mid-sized companies don’t have dedicated security staff to manage complicated security systems. For products to help small companies they’ve got to be dead simple to use, automate basic security functions and save them time. The product has to make their jobs easier, not be their job,” according to Adrian Lane, CTO at Securosis.

In the absence of an existing control process or tools to mask sensitive data, database administrators must create and maintain scripts.  But it makes little sense to build and maintain a masking tool set in house while so many other priorities exist for scarce DBA resources.  Furthermore, enterprise IT organizations must demonstrate the ability to mask data consistently across all application environments.  Compliance objectives require a clear and consistent process.

To the relief of many, free database security software tools are finally making an overdue market entrance.  MySQL audit plugins, free vulnerability scanners, and now, free data masking solutions are widely available.  Even more important, these free downloads are designed from the ground up to deploy fast and be easy to use.
Certain free software tools may impose vendor restrictions on deployment and usage, but free database security software still represents a better, faster, cheaper way to get started protecting non-production data.  So, there really is no excuse for a non-production database breach!

Editorial Note: This week Solix released Solix EDMS Data Masking Standard Edition, a free download enabling sensitive data to be masked across non-production instances of enterprise applications.  The software may be downloaded and fully deployed in minutes through an easy to use four step deployment wizard.


II Pescatore, John. “High-Profile Thefts Show Insiders Do the Most Damage”. Gartner Group. (November 2002)

© Solix Technologies, Inc.
Entries RSS