Enterprise application data tiering with Hadoop
6 mins read

Enterprise application data tiering with Hadoop

With widespread digital transformation taking place in enterprises across the world, every CIO wants to know if their infrastructure will be able to handle the resulting data growth. In fact, Gartner in its research has presented that 47% of respondents ranked data growth as the #1 infrastructure challenge for data centers.

Data Growth Crisis

When data sets become too large, application performance slows, and infrastructure struggles to keep up. Data growth drives increased cost, compliance, and complexity everywhere — including data center, performance, availability, maintenance, and even compliance.

System availability is impacted as batch processes are no longer able to meet scheduled completion times. The “outage windows” necessary to convert data during ERP upgrade cycles may extend from hours to days. Other critical processes like replication and disaster recovery are impacted because more data is highly time-consuming to move and copy.

Left unchecked, data growth may also create governance, risk and compliance challenges. GDPR, CCPA, HIPAA, PCI DSS, FISMA and SAS 70 mandates all require that organizations establish compliance frameworks for data security and compliance. With enormous data getting generated every day and it is shared across the enterprise regularly, it becomes very difficult for enterprises to stay compliant with regulations.

Gartner has also stated that as much as 80% of data in a typical production portfolio may be inactive, and thus unnecessarily hindering application performance, increasing costs, causing outages and compliance concerns. So how can data be managed such that the inactive data doesn’t clog the infrastructure and impact critical processing?

Data tiering stats

One correlation we are able to make is that the value of data is indirectly proportional with its age. i.e the historic data/inactive data has less value as compared to newer data as it is accessed and processed less. So why should such inactive data continue to clog the production environments?

Implementing an effective ILM strategy will help

Information Lifecycle Management (ILM) is a data management best practice to manage the lifecycle of data from creation to deletion and disposal.

The goals of ILM are:

  • Optimize application performance
  • Manage data security, risk, and compliance
  • Reduce infrastructure costs
  • Reduce maintenance time and costs
  • Manage Data compliance (GDPR)
  • Significant Analyze and generate Data Analytics reports
  • Manage streamed data (real time or twitter)
  • Data preparation
  • Extract, Transform and Load

ILM achieves these goals by moving data to the most appropriate infrastructure tier, based on retention policies such as the age of the data. Since older data is less frequently accessed, it is, therefore, less valuable and less deserving of limited tier-one performance and capacity. ILM achieves these goals by moving structured data, documents, files, images, documents from different sources like database, SharePoint, NFS, CIFS, and emails, etc.

Tier one infrastructure is high cost and may include multi-processor servers with large flash memory arrays and high-speed storage area networks. Data positioned on tier one infrastructure is should ideally be three years old or less. Older, less active data should be assigned to low-cost infrastructure tiers to reduce overall costs while still providing proper access to the data, albeit not at tier-one performance levels.

The new age storage alternatives for inactive data

Apache Hadoop is a free and open-source computing framework that is designed to operate a powerful, new low-cost infrastructure at a lesser tier while still delivering massive scalability and performance. It delivers highly scalable workload performance and very low cost, bulk data storage. Hadoop leverages commodity infrastructure, distributed compute models to process large data sets in parallel on the Hadoop file system (HDFS). All this means is that Hadoop offers dramatic cost savings over traditional tier-one infrastructure.

Object Stores, on the other hand, can help scale to petabyte storage range and overcome the limitation of the traditional file system storage architectures at a fraction of the cost. Object Stores allow an organization to store large data on the cloud / on-premise and provides improved performance by ensuring the high availability of data. The objects stored on the object can be retrieved, visualized and searched based on the context/text using big data analytic tools.

Consider the following comparison:

According to Monash Research, the cost of a tier-one database infrastructure is over $60,000 per TB. At the same time, 1TB of S3 bucket storage at Amazon Web Services (US West — Northern California) is $26 per month according to their recent price list. This means Hadoop is essentially 64X cheaper than tier one infrastructure.

Hadoop is essentially 64X cheaper than tier one infrastructure

Data tiering explained

Enterprise applications such as ERP, CRM, and HCM represent an excellent opportunity for improving performance and reducing costs, through application data tiering.

Enterprise Archiving follows an ILM approach to improve performance and reduce costs by supporting 3 processing tiers:

Data tiering explained

The benefits of enterprise application data tiering are significant regarding improved infrastructure performance, reduced costs and higher availability. By positioning data based on business value, infrastructure utilization becomes more efficient while providing appropriate access.

Solix Common Data Platform – Next generation data tiering and  management platform for the modern data driven Organizations

The Solix Common Data Platform is a uniform data collection, retention management, data tiering, and bulk data storage solution for structured and unstructured data. The Solix CDP features enterprise archiving, data lake, data governance, and advanced analytics applications to help organizations achieve data-driven business results.

The Solix CDP enables the Information Lifecycle Management (ILM) framework. ILM framework helps archive / migrate data to an appropriate tier based on business rules, age and the value of the data. It also provides data governance to meet risk and compliance objectives and ensure that best practices for data retention and classification are deployed. ILM policies and business rules may be pre-configured to meet industry standard compliance objectives or custom designed to meet more specific requirements. To ensure data security, the Solix Common Data Platform (CDP) discovers and classifies sensitive data, masking or encrypting it based on business rules. Role-based access is also supported for data access at the record level.