We have seen great leaps in digital technology in the past five years. Smartphones, cloud computing, multi-touch tablets, these are all innovations that have revolutionized the way we live and work. However, believe it or not, we are just getting started. Technology will soon get even better. In the future, we could live much like people in science fiction movies do. There are tremendous advancements occurring today in areas that will affect real-life products and change the way we experience life.
One thing common across all these innovations and many others evolving - data, they will either produce immense data or need limitless access to data.
With these technological advancements and the resultant MBs and GBs of data produced every second, there is a desire and a clear need to support the rapid growth of data both efficiently and economically. Enterprises along with researchers are working on innovative solutions to tackle this issue. The conventional method of storing data using independent storage arrays along with supporting storage arrays for availability and DR purpose cannot sustain the requirements of future. The fundamental issues with traditional storage architectures are as follows:
- Rehydration of deduplicated data: We have been recently seeing a lot of discussion around implementing deduplication on primary storage. Vendors like EMC, with their product Data Domain, have set a standard of 20:1, 32:1, 50:1 types of compression. That being said, do you really see this happening in your primary storage system based on a traditional architecture? Rather, I have to ask, are you really going to implement this by taking a hit on the performance. The practical limitation faced is the rehydration of the IOPS which is necessary to pull that storage back.
- Performance related issues: When we look at storage alternatives, most of them seem to be addressing capacity. Isn't performance also an issue? Performance translates to two things: the IOPS or the I/O capability of the back end, and the throughput performance of the front end. Where we're really seeing the problems manifest is in the IOPS back-end. Traditionally we extract more performance out of an array by adding more drives to it, because the more spindles you have, the more IOPS you have. But this leads to gross inefficiencies in terms of capacity because even many large-scale organizations would have capacity utilizations of only around 20% to 40% , which means that their cost per gigabyte is two or three times than what it really should be.
- Data resiliency: The manner in which data is protected and stored on systems today is also being re-examined. A threefold approach is most commonly being adopted till now for achieving the resiliency of data .
- One: protection of data at the disk level,
- Second: Data protection at the array/controller level,
- Third: protection at the datacenter level.
Protection at the disk level is often provided by RAID(redundant array of independent disks). For array level or datacenter level protection you need to have data mirroring (from Netapp perspective) or data replication (from EMC, HDS perspective) done between two or multiple arrays. This system introduces two problems. One is that that there is parity overhead that's associated with it, the other is that more and more capacity is being added to, what I would term, a nonproductive environment where the only thing that a parity drive is doing is sitting there waiting for something to fail!
So how do we get a storage architecture which is fast, cost effective and available? The answer may well lie in the architecture and technologies adopted by Facebook or Google or other such cloud providers. A fascinating example is Facebook’s datacenter, which delivers almost 2PB of storage per rack using just 2KW of power. So how do they do it? Here is the explanation:
Tiered Storage, With a Twist: A tiered storage solution, which has been in existence for the past 3-4 years, could meet the needs of today’s storage requirements. Tiered storage is a strategy that organizes stored data into categories based on their priority – typically hot, warm and cold storage – and then assigns the data to different types of storage media to reduce costs. The storage vendors use their own algorithms, which is inbuilt in their storage system firmware/OS, to perform this task. Rarely-used data is typically shifted to cheaper hardware or tape archives, a move that although saves money but comes with a tradeoff: these archives may not be available instantaneously. As an example, Amazon’s new Glacier cold storage is cheap, but it takes 3 to 5 hours to retrieve files. So, what about the fast access to the data? And we are talking here about a near real-time speed. Organizations are using custom software that categorizes the data and shifts them between different storage tiers. By doing this, processing of data is handed over to a dedicated computing device and the storage acts like a JBOD (Just a Bunch of Disks).