Organizations often keep collecting data - without a clear vision of how, if at all, they will be able to utilize all the information they collect. Especially with the advent of social media and the Internet of Things (IoT), there is an enormous amount of data being generated, and organizations are hoarding up as much of it as possible with the hope that someday this data will yield handsome returns.
This practice of collecting and storing information from disparate sources has given rise to the concept of Data Lakes. Just as a lake is an empty shallow piece in land where water gets collected from various sources without necessarily flowing out, a Data Lake can be thought of a place where data gets dumped from various sources. It is kept largely in its native form, to be analyzed and furnished as and when needed by the business.
While the idea of a huge pool of unprocessed information lying at the disposal of your organization - ready to be processed and analyzed whenever needed - initially sounds appealing, organizations are fast realizing that keeping the data stagnant in a data lake is not the most efficient way of storing data. One reason for this realization being that with Data Lakes, businesses are more likely to end up spending a lot of time preparing data for analysis when they badly need ready-to-analyze data to make mission-critical decisions. All this translates to additional cost and ultimately is not a good data management strategy.
So instead of a stagnant Data Lake, organizations must think of moving into a “Data Reservoir”.
Derived from the original “lake” metaphor, a Data Reservoir is a pool of data that allows for ad hoc discovery, organization, and enrichment of data before it progresses to more advanced sets of analytics tools.
Big data reservoirs, for instance, are large catchment areas where all kinds of data can be stored and analysed, making them a natural foundation for building information systems of tomorrow. Also, as the data volumes are large, velocity high and the variety huge, data reservoirs need to be cost-effective, fast and flexible to be viable. Hadoop then becomes a natural choice for creating such Big data reservoirs.
There are some obvious business benefits of having a Data Reservoir vis-a-vis a Data Lake. As mentioned earlier, Data Reservoirs - with their ready-for-analysis data save organizations a lot of time and money. Also, Data reservoirs are inherently better suited for predictive analytics, are better profiled, efficient, compliant and cost-effective. As big data technologies mature, we will discover still newer applications of big data. Also, with time the technology barriers will get lowered making it even easier and cost effective to support complex big data requirements in a quick and easy fashion. An important step in this direction is careful management of the vast variety of data and keeping it in a ready-for-analysis form for extended periods of time- a goal that can be best achieved through Data Reservoirs.
While there are many ways to build data reservoirs, organizations should choose between creating fresh data reservoirs versus converting their existing data lakes into data reservoirs. Though both the approaches have their own pros and cons, it is always advisable to follow a framework-driven approach for both building fresh data reservoirs as well as converting existing data lakes into data reservoirs.
The benefits of using a good framework include saving of time, flexibility in adapting to various data formats, scaling up and down quickly to respond to varying data volumes. Some essential properties of a good framework include, but are not limited to:
- Built-in Reporting and Analytics: Having built-in capabilities like reporting, analytics, etc so that they can perform entry-level processing and analytics before the data is sent further to more sophisticated applications for higher analyses.
- Built-in Data Governance capabilities: As Data Reservoirs deal with a vast variety of data types, they need to have strong data governance capabilities to ensure that the data quality and integrity is maintained across the variety, velocity and veracity of this data.
- Robustness: Ability to combine multiple data sources and assimilating them into a single reservoir of data, as well as to move the data in the reservoir through various stages and perform necessary actions like ETL, making it ready for use and finally routing the data where required.
- Self-service: Given the fast-paced world of today’s business, agility is the key to staying on top. Hence, an indispensable property of a data reservoir framework is the capability to quickly respond to changing requirements. A framework that allows customers to modify its structure as per business requirement with minimum dependency on the service provider can have far-reaching business benefits.
Is your organization still working with Data Lakes? Are you planning to build a Data Reservoir? We would love to hear from you about your experiences with both.