Data Virtualization
DATA VIRTUALIZATION – WHAT’S REAL?
Meenakshinathan Padmanabhan & Arvind Handuu
Unless you’ve been living under a rock for the past few years OR have
chosen to not look at a printed word, you’d have been told, often enough
that you are physically fatigued, data is what you produce and how that
(data) is the new oil. The real cure for all that ails us. How, once we
organize it and make it get more well-rounded experience and learning, the
world would be a better place. My cat is now afraid of Data, it’s data
world and not the cat’s, we all including the cat just rent it.
To be fair though, the impact of Data – the availability, regeneration,
amount, age, lineage, utility, diversity, value, reach- though at times
overstated is significant enough that organizations are well served by
continually scanning the environment to look for opportunities that make
value creation possible.
In a recent Forbes Survey’ 2018 Data Virtualization was in the top 3
highest growth areas for the year 2018 / 2019.
This post is an attempt to simplify the conceptual presentation on Data
Virtualization. So then let’s take it from the top.
So, what is Data Virtualization? what is a good use case for this? and
what are the corner cases where this approach fails?
Data Management, especially in the applications that require post-fact data
creation and analysis of institutional data assets, is an ever-moving
target. It appears that just as a seemingly effective governance model is
implemented and initial set of questions are getting answered, new
questions arise often requiring new information and data sources to be
incorporated in the answer base. This requires reworking the data
warehouse, introducing the new data source, applying the same rigor to
ensure data hygiene. And an expensive build, add, analyze cycle repeats.
Data Virtualization offers a near-term reprieve from this solution by
making it easy to introduce new data sources rather quickly. This solution
has a potential of being THE solution in case of less complex data
environments and at least a resilient intermediate solution in applications
with higher data complexity.
Data virtualization is the process of offering data consumers a data
access interface that hides the technical aspects of stored data, such
as location, storage structure, API, access language, and storage
technology.
Data virtualization creates integrated views of data drawn from disparate
sources, locations, and formats, without replicating the data, and delivers
these views, in real time, to multiple applications and users. Data
virtualization is any approach to data management that allows an
application to retrieve and manipulate data without requiring technical
details about the data, such as how it is formatted at source, or where it
is physically located, and can provide a single customer view (or single
view of any other entity) of the overall data. Data virtualization can draw
from a wide variety of structured, semi-structured, and unstructured
sources, and can deliver to a wide variety of consumers. Because no
replication is involved, the data virtualization layer contains no source
data; it contains only the metadata required to access each of the
applicable sources, as well as any global instructions that the
organization may want to implement, such as security or governance
controls. This concept and software is a subset of data integration and is
commonly used within business intelligence, service-oriented architecture
data services, cloud computing, enterprise search, and master data
management.
The concept was initially incorporated in various business intelligence
tools like @Qlik, @Spotfire, @Tableau to name a few. The obvious limitation
being the close coupling between the virtual data store and the choice of
analytical (at the time this was mainly data visualization) tools. That
meant that the limitations of the analytical tools defined the extent to
which data could be utilized. The below graphic represents the data
virtualization approach by one of the leading solutions vendors in this
technology, Denodo.
Image Courtesy: Denodo
Our teams have taken a position that in case of very small data base
volumes and relatively clean data sources, data virtualization would be an
effective solution that would allow a federated data structure and quick
analytics solution. However, as the data complexity increases the
organizations will need a more disciplined data governance practices
effected in the data warehouse led analytics platform. In such cases a
virtualized database solution would be utilized as a rapid Proof of Concept
solution to test various source systems.
We find data virtualization highly effective in the following use cases:
‣ Generally structured data sources with easy to define relationships.
Referring to the promise stated earlier in this article, Data
Virtualization really does deliver on the data integration front. Whether
one needs data from a mobile application or from hundreds of domains and
other web technologies, Data Virtualization consolidates all of that into a
single solution.
‣ Data virtualization supports the integration of structured and
semi-structured data, and is seamlessly supported by the likes of Hadoop
and MapReduce.
‣ Rapid analytics delivery OR short-term proof of concept solutions. Unlike
some massive Data Management solutions, Data Virtualization can be
implemented at an unnervingly rapid rate. It can be implemented into
already existing infrastructure in a matter of weeks and months. Some Data
Virtualization adopters have reported an ROI turnaround of less than six
months.
‣ Direct exposure into the source applications, the reason for data
virtualization is the ability in incorporate operations data in real time.
While the above might appear compelling, data virtualization falls short in
the following key application areas:
‣ Historical and lineage tracking applications e.g. Slowly Changing
Dimension Type I/ Type II problem areas. Organizations need to use data
warehouses when there exists a need to analyze data that is days, weeks or
even months old. Data warehouses are a better option for an organization in
this case.
‣ Data Virtualization often imposes a great deal of stress on the
organization’s operations, often requiring massive overhead. These changes
need to be integrated and distributed throughout every user and application
within your entire infrastructure. This can be a huge financial and
logistical strain on your environment.
‣ Overall effectiveness, data virtualization solutions can be deceptively
difficult. The data virtualization solutions’ effectiveness in managing
real-time data delivery can be a little underwhelming. The expectation gap
usually occurs when an organization thinks that just because they’re using
a powerful Data Virtualization solution that they no longer have to manage
their own data.
In the Data Management space there are very few, if any, magic bullets.
Data Virtualization is an effective Swiss Army knife in a data architect /
Solution strategist’s toolkit. While data virtualization is far from
perfect now the overall market is evolving at a rapid rate to provide
access to real-time, easily managed data. But as a sole mode of capturing,
interpreting, and managing BI data, the virtualized data warehouse is an
effective strategy to create business value and introduce additional data
sources in the analytics framework.
Meenakshinathan (Nathan) Padmanabhan is a Sr. Data Solutions
Architect at Visvero, Inc. He has been supporting various F2000
clients in deploying effective data management, Business
Intelligence and Analytics solutions for over 20 years. Nathan is
based out of the Visvero, Pittsburgh.
Arvind Handuu is a Practice Manager for Business Intelligence &
Analytics at Visvero. Arvind is an analytics value purist. He
believes that a BI & Analytics platform should be a
self-contained and sovereign solution. “Its value drops to zero the
instant you are using a different data source to inform your
decision.” Arvind is based out of Visvero, Pittsburgh.