Back

 Industry News Details

 
Minimizing the Complexities of Machine Learning with Data Virtualization Posted on : Sep 21 - 2018

Data lakes have become the principal data management architecture for data science. A data lake's primary role is to store raw structured and unstructured data in one central location, making it easy for data scientists and other investigative and exploratory users to analyze data.

The data lake can store vast amounts of data affordably. It can potentially store all data of interest to data scientists in a single physical repository, making discovery easier. The data lake can reduce the time data scientists spend on data selection and data integration by storing data in its original form, avoiding transformations designed for specific tasks. The data lake also provides massive computing power so data can be efficiently transformed and combined to meet the needs of each process.

However, when it comes to applying machine learning (ML) in the enterprise, most data scientists still struggle with the complexities of data discovery and integration. In fact, a recent study revealed that data scientists spend as much as 80 percent of their time on these tasks.

Why Challenges Remain

In the same way that it is not easy to find a specific person in a crowded stadium, having all your data in the same physical place does not necessarily make discovery easy. In addition, only a small subset of the relevant data tends to be stored in the lake because data replication from the origin systems is slow and costly. Further complicating matters is the fact that many companies may have hundreds of data repositories distributed across multiple on-premises data centers and cloud providers.

When it comes to data integration, storing data in its original form does not remove the need to adapt it for the needs of each machine learning process. Rather it simply moves the burden of performing that process to the data scientists. Although the required processing capacity may be available in the lake, data scientists usually have the skills needed to integrate data.

Some data preparation tools have emerged in the past few years to make simple integration tasks accessible to data scientists. However, more complex tasks still require advanced skills. IT often needs to come to the rescue by creating new data sets in the data lake for specific ML processes, drastically slowing progress.

Data Virtualization Benefits

To address these challenges, organizations have started to apply new processes such as data virtualization (DV). DV provides a single access point to any data -- no matter where it is located and no matter its native format -- without first replicating it in a central repository.

The DV layer can also provide different logical views of the same physical data without creating additional replicas. This provides a fast and inexpensive way of offering different views of the data to meet the unique needs of each type of user and application. These logical views can be created by applying complex data transformation and combination functions on top of the physical data, using sophisticated optimization techniques to achieve the best performance. View More