Back Industry News

Data Requirements for Machine Learning Posted on Sep 14 - 2018

Share This :

Machine learning algorithms consume and process large volumes of data to learn complex patterns about people, business processes, transactions, events, and so on. This intelligence is then incorporated into a predictive model. Comparisons to the model can reveal whether an entity is operating within acceptable parameters or is exhibiting an anomaly.

Today, machine learning is used to solve well-bounded tasks such as classification and clustering. Note that a machine learning algorithm learns from so-called training data during development; it also learns continuously from real-world data during deployment so the algorithm can improve its model with experience.

Machine learning has a voracious appetite for data during both development and production, making unique demands of an organization's infrastructure for data management.

Data Requirements for Successful Machine Learning

#1: Large, diverse data sets

The development of a machine learning algorithm depends on large volumes of data, from which the learning process draws many entities, relationships, and clusters. To broaden and enrich the correlations made by the algorithm, machine learning needs data from diverse sources, in diverse formats, about diverse business processes.

For the most comprehensive learning experience, you should provide diverse training data -- integrated from multiple sources and concerning various business entities, collected across multiple time frames -- to make algorithmic assessments more real-world, accurate, and successful in production. Once in production, a machine learning algorithm continues to read large, diverse data sets to keep its model up-to-date and growing.

Savvy organizations are deploying tools for multiple types of analytics (not just machine learning), because each type tells them something unique and valuable. Each of these analytics approaches needs data that is prepared and presented in a certain way that is optimal for the analytics tool or the user practice involved. Machine learning algorithms are almost always optimized for raw, detailed source data. Thus, the data environment must provision large quantities of raw data for discovery-oriented analytics practices such as data exploration, data mining, statistics, and machine learning.

#2: Large, diverse infrastructure for data management

Infrastructure for training data for machine learning typically involves multiple data platforms, tools, and processing engines, ranging from traditional (relational and columnar databases) to modern (Hadoop, Spark, and cloud storage). Multiple technologies are required to cope with training data's extreme size, multiple data structures, and (in some cases) multiple latencies. Tools for machine learning are obviously important, but data management infrastructure is just as important.

There are many ways to provision training and production data for machine learning. This data can come from multiple platforms in the extended data infrastructure, but the trend is toward consolidating as much data as possible into a data lake designed for machine learning and other forms of advanced analytics. In a related trend, data lakes are moving toward elastic clouds for reasons of automation, optimization, and economics.

Data management infrastructure can be vast. It can include platforms and tools for data warehousing, data lakes, data integration, data preparation, multiple forms of analytics, and big data. New data platforms are emerging as well, dominated by clouds, open source engines, open source libraries and languages, and self-service tools. That is a long list of platforms, technologies, and processing engines. However, it is all required for modern organizations that want to operate and compete on analytics and intelligence.

Finally, when organizations already have big data infrastructure in place, adding machine learning extends the life cycle and business value of the infrastructure. Source

x

Get the Global Big Data Conference
Newsletter.

Weekly insight from industry insiders.
Plus exclusive content and offers.