Industry News Details

The Growing Significance Of DevOps For Data Science Posted on : Nov 06 - 2018

Data science and machine learning are often associated with mathematics, statistics, algorithms and data wrangling. While these skills are core to the success of implementing machine learning in an organization, there is one function that is gaining importance – DevOps for Data Science.

DevOps involves infrastructure provisioning, configuration management, continuous integration and deployment, testing and monitoring.  DevOps teams have been closely working with the development teams to manage the lifecycle of applications effectively.

Data science brings additional responsibilities to DevOps. Data engineering, a niche domain that deals with complex pipelines that transform the data, demands close collaboration of data science teams with DevOps. Operators are expected to provision highly available clusters of Apache Hadoop, Apache Kafka, Apache Spark and Apache Airflow that tackle data extraction and transformation. Data engineers acquire data from a variety of sources before leveraging Big Data clusters and complex pipelines for transforming it.

Data scientists explore transformed data to find insights and correlations. They use a different set of tools including Jupyter Notebooks, Pandas, Tableau and Power BI to visualize data. DevOps teams are expected to support data scientists by creating environments for data exploration and visualization.

Building machine learning models is fundamentally different from traditional application development. The development is not only iterative but also heterogeneous. Data scientists and developers use a variety of languages, libraries, toolkits and development environments to evolve machine learning models. Popular languages for machine learning development such as Python, R and Julia are used within development environments based on Jupyter Notebooks, PyCharm, Visual Studio Code, RStudio and Juno. These environments must be available to data scientists and developers solving ML problems.

Machine learning and deep learning demand massive compute infrastructure running on powerful CPUs and GPUs. Frameworks such as TensorFlow, Caffe, Apache MXNet and Microsoft CNTK exploit the GPUs to perform complex computation involved in training ML models. Provisioning, configuring, scaling and managing these clusters is a typical DevOps function. DevOps teams may have to create scripts to automate the provisioning and configuration of the infrastructure for a variety of environments. They will also need to automate the termination of instances when the training job is done. View More