Back

Speaker "Claudiu Barbura" Details Back

 

Topic

Overkill Analytics on High Dimensional Feature Spaces

Abstract

In our quest for data science automation we have learned many lessons that I am going to share in this session.

Less slides and more demos featuring real world use cases such as predicting port destination for oil ships and the Outbrain Kaggle competition, all performed from our own notebook (called DSL Workbench) we built for exploratory data analysis. DSL is the fluent and expressive API we created to expose data and services from our data science platform.

I will compare multiple approaches for feature engineering, reduction as well as full feature space training employing OKA (OverKill Analytics) techniques: where spark.ml/spark.mllib could not perform on high dimensional sparse feature spaces we employed Spark for distributing scikit-learn, VW, TensorFlow and R packages and produced ensemble models and prediction tables that still yield highly accurate predictions.

 

I will cover and show concrete examples for geo-spatial, composite and progressive modeling, deep learning, high dimensional and sparse feature engineering, the primitives we built for handling sparse data beyond the support in Spark or scipy.

While I’ll focus on data science at scale I will also touch on infrastructure aspects, with tips and tricks we learned with the underlying technology stack: scala, python, Spark, HDFS, Cassandra, ElasticSearch, Zookeeper, VW, TensorFlow etc

Profile

Claudiu is Director of Engineering at Blueprint Technologies, he oversees Product Engineering where he builds large scale advanced analytics pipelines, IoT and Data Science applications for customers in oil & gas, energy and retail industries. Formerly VP of Engineering at Ubix.io, automating data science at scale and Sr. Dir. of Eng, xPatterns Platform Services at Atigeo, building several advanced analytics platforms and applications in healthcare and financial industries, Claudiu is a hands on architect, dev manager and executive with 20+ years of experience in Open Source, Big Data Science and Microsoft technology stacks, frequent speaker at data conferences.