Back

Speaker "Alex Perrier" Details Back

 

Topic

Large data in Python with Scikit-learn and Dask

Abstract

Although Scikit learn is optimized for small data, its out-of-core features enable the data scientist to work with Large data, i.e. Data that does not fit in the computer's memory. I'll present the scikit-learn algorithms compatible with this batch training approach and their respective performances on large datasets. However, data minging remains a time consuming problem when dealing with Large Data. This where, Dask a Python library comes in. By breaking operations into sequences that can be parallelized, Dask addresses the Large Data pre-processing part of the problem.

Profile

Data Scientist at Berklee online, Contributor @ODSC, PhD signal processing,