Back

Speaker "Thanh Tran" Details Back

 

Topic

How to Rebuild Data and ML Platform using Kinesis, S3, Spark, MLlib, Airflow and Upwork

Abstract

Upwork has the biggest closed-loop online dataset of jobs and job seekers in labor history. Besides sheer quantity, our data is also contextually very rich. We have client and contractor data for the entire job-funnel - from finding jobs to getting the job done. For various machine learning applications including search and recommendations and labor marketplace optimization (rate, supply and demand), we heavily relied on a Greenplum-based data warehouse solution for data processing and ad-hoc ML pipelines (weka, scikit-learn, R) for offline model development and online model scoring. In this talk, we present our modernization efforts in moving towards a 1) holistic data processing infrastructure for batch and stream data processing using S3, Kinesis, Spark and Spark Structured Streaming 2) model development using Spark MLlib and other ML libraries for Spark 3) model serving using Databricks Model Scoring, Scoring over Structured Streams and microservices and 3) how we orchestrate and streamline all these processes using Apache Airflow and a CI/CD workflow customized to our Data Science product engineering needs. The focus of this talk is on how we were able to reduce DevOps overhead and costs, complete the entire modernization with moderate efforts and adopt a collaborative notebook-based solution for all our data scientists to develop model, reuse features and share results. We will shared the core lessons learned and pitfalls we encountered during this journey.

Profile

Thanh is Director of Data Science for Upwork where he leads on all aspects of search & recommendations, knowledge graph and data science infrastructure. Reporting to the SVP of Engineering, he pioneered the adoption of data lake, streaming-based live data pipeline and microservice based model serving. Towards improving welfare through optimized labor marketplace exchange & distributed work, he and his team are working on state-of-the-art research & engineering solutions in multi-sided matching, learning-to-rank, recommender systems, knowledge graph construction, semantic search and scaling live data processing for context-rich real-time model scoring. Previously, Thanh served as CTO for Lyfeline and consultant for various Bay Area startups, where he helped to pioneer Machine-Learning based mobile and chat-bot applications. As professor for the Karlsruhe Institute of Technology and Stanford University, he led a world-wide top research group in semantic search and helped successfully build and shape a community around this topic through the organization of conferences and workshops. He is passionate about the opportunity of improving welfare for job providers and job seekers through Data Science research and applications for a world of distributed work.