Back

Speaker "Doohee You" Details Back

 

Topic

Silhouettes; A cheap computational graphical approach to the validation of big data classification.

Abstract

Demand of computational cheap unsupervised classification validation is increasing due to fast growing need of data drive business decision development. Most commonly accepted graphical classification validation method is silhouette coefficient, which requires expensive computation due to computation theory. Hence, this paper suggest a computationally cheap and innovative application of cluster validation method. Method: Mean and SD of each cluster (N=3,777,481, k=5) is clustered from unsupervised classification method and plot it on one dimensional graph by each variable using layer of histograms to identify overlap of cluster which represents poor classification. Overlap of historgram is equivalent to negative silhouette coefficient. Results: Less than quarter of (µ=18.03% SD=0.02) of time spent for graphical validation of cluster classification performance compare to using silhouette coefficient plot for validation. This method provides graphical aid to understand distribution of mean and SD that shows overlap of each cluster to understand classification validity using fractional computation resources.. Conclusions: Mean and SD driven cluster value distribution graph allows significantly fast classification validation than Silhouette coefficient driven method. Impact: Reducing computation time and power will allow faster unsupervised classification validation to support prompt decision making in fast moving data-driven business.

Profile

Ph.D. from UC Berkeley at 2013 and worked at World Health Organization for various forecasting analysis using big data and data mining techniques that is collected from all over the world. Developed a innovative data visualization solution at WHO and awarded for a 2015 UN data visualization competition as a top 3 finalists. Since 2017, joined at Hulu for advanced strategic analytics working towards various of data science for precise and computationally inexpensive and resource-saving methods.