Speaker "Benoy Antony" Details Back



Detect Sensitive Data in Hadoop Clusters


Organizations store massive amounts of data in Hadoop clusters. The data may contain sensitive information without sufficient protection. The sensitive information could be Personally Identifiable Information (PII) such as Social Security Numbers or Financial Information like Credit Card Numbers. Organizations need to continuously monitor the presence of sensitive information in Hadoop clusters to meet security and compliance requirements. Detecting sensitive information in a Hadoop Cluster poses challenges due to the massive amount of data and different storage formats. In this presentation, we will understand the methods to detect sensitive data in a Hadoop cluster. We will see how to use Yarn applications to scan large amounts of data for sensitive information. We will identify the best practices to scan data stored in different file formats. Once sensitive data is identified, the data has to be protected. We will review the options available in Hadoop to protect sensitive information.


Benoy Antony is an Apache Hadoop Committer and has contributed features related to security and HDFS. He is the Founder of DataApps (, a company which specializes on creating applications for Big Data. He maintains a Hadoop Security wiki at Benoy is a Hadoop Architect at eBay where he focuses on enhancing security and availability on eBay's Hadoop clusters without limiting user productivity. He regularly speaks at conferences like Hadoop Summit.