Global Big Data Conference

Industry News Details

Pig vs. Hive: Is There a Fight? Posted on : Oct 11 - 2016

Pig and Hive came into existence out of the sole necessity for enterprises to interact with huge amounts of data without worrying much about writing complex MapReduce code. Though it was born out of necessity, they have come a long way to run even on top of other Big Data processing engines like Spark. Both these two components of the Hadoop ecosystem provide a layer of abstraction over these core execution programs. Hive was invented to give people something that looked like SQL and would ease the transition from RDBMS. Pig has more of a procedural approach and it was created so people didn’t have to write MapReduce in order to manipulate data.

When to Harvest Benefits from Hive

Apache Hive is a terrific Big Data component when it comes to data summarization and extraction. It’s undoubtedly an ideal tool to work on data that already has a schema associated with it. On the other hand, the Hive metastore facilitates partitioning of all data based upon user specified conditions that further makes data retrieval faster. However, one should be careful in using an excessive number of partitions in a single query because it could lead to either of the following issues:

An increase in number of partitions in the query means that the number of paths associated with them will also increase. Let’s say there is a use case which has to run a query over a table of 10,000 top-level partitions and each partition is comprised of more nested partitions. For those of us who know or may not be aware of, Hive tries to set the paths of all the partitions in the job configuration while translating the query into a MapReduce job. Hence, the number of partitions directly impacts the size of the job. Since the default jobconf size is set to 5MB, exceeding the limit would incur a runtime execution failure. For example, it would state something like - "java.io.IOException: Exceeded max jobconf size: 14762369 limit: 5242880". You can find the related details here.

Bulk registration of partitions (for example - 10,000 * 1,00000 partitions) via “MSCK REPAIR TABLE tablename" also has its restrictions owing to Hadoop Heap size and GCOverheadlimit.

Using extensively complex multi-level operations such as joins over numerous partitions has its limits, as well. Big queries might fail at the time when Hive Compiler does semantic validation with the metastore. Because the Hive metastore is primarily an SQL schema storage, large queries could fail with a similar exception like 'com.mysql.jbdc.PacketTooBigException: Packet for query is too large'.

The above properties such as the jobconf size, Hadoop heap size, and the packet size are undoubtedly configurable. To avoid these issues, put emphasis on having a better design of the semantics rather than frequently changing the configurations.

The optimum benefit of Hive can be derived based on a systematic schema design over the data residing in HDFS. This may include an approach where an acceptable number of partitions each holding a large chunk of data is used rather than an excessive number of partitions with less data in each partition. After all, the concept of partitioning is meant for querying specific data faster eliminating the need to operate on the entire dataset. A reduction in the number of partitions would foster minimal load on the metastore and maximum resource utilization of the cluster.

When to Make the Pig Grunt

Apache Pig has a very huge appetite and it can consume all sorts of data no matter if it’s structured, semi-structured or unstructured. Unlike Hive, it doesn’t have any metastore associated with it but it can leverage Hcatalog in Hive. In fact, Pig was created to operate on complex extensible operations on large datasets and for the reason that it could carry out self-optimizations on the go. Even though Pig has a multi-level script outlook, internally multiple operations are optimized at execution time which reduces the number of data scans.

Let’s consider the above situation using the 10,000 partitions that were used in our Hive example. In this case, we will use Pig on the same dataset. Since there is no metastore associated, the concept of partitioning doesn’t hold up for Pig alone. View More

Get the