Back

 Industry News Details

 
Facebook’s Expanding Machine Learning Infrastructure Posted on : Jan 09 - 2018

Here at The Next Platform, we tend to keep a close eye on how the major hyperscalers evolve their infrastructure to support massive scale and evermore complex workloads.

Not so long ago the core services were relatively standard transactions and operations, but with the addition of training and inferencing against complex deep learning models—something that requires a two-handed approach to hardware—the hyperscale hardware stack has had to quicken its step to keep pace with the new performance and efficiency demands of machine learning at scale.

While not innovating on the custom hardware side quite the same way as Google, Facebook has shared some notable progress in fine-tuning its own datacenters. From its unique split network backbone, neural network-based viz system, to large-scale upgrades to its server farms and its work honing GPU use, there is plenty to focus on infrastructure-wise. For us, one of the more prescient developments from Facebook is its own server designs which now serve over 2 billion accounts as of the end of 2017, specifically its latest GPU-packed Open Compute based approach.

The company’s “Big Basin” system unveiled at OCP Summit last year is a successor to the first generation “Big Sur” machine that the social media giant unveiled at the Neural Information Processing Systems conference in December 2015. As we noted at the release in a deep dive into the architecture, the Big Sur machine crammed eight of Nvidia’s Tesla M40 accelerators, which slide into PCI-Express 3.0 x16 slots and which has 12 GB of GDDR5 frame buffer memory for CUDA applications to play in, and two “Haswell” Xeon E5 processors into a fairly tall chassis. Since then, the design has been extended to support the latest Nvidia Volta V100 GPUs.

Facebook also claims that compared with Big Sur, the newer V100 Big Basin platform enables much better gains on performance per watt, benefiting from single-precision floating-point arithmetic per GPU “increasing from 7 teraflops to 15.7 teraflops, and high-bandwidth memory (HBM2) providing 900 GB/s bandwidth (3.1x of Big Sur).” The engineering team notes that half-precision was also doubled with this new architecture to further improve throughput.

“Big Basin can train models that are 30 percent larger because of the availability of greater arithmetic throughput and a memory increase from 12 GB to 16 GB. Distributed training is also enhanced with the high-bandwidth NVLink inter-GPU communication,” the team adds.

Facebook says the shift to “Big Basin” has led to a 300 percent improvement in throughput over Big Sur on ResNet-50 as one example and that while they are pleased with these results they are still evaluating new hardware designs and technologies. View More