Facebook在人工智能方面的盘子有多大？

2018-06-07 11:55

小弈编译

（本文为小弈自动翻译）

Title:How Facebook scales AI

Most of Facebook's two billion users have little idea how much the service leans on artificial intelligence to operate at such a vast scale. Facebook products such as the News Feed, Search and Ads use machine learning, and behind the scenes it powers services such as facial recognition and tagging, language translation, speech recognition, content understanding and anomaly detection to spot fake accounts and objectionable content.

Facebook有20亿用户，其中大多数人都被不了解Facebook在人工智能（AI）方面大有作为。Facebook产品，如News Feeds、Search和Ads，使用机器学习，在幕后提供面部识别和标签、语言翻译、语音识别等服务，内容理解与异常检测以识别虚假和令人反感的内容.

The numbers are staggering. In all, Facebook's machine learning systems handle more than 200 trillion predictions and five billion translations per day. Facebook's algorithms automatically remove millions of fake accounts every day.

这个数字令人吃惊，总而言之，Facebook的机器学习系统每天处理超过2000亿的预测和50亿次翻译，Facebook的算法每天自动删除数以百万计的虚假内容。

In a keynote at this year's International Symposium on Computer Architecture (ISCA), Dr. Kim Hazelwood, the head of Facebook's AI Infrastructure group, explained how the service designs hardware and software to handle machine learning at this scale. And she urged hardware and software architects to look beyond the hype and develop "full-stack solutions" for machine learning. "It is really important that we are solving the right problems and not just doing what everyone else is doing," Hazelwood said.

Facebook's AI Infrastructure 团体的负责人Kim Hazelwood博士在今年国际计算机体系结构研讨会上的主旨发言中解释了服务设计如何处理如此规模的机器学习硬件和软件。她敦促硬件和软件设计者超越大肆宣传的范围，开发出机器学习的“堆叠式解决方案”。Hazelwood说：“我们必须解决正确的问题，而不只是做别人正在做的事情，这是非常重要的。”

Facebook's AI infrastructure needs to handle a diverse range of workloads. Some models can take minutes to train, while others can take days or even weeks. The News Feed and Ads, for example, use up to 100 times more compute resources than other algorithms. As a result, Facebook uses "traditional, old-school machine learning" whenever possible, and only resorts to deep learning--Multi-Layer Perceptrons (MLP), ConvolutionalNeural Networks (CNN), and Recurrent Neural Networks (RNN/LSTM)--when absolutely necessary.

Facebook的AI基础设施需要处理各种工作量，有些模型需要花费几分钟才能培训，而其他模型则需要几天甚至几个星期。例如，News Feed和Ads计算资源使用比其他算法多100倍。因此，Facebook尽可能使用"传统的、老式的机器学习"，而只是用来进行深造 -- -- 多层知觉器、结核神经网络(CNN)，和循环神经网络（RNN/LSTM） -- -- 当必要的时候。

The company's AI ecosystem includes three major components: the infrastructure, workflow management software running on top, and the core machine learning frameworks such as PyTorch.

公司的AI系统包括三个主要部分：基础设施、工作流管理软件在顶端，以及诸如PyTorch等核心机器学习框架。

Facebook has been designing its own datacenters and servers since 2010. Today it operates 13 massive datacenters--10 in the U.S. and three overseas. Not all of these are the same since they were built over time and they do not house the same data since "the worst thing you can do is replicate all data in every data center." Despite this, every quarter the company "unplugs an entire Facebook datacenter," Hazelwood said, to ensure continuity. The datacenters are designed to handle peak loads, which leaves about 50% of fleet idle at certains times of the day as "free compute" that can be harnessed for machine learning.

自2010年以来，Facebook一直在设计自己的数据中心和服务器，目前它经营着13个庞大的数据中心——在美国有10个，在海外有3个。并不是所有的数据都是相同的，因为它们是随着时间的推移而构建的，并且它们不会保存相同的数据，因为“你能做的最糟糕的事情是在每个数据中心复制所有的数据”。尽管如此，该公司每季度都会“拔掉整个Facebook数据中心”，Hazelwood表示，要确保连续性。数据中心旨在处理高峰负荷，使大约50%的车队在一天通行时闲置起来，成为可用于机器学习的“免费计算器”。

Rather than using a single server, Facebook took hundreds of workloads in production, put them in buckets, and designed custom servers for each type. The data is stored in Bryce Canyon and Lightning storage servers, training takes place on Big Basin servers with Nvidia Tesla GPUs, and the models are run on Twin Lakes single-socket and Tioga Pass dual-socket Xeon servers. Facebook continues to evaluate specialized hardware such as Google's TPU and Microsoft's BrainWave FPGAs, but Hazelwood suggested that too much investment is focused on compute, and not enough on the storage and especially networking, which in keeping with Amdahl's Law can become a bottleneck for many workloads. She added that AI chip startups weren't putting enough focus on the software stack leaving a big opportunity in machine learning tools and compilers.

Facebook不是使用单台服务器，而是在生产中使用了数百个工作负载，把它们放在存储单元里，为每个类型设计了自定义服务器。数据存储在布莱斯峡谷和照明储存服务器中，用Nvidia Tesla GPU在大盆地服务器上进行培训，并在Twin Lakes(Tyga)单插座和Tioga Pass-Xion服务器上运行模型。Facebook继续评估谷歌TPU和微软的“脑波FPGA”等专门硬件，但Hazelwood表示，过多的投资侧重于计算机，在存储尤其是网络上投资不够，与Amdahl法律保持一致，可能成为许多工作负载的瓶颈。她补充道，AI芯片初创企业没有把足够的注意力放在软件堆栈上，为机器学习工具和编译器留下很大的机遇。

Facebook's own software stack includes FBLearner, a set of three management and deployment tools that focus on different parts of the machine learning pipeline. FBLearner Store is for data manipulation and feature extraction, FBLearner Flow is for managing the steps involved in training, and FBLearner Prediction is for deploying models in production. The goal is to free up Facebook engineers to be more productive and focus on algorithm design.

Facebook本身的软件栈包括FBL赚取器，一套三套管理和部署工具，集中于机器学习管道的不同部分。挣钱商店用于数据操作和特征提取，赚取收入流程用于管理培训所涉及的步骤，并且FBL赚取收入预测用于在生产中使用模型。其目标是解放Facebook工程师，提高生产率，注重算法设计。

Facebook has historically used two machine learning frameworks: PyTorch for research and Caffe for production. The Python-based PyTorch is easier to work with, but Caffe2 delivers better performance. The problem is that moving models from PyTorch to Caffe2 for production is a time-consuming and buggy process. Last month, at its F8 developer conference, Facebook announce that it had "merged them internally so you get the look and feel of PyTorch and the performance of Caffe2" with PyTorch 1.0, Hazelwood said.

Facebook历来采用两个机器学习框架：PyTorch用于研究，Caffe用于生产。问题是，从PyTorch向Caffe2移动的模型生产是一个耗时耗时的轮盘过程。Hazelwood说，上个月，Facebook在其F8开发商会议上，Facebook宣布它“内部合并，因此PyTorch和Caff2的表情和感觉”。

This was a logical first step for ONNX (Open Neural Network Exchange), an effort by Facebook, Amazon and Microsoft to create an open format for optimizing deep learning models built in different frameworks to run on a variety of hardware. The challenge us that there are lots of frameworks--Google TensorFlow, Microsoft's Cognitive Toolkit, and Apache MXNet (favored by Amazon)--and the models need to run on a variety of different platforms such as Apple ML, Nvidia, Intel/Nervana and Qualcomm's Snapdragon Neural Engine.

这是ONNX（Open Neural Network Exchange）合乎逻辑的第一步，Facebook、Amazon和Microsoft正在努力创造一种开放格式，以优化在不同的框架内开发的深度学习模型，以在各种硬件基础上运作。我们面临的挑战是，有许多框架——谷歌紧张流程、微软认知工具包和Apache MXNet（亚马逊所青睐）——这些模型需要在各种平台上运行，如Apple ML、Nvidia，英特尔/内尔瓦纳和古尔科姆的斯内普龙神经引擎。

There are a lot of good reasons for running models on edge devices, but phones are especially challenging. Many parts of the world still have little or no connectivity and more than half of the world is using phones dating from 2012 or earlier, and they use a variety of hardware and software. Hazelwood said there is about a 10X performance difference between today's flagship phone and the median handset. "You can't assume that everyone you are designing your mobile neural net for is using an iPhone X," she said. "We are very anomalous here in the U.S." Facebook's Caffe2 Go framework is designed to compress models to address some of these issues.

在边缘设备上运行模型的原因很多，但手机尤其具有挑战性。世界许多地区仍然几乎没有或根本没有连通性，而且世界一半以上的地区使用自2012年或以前的手机，他们使用各种硬件和软件。Hazelwood表示，今天旗舰手机和中位手机性能差别约为10X。“你不能假定，每一个正在设计移动神经网络使用iPhoneX的人，”她说。Facebook's Caffe2 Go框架旨在压缩模型以解决其中一些问题。

The deep learning era has arrived and Hazelwood said there are lots of hardware and software problems to solve. The industry is spending lots of time and money building faster silicon but, she said, we need equal investment in software citing Proebsting's Law that compiler advances only double compute performance every 18 years, "Please keep that in mind so we don't end up with another Itanium situation," Hazelwood joked, referring to Intel's non-defunct IA-64 architecture. The real opportunity, Hazelwood said, is in solving problems that no one is working on building end-to-end solutions with balanced hardware and better software, tools and compilers.

深厚的学习时代已经到来，哈泽尔伍德说，有许多硬件和软件问题需要解决。业界花费大量时间和金钱建设更快的硅，但她表示，我们需要对软件进行平等投资，理由是Proebst的法律规定，编译器每18年只能增加一倍的计算性能，Hazelwood开玩笑地说：“请记住这一点，这样我们就不会再跟另一种Itanium的情况结尾，”Hazelwood谈到Intel的非功能IA-64建筑。Hazelwood说，真正的机会在于解决问题，没有人正在努力建立平衡硬件和更好的软件、工具和编译器的端到端解决方案。

观后感

已有0人参与