Today, ImageNet is not only a competition for AI algorithms, but also tests the AI computing power for a substantial number of AI vendors. Indeed, the time required to complete ImageNet training has now become the gold standard for AI computing power in the industry, with public institutions and private companies alike competing to set new records.
In September 2017, UC Berkeley completed ImageNet training within 24 minutes, setting a new world record. This was broken a mere three months later, when UC Berkeley’s Deep Neural Network (DNN) training completed the challenge within 11 minutes. Tencent was the next to break the record by completing the training in four minutes in August 2018.
Every minute slashed and every record broken is exciting news for the AI industry. Requiring approximately ten billion floating-point computations, an ImageNet training task is challenging even for the world’s most powerful supercomputers. Remarkably, Huawei Atlas 900 AI training cluster slashed the completion time to under one minute, winning it the honor of the Tech of the Future Award.
Why is improving the performance of AI training clusters so difficult?
The performance of AI processors is the basis for the overall performance of training clusters so one way to improve computing power is simply to use processors with higher performance. In recent years, the performance of AI processors has grown at an explosive rate. However, a cluster usually involves thousands of AI processors in the computing process. How to make these processors collaborate effectively remains the greatest challenge for the industry.
Processors are key to the performance of a single AI server.
The Atlas 900 AI training cluster uses Ascend 910 AI processors with the largest computing power in the industry: each processor integrates 32 built-in Da Vinci AI cores, delivering 256 teraFLOPS (TFLOPS) at FP16, twice the computing power of the industry average. One server can be configured with eight Ascend AI chips, giving it a peak overall floating-point computing power in the petaFLOPS level.
Powerful AI chips alone are still not enough to achieve the ten billion floating-point computations required in AI training such as ImageNet. Multiple AI servers are needed to form a cluster to finish the computations collaboratively. Many have argued that the larger the scale of the AI training cluster, the greater the computing power. This alone would not be sufficient, however, requiring improvements in other areas to boost the overall performance of the AI training cluster.
Packet loss limits the performance of AI training clusters.
Theoretically, the overall performance of an AI cluster made up of two servers is twice that of a single server. In practice, however, the actual performance is less than twice that of a single server due to collaboration overhead. According to industry experience, the maximum performance of an AI cluster made up of 32 nodes can reach only half of the theoretical value. Indeed, more server nodes may even reduce the overall performance of the cluster as AI training clusters reach their performance ceilings.
The reason the theoretical value is not reached is due to a large number of parameters that are frequently synchronized between multiple servers when the AI training cluster completes a training. Network congestion worsens when the number of servers increases, resulting in greater packet loss. According to the test data, just one thousandth of a packet loss results in the loss of half of the network throughput. Since packet loss increases with the number of server nodes, and the network will break down when packet loss rate reaches 2%, packet loss is the key factor that is limiting the improvement of AI cluster performance.