Large AI language models (LLMs) require massive power, which generally comes from specialized GPUs. They are expensive and consume a lot of electricity – which is also a cost prohibitive for cloud AI providers.
Researchers from the Microsoft Azure team took on the problem and came up with an amazing solution. A new technology called Splitwise aims to make inference calculations for LLMs significantly more efficient and sustainable. Processing is divided into two phases: fast processing and code generation, and is distributed across different GPU clusters and machines. Splitwise takes advantage of the fact that fast processing requires a large amount of GPU processing capacity, while token generation relies on high memory bandwidth.
Details about Splitwise are described Detailed paper. With Splitwise, Microsoft wants to achieve 1.4 times the throughput at 20 percent lower costs than previous system designs or 2.35 times the throughput for the same costs and power budget. (Yupi)
“Prone to fits of apathy. Zombie ninja. Entrepreneur. Organizer. Evil travel aficionado. Coffee practitioner. Beer lover.”