AI Requires a New Approach to High-Performance Computing

John Monroe

5 years ago

High-performance computing (HPC) needs to evolve. The traditional HPC architecture, now decades old, worked well for previous generations of HPC applications. But today’s applications, driven by AI, require a new approach.

The problem? The old systems were too static. Agencies have always bought high-performance computers the way people buy cars. They choose a model, select their features (in this case, compute capacity, storage capacity, etc.), and once they make their choice, they live with it until they buy a new model.

That means if they know they will sometimes need an SUV-equivalent of an HPC system, that’s what they buy, even if they often need more of a sedan-equivalent.

Why AI Is Different

That wasn’t a problem when applications had static performance requirements. But AI is different. When developing an AI system, the workload changes from one stage of the process to another, said Matt Demas, Public Sector Chief Technology Officer (CTO) at Liqid, a company that provides a comprehensive composable infrastructure platform.

For example, during the data ingest phase, the system requires high-performance network interface capacity, while the training phase (when algorithms are built based on historical data) and the inference phase (when the algorithms are run with live data) require large numbers of graphics processing units (GPUs).

The requirements are so varied, some organizations use different servers for each phase of the process – the equivalent of having a different car for each season of the year. That might be effective, but it is not efficient.

Capabilities on Demand

Liqid has developed an architecture called the Composable Disaggregated Infrastructure (CDI). With CDI, the server is stripped down to the smallest building blocks: the CPUs and dynamic random-access memory (DRAM). All other components – GPUs, the field-programmable gate arrays (FPGAs), storage, etc. – are available from a pool of resources that are available on demand as requirements shift.

In the case of AI, CDI makes it possible to choose the right mix of resources for each stage of the workflow. It’s as if you could reassemble your car each time you went on the road, depending on the conditions of the day. “I don’t have to under- or over-provision a system – I can right-size it stage by stage,” Demas said.

That means an agency no longer needs to pay for resources that are just sitting idle. They can also start small and add more resources as an initiative grows.

“With CDI, we are providing the kind of flexibility that people have become accustomed to in AWS or other public clouds, where they can build a server on the fly by clicking on the resources they need,” said Eric Oberhofer, Director of Sales for Public Sector at Liqid. “We’re allowing that on premises.”

Through Carahsoft, Liqid has worked with industry partners like Intel, NVIDIA and others to deliver the first-ever composable supercomputing deployments for DoD. The three installations are collectively worth $52 million.

This article is an excerpt from GovLoop’s recent guide, “The State of AI in Government: Policies, Challenges & Practical Use Cases.” Download the full guide here.