Sunday, December 22, 2024

What AI Means For Networking Infrastructure In 2024

Must read

Of the number of trends taking place in cloud and communications infrastructure in 2024, none loom as large as AI. Specifically in the networking markets, AI will have an impact on how infrastructure is built to support AI-enabled applications.

AI has interesting characteristics that make it different from previous cloud infrastructure. In general, training large language models (LLMs) and other applications requires extremely low latency and very high bandwidth.

Generative AI (GenAI), which creates text, images, sounds, and other output from natural language queries, is driving new computing trends toward highly distributed and accelerated platforms. These new environments require a complex and powerful underlying infrastructure, one that addresses the full stack of functionality, from chips to specialized networking cards to distributed high performance computing systems.

This has raised the profile of networking as a key element of the “AI stack.” Networking leaders such of Cisco have grabbed a hold of this in marketing materials and investor conference calls. It was even one of the featured topics of conversation in HPE’s recently announced $14 billion deal to acquire Juniper Networks. HPE executives said the deal emphasis the growing importance of networking in the AI cloud world.

The AI networking companies that have drawn the most investor interest so far have been Nvidia, which has a full stack of networking elements including its BlueField networking platform, and Arista Networks, which has drawn extraordinary interest from investors for its role as a key supplier to AI providers such as Microsoft. There are also numerous interesting private companies in this market which we’ll detail in a bit.

The impact of AI runs two ways in networking. In addition to “Networking for AI,” there is “AI for Networking.” You must build infrastructure that is optimized for AI. You must also build AI into your infrastructure, to automate and optimize it.

In short, AI is being used in nearly every aspect of cloud infrastructure, while it is also deployed as the foundation of a new era of compute and networking.

Network Infrastructure for AI

Building infrastructure for AI services is not a trivial game, especially in networking. It requires large investments and exquisite engineering to minimize latency and maximize connectivity. AI infrastructure makes traditional enterprise and cloud infrastructure look like child’s play.

“What our customers are telling us is they are starting to think about how to bring multiple [AI] clusters together and connect them and extend them to inference nodes and edges,” Shekar Ayyar, CEO of cloud-native networking company Arrcus, told me in a recent interview.

There has been a surge in companies contributing to the fundamental infrastructure of AI applications — the full-stack transformation required to run LLMs for GenAI. The giant in the space, of course, is Nvidia, which has the most complete infrastructure stack for AI, including software, chips, data processing units (DPUs), SmartNICs, and networking.

One of the ongoing discussions is the role of InfiniBand, a specialized high-bandwidth technology frequently used with AI systems, versus the expanded use of Ethernet. Nvidia is perceived to be the leader in InfiniBand, but it has also hedged by building Ethernet-based solutions.

Ethernet’s advantage will be economics, but it will require software tweaks and coupling with SmartNICs and DPUs. This market is targeted by the Ultra Ethernet Consortium, a Linux Foundation group whose membership includes industry-leading companies such as Arista, Broadcom, Cisco, HPE, Microsoft, and Intel, among others. Private companies including Arrcus and Enfabrica have also joined.

Key Startups Targeting AI Networking

There will be plenty of spots for emerging companies to play as Ethernet-based networking solutions emerge as an alternative to InfiniBand. At the same time, specialized AI service providers are emerging to build AI-optimized clouds.

Here are some of the private companies we are tracking:

Arrcus offers Arrcus Connected Edge for AI (ACE-AI), which uses Ethernet to support AI/ML workloads, including GPUs within the datacenter clusters tasked with processing LLMs. The vendor aims the solution at communications service providers, enterprises, and hyperscalers looking for a way to flexibly network compute resources for AI infrastructure in a software-based approach that avoids the costs and limitations of switching hardware. Arrcus recently joined the Ultra Ethernet Consortium, a band of companies targeting high-performance Ethernet-based solutions for AI.

DriveNets offers a Network Cloud-AI solution that deploys a Distributed Disaggregated Chassis (DDC) approach to interconnecting any brand of GPUs in AI clusters via Ethernet. This massively scalable platform is meant to be an InfiniBand alternative. Implemented via white boxes based on Broadcom Jericho 2C+ and Jericho 3-AI components, the product can link up to 32,000 GPUs at up to 800 Gb/s. DriveNets recently pointed out that in an independent test, DriveNets’ solution showed 10% to 30% improved job completion time (JCT) in a simulation of an AI training cluster with 2,000 GPUs.

Enfabrica, a startup founded in 2020 that emerged from stealth early in 2023, has created an accelerated compute fabric switch (ACF-S) that replaces the SmartNICs and PCIe switches that connect Ethernet-linked servers with the GPUs and CPUs within the systems that process AI models. The switch chip offers faster connections from the network to the AI system and reduces latency associated with traffic flows between NICs and GPUs. All of this streamlines AI processing and lowers the total cost of ownership (TCO) for AI systems.

Enfabrica hasn’t released its ACF-S switch yet, but it is taking orders for shipment early this year, and the startup has been displaying a prototype at conferences and trade shows in recent months. While it can’t list customers yet, Enfabrica’s investor list is impressive, including Atreides Management, Sutter Hill Ventures, IAG Capital, Liberty Global, Nvidia, Valor Equity Partners, Infinitum, and Alumni Ventures.

Software for Open Networking in the Cloud (SONiC) is an open networking platform built for the cloud — and many enterprises see it as an economical solution for running AI networks, especially at the edge in private clouds. Aviz Networks has built the Open Networking Enterprise Suite, a multivendor networking stack for the open-source network operating system, SONiC, enabling datacenters and edge networks to deploy and operate SONiC regardless of the underlying ASIC, switching, or the type of SONiC. It also incorporates NVIDIA Cumulus Linux, Arista EOS, or Cisco NX-OS into its SONiC network.

Hedgehog is another cloud-native software company using SONiC to help cloud-native application operators manage workloads and networking with the ease of use of the public cloud. This includes managing applications across edge compute, on-premises infrastructure, or in distributed cloud infrastructure. CEO Marc Austin recently told us the technology is in early testing for some projects that need the scale and efficiency of cloud-native networking to implement AI at the edge.

AI-Enabled Observability and Automation

AI is also having an impact on how infrastructure tools are used, including how it can drive automation. This is the “AI for Infrastructure” part of the equation.

One key area that is using AI to drive automation of infrastructure is observability, which is a somewhat dull industry term for the process of gathering and analyzing information about IT systems.

Several companies are on the forefront here. Kentik’s Network Intelligence Platform, delivered as a service, uses AI and machine learning to monitor traffic from multiple sources throughout the IT infrastructure and correlate that data with additional information from telemetry, traffic monitoring, performance testing, and other sources. The results are used for capacity planning, cloud cost management, and troubleshooting. Selector uses AI and ML to identify anomalies in the performance of applications, networks, and clouds by correlating data from metrics, logs, and alerts. A natural language query interface is integrated with messaging platforms such as Slack and Microsoft Teams.

Another trend to watch is how WebAssembly (Wasm) helps AI infrastructure. Wasm is an abstraction layer that can help developers deploy applications to the cloud more efficiently. AI might be the perfect application for Wasm.

Fermyon, which has created Spin, an open-source tool for software engineers, is a company to watch in the Wasm space. It also built Fermyon Cloud, a premium cloud service aimed at larger enterprises. Both products deploy the W3C Wasm standard to efficiently compile many different types of code down to the machine level, giving Web apps much faster startup times. The software also runs cloud apps securely in a Web sandbox separated at the code level from the rest of the infrastructure.

AI for Multicloud

AI will also fuel the growing need for multicloud networking. In theory, a lot more data will be shuttled between clouds so that it can be collected, organized, and analyzed. One trend to watch is that this will also mean the collection of more data at the edge.

Networking companies targeting data and apps at the edge should benefit from the need for secure connectivity. Aviatrix will be part of the game, as its distributed multicloud networking platform can drive more integrated connectivity with public cloud platforms, while providing operators with better distributed security and observability features. Aviatrix CEO Doug Merritt recently told industry video outlet theCUBE that AI will have a huge impact on networking.

“LLMs will have a play in networking when you’re trying to think about understanding attack surfaces,” Merritt told theCUBE. “How do you optimize policies across these very complex networks and be more proactive and resilient? Just getting secure and resilient transport between clouds on a seamless basis is [something] that most companies really, really wrestle with.”

Itential is an intriguing company out of Atlanta that is building automation tools to facilitate the integration of multidomain, hybrid, and multicloud environments using infrastructure as code and platform engineering. The company helps organizations orchestrate infrastructure using APIs and pre-built automations. This type of automation will be key in implementation of AI infrastructure as organizations seek more flexible connectivity to data sources.

Prosimo’s multicloud infrastructure stack delivers cloud networking, performance, security, observability, and cost management. AI and machine learning models provide data insights and monitor the network for opportunities to improve performance or reduce cloud egress costs. Graphiant’s Network Edge tags remote devices with packet instructions to improve performance and agility at the edge compared to MPLS or even SD-WAN. A Graphiant Portal enables policy setup and connectivity to major public clouds.

Providers of AI in IT and cloud environments should also benefit. These include ClearBlade, whose Internet of Things (IoT) software facilitates stream processing from multiple edge devices to a variety of internal and external data stores. ClearBlade Intelligent Assets deploys artificial intelligence (AI) to create digital twins of a variety of IoT environments that can be linked to real-time monitoring and operational functions.

Overall, AI’s impact on networking and infrastructure has been one of the key themes for the remainder of 2024, as vendors line up to build the right technology for this enormous trend. While it’s expected for AI hype to be tempered at some point during the year, the capital spending (capex) plans for AI infrastructure deployments are being plotted for many years in the future, and it’s likely that AI will have an outsized impact on the future of networking and infrastructure deployments.

Latest article