Lessons in system engineering the Internet and why it matters today

The Internet is arguably the most successful engineering project ever, and is arguably behind the success of eight of the world’s largest companies by market capitalization. It enabled the rise of cloud computing, scalable AI, data aggregation, and mobile computing. And yet, it might not have turned out that way. It came at a time when other networks from telcos and cable operators seemed like better bets for voice, video, and possibly data.

Larry Peterson and Bruce Davie were both involved in the early stages of the Internet, holding various roles over the last several decades. They document the history of systems engineering and collaborative decision-making in their new book, What We Talk About When We Talk About Systems. The two have been collaborating on systems engineering textbooks since the 1990s. Peterson started in academia, helped spin out a startup, and chaired various open source projects. Davie started at Bellcore, the engineering lab of AT&T, and later held senior engineering roles at Cisco, Nicira, and VMware.

The new book is a collection of essays on topics such as system and network architecture, open source software, judgment in system design, 5G, and software-defined networking. It examines the concept of system engineering at a high level, which is a fairly abstract idea. What makes the book special is that it maps its framing to specific, concrete examples of the debates involved in shaping the Internet. Peterson and Davie were also vulnerable enough to explain how some of their ideas changed in response to new experiences and data. In this sense, it’s not just a technical explainer, but an exploration of the thought process and collaboration behind a systems approach.

I had some differing opinions about a few things, particularly when they discussed 5G and AI near the end. However, they both make it clear that respectfully holding space for diverging opinions plays a crucial role in the systems engineering approach. The bulk of the book focuses on the networking infrastructure underpinning the Internet, but the authors make clear that the principles equally apply to other domains as well. Reading it, I found myself wondering what kind of clues it might have for the future of AI, spatial computing, and sustainability.

Not a sure thing

Today, it’s hard to imagine that the Internet’s success was not a sure thing. It came at a time when telcos were charging us exorbitant connection and long-distance fees, and they had ambitious plans for monetizing the future of text-only “data” services like Minitel. The cable operators were trying to figure out how to better monetize hundreds of additional TV channels over their new broadband networks.

The telcos and cable operators were busy building their own networks to care about the Internet. The telcos were focused on OSI, while cable operators focused on DOCSIS. The authors observe that those networking stack diagrams are not some magical insights written in stone. Once established vendors are involved, these demarcation lines tend to arise to protect profitable business models or established engineering processes rather than relying on first-principles technical insight.
Fortunately for the rest of us, the established network operators did not take a great interest in this ambitious academic research project that was starting to take shape. This gave academics the luxury of designing a new network from first principles, focusing on factors such as availability, scalability, and performance. Ideas like security came a bit later as an add-on.

One essential idea is that network architectures encompass the aspects that everyone must agree on, while allowing flexibility for variations in the designs of specific networks. The telcos settled on standardizing the physical links at the bottom of the stack. The cable operators focused on the TV interface at the top. The academic community decided to standardize the meaning of IP addresses in the middle. This later came to be known as the hourglass, as it simplified the process of leveraging newer links, such as Wi-Fi standards at the bottom, and web, mobile, cloud, and AI applications at the top.

The importance of this approach was first popularized in a 1994 report, “Realizing the Information Future” (RTIF). Davie elaborates:

What Realizing the Information Future did brilliantly was to highlight the key differences between these competing visions of the future of networking. While the Internet could be modeled as an hourglass, ATM looked more like a funnel: the entire bottom part of the stack was pinned down to a small set of technology choices, with ATM requiring SONET and a specific set of link technologies. And the cable-TV version, with the view that the only application that mattered was video, was the inverse of this, narrowing to a single application class at the top.

RTIF didn’t claim that the Internet architecture was the one true choice for the future, but it did point out the drawbacks of narrowing either the top or the bottom part of the hourglass to a small set of options. If you narrow the bottom, you rule out a whole lot of current and future technology choices for the link layer, such as Ethernet and WiFi. By contrast, IP just keeps working over every new link layer technology that gets invented, embracing all sorts of link layer innovations such as various generations of cellular data and a proliferation of broadband access technologies. It also provided a clear deployment path leveraging existing networks that ATM lacked. And if you narrow the top, you rule out the diversity of applications that have flourished since the Internet was invented.”

Scalability challenges

In the early 1990s, the Internet was beginning to show signs of strain as it transitioned from a moderate to a global scale. The engineering community began to question some of the cherished principles behind the initial network. Davie writes:

There are many theories about why TCP/IP was more successful than its contemporaries, and they are not readily testable. Most likely, there were many factors that played into the success of the Internet protocols. But I rate congestion control as one of the key factors that enabled the Internet to progress from moderate to global scale. It is also an interesting study in how the particular architectural choices made in the 1970s proved themselves over the subsequent decades.

The problem lies in trying to get millions of end systems, with no direct access to each other, to cooperatively share bandwidth fairly. Initially, they just used packet loss as a control signal. But then it turned out that it also promoted caching schemes that caused delays and jitter. In some congestion collapse episodes, throughput fell by three orders of magnitude, and Ethernet inventor Bob Metcalfe famously predicted the Internet would collapse as consumer access to the web drove massive growth. This drove research into alternative approaches for better signals to control congestion, such as measuring delay.

It also led to the adoption of new paradigms. For example, the Internet originally started with just two transport layer protocols: UDP for low-overhead, quick communication and TCP for reliable delivery, but TCP slowed down with more round-trip communication. Both protocols came with limitations for the types of remote procedure calls commonly used in web, mobile, and cloud applications. Ultimately, systems engineers crafted QUIC as a more efficient alternative for these use cases. This required spanning multiple layers of the protocol stack, which was a questionable idea at the time.

Another major paradigm shift was the debate between fully decentralized systems and centralized control. Initially, it was thought that every router and switch on the Internet should be fully decentralized to minimize the risk of a single point of failure. However, this can result in numerous routing decisions that may be optimal at a local level but reduce overall performance. A fully decentralized system is also harder to manage centrally. The development of software-defined networking required finding a new balance between centralized and decentralized management and control.

My take

What We Talk About When We Talk About Systems is a rich book that can be read at many different levels. At a high level, it can provide an overview of how systems engineering is a collaborative and iterative process that requires rethinking assumptions along the way. At a lower level, it provides an in-depth analysis of the essential principles and assumptions behind the rise of cloud and data services today. Some of the lessons seem relevant to ongoing discussions for improving healthcare delivery, sustainable electrification, AI, and other large-scale challenges and opportunities.

I am still trying to wrap my head around software-defined networking (SDN) and 5G. Although SDN does not receive the same coverage as other IT topics, they argue that it plays a seminal role in cloud, microservices, and zero-trust security.

Meanwhile, 5G is one of those things I keep scratching my head about. They argue that telco architecture choices necessitate re-architecting the entire system every time a new radio technology emerges, as seen with the transition from 3G to 4G, 5G, and now 6G. They are also excited about the potential for Aether to open up 5G networks, supporting more competition. However, the mechanics of provisioning and managing private 5G sounds considerably more complicated than Wi-Fi. Even at the public network level, the major use case today for 5G is a kind of glorified on-ramp to the more open TCP/IP network.

Another significant takeaway is the level of tension that goes into multi-stakeholder collaborations. If we had just left the networks to the telcos, we would still be paying exorbitant fees per multimedia text messages or cable TV firms for more channels we would not have time to watch. The openness baked into the Internet “academic project” freed researchers to focus on first principles alone, creating trillions of dollars in value that has propped up some telcos and cable operators that might have otherwise squeezed the life out of the new network.

One last thing I am bookmarking is that many experts predicted the TCP/IP protocol, underpinning the Internet, would never scale to support real-time voice and video. The technical limitations of early routing protocols and equipment meant choppy video and unusable voice calls. Yet somehow, engineers across vendors, academic, and government institutions came together to create something that has essentially replaced and superseded the gated communities the telco and cable operators had in mind for the future.

Lessons in system engineering the Internet and why it matters today

Tags: