When building and running a system deployed as a monolith, we are able to make a certain set of assumptions about the behaviour of the overall application. One of the biggest assumptions we make is that the memory space for the application all resides on the same machine. With this model, function and library calls can assume that their view of the data structures for the application are accurate, and that you can retrieve or mutate that data immediately and deterministically (leaving the thornier issues of multi-threaded applications aside for a minute). These assumptions allow teams of programmers to work effectively and efficiently across multiple packages, libraries, classes and functions.
Once you make the decision to split your monolith into multiple services, you enter an environment where the memory space for your application is distributed across multiple machines. In such an environment the services that make up our overall application have to communicate to one another by sending and receiving messages and data using the network. Leveraging a network as the anchoring point for communication between services means that the assumptions we can make when data is available in memory (or local disk) are no long true.
If we continue to develop microservices using the same set of assumptions we used for a monolith, we are operating with a now false set of assumptions that can prevent us from being successful. Even in a small distributed system with just two microservices we need to deal with networked communication that can turn our usual mental model of application development on its head. A common set of rules that can help us update our mental model to be more accurate in a distributed environment is the Eight Fallacies of Distributed Computing commonly attributed to Peter Deutsch, an engineer at Sun microsystems who worked on early versions of Ghostscript, as well as interpreters for Smalltalk and Lisp.
The eight fallacies as we know them today actually started off as just four fallacies. The first four fallacies were created in the early 90’s (sometime after 1991) by two Sun Microsystems engineers, Bill Joy and Dave Lyon.
James Gosling, who was a fellow at Sun Microsystems at the time — and would later go on to create the Java programming language — classified these first four fallacies as “The Fallacies of Networked Computing”. A bit later that decade, Peter Deutsch, another Fellow at Sun, added fallacies five, six, and seven. In 1997, Gosling added the eighth and final fallacy. So much of the work done to codify this list was inspired by work that was actually happening at Sun Microsystems at the time. And while Deutsch is the one who is often credited with having “created” the eight fallacies, clearly it was a combined, group effort (as is so often the case in computing!).
The word fallacy means using invalid or faulty reasoning to reach a conclusion. Used here, the fallacies of distributed computing characterize common traps that a developer of a microservice system will fall into if they take a naive approach to development that matches the mental model that they used to successfully create and operate a monolith. We can think of the eight fallacies as eight common misconceptions that developers of distributed systems often fall prey to, which we, as savvy practitioners, want to avoid.
These fallacies are:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn’t change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Let’s unpack each of these fallacies, describing their importance and applicability, and some examples of the impact of those fallacies in real world applications.
Fallacy One: The network is reliable
Fallacy one is “the network is reliable”. This is likely the most common misconception that developers transitioning from a monolithic to a microservice architecture fall prey to. In a monolithic environment, function calls always succeed or return a programmatic error (in this context, throwing an exception is still considered a success — it is within the bounds of what the programmer expects to happen given the constraints they apply to the program). As a contrary example, we don’t typically expect a functional call to succeed 75% percent of the time and randomly fail the other 25%.
Once you split a monolith into multiple microservices, you begin introducing a network between function or API calls and responses. A misconception derived from a monolithic approach to software development is that these function calls are as reliable as in a monolithic environment where application state resides in memory; they are not! We are moving closer to the contrary example where a percentage of function calls will randomly fail due to network conditions outside of your control, introducing potentially unforeseen error conditions that you now have to deal with. It becomes your responsibility as a programmer to keep this fallacy in mind and code for the potential that the network fails.
Networks are complex and dynamic systems. In any complex system, failure is to be expected. Switches fail, power goes out, cables get cut. In March 2019, AWS experienced an outage that affected customers connected to their Ashburn, Virginia data center, affecting more than 240 critical services relying on AWS infrastructure. The root cause of the issue was a power outage that translated into elevated latency, packet loss, and downtime for Internet users relying on the AWS Direct Connect service. These large scale outages provide us with readily available examples of the unreliability of networks.
Fallacy Two: Latency is zero
In a general sense, latency is the time delay between the cause and the effect of some change in the system being observed. From an application perspective, we typically represent latency as the time that passes between a user action and the resulting response. Network latency refers specifically to the delay between a request and a response that is attributed to transmitting data over the network.
In a local environment like a monolith, reading or changing in-memory data can be measured on the order of nanoseconds. In comparison, reading or changing data in another computer within the same data center is measured in milliseconds. This difference may not seem like much, but if we normalize it to every day activities we can see the magnitude in difference between local access and network access. For example, if accessing data from memory takes as long as you take to brush your teeth (about 100 seconds), then accessing data in another machine within the same data center would take about five and a half hours! Naturally, the greater the distance between machines, the greater the latency. If we expand out our example to include data travelling between a machine in California and a machine in the Netherlands, it would take close to five years.
This fallacy is easy to fall prey to. When we are programming and debugging in a local environment or development environment, we experience a much lower latency and the time delay between sending or receiving data is negligible. However, that low local latency doesn’t accurately reflect the delay of what other parts of the system might feel — especially if they travel over different networks. And it doesn’t accurately reflect the experience of customers.
What can be even more insidious about latency being non-zero is that it is also not uniform. You can expect the majority of requests and responses to fall within a defined minimum and maximum latency, but some requests take vastly longer, making up the “long-tail” of the latency distribution. Common sources of such variability in latency are applications that share resources on a single machine, background processes or daemons that consume resources, contention over shared network infrastructure like switches, garbage collection events, or even simple routine maintenance.
In “The Tail at Scale”, Jeff Dean and Luiz André Barroso from Google show that these temporary high-latency episodes that can be unimportant in moderate-size systems may come to dominate overall service performance at large scale. In this research paper, they list real measurements from Google services showing that the latency for a single random request can be as low as 10ms while the 99th-percentile latency for all requests can be as much as 140ms and that those long-tail requests have an outsized impact on overall system performance. Programmers working in a microservice environment need to ensure that their programs operate correctly for the 99th-percentile case equally as well as the average general case.
Fallacy Three: Bandwidth is infinite
Bandwidth is another common measure for describing network performance referring to how much data can be sent over a network at one point in time. Bandwidth can sometimes be confused with latency. A way to visually picture the difference between latency and bandwidth is to compare it to a physical pipe like the water main in your home. With this example in mind, latency is the distance the water travels through the pipe, whereas bandwidth is the pipe’s diameter. A larger diameter pipe means that more water can travel at any point in time (bandwidth), but the overall time it takes for the water to move from point A to point B remains the same (latency).
The third fallacy of distributed computing is to treat bandwidth as an infinite resource. Unfortunately, the capacity of a network is not infinite, and the capacity can change drastically as data travels between networks (for example, bandwidth within a data center is typically high whereas bandwidth to a mobile phone may be significantly lower).
When we design and implement a distributed microservice architecture, it is up to developers to understand the effect of bandwidth constraints on user experience and overall program behaviour. Some users of our application may be on a network with limited bandwidth. For those users, sending high definition video may significantly degrade usability, while users on a fibre Internet connection would be happy with the increased quality of video.
With the advent of high resolution screens on computers and phones, many web sites began including high resolution images to take advantage of this increased resolution. High resolution devices have more pixels per inch, and high-resolution images fill in those extra pixels with extra details that make photos look phenomenal. If you’re following best practice for high resolution images on the web, you scale your images to the right size for high resolution devices by doubling the pixel dimensions for every image you upload. If you have a large hero image that is 1600px wide and 400px tall, you need to produce an image that is 3200px wide and 800px tall. If the width of your blog is 800px, then the images for your blog posts will have to have a width of 1600px and so forth.
These images will undoubtedly look better, but they come an increase in bandwidth requirements for a responsive user experience. For one network, using retina images may slow the loading of a webpage to a crawl, making the high resolution images a nuisance. For another network, the increased image resolution might be welcome and have no noticeable impact on latency.
Fallacy Four: The network is secure
Time and time again, we’ve been shown that the network is definitely not secure. There are just simply too many ways that a network can be attacked or compromised: vulnerabilities in libraries, operating systems, or other dependencies that attackers can use to penetrate the network, social engineering and phishing attacks designed to extract passwords used to access network infrastructure, or encryption mistakes or oversights allowing access to data transmitted over the network. A common saying in the security world is that “the only truly secure computer is one that is disconnected from any and all networks, turned off, buried in the ground, and encased in concrete.” But that computer isn’t terribly useful.
A system is only as secure as the weakest link. Once you split your system into multiple services connected by a network, you considerably increase the number of links you need to worry about. For example, if you are using HTTPs to communicate between services, each service will need an implementation of the HTTPs transport stack. If you are using third-party libraries to do this each of those libraries might be at risk. Even if you’re protecting against all of that, there still is the human factor. A malicious DBA can “misplace” a database backup. The attackers of today have a lot of computing power in their hands and a lot of patience. So the question is not if they’re going to attack your system, but when.
In a monolithic system, the total surface area that is open for attack is reduced. The monolith can use one shared HTTP library, perform authorization and authentication in a single location, and limit the use of multiple versions of libraries. You still need to practise defence in depth, using a layered approach to secure your application at the network, infrastructure, and application level while keeping security in mind when designing your system. However, by lowering the total amount of surface area exposed to attackers, you increase the overall security posture for your organization.
Fallacy Five: Topology doesn’t change
A network’s topology is the way that the elements in the network are laid out or arranged. This arrangement defines the way that the nodes in our distributed system relate or connect to one another, and how they communicate with one another. As we develop microservices, if we start to depend on the services in a network always “looking” a certain way, we can fall into assumptions about application behaviour that may not always be true.
Services in a microservice architecture need to communicate via requests which are sent through a network. Although it is tempting to view the network as a fixed entity, it is actually a dynamically changing system. For example, each additional microservice added to the network changes the network itself by adding a new node to the topology. Similarly, if we deploy a scalable microservice, scaling events that add or remove instances of the service in response it increases in load naturally change how the network is arranged.
An example of the potential side effects of the network topology changing in a microservice environment is DNS resolution. Most HTTP libraries cache IP addresses obtained through DNS requests to avoid the overhead of making full DNS queries for each request. In a dynamic microservice environment like Kubernetes, caching IP addresseses from DNS requests causes problems because the IP addresses of individual instances of the service change frequently.
By treating your servers as cattle, not pets you are ensuring that no server is irreplaceable. This bit of wisdom helps you get into the right mindset: any server can fail (thus changing the topology), so you should be prepared for a dynamic network topology when working with microservices.
Fallacy Six: There is one administrator
In some more traditional software development organizations, the role of system administrator is unique position responsible for maintaining the underlying machine that software runs on. At the time of writing of the eight fallacies of distributed computing, this role was more prevalent than it is today. With the rise of cloud computing and the growing importance of Developer Operations (DevOps) as an operating model for software organizations, this role is beginning to diminish somewhat in favour of engineers and developers taking on more administrative and software support tasks.
Even though system administration tasks are being automated by cloud providers or completed by engineers rather than dedicated administrators, it does not mean that the tasks somehow disappear. Because of the number and variety of tasks required to maintain and support production software systems, there is usually more than one person who becomes the “administrator” of a system by taking on the ongoing maintenance tasks. This can become a problem if these tasks are partitioned within a team so that only certain people on the team are able to perform some of the tasks. It can also cause problems in coordination to make sure that the correct set of administrative tasks are performed on the correct set of services.
In June 2019, Google suffered a major outage that frustrated users for almost 24 hours. Benjamin Treynor Sloss, Google’s VP of engineering, explained in a blogpost that the root cause of last Sunday’s outage was a configuration change for a small group of servers in one region being wrongly applied to a larger number of servers across several neighbouring regions. This configuration change caused the servers and network to stop using more than half of their available networking capacity. As an interesting side effect, the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage.
As our system grows from a monolith into multiple microservices that communicate over a network, the tasks required for maintenance multiply over the total number of microservices. For example, one routine maintenance tasks is operating system upgrades. In a monolithic system, an operating system upgrade affects a single system. In a microservice environment, on the other hand, operating system upgrades can become a cascading roll-out affecting tens or even hundreds of systems.
Fallacy Seven: Transport cost is zero
Transporting any data over the network has a price, in both time and resources. We already learnt that the amount of time for data to travel is not zero, and that we measure this time as latency. In addition to the time it takes to transport data, we need to consider the resource consumption and associated costs for transporting data. Theses cost can be further broken down by infrastructure cost and compute costs.
Because we have grown accustomed to the Internet and its impact in our day-to-day lives, we can almost become unaware of the vast infrastructure that powers it because of its ubiquity. However, the components powering the Internet at scale all have a cost: switches, load balancers, firewalls, proxies, and servers themselves all have a cost. The broader your network footprint reaches, the more resource cost you will need to pay to support it.
In addition to the cost of raw infrastructure, there is a processing cost for transmitting data. Sending data from one place to another takes resources from the CPU on the server to process and serialize the data into packets that can be transmittable over the Internet and on the client to deserialize that same data into an object representation that can be manipulated by our application. The costs of this processing are non-zero and increase as we increase the number of microservices deployed to our environment and the communication between them.
Throughout 2020, the Zoom video conferencing platform drastically increased their reach throughout the globe due to work from home orders put in place by governments to combat the spread of COVID-19. In response, Zoom quickly scaled their network to accommodate the influx of users and traffic. In May 2020, they announced had struck a deal with Oracle to move some of their services to Oracle’s cloud. Peeking into the press release and reporting, Reuters reports that the Zoom platform has 217,000 terabytes of data flowing through it per month. If we calculate the cost of that data transfer using the AWS cost calculator, the total for moving that much data every month comes to an incredible 11 million dollars. Definitely not zero!
Fallacy Eight: The network is homogeneous
The term homogeneous refers to a group of things that are the same or similar in nature. When speaking of a network, it is sometimes convenient to imagine that the nodes and infrastructure are homogeneous and that the variation between components is minimal. The rise of automation and virtual networking helps to keep the variation within individual networks to a minimum, at least at the data center level. However, as you move beyond your data center to the public Internet you will quickly have to deal with any and every type of network between your application and an end user.
For example, if your application is running in Kubernetes with an Amazon Virtual Private Cloud you can likely treat your service to service network traffic as homogeneous. After all, that is the entire point of cloud computing turning compute and networking into fungible services. But consider your end user. They may be accessing your application using their mobile phone, and the actual journey between their phone and your application can cross multiple networks with unique configurations and capabilities at each one — the LTE network of the phone provider will have vastly different characteristics than the network supporting service to service traffic.
As a developer, we need to pay attention to variation within the network and how it affects application development and the end user experience
The eight fallacies of distributed systems were developed in the 1990’s, and it can be tempting to dismiss them as out of touch with modern software development. The rise of cloud computing, automation, and developers doing operations and administration (DevOps) helps minimize the impact of these fallacies, but it does not remove them completely — it is impossible to completely avoid the eight fallacies when running a distributed system or microservice architecture and the steps that have been taken to improve reliability and efficiency have been done to ameliorate, but not solve these fallacies.
In reality, we live in a networked society and it is rare that a useful system or service does not rely on the network in some capacity. It is therefore almost impossible to completely avoid the eight fallacies of distributed computing. However, we can try to take steps to minimize their impact on the systems we build and deploy each day.