# Distributed System Models in the Real World

Practical distributed applications are deployed into varied environments and execute on a variety of different machines linked together over a variety of communication infrastructure. The physical machines themselves can differ in the number and speed of processors, the availability of random access and stable storage, and more. The communication infrastructure can differ in the available levels of latency, throughput, and reliability. Because of these differences, it is more practical to look at distributed algorithms from a higher-level perspective so that they are applicable to a wide range of environments. Such algorithms do not depend on the particular details of the hardware or software on which they are run, and they are not limited to a highly specialized implementation.

In order to work with distributed algorithms without getting swamped by implementation details, researchers model the fundamental characteristics of distributed systems. Such models capture the essential faults and failure characteristics of any process that performs a computation by exchanging messages. The set of assumptions that we make about the environment is called a system model. The system model is an abstraction of the distributed environment, and a particular distributed algorithm is guaranteed to work within the system model it was designed for.

There are three primary components that make up a complete system model: processes that execute operations, communication links that facilitate the passing of messages between processes, and timing assumptions that model the reliability and time bounds on operations and message passing within the system.

Depending on the behaviour of the different components of the system model, a distributed algorithm may need to be altered or adjusted to compensate. For example, if we assume that once processes in our environment crash, they are removed and never return (the crash stop model) we can design an algorithm differently than if we assume that processes can fail randomly and arbitrarily (the Byzantine or arbitrary model). Because of this nuance, distributed algorithms are always developed in reference to an explicit system model.

A benefit of working with a system model is that you can model the implementation of algorithms without requiring access to multiple distinct computers. A process abstraction may be implemented as an entire node, a single processor within a node, or even a specific thread of execution within a processor. Similarly, links provide an abstraction over the physical networking infrastructure and can be modelled using local message passing and simulated network conditions. This makes defining and implementing distributed algorithms considerably easier.

It is possible to represent many different operating environments by varying the different properties of the processes, links, and timing assumptions, typically by varying when and how different components may fail. To design a distributed algorithm, we need to provide some base assumptions we can work with. To do this, we specify the types of faults that can occur in our system, we can divide these assumptions of system behaviour into process models, link models, and timing assumptions.

## Process Models

Failure occurs whenever a process behaves in a way that isn’t specified by the distributed algorithm. The models we use for processes differ according to the nature of the faults that can cause processes to fail. These faults range from simply crashing and stopping all work, to random or adversarial behaviour. A robust distributed algorithm guards against common faults so that each process can proceed to contribute work, and that the algorithm can continue to meet its liveness and safety guarantees.

### Crash stop processes

The simplest way that a process can fail is when it stops executing. The process executes according to its algorithm up until some point in time t, after which it simply stops executing. It does not send any additional messages to other processes, and it does not perform any additional computation. When a process fails execution, we say that the process has crashed at some point in time t, and that it never recovers. In distributed algorithms, this is called a crash stop process model.

In a crash stop process model, we assume that when all processes fail, it is due to a crash stop. Furthermore, we assume that once a process has crash stopped, it is forever stopped and does not return. With this model, a process is said to be faulty if it crashes at some time during its execution. Conversely, it is said to be correct if it never crashes and continues to execute steps of the algorithm.

When using the crash stop process model, we assume that a process executes correctly, but that it may crash at some time. After that point, it never recovers, meaning that it never performs another step in the algorithm. In practice, of course compute nodes that crash can be restarted and may recover, but for the purposes of the algorithm these nodes are no longer part of the distributed system or algorithm. This doesn’t necessarily mean that recovery should be prevented or inhibited in any way. Nothing prevents the recovered process from joining in to subsequent iterations of the distributed algorithm. It simply means that if we define an algorithm under the assumptions of the crash stop process model we should not rely on some of the processes to recover for the algorithm to work correctly.

### Crash-recovery processes

Sometimes it is difficult, if not impossible, to assume that a sufficient number of processes will never crash throughout the execution of a distributed algorithm. In these cases, the crash-recovery model makes more sense than crash stop. Here, we assume that a process may crash, but that it will recover after some period of time and resume completing steps of the algorithm.

In the crash-recovery model, a process is faulty under one of two conditions. First, the process may crash and never recovers. Second, it may crash and recover an infinite number of times (crash-loop) so that it is rendered useless to the rest of the system. Otherwise, the process is said to be correct. Effectively, a correct process in a crash-recovery model will eventually be up and running. This allows for a process to crash and recover a finite number of times and still participate in the distributed algorithm.

A crash-recovery model can lead to the presence of omission faults. Omissions occur when a process does not send a message that it is supposed to send or receive a message that it is supposed to receive. Omissions cause the process to deviate from the algorithm by dropping some messages that should have been exchanged with other processes.

Another effect of crash-recovery is that any internal state present on the process at the time of the crash is lost. Because of this, we assume that processes in a crash-recovery model have some form of stable storage that is preserved across process crashes. Any in-memory state at the time of the crash is lost. This significantly complicates the design because when the process recovers, it may begin sending messages that contradict messages that were present in-memory at the time of the crash.

To recoup the state of the process after a crash, a process in the crash-recovery model is assumed to be aware that it has crashed and is recovering (for example, through a startup procedure). During this recovery, the process can retrieve any relevant state from stable storage before resuming participation in the distributed algorithm. For performance reasons, some distributed algorithms aim to minimize access to stable storage.

### Byzantine processes

The last process model is arbitrary (also known as Byzantine) faults. In this mode, processes can do any random thing. This includes sending and receiving random data, or even trying to deceive other processes. When using the arbitrary-fault model, the algorithm can make no assumptions on the behaviour of faulty processes.

Naturally, the arbitrary-fault model is the most general, and the most difficult to design algorithms for. Oftentimes, arbitrary faults are the most expensive to tolerate in terms of performance or storage cost. However, this model is the only reasonable option when unknown or unpredictable faults can occur. This includes any time when a system is vulnerable to outside attackers: some processes may become controlled by malicious users that deliberately try and prevent correct behaviour.

An arbitrary fault does not need to be malicious or intentional. It can be cause by a bug in the implementation, a framework, or an operating system where the process runs. Faults that occur due to bugs can sometimes be caught through redundancy or checksum mechanisms, but any faults that are due to a determined adversary cannot be caught with such techniques. A process under the control of an adversary may inconsistently appear to both failed and functioning to any existing failure-detection systems, and it may present different symptoms to different observers. It is difficult for the other components to declare it failed and shut it out of the network, because they need to first reach a consensus regarding which process has failed in the first place.

## Real World Process Models

Processes in a system model map directly to running processes on a machine, or running services in a distributed system. Processes can be a server, or node. Such a process could be a bare-metal compute instance, a virtualized server, or a Docker container running on a compute instance. In this model, we can encounter behaviour that maps directly to a system model.

### Crash stop processes in the real world

The first process model we covered was crash stop. In a crash stop model, once a process crashes, it never recovers. No further computation is done, no messages are received, and no messages are sent. Real world systems built with this model are typically stateless services with no persistent storage. For example, we can consider a simple layered web application with a frontend API server, and a separate database. Each of these is on different compute nodes.

If we receive increased traffic at our site, one way to scale is by adding additional frontend API servers behind a load balancer. With this simple change, we now have multiple API frontends that can receive requests from users, while the single database server is used for persistence of and long term data storage.

To simplify the design of such a system, one principle that many development teams follow is to make the frontend services stateless. In this context, stateless means that each individual frontend does not track any information about a user outside of the context of a single request. Such a stateless service can be stopped at any time, and does not try to recover any state.

The stateless design can be used to design fairly complex services. At a very basic level, as the name suggests, the term “stateless” means that no past data nor state is stored or needs to be persistent when a new instance of a process is created. Kubernetes is another example of an environment that encourages the use of stateless services that can run in a crash stop model.

In Kubernetes, services are made up of multiple pods that are scheduled to run on compute infrastructure called Nodes. Each pods is scheduled to run only once in their lifetime. Once a Pod is scheduled (assigned) to a Node, the Pod runs on that Node until it stops or is terminated. Pods are considered to be relatively ephemeral (rather than durable) entities. Pods are created, assigned a unique ID, and scheduled to nodes where they remain until termination or deletion. If a Node dies, the Pods scheduled to that node are scheduled for deletion after a timeout period.

Pods do not, by themselves, self-heal. If a Pod encounters a failure, it can be configured to restart, though it will lose any data in-memory at the time of failure. The recovery process does not attempt to restore any state from before the crash.

#### Crash recovery processes in the real world

A process that operates using the crash recovery model attempts to restore any state that was present in the process before the time of the crash. Stateful applications typically involve some database, such as Cassandra, MongoDB, or MySQL and processes that read and/or write to it. Stateful processes will respond to requests based on the information provided where the prior request history impacts the current state; hence, the server must access and hold onto state information generated during the processing of the earlier request, which is why it is called stateful.

A database itself can be considered a stateful service operating as a crash recovery process model. Such applications store data and keep track of the changes to it. Upon a process failure, the database loads previous state from the write-ahead log written to stable storage to recover the data present at the time before the crash occurred.

Returning to Kubernetes, it is possible to manage stateful applications using a StatefulSet abstraction. This abstraction manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods. A StatefulSet maintains a sticky identity for each of their Pods that it maintains across any rescheduling. Although individual Pods in a StatefulSet are susceptible to failure, the persistent Pod identifiers make it easier to match existing data volumes or storage to the new Pods that replace any that have failed, allowing a Pod to recover any state it had accumulated during operation.

#### Byzantine processes in the real world

Byzantine processes are allowed to do pretty much anything. This includes trying to trick or deceive other processes. This model allows a process to “lie” by sending incorrect messages or performing incorrect computations. Any distributed algorithm becomes much harder if there is a risk that a node may lie.

In the world of the Internet-based systems, Byzantine process are rare, but possible. Consider as an example the case of a security threat. A financial services company may become subject to an attack that compromises one of the processes in the system, with the goal of the attack to subvert a distributed algorithm for profit.

Byzantine faults come in weaker forms as well. For example, memory can become corrupted and calculations can fail. Such a failure may cause a process to respond to messages in entirely unpredictable ways. Or consider a decentralized network like Bitcoin. Some participants in the Bitcoin blockchain attempt to cheat or defraud others for financial gain. It’s not safe to trust another processes’ messages or data, and a strong distributed algorithm, like proof of work, allows untrusting actors to agree on the state of the system.

Links model the network communication between processes in a distributed algorithm. Every pair of processes can be assumed to be connected by bi-directional links that support the passing of messages and data between processes. Communication links can be implemented using different network topologies like a fully connected mesh, or a ring. In practice, this typically means a set of links connected over bridges and routers like the Internet.

In a distributed system, we can assume that some messages may be lost during transmission. In our system model this is specified in terms of the probability of any one message being lost during transmission, and in how we deal with the presence of lost messages. The probability of a message being lost is usually non-zero because it is unlikely that all messages between two processes are lost unless there is a severe network condition like a partition that will fundamentally alter the validity of the distributed algorithm.

Fair-loss links capture the idea that messages will be lost, and that some may be repeated. Fair-loss links are characterized by three core properties. First, some messages always make it through — a fair-loss link does not systematically drop every single message. Therefore, the probability of a message making it through to delivery is greater than zero. Second, even though messages will eventually be delivered, it is possible that messages may be duplicated (up to a finite number of times) or reordered. Third, fair-loss links cannot arbitrarily create or corrupt messages that have not been sent - no message is created out of thin air.

Stubborn links provide a reliability improvement on fair-loss links (at the expense of performance). Stubborn links make sure that messages are eventually delivered by resending every message a (potentially infinite) number of times — thus ensuring that the message is delivered.

Implementing stubborn links can be done using a “Retransmit Forever” algorithm that builds upon an underlying fair-loss link. We know that the fair-loss link will drop a certain percentage of messages, but that some messages do reach their destination. If we keep retransmitting all messages sent, in the limit we overcome any possible dropped messages and can guarantee that every message is delivered.

Clearly, the performance of a stubborn links implementation is not efficient — it does not make sense to infinitely resend every single message. To make the algorithm more practical, we can first limit the concept of “forever” or “infinite” retransmissions. We can add acknowledgements into the algorithm so that if a message is delivered, the receiver can send an acknowledgement that signals to the sender that they may stop sending new messages. We can also note that if a round of the algorithm is complete and we move to the next execution step, we can stop retransmitting messages from previous iterations.

Perfect links (also called reliable links) add mechanisms for detecting and suppressing duplicate messages, in addition to adding message retransmission like in stubborn links. Perfect links guarantee reliable delivery of every message sent — if a process sends a message, it eventually gets correctly delivered. Perfect links also guarantee that no message is delivered more than once. Together, reliable delivery and no duplicate messages means that every message sent by a correct process is delivered exactly once. In practice, although a 100% guarantee of exactly-once delivery is very difficult (if not impossible) to achieve, with careful engineering we can get fairly close to that ideal.

The last type of link is Byzantine (or arbitrary). If we deliver messages over a network where we cannot trust the actors, we need to assume that links will exhibit arbitrary or adversarial behaviour. On an unsecure network, for example, attackers can potentially interfere and manipulate network packets in unknown ways and alter the payload of delivered messages.

The way to deal with Byzantine link failures is to secure network communication using cryptography primitives like transport layer security (TLS). The TLS protocol prevents an active adversary spoofing our traffic by allowing the sender and receiver of messages to verify their integrity and correctness. By using a cryptographic layer, we can add Byzantine failure guarantees to any of the link abstractions covered so far. This allows us to guarantees against attackers to fair links, stubborn links, and perfect links through the addition of cryptographic verification.

Fundamentally, communication over the Internet is unreliable. To further understand the communication guarantees available over such an unreliable network, let’s examine the fundamental problems of communication over an unreliable network, and two widely deployed communication protocols that make use of an unreliable network, yet still get work done.

UDP is a simple, stateless protocol. RFC 1122 refers to UDP as the “almost a null protocol”; while that is something of a harsh assessment, UDP is indeed fairly basic. An old bit of Internet humor about UDP’s unreliability has it that if I send you a UDP joke, you might not get it. The two features it adds beyond the IP layer are port numbers and a checksum. The port numbers are what makes UDP into a real transport protocol: with them, an application can now connect to an individual server process (that is, the process “owning” the port number in question), rather than simply to a host.

With UDP communication is achieved by transmitting information in one direction from source to destination without verifying the readiness or state of the receiver. While this simplicity is great for minimal applications, it presents problems of reliability. Many real-world applications require some form of congestion control to keep the network stable. Typically, this means that UDP-based protocols still must implement some form of congestion control, retry, and backoff that is found with more reliable link abstractions.

With the ability to detect and correct for duplicate and out-of-order packets, TCP provides a provides at-least-once delivery and exactly-one processing, with regards to the following definitions:

• At-least-once delivery: a TCP message will be delivered at least once to the destination. More specifically, it will keep re-transmitting with specific timeouts if no ACK(knowledgement) is received, so that it will eventually be delivered. However, if some of these re-transmissions were not lost (but just delayed), then more than one copy of the message will be delivered.
• Exactly-once processing: each TCP message will be processed by the destination node exactly once. More specifically, the destination will watch out for duplicate messages (checking the IDs of each received message). So, even if a message is delivered twice, the destination node will only process it (pass it to the application level) once and ignore the duplicates received later.

The Two General’s Problem applied to TCP shows why the two computers in the connection can’t simultaneously have common knowledge about the state of the connection.

The way every distributed agreement protocol deals with this issue is to always promise safety (nothing bad will happen), but not to guarantee liveness (that progress will eventually be made). When the network is operating well, you can try to do your best and hope to make progress, but occasionally, progress will halt for some period of time.

In TCP it means that an endpoint can make an assumption (such as “connection established”) without definitely knowing the other’s state. However, it is not an unsafe assumption to make; at worst, it is a benign misunderstanding. After a timeout, it will change its opinion. It is no different from being on one end of a long-distance telephone and continuing to talk thinking the connection is still on; after a while, you may have to ask “hello, you still there?”, and time out. Real world protocols must always have timeouts (unlike asynchronous formal models) because somewhere up the stack they serve some human function, and human patience is limited. In practice, there are sufficiently good stretches of time that progress can be made, so we just have to pick appropriate timeouts that don’t time out too early either.

Both TCP and UDP can implement the fair-loss links model. Fair-loss links drop a certain percentage of messages by the network. Fair-loss links also assume a finite amount of message duplication. If using TCP, dropped messages are retransmitted by the protocol, and duplicate messages are removed before being delivered to the application. If using UDP, it is up to the application developer to either accept message loss, or to resend messages if desired.

If we begin with UDP, stubborn links can be implemented by resending messages repeatedly.

With stubborn links, it is up to the message receiver to check whether a messages has already been delivered or not. Adding mechanisms for detecting and suppressing message duplicates, in addition to retransmission of failed messages, creates a higher-level form of perfect link.

Perfect links require reliable delivery of messages, with no duplication. These are the same guarantees that TCP provides. However, there are some exceptions: TCP can only provide these guarantees for as long as the incoming packet queue has space to store packets, or for as long as the process survives without crashing. The TCP algorithm for preventing duplicates stores internal state in volatile memory. If the process crashes, this state is lost. Upon recovery the process no longer remembers which messages have already been delivered and it might deliver the same message twice. Since the sender cannot reliably detect that the process has crashed, it has no other choice than to simply resend the message, which may result in duplicates.

In practice, this means that to achieve perfect links in the real-world, we need the retry and duplicate detection of TCP, coupled with a way to store message delivery state that survives between process crashes. This is not implemented by TCP and is the responsibility of the application developer.

In the Byzantine model, the communication links themselves can behave in arbitrary ways. The closest analog to this behaviour in the real world is a network that has been compromised by an attacker. In this case, cryptographic authentication can be used to turn any of our implementations into an authenticated version. This authenticated version of the link can verify the integrity of messages sent through the network.

## Timing models

An important characteristic of a distributed system is how processes and links react to the passage of time. The timing bounds and communication delays we can assume for our system dictate a lot about the final algorithm we can use to solve a problem. Because of this central importance of time in distributed algorithms, we dedicate an entire chapter to the issue later in this book.

### Asynchronous timing

An asynchronous model assumes nothing about how processes and links handle time. This model assumes that processes do not have access to any sort of physical clock (even locally), and that we cannot assume any upper bound on processing delays or communication delays. Without access to a physical clock, processes cannot even use timeouts to know when the sending of a message has failed.

This model is the most restrictive. Even then, it is possible to design algorithms that operate without any timing assumptions. Although the asynchronous model does not assume any access to physical clocks, we can still signal the passage of time through the sending and receiving of messages. This means that time is defined by the communication between processes. This is often defined as logical time or by using a logical clock to distinguish it from physical clocks.

### Synchronous timing

The synchronous timing model assumes there is an upper bound on both processing delay, message transmission delay, and on the size of clock errors.

The synchronous model assumes that the time taken by a process to execute any step of the algorithm is always less than some upper bound. Similarly, we assume that the time period between when a message is sent from one process and delivered at its destination is always less than some upper bound. Lastly, although the synchronous model assumes that each process has access to a local physical clock, we assume that it can only deviate from a known correct clock by an upper bound.

The restrictions of the synchronous model create several useful mechanisms for designing algorithms:

• Failure detection: Every crash of a process can be detected within a bounded amount of time. This can be achieved by using a heartbeat mechanism to detect crashed processes.
• Time-based locks: We can coordinate between processes using time-based locks and leases. Each process can assume the right to execute some action on a resource for a defined fixed period of time, after which the lock or lease expires and a separate process can execute an action on the resource.
• Synchronized clocks: A synchronous system makes it possible to have fully synchronized clocks so that the local clock on each process does not differ from more than some known upper bound. Synchronized clocks make it possible to timestamp events using the value of a local clock and can be used to order events and coordinate actions across systems.

Generally, most distributed systems are synchronous, most of the time. There are, however, periods where the timing assumptions of the synchronous model do not hold. For example, if the network is overloaded, or some process experiences a garbage collection pause, we will exceed our fixed upper bound on communication or processing delay.

Unfortunately, this means it is not possible to build a fully synchronous algorithm over over a large-scale network such as the Internet. We simply cannot that meets the requirements of the model in the real world and it is not feasible to build a system where the upper bound timing assumptions hold with high enough probability to build distributed algorithms on top of. For example, on the Internet, it is possible there are periods where messages can take a very long time to arrive at their destination — exceeding the fixed upper bound on transmission delay of the synchronous model.

### Partially synchronous timing

Although most systems are synchronous most of the time, there are periods when we cannot respect the upper bound on communication delay or processing delay and the system temporarily exhibits the properties of the asynchronous model. For example, although most of the time messages are delivered to recipients within a fixed time period, there is a possibility that a message buffer becomes overloaded and the delivery mechanism begins dropping messages. If we assume that messages are retransmitted to ensure reliable delivery, the retransmission introduces unpredictable delays that fail be delivered within a fixed transmission upper bound.

In this sense, practical systems built in the real world operate using a partially synchronous model. This means that the system behaves like a synchronous system most of the time, but that it sometimes exceeds the upper bounds for any network delay, processing delay, or clock drift.

To formally leverage the idea of partial synchrony for designing distributed algorithms, we can assume that there are long periods of time during which the system is synchronous, and that these periods are long enough for the algorithm to complete some real work and potentially terminate its execution. In a way, we assume that the system is asynchronous, but that it is eventually synchronous. This makes sense because most of the time both networks and processes operate correctly — otherwise nothing would ever work. But we do have to be aware that some of the time any assumptions we make on the time delay within our network will fail and that the delays may become arbitrarily large.

## Real world timing models

The natural real world analogy to links is the Internet. The Internet behaves nicely most of the time, making it akin to a synchronous model where packets are generally received. Yet the Internet behaves erratically at times-losing packets, dropping connections, and delaying delivery. When these erratic behaviours occur, the Internet is more like an asynchronous model. With an asynchronous model, one process can send a message to another process, but the network gives no guarantees about when it arrives. In fact, it may not arrive at all. There may be a cut in a network cable, a queue may be full and your message delayed, a server may have failed, or a file system may have become full.

On the Internet, expect to violate any timing assumptions made designing a distributed algorithm. Network delays may become arbitrarily large for reasons outside of your control. Because of this, the Internet is often described as a partially synchronous model that respects timing assumptions most of the time, and violates them some of the time.

### Asynchronous timing in the real world

Assuming an asynchronous timing model means that you make no assumptions about process or network delays—anything that can happen, happens. If you can design an algorithm under this assumption, it provides the strongest guarantees of accuracy because it survives any timing conditions it encounters in the real world.

In an asynchronous environment, you assume that processes to not have access to any sort of physical clock, and that the time process and network delays are arbitrarily large. Since physical clocks can be inaccurate, and time and network delays can become large, algorithm’s designed for an asynchronous model can guarantee correctness in real-world situations.

Even in the absence of physical clocks, it’s still possible to define useful algorithms that rely on the sending and receiving of messages to signal the passage of time. Some algorithms do need some physical timing assumptions.

### Synchronous timing in the real world

Assuming a synchronous model assumes that there is a known upper bound on the length of processing or network delays. This means that the time taken by a process to execute a step of an algorithm is less than some value epsilon. It also means that there is on upper bound on message transmission delays.

The main difficulty in assuming a synchronous system model is providing enough coverage. It’s difficult to build a system where the synchronous timing assumptions hold with high enough probability to be confident in the algorithm.

### Partially synchronous timing in the real world

Generally, distributed systems appear to be synchronous. For most systems, most of the time, there are physical time bounds on process and network delays, and periods of time where process and network delays become unbounded. When the network overloads, or a process has a shortage of memory, or a buffer storing incoming or outgoing overflows, messages may get lost and retransmitted, violating time bound you set for message delivery.

Partial synchrony is a realistic way to model the majority of real world systems.

## A practical distributed system model

Having covered the different process, link, and timing models available. The last task is to assemble the different models into ones that capture the requirements of a practical Internet-based software system.

The process model covers crash-stop, crash-recovery, and byzantine processes. In the crash-stop model, a process can fail in one way—crashing. After crashing, the process stops responding to or sending messages. The process never recovers. With crash-recovery, it’s possible for a process to recover after some unknown amount of time and resume operation using some form of stable storage that retains data across crashes. Lastly, Byzantine processes operate in any way possible, including trying to trick or deceive other processes. Of these the three process models, the most practical for real world algorithms is the crash-recovery model. In most circumstances, processes have access to stable storage in the form of a hard disk that you can rely on to recover state between crashes.