17,689 reads

How to Deal With Complexity When Designing Software Systems

by AlekseiFebruary 5th, 2024

Too Long; Didn't Read

Complexity is the enemy! Let's learn how to deal with that!

featured image - How to Deal With Complexity When Designing Software Systems

What is it all about?

Every day, every moment during our engineering career, we encounter many different problems of various complexity and situations where we need to make a decision or postpone it due to lack of data. Whenever we build new services, construct infrastructure, or even form development processes, we touch a huge world of various challenges.

It is challenging, and perhaps even impossible, to list all the problems. You will encounter some of these issues only if you work in a specific niche. On the other hand, there are many that we all must understand how to solve, as they are crucial for building IT systems. With a high probability, you will encounter them in all projects.

In this article, I will share my experiences with some of the problems I have encountered while creating software programs.

What is Cross-Cutting Concern?

If we look into Wikipedia, we will find the following definition

In aspect-oriented software development, cross-cutting concerns are aspects of a program that affect several modules, without the possibility of being encapsulated in any of them. These concerns often cannot be cleanly decomposed from the rest of the system in both the design and implementation, and can result in either scattering (code duplication), tangling (significant dependencies between systems), or both.

It greatly describes what it is, but I want to extend and simplify it a little bit:

A cross-cutting concern is a concept or component of the system/organisation that affects (or 'cuts across') many other parts.

The best examples of such concerns are system architecture, logging, security, transaction management, telemetry, database design and there are many others. We are going to elaborate on many of them later in this article.

On the code level, cross-cutting concerns are often implemented using techniques like Aspect-Oriented Programming (AOP), where these concerns are modularized into separate components that can be applied throughout the application. This keeps the business logic isolated from these concerns, making the code more readable and maintainable.

Aspects Classification

There are many possible ways how to classify aspects by segmenting them with different properties like scope, size, functionality, importance, target, and others, but in this article, I am going to use a simple scope classification. By this, I mean where this specific aspect is directed whether it is the whole organisation, a particular system, or a specific element of that system.

So, I am going to split aspects into Macro and Micro.

By Macro aspect I mean mainly considerations we follow for the whole system like chosen system architecture and its design (monolithic, microservices, service-oriented architecture), technology stack, organization structure, etc. Macro aspects are related mainly to strategic and high-level decisions.

In the meantime, the Micro aspect is much closer to the code level and development. For instance, which framework is used for interacting with the database, the project structure of folders and classes, or even specific object design patterns.

While this classification is not ideal, it helps to structure an understanding of possible problems and the importance and impact of solutions we apply to them.

In this article, my primary focus will be on the macro aspects.

Macro Aspects

Ubiquitous Language, Domain Model, Bounded Context

Domain-Driven Design (DDD) is a software development methodology that focuses on building a domain model with a deep comprehension of the domain's processes and rules. This approach is detailed in Eric Evans's 2003 book, which introduces DDD through a collection of patterns.

Ubiquitous Language and Domain Model are the main pillars of Domain-Driven Design (DDD). Domain-driven design is one of the major software architecture styles that aims to model software based on input from domain experts. DDD defines plenty of tactical and strategical design patterns describing ways of modelling systems with such an approach. However, the major ones and the most important from my personal view, are Ubiquitous Language, Domain Model, and Bounded Context.

Ubiquitous Language refers to the common language shared among users, developers, and domain experts. It is then utilized within the Domain Model to accurately describe the domain. Certainly, a domain model encompasses much more than just language, it includes rules, policies, entities, algorithms, and more. However, the primary challenge is that defining a single unified domain model is highly complex and likely impossible. This is where the concept of Bounded Context comes into play.

So, Ubiquitous Language doesn't make much sense without a Bounded Context, since this language must be unambiguous among everyone involved in the development process. The same term could be used to refer to different things in various parts of the system. For instance, the term 'Customer' in the Bank Account Context and 'Customer' in the Users Context might be named the same but have entirely different meanings. So, whenever people talk about a 'Customer,' they could be thinking about different entities.

So, the Bounded Context is about language and boundaries where this language is unambiguous.

The primary application of Domain-Driven Design (DDD) lies in managing complexity within intricate domains and systems. For simple projects, there's little practical benefit to employing DDD beyond educational purposes. If a system only encompasses a couple of contexts, as in the given example, applying DDD practices might not be necessary. However, consider a scenario where your system is vast, containing hundreds of distinct areas, with over a hundred people contributing. In such cases, DDD becomes significantly more advantageous.

The main question I had at the beginning, and still encounter from time to time, is how to properly identify the ubiquitous language, domain model, and bounded contexts. The answer, in one form or another, is event storming, which heavily depends on the specific situation. Since this topic is very broad, I'll skip it for now, but this technique is very useful for such tasks.

But before we move forward, let us take a look at how we can apply Ubiquitous Language in the project structured with the Clean Architecture approach.

Clean Architecture

Clean Architecture is a software design philosophy emphasising the separation of concerns, aiming to make systems more understandable, flexible, and maintainable. It was popularized by Robert C. Martin (Uncle Bob) and is characterised by its emphasis on software structure that promotes independence from frameworks, UI, databases, and any external agency. The ultimate goal is to create decoupled systems where business rules and policies are isolated from external influences, making the system easier to develop, test, and maintain over time.

The original article describes in the details its concepts https://blog.cleancoder.com/uncle-bob/2012/08/13/the-clean-architecture.html

The Dependency Rule is a core concept of Clean Architecture that dictates that source code dependencies can only point inwards. As you move inward, the level of abstraction increases. The outer layers implement interfaces defined in the inner layers. This ensures that the inner layers (such as entities and use cases) are not dependent on the outer layers (like UI and databases) and remain isolated from external changes and frameworks.

Another core concept is placing Domain and application layers in the centre of architecture. By placing the Domain and Application layers at the heart of the architecture, we align closely with Domain-Driven Design (DDD) principles, emphasising the business's core needs and logic. This approach simplifies adapting and evolving the software in tune with business changes, embodying the essence of DDD.

Also, there are many differences between Onion Architecture and Ports and Adapters, they all follow the same Dependency Rule.

This is the link to a series of foundational articles about Onion architecture https://jeffreypalermo.com/2008/07/the-onion-architecture-part-1/

The main question I had was: How many layers are permitted in such types of architecture? The answer is simple: as many as you need while you are following the Dependency Rule. That’s what Robert C. Martin writes about it.

Only Four Circles?

No, the circles are schematic. You may find that you need more than just these four. There’s no rule that says you must always have just these four. However, The Dependency Rule always applies. Source code dependencies always point inwards. As you move inwards the level of abstraction increases. The outermost circle is low level concrete detail. As you move inwards the software grows more abstract, and encapsulates higher level policies. The inner most circle is the most general.

Okay, let us apply these principles to an example service project structure. In the following picture, we can see four sections.

First layer. The core contains the Application Core (use cases), Domain, and Unit Tests. This is the heart of our service and includes the business logic.
Second layer. In the Infrastructure and Adapters layer, we include implementation details such as Persistence, Messaging, Security, and Interactions with external systems, among other aspects that do not dictate the domain.
Third layer. The outer layer contains entry points to our application, which could include Web APIs, Message Handlers, Integration Tests, CLIs, and many other things.

Simple? Indeed it is :)

Finally, let us take a look at a little more details about how to apply Ubiquitous Language here.

The domain model is very primitive, it contains an Account entity and Customer object.

public class Account : IEntity
{
    public Guid Id { get; }
    public Guid CustomerId { get; }

    // omitted for clarity
}

public class Customer
{
    public Guid CustomerId { get; }
    public string Email { get; }
    public string Name { get; }
}

As you can see, Customer is not represented as entity because the single source of truth for Customer entity is another part of the system. However, in this context, we still have Customer ‘word‘ in the domain language.

In order to retrieve customer info to execute operations with Account, we expose ICustomerAdapter in the Domain language.

public interface ICustomerAdapter
{
    Task<Customer> GetCustomerByIdAsync(Guid id, CancellationToken cancellationToken);
}

Also, we need to be able store our new Accounts created by Customers.

So, we are simply exposing IAccountRepository from ApplicationCore.

public interface IAccountRepository
{
    Task SaveAccountAsync(Account account, CancellationToken cancellationToken);
    Task<Account> GetAccountByIdAsync(Guid id, CancellationToken cancellationToken);
}

And we are also fearless in using the Domain language in our outer layer.

public sealed class AccountsController : ControllerBase
{
    private readonly IAccountRepository _accountRepository;

    public AccountsController(IAccountRepository accountRepository)
    {
        _accountRepository = accountRepository;
    }

    [HttpGet]
    public async Task<IActionResult> Get(Guid accountId, CancellationToken cancellationToken)
    {
        var account = await _accountRepository.GetAccountByIdAsync(accountId, cancellationToken);

        //convert to Model

        //...
    }
}

So, the structure now looks like this

Thus, we are utilising Domain language across layers, simplifying our structure and making it much easier to understand and maintain.

There are really two complicated problems in software engineering: Cache invalidation and Naming :)

Naming is especially painful when you need to define different types of the same thing in the project, like Dto, Entity, Model, etc. So, to simplify this I learned from my colleague a really useful and simple approach:

Account - Domain language
AccountEntity - How we store it (usually it is combined with Domain language for simplicity)
AccountDto - Dtos for interactive with external systems, usually in Adapters and Gateways
AccountModel or Account - for exposing in API like REST, GraphQL, GRPC, etc.

Okay, now let us jump to intricate Organisation structure.

Organisation structure

When I just started to learn about software architecture, I read many interesting articles about Conway’s law and its impact on organisational structure. Especially this one. So, this law states that

Any organisation that designs a system (defined broadly) will produce a design whose structure is a copy of the organisation’s communication structure.

I have always believed that this concept is indeed very universal and represents the Golden Rule.

Then I started to learn Eric Evans’s Domain-Driven Design (DDD) approach for modeling systems. Eric Evans emphasises the importance of Bounded Context identification. This concept involves dividing a complex domain model into smaller, more manageable sections, each with its own limited set of knowledge. This approach aids in effective team communication, as it reduces the need for extensive knowledge of the entire domain and minimises context switching, thus making conversations more efficient. Context switching is the worst and most resource-consuming thing ever. Even computers are struggling with it. Although it is unlikely to achieve a complete absence of context switching, I reckon that is what we should strive for.

Returning to Conway’s Law, I have found several issues with it.

The first issue I've encountered with Conway's Law, which suggests that system design mirrors organisational structure, is the potential for forming complex and comprehensive Bounded Contexts. This complexity arises when the organisational structure is not aligned with domain boundaries, leading to Bounded Contexts that are heavily interdependent and loaded with information. It leads to frequent context-switching for the development team.

Another issue is that organisational terminology leaks to the code level. When organisational structures change, it necessitates codebase modifications, consuming valuable resources.

Thus, following Inverse Conway Maneuver helps to build the system and organisation that encourage desired software architecture. However, it is noteworthy to say that this approach won’t work very well in already-formed architecture and structures since changes at this stage are prolonged, but it is exceptionally performing in startups since they are quick to introduce any changes.

Big Ball of Mud

This pattern or “anti-pattern“ drives building a system without any architecture. There are no rules, no boundaries, and no strategy on how to control the inevitable growing complexity. Complexity is the most formidable enemy in the journey of building software systems.

To avoid constructing such type of a system, we need to follow specific rules and constraints.

Systems Theory and Cybernetics

Have you heard anything about Systems theory or Cybernetics?

Based on the definition in Wikipedia,

Systems theory is the transdisciplinary study of systems, i.e. cohesive groups of interrelated, interdependent components that can be natural or artificial.

And for Cybernetics

Cybernetics is a field of systems theory that studies circular causal systems whose outputs are also inputs, such as feedback systems. It is concerned with the general principles of circular causal processes, including in ecological, technological, biological, cognitive and social systems and also in the context of practical activities such as designing, learning, and managing.

And description of how it is related to software engineering

In the context of software engineering, cybernetics can be defined as the study and application of feedback loops, control systems, and communication processes within software development and operational environments. It focuses on how systems (software and hardware, processes, and human interactions) can be designed and managed to achieve desired goals through self-regulation, adaptation, and learning. Cybernetics in software engineering emphasises creating systems that can adjust to changes, learn from interactions, and improve over time, ensuring reliability, efficiency, and resilience.

So, the study of Systems Theory and Cybernetics can be applied to System Engineering.

Systems engineering is an interdisciplinary field of engineering and engineering management that focuses on how to design, integrate, and manage complex systems over their life cycles.

Sounds exactly what we are doing - designing, managing complex systems, and also managing the complexity of such systems.

However, let's take a closer look at the key Cybernetics concepts.

Concept	Application to Software Engineering
Systems Thinking	Cybernetics encourages viewing software architecture not just as a collection of independent components but as a cohesive system where the components interact with each other in complex ways. Example: When Service B handles an event published by Service A, the outcome does not affect Service A directly. However, the overall result of the operation is significant to the system as a whole.
Feedback Loops	A core concept in cybernetics is the use of feedback loops to control and stabilize systems. Example: In software architecture, feedback loops can be implemented in various forms, such as monitoring system performance, user feedback mechanisms, or continuous integration/continuous deployment (CI/CD) pipelines.
Adaptability and Learning	Cybernetics promotes the idea that systems should be capable of adapting to changes in their environment. For software architecture, this means designing flexible systems that can evolve over time. Example: This could involve using microservices that can be updated independently, employing feature toggles for managing new features, or incorporating machine learning algorithms that improve with more data.
Goal-Oriented Design	Cybernetic systems are often defined by their goals. In the context of software architecture, this means that the system should be designed with clear objectives in mind. Example: This involves understanding the user needs, business goals, and technical requirements, and ensuring that the architecture is aligned with these goals
Interdisciplinary Approach	Just as cybernetics itself draws from multiple disciplines (e.g., engineering, biology, psychology), applying its principles to software architecture encourages a multidisciplinary approach. Example: This could involve incorporating insights from data science, user experience design, business strategy, and more to create a holistic and effective architecture.
Redundancy and Resilience	Cybernetics recognises the importance of redundancy in maintaining the stability of systems. In software architecture, this principle can be applied by designing systems that are resilient to failures. Example: This might include strategies like replicating critical components, implementing failover mechanisms, and designing for disaster recovery.
Communication and Information Flow	Effective communication and information flow are key concepts in cybernetics. For software architecture, this emphasizes the importance of designing systems where data can flow seamlessly between components, and where communication protocols are efficient and reliable. Example: A distributed system detects a failure in one of its components and automatically reroutes traffic to healthy instances, minimizing downtime and maintaining service availability.

So, in other words:

Systems Thinking - view the system as a set of components that interact with each other
Feedback Loops - receiving feedback tends to higher system quality
Adaptability and Learning - build evolutionary architectures
Goal-Oriented Design - designing software for specific goals
Interdisciplinary Approach - gain insights from data, UX, and business strategy to make decisions
Redundancy and Resilience - don’t forget about backups and failover, nothing is 100% reliable
Communication and Information Flow - utilise data streams between components to foster system automation

From one perspective, this seems quite obvious, right? From another, it provides room for analysis and offers a fresh viewpoint on ordinary things.

Let us quickly take a look at Systems Theory’s main concepts before we move forward.

Concept	Description
Holism	This concept focuses on the system as a whole rather than its individual parts. In software engineering, this means considering how all parts of a software system (e.g., modules, functions, infrastructure) work together to achieve the desired outcomes. Design decisions are made with an understanding of their impact on the entire system. Very similar toSystems Thinking
Interconnectivity and Interdependence	Systems are composed of interconnected and interdependent components. In software systems, changes in one module can affect others.
Hierarchy	Systems are organized in hierarchies of subsystems. Software systems often have a hierarchical structure, with high-level modules depending on lower-level modules for functionality. This hierarchical decomposition helps manage complexity by breaking down the system into more manageable parts.

Actually, there are many of them, I just selected the more easily understandable in respect of Software Engineering.

Loose coupling and High Cohesion

Both low coupling and high cohesion enhance the design and functionality of systems in ways that are synergistic with Systems Theory principles. By ensuring that system components have minimal dependencies on each other (low coupling), while each component is highly specialized and effective in its role (high cohesion), the overall system becomes more than the sum of its parts. This is a core tenet of Systems Theory, which sees the interactions and relationships within a system as key to its behaviour and performance.

This is the illustration from Wikipedia greatly demonstrates such traits

The second result usually happens when the system is poorly designed.

Okay, any system represents a set of interconnected components in one way or another that is organised with some hierarchy. Analysis and design of new components should be done with a holistic approach.

So, both concepts in Systems Theory and Cybernetics lead to the fact that we must have specific structure and rules to manage evolution and complexity of complex systems.

Now, let us take a look at the System architecture.

System architecture

There are myriad definitions for Software Architecture. I like many of them since they cover different aspects of it. However, to be able to reason about architecture, we need naturally to form some of them in our minds. And it is noteworthy to say that this definition may evolve. So, at least for now, I have the following description for myself.

Software Architecture is about decisions and choices you make every day that impact the built system.

To make decisions you need to have in your “bag” principles and patterns for solving arising problems, it is also essential to state that understanding the requirements is key to building what a business needs. However, sometimes requirements are not transparent or even not defined, in this case, it is better to wait to get more clarification or rely on your experience and trust your intuition. But anyway, you cannot make decisions properly if you do not have principles and patterns to rely on. That is where I am coming to the definition of Software Architecture Style.

Software Architecture Style is a set of principles and patterns that designate how to build software.

There are a lot of different architectural styles focused on various sides of the planned architecture, and applying multiple of them at once is a normal situation.

For instance, such as:

Monolithic architecture
Domain-driven design
Component-based
Microservices
Pipe and filters
Event-driven
Microkernel
Service-oriented
Orchestration
Choreography

and so on…

Of course, they have their advantages and disadvantages, but the most important thing I have learned is that architecture evolves gradually while depending on actual problems. Starting with the monolithic architecture is a great choice for reducing operational complexities, very likely this architecture will fit your needs even after reaching out Product-market Fit (PMI) stage of building the product. At scale, you may consider moving towards an event-driven approach and microservices for achieving independent deployment, heterogeneous tech stack environment, and less coupled architecture (and less transparent in the meantime due to the nature of event-driven and pub-sub approaches if these are adopted). Simplicity and efficiency are close and have a great impact on each other. Usually, complicated architectures impact the development speed of new features, supporting and maintaining existing ones, and challenging the system’s natural evolution.

However, complex systems often require complex and comprehensive architecture, which is inevitable.

Service types

Whenever a system evolves, more and more components and services appear, and at scale, it might be very complicated to keep track of everything and solve everything case by case to avoid getting Big Ball of Mud . It is much easier to have a predefined set of rules and service types which each type defines rules for API accessibility from outside, who can call it inside, who owns it and etc, even should it be a separate service or not. So, in order to manage complexity efficiently we need to have some Hierarchy of services of components in architecture.

What is layer?

A layer refers to a distinct level within a system where specific types of operations or responsibilities are executed. Let us take a look at a typical architecture at a scale where it has multiple products, platform components, and different Web Applications.

As we can see here, there is a specific hierarchy including multiple layers of services and specific types of services.

Service Type	Purpose
Web Application and Public API	This refers to the components of a system designed to interact with end-users and external systems. A web application provides a user interface for human interaction, while a Public API (Application Programming Interface) offers programmable interfaces for other systems to interact with your service.
BFF	A pattern where the server-side component is designed specifically to support a particular frontend application (such as a mobile app or web client). The BFF acts as an intermediary, tailoring data and interactions to meet the unique needs and characteristics of the frontend, optimizing user experience and efficiency.
Product Workflows	These are sequences of steps or processes designed to achieve a specific outcome within a product, often involving multiple system components. Product workflows encapsulate business logic and user interactions that drive the core functionalities of a product.
Domain Macro API	This concept involves creating APIs that provide more abstract, high-level operations within a specific domain, allowing for more complex processes or business logic to be encapsulated as single API calls.
Domain Focus Service	A Domain Service is a standard service that focuses on addressing a specific problem within a domain, encapsulating business logic and operations pertinent to that domain issue.
Gateways / Adapters	These components act as intermediaries, translating between different formats, protocols, or interfaces to allow disparate systems to communicate. In software architecture, gateways often handle external communications (such as with third-party APIs), while adapters typically enable connectivity between internal components or layers, ensuring that data and operations can flow seamlessly across the system despite differences in underlying technologies or designs.

Also, you can see that the control flow is mainly represented by an orchestration style, which boosts transparency and system simplicity

Thus, having a structure, even a basic one like this, reduces complexity when making decisions about service types and the appropriate communication flow.

What is the pragmatic start in a new project?

Fairly, this is a very very broad topic, and there are many great ideas about how to structure and build systems for natural evolution. Based on my experience, I have worked out the following approach:

Almost always begins with the monolithic architecture style since it eliminates most of the problems that arise due to the nature of distributed systems. It also makes sense to follow modular monolith to focus on building components with clear boundaries. Applying a component-based approach could help them communicate with each other by using events, but having direct calls (aka RPC) simplifies things in the beginning. However, it is important to track dependencies between components since if component A knows a lot about component B, perhaps, it makes sense to merge them into one.
When you come closer to the situation when you need to scale your development and system, you could consider following the Stangler pattern to gradually extract components that need to be deployed independently or even scaled with specific requirements.
Now, if you have a clear vision of the future, which is a bit of incredible luck, you could decide on the desired architecture. At this moment, you could decide on moving towards microservices architecture by also applying Orchestration and Choreography approaches, incorporating CQRS pattern for independent scale write and read operations, or even deciding to stick with monolithic architecture if it fits your needs.

It is also vital to understand the numbers and metrics like DAU (Daily Active Users), MAU (Monthly Active Users), RPC (Request Per Second), and TPC (Transaction Per Second) since it could help you to make choices because architecture for 100 active users and 100 million active users are different.

As a final note, I would say that architecture has a significant impact on the product’s success. Poorly designed architecture for the products is required in scaling, which very likely leads to failure since customers will not wait while you scale the system, they will choose a competitor, so we need to be ahead of potential scaling. Although I admit that sometimes it could not be a lean approach, the idea is to have a scalable but not already scaled system. On the other hand, having a very complicated and already scaled system with no customers or plans to get many of them will cost you money on your business for nothing.

Technology stack selection

Selecting a technology stack is also a macro-level decision since it influences hiring, system natural evolution perspectives, scalability, and system performance.

This is the list of basic considerations for choosing a technology stack:

Project requirements and complexity. For instance, a simple web application can be built with the Blazor framework if your developers have experience with it, but due to the lack of matureness of WebAssembly, choosing React and Typescript for long-term success could be a better decision
Scalability and Performance Needs. If you anticipate receiving a large amount of traffic, opting for ASP.NET Core over Django could be a wise choice due to its superior performance in handling concurrent requests. However, this decision depends on the scale of traffic you expect. If you need to manage potentially billions of requests with low latency, the presence of Garbage Collection could be a challenge.
Hiring, Development Time, and Cost. In most cases, these are the factors we need to care about. Time to Market, Maintenance cost, and Hiring stability drive your business needs without obstacles.
Team Expertise and Resources. The skill set of your development team is a critical factor. It is generally more effective to use technologies that your team is already familiar with unless there is a strong reason to invest in learning a new stack.
Matureness. A strong community and a rich ecosystem of libraries and tools can greatly ease the development process. Popular technologies often have better community support, which can be invaluable for solving problems and finding resources. Thus, you could save resources and focus mainly on the product.
Long-Term Maintenance and Support. Consider the long-term viability of the technology. Technologies that are widely adopted and supported are less likely to become obsolete and generally receive regular updates and improvements.

How having multiple technology stacks could affect business growth?

From one perspective, introducing one more stack could scale your hiring, but on the other hand, it brings extra maintenance costs since you need to support both stacks. So, as I said previously, in my point of view, only extra need should be an argument for incorporating more technology stacks.

But what is about the principle of selecting the best tool for a specific problem?

Sometimes you have no other choice but to bring new tools to solve a specific problem based on the same considerations aforementioned, in such cases, it makes sense to select the best solution.

The creation of systems without high coupling to a specific technology could be a challenge. Still, it is helpful to strive for a condition where the system is not tightly coupled to technology, and it will not die if tomorrow, a specific framework or tool becomes vulnerable or even deprecated.

Another important consideration is related to open-source and proprietary software dependencies. Proprietary software gives you less flexibility and the possibility to be customised. Still, the most dangerous factor is vendor lock-in, where you become dependent on a vendor's products, prices, terms, and roadmap. This can be risky if the vendor changes direction, increases prices, or discontinues the product. Open-source software reduces this risk, as a single entity does not control it. Eliminating a single point of failure on all levels is a key to building reliable systems for growth.

Single Point of Failure (SPOF)

A single point of failure (SPOF) refers to any part of a system that, if it fails, will cause the entire system to stop functioning. Eliminating SPOFs at all levels is crucial for any system requiring high availability. Everything, including knowledge, personnel, system components, cloud providers, and internet cables, can fail.

There are several basic techniques we could apply to eliminate single points of failure:

Redundancy. Implement redundancy for critical components. This means having backup components that can take over if the primary component fails. Redundancy can be applied across different layers of the system, including hardware (servers, disks), networking (links, switches), and software (databases, application servers). If you are hosting everything in one Cloud Provider and even having backups there, consider building a regular additional backup in another to reduce your lost cost in case of disaster.
Data Centers. Distribute your system across multiple physical locations, such as data centres or cloud regions. This approach protects your system against location-specific failures like power outages or natural disasters.
Failover. Apply a failover approach for all your components (DNS, CDN, Load balancers, Kubernetes, API Gateways, and Databases). Since issues can arise unexpectedly, it's crucial to have a backup plan to replace any component with its clone as needed swiftly.
High availability services. Ensure your services are built to be horizontally scalable and highly available from the start by adhering to the following principles:

Practice service statelessness and avoid storing user sessions in in-memory caches. Instead, use a distributed cache system, such as Redis.
Avoid reliance on the chronological order of message consumption when developing logic.
Minimise breaking changes to prevent disrupting API consumers. Where possible, opt for backwards-compatible changes. Also, consider cost since sometimes, implementing a breaking change may be more cost-effective.
Incorporate migration execution into the deployment pipeline.
Establish a strategy for handling concurrent requests.
Implement service discovery, monitoring, and logging to enhance reliability and observability.
Develop business logic to be idempotent, acknowledging that network failures are inevitable.

Dependency review. Regularly review and minimise external dependencies. Each external dependency can introduce potential SPOFs, so it's essential to understand and mitigate these risks.
Regular knowledge share. Never forget the importance of spreading knowledge within your organisation. People can be unpredictable, and relying on a single individual is risky. Encourage team members to digitise their knowledge through documentation. However, be mindful of over-documenting. Utilise various AI tools to simplify this process.