The System Design Cheat Sheet: Relational Databases (Part 1)

This is a continuation of a series of articles in which I briefly cover the main points of a specific topic in system architecture design. The previous article can be read , and the complete guide you can find on my . here github This article will examine the basic concepts, approaches and patterns of schema design and scaling relational databases. In the second part, we will dwell in more detail on data queries and their optimization. Introduction In system design, relational databases are a traditional choice for data storage, offering structured schema and powerful query capabilities, contrasted with NoSQL databases' flexible schemes and scalability. The following concepts are distinguished: A is a type of that stores and organizes data in tables with rows and columns. It is based on the relational model, which uses a structure that permits the identification and access of data with another piece of data in the database. relational database database An , or , is a software system that manages relational databases and facilitates the creation, updating, and administration of a relational database using SQL. It supports data integrity, transaction processing, and concurrent access to ensure efficient and secure data management stored in relational database structures. RDBMS Relational Database Management System Popular RDBMS - A comprehensive RDBMS solution for enterprise-level data management, known for its robust feature set, scalability, and security. Oracle Database - An open-source RDBMS widely used for web applications and as a component of the LAMP stack (Linux, Apache, MySQL, PHP/Python/Perl). MySQL : Microsoft developed a relational database management system to support various data types and applications, from small websites to large enterprise applications. Microsoft SQL Server : An open-source, object-relational database system known for its standards compliance, extensibility, and support for advanced data types. PostgreSQL : A C-library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine. SQLite Terminology : Describes how data is stored in the database, including files, indexes, and storage mechanisms. It deals with the optimization of storage resources and data access methods. Physical Schema : Represents the abstract design of the data structure, including the tables and relationships, without detailing how data is stored physically. It focuses on defining entities, relationships, and constraints. Logical Schema : A collection of related data entries organized in rows and columns, representing a specific entity such as customers or orders. Table : A single, horizontal dataset in a table representing a specific entity instance, each row containing unique data for the defined columns. Row : A vertical dataset in a table representing an entity's attribute, where each column holds data of a particular type. Column : Rules enforced on database tables and columns to ensure data integrity and accuracy and enforce business logic, preventing invalid data entry and maintaining consistency across the database. Constraints : Ensures each row in a table has a unique identifier, disallowing NULL values. Primary key : Establishes a relationship between tables, ensuring that one table's value matches another table's primary key. Foreign key : Guarantees that all values in a column or a set of columns are distinct, allowing for one NULL value if not part of a PRIMARY KEY. Unique : Specifies a condition that each value in a column must meet, ensuring data adheres to specific rules. Check : Ensures that a column cannot hold NULL values, guaranteeing that data is always entered for specific columns. Not NULL : Assigns a default value to a column when no other value is specified during data insertion. Default : A data structure that improves the speed of data retrieval operations on a table, making finding rows based on the column values faster. Index : Predefined SQL commands stored in the database that can be executed to perform various operations, including data manipulation and transaction management. Stored Procedures : Procedures automatically executed in response to specific events on a particular table or view, such as insertions, updates, or deletions. Triggers : Virtual tables created based on the result set of an SQL statement, which can simplify complex queries, aggregate data, or restrict access to specific data within the database. Views : A physical copy of the result set of a specific SQL query stored in the database. Unlike virtual views, which query the underlying tables every time they are accessed, materialized views are updated periodically. Materialized View Database Design Strategies Top-Down It begins with an analysis of the general requirements of the system at a conceptual level before designing the detailed structure of the database. Pros Ensures a clear understanding of the organization’s information needs. Facilitates a comprehensive and strategic view of the database structure. Cons It can be time-consuming due to the initial extensive requirement gathering and analysis phase. May overlook specific lower-level data needs or details in the early stages. Bottom-Up It starts with designing the detailed data structures, such as specific tables and relationships, before aggregating them into a complete system. Pros Allows for quick starts, focusing on specific, well-understood areas of the system. Can be more adaptable to changes in specific areas of the system. Cons This may result in a lack of coherence in the overall database structure. This can lead to difficulties in integrating different parts of the database later on. ER (Entity Relationship) Diagram An Entity-Relationship (ER) diagram is a graphical representation of entities and their relationships to each other. An ER diagram helps structure and organize data requirements before developing a database. Components : Represent real-world objects or concepts in the database. Entities A type of entity that has a key Attribute Strong Entity: : An entity that does not have a primary key attribute. Weak Entity : Characteristics or properties of entities. Attributes : Used to represent the main characteristics of an entity. It represents a primary key. Key Attribute : An attribute that is composed of many other attributes Composite Attribute : An attribute that can have multiple values. Multivalued Attribute : An attribute that can be derived from another attribute Derived Attribute : Depict how entities are related to each other. Relationships One-to-One One-to-Many Many-to-One Many-to-Many Types : High-level, abstract design without specific details about technologies or data types. Conceptual : More detailed, including specific entities, relationships, and attributes but not entirely tied to a particular database management system (DBMS). Logical : Includes all constraints, indexes, and specific technologies that will be implemented in the database. Physical There are several ERD : Chen, Crow's Foot(Information Engineering), Barker's, IEC, Martin, etc. Each notation has its advantages and is chosen based on the preferences of the database designers, the complexity of the system being modeled, and the project's specific requirements. Crow's Foot and Chen's notations remain among the most popular. notations Chen notation Crow's Foot / Information Engineering Design Patterns Design patterns for relational databases are reusable solutions to common problems encountered when designing database schemas, querying data, and implementing database-related functionalities. Here are some key design patterns for relational databases: Normalization Normalization involves applying a set of rules or "normal forms" to ensure the database structure is clear, efficient, and able to handle updates and queries reliably. It aims to reduce redundancy and improve data integrity by ensuring that each piece of data is stored only once. Pros : Normalization ensures accurate and consistent data, enhancing the overall integrity of the database. Improved Data Integrity : It minimizes duplicate data storage, saving space and simplifying data management. Reduced Data Redundancy : Normalized structures simplify updates, inserts, and deletions, making maintenance tasks more straightforward. Easier Database Maintenance Cons : Normalization can lead to complex queries due to the need to join multiple tables to retrieve related data. Complex Queries : The additional joins required by normalized databases can slow query performance, especially in large or high-transaction systems. Performance Issues : Achieving optimal normalization requires careful, often complex planning and design, posing challenges for those who need more experience. Design Complexity Normal Forms Normal forms are a series of guidelines or rules used in normalization. There are several normal forms, each building upon the principles of the previous one. While achieving high levels of normalization offers significant benefits, it comes with tradeoffs regarding performance and ease of use. Balancing these factors requires careful consideration based on the specific demands of the application. First Normal Form (1NF) Each column in a table contains only a single value. Each column has a unique name. The order in which data is stored should not matter. Second Normal Form (2NF) All non-key attributes are fully functional and dependent on the primary key (PK). If the table has a PK, then each non-key attribute must be fully dependent on the entire PK and not on a subset of the PK. composite Third Normal Form (3NF) No transition dependency exists, ensuring that non-key attributes are dependent on other non-key attributes. not Boyce-Codd Normal Form (BCNF) It is a special case of 3NF. Table that should have multiple overlapping candidate keys. not 3NF wouldn't prohibit an attribute within one candidate key that depends on the part of another candidate key. Fourth Normal Form (4NF) The table should not have two or more independent and multivalued data describing the relevant entity. Fifth Normal Form (5NF) The table can be decomposed into any number of smaller tables without loss of data. Denormalization Denormalization is adding redundant data to a normalized database to improve query performance, simplify the database structure, or address issues with read-heavy database applications requiring fast data access. Pros : Denormalization can significantly speed up read operations by reducing the number of joins needed. Improved Query Performance : It simplifies query writing and understanding by reducing the complexity of database schemas. Simplified Queries : Optimizes performance for specific read-heavy operations, making it beneficial for certain applications. Enhanced Performance for Specific Tasks Cons : This introduces duplicate data, which can lead to inconsistencies and increased storage requirements. Increased Data Redundancy : The redundancy requires more effort to ensure data integrity during updates, inserts, and deletes. Higher Maintenance Overhead : There's a greater risk of inserting, updating, and deleting anomalies, potentially leading to data integrity issues. Risk of Data Anomalies Entity-Attribute-Value (EAV) EAV is a database design pattern used to represent entities. A unique identifier represents each entity, each attribute is represented as a record, and the attribute's value for the entity is stored in a separate table. This model allows for the flexible and dynamic addition of attributes without altering the database schema. Pros : Allows for the dynamic addition of attributes to entities without modifying the database schema. Flexibility : Efficiently stores entities with many attributes primarily null in traditional relational database designs. Scalability for Sparse Data : Facilitates modeling complex hierarchies and relationships that are difficult to represent in traditional relational databases. Complex Hierarchies and Relationships Cons : Retrieving data can require complex queries that join multiple tables, which can be challenging to write and maintain. Query Complexity : This can lead to performance bottlenecks due to the intensive joins required to reconstruct entity data. Performance Issues : Maintaining data integrity is more complicated due to the scattered nature of the data across multiple tables. Data Integrity Challenges Master-Detail The Master-Detail pattern involves two interconnected tables: a master table that holds primary information and one or more detail tables that contain related data linked back to the master table through foreign keys. This pattern is commonly used to manage related data entities where the master entity controls various operations of the detail entities, such as cascading updates or deletions. Pros : Establishes a clear and logical data hierarchy, making the database structure more understandable. Clear Hierarchy : Ensures data integrity through referential integrity constraints, maintaining consistent relationships between master and detail records. Data Integrity : Facilitates data organization in a way that reflects real-world relationships, making data retrieval and analysis more straightforward. Efficient Data Organization Cons : Retrieving data that spans across the master and detail tables can require complex SQL joins, potentially impacting query performance. Complex Joins for Queries : Cascading updates or deletions can unintentionally affect multiple detail records, posing a risk to data integrity if not carefully managed. Cascade Operations Risk : As the volume of detail records grows, the performance of operations that involve cascading changes or extensive joins may degrade, impacting scalability. Scalability Issues Table Inheritance Table Inheritance involves creating a table structure that mimics the inheritance of entities or objects in an application, typically using one of three approaches, each with varying strategies for how data is stored across parent and child tables: Single Table Inheritance Class Table Inheritance Concrete Table Inheritance Pros : It closely aligns the database schema with the application's object-oriented model, facilitating a more intuitive design and implementation process. Reflects Object-Oriented Models : Supports polymorphism, allowing queries to easily retrieve or manipulate instances of base classes or subclasses stored in the database. Polymorphism Support : Organizes data to improve the clarity and logical structure of the database, making it easier to understand the relationships between different entities. Efficient Data Organization Cons : Queries can become more complex, especially when dealing with polymorphic associations or needing to join multiple tables to reconstruct a single object. Complexity in Querying : The need to join multiple tables for single object reconstruction can introduce performance overhead in read operations. Potential Performance Overhead : Modifying the inheritance hierarchy can lead to challenging schema migrations, requiring careful management of parent and child tables to maintain data integrity. Schema Evolution Difficulties Star The Star Schema organizes data into a central fact table that contains quantitative metrics (facts) and foreign keys to related dimension tables, which store descriptive attributes related to the facts. This pattern resembles a star, with the fact table at the center and dimension tables radiating outwards. Pros : The structure simplifies query writing, making performing complex data analytics and reporting easier. Simplified Queries : Optimized for query performance in analytical processing, facilitating fast data retrieval. Improved Performance : The clear separation between dimensions and facts makes the schema easy to understand and navigate, even for non-technical users. Intuitive Design Cons : Dimension tables can introduce redundancy, increasing storage requirements. Data Redundancy : Extract, Transform, and Load (ETL) processes can become complex and time-consuming due to the need to populate multiple dimension tables. Complex ETL Processes : The design is optimized for querying and reporting, making it less suitable for transactional processing where data is frequently updated. Not Suitable for Transactional Processing Snowflake The Snowflake Schema organizes data into a central fact table surrounded by normalized dimension tables. Unlike the Star Schema, where dimension tables are denormalized, in the Snowflake Schema, dimension data is broken down into additional tables to eliminate redundancy and enforce data integrity. Pros : Normalization of dimension tables reduces data redundancy, leading to more efficient storage use. Reduced Data Redundancy : The normalization process enhances data integrity and consistency across the database. Improved Data Integrity : This allows a more detailed representation of data hierarchies and relationships within dimensions. Detailed Data Hierarchy Cons : The additional joins required to navigate the normalized dimensions can make queries more complex and challenging to write. Increased Query Complexity : The normalization and additional table joins may lead to slower query performance than the Star Schema. Potential Performance Overhead : Populating the normalized dimension tables can complicate ETL (Extract, Transform, Load) operations, requiring more sophisticated logic and processing time. Complex ETL Processes Audit logging Audit Logging pattern in databases is a design approach focused on systematically recording changes to data or actions performed within an application or system. It involves creating detailed logs for all create, read, update, and delete (CRUD) operations on data, providing a transparent and immutable history of all transactions and changes within the database for security, compliance, and debugging purposes. Pros : Ensures compliance with legal and regulatory requirements by maintaining a detailed record of all data accesses and changes. Enhanced Security and Compliance : Establishes clear accountability by tracking user actions, making identifying the source of changes or errors easier. Improved Accountability : Offers valuable insights into application usage patterns, data access, and changes over time, facilitating better decision-making and system improvements. Valuable Insights Cons : Generates significant log data, increasing storage costs and requirements. Increased Storage Requirements : This can introduce performance overhead, as every data operation requires additional logging actions. Performance Overhead : The volume and complexity of audit logs can make management and analysis challenging, requiring specialized tools or processes. Complexity in Management and Analysis Versioning Versioning involves adding metadata to each record in a database to indicate its version, with mechanisms to create new versions upon updates while preserving the old versions. This can be implemented through various means, such as additional versioned tables, timestamping, or a separate history table to store changes. Pros : Enables tracking and preservation of historical data, allowing for analysis of changes over time. Historical Data Preservation : Provides a comprehensive audit trail for changes, supporting compliance and security analyses. Audit Trail : Facilitates data recovery and rollback capabilities, allowing previous data states to be restored if necessary. Data Recovery Cons : Requires additional storage space to maintain historical data versions. Increased Storage Space : This can complicate query design, as accessing the current or a specific historical data version may require more complex queries. Complex Queries : Maintaining and accessing multiple versions of data can lead to performance degradation, especially with large datasets or frequent updates. Performance Impact Scaling and Fault Tolerance Scaling and fault tolerance are critical aspects of managing relational databases, especially in environments that require high availability, performance, and consistency. These concepts ensure that a database can handle growing amounts of work and recover from hardware or software failures without data loss. refers to the database's ability to accommodate growth in data volume and transaction throughput without compromising performance. It can be achieved in two primary ways: Scaling : Add more resources (e.g., CPU, RAM) to the existing database server to handle increased load. Vertical Scaling (Scaling Up) : Entails distributing the database load across multiple servers or instances to enhance capacity and performance. Horizontal Scaling (Scaling Out) refers to the ability of the database system to continue operating without interruption in the event of a hardware or software failure. Key strategies include: Fault tolerance : Replicates data across multiple nodes or locations to prevent data loss and ensure availability during failures. Replication : Utilizes a group of servers working together to provide high availability and distribute the load, ensuring continuous operation even if one server fails. Clustering : Regular backups and efficient recovery procedures are essential for restoring data after a failure. Backup and Recovery There are several techniques for scaling and improving reliability. Master-Slave (Primary-Secondary) replication Master-slave replication is a widely used architecture for data replication and distribution. It involves a primary database server (the master) and one or more secondary database servers (the slaves). The master server handles all the write and/or read operations and logs changes, while the slave servers replicate these changes from the master, allowing them to handle read queries. Benefits : Distributes read queries among multiple slave servers, significantly improving the system's ability to handle large volumes of read operations. Read Scalability : Increases database availability through replicas; if the master server fails, a slave can be promoted to a master, minimizing downtime. High Availability : Allows for effective distribution of database load, with the master handling writes and slaves handling reads. Load Distribution : Facilitates backups and maintenance tasks on slave servers without impacting the master server's performance or availability. Backup and Maintenance Tradeoffs : This does not improve write scalability, as all write operations must go through the master server. Write Scalability : There can be a delay (replication lag) between when data is written to the master and when it is available on the slave servers, potentially leading to stale data. Data Latency : Introduces complexity in setup, configuration, and maintenance, requiring careful management of replication processes and failover mechanisms. Complexity : In an asynchronous replication setup, there is a risk of temporary inconsistency between the master and slave databases. Consistency Concerns Use Cases : Ideal for applications with a high read-to-write ratio, such as content delivery networks, where distributing read operations can significantly improve performance. Read-Heavy Applications : Useful for running complex queries and reports on slave databases without impacting the performance of the master database or the main application. Reporting and Analytics : Enables real-time backups on slave servers without interrupting the master server, providing an effective strategy for data recovery and business continuity. Backup and Recovery : Allows for testing new features, system upgrades, or migrations on slave servers without risking the integrity of the master server or disrupting the live environment. System Upgrades and Testing Multi-Master (Master-Master, Primary-Primary ) replication Multi-master replication is an architecture that allows data to be replicated across multiple servers, each capable of handling read and write operations. This setup creates a distributed database system where changes made on one server are automatically replicated to all other servers in the replication group, ensuring each server has the same data set. Benefits : Enhances write Scalability by distributing write operations across multiple servers, improving the system's overall throughput. Write Scalability : Provides high availability and fault tolerance, as the system can continue to operate even if one or more servers fail. High Availability : Enables data to be located closer to its users, reducing access latency and improving application response times, especially in geographically distributed applications. Improved Data Locality : Often includes mechanisms for conflict resolution, ensuring data consistency across the replicated servers despite concurrent writes. Conflict Resolution Tradeoffs : Requires sophisticated conflict resolution strategies to handle concurrent writes to the same data, which can complicate application logic and database management. Conflict Management : The replication and synchronization processes introduce additional network traffic and processing overhead, impacting system performance. Increased Overhead : The configuration, monitoring, and maintenance of a multi-master replication system are more complex than single-master setups, requiring advanced expertise. Complexity in Setup and Maintenance : Ensuring immediate consistency across all nodes can be difficult, potentially leading to temporary inconsistencies until replication is completed. Data Consistency Challenges Use Cases : Ideal for applications requiring data to be available and up-to-date across multiple locations, such as global e-commerce platforms or distributed content management systems. Distributed Applications : Suitable for systems where high availability and fault tolerance are critical, ensuring that the application remains operational even in the event of server failures. Highly Available Systems : Useful in scenarios where multiple users must update shared data simultaneously, such as collaborative editing platforms or distributed ledgers. Collaborative Applications : This can balance the load across servers in high-traffic scenarios, ensuring no single server becomes a bottleneck. Load Balancing Sharding Sharding is a technique that scales databases horizontally by partitioning data across multiple servers or instances. Each shard contains a subset of the total data; the shards comprise the entire database. Benefits : Significantly improves the scalability of a database by allowing it to distribute data across multiple servers, thereby handling more transactions and storing more data than a single server could. Scalability : By distributing the data, sharding can reduce the load on any single server, leading to performance improvements in data retrieval and storage. Performance Improvement : If properly implemented, sharding can increase data availability by replicating shards across multiple servers, thus ensuring that the failure of a single server does not lead to a total system outage. High Availability Tradeoffs : Sharding introduces complexity regarding database design, development, and maintenance. Managing multiple shards requires sophisticated coordination and configuration. Complexity : Ensuring an even distribution of data across shards to prevent imbalances (hotspots) that can lead to performance bottlenecks is challenging. Data Distribution Challenges : Handling transactions spanning multiple shards can be complicated and require additional logic to maintain data consistency and integrity. Cross-Shard Transactions Use Cases : Ideal for web applications that require high throughput and storage capacity beyond the limits of a single database server. Large-Scale Web Applications : Supports real-time analytics applications that need to process and analyze large volumes of data quickly by distributing the workload across multiple servers. Real-Time Analytics : Suitable for applications requiring global data distribution to serve users from the nearest geographical location, improving access speed and user experience. Global Distributions Federation Federation is a strategy for managing and accessing data distributed across multiple databases or different systems. It involves creating a virtual database that abstracts several physical databases, allowing users to interact with it as if it were a single database. This approach enables the integration of diverse data sources, providing unified access and manipulation of data without consolidating it physically in one location. Benefits : Simplifies access to data distributed across different databases or systems, making it appear that the data comes from a single source. Access to Distributed Data : By allowing data to remain in its original location, federation reduces the need for data replication and consolidation, minimizing redundancy. Reduced Data Redundancy : Offers the ability to scale out systems and add new data sources without significant reconfiguration, providing flexibility in handling growing or changing data needs. Scalability and Flexibility Tradeoffs : Queries that span multiple federated databases may experience higher latencies due to the overhead of integrating results from different sources. Query Performance : Managing and maintaining federated database systems can be significant, especially when integrating heterogeneous databases with varying schemas. Complexity in Management : Ensuring data consistency and integrity across federated databases requires robust synchronization and conflict resolution mechanisms. Data Consistency Challenges Use Cases : Ideal for scenarios where data needs to be integrated and accessed from multiple databases, such as in enterprise environments with different departmental databases. Data Integration from Multiple Sources : The federation can facilitate business intelligence and analytics applications by providing unified access to diverse data sets spread across multiple databases. Business Intelligence and Analytics : Suitable for decentralized organizations that require access to data across different regions or departments without centralizing all data, maintaining autonomy while ensuring data accessibility. Decentralized Organizations