Designing a Social Media News Feed System

The keywords newsfeed and timeline are used interchangeably. Some of the similar services to Social media newsfeeds are the following:

Facebook newsfeed
Twitter Timeline
Instagram feed
Google podcast feed
Google news timeline
Etsy feed
Feedly
Reddit feed
Medium feed
Quora feed

Requirements

The user newsfeed must be generated in near real-time based on the feed activity from the people that a user follows
The feed items contain text and media files (images, videos)

Data storage

Database schema

The primary entities of the database are the Users table, the FeedItems table, and the Follows table
The relationship between the Users and the FeedItems tables is 1-to-many
The relationship between the Users and the Follows tables is many-to-many
The Follows is a join table to represent the relationship (follower-followee) between the users

Type of data store

The media files (images, videos) are stored in a managed object storage such as AWS S3
A SQL database such as Postgres stores the metadata of the user (followers, personal data)
A NoSQL data store such as Cassandra stores the user timeline
A cache server such as Redis stores the pre-generated timeline of a user

High-level design

The server stores the feed items in cache servers and the NoSQL store
The newsfeed generated is stored on the cache server
There is no feed publishing for inactive users but uses a pull model (fanout-on-load)
The feed publishing for active non-celebrity users is based on a push model (fanout-on-write)
The feed publishing for celebrity users is based on a hybrid push-pull model
The client fetches the newsfeed from the cache servers

Write path

The client creates an HTTP connection to the load balancer to create a feed item
The load balancer delegates the client connection to a web server with free capacity
The write requests to create feed items are rate limited
The feed item is stored on the message queue for asynchronous processing and the client receives an immediate response
The fanout service distributes the feed item to multiple services to generate the newsfeed for followers of the client
The object store persists the video or image files embedded in the feed item
The NoSQL store persists the timeline of users (feed items in reverse chronological order)
The SQL database stores the metadata of the users (user relationships) and the feed items
A limited number of feed items for users with a certain threshold of followers are stored on the cache server
The IDs of feed items are stored on the user timeline cache server for deduplication
The feed generation service subscribes to the fanout service for any updates
The feed generation service queries the in-memory user info service to identify the followers of a user and the category of a user (active non-celebrity users, inactive, celebrity users)
The feed generation service creates the home timeline for active non-celebrity users using a push model (fanout-on-write) in linear time O(n), where n is the number of followers
The feed items are ranked, sorted, and merged to generate the home timeline for a user
The home timeline for active users is stored on the cache server for quick lookups
There is no feed publishing for inactive users but uses a pull model (fanout-on-load)
The feed publishing for celebrity users is based on a hybrid push-pull model (merge celebrity feed items to the home timeline of a user on demand)
As an alternative, the feed publishing for celebrity users can use a push model only for the online followers in batches (not optimal solution)

Read path

The client executes a DNS query to resolve the domain name
The client queries the CDN to check if the feed items for the home timeline are cached on the CDN
The client creates an HTTP connection to the load balancer
The load balancer delegates the client connection to a web server with free capacity
The read requests to fetch the newsfeed are rate limited
The web server queries the timeline service to fetch the newsfeed
The timeline service queries the user info service to get the list of followee and identify the category of the user (active, inactive, following celebrities)
The home timeline cache is queried to fetch the list of feed item IDs
The feed items are fetched from the feed items cache server by executing an MGET operation on Redis
When the client executes a request to fetch the timeline of another user, the timeline service queries the user timeline cache server
The SQL database follower (replica) is queried on a cache miss
The media files embedded on feed items are fetched from the object store
The NoSQL data store is queried to fetch the user timeline on a cache miss
The inactive users fetch the home timeline using a pull model (fanout-on-load)
The active users following celebrity users use a hybrid model to fetch the home timeline (the feed items from celebrities are merged on demand)

References

Raffi Krikorian, Timelines at Scale, infoq.com
How feed works, facebook.com

Featured image source.

Also published here.