Disclaimer: I no longer work at Facebook. These opinions are mine alone. Last month, I took some time to reflect on what I learned while building real-time APIs at Facebook. One useful technique stands out: . build the null API first Consider a null endpoint for a new HTTP API that returns Status Code 200 with an empty body. With the null 200 API, the client can: Check network connectivity Measure latency Validate an access token Test the SSL certificate Test intermediate caching layers The client can also detect and handle various failure cases: Network unavailable (airplane mode) DNS lookup failed Request timeout (retry the request?) Flood response (503) Notice that we haven’t mentioned . That’s the point. Engineering teams often focus on first_,_ and requirements second. By omitting the , the null API technique forces us to work backwards and confront requirements first. application data functional requirements non-functional application data non-functional As for the , we need to establish a foundation by discussing the differences between streaming and request/response. In a streaming API, the client opens a to the server, which can be an actual persistent connection (e.g. WebSocket) or a simulated connection (e.g. HTTP long polling). Over the lifetime of this connection, the server chooses when to push data to the client. streaming null API persistent connection While request/response APIs can be stateful or stateless, . For example, in a chat app, the server needs to store at least three pieces of state: a streaming API is always stateful Which chat rooms has the user joined? What was the last message the client received? (if we want coherent conversations) Which persistent connection will carry data to the correct user? Stateless application-layer protocols like HTTP have extreme amnesia. They remember only what’s necessary for a single request/response. By contrast, real-time APIs are also stateful at the . That is, they must remember what happened in the past. For example, real-time protocols frequently rely on the (e.g. and ) where an request might only be valid if it follows a request for the same channel name. application protocol layer publisher-subscriber pattern MQTT Redis unsubscribe subscribe Handling this kind of state at scale is challenging. First: the server must make trade offs between consistency, availability, partition-tolerance ( ), durability, and latency. Second: some kinds of state, such as chat room membership, must be synchronized across the network. If the client and server disagree about the set of chat rooms for the current user, the result is either a broken user experience or leaked resources. Unfortunately, . Third: client-side caches are often ill-equipped to deal with real-time data streams. CAP theorem state synchronization is hard How many persistent connections does a client need? “One” seems like the obvious answer. But if this single persistent connection drops, all data streams sharing the connection would drop simultaneously. Then again, maybe multiple persistent connections would fare no better, such as cases where your cell phone enters a dead zone. The point is, real-time APIs are much more complicated than request/response APIs, partly because they are stateful at the application protocol layer as well as the transport layer. With those concepts sorted out, we’re ready to model the : an API where the client opens a persistent connection to the server, but the server pushes no . The client can use such an API to anticipate a bountiful assortment of success and failure cases: null streaming API application data Success Cases: Reach the server (network available, DNS resolves, server cert is legit) Create the persistent connection Issue a request across the persistent connection Detect server-initiated stream termination (end-of-stream) Close the stream Close the persistent connection Pause the stream Failure Cases: Failed to create a persistent connection (transport level): Network unavailable, DNS lookup failure, Request timeout, Bad cert, Bad access token Failed to create a data stream (protocol level): Unauthenticated, Unauthorized, Invalid request, Flood response Connection interrupted: Server load shedding, Server gateway node failure Stream interrupted: Connection interrupted (above) The stream/connection interrupted cases are particularly interesting. An interrupted stream cannot possibly yield data, but the interruption can occur at the transport and protocol levels. If that stream was feeding essential data to the client (for example, player health in a multiplayer game), it’s important to let the user know what’s happening: “hey, you’re lagging, hang on!”. This list of non-functional cases is not obviously exhaustive, but it already shows how much smarter the client needs to be when handling edge cases. By building such an API first, the team can discover and design the best user experience when building apps running over unreliable networks, such mobile networks. If you’ve had success using some variation of the null API technique or if you know a better way, I’d love to hear from you in the comments!