Mar 21, 2024

What I’ve Learned About Operating GraphQL APIs at Scale

GraphQL API’s are pretty new and existing tools aren’t quite adapted to observing, documenting and securing GraphQL. In this post we want to outline some practices that we do at Stellate and practices we want to enable with Stellate.

Observability

We want to be able to see what is happening in our system on a high level by means of metrics. We also want to be able to drill down into requests and see what is happening as they play out as this will allow us to inspect the chain of resolvers the request went through and how it eventually ended up in an erroneous state. Last but not least we want to be alerted of performance degradations and errors happening in our system.

Logging

GraphQL can make it a bit more complicated to get the full picture of what is happening during a request. This is because a single GraphQL request can result in multiple queries to the database/multiple calls to REST endpoints.

A good logging setup starts with giving each request a unique identifier and then using this identifier in every log, this allows us to correlate logs to a given request.

An example with GraphQL-yoga:

import { createYoga } from 'graphql-yoga'
import { v4 as uuidv4 } from 'uuid';

const yoga = createYoga({
  schema: schema,
  async context() {
    return { traceId: uuidv4() }
  }
})

When we have this unique identifier we can choose to use a structured logger like pino, but we have to make sure that we either create the pino-logger with the traceId in the context or we add it to each log message in our resolvers by means of including ctx.traceId in the body of the structured log.

Given that every log-message is now associated to a request we can filter/associate them in our favorite logging solution like Datadog, Splunk, etc. Every request can be linked to the resolvers that were called, the requests that resulted and the eventual warning/error you are tracing towards.

Metrics

The most important metrics for a GraphQL API are the error rate and the latency. I have found that the granular latency where each field is carefully recorded and is showing latency is less useful than the overall latency of a request and the evolution of latency.

There are a lot of solutions like GraphQL Hive, Inigo, Stellate, and others that can get you both the above metrics for your GraphQL API. These solutions allow us to then monitor these metrics and set alerts when they exceed a certain number.

One feature that we have found to be missing in a lot of these products is the ability to monitor a given operation, let's say we have the following operation

query CategoryProducts($id: ID!) {
  category(id: $id) {
    id
    name
    products {
      nodes {
        id
        name
        price
      }
    }
  }
}

We now add a new field called inventory and the performance of this operation just drops significantly, rather than getting a general alert that performance dropped it would be useful to get an alert that CategoryProducts is slow and that the executed document evolved to have an inventory field. This alert would allow us to quickly identify the problem and either accept it because the service managing inventory is slow or optimize it with a dataloader. This could be done similarly for an added argument, when an interface or union resolves to a certain type.

Identifying errors

Error rates are hard to get right as we need to know whether this was the fault of an external system or service (the error code or message could signal that), if it's isolated to a given operation or whether someone is just trying to do something that is not allowed.

In an ideal world we can filter out the error when someone is just trying to request our API as an unauthorized user, ... but the risk could be that we miss an error where we incorrectly mark a user as unauthorized. This results in a lot of application-specific metadata being relevant to the error and we need to be able to deliver these as metrics when an alert arises, these could be things like the amount of distinct variable combinations attributing to an alert, the amount of distinct users and the amount of operations. This can help make an informed decision whether this is P0, P1, P2, ... and whether we need to fix it now or whether it can wait.

A common solution to this problem is to introduce an error-code pattern where we use the extensions property on the GraphQLError object to define the category of our error.

An example of this could be

{
  "errors": [
    {
      "message": "Payment failed",
      "path": "checkout",
      "extensions": { "code": "EXTERNAL_PROVIDER_STRIPE" }
    }
  ]
}

Seeing the above error in our logs gives us a quick hint as to where to look as we know we failed in the checkout resolver and our error-code points at our external-provider erroring, we can see where we touch base with the provider and trace that way. We can now also expand our metrics by grouping on error-path or error-code.

Clients

Another important aspect of metrics is identifying what kind of client is requesting certain fields, seeing certain errors, ... With GraphQL we'll often have a lot of consumers of a given endpoint, think about your mobile app that could have a lot of versions on the app store, your web-app and maybe even some CLI. To know which fields/types we can deprecate we need to be aware of who's using them, to track this we can add a header or something similar to our requests and include them in the logs.

Security

There are a few concerns with GraphQL when it comes to security, the first would be that by default we expose everything in our schema and the second would be that one request can access multiple resources.

Everything being exposed by default can be addressed by using a plugin like block-field-suggestions and disabling the introspection when you are in production. This is however a band aid as people can still visit your application and look at the requests in the network-tab of their browser and see what is possible. This is where Persisted Operations come into play, we can extract all the documents from our client-application and create hashes for them and share that with the server so that it will only accepts those documents. In doing so we made the selection of our document obscure and we have only a limited amount of documents that will be accepted. There are still a few pitfalls to exploit with persisted-operations like the fact that we could have a query with a variable named limit which allows an attacker to do a DoS attack on our server by setting the limit to a very high number.

The second concern can be addressed by erroring out early, we can limit the depth, breath, ... of a GraphQL Document and error out before we even start executing the request, GraphQL armor is a great tool here! Another thing we can do is see whether authenticated fields are included in the request as a preflight check and error out if the user isn't authenticated. The last solution here would be rate-limiting, request based rate limiting can be a good first line of defense when you are using persisted-operations, however for the normal GraphQL execution where large requests can be forged we need a different approach like complexity where a given document gets assigned a complexity score and we can limit that across requests based on i.e. ip or userId.

Documentation

Documentation is important so that people can understand what's possible and onboard to your product. GraphQL is self-documenting is a term that is used a lot, it is true that we get a lot of information by default on our graph, like our entry-points and the types they return and the fields on these types, ... however adding descriptions to our fields is still entirely on the developer, not saying this is a bad thing just want to emphasize that nothing is entirely magical. When using a well-described schema in combination with i.e. GraphiQL you can explore and play with the schema as a developer and onboard quite quickly.

When there is a basic understanding of the schema, one can explore the documents used in client-applications and understand the different use-cases the schema is used for, heck maybe one day we just give an AI application our executed documents and schema and it will tell us all the use cases that a client has for this schema.