Nov 30, 2023

GraphQL Performance: Key Challenges and Solutions

Despite the benefits of GraphQL APIs, many companies are hesitant to adopt GraphQL as there's a misconception that GraphQL performance is inferior to REST performance.

This is largely because GraphQL queries use POST requests, which aren't cacheable by default.

As caching is a critical aspect of improving website performance, many believe that GraphQL can't deliver the same level of performance as REST.

While it's true that a GraphQL API without any caching would result in inferior performance to a REST API with caching, simple solutions now allow you to cache GraphQL requests.

This means you can enjoy all of the benefits of a GraphQL API without the drawbacks of poor performance.

In this post, we'll explore the most common GraphQL performance challenges and explain the simple solutions available to solve them.

Challenges with GraphQL Performance: The N+1 Problem

There’s a common misconception that GraphQL performance is suboptimal because of the N+1 problem.

What is the N+1 problem?

The N+1 problem arises when gathering data from multiple data sources (e.g., different databases or different API endpoints).

While this problem is also present in REST APIs, it may be easier to spot as all of the data requirements for a REST endpoint are in a single place.

However, GraphQL uses a nested structure to resolve a query, so whenever you select a field, it may be doing network requests in the back. This becomes more apparent when queries are involved.

For example, if you’re fetching blog posts with comments, but the comments are stored in an external service, you might have a snippet like this:

``` query getPostsWithComments { posts(first: 5) { id author body # the resolver for this field does an API call comments { id author content } } } ```

In this case you could do one API call to get all of the blog post, and then 1 extra API call per each post to get its comments. This is why it is called the N+1 problem, as you’d do one initial request plus N subsequent requests for each data item.

The issue here is that if one of the initial requests in the waterfall is slow, the entire network call is delayed.

This means the user won’t be able to see any of the information within that network call until the slow request is resolved. As a result, the N+1 problem negatively impacts GraphQL latency and creates a poor user experience.

The good news is that there’s a simple solution to the N+1 problem, which will make your GraphQL API perform just as well as a REST API. We’ll discuss this solution in detail below.

How to Improve GraphQL Performance

The primary challenge with the N+1 problem and resulting waterfall issue is that it impacts data retrieval times, which negatively impacts GraphQL performance. The most obvious solution is limiting how often each network call travels back to your server.

You can do this by caching your data.

Caching solves the N+1 problem because the data is stored in the browser, eliminating the need to retrieve data from the backend server.

In practice, the first time a new query is received, retrieving the data from each endpoint may take a few moments—but then the data is cached.

So, any calls with those endpoints will be found in your cache memory, and the user will be able to quickly retrieve the desired information.

To ensure the user is never the first to make a query before the data is cached, you can put your website online in a temporary environment and hit the server for all the pages you want cached, such as product or category pages.

This process is referred to as cache warming.

The first (oftentimes slower) retrievals occur internally while the site is still in a temporary environment, and then once the data is cached, it's rolled out to users.

However, there are two major issues when it comes to caching data.

First, you need to know what data can be cached, and secondly, you need a solution that allows you to cache GraphQL requests, as GraphQL requests aren't cacheable by default.

Let's address these two problems.

How do you know what data to cache?

Platforms like Datadog and Apollo Studio provide basic performance metrics, like the type of errors that are occurring and how long they take to resolve, but they won't tell you if the data is actually cacheable.

Without this information, the team might waste time optimizing a cache rate that's already well-optimized.

You may also overlook other major opportunities to improve your cache hit rate.

How do you implement GraphQL caching?

GraphQL uses POST requests, which are designed for actions that can have side effects on the server like storing some information or uploading a file, meaning they aren't cacheable by default. To cache these requests, most developers need to build their own proprietary caching solution.

Unfortunately, manually building your own GraphQL caching solution is very expensive. It requires a dedicated team of engineers with specific knowledge to build a caching solution.

Even after building a proprietary caching solution, you still need additional functionalities to maintain it.

Here are just a few issues that many developers run into after building their own caching solution:

Identifying latency issues. Even after caching data, you'll still lack full observability into your website's performance. For example, you may notice latency issues occurring, but you won't see which geographical areas it's impacting or where the issue is occurring. As a result, it's difficult to prioritize problems based on revenue impact.
Managing cache purging. If you build your own caching infrastructure, you'll probably find that there isn't an out-of-the-box method to automatically purge outdated data. Therefore, most developers purge all data at once, though this isn't the best solution either, as it will cause your cache hit rate to drop to zero, which can overload your servers and cause the website to crash.
Monitoring cache leaks. If you build your own caching infrastructure, you'll find that there isn't an intuitive method to identify cache leaks. This is a major problem as it could leak sensitive data, like customer credit card information, to other website visitors.

Fortunately, you can solve these challenges with a modern GraphQL CDN.

How To Use A GraphQL CDN To Improve Performance

A GraphQL CDN solves the N+1 issue by automatically caching data for you. Caching the data automatically improves website performance and reduces latency issues because the API no longer has to travel back to the server to access the data.

The only problem was that, until recently, a GraphQL CDN didn't exist, which is why most engineers build their own proprietary caching solution.

We wanted a more efficient method to cache GraphQL requests, so we built Stellate – the first GraphQL CDN designed specifically to improve GraphQL performance.

Stellate sits directly in front of your existing infrastructure and automates GraphQL caching so that you receive all the benefits of GraphQL caching without the cost and hassle of building your own infrastructure.

Stellate also provides many out-of-the-box features, like automatic and manual cache purging and granular observability metrics regarding latency issues and other performance issues.

Below we'll discuss how you can better leverage a GraphQL CDN to improve performance.

Quickly Identify And Solve High Impact Latency Issues

If your website takes too long to load, users will leave and go to a competitor's website instead.

By caching your GraphQL data, Stellate automatically solves most latency issues, though it also offers granular out-of-the-box metrics that allow you to identify where latency issues are occurring and who experiences them.

This makes it easy to prioritize latency issues costing the most in lost revenue.

For instance, if you know of a latency issue on the checkout page, you can prioritize it over other latency issues, as it's likely negatively impacting revenue. Similarly, if your best customers are in Canada and you notice a latency issue on a product page that negatively impacts Canadian customers, you can prioritize solving that issue.

As a result, you can prioritize issues based on revenue impact and solve them faster, as Stellate shows you exactly where they’re occurring.

Optimize The Cache Hit Rate

Optimizing the cache hit rate improves GraphQL performance by reducing the number of requests to the origin server, thereby minimizing latency, reducing server load, and ultimately enhancing the application's performance and user experience.

The problem is that platforms like Datadog and Apollo Studio don't offer caching related metrics out of the box.

Stellate solves this problem by offering detailed information on which types of data are cacheable and the maximum cache hit rate (CHR). With this information, engineers won't waste time optimizing GraphQL queries that are already well optimized.

It will also reveal opportunities where engineers can significantly improve cache hit rates.

Incorporate Automated Cache Purging

Cached data must be purged when a new version becomes available so that website visitors always see up-to-date information.

Most developers using an in-house GraphQL caching infrastructure don't have the option to automatically purge specific data and instead have to purge all data at once.

Unfortunately, this drops the cache hit rate to zero, which can overload the server.

When your website goes down, users will just visit a competitor to access the information they need.

Stellate prevents this situation by allowing you to automatically or manually purge data so that customers always see up-to-date information and your servers are never overloaded with a single site-wide purge.

You can set Stellate to automatically purge data as new information becomes available, or you can manually purge data by a specific parameter, like a query, field, or specific name of an operation. This ensures your users always view up-to-date information, which improves the customer experience, and you'll never have to worry about the cache hit rate dropping to zero and overloading the servers.

Proactively Identify High Impact Performance Issues With Alerts

Most companies have a reactive approach to fixing performance issues, mainly because they don't even know when issues are occurring. Instead, developers often wait until customer success relays a customer complaint to identify and solve an issue.

This reactive approach to solving GraphQL performance issues leads to a poor user experience and can cause lost revenue, as many other users likely experienced that same issue and simply left your website rather than complaining.

Stellate helps you take a more proactive approach to solving performance issues by setting performance alerts.

For instance, you can set an alert within Stellate to notify the team of critical latency issues.

With proactive alerting, you can reduce customer support complaints and plug holes causing lost revenue.

It also makes it easy to prioritize issues based on impact, as not all problems are equally important. ‌

Example of Improving GraphQL Performance

Traffic spikes during Black Friday are a major problem for ecommerce websites as they often lead to server crashes.

Italic, a fast-growing ecommerce marketplace that connects high-end manufacturers directly to customers, experienced this nightmarish scenario in 2020 as the website crashed every few hours during Black Friday.

They knew that caching the data would help improve performance, but as they were using GraphQL, they couldn't find a simple caching solution – until they found Stellate.

Stellate automatically cached Italic's data, which allowed Italic to not only survive Black Friday without any downtime, but also significantly improve performance and the end-user experience by reducing average page load speed by over a second.

In fact, Italic's overall cache hit rate reached 86.8%, and some queries reached a cache hit rate of over 99%.

Italic also significantly reduced cloud costs as the website's overall server load dropped by 61%.

By simply implementing Stellate, Italic delivered a much better customer experience that resulted in more revenue, as customers could quickly and easily find what they were looking for and complete the checkout process without a hitch.

In addition, Italic's team was able to focus on other high-impact tasks and enter the holiday season with confidence.

Start Improving Your GraphQL Performance Today

There are plenty of minor tasks you can implement to improve GraphQL performance, but caching is the most impactful and will yield far better results than spending time making minor tweaks and optimizations.

Building your own caching solution for GraphQL is astronomically expensive and time consuming, and we wanted a better solution.

That's why we built Stellate, the first GraphQL CDN that leverages edge caching.

With Stellate, you can automatically cache your data, access detailed observability metrics, and shield your API from traffic spikes. As a result, you can significantly improve GraphQL performance and recover revenue lost due to website performance issues without the headache of building your own caching solution.

Learn more about Stellate by signing up for a free trial today.