May 22, 2024

How Stellate pushed the boundaries of the Edge

In the rapidly evolving world of digital infrastructure, Stellate acts as a pioneering force, continuously pushing the boundaries of what’s possible. As a company relying on the global "edge" infrastructure landscape, Stellate often needs solutions that haven’t been fully worked out yet. Whenever this happens, we try to be a helpful design partner and push our partners to new innovations. Here's a closer look at how we achieve this.

Stellate in the Bigger Infrastructure Picture

Stellate leverages a globally distributed edge infrastructure to bring content closer to users, reducing latency and saving infrastructure costs. This setup ensures that regardless of where end-users are located, they experience quick and reliable connections to our users’ services.

Unlike traditional REST reverse proxies, Stellate incorporates heavy business logic into its infrastructure. This enables us to handle complex operations and provide advanced functionalities directly at the edge, enhancing efficiency and responsiveness. A great example for this is our new Partial Query Caching feature.

"Make it Work" – The Era of Prototyping

When Stellate was in its early stages, our primary goal was to make everything work seamlessly. This period was characterized by clever workarounds to integrate desired features into our infrastructure.

To get our caching system working as we intended, we initially deployed two instances of Fastly. This was purely an implementation detail and never visible to the end-user. High level we converted GraphQL POST request to HTTP GET requests so we could internally leverage traditional HTTP caching to store otherwise uncacheable GraphQL requests.

Additionally, we employed JavaScript Cloudflare Workers (CFW) to address the specifics of GraphQL, utilizing the JavaScript reference implementation graphql-js (like injecting __typename selections). This allowed us to support advanced GraphQL features at the edge, while the Rust-based GraphQL ecosystem and internal Rust knowledge was still immature.

Throughout this prototyping phase, our team gained a deep understanding of our own technology stack. This tribal knowledge was crucial in identifying the right places to integrate new features, ensuring optimal performance and reliability. At the same time, it made new feature development unnecessarily complex, often spanning multiple of those 3 layers.

Striving for Simplicity: The Reconciliation Phase

As we progressed, the focus shifted towards simplifying our infrastructure to reduce maintenance burdens, improve performance, and minimize errors.

The complexity of our initial setup led to a significant maintenance burden and a slower proxy than users expected.

Last but not least, the intricate setup was eating into our margins. In the short-term, the extra cost of running 3 separate services was acceptable to prove that the product could be build and that there was demand from customers. But in the long-run, a more streamlined approach would be needed.

Notable areas of work during the transition

"How Many TypeScript Developers Does It Take to Write a Rust Proxy?"

This humorous yet profound question highlights our transition towards using Rust for our proxy.

To reconcile on a single infrastructure provider, we had to choose between CloudFlare and Fastly. Neither was immediately ready to do everything we wanted. After assessing both vendors, we settled on Fastly.

One of the frustrating limitations of both vendors was language support. Our business logic was written in both TypeScript and Rust at this time. While both CloudFlare and Fastly in theory support both runtimes, each had a clear favorite and compromised support for the other. In the end we settled for Fastly whose main SDK support was for Rust. This meant porting all of our TypeScript code to Rust.

From Dictionaries to KV Stores

Fastly has an entity called “Edge Dictionaries”. Those are basically key-value pairs that are available to the edge runtime and can be used to store data that can change without the need to deploy new code.

The downside is: they’re heavily constrained. In size and update frequency.

With a medium sized set of cache rules our service config could easily reach the size limits. Once reached, updates to the config would be rejected. This would limit the amount of Stellate features a single service could use.

Also they weren’t really made to update that often and needed to be manually orchestrated by us to account for e.g. two users making changes to two separate services at the same time. This is because Dictionaries are effectively “versioned”. If not done carefully, two updates in parallel could cause one version to override the other, preserving only one out of two changes.

The alternative proposed is called “KV Stores”. A more flexible alternative to dictionaries. At the core of it, it’s still a key-value lookup, but designed for bigger value payloads and more frequent updates. These benefits come at the cost read speed.

While KV Stores were still in beta, Stellate started making heavy use of them where necessary. In this process we found multiple issues on the way and worked with Fastly to make KV Stores ready for general availability. We no longer use Dictionaries for user configurations and instead rely on the now robust KV Store.

We do still use Dictionaries where appropriate for other configuration, e.g. feature flags, which are small enough to fit within their size constraints.

From Varnish to CacheD

As touched on earlier, we made some clever transform internally to enable a traditional HTTP cache to store data for GraphQL requests. The cache used by Fastly was [Varnish](<https://varnish-cache.org/>).

While Varnish itself is a very robust and well maintained project, it requires an incoming request to cache data. We solved that by putting two Fastly instances behind each other: the first one did the transform and the second service was guarded by Varnish and then applied the actual business logic.

This setup allowed us to cache data in a rudimentary way, but we had plans for a more sophisticated caching system - Partial Query Caching. Luckily Fastly was working on CacheD internally to replace their Varnish based system. CacheD would allow us fine grained SDK access to caching primitives which would become critical for Partial Query Caching. We worked with Fastly to incrementally migrate all of our users to CacheD . As we were one of the first adopters of CacheD we identified multiple issues and worked with Fastly to fix them.

Embracing Dynamic Backends

In our clever “make it work” workarounds, we used the Cloudflare Worker infrastructure to make requests to the user’s origin server. Cloudflare exposed a simple WHATWG fetch interface and didn’t impose limitations on the endpoints you could reach out. This was exactly what we needed.

Fastly on the other hand wasn’t designed to make arbitrary HTTP requests from their edge runtime, a decision undoubtedly motivated by security and a philosophy of locking down services. After all, Fastly is mainly a cache in front of your service, so usually you know your handful of backends that you need to connect to.

Well, Stellate is far from a usual Fastly customer. The origins that we proxy change all the time. A new service is created, an existing service is updated or deleted. The cardinality of backends is too big to make them a static list.

Enter “Dynamic Backends”: a new Fastly feature that allows reaching out to arbitrary HTTP endpoints. While testing out the feature we were able to give meaningful feedback and helped shaping a production ready solution for us.

Today, all requests from Stellate to origin servers are coming from Fastly, enabling us to eliminate the Cloudflare Worker from our hot code path. This significantly reduced latencies for cache misses and passes and reduced the risk of outages.

Conclusion: The Right Feedback Loop

Throughout our journey, maintaining a robust feedback loop with all involved parties has been invaluable. Continuous feedback and iteration have enabled us to refine our infrastructure, constantly innovate ourselves and with our partners, and meet the evolving needs of our customers.

At Stellate, pushing the boundaries of infrastructure is not just about technology — it’s about understanding our customers’ needs, anticipating future challenges, and continuously improving our solutions to stay ahead so we can offer the best GraphQL caching solution in the world.