Switch to New Infrastructure

High-level overview

  • Biggest architecture change to Stellate since launch.
  • You are in control of moving to the new system.
  • Below is a list of all behavior changes in the system.
  • We ask that you migrate to the new system in the next 6 weeks.

New Infrastructure

We at Stellate have finished the biggest architectural improvement to our platform since our first launch in 2021!

When we first started, Edge Computing was a new concept, and the primitives to build Stellate didn’t exist, so we had to make some tough compromises to get the product to work. Edge Computing has matured significantly in the last couple of years, and we now see new opportunities to build an even better product for you.

  • More reliable platform with fewer dependencies.
  • Better performance with even lower latencies.
  • Higher cache hit rates through new features.

Stay tuned for the announcements ahead as we start leveraging this new foundation to improve the platform for you. But first, we needed to improve our foundation. We’ve rewritten our caching stack in Rust and removed our usage of CloudFlare services that introduced latency, complexity, and stability risks. This is the most significant change we expect to make for years to come.

Testing

We have an incredible responsibility to keep your services running, and we thank you for your trust. To honor that trust, we have done extensive testing in production over the last couple of months. We have re-run millions of real production requests through the new system to discover potential issues with the new system without the risk of impacting end users. This process allowed us to identify a few issues which we have fixed.

Rollout

The hope was for this migration to be entirely transparent. But we uncovered issues in our old system that we deemed worthy of improving or fixing. Below is an explanation of the issues identified and the resulting behavior changes in the new system.

While these changes are all for the better, we wanted to give you control over when those changes are introduced to your production traffic.

As of today, you can go to the settings for any of your existing services and toggle the new system on or off.

Services created from the 1st of August onwards automatically use the new system and can’t switch back to the legacy infrastructure.

Screenshot of the toggle in settings that can be used to migrate.

Our support team is here for any questions you have before or after the migration.

Your transition period begins today and allows your team six weeks to migrate proactively before the deadline when our team will be migrating services on behalf of customers. If you can not make the migration by September 15th and have concerns about your service being migrated, please let us know so we can work out a suitable migration path together.

Rollout checklist

  • Finish reading the migration guide
  • Check if your service limits the number of aliases in a query
    • If yes, remove or increase the limit
  • Migrate staging service using the toggle in service settings
  • Test staging service
  • Migrate production service using the toggle in service settings

Major change: __typename insertions replaced with __stellate__internal__typename: __typename

When Stellate makes GraphQL queries to your origin server we inject __typename's into the query. This type information is used for our caching, purging and metrics features.

During our testing, we found cases where some of these extra __typename's were not removed from the data object before the response was sent to clients.

To fix this, we've made it more explicit when Stellate requests type information. If a selection set already includes the __typename selection, we will use that information. If a selection set doesn't ask for __typename, or has that field selection aliased to a different name, we will add a new field selection __stellate__internal__typename: __typename. On the boundaries of fragment spreads and inline fragments, the enclosing selection set is responsible for selecting the __typename field. As a result, the top-most selection sets of (inline) fragments will never have an aliased typename selection added by Stellate.

Example client request:

query SampleRequest {
  authors(limit: 10) {
    posts(limit: 10) {
      __typename
      ... on BlogPost {
        title
      }
      ... on Tweet {
        message
      }
    }
  }
}

Before your origin would receive:

query SampleRequest {
  authors(limit: 10) {
    __typename # added by Stellate
    posts(limit: 10) {
      __typename # from the original client request
      ... on BlogPost {
        __typename # added by Stellate
        title
      }
      ... on Tweet {
        __typename # added by Stellate
        message
      }
    }
  }
}

After:

query SampleRequest {
  authors(limit: 10) {
    __stellate__internal__typename: __typename # Added by Stellate (using the new alias)
    posts(limit: 10) {
      __typename # Not touched since this was already included in the original query
      ... on BlogPost {
        title
      }
      ... on Tweet {
        message
      }
    }
  }
}

While the above is a change for the better as it allows us to more precisely handle type-name additions, it may cause issues for your service. Some servers limit how many aliases are allowed in a query and since we now use aliases instead of flat __typename fields, you're more likely now to hit this limit.

The intend of setting such a limit is to reduce the risk of bad actors abusing aliases to multiply the impact of an expensive query. Sadly the system does not take complexity into account which causes it to be more prone to false positives. __typename aliases have almost no performance overhead and should not count against the set alias limit. We will try to work with the common packages to ignore __typename for these limits but would recommend simply increasing the limit for now if you have one in place.

Patches / Minor differences

These changes are unlikely to affect any of our customers negatively. We would normally release these without providing a migration window, but have included them in this migration window in an abundance of caution.

Improved precision for >Number.MAX_SAFE_INTEGER

Previously Stellate's CDN was written primarily in TypeScript. TypeScript does not guarantee precision for large integers bigger than 252-1. You can read more about how TypeScript handles integers here. In our testing, we found a few cases where responses included integers larger than 9007199254740991 and as a result our old implementation would handle those in an imprecise manner. This is now fixed.

APQs will need to be refetched

When you switch to the new infrastructure all Automatic Persisted Queries (APQs) will be refetched. This will cause a minor increase in load for your origin service as the first request for each APQ will be a cache miss rather than a cache hit.

The underlying storage solution we use to store the APQ id to query map has been changed necessitating the need to refresh the map.

No longer serializing surrogate keys that are objects/arrays

To create cache keys we expect ids to be scalar. We discovered during testing that one of our customers had an object value for an id field. This would cause an overly generic surrogate key to be generated. When purging this would cause all responses containing this id type to be purged. We have opted to not generate surrogate keys for this edge case until we find a scalable and safe solution.

This only impacts a single service, the owner of which has already been notified.

Improved bot protection

When a Stellate endpoint receives non-GraphQL requests with a user agent that matches a list of known bots we return a 204 on your behalf to protect your origin from irrelevant bot traffic.

We have updated/extended our list of known bots.

More consistent handling of new lines in error cases

Stellate has always stripped redundant new lines from responses and produced compact JSON objects for successful requests. We uncovered that in some rare error cases, this step was skipped. We now do this formatting step in all situations.

Before:

{
  "data": null,


  "error": ["<STACK_TRACE>"]
}

After:

{
  "data": null,
  "error": ["<STACK_TRACE>"]
}

Ordering of "data", "errors" and "extensions" response fields

Our move to Rust from TypeScript as well as significant changes to the logic can in some cases cause the ordering of the top-level response key/value pairs to be different. There was no enforced or implied order in the previous version either. As such no impact is expected as a result of this change.

The GraphQL Spec specifically states the keys inside of data should be ordered as they were in the request. We naturally continue to maintain this ordering.

{
  "data": {},
  "error": [],
  "extensions": []
}

Might become:

{
  "error": [],
  "data": {},
  "extensions": []
}