Production reduced service levels

Incident Report for Treasury Prime

Postmortem

An update on recent service disruptions

Over the last several months, we have experienced four distinct service disruptions that prevented reliable access to Treasury Prime’s platform. I know how impactful this is to our customers, and I’m sorry for the pain these outages have caused. We recognize that our service is mission critical for the end users that rely on consistent money movement to live their lives and run their businesses. I’d like to share a bit more context about what happened and how we’re responding to avoid these issues moving forward.

Timeline

May 5 2023 (30 minutes)

A spike in book transfer requests caused contention within our database-backed queue system; once the blocking queries were canceled, a full platform restart remediated the service disruption.

June 21 2023 (43 minutes)

A subsequent spike in book transfer requests again caused contention within our database-backed queue system; once the blocking queries were canceled, a full platform restart remediated the service disruption.

June 26 2023 (58 minutes)

Due to an unusual increase in pending ACH debit payment instructions, we experienced a spike in book transfer requests that caused contention within our database-backed queue system; in this instance, the blocking queries could not be canceled and a database failover was required to alleviate the blocking queries. A full platform restart remediated the service disruption.

July 12 2023 (1h 39 minutes)

A performance regression caused by an edge case in platform usage caused a large amount of data to be returned to our application servers, triggering an out-of-memory (OOM) JVM condition. Restarting an application instance caused the issue to migrate to another application instance, delaying recovery. Once the problematic query behavior was identified, a code change was deployed to prevent the OOM scenario, remediating the service disruption.

Root cause analysis

While these four incidents were triggered by distinct changes in activity, the common theme is our production database. Treasury Prime’s architecture is a traditional monolith code base connecting to an AWS Aurora Postgres database cluster. In each of the four incidents, our application servers created database contention that our architecture failed to handle gracefully. In each of the four incidents, unexpected usage patterns led to cascading failures that prevented the application from operating normally.

It's important to acknowledge that the linkage to our bank's core was not a constraining factor in this context. Our adaptable, real-time bank driver hub operated seamlessly without any complications.

Next Steps

Together with our engineering team, I’ve identified specific next steps to address the short-term takeaways from these service disruptions.

Implement performance improvements to mitigate database contention and increase our capacity to queue and execute book transfers. These changes went into production shortly after the June 26 and July 12 incidents and our data indicates the changes have helped improve throughput.
Implement configuration changes to “fail fast” and improve mean time to recovery, specifically related to load balancer health checks, database failover, and database connection pooling.
Modify operational processes with a focus on reducing the duration of a production outage.
Collaborate with our legal team to include explicit service-level agreements and corresponding service credits into our API Services Agreements going forward.

Treasury Prime has experienced incredible growth over the past 12 months, and significant increases in platform usage have accompanied that growth. We’ll be taking the following steps to improve our ability to scale in the future.

Form a stability tiger team comprise by our senior engineering staff to identify and implement the major architectural changes necessary to support our long-term scaling needs.
Establish network-level and application-level rate limits. The Treasury Prime platform was built with an “async-first” design from the beginning; we believed a heavy reliance on asynchronous job execution would allow us to scale our web tier independently from our worker tier. In practice, we have determined that establishing rate limits cannot be avoided when operating a multi-tenant platform. We will share more soon about the specifics of this change.
Add additional tooling to identify performance “hot spots” and optimize response times for critical path features, especially book transfers and asynchronous job execution.

Summary

Once again, I’m sorry for the pain these recent disruptions have caused both our direct customers and their end users. The service we operate is mission critical, and I am committed to making the required changes that will enable us to scale while minimizing unplanned downtime going forward. Let us know if you have additional questions, thank you for your patience.

Mike Clarke

VP of Engineering, Treasury Prime

Posted Aug 01, 2023 - 23:35 UTC

Resolved

This incident has been resolved.

Posted Jul 12, 2023 - 21:22 UTC

Monitoring

A fix has been put in place and we are monitoring the results.

Posted Jul 12, 2023 - 21:00 UTC

Investigating

We are currently investigating this issue.

Posted Jul 12, 2023 - 19:58 UTC

This incident affected: API and Bank Console.