Over the last several months, we have experienced four distinct service disruptions that prevented reliable access to Treasury Prime’s platform. I know how impactful this is to our customers, and I’m sorry for the pain these outages have caused. We recognize that our service is mission critical for the end users that rely on consistent money movement to live their lives and run their businesses. I’d like to share a bit more context about what happened and how we’re responding to avoid these issues moving forward.
May 5 2023 (30 minutes)
A spike in book transfer requests caused contention within our database-backed queue system; once the blocking queries were canceled, a full platform restart remediated the service disruption.
June 21 2023 (43 minutes)
A subsequent spike in book transfer requests again caused contention within our database-backed queue system; once the blocking queries were canceled, a full platform restart remediated the service disruption.
June 26 2023 (58 minutes)
Due to an unusual increase in pending ACH debit payment instructions, we experienced a spike in book transfer requests that caused contention within our database-backed queue system; in this instance, the blocking queries could not be canceled and a database failover was required to alleviate the blocking queries. A full platform restart remediated the service disruption.
July 12 2023 (1h 39 minutes)
A performance regression caused by an edge case in platform usage caused a large amount of data to be returned to our application servers, triggering an out-of-memory (OOM) JVM condition. Restarting an application instance caused the issue to migrate to another application instance, delaying recovery. Once the problematic query behavior was identified, a code change was deployed to prevent the OOM scenario, remediating the service disruption.
While these four incidents were triggered by distinct changes in activity, the common theme is our production database. Treasury Prime’s architecture is a traditional monolith code base connecting to an AWS Aurora Postgres database cluster. In each of the four incidents, our application servers created database contention that our architecture failed to handle gracefully. In each of the four incidents, unexpected usage patterns led to cascading failures that prevented the application from operating normally.
It's important to acknowledge that the linkage to our bank's core was not a constraining factor in this context. Our adaptable, real-time bank driver hub operated seamlessly without any complications.
Together with our engineering team, I’ve identified specific next steps to address the short-term takeaways from these service disruptions.
Treasury Prime has experienced incredible growth over the past 12 months, and significant increases in platform usage have accompanied that growth. We’ll be taking the following steps to improve our ability to scale in the future.
Once again, I’m sorry for the pain these recent disruptions have caused both our direct customers and their end users. The service we operate is mission critical, and I am committed to making the required changes that will enable us to scale while minimizing unplanned downtime going forward. Let us know if you have additional questions, thank you for your patience.
Mike Clarke
VP of Engineering, Treasury Prime