LUSID is a mission-critical service for our clients, demanding 24x7 availability. Our goal is zero downtime, which we aim to achieve through:
Resilience: LUSID is architected to be robust to individual failures.
Contingency: If the system does suffer a problem, we have established plans for recovery.
Testing: We regularly validate and practice that our service continuity strategy works.
Improvement: Learn from our experiences and evaluate new technology and best practices.
Approach
LUSID is architected to have no single points of failure and exhibit comprehensive redundancy.
Hosted in AWS or Azure, we distribute our estate across all “availability zones” (AZs) in the regions in which we operate. All data is replicated, and LUSID is tested to be tolerant to multiple failure modes. This includes individual container failures, loss of server instances, network degradation, entire AZs outages, and external DDoS attacks.
We routinely inject failures in our environments - a practice known as “chaos engineering” - and verify there is no client impact by examining our monitoring and telemetry sources.
In addition, we regularly test our automated recovery mechanisms, including database failover and AZ annexation.
Finally, we conduct tests of our disaster recovery processes, whereby we may be forced to restore the entire service in a new region, from backups, in the event of a catastrophic geographical outage.
Monitoring
We have extensive monitoring and telemetry that provides us a real-time view on the health of LUSID. This data is continuously monitored and feeds into our internal alerting tools, ensuring we are immediately aware of any service degradation.
In addition, we track the availability of all LUSID APIs using externally hosted tools that regularly and systematically exercise the various features of the API. The results of these tests are used to determine our uptime statistics, which we make available to all clients.
Data durability and backup
LUSID maintains real-time replicas of all data, to ensure all data changes are durable and resilient to failures in our data storage systems. These replicas exist in all availability zones in the primary region, as well as an asynchronous replica being maintained in the disaster recovery region.
In addition to these replicas, we also take backups of the data via two mechanisms:
Real-time backups which facilitate point-in-time recovery to any time in the past 30 days.
Daily snapshots which are moved to a different cloud provider region, to facilitate disaster recovery.
These provisions ensure that once data is successfully written to LUSID, we have very high confidence in its ongoing availability.
Performance and capacity management
We target the following SLAs for LUSID API performance:
99% of all requests to return in under 2 seconds.
99.9% of all requests to return in under 5 seconds.
To achieve these targets, LUSID dynamically adjusts the number of available servers to respond to changes in client workload. If necessary, LUSID throttles incoming requests to temper surges in load, or to constrain ‘bad actors’, in order to ensure the stability of the service. This manifests as “429 - Too Many Requests” response codes being returned by our APIs.
We validate the scalability of the platform using stress-test simulations which are run as part of our release cycle.
Service levels
The table below outlines our Recovery Time and Recovery Point Objectives for various failure modes.
Recovery Time Objective (RTO) describes the time it takes to bring the service back to full operational health.
Recovery Point Objective (RPO) describes the maximum amount of data loss incurred in the case of that failure (measured in time).
Failure mode | Impact | Recovery action | RTO | RPO |
---|---|---|---|---|
Container crash | Failure of any in-flight API requests. | Restart container. | Zero | Zero |
EC2 server failure | Failure of multiple containers. | Reprovision server. | Zero | Zero |
Network degradation | Slow and/or disrupted API requests. | Annex affected infra. | Zero | Zero |
AZ issue | Temporary loss of capacity. Potential disruption to requests. | Annex AZ. Increase capacity in other AZs. | Zero | Zero |
AZ outage | Failure of any in-flight API requests. Temporary loss of capacity. | Increase capacity in other AZs. | Zero | Zero |
Database failure | Service disrupted. | Failover to replica. | < 60 seconds | Zero |
Cloud provider regional outage | Service disrupted. | Invoke DR procedure. Restore service to different region. | < 4 hours | Seconds |