Ambra Incident

Incident Report for InteleShare

Postmortem

Root Cause:

The database load increased due to an abnormally high number of database locks which started generating errors on some queries related to study uploads. These locks were related to some database optimization work hitting backend limits and causing longer than expected locks on the system.

Remediation:

The following actions were performed:
• Increased the existing database cluster resources.
• Created a secondary database cluster to receive new study uploads (ingestion.)
• This scaling helped immediately with newly sent studies. However, pending studies with failures were not able to process until removed a newly introduced configuration that was causing the increased locks that was part of a set of database optimization work. Its removal does not impact the end benefit.

Posted Sep 18, 2023 - 12:25 EDT

Resolved

The incident has been fully resolved and service is back to normal levels. Our team will be conducting a root cause analysis and sharing as soon as possible. We will continue to monitor the situation to ensure there are no further issues.

Posted Sep 13, 2023 - 01:00 EDT

Update

The study backlog continues to decline. Hourly updates will continue until we are back to normal processing.

Posted Sep 12, 2023 - 23:54 EDT

Update

The study backlog continues to decline, however, not as quickly as anticipated previously. Hourly updates will continue until we are back to normal processing.

Posted Sep 12, 2023 - 22:52 EDT

Update

The study backlog continues to decline. Currently, we estimate approximately 1 hour to get back to normal processing.

Posted Sep 12, 2023 - 21:48 EDT

Update

Processing of the study backlog continues. Currently, we estimate 1-2 hours to get back to normal processing.

Posted Sep 12, 2023 - 20:48 EDT

Update

The study backlog is reducing quickly. We are now estimating less than 2 hours to finish processing the backlog.

Posted Sep 12, 2023 - 19:44 EDT

Update

All storage clusters are healthy, and study ingestion continues. Currently, we estimate less than 4 hours to finish processing the study backlog. Progress updates will be posted hourly.

Posted Sep 12, 2023 - 18:47 EDT

Update

All storage clusters are healthy, and study ingestion continues. We estimate 4-6 hours to process the study backlog. Progress updates will be posted hourly.

Posted Sep 12, 2023 - 17:54 EDT

Update

The new storage cluster (storelpa.dicomgrid.com) continues to performing well. During the day the gateways have developed a backlog. We do not have an ETA yet on how long the gateway backlogs to be resolved.

Posted Sep 12, 2023 - 17:17 EDT

Update

The new storage cluster (storelpa.dicomgrid.com) is performing well. However, studies which were partially uploaded during these issues may still be attempting to connect with the old storage node. Our engineering team is actively monitoring the situation and exploring further potential solutions.

Posted Sep 12, 2023 - 16:40 EDT

Update

Currently, we are observing the ingestion of images on the new storage node. However, studies which were partially uploaded during these issues may still be attempting to connect with the old storage node. Our engineering team is actively monitoring the situation and exploring further potential solutions.

Posted Sep 12, 2023 - 16:11 EDT

Monitoring

The Engineering team has effectively implemented the new storage node, and we are actively monitoring the status of image processing at this time.

Posted Sep 12, 2023 - 15:51 EDT

Update

The team is making the final preparations to bring the new storage node up and running at this time.

Posted Sep 12, 2023 - 15:29 EDT

Update

The team is still progressing through the necessary steps for deploying the new storage cluster at this time. The storage cluster should be running in approximately 30 minutes.

Posted Sep 12, 2023 - 14:52 EDT

Update

The team is still engaged in the deployment of the new storage cluster at this time and will provide an ETA as soon as possible.

Posted Sep 12, 2023 - 14:25 EDT

Update

Our team is actively engaged in the deployment of a new storage cluster to optimize database load distribution. Currently, we do not have a specific estimated time for completion, but we have allocated multiple dedicated resources to expedite the restoration of service.

Posted Sep 12, 2023 - 14:03 EDT

Update

At this time, the engineering team has completed an initial fix attempt, which unfortunately did not yield the desired results. They are now proceeding to implement the next potential solution and we will continue to provide updates on the progress.

Posted Sep 12, 2023 - 13:50 EDT

Update

Engineering teams are implementing and testing possible solutions to our database issue and though a definitive resolution has not yet been reached, the team continues their efforts. We appreciate your patience.

Posted Sep 12, 2023 - 13:31 EDT

Update

Our engineering teams are diligently investigating potential solutions to address the image uploading issues. Once we have successfully identified a solution, we will swiftly implement it and keep you informed with regular updates.

Posted Sep 12, 2023 - 13:12 EDT

Update

At this time there are no new developments. Engineering teams are still working towards a solution.

Posted Sep 12, 2023 - 12:53 EDT

Update

Our engineering teams continue working on a resolution to the image uploading issues we are experiencing and we will provide further updates as they are available.

Posted Sep 12, 2023 - 12:40 EDT

Update

Our Engineering teams are continuing to work towards a resolution. We understand the urgency and we appreciate your patience as we work to address the issue.

Posted Sep 12, 2023 - 12:23 EDT

Identified

Our engineering teams have identified and are working on remediating performance issues with our database that are causing issues with uploading via the web and gateway. Further updates will be provided here as they are made available.

Posted Sep 12, 2023 - 12:07 EDT

Investigating

We have received reports of issues on the Ambra platform. Engineering teams are currently investigating. Additional information will be provided as soon as it is available.

Posted Sep 12, 2023 - 11:52 EDT

This incident affected: Web Services, Image Processing, and Image Viewing.