Incident management – based on a true story…

This is how developer looks when someone says
‘There is incident’

Hello everyone, hope you having a good time. Today I want to share my experience in the latest incident. How it started, how it was detected, how I got involved, steps to reproduce the issue on the local machine, fight with Hibernate and tons of boilerplate code, and finally resolved the problem and post-incident report (post mortem).

Incident – the problem, issue, outage, discrepancy in your product/project, infrastructure.

Before we begin I would like to mention a few things – usually, incidents are caused by complex sequences of events. If any event were prevented or gone differently, the incident might not even happen. There should not be any blame on the incident report, but a detailed list of events, analysis, lessons that were learned, and action items which suppose to prevent such incidents in the future. Often incidents are exposing problems and gaps in the entire company.

Let’s review the stages of the incident:

  1. The incident started but nobody noticed it yet – this can happen if we don’t have a proper monitoring system, our clients did not notice errors or corrupted data, actually it is very dangerous if this stage takes a while.
  2. Incident detected – someone observed not desired behavior, some errors, outages. If people in the company are well known what is expected behavior of the product/project and everyone is free to speak then those who saw the issue immediately report to the responsible team or supervisor.
  3. Incident reported – at this point usually there should be a document created where the observed issue is described with details, tagged responsible and involved people, and severity assigned to the incident. I would suggest creating a bug ticket in Jira where the issue is described and assigned to the responsible person + creating a report page in Confluence where put all details and add all updates during the investigation (Confluence has a nice template for post mortem reports).
  4. Incident investigation – during this phase severity can be changed, and the impact caused by the incident should be roughly estimated. It can happen that the incident will be escalated or reassigned to another responsible team. Customer support should seek quick fixes/solutions which customers can use to keep working or to inform customers to pause using the system due to the danger of data corruption or other things. Usually, if it’s not a security incident, I would suggest creating an open Slack channel where all discussions going, so all people can ask questions and get answers in the same place. This channel will be in under analysis during report preparation and can help keep time tracking. The main goal of this phase is to recognize the root cause of the incident and provide a plan for how to fix it.
  5. Incident resolving – at this stage is important to be able to reproduce (if it is possible) the incident problem in the local or dev environment. During resolving there must be test coverage and fix, and all relevant logs and metrics should be added (you should be sure that everything working as you expected and if something went wrong you should have an alert), also I recommend involving other people to validate the resolving plan.
  6. Incident fix delivery – deliver an image with the fix, test if all are working, ask the customers to run test data, and confirm on your end that all are working correctly.
  7. A post-Incident report (post-mortem) – is one of the most important parts of the incident lifecycle because it must help to prevent incidents in the future. When you write a postmortem or incident report few things to keep in mind: transparency – do not hide any small details, root case – collect links and refs to all code, documents, and meetings that directly or indirectly caused the incident.

What was this incident about? – on one of our projects we are dealing with receiving and persisting data in JSON and CSV format. Around two weeks ago some of the clients started reporting 504 Gateway timeout errors on their attempt to upload a batch of records.

The team that was responsible for the microservice which received a butch of data at that moment did not have experienced backend developers and even worse – many people were on Summer holidays. Basically, this was the reason why I got involved.

We started by checking logs around the time customer reported but did not find any errors on that microservice. This was very strange because the customer sent us a response that included HTML code, with an error message: Gateway timeout. If not that microservice responded with an error (because we did not find any errors in the logs) then it must be something in the middle between microservice and client application or even something on the client’s side.

With help of DevOps we recognized misconfiguration on the load balancer, which actually sent 504 errors if the request happened longer than 30 seconds. Quickly configured up to 120 seconds we tested and again got the same error, it looked like the request was still going, even after 120 seconds.

Then we created a Spring Boot Integration test that performs the same bulk insert against PostgreSQL which runs in the test container, on the local machine (Mac M1, 32 Gb) 500 records took around 10 minutes. Starting from this point we had in our hands a reproducible issue and within the next few hours root cause was recognized, actually, there were a few of them:

  • There was materialized view and refresh was triggered by each insert/update on two big tables -actually usage of the materialized view is always dangerous, so think twice before introducing it to your system.
  • Data, that was uploaded, after parsing had a complex structure, there were parent objects and child collections – from parent to the lastest child collection there are 3 layers.
  • The part which prepared data for persistence used Hibernate in a very inefficient way – modern courses and tutorials played with us bad game – they made fake feelings that Hibernate is easily suited to any kind of application and all kinds of complex things will be resolved by Hibernate magic.
  • Wrong transaction management – just now I realized that many developers don’t really know and understand what is isolation level, what is propagation, and how dangerous is to mess with it.
  • Poor test coverage – actually there was a test that check bulk insert of 50k records but ALL OF THE ENTITY HAD SAME ID, so Hibernate squashed it to one record, and in that test, no single line to check how many records were inserted into DB, just assert that time spent on insertion below 30 seconds.

For two days we tried to resolve performance by tuning Hibernate, playing around with different parameters, adjusting transactions etc. It did not bring any good results, so I decided to rewrite bulk inserts manually, without Hibernate’s involvement.

After we get rid of hibernate from the bulk inset the time required for insertion of 1000 records (with 15 children for each) is reduced from 500 seconds to 16 (12 seconds from 16 spent on refresh materialized view). Not bad results so far, but the next step will still need to improve bulk updates.

With updates, the situation was even more interesting because it required audit history for changing to each field. During the analysis and debugging we find out that it does extra requests to DB for fetching all children. At this time I’ve decided to keep hibernate and manage states of entities. Steps that let us improve the speed of updates were:

  • Use @EntityGraph to reduce the number of queries and fetch all data at once
  • Use fetch previously data as the baseline for audit logs comparison purposes
  • Enable bulk update option in hibernate configuration
  • Use TransactionTemplate for explicit transaction management
  • Disable materialized view refresh before the update and enable it only after the transaction committed

Those steps improved bulk update to timing similar to bulk inserts. Even though the solution is far from perfect and weak for concurrent inserts and updates (due to the nature of materialized view) it helped to resolve this incident and our clients stopped facing timeout errors. Right now we have around a month on our hands to review current architecture and probably implement a proper solution.

Share This

What's your reaction?

Leave a comment