Wrap-up regarding yesterday's outage of the ioki DRT platform
The stability and reliablity of our DRT platform is extremely important to us and we have shown a good track record over the last five years. But even with the best technology and processes in place, you are not safe from external events affecting you.
At ioki, we value transparency and we want to keep you in the loop what happened yesterday and how we reacted to it.
Timeline
Yesterday evening there was a problem in our data center, where multiple critical network nodes started to fail.
This also caused the ioki DRT platform backnet to become unstable.
We were not down yet, but recognized network issues around 17:50. We informed our datacenter about these problems and started to prepare for more problems that might arise.
Around 18:30 the network mostly broke down, connections to other external services like HERE or Firebase were not possible anymore. We investigated the problem and worked closely with the datacenter to resolve the issue and restore the production environment with highest priority.
At 18:50 we set up a statuspage message, informed the passenger apps as well as the statuspage subscribers.
At this point, for about 50 minutes (18:30-19:20), most requests would result in errors and matching was not possible.
At 19:27 after verification that the system is operating normally again, we resolved the incident on statuspage as well.
Final Thoughts
We rely on the infrastructure of our datacenter. So if an incident like this happens, there is only so much we can do to prevent this in a short-time manner.
However, just like in this case, it is often not either 100% up or 100% down, but something in between and it can be caused by many different factors.
So in order to react quickly, to identify and mitigate the problems as fast and efficient as possible, you need good monitoring and logging systems to be able to judge the situation properly.
This worked very well and we were able identify and react to the problem very fast.
Thanks again to our SRE team for the fast reaction and pulling all strings to get us back online as soon as possible and the backend team for the fast support in identifying the problems and verifying the mitigations!
Thank you for your patience and your understanding!
Let me know if you have any questions!
Regards,
Christian Bäuerlein
CTO @ ioki