Sunday, December 22

Sorry We Messed Up Says Myntra

India’s leading fashion website Myntra apologise in an official company blog on Friday for a major technical glitch. Yesterday, Myntra unintentionally sent many notifications to its users. The fault was causes due ti mobile notifications updation.

Myntra relised the mistake and posted a blog of this technical glitch and said sorry for this mess up.

On Thursday, May 19, at around 2:00 pm, the notifications team of Myntra updated fleet of notification servers with a code change. It took about 3 minutes for the deployment systems to update all notification servers with the change.

Within minutes of the code-push, our users, including several employees started reporting that they were getting bombarded with notifications, unrelated to their interactions with Myntra. The team immediately stopped notification systems and started to troubleshoot. However, within that short period, a lot of notifications had already been sent. We did cancel a lot of the notifications that were en-route, but unfortunately by then a lot of our customers had already received them. – Myntra CTOShamik Sharma explained in blog.

 

 

How it Happened?

Notification systems require a set of “transformations” to a message before it is sent – for example, adding the recipient’s name inside the message, adding the list of users to whom the message should be sent etc. Each of these transformations is done by a set of processes which do their part and then put the message back in a queue for the next set of processes to take up. The new code had a “schema change” – the list of recipients was now expected to be in a new field called “userId” rather than “recipient”.

When we deployed our new code, there was a short period (2 min 37 sec) when the new code was active while notifications created by the older code were still being processed. This led to a “race condition” – the old code had already added the recipient in the old field (“recipients”) while the new code was expecting it in the new field (“userId”) and upon not finding a userId, left it blank. Defensive code should have been written for this case, but was missed.