A Close Call with Real-Time: How Rethinking Pub-Sub Saved the Day
Sometimes the easy way is not the most obvious way. And this was one of those times.
All of this started while I was working on our new feature – Automations 🤖. In a nutshell, Automations allow customers to set up actions triggered under specific conditions. Some of the currently implemented actions are sending slack messages, creating and updating tasks, or posting comments on objects.
I wanted that actions that modify tasks and comments would send a message over our real-time system so that our frontend clients (browser, mobile, desktop app) could pick that up and show those changes as they occur.
Currently, our app’s real-time updates are tied to POST/PATCH/DELETE requests. We have a controller extension Extensions::Broadcastable
that hooks on save_form
and destroy_resource
methods and sends a real-time event if the action was successful. However, this approach wasn’t suitable for my automation actions, as they don’t go through controllers.
As I was digging into this topic, I somehow changed the scope of my task from making the automations actions “live” to revamping our whole broadcasting architecture. I wanted to move that logic out of controllers to a place where I could catch all the changes – which would also catch the automations actions.
Exploring Different Approaches
1. Callbacks
Yeah… although we all know callbacks are a hassle, I couldn’t NOT think about them, just for a moment…
2. Forms instead of Controllers
Both the controller actions and my automation actions use the same forms to handle our data. So, why wouldn’t I hook on forms and send my real-time messages from there?
This approach was bugging me a bit as we don’t use Form objects in all the places we are actually changing our data. So this wouldn’t make the whole app feel “live” but I would cover more places than what we have currently. That sounded promising to me.
Wanted to pitch my thoughts to the rest of the Core team and that discussion turned on a lightbulb above my head.
3. Embracing Pub/Sub
That was exactly what I was looking for.
Publishing events happens for every change. Of course, except the places we are using methods that skip callbacks (update_column, update_all, …) as Pub/Sub and the aspect of publishing changes essentially is hooked to callbacks – but that’s a story for another day.
Making a PoC
As things go with every big change in our codebase, and generally in our product, I was putting this code behind a Feature Flag (FF).
Putting it simply, when one would have the pubSubBroadcasting
FF enabled I would skip sending the real-time events from the controller actions and I would handle the published events accordingly. If you didn’t have that FF, nothing would have changed.
I made a few Subscribers that would listen for all the task and comment related changes and simply handle them as sending a real-time event.
And of course, I added RSpec tests and set up some widgets on New Relic to cover the difference in number of events that we are sending now – as we knew we would be sending more events now.
Basically that was it. The next step was to slowly propagate this FF over all organizations, check our New Relic metrics and see if anything breaks. Once we release the change to the entirety of our user base, we should cover the remaining objects and make subscribers for their Pub/Sub events too.
Showing it off
As I was pretty proud of this solution, I kinda talked about it a lot and naturally it popped up in a 1on1 meeting with my Engineering Manager. He wanted to discuss it a bit more so I told him the same thing I wrote here: “Isn’t that great? Our APP will be LIVE, all the data would be in sync.”
His response was “But do we really want that?”.
That wasn’t the response that I was looking for…
What I didn’t know was that our frontend client, once it gets socket messages, depending on the screen the user is, has to make additional API calls so that all the required data could be fetched again. So, a great deal of my real-time events actually end up generating additional requests to our server and in a way, we are just generating a lot more traffic (self DDoSing?). As I didn’t know about this, I wasn’t even paying attention to those metrics along the way.
One step forward, two steps back
We decided to get better data so that we could make a better call for this situation.
I took a time frame of one week from our logs and checked the number of the POST/PATCH/DELETE requests versus the number of dispatched publish events. This, in the end, would roughly be the same number as the events we are sending over the socket in the current and in the new way.
I wasn’t really aware of how badly this could end. I was making a PoC out of this for tasks and comments and you can see here that there wasn’t such a big difference. 25% more events was okay, I knew it would be more.
But look at our deals endpoint for example – 55 times more events would be sent. That would add up to the traffic we sent over sockets, to our infrastructure bill for that services, and I don’t want to imagine the number of API calls generated as a result of this – by ourselves…
“Deals have that much traffic because a lot of other objects update financial data on deals so that was understandable too, it immediately came to us…”
Back to the drawing board
1. Pub/Sub shouldn’t be a bad call for this
As this is the part of our code that gets all the changes in our data, when wanting to make a frontend client to be as live as it gets, this should be a good call. The solution would be not sending all the changes over sockets but filter them by relevant and not relevant. This way we would surely see a drop of those insane company and deal numbers.
2. Send all the needed data to front?
As mentioned before, each real-time message already contains the object that was changed. The issue here is that there are a lot of screens in our client and we can’t know all the contexts our users are in and what additional data should be sent – which is exactly why our client makes API calls when receiving some socket messages.
3. Why didn’t I just resolve the problem I had?
To make those actions “live” in our frontend clients (browser, mobile, desktop APP), I wanted to plug them into our real-time system. So, why didn’t I just put a bit of explicit calls to the code of my automations actions where I would just call the class that sends the real-time message. But no, I wanted to act smart and fix a problem that wasn’t really there in the first place – to make everything more live while no one was asking for it.
The Aftermath
So yeah, the Pub/Sub usage in our real-time part of the app is on hold until we leverage things up.
I went on with the 3rd solution and added 5 lines of code – a call to the Broadcaster class is one line and I have 5 events over 4 classes to send…
This was a nice learning opportunity for me and I would say that:
- When having concrete problems, stick to handling them and resolve them first
- I had the data the whole time in New Relic, I should’ve prepare better
- It’s not bad to be explicit in code, not everything should be an abstraction, generalization, metaprogramming, … Hope to write on this point a bit more soon
The good thing is that we didn’t actually do any damage with this and we didn’t lose a lot of time during this “learning opportunity”.
Anyone faced similar problems? If yes, how did you deal with them? Feel free to reach out to me!