Effectively monitoring any system is difficult. Ideally an engineer should be able to quickly get an idea of how well the system is functioning, whether things look normal, and be notified when they are not. A graph can convey much of this information with minimal thought processing for an engineer. However, they can be difficult to create.
In early days, we leveraged parts of the billing infrastructure to aggregate API call counts, error counts, and some timing metrics. We also would record server metrics (CPU, memory, etc) in our main application database. From there we used google charts plugins to visualize the data.
Things were… okay. As the system grew, and we wanted to know more details about system performance, we found that every additional metric took a good deal of plumbing and database setup in order to add a small chart. We’d have to ensure that the data would get truncated or rotated in order to minimize strain on the database. A chart might be created with a 3 hour window, but you might want to zoom in or see what happened last night, and our charts pages would load incredibly slowly. Additionally, sending a notification or alert was a part of application code, which meant it was time-consuming to change a threshold or pause alerting.
So, we switched to using Grafana, a free open-source visualization and monitoring tool, with plugins for various data sources. You can save customized dashboards and export them to different environments (e.g. production and test), change time windows and aggregation functions, and configure alerts sent to slack, pagerduty, email, and more. We use Kubernetes for infrastructure so it was simple to set up Grafana. We went with Prometheus for the time-series collection and slowly started instrumenting everything. This greatly enhanced our ability to watch system performance and add metrics and alerts for key components of the system.
Most importantly, it immediately helped us answer a big question we’ve received from customers: “how much overhead is there on starting an algorithm?” We could abandon the aggregation system we had created for a few lines of code to emit metrics, and had several options based on if we’d want to view metrics like p90. It turned out our overhead was between 20-30ms on API calls, with ~15ms between when we receive an API request and it actually gets sent to a machine to run.
That seemed like quite a bit – so we wondered could we cut that down? To answer, we first created detailed graphs timing every major step of the API serving pipeline.
It’s clear that a few sections of code dominate the total overhead for API calls so were prime targets for optimization. In a day we were able to optimize the API processing and scheduling pipeline so that a request would be sent to a machine within 4ms, a 73% reduction!
The reduction came by fixing a couple major bottlenecks:
- Adding caching to a database query (many are cached already, but we had missed one)
- Made rate-limiting slightly less precise (we’d rather someone goes a tiny bit above our threshold than add a few ms to processing)
- Moved some cleanup actions to background threads
If all that wasn’t enough reason to love Grafana, it also helped us debug a curious issue. A teammate was making algorithm calls to our test system, but never seemed to get results. Additionally after a few seconds his calls started getting rate limited, even though he was only making a single call at a time. Our logfiles showed us that sessions were being created for his user, and then nothing. As a result, the system thought he had many active sessions. I couldn’t reproduce it and was struggling to think of what to do when I pulled up the API pipeline timing graphs and changed the view from the actual timing to just a counter of occurrences.
A simple glance at this graph and it was clear where in the processing pipeline requests were dropping off, directing me to a single function that was failing to delete user sessions in a certain case – bug fixed!
Moving to Prometheus and Grafana gave us flexibility and power with our monitoring. It freed us from spending time developing charting and alarming features. We now have near-realtime data on the performance of the platform, and have been able to debug performance and system issues. We continue to optimize and thrive with many thanks to Grafana and Prometheus!