All posts by Patrick McQuighan

Reducing API Overhead by 70% with Prometheus and Grafana

Effectively monitoring any system is difficult. Ideally an engineer should be able to quickly get an idea of how well the system is functioning, whether things look normal, and be notified when they are not. A graph can convey much of this information with minimal thought processing for an engineer. However, they can be difficult to create.

In early days, we leveraged parts of the billing infrastructure to aggregate API call counts, error counts, and some timing metrics. We also would record server metrics (CPU, memory, etc) in our main application database. From there we used google charts plugins to visualize the data.

BlogPostOldTiming.png

Things were… okay. As the system grew, and we wanted to know more details about system performance, we found that every additional metric took a good deal of plumbing and database setup in order to add a small chart. We’d have to ensure that the data would get truncated or rotated in order to minimize strain on the database. A chart might be created with a 3 hour window, but you might want to zoom in or see what happened last night, and our charts pages would load incredibly slowly. Additionally, sending a notification or alert was a part of application code, which meant it was time-consuming to change a threshold or pause alerting. Read More…