Monitoring the ForgeRock Identity Platform 6.0 using Prometheus and Grafana

All products within the ForgeRock Identity Platform 6.0 release include native support for monitoring product metrics using Prometheus and visualising this information using Grafana.  To illustrate how these product metrics can be used, alongside each products' binary download on Backstage, you can also find a zip file containing Monitoring Dashboard Samples.  Each download includes a README file which explains how to setup these dashboards.  In this post, we'll review the AM 6.0.0 Overview dashboard included within the AM Monitoring Dashboard Samples.

As you might expect from the name, the AM 6.0.0 Overview dashboard is a high level dashboard which gives a cluster-wide summary of the load and average response time for key areas of AM.  For illustration purposes, I've used Prometheus to monitor three AM instances running on my laptop and exported an interactive snapshot of the AM 6.0.0 Overview dashboard.  The screenshots which follow have all been taken from that snapshot.

Variables

At the top of the dashboard, there are two dropdowns:

AM 6.0.0 Overview dashboard variables

  • Instance - Use this to select which AM instances within your cluster you'd like to monitor.  The instances shown within this dropdown are automatically derived from the metric data you have stored within Prometheus.
  • Aggregation Window - Use this to select a period of time over which you'd like to take a count of events shown within the dashboard. For example, how many sessions were started in the last minute, hour, day, etc.
Note. You'll need to be up and running with your own Grafana instance in order to use these dropdowns as they can't be updated within the interactive snapshot.

Authentications

AM 6.0.0 Overview dashboard Authentications section

The top row of the Authentications section shows a count of authentications which have started, succeeded, failed, or timed-out across all AM instances selected in the "Instance" variable drop-down over the selected "Aggregation Window".  The screenshot above shows that a total of 95 authentications were started over the past minute across all three of my AM instances.

The second row of the Authentications section shows the per second rate of authentications starting, succeeding, failing, or timing-out by AM instance.  This set of line graphs can be used to see how behaviour is changing over time and if any AM instance is taking more or less of the load.

All of the presented Authentications metrics are of the summary type.

Sessions

AM 6.0.0 Overview dashboard Sessions section

As with the Authentications section, the top row of the Sessions section shows cluster-wide aggregations while the second row contains a set of line graphs.

In the Sessions section, the metrics being presented are session creation, session termination, and average session lifetime.  Unlike the other metrics presented in the Authentications and Sessions sections, the session lifetime metric is of the timer type.

In both panels, Prometheus is calculating the average session lifetime by dividing am_session_lifetime_seconds_total by am_session_lifetime_count.  Because this calculation is happening within Prometheus rather than within each product instance, we have control over what period of time the average is calculated, we can choose to include or exclude certain session types by filtering on the session_type tag values, and we can calculate the cluster-wide average.

When working with any timer metric, it's also possible to present the 50th, 75th, 95th, 98th, 99th, or 99.9th percentiles.  These are calculated from within the monitored product using an exponential decay so that they are representative of roughly the last five minutes of data [1].  Because percentiles are calculated from within the product, this does mean that it's not possible to combine the results of multiple similar metrics or to aggregate across the whole cluster.

CTS

AM 6.0.0 Overview dashboard CTS section

The CTS is AM's storage service for tokens and other data objects.  When a token needs to be operated upon, AM describes the desired operation as a Task object and enqueues this to be handled by an available worker thread when a connection to DS is available.

In the CTS section, we can observe...
  • Task Throughput - how many CTS operations are being requested
  • Tasks Waiting in Queues - how many operations are waiting for an available worker thread and connection to DS
  • Task Queueing Time - the time each operation spends waiting for a worker thread and connection
  • Task Service Time - the time spent performing the requested operation against DS
If you'd like to dig deeper into the CTS behaviour, you can use the AM 6.0.0 CTS dashboard to observe any of the recorded percentiles by operation type and token type.  You can also use the AM 6.0.0 CTS Token Reaper dashboard to monitor AM's removal of expired CTS tokens.  Both of these dashboards are also included in the Monitoring Dashboard Samples zip file.

OAuth2

AM 6.0.0 Overview dashboard OAuth2 section

In the OAuth2 section, we can monitor OAuth2 grants completed or revoked, and tokens issued or revoked.  As OAuth2 refresh tokens can be long-lived and require state to be stored in CTS (to allow the resource owner to revoke consent) tracking grant issuance vs revocation can be helpful to aid with CTS capacity planning.

The dashboard currently shows a line per grant type.  You may prefer to use Prometheus' sum function to show the count of all OAuth2 grants.  You can see examples of the sum function used in the Authentications section.

Note. you may also prefer to filter the grant_type tag to exclude refresh.

Policy / Authorization

AM 6.0.0 Overview dashboard Policy / Authorizations section

The Policy section shows a count of policy evaluations requested and the average time it took to perform these evaluations.  As you can see from the screenshots, policy evaluation completed in next to no time on my AM instances as I have no policies defined.

JVM

AM 6.0.0 Overview dashboard JVM section

The JVM section shows how JVM metrics forwarded by AM to Prometheus can be used to track memory usage and garbage collections.  As can be seen in the screenshot, the JVM section is repeated per AM instance.  This is a nice feature of Grafana.

Note. the set of garbage collection metric available is dependent on the selected GC algorithm.

Passing thoughts

While I've focused heavily on the AM dashboards within this post, all products across the ForgeRock Identitty Platform 6.0 release include the same support for Prometheus and Grafana and you can find sample dashboards for each.  I encourage you to take a look at the sample dashboards for DS, IG and IDM.

References

  1. DropWizard Metrics Exponentially Decaying Reservoirs, https://metrics.dropwizard.io/4.0.0/manual/core.html

Comments

Popular Posts