Early Anomaly Detection
Introduction
Over the last few months, the InsightFinder team has been working closely with a large financial firm to create an environment that can display the strength of the InsightFinder Unified Intelligence Engine within the firm’s technology ecosystem. During this time, we have found a few clear examples of how our product can reduce customer impact and internal technical resources required to identify and resolve issues.
Our system is currently receiving metric, log, and incident data from different sources. In the examples below, we will focus on Elasticsearch and TradeHub.
Scenario
An issue occurred in which an LDAP server had been taken out of service, but ElasticSearch had a cached connection to this server. This cached connection caused a “401 Unauthorized” error that would manifest itself in an inability to log in.
Post-mortem
On June 5th, 2019, 3 or 4 reports of a failure to log in were received. Refreshing the browser allowed the log-in to occur (based on being connected to a different Kibana server upon refresh). Around the same time, the team also noticed some errors on a rest plugin. This plugin was removed from the LTM and was thought to be the source of the log-in issues.
On June 6th at ~ 10 AM Central, more reports had come in. Customers sent an email to the IT team, reporting that they could not log into Kibana. At this point, it was evident that what was done the day before did not fix the problem. The team then searched through logs and found an exception. Once the exception is seen, the engineer tried to ping the server. When the ping failed, other elastic servers were checked to see if they are having issues with this IP address. After this, the LDAP team was contacted, and it was found that the IP address that was being pinged was taken out of service. Once this was confirmed, ElasticSearch was restarted to clear cached connections; ultimately resolving the issue.
Our Solution
On June 5th, at ~10 AM Central, InsightFinder identified a log that had never been seen before and thus added a new pattern to the Global View. This log happens to be an LDAP exception in the Logs and is identified as a Whitelist event as well since this log contains the word “fail.”
All anomalies are given an anomaly score that helps to show the user how serious an anomaly may be. In this example, the LDAPException New Pattern has been given a high Anomaly Score of 360, because it is indeed a newly identified pattern, and the system has also seen this log come through 9 times already. Based on this Anomaly Score, the box on the Global view is colored red to indicate to the user that attention is needed.
When hovering the mouse over the Timeline on the Global View, a context box will appear with a brief breakdown of the anomalies identified at that time.
When selecting the “Click for details” button, we are taken to the Events page where we are given much more information about this log, including its actual content.
Here we can see that an exception has been thrown due to an inability to connect to a server with IP address:
Results
If InsightFinder had been used in the problem management workflow, the connection issues with the LDAP server would have been identified and localized 24 hours before the team found it in this scenario; before impacting users. In this case, the team would have had everything they needed to resolve the issue without the reports from users, or the time and effort that was spent to find and dig through logs for exceptions.