Introduction
The InsightFinder team has been working closely with a large financial firm to create an environment that displays the strength of InsightFinder’s Unified Intelligence Engine within the firm’s technology ecosystem. Our system currently receives metric, log, and incident data from different sources. This scenario focuses Elasticsearch and TradeHub. Below, see how the financial firm initially discovered and resolved the problem. Then, review find how InsightFinder’s solution would have helped reduce customer impact and internal technical resources required to identify and resolve issues.
Scenario
An issue occurred in which an LDAP server had been taken out of service, but ElasticSearch had a cached connection to this server. This cached connection caused a “401 Unauthorized” error that would manifest itself in an inability to log in.
Post-mortem
On June 5th, 3 or 4 reports of a failure to log in were received. Refreshing the browser allowed the log-in to occur (based on being connected to a different Kibana server upon refresh). Around the same time, the team also noticed some errors on a rest plugin. This plugin was removed from the LTM and was thought to be the source of the log-in issues.
On June 6th, at around 10 AM Central, more reports came in highlighting a problem. Customers sent emails IT team, reporting that they could not log into Kibana. At this point, it was evident that the initial identified solution did not fix the problem. The team then searched through logs and found an exception. Once the exception is seen, the engineer tried to ping the server. When the ping failed, other elastic servers were checked to see if they are having issues with this IP address. Finally, the LDAP team was contacted, and it was found that the IP address that was being pinged was taken out of service. Once this was confirmed, ElasticSearch was restarted to clear cached connections; ultimately resolving the issue.
Our Solution
On June 5th, at ~10 AM Central, InsightFinder identified a log that had never been seen before and thus added a new pattern to the Global View. This log happens to be an LDAP exception in the Logs and is identified as a Whitelist event as well since this log contains the word “fail.”
All anomalies are given an anomaly score that helps to show the user how serious an anomaly may be. In this example, the LDAPException New Pattern has been given a high Anomaly Score of 360, because it is indeed a newly identified pattern, and the system has also seen this log come through 9 times already. Based on this Anomaly Score, the box on the Global view is colored red to indicate to the user that attention is needed.
When hovering the mouse over the Timeline on the Global View, a context box will appear with a brief breakdown of the anomalies identified at that time (please note the screenshots are from a slightly older version of the interface. For the latest view, sign up for a free trial).
When selecting the “Click for details” button, we are taken to the Events page where we are given much more information about this log, including its actual content.
Here, we can see that an exception has been thrown due to an inability to connect to a server with IP address:
Results
If InsightFinder had been used in the problem management workflow when the problem first occurred, the connection issues with the LDAP server would have been identified and localized 24 hours before the team found it in this scenario – before impacting users. In the future, using InsightFinder, the team will have the tools they need to resolve the issue without the reports from users, or the time and effort expended to manually dig through logs for exceptions.