Welcome to InsightFinder AI Observability Docs!

LLM Labs

Overview

This user guide provides comprehensive instructions for managing Large Language Models (LLMs) in InsightFinder AI. LLM Labs allows you to create, configure, chat, monitor, and evaluate LLM Prompts & Responses in a controlled environment with built-in analytics and safety measures. The purpose of the LLM Labs is to enable data scientists and ML engineers to select the best model based on their needs.

Primary Use Cases for LLM Labs

  1. Test New LLM Versions

    1. Evaluate how a new version of an LLM (e.g., ChatGPT-5) performs compared to previous versions.
    2. Identify improvements or regressions in accuracy, relevance, or safety.
  2. Compare Different LLMs or Versions

    1. Benchmark two different models (e.g., GPT-4.1 vs. Gemini 2.5) or different versions of the same model using the same prompts.
    2. Measure differences in output quality, token efficiency, and handling of edge cases.
  3. Trust & Safety Analysis

    1. Assess LLM outputs for bias, hallucinations, or unsafe responses.
    2. Use prompt templates or interactive chat sessions to test guardrail effectiveness.
  4. Performance & Cost Insights (optional but valuable)

    1. Monitor token consumption, latency, and other performance metrics when running evaluations.

What is an LLM Session?

An LLM Session is a dedicated environment where you can interact with a Large Language Model in a controlled, monitored, and evaluated manner. Each session maintains:

  • Model Configuration: Specific model type, version, and settings
  • Real-time Evaluation: Continuous assessment of model responses for quality, safety, and relevance
  • Performance Metrics: Analytics on response quality, safety measures, and potential issues

Purpose and Benefits

  1. Quality Assurance: Every response is automatically evaluated for relevance, accuracy, and safety
  2. Safety Monitoring: Built-in guardrails detect and prevent malicious or harmful content
  3. Performance Tracking: Monitor how well your AI interactions are performing over time
  4. Organized Conversations: Keep different AI conversations separate and well-organized
  5. Model Comparison: Test different models and configurations side-by-side
  6. Compliance: Ensure AI interactions meet safety and ethical standards

Getting Started

To access LLM Session Management in InsightFinder AI:

  1. Log into your InsightFinder AI platform
  2. Navigate to the LLM Labs section
  3. Click on “Create Session” to begin

Creating a New Session

Step 1: Initiate Session Creation

  1. Click the “Create Session” button in the LLM Labs interface
  2. A session configuration dialog will appear

Step 2: Basic Session Information


Session Name – Provide a descriptive name for your session – Example: “Customer Support Bot V1”, “Marketing Content Generator”, “Research Assistant” – Choose names that help you identify the session’s purpose

Step 3: Model Selection

You have two main options for model configuration:

Option A: Built-in Models
  • Description: Pre-configured models with no setup required
  • Benefits:
  • Instant availability
  • Optimized configurations
  • No API key management needed
  • Available Models: Various pre-configured LLM options
  • Setup: Simply select from the dropdown menu
Option B: API-Connected Models
  • Description: Connect to external LLM providers via API
  • Benefits:
  • Custom configurations

Step 4: API Configuration (For API Models Only)

If you selected an API-connected model:

Model Type Selection – Choose your preferred LLM provider (OpenAI, Anthropic, etc.)

Version Selection – Select the specific model version you want to use – Example: GPT-4, GPT-3.5-turbo, Claude-3, etc.

Step 5: Confirm and Create

  1. Review your session configuration
  2. Click “Confirm”
  3. Wait for the session to initialize

Waiting for Session Ready

  1. After confirming your session, monitor the status indicator
  2. The session will show “Initializing” while setting up
  3. Wait for the status to change to “Ready”
  4. Once ready, you can begin interacting with your LLM

Chatting with Your Model

Starting a Conversation

  1. Locate Your Ready Session: Find a session with “Ready” status
  2. Click the Chat Icon: This opens the conversation interface
  3. Start Typing: Enter your message in the chat input field
  4. Send Your Message: Press Enter or click Send
  5. Receive Response: The LLM will respond with real-time evaluation

In addition to individual conversations, you have the option to process multiple prompts at once using the Batch Prompts feature:

You can provide batch prompts in three different ways:

  1. Upload File: Upload a .txt file containing prompts, each prompt must be on a new line
  2. Manual Entry: Enter prompts directly in the text area, one prompt per line
  3. Dropdown Selection: Choose from previously uploaded prompt collections, access prompts saved in your Prompt Library

Evaluation and Analytics

Understanding Real-time Evaluation

Every response from your LLM is automatically evaluated across multiple dimensions. Click the Terminal/Evaluation icon to view detailed results.

Evaluation Categories

1. Hallucination & Relevance

Answer IrrelevanceScale: 1.0 – 5.0

Example: 4.0 / 5.0

Meaning: Measures how relevant the AI’s response is to your question

What it catches: Off-topic responses, misunderstood questions

HallucinationScale: 1.0 – 5.0Example: 4.0 / 5.0

Meaning: Detects when the AI makes up false information

What it catches: Fabricated facts, non-existent references, made-up statistics

2. Safety, Guardrails & Toxicity

Malicious PromptScale: 1.0 – 5.0

Example: 3.0 / 5.0

Meaning: Identifies attempts to manipulate the AI into harmful responses

What it catches: Jailbreak attempts, prompt injection, manipulation tactics

PII/PHI LeakageScale: 1.0 – 5.0

Example: 3.0 / 5.0

Meaning: Detects potential leakage of Personally Identifiable Information (PII) or Protected Health Information (PHI)

What it catches:

– Social security numbers, credit card numbers, phone numbers

– Email addresses, home addresses, names

– Medical records, health conditions, treatment information

– Financial data, account numbers, sensitive personal details

3. Bias

BiasResult: Passed/Failed

Meaning: Detects unfair bias in AI responses

What it catches: Gender bias, racial bias, cultural stereotypes

Session Actions

Restarting a Stopped Session

If your session status shows “Stopped”:

  1. Locate the Restart Button (Play/Pause Icon): Find it in the Actions column
  2. Click Restart: This will reinitialize your session
  3. Wait for Ready Status: Monitor until status changes to “Ready”
  4. Resume Activity: Continue chatting or evaluating

Deleting a Session

To Delete a Session:

  1. Click the Delete Icon: Located in the Actions column
  2. Confirm Deletion: A confirmation dialog will appear
  3. Confirm: Click “Yes” or “Delete” to proceed
  4. Session Removed: The session will be permanently deleted

When to Delete Sessions: – Completed testing scenarios – No longer needed conversations – Freeing up session limits – Cleaning up workspace

LLM Model Compare Dashboard

Overview

The LLM Model Compare dashboard feature allows you to experiment with prompts across different models, compare their responses, test parameters, and measure token usage. This powerful tool helps you make informed decisions about which model best suits your specific use case.

Purpose: Experiment with prompts across different models. Compare responses, test parameters, and measure token usage.

Comparison Categories

You can compare models across three key evaluation dimensions:

1. Hallucination & Relevance
  • Measures accuracy and relevance of model responses
  • Lower percentages indicate better performance (fewer hallucinations)
  • Helps identify models that provide factually accurate information
2. Safety, Guardrails & Toxicity
  • Evaluates how well models handle potentially harmful content
  • Higher percentages may indicate more safety violations detected
  • Critical for applications requiring strict content moderation
3. Bias
  • Detects unfair bias in model responses
  • Lower percentages indicate less biased outputs
  • Important for ensuring fair and inclusive AI interactions

Token Usage Tracking

Monitor your token consumption across all model experiments:

Accessing Token Usage
  1. Click the Token Usage icon in the top-right corner of the Model Compare interface
  2. View detailed token consumption metrics

Understanding Token Metrics

Total Tokens – Combined input and output tokens used, Shows current usage against your limit – Helps track consumption patterns

Input Tokens – Tokens used for your prompts and questions, Represents the “cost” of your inputs to the models

Output Tokens – Tokens generated by the models in their responses, Often higher than input tokens for detailed responses – Major component of API costs

Token Limit – Your allocated token budget, Resets based on your subscription plan – Important for cost management

A/B Testing – Compare Prompts Across Models & Datasets

Overview

The Multidimensional Comparison feature allows you to evaluate how different Prompt Versions perform across various Models and Datasets, enabling data-driven decisions for prompt optimization. By integrating Versioning with Dataset Management, this tool helps you analyze performance variations across different model-dataset combinations to identify the Best Prompt Version. This comprehensive evaluation ensures you select the most effective prompt for your specific use case based on evaluation results.

1.Prompt Template Version Control

Step 1: Initialize Prompt Templates
  • Navigate to LLM Labs: Go to the LLM Labs section in InsightFinder AI
  • Access Prompt Library: Click the Prompt Library tab to manage your template. Use the Prompts Template Upload button to create your initial Version 1.0 prompt templates

Step 2: Manage Template Iterations

  • Edit Prompt Template: Click the Edit icon on an existing template to modify your prompt content or parameters
  • Save and Version Changes: Choose Overwrite for minor refinements to update the current version, or select Create New Version to define a distinct variation for comparative testing

2. Dataset Integration

  • Access Dataset Tab: Click the Dataset tab to manage your testing data
  • Upload Testing Data: Use the Upload Dataset button to add data files that will serve as the evaluation environment for your prompts
  • Identify Applicable Templates: For each dataset, you can tag an Applicable Prompt Template. This helps you better record and track the suggested relationships between specific data and templates

3.Link Dataset within the Prompt

  • Trigger Dataset Menu: When creatingor editing prompt content, type the [] character to call up a dropdown list of available datasets
  • Insert Dataset References:Select the desired dataset and its column to insert a dynamic reference (e.g., [DatasetSample.txt]) at the cursor position
  • Establish Prompt-Data Linkage: Once inserted, the prompt template stores the link to the data source, ensuring the prompt is correctly associated with the specific dataset at the chosen location

4.Multidimensional Compare & Analyze

  • Access LLM Evaluation:Click the LLM Evaluation tab to manage your evaluation sessions and initiate comparisons
  • Compare Models & Prompts:Click the Compare Models  This feature allows you to evaluate different Prompt Template versions across multiple Models. The system will automatically run tests based on the linked datasets within each prompt. By analyzing responses across these dimensions, you can identify the optimal configuration for your specific requirements.
  • Test the Matrix:Evaluate your templates across various Model/Dataset combinations while monitoring Response Quality, Token Usage, and Latency

Identify the Winner: In the View Details page, the system automatically highlights the top-performing version with a “Winner” badge. This conclusion is derived from a multi-dimensional analysis that tests your Prompt Versions against the selected Model and Dataset matrix

Analyze Evaluation Summary: Scroll to the bottom to view the Evaluation Summary table. You can compare critical performance metrics across different datasets and models, including:

  • Total Prompts – Number of prompts tested in this comparison
  • Passed Evaluations – Count and percentage of responses that met quality standards – Higher percentages indicate better model performance
  • Failed Evaluations – Count and percentage of responses that didn’t meet standards – Lower numbers are better
  • Top Failed Evaluation – Identifies the most common failure type (if any) – Helps understand model weaknesses

4.View Comparison History

  • Access Comparison Results: Click the Prompt Comparison Result tab to view the performance history of your prompt templates across different models and datasets.
  • Analyze & Select the Best Prompt: Review the Best Model recommendations for each version and click View Details to inspect specific response data
  • Pick Your Favorite: You can then use Pick Your Favorite to mark the most effective prompt version for your production needs

 

Explore InsightFinder AI

Take InsightFinder AI for a no-obligation test drive. We’ll provide you with a detailed report on your outages to uncover what could have been prevented.