Welcome to InsightFinder AI Observability Docs!

LLM Labs

Overview

This user guide provides comprehensive instructions for managing Large Language Models (LLMs) in InsightFinder AI. LLM Labs allows you to create, configure, chat, monitor, and evaluate LLM Prompts & Responses in a controlled environment with built-in analytics and safety measures. The purpose of the LLM Labs is to enable data scientists and ML engineers to select the best model based on their needs.

Primary Use Cases for LLM Labs

Test New LLM Versions
- Evaluate how a new version of an LLM (e.g., ChatGPT-5) performs compared to previous versions.
- Identify improvements or regressions in accuracy, relevance, or safety.
Compare Different LLMs or Versions
- Benchmark two different models (e.g., GPT-4.1 vs. Gemini 2.5) or different versions of the same model using the same prompts.
- Measure differences in output quality, token efficiency, and handling of edge cases.
Trust & Safety Analysis
- Assess LLM outputs for bias, hallucinations, or unsafe responses.
- Use prompt templates or interactive chat sessions to test guardrail effectiveness.
Performance & Cost Insights (optional but valuable)
- Monitor token consumption, latency, and other performance metrics when running evaluations.

What is an LLM Session?

An LLM Session is a dedicated environment where you can interact with a Large Language Model in a controlled, monitored, and evaluated manner. Each session maintains:

Model Configuration: Specific model type, version, and settings
Real-time Evaluation: Continuous assessment of model responses for quality, safety, and relevance
Performance Metrics: Analytics on response quality, safety measures, and potential issues

Purpose and Benefits

Quality Assurance: Every response is automatically evaluated for relevance, accuracy, and safety
Safety Monitoring: Built-in guardrails detect and prevent malicious or harmful content
Performance Tracking: Monitor how well your AI interactions are performing over time
Organized Conversations: Keep different AI conversations separate and well-organized
Model Comparison: Test different models and configurations side-by-side
Compliance: Ensure AI interactions meet safety and ethical standards

Getting Started

To access LLM Session Management in InsightFinder AI:

Log into your InsightFinder AI platform
Navigate to the LLM Labs section
Click on “Create Session” to begin

Creating a New Session

Step 1: Initiate Session Creation

Click the “Create Session” button in the LLM Labs interface
A session configuration dialog will appear

Step 2: Basic Session Information

Session Name – Provide a descriptive name for your session – Example: “Customer Support Bot V1”, “Marketing Content Generator”, “Research Assistant” – Choose names that help you identify the session’s purpose

Step 3: Model Selection

You have two main options for model configuration:

Option A: Built-in Models

Description: Pre-configured models with no setup required
Benefits:
Instant availability
Optimized configurations
No API key management needed
Available Models: Various pre-configured LLM options
Setup: Simply select from the dropdown menu

Option B: API-Connected Models

Description: Connect to external LLM providers via API
Benefits:
Custom configurations

Step 4: API Configuration (For API Models Only)

If you selected an API-connected model:

Model Type Selection – Choose your preferred LLM provider (OpenAI, Anthropic, etc.)

Version Selection – Select the specific model version you want to use – Example: GPT-4, GPT-3.5-turbo, Claude-3, etc.

Step 5: Confirm and Create

Review your session configuration
Click “Confirm”
Wait for the session to initialize

Waiting for Session Ready

After confirming your session, monitor the status indicator
The session will show “Initializing” while setting up
Wait for the status to change to “Ready”
Once ready, you can begin interacting with your LLM

Chatting with Your Model

Starting a Conversation

Locate Your Ready Session: Find a session with “Ready” status
Click the Chat Icon: This opens the conversation interface
Start Typing: Enter your message in the chat input field
Send Your Message: Press Enter or click Send
Receive Response: The LLM will respond with real-time evaluation

In addition to individual conversations, you have the option to process multiple prompts at once using the Batch Prompts feature:

You can provide batch prompts in three different ways:

Upload File: Upload a .txt file containing prompts, each prompt must be on a new line
Manual Entry: Enter prompts directly in the text area, one prompt per line
Dropdown Selection: Choose from previously uploaded prompt collections, access prompts saved in your Prompt Library

Evaluation and Analytics

Understanding Real-time Evaluation

Every response from your LLM is automatically evaluated across multiple dimensions. Click the Terminal/Evaluation icon to view detailed results.

Evaluation Categories

1. Hallucination & Relevance

Answer Irrelevance – Scale: 1.0 – 5.0
Example: 4.0 / 5.0
Meaning: Measures how relevant the AI’s response is to your question
What it catches: Off-topic responses, misunderstood questions
Hallucination – Scale: 1.0 – 5.0Example: 4.0 / 5.0
Meaning: Detects when the AI makes up false information
What it catches: Fabricated facts, non-existent references, made-up statistics

2. Safety, Guardrails & Toxicity

Malicious Prompt – Scale: 1.0 – 5.0
Example: 3.0 / 5.0
Meaning: Identifies attempts to manipulate the AI into harmful responses
What it catches: Jailbreak attempts, prompt injection, manipulation tactics
PII/PHI Leakage – Scale: 1.0 – 5.0
Example: 3.0 / 5.0
Meaning: Detects potential leakage of Personally Identifiable Information (PII) or Protected Health Information (PHI)
What it catches:
– Social security numbers, credit card numbers, phone numbers
– Email addresses, home addresses, names
– Medical records, health conditions, treatment information
– Financial data, account numbers, sensitive personal details

3. Bias

Bias – Result: Passed/Failed
Meaning: Detects unfair bias in AI responses
What it catches: Gender bias, racial bias, cultural stereotypes

Session Actions

Restarting a Stopped Session

If your session status shows “Stopped”:

Locate the Restart Button (Play/Pause Icon): Find it in the Actions column
Click Restart: This will reinitialize your session
Wait for Ready Status: Monitor until status changes to “Ready”
Resume Activity: Continue chatting or evaluating

Deleting a Session

To Delete a Session:

Click the Delete Icon: Located in the Actions column
Confirm Deletion: A confirmation dialog will appear
Confirm: Click “Yes” or “Delete” to proceed
Session Removed: The session will be permanently deleted

When to Delete Sessions: – Completed testing scenarios – No longer needed conversations – Freeing up session limits – Cleaning up workspace

LLM Model Compare Dashboard

Overview

The LLM Model Compare dashboard feature allows you to experiment with prompts across different models, compare their responses, test parameters, and measure token usage. This powerful tool helps you make informed decisions about which model best suits your specific use case.

Purpose: Experiment with prompts across different models. Compare responses, test parameters, and measure token usage.

Comparison Categories

You can compare models across three key evaluation dimensions:

1. Hallucination & Relevance

Measures accuracy and relevance of model responses
Lower percentages indicate better performance (fewer hallucinations)
Helps identify models that provide factually accurate information

2. Safety, Guardrails & Toxicity

Evaluates how well models handle potentially harmful content
Higher percentages may indicate more safety violations detected
Critical for applications requiring strict content moderation

3. Bias

Detects unfair bias in model responses
Lower percentages indicate less biased outputs
Important for ensuring fair and inclusive AI interactions

Token Usage Tracking

Monitor your token consumption across all model experiments:

Accessing Token Usage

Click the Token Usage icon in the top-right corner of the Model Compare interface
View detailed token consumption metrics

Understanding Token Metrics

Total Tokens – Combined input and output tokens used, Shows current usage against your limit – Helps track consumption patterns

Input Tokens – Tokens used for your prompts and questions, Represents the “cost” of your inputs to the models

Output Tokens – Tokens generated by the models in their responses, Often higher than input tokens for detailed responses – Major component of API costs

Token Limit – Your allocated token budget, Resets based on your subscription plan – Important for cost management

A/B Testing – Compare Prompts Across Models

Overview

The A/B Testing feature in LLM Labs allows you to compare how different models respond to the same prompts, enabling data-driven decisions about model selection and prompt optimization. This powerful comparison tool helps you understand model behavior differences and choose the best performing model for your specific use case.

Accessing A/B Testing

Navigate to LLM Labs: Go to the LLM Labs section in InsightFinder AI
Click “Compare Models”: Find and click the “Compare Models” button
Configure Your Comparison: Set up your testing parameters

Comparison Configuration

Step 1: Comparison Mode

Currently available mode: – Same Prompt, Different Models (automatically selected) – Tests identical prompts across multiple models – Reveals how different models interpret and respond to the same input – Perfect for model selection and performance benchmarking

Step 2: Prompt Entry Method

You have two options for providing your test prompts:

Option A: Input Prompts

Description: Manually enter prompts directly into the interface
Format: One prompt per line
Best for: Quick tests with a few specific prompts
Example:

What is 1 + 2?
What is 2 + 3?
Explain quantum computing in simple terms

Option B: Select Prompt Template

Description: Use pre-existing prompt collections or upload files

Two sub-options available:

Select from Dropdown – Choose from existing prompt templates in your workspace – Pre-configured prompt sets for common use cases – Consistent testing across different comparison sessions
Upload File – File Format: .txt file with one prompt per line – Preparation: Create a text file with each prompt on a separate line – Benefits: Bulk testing with large prompt sets – Auto-Save to Library: Uploaded prompts are automatically saved to the Prompt Library – Example file content: What is the capital of France? How do you solve a quadratic equation? Write a short story about friendship

Step 3: Model Selection

Requirements: Select exactly two models for comparison
Model Options: – Choose from models you’ve created using the “Create Session” feature – Only active/ready sessions appear in the dropdown – Models must be in “Ready” status to be available for comparison

Step 4: Start Comparison

Review Settings: Verify your comparison configuration
Click “Compare”: Initiate the A/B testing process
Wait for Results: The system will process prompts across both models

Understanding Comparison Results

Side-by-Side Response Display

The results are presented in a clear, comparative format:

Layout Structure:

Response Analysis Features

For Each Model Response: –
Response Content: The actual model output
Real-time Evaluations: Quality and safety assessments

Evaluation Summary

At the end of each comparison, you’ll receive a comprehensive Evaluation Summary:

Key Summary Components

Total Prompts – Number of prompts tested in this comparison
Passed Evaluations – Count and percentage of responses that met quality standards – Higher percentages indicate better model performance
Failed Evaluations – Count and percentage of responses that didn’t meet standards – Lower numbers are better
Top Failed Evaluation – Identifies the most common failure type (if any) – Helps understand model weaknesses

Common Use Cases

1. Model Selection for Production

Compare candidate models before deployment
Validate model performance with real-world prompts
Ensure consistent quality across different input types

2. Prompt Optimization

Test how different models respond to prompt variations
Identify which models work best with your prompt style
Optimize prompts for specific model capabilities

3. Quality Assurance

Regular testing to monitor model performance
Detect degradation or improvements in model responses
Ensure compliance with quality standards

4. Research and Development

Understand model behavior differences
Document model capabilities and limitations
Support evidence-based model recommendations

Prompt Library Management

Overview

The Prompt Library is a centralized repository where you can store, organize, and manage your prompt collections for reuse across different A/B testing sessions and model comparisons.

How Prompts Are Saved

Automatic Saving: When you upload a .txt file during A/B testing setup, the prompts are automatically saved to the Prompt Library
Reusability: Saved prompts can be selected from the dropdown in future comparisons
Organization: Keep your frequently used prompt sets organized and easily accessible

Manual Upload to Prompt Library

Alternative Upload Method:
1. Navigate to Prompt Library: Click on “Prompt Library” in the LLM Labs interface
2. Click Upload Button: Find and click the “Prompts Upload” button
3. Select File: Choose your .txt file with prompts (one prompt per line)
4. Confirm Upload: The prompts will be added to your library

Managing Prompt Library

Viewing Your Prompts – Access the Prompt Library section to see all saved prompt collections – Browse through your uploaded prompt files – Preview prompt content before selection

Deleting Prompts
1. Navigate to Prompt Library: Go to the Prompt Library section
2. Locate Target Prompts: Find the prompt collection you want to remove
3. Click Delete Button: Look for the delete button in the Actions column
4. Confirm Deletion: Confirm the removal of the prompt collection

Best Practices for Prompt Library

Organization Tips:
– Use descriptive filenames when uploading (e.g., “customer_service_prompts.txt”, “technical_qa_prompts.txt”)
– Group related prompts together in single files
– Regularly review and clean up unused prompt collections
– Keep prompt collections focused on specific use cases or domains

File Preparation Guidelines:
– One prompt per line: Ensure each prompt is on a separate line

Example Organized Prompt Collections:
customer_support.txt
How can I reset my password?
What are your business hours?
How do I cancel my subscription?
Where can I find my order history?

technical_qa.txt

Explain the difference between REST and GraphQL APIs
What is machine learning?
How does blockchain technology work?
Define microservices architecture

Integration with A/B Testing

Seamless Workflow:
1. Upload Once: Upload prompt collections to the Prompt Library
2. Reuse Multiple Times: Select from dropdown in multiple A/B testing sessions

From the Blog

Blogs

A Practitioner’s Guide to AIOps, MLOps, and LLMOps

You’re likely here because you’re trying to figure out how to deploy, monitor, and…

Diagram of MCP Server architecture with layered security: outer firewall, authentication and rate limiting, HTTPS encryption, nginx reverse proxy, and monitoring at the core

Blogs

How to Harden Your MCP Server

Model Context Protocol, or MCP, servers have seemingly become the new API server, with…

Blogs

AI Observability Tools 2025: Platform Comparison Guide for ML and LLM Reliability

Imagine this: your chatbot’s performance has been declining for weeks, producing generic responses due…

Explore InsightFinder AI

Take InsightFinder AI for a no-obligation test drive. We’ll provide you with a detailed report on your outages to uncover what could have been prevented.

AI Observability

IT Observability

Unified Intelligence Engine - UIE

Integrations

Release Notes

Welcome to InsightFinder AI Observability Docs!

LLM Labs

Overview

Primary Use Cases for LLM Labs

What is an LLM Session?

Purpose and Benefits

Getting Started

Creating a New Session

Step 1: Initiate Session Creation

Step 2: Basic Session Information

Step 3: Model Selection

Option A: Built-in Models

Option B: API-Connected Models

Step 4: API Configuration (For API Models Only)

Step 5: Confirm and Create

Waiting for Session Ready

Chatting with Your Model

Starting a Conversation

Evaluation and Analytics

Understanding Real-time Evaluation

Evaluation Categories

1. Hallucination & Relevance

2. Safety, Guardrails & Toxicity

3. Bias

Session Actions

Restarting a Stopped Session

Deleting a Session

LLM Model Compare Dashboard

Overview

Comparison Categories

1. Hallucination & Relevance

2. Safety, Guardrails & Toxicity

3. Bias

Token Usage Tracking

Accessing Token Usage

Understanding Token Metrics

A/B Testing – Compare Prompts Across Models

Overview

Accessing A/B Testing

Comparison Configuration

Step 1: Comparison Mode

Step 2: Prompt Entry Method

Option A: Input Prompts

Option B: Select Prompt Template

Step 3: Model Selection

Step 4: Start Comparison

Understanding Comparison Results

Side-by-Side Response Display

Response Analysis Features

Evaluation Summary

Key Summary Components

Common Use Cases

1. Model Selection for Production

2. Prompt Optimization

3. Quality Assurance

4. Research and Development

Prompt Library Management

Overview

How Prompts Are Saved

Manual Upload to Prompt Library

Managing Prompt Library

Best Practices for Prompt Library

Integration with A/B Testing

From the Blog

A Practitioner’s Guide to AIOps, MLOps, and LLMOps

How to Harden Your MCP Server

AI Observability Tools 2025: Platform Comparison Guide for ML and LLM Reliability

Explore InsightFinder AI