Welcome to InsightFinder AI Observability Docs!
Categories
LLM Labs
Overview
This user guide provides comprehensive instructions for managing Large Language Models (LLMs) in InsightFinder AI. LLM Labs allows you to create, configure, chat, monitor, and evaluate LLM Prompts & Responses in a controlled environment with built-in analytics and safety measures. The purpose of the LLM Labs is to enable data scientists and ML engineers to select the best model based on their needs.
Primary Use Cases for LLM Labs
-
Test New LLM Versions
- Evaluate how a new version of an LLM (e.g., ChatGPT-5) performs compared to previous versions.
- Identify improvements or regressions in accuracy, relevance, or safety.
-
Compare Different LLMs or Versions
- Benchmark two different models (e.g., GPT-4.1 vs. Gemini 2.5) or different versions of the same model using the same prompts.
- Measure differences in output quality, token efficiency, and handling of edge cases.
-
Trust & Safety Analysis
- Assess LLM outputs for bias, hallucinations, or unsafe responses.
- Use prompt templates or interactive chat sessions to test guardrail effectiveness.
-
Performance & Cost Insights (optional but valuable)
- Monitor token consumption, latency, and other performance metrics when running evaluations.
What is an LLM Session?
An LLM Session is a dedicated environment where you can interact with a Large Language Model in a controlled, monitored, and evaluated manner. Each session maintains:
- Model Configuration: Specific model type, version, and settings
- Real-time Evaluation: Continuous assessment of model responses for quality, safety, and relevance
- Performance Metrics: Analytics on response quality, safety measures, and potential issues
Purpose and Benefits
- Quality Assurance: Every response is automatically evaluated for relevance, accuracy, and safety
- Safety Monitoring: Built-in guardrails detect and prevent malicious or harmful content
- Performance Tracking: Monitor how well your AI interactions are performing over time
- Organized Conversations: Keep different AI conversations separate and well-organized
- Model Comparison: Test different models and configurations side-by-side
- Compliance: Ensure AI interactions meet safety and ethical standards
Getting Started
To access LLM Session Management in InsightFinder AI:
- Log into your InsightFinder AI platform
- Navigate to the LLM Labs section
- Click on “Create Session” to begin
Creating a New Session
Step 1: Initiate Session Creation
- Click the “Create Session” button in the LLM Labs interface
- A session configuration dialog will appear
Step 2: Basic Session Information
Session Name – Provide a descriptive name for your session – Example: “Customer Support Bot V1”, “Marketing Content Generator”, “Research Assistant” – Choose names that help you identify the session’s purpose
Step 3: Model Selection
You have two main options for model configuration:
Option A: Built-in Models
- Description: Pre-configured models with no setup required
- Benefits:
- Instant availability
- Optimized configurations
- No API key management needed
- Available Models: Various pre-configured LLM options
- Setup: Simply select from the dropdown menu
Option B: API-Connected Models
- Description: Connect to external LLM providers via API
- Benefits:
- Custom configurations
Step 4: API Configuration (For API Models Only)
If you selected an API-connected model:
Model Type Selection – Choose your preferred LLM provider (OpenAI, Anthropic, etc.)
Version Selection – Select the specific model version you want to use – Example: GPT-4, GPT-3.5-turbo, Claude-3, etc.
Step 5: Confirm and Create
- Review your session configuration
- Click “Confirm”
- Wait for the session to initialize
Waiting for Session Ready
- After confirming your session, monitor the status indicator
- The session will show “Initializing” while setting up
- Wait for the status to change to “Ready”
- Once ready, you can begin interacting with your LLM
Chatting with Your Model
Starting a Conversation
- Locate Your Ready Session: Find a session with “Ready” status
- Click the Chat Icon: This opens the conversation interface
- Start Typing: Enter your message in the chat input field
- Send Your Message: Press Enter or click Send
- Receive Response: The LLM will respond with real-time evaluation
In addition to individual conversations, you have the option to process multiple prompts at once using the Batch Prompts feature:
You can provide batch prompts in three different ways:
- Upload File: Upload a .txt file containing prompts, each prompt must be on a new line
- Manual Entry: Enter prompts directly in the text area, one prompt per line
- Dropdown Selection: Choose from previously uploaded prompt collections, access prompts saved in your Prompt Library
Evaluation and Analytics
Understanding Real-time Evaluation
Every response from your LLM is automatically evaluated across multiple dimensions. Click the Terminal/Evaluation icon to view detailed results.
Evaluation Categories
1. Hallucination & Relevance
Answer Irrelevance – Scale: 1.0 – 5.0
Example: 4.0 / 5.0
Meaning: Measures how relevant the AI’s response is to your question
What it catches: Off-topic responses, misunderstood questions
Hallucination – Scale: 1.0 – 5.0Example: 4.0 / 5.0
Meaning: Detects when the AI makes up false information
What it catches: Fabricated facts, non-existent references, made-up statistics
2. Safety, Guardrails & Toxicity
Malicious Prompt – Scale: 1.0 – 5.0
Example: 3.0 / 5.0
Meaning: Identifies attempts to manipulate the AI into harmful responses
What it catches: Jailbreak attempts, prompt injection, manipulation tactics
PII/PHI Leakage – Scale: 1.0 – 5.0
Example: 3.0 / 5.0
Meaning: Detects potential leakage of Personally Identifiable Information (PII) or Protected Health Information (PHI)
What it catches:
– Social security numbers, credit card numbers, phone numbers
– Email addresses, home addresses, names
– Medical records, health conditions, treatment information
– Financial data, account numbers, sensitive personal details
3. Bias
Bias – Result: Passed/Failed
Meaning: Detects unfair bias in AI responses
What it catches: Gender bias, racial bias, cultural stereotypes
Session Actions
Restarting a Stopped Session
If your session status shows “Stopped”:
- Locate the Restart Button (Play/Pause Icon): Find it in the Actions column
- Click Restart: This will reinitialize your session
- Wait for Ready Status: Monitor until status changes to “Ready”
- Resume Activity: Continue chatting or evaluating
Deleting a Session
To Delete a Session:
- Click the Delete Icon: Located in the Actions column
- Confirm Deletion: A confirmation dialog will appear
- Confirm: Click “Yes” or “Delete” to proceed
- Session Removed: The session will be permanently deleted
When to Delete Sessions: – Completed testing scenarios – No longer needed conversations – Freeing up session limits – Cleaning up workspace
LLM Model Compare Dashboard
Overview
The LLM Model Compare dashboard feature allows you to experiment with prompts across different models, compare their responses, test parameters, and measure token usage. This powerful tool helps you make informed decisions about which model best suits your specific use case.
Purpose: Experiment with prompts across different models. Compare responses, test parameters, and measure token usage.
Comparison Categories
You can compare models across three key evaluation dimensions:
1. Hallucination & Relevance
- Measures accuracy and relevance of model responses
- Lower percentages indicate better performance (fewer hallucinations)
- Helps identify models that provide factually accurate information
2. Safety, Guardrails & Toxicity
- Evaluates how well models handle potentially harmful content
- Higher percentages may indicate more safety violations detected
- Critical for applications requiring strict content moderation
3. Bias
- Detects unfair bias in model responses
- Lower percentages indicate less biased outputs
- Important for ensuring fair and inclusive AI interactions
Token Usage Tracking
Monitor your token consumption across all model experiments:
Accessing Token Usage
- Click the Token Usage icon in the top-right corner of the Model Compare interface
- View detailed token consumption metrics
Understanding Token Metrics
Total Tokens – Combined input and output tokens used, Shows current usage against your limit – Helps track consumption patterns
Input Tokens – Tokens used for your prompts and questions, Represents the “cost” of your inputs to the models
Output Tokens – Tokens generated by the models in their responses, Often higher than input tokens for detailed responses – Major component of API costs
Token Limit – Your allocated token budget, Resets based on your subscription plan – Important for cost management
A/B Testing – Compare Prompts Across Models & Datasets
Overview
The Multidimensional Comparison feature allows you to evaluate how different Prompt Versions perform across various Models and Datasets, enabling data-driven decisions for prompt optimization. By integrating Versioning with Dataset Management, this tool helps you analyze performance variations across different model-dataset combinations to identify the Best Prompt Version. This comprehensive evaluation ensures you select the most effective prompt for your specific use case based on evaluation results.
1.Prompt Template Version Control
Step 1: Initialize Prompt Templates
- Navigate to LLM Labs: Go to the LLM Labs section in InsightFinder AI
- Access Prompt Library: Click the Prompt Library tab to manage your template. Use the Prompts Template Upload button to create your initial Version 1.0 prompt templates
Step 2: Manage Template Iterations
- Edit Prompt Template: Click the Edit icon on an existing template to modify your prompt content or parameters
- Save and Version Changes: Choose Overwrite for minor refinements to update the current version, or select Create New Version to define a distinct variation for comparative testing
2. Dataset Integration
- Access Dataset Tab: Click the Dataset tab to manage your testing data
- Upload Testing Data: Use the Upload Dataset button to add data files that will serve as the evaluation environment for your prompts
- Identify Applicable Templates: For each dataset, you can tag an Applicable Prompt Template. This helps you better record and track the suggested relationships between specific data and templates
3.Link Dataset within the Prompt
- Trigger Dataset Menu: When creatingor editing prompt content, type the [] character to call up a dropdown list of available datasets
- Insert Dataset References:Select the desired dataset and its column to insert a dynamic reference (e.g.,
[DatasetSample.txt]) at the cursor position - Establish Prompt-Data Linkage: Once inserted, the prompt template stores the link to the data source, ensuring the prompt is correctly associated with the specific dataset at the chosen location
4.Multidimensional Compare & Analyze
- Access LLM Evaluation:Click the LLM Evaluation tab to manage your evaluation sessions and initiate comparisons
- Compare Models & Prompts:Click the Compare Models This feature allows you to evaluate different Prompt Template versions across multiple Models. The system will automatically run tests based on the linked datasets within each prompt. By analyzing responses across these dimensions, you can identify the optimal configuration for your specific requirements.
- Test the Matrix:Evaluate your templates across various Model/Dataset combinations while monitoring Response Quality, Token Usage, and Latency
Identify the Winner: In the View Details page, the system automatically highlights the top-performing version with a “Winner” badge. This conclusion is derived from a multi-dimensional analysis that tests your Prompt Versions against the selected Model and Dataset matrix
Analyze Evaluation Summary: Scroll to the bottom to view the Evaluation Summary table. You can compare critical performance metrics across different datasets and models, including:
- Total Prompts – Number of prompts tested in this comparison
- Passed Evaluations – Count and percentage of responses that met quality standards – Higher percentages indicate better model performance
- Failed Evaluations – Count and percentage of responses that didn’t meet standards – Lower numbers are better
- Top Failed Evaluation – Identifies the most common failure type (if any) – Helps understand model weaknesses
4.View Comparison History
- Access Comparison Results: Click the Prompt Comparison Result tab to view the performance history of your prompt templates across different models and datasets.
- Analyze & Select the Best Prompt: Review the Best Model recommendations for each version and click View Details to inspect specific response data
- Pick Your Favorite: You can then use Pick Your Favorite to mark the most effective prompt version for your production needs
From the Blog
Explore InsightFinder AI





















