Welcome to InsightFinder AI Observability Docs!
Categories
LLM Labs
Overview
This user guide provides comprehensive instructions for managing Large Language Models (LLMs) in InsightFinder AI. LLM Labs allows you to create, configure, chat, monitor, and evaluate LLM Prompts & Responses in a controlled environment with built-in analytics and safety measures. The purpose of the LLM Labs is to enable data scientists and ML engineers to select the best model based on their needs.
Primary Use Cases for LLM Labs
- Test New LLM Versions
- Evaluate how a new version of an LLM (e.g., ChatGPT-5) performs compared to previous versions.
- Identify improvements or regressions in accuracy, relevance, or safety.
- Compare Different LLMs or Versions
- Benchmark two different models (e.g., GPT-4.1 vs. Gemini 2.5) or different versions of the same model using the same prompts.
- Measure differences in output quality, token efficiency, and handling of edge cases.
- Trust & Safety Analysis
- Assess LLM outputs for bias, hallucinations, or unsafe responses.
- Use prompt templates or interactive chat sessions to test guardrail effectiveness.
- Performance & Cost Insights (optional but valuable)
- Monitor token consumption, latency, and other performance metrics when running evaluations.
What is an LLM Session?
An LLM Session is a dedicated environment where you can interact with a Large Language Model in a controlled, monitored, and evaluated manner. Each session maintains:
- Model Configuration: Specific model type, version, and settings
- Real-time Evaluation: Continuous assessment of model responses for quality, safety, and relevance
- Performance Metrics: Analytics on response quality, safety measures, and potential issues
Purpose and Benefits
- Quality Assurance: Every response is automatically evaluated for relevance, accuracy, and safety
- Safety Monitoring: Built-in guardrails detect and prevent malicious or harmful content
- Performance Tracking: Monitor how well your AI interactions are performing over time
- Organized Conversations: Keep different AI conversations separate and well-organized
- Model Comparison: Test different models and configurations side-by-side
- Compliance: Ensure AI interactions meet safety and ethical standards
Getting Started
To access LLM Session Management in InsightFinder AI:
- Log into your InsightFinder AI platform
- Navigate to the LLM Labs section
- Click on “Create Session” to begin
Creating a New Session
Step 1: Initiate Session Creation
- Click the “Create Session” button in the LLM Labs interface
- A session configuration dialog will appear
Step 2: Basic Session Information
Session Name – Provide a descriptive name for your session – Example: “Customer Support Bot V1”, “Marketing Content Generator”, “Research Assistant” – Choose names that help you identify the session’s purpose
Step 3: Model Selection
You have two main options for model configuration:
Option A: Built-in Models
- Description: Pre-configured models with no setup required
- Benefits:
- Instant availability
- Optimized configurations
- No API key management needed
- Available Models: Various pre-configured LLM options
- Setup: Simply select from the dropdown menu
Option B: API-Connected Models
- Description: Connect to external LLM providers via API
- Benefits:
- Custom configurations
Step 4: API Configuration (For API Models Only)
If you selected an API-connected model:
Model Type Selection – Choose your preferred LLM provider (OpenAI, Anthropic, etc.)
Version Selection – Select the specific model version you want to use – Example: GPT-4, GPT-3.5-turbo, Claude-3, etc.
Step 5: Confirm and Create
- Review your session configuration
- Click “Confirm”
- Wait for the session to initialize
Waiting for Session Ready
- After confirming your session, monitor the status indicator
- The session will show “Initializing” while setting up
- Wait for the status to change to “Ready”
- Once ready, you can begin interacting with your LLM
Chatting with Your Model
Starting a Conversation
- Locate Your Ready Session: Find a session with “Ready” status
- Click the Chat Icon: This opens the conversation interface
- Start Typing: Enter your message in the chat input field
- Send Your Message: Press Enter or click Send
- Receive Response: The LLM will respond with real-time evaluation
In addition to individual conversations, you have the option to process multiple prompts at once using the Batch Prompts feature:
You can provide batch prompts in three different ways:
- Upload File: Upload a .txt file containing prompts, each prompt must be on a new line
- Manual Entry: Enter prompts directly in the text area, one prompt per line
- Dropdown Selection: Choose from previously uploaded prompt collections, access prompts saved in your Prompt Library
Evaluation and Analytics
Understanding Real-time Evaluation
Every response from your LLM is automatically evaluated across multiple dimensions. Click the Terminal/Evaluation icon to view detailed results.
Evaluation Categories
1. Hallucination & Relevance
Answer Irrelevance – Scale: 1.0 – 5.0
Example: 4.0 / 5.0
Meaning: Measures how relevant the AI’s response is to your question
What it catches: Off-topic responses, misunderstood questions
Hallucination – Scale: 1.0 – 5.0Example: 4.0 / 5.0
Meaning: Detects when the AI makes up false information
What it catches: Fabricated facts, non-existent references, made-up statistics
2. Safety, Guardrails & Toxicity
Malicious Prompt – Scale: 1.0 – 5.0
Example: 3.0 / 5.0
Meaning: Identifies attempts to manipulate the AI into harmful responses
What it catches: Jailbreak attempts, prompt injection, manipulation tactics
PII/PHI Leakage – Scale: 1.0 – 5.0
Example: 3.0 / 5.0
Meaning: Detects potential leakage of Personally Identifiable Information (PII) or Protected Health Information (PHI)
What it catches:
– Social security numbers, credit card numbers, phone numbers
– Email addresses, home addresses, names
– Medical records, health conditions, treatment information
– Financial data, account numbers, sensitive personal details
3. Bias
Bias – Result: Passed/Failed
Meaning: Detects unfair bias in AI responses
What it catches: Gender bias, racial bias, cultural stereotypes
Session Actions
Restarting a Stopped Session
If your session status shows “Stopped”:
- Locate the Restart Button (Play/Pause Icon): Find it in the Actions column
- Click Restart: This will reinitialize your session
- Wait for Ready Status: Monitor until status changes to “Ready”
- Resume Activity: Continue chatting or evaluating
Deleting a Session
To Delete a Session:
- Click the Delete Icon: Located in the Actions column
- Confirm Deletion: A confirmation dialog will appear
- Confirm: Click “Yes” or “Delete” to proceed
- Session Removed: The session will be permanently deleted
When to Delete Sessions: – Completed testing scenarios – No longer needed conversations – Freeing up session limits – Cleaning up workspace
LLM Model Compare Dashboard
Overview
The LLM Model Compare dashboard feature allows you to experiment with prompts across different models, compare their responses, test parameters, and measure token usage. This powerful tool helps you make informed decisions about which model best suits your specific use case.
Purpose: Experiment with prompts across different models. Compare responses, test parameters, and measure token usage.
Comparison Categories
You can compare models across three key evaluation dimensions:
1. Hallucination & Relevance
- Measures accuracy and relevance of model responses
- Lower percentages indicate better performance (fewer hallucinations)
- Helps identify models that provide factually accurate information
2. Safety, Guardrails & Toxicity
- Evaluates how well models handle potentially harmful content
- Higher percentages may indicate more safety violations detected
- Critical for applications requiring strict content moderation
3. Bias
- Detects unfair bias in model responses
- Lower percentages indicate less biased outputs
- Important for ensuring fair and inclusive AI interactions
Token Usage Tracking
Monitor your token consumption across all model experiments:
Accessing Token Usage
- Click the Token Usage icon in the top-right corner of the Model Compare interface
- View detailed token consumption metrics
Understanding Token Metrics
Total Tokens – Combined input and output tokens used, Shows current usage against your limit – Helps track consumption patterns
Input Tokens – Tokens used for your prompts and questions, Represents the “cost” of your inputs to the models
Output Tokens – Tokens generated by the models in their responses, Often higher than input tokens for detailed responses – Major component of API costs
Token Limit – Your allocated token budget, Resets based on your subscription plan – Important for cost management
A/B Testing – Compare Prompts Across Models
Overview
The A/B Testing feature in LLM Labs allows you to compare how different models respond to the same prompts, enabling data-driven decisions about model selection and prompt optimization. This powerful comparison tool helps you understand model behavior differences and choose the best performing model for your specific use case.
Accessing A/B Testing
- Navigate to LLM Labs: Go to the LLM Labs section in InsightFinder AI
- Click “Compare Models”: Find and click the “Compare Models” button
- Configure Your Comparison: Set up your testing parameters
Comparison Configuration
Step 1: Comparison Mode
Currently available mode: – Same Prompt, Different Models (automatically selected) – Tests identical prompts across multiple models – Reveals how different models interpret and respond to the same input – Perfect for model selection and performance benchmarking
Step 2: Prompt Entry Method
You have two options for providing your test prompts:
Option A: Input Prompts
Description: Manually enter prompts directly into the interface
Format: One prompt per line
Best for: Quick tests with a few specific prompts
Example:
- What is 1 + 2?
What is 2 + 3?
Explain quantum computing in simple terms
Option B: Select Prompt Template
Description: Use pre-existing prompt collections or upload files
Two sub-options available:
- Select from Dropdown – Choose from existing prompt templates in your workspace – Pre-configured prompt sets for common use cases – Consistent testing across different comparison sessions
- Upload File – File Format: .txt file with one prompt per line – Preparation: Create a text file with each prompt on a separate line – Benefits: Bulk testing with large prompt sets – Auto-Save to Library: Uploaded prompts are automatically saved to the Prompt Library – Example file content: What is the capital of France? How do you solve a quadratic equation? Write a short story about friendship
Step 3: Model Selection
Requirements: Select exactly two models for comparison
Model Options: – Choose from models you’ve created using the “Create Session” feature – Only active/ready sessions appear in the dropdown – Models must be in “Ready” status to be available for comparison
Step 4: Start Comparison
- Review Settings: Verify your comparison configuration
- Click “Compare”: Initiate the A/B testing process
- Wait for Results: The system will process prompts across both models
Understanding Comparison Results
Side-by-Side Response Display
The results are presented in a clear, comparative format:
Layout Structure:
Response Analysis Features
For Each Model Response: –
Response Content: The actual model output
Real-time Evaluations: Quality and safety assessments
Evaluation Summary
At the end of each comparison, you’ll receive a comprehensive Evaluation Summary:
Key Summary Components
Total Prompts – Number of prompts tested in this comparison
Passed Evaluations – Count and percentage of responses that met quality standards – Higher percentages indicate better model performance
Failed Evaluations – Count and percentage of responses that didn’t meet standards – Lower numbers are better
Top Failed Evaluation – Identifies the most common failure type (if any) – Helps understand model weaknesses
Common Use Cases
1. Model Selection for Production
- Compare candidate models before deployment
- Validate model performance with real-world prompts
- Ensure consistent quality across different input types
2. Prompt Optimization
- Test how different models respond to prompt variations
- Identify which models work best with your prompt style
- Optimize prompts for specific model capabilities
3. Quality Assurance
- Regular testing to monitor model performance
- Detect degradation or improvements in model responses
- Ensure compliance with quality standards
4. Research and Development
- Understand model behavior differences
- Document model capabilities and limitations
- Support evidence-based model recommendations
Prompt Library Management
Overview
The Prompt Library is a centralized repository where you can store, organize, and manage your prompt collections for reuse across different A/B testing sessions and model comparisons.
How Prompts Are Saved
- Automatic Saving: When you upload a .txt file during A/B testing setup, the prompts are automatically saved to the Prompt Library
- Reusability: Saved prompts can be selected from the dropdown in future comparisons
- Organization: Keep your frequently used prompt sets organized and easily accessible
Manual Upload to Prompt Library
Alternative Upload Method:
1. Navigate to Prompt Library: Click on “Prompt Library” in the LLM Labs interface
2. Click Upload Button: Find and click the “Prompts Upload” button
3. Select File: Choose your .txt file with prompts (one prompt per line)
4. Confirm Upload: The prompts will be added to your library
Managing Prompt Library
Viewing Your Prompts – Access the Prompt Library section to see all saved prompt collections – Browse through your uploaded prompt files – Preview prompt content before selection
Deleting Prompts
1. Navigate to Prompt Library: Go to the Prompt Library section
2. Locate Target Prompts: Find the prompt collection you want to remove
3. Click Delete Button: Look for the delete button in the Actions column
4. Confirm Deletion: Confirm the removal of the prompt collection
Best Practices for Prompt Library
Organization Tips:
– Use descriptive filenames when uploading (e.g., “customer_service_prompts.txt”, “technical_qa_prompts.txt”)
– Group related prompts together in single files
– Regularly review and clean up unused prompt collections
– Keep prompt collections focused on specific use cases or domains
File Preparation Guidelines:
– One prompt per line: Ensure each prompt is on a separate line
Example Organized Prompt Collections:
customer_support.txt
How can I reset my password?
What are your business hours?
How do I cancel my subscription?
Where can I find my order history?
technical_qa.txt
Explain the difference between REST and GraphQL APIs
What is machine learning?
How does blockchain technology work?
Define microservices architecture
Integration with A/B Testing
Seamless Workflow:
1. Upload Once: Upload prompt collections to the Prompt Library
2. Reuse Multiple Times: Select from dropdown in multiple A/B testing sessions