ChainForge: An Open Source Visual Programming Environment for Testing and Evaluating the Effectiveness of Large Language Model Hints

Latest AI Resources8mos agorelease AI Sharing Circle

1.7K 00

General Introduction

ChainForge is an open source visual programming environment designed for testing and evaluating the effectiveness of prompts for large language models (LLMs). It provides a data-flow cueing engineering environment through which users can quickly explore and analyze the impact of different cues on the response quality of LLMs.ChainForge supports a wide range of model providers, including OpenAI, HuggingFace, Anthropic, etc., allowing users to compare and evaluate multiple models in a single interface. The tool is particularly well suited for early-stage cue exploration and rapid iteration, helping users optimize cue and model settings for best response quality.

Function List

Multi-model query: Query multiple LLMs at the same time to quickly test hint ideas and variants.
Comparison of response quality: Comparing response quality across cues, models, and model settings.
Visual assessment: Set up evaluation metrics and instantly visualize the results of prompts, parameters, models and settings.
many rounds of dialogue: Conduct multiple rounds of dialog between template parameters and chat models, examining and evaluating the output of each dialog round.
Templated Tips: Not only can you template prompts, but you can also template follow-up chat messages.
Example Evaluation Streams: Provide multiple example assessment streams to demonstrate possible usage scenarios.
Local and online installation: Supports local installations and online trials, providing flexibility of use.
Multiple model support: Support for OpenAI, HuggingFace, Anthropic, Google PaLM2, Azure OpenAI, and many other model providers.

Using Help

Installation process

local installation

Make sure Python 3.8 or later is installed.
Run the following command to install ChainForge:

   pip install chainforge

After the installation is complete, run the following command to start the ChainForge server:

   chainforge serve

Open your browser and visit localhost:8000You can start using ChainForge now.

Installing with Docker

Build the Docker image:

   docker build -t chainforge .

Run the Docker container:

   docker run -p 8000:8000 chainforge

Open your browser and visit 127.0.0.1:8000You can start using ChainForge now.

Guidelines for use

Setting the API Key: Click the Settings icon in the upper right corner and enter the API key for OpenAI, Anthropic, Google PaLM, etc.
Create a new projectClick on the "New Project" button and select the desired model and prompt template.
Add tips and models: Add cue templates and models to the project and set different parameters for testing.
Operational assessment: By clicking the "Run" button, ChainForge will automatically query all selected models and display the response results.
Comparison and visualization: Use visualization tools to compare the response quality of different cues and models and select the best cue and model settings.
Save and Share: Once the project is completed, you can save the assessment results and generate a share link to share with others.

Example Evaluation Streams

ChainForge provides several sample evaluation flows to help users get started quickly. For example, you can use the "Response Length Comparison" example to compare the response lengths of different models with the same cue. You can also create custom evaluation flows with specific evaluation metrics and visualizations.

Advanced Features

Customized evaluation nodes: Users can write Python code to customize evaluation nodes for more complex response evaluation.
Multi-round dialogue assessment: Multiple rounds of dialog evaluation are supported, allowing users to test the quality of responses in different dialog rounds.
Data export: The results of the assessment can be exported to an Excel table for further analysis.

ChainForge is a powerful tool for researchers, developers, and data scientists to help them optimize cue and model settings and improve the quality of LLM responses.