langgraph examples
LangGraph is an emerging framework for constructing and managing workflows involving language model agents, APIs, and other tools in a composable and flexible manner. In this example, we'll create a simple LangGraph application to compare responses from Groq AI and ChatGPT. We'll describe the components and how they work within LangGraph as we go.
Parts of the Application
1. Node Definitions: Nodes represent discrete tasks in LangGraph. We define nodes for interacting with the Groq API, ChatGPT, and the ComparisonAgent.
2. Edges (Workflow): Edges define the flow of data between nodes. LangGraph handles dependencies and parallel execution based on these edges.
3. Agents and Outputs: The ComparisonAgent processes both responses and compares them. The final output is logged or returned to the user.
Code Implementation
Here’s how the code might look:
from langgraph import Graph, Node, Edge, Agent
# 1. Define the API Nodes for Groq AI and ChatGPT
class GroqAPINode(Node):
def run(self, prompt):
# Simulated API call to Groq AI
response = f"GroqAI's response to '{prompt}'" # Replace with actual API logic
return response
class ChatGPTNode(Node):
def run(self, prompt):
# Simulated API call to ChatGPT
response = f"ChatGPT's response to '{prompt}'" # Replace with actual API logic
return response
# 2. Define the Comparison Agent
class ComparisonAgent(Agent):
def compare(self, groq_response, chatgpt_response):
# Simulate a basic comparison
if groq_response == chatgpt_response:
result = "Both responses are identical."
else:
result = f"Responses differ:\n- Groq: {groq_response}\n- ChatGPT: {chatgpt_response}"
return result
# 3. Construct the LangGraph Application
def create_comparison_app():
# Create nodes
groq_node = GroqAPINode(name="GroqAPI")
chatgpt_node = ChatGPTNode(name="ChatGPTAPI")
comparison_agent = ComparisonAgent(name="ComparisonAgent")
# Define the graph
graph = Graph()
# Add edges for workflow
graph.add_edge(Edge(source=groq_node, target=comparison_agent, data_key="groq_response"))
graph.add_edge(Edge(source=chatgpt_node, target=comparison_agent, data_key="chatgpt_response"))
# Return the graph
return graph
# 4. Execute the Graph
def main():
prompt = input("Enter your prompt: ")
# Initialize the graph
graph = create_comparison_app()
# Set inputs
graph.set_input("GroqAPI", prompt=prompt)
graph.set_input("ChatGPTAPI", prompt=prompt)
# Run the graph
responses = graph.run()
# Get and print the comparison result
comparison_result = responses["ComparisonAgent"]
print("\nComparison Result:")
print(comparison_result)
if __name__ == "__main__":
main()
Explanation of Components
1. Node: - `GroqAPINode` and `ChatGPTNode` inherit from `Node`. Their `run` methods define the task of sending prompts to respective APIs and receiving responses.
2. Agent: - `ComparisonAgent` is specialized to handle multiple inputs (`groq_response` and `chatgpt_response`) and process them using the `compare` method.
3. Graph: - The `Graph` orchestrates the workflow. It manages execution order, passing data between nodes and agents based on defined `Edge` connections.
4. Edge: - `Edge` connects nodes and specifies how data flows through the graph. Each edge has a `data_key` that indicates the payload to pass between tasks.
5. Execution Flow: - The graph is constructed with nodes and edges. Inputs are fed to the graph, which automatically resolves dependencies, executes the nodes, and aggregates outputs.
How to Extend the Application
1. Real API Calls: - Replace simulated responses in `GroqAPINode` and `ChatGPTNode` with actual API calls.
2. Advanced Comparison: - Implement sophisticated logic in `ComparisonAgent.compare`, such as evaluating coherence, tone, or factual accuracy.
3. Logging and Error Handling: - Add logging and exception handling to monitor and debug the workflow.
This example introduces a basic yet functional LangGraph application that demonstrates core concepts like nodes, edges, agents, and workflows.
Comparison Measurements
When comparing the output of two different Large Language Models from the same prompt, it is crucial to ensure the comparison is both rigorous and methodologically sound.
By using a combination of these best practices and measurements, a programmer can effectively compare the outputs of different LLMs and draw meaningful insights about the relative performance of each model in various contexts.
Here is a list of measurements to consider when comparing two responses to the same prompt:
1. Clarity and Coherence
- Readability: Assess the ease with which the output can be read and understood.
- Logical Consistency: Evaluate whether the outputs are internally consistent without contradictions.
- Flow: Check for natural transitions between sentences or paragraphs.
2. Relevance and Appropriateness
- Contextual Relevance: Ensure both responses are pertinent to the specific query or task (e.g., accuracy of responses to a question).
- Task Appropriateness: Evaluate if the responses align with the expected output format (e.g., a summary, list, or code).
3. Factual Accuracy
- Truthfulness: Verify whether both responses present factual information correctly, especially in the context of knowledge-grounded prompts.
- Error Rate: Count the number of factual errors (e.g., wrong data, misinterpreted concepts).
- Sources and Citations: Evaluate whether the output cites sources correctly or is backed by verifiable facts (if applicable).
4. Performance Metrics
- Perplexity: Measure how well each model predicts the next word in a sequence. Lower perplexity indicates more predictable, fluent responses.
- BLEU (Bilingual Evaluation Understudy): A metric used for comparing machine-generated text with a reference text. This is particularly relevant for tasks like translation but can be applied to other tasks.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Used to compare the overlap between n-grams (i.e., sequences of words) in the generated text and reference text.
- METEOR: Similar to BLEU but incorporates synonym matching and stemming, often leading to more human-like evaluations.
- CIDEr (Consensus-based Image Description Evaluation): Originally used for image captioning but also applicable for text generation tasks in general.
- Accuracy: Simple accuracy in tasks where there are clear right/wrong answers (e.g., fact-checking or classification).
5. Diversity and Creativity
- Entropy: A measure of unpredictability or randomness in the model’s response. Higher entropy indicates more creativity or variability.
- Uniqueness: Evaluate how original or creative the generated responses are. This is especially important for tasks like storytelling or content generation.
- Variety of Vocabulary: Assess the lexical variety used in the output. More diverse vocabulary can be indicative of a more sophisticated model.
6. Tone and Sentiment
- Sentiment Analysis: Measure whether the sentiment (positive, negative, neutral) aligns with the intended tone or purpose of the prompt.
- Formality and Appropriateness: Assess whether the tone of the response matches the expected level of formality or style (e.g., professional vs. conversational).
7. Efficiency and Latency
- Response Time: Measure how quickly each model produces an answer. This is important for real-time applications.
- Compute Efficiency: Measure the resources (e.g., memory, processing time) used by each model to generate the response.
8. Robustness and Reliability
- Generalization: Test how well the models perform on diverse, unseen prompts.
- Stability: Check if the models generate stable outputs for similar inputs or if their responses vary significantly across identical queries.
- Handling Edge Cases: Evaluate how each model deals with ambiguous or out-of-scope inputs.
9. Bias and Fairness
- Bias Detection: Identify whether the responses exhibit any biases (e.g., gender, racial, political).
- Fairness Metrics: Evaluate if the responses are fair and neutral in cases that require non-biased, equal treatment across groups.
10. User Experience Metrics
- Human Evaluation: Have human evaluators rank or score the outputs based on specific criteria (e.g., correctness, creativity).
- Usability and Satisfaction: Measure user satisfaction with the responses based on predefined user experience criteria, especially in applications like chatbots.
11. Explainability
- Transparency: Assess whether one model provides more interpretable reasoning for its output (e.g., can the model explain its answer or logic?).
- Error Attribution: Determine if errors or mistakes in the output are attributable to any specific model characteristic (e.g., overfitting or lack of training data).
12. Language Metrics
- Grammar and Syntax: Check for proper grammar, punctuation, and syntax adherence.
- Fluency: Evaluate the natural flow of language used in the responses.
13. Human-Like Features
- Empathy and Engagement: For dialogue models, assess the extent to which the output reflects empathy, engagement, or conversationality.
- Personalization: Assess how well the model adapts its responses based on previous interactions or the user’s style.