Upstage’s Groundedness Examine service gives a robust API for verifying that AI-generated responses are firmly anchored in dependable supply materials. By submitting context–reply pairs to the Upstage endpoint, we will immediately decide whether or not the provided context helps a given reply and obtain a confidence evaluation of that grounding. On this tutorial, we reveal the best way to make the most of Upstage’s core capabilities, together with single-shot verification, batch processing, and multi-domain testing, to make sure that our AI techniques produce factual and reliable content material throughout numerous topic areas.
!pip set up -qU langchain-core langchain-upstage
import os
import json
from typing import Checklist, Dict, Any
from langchain_upstage import UpstageGroundednessCheck
os.environ["UPSTAGE_API_KEY"] = "Use Your API Key Right here"
We set up the newest LangChain core and the Upstage integration package deal, import the required Python modules for information dealing with and typing, and set our Upstage API key within the setting to authenticate all subsequent groundedness examine requests.
class AdvancedGroundednessChecker:
"""Superior wrapper for Upstage Groundedness Examine with batch processing and evaluation"""
def __init__(self):
self.checker = UpstageGroundednessCheck()
self.outcomes = []
def check_single(self, context: str, reply: str) -> Dict[str, Any]:
"""Examine groundedness for a single context-answer pair"""
request = {"context": context, "reply": reply}
response = self.checker.invoke(request)
outcome = {
"context": context,
"reply": reply,
"grounded": response,
"confidence": self._extract_confidence(response)
}
self.outcomes.append(outcome)
return outcome
def batch_check(self, test_cases: Checklist[Dict[str, str]]) -> Checklist[Dict[str, Any]]:
"""Course of a number of take a look at instances"""
batch_results = []
for case in test_cases:
outcome = self.check_single(case["context"], case["answer"])
batch_results.append(outcome)
return batch_results
def _extract_confidence(self, response) -> str:
"""Extract confidence degree from response"""
if hasattr(response, 'decrease'):
if 'grounded' in response.decrease():
return 'excessive'
elif 'not grounded' in response.decrease():
return 'low'
return 'medium'
def analyze_results(self) -> Dict[str, Any]:
"""Analyze batch outcomes"""
whole = len(self.outcomes)
grounded = sum(1 for r in self.outcomes if 'grounded' in str(r['grounded']).decrease())
return {
"total_checks": whole,
"grounded_count": grounded,
"not_grounded_count": whole - grounded,
"accuracy_rate": grounded / whole if whole > 0 else 0
}
checker = AdvancedGroundednessChecker()
The AdvancedGroundednessChecker class wraps Upstage’s groundedness API right into a easy, reusable interface that lets us run each single and batch context–reply checks whereas accumulating outcomes. It additionally contains helper strategies to extract a confidence label from every response and compute total accuracy statistics throughout all checks.
print("=== Check Case 1: Peak Discrepancy ===")
result1 = checker.check_single(
context="Mauna Kea is an inactive volcano on the island of Hawai'i.",
reply="Mauna Kea is 5,207.3 meters tall."
)
print(f"End result: {result1['grounded']}")
print("n=== Check Case 2: Appropriate Data ===")
result2 = checker.check_single(
context="Python is a high-level programming language created by Guido van Rossum in 1991. It emphasizes code readability and ease.",
reply="Python was made by Guido van Rossum & focuses on code readability."
)
print(f"End result: {result2['grounded']}")
print("n=== Check Case 3: Partial Data ===")
result3 = checker.check_single(
context="The Nice Wall of China is roughly 13,000 miles lengthy and took over 2,000 years to construct.",
reply="The Nice Wall of China may be very lengthy."
)
print(f"End result: {result3['grounded']}")
print("n=== Check Case 4: Contradictory Data ===")
result4 = checker.check_single(
context="Water boils at 100 levels Celsius at sea degree atmospheric stress.",
reply="Water boils at 90 levels Celsius at sea degree."
)
print(f"End result: {result4['grounded']}")
We run 4 standalone groundedness checks, masking a factual error in top, an accurate assertion, a imprecise partial match, and a contradictory declare, utilizing the AdvancedGroundednessChecker. It prints every Upstage outcome for example how the service flags grounded versus ungrounded solutions throughout these totally different eventualities.
print("n=== Batch Processing Instance ===")
test_cases = [
{
"context": "Shakespeare wrote Romeo and Juliet in the late 16th century.",
"answer": "Romeo and Juliet was written by Shakespeare."
},
{
"context": "The speed of light is approximately 299,792,458 meters per second.",
"answer": "Light travels at about 300,000 kilometers per second."
},
{
"context": "Earth has one natural satellite called the Moon.",
"answer": "Earth has two moons."
}
]
batch_results = checker.batch_check(test_cases)
for i, end in enumerate(batch_results, 1):
print(f"Batch Check {i}: {outcome['grounded']}")
print("n=== Outcomes Evaluation ===")
evaluation = checker.analyze_results()
print(f"Complete checks carried out: {evaluation['total_checks']}")
print(f"Grounded responses: {evaluation['grounded_count']}")
print(f"Not grounded responses: {evaluation['not_grounded_count']}")
print(f"Groundedness charge: {evaluation['accuracy_rate']:.2%}")
print("n=== Multi-domain Testing ===")
domains = {
"Science": {
"context": "Photosynthesis is the method by which vegetation convert daylight, carbon dioxide, & water into glucose and oxygen.",
"reply": "Crops use photosynthesis to make meals from daylight and CO2."
},
"Historical past": {
"context": "World Battle II led to 1945 after the give up of Japan following the atomic bombings.",
"reply": "WWII led to 1944 with Germany's give up."
},
"Geography": {
"context": "Mount Everest is the very best mountain on Earth, positioned within the Himalayas at 8,848.86 meters.",
"reply": "Mount Everest is the tallest mountain and is positioned within the Himalayas."
}
}
for area, test_case in domains.objects():
outcome = checker.check_single(test_case["context"], test_case["answer"])
print(f"{area}: {outcome['grounded']}")
We execute a sequence of batched groundedness checks on predefined take a look at instances, print particular person Upstage judgments, after which compute and show total accuracy metrics. It additionally conducts multi-domain validations in science, historical past, and geography for example how Upstage handles groundedness throughout totally different topic areas.
def create_test_report(checker_instance):
"""Generate an in depth take a look at report"""
report = {
"abstract": checker_instance.analyze_results(),
"detailed_results": checker_instance.outcomes,
"suggestions": []
}
accuracy = report["summary"]["accuracy_rate"]
if accuracy 0.9:
report["recommendations"].append("Excessive accuracy - system performing effectively")
return report
print("n=== Remaining Check Report ===")
report = create_test_report(checker)
print(f"General Efficiency: {report['summary']['accuracy_rate']:.2%}")
print("Suggestions:", report["recommendations"])
print("n=== Tutorial Full ===")
print("This tutorial demonstrated:")
print("• Fundamental groundedness checking")
print("• Batch processing capabilities")
print("• Multi-domain testing")
print("• Outcomes evaluation and reporting")
print("• Superior wrapper implementation")
Lastly, we outline a create_test_report helper that compiles all accrued groundedness checks right into a abstract report, full with total accuracy and tailor-made suggestions, after which prints out the ultimate efficiency metrics together with a recap of the tutorial’s key demonstrations.
In conclusion, with Upstage’s Groundedness Examine at our disposal, we achieve a scalable, domain-agnostic resolution for real-time truth verification and confidence scoring. Whether or not we’re validating remoted claims or processing massive batches of responses, Upstage delivers clear, grounded/not-grounded judgments and confidence metrics that allow us to observe accuracy charges and generate actionable high quality studies. By integrating this service into our workflow, we will improve the reliability of AI-generated outputs and keep rigorous requirements of factual integrity throughout all functions.
Take a look at the Codes. All credit score for this analysis goes to the researchers of this venture. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication.
Sana Hassan, a consulting intern at Marktechpost and dual-degree scholar at IIT Madras, is obsessed with making use of know-how and AI to handle real-world challenges. With a eager curiosity in fixing sensible issues, he brings a recent perspective to the intersection of AI and real-life options.


