Evaluator
Our evaluation tool is designed to streamline the assessment of model outputs for the MMScan task, providing essential metrics to gauge model performance effectively.
from mmscan import MMScan
# (2) The evaluator tool ('VisualGroundingEvaluator', 'QuestionAnsweringEvaluator', 'GPTEvaluator')
from mmscan import VisualGroundingEvaluator, QuestionAnsweringEvaluator, GPTEvaluator
Visual Grounding Evaluator
Introduction:For the visual grounding task, our evaluator computes multiple metrics including AP (Average Precision), AR (Average Recall), AP_C, AR_C, and gtop-k.
AP and AR: These metrics calculate the precision and recall by considering each sample as an individual category.
AP_C and AR_C: These versions categorize samples belonging to the same subclass and calculate them together.
gTop-k: An expanded metric that generalizes the traditional Top-k metric, offering superior flexibility and interpretability compared to traditional ones when oriented towards multi-target grounding.
Details:
# Initialize the evaluator with show_results enabled to display results
my_evaluator = VisualGroundingEvaluator(show_results=True)
# Update the evaluator with the model's output
my_evaluator.update(model_output)
# Start the evaluation process and retrieve metric results
metric_dict = my_evaluator.start_evaluation()
# Optional: Retrieve detailed sample-level results
print(my_evaluator.records)
# Optional: Show the table of results
print(my_evaluator.print_result())
# Important: Reset the evaluator after use
my_evaluator.reset()
The evaluator expects input data in a specific format, structured as follows:
[
{
"pred_scores" (tensor/ndarray): Confidence scores for each prediction. Shape: (num_pred, 1)
"pred_bboxes"/"gt_bboxes" (tensor/ndarray): List of 9 DoF bounding boxes.
Supports two input formats:
1. 9-dof box format: (num_pred/gt, 9)
2. center, size and rotation matrix:
"center": (num_pred/gt, 3),
"size" : (num_pred/gt, 3),
"rot" : (num_pred/gt, 3, 3)
"subclass": The subclass of each VG sample.
"index": Index of the sample.
}
...
]
Question Answering Evaluator
Introduction:The question answering evaluator measures performance using several established metrics.
Bleu-X: Evaluates n-gram overlap between prediction and ground truths.
Meteor: Focuses on precision, recall, and synonymy.
CIDEr: Considers consensus-based agreement.
SPICE: Used for semantic propositional content.
SimCSE/SBERT: Semantic similarity measures using sentence embeddings.
EM (Exact Match) and Refine EM: Compare exact matches between predictions and ground truths.
Details:
# Initialize the evaluator with show_results enabled to display results
my_evaluator = VisualGroundingEvaluator(show_results=True)
# Update the evaluator with the model's output
my_evaluator.update(model_output)
# Start the evaluation process and retrieve metric results
metric_dict = my_evaluator.start_evaluation()
# Optional: Retrieve detailed sample-level results
print(my_evaluator.records)
# Optional: Show the table of results
print(my_evaluator.print_result())
# Important: Reset the evaluator after use
my_evaluator.reset()
The evaluator requires input data structured as follows:
[
{
"question" (str): The question text,
"pred" (list[str]): The predicted answer, single element list,
"gt" (list[str]): Ground truth answers, containing multiple elements,
"ID": Unique ID for each QA sample,
"index": Index of the sample,
}
...
]
GPT Evaluator
Details:
In addition to classical QA metrics, the GPT evaluator offers a more advanced evaluation process.
# Initialize GPT evaluator with an API key for access
my_evaluator = GPTEvaluator(API_key='XXX')
# Load, evaluate with multiprocessing, and store results in temporary path
metric_dict = my_evaluator.load_and_eval(model_output, num_threads=5, tmp_path='XXX')
# Important: Reset evaluator when finished
my_evaluator.reset()
The input structure remains the same as for the question answering evaluator:
[
{
"question" (str): The question text,
"pred" (list[str]): The predicted answer, single element list,
"gt" (list[str]): Ground truth answers, containing multiple elements,
"ID": Unique ID for each QA sample,
"index": Index of the sample,
}
...
]
Last updated