Evaluator

Our evaluation tool is designed to streamline the assessment of model outputs for the MMScan task, providing essential metrics to gauge model performance effectively.

from mmscan import MMScan

# (2) The evaluator tool ('VisualGroundingEvaluator', 'QuestionAnsweringEvaluator', 'GPTEvaluator')
from mmscan import VisualGroundingEvaluator, QuestionAnsweringEvaluator, GPTEvaluator

Visual Grounding Evaluator

Introduction：For the visual grounding task, our evaluator computes multiple metrics including AP (Average Precision), AR (Average Recall), AP_C, AR_C, and gtop-k.
- AP and AR: These metrics calculate the precision and recall by considering each sample as an individual category.
- AP_C and AR_C: These versions categorize samples belonging to the same subclass and calculate them together.
- gTop-k: An expanded metric that generalizes the traditional Top-k metric, offering superior flexibility and interpretability compared to traditional ones when oriented towards multi-target grounding.
Details:

# Initialize the evaluator with show_results enabled to display results
my_evaluator = VisualGroundingEvaluator(show_results=True)

# Update the evaluator with the model's output
my_evaluator.update(model_output)

# Start the evaluation process and retrieve metric results
metric_dict = my_evaluator.start_evaluation()

# Optional: Retrieve detailed sample-level results
print(my_evaluator.records)

# Optional: Show the table of results
print(my_evaluator.print_result())

# Important: Reset the evaluator after use
my_evaluator.reset()

The evaluator expects input data in a specific format, structured as follows:

[
    {
        "pred_scores" (tensor/ndarray): Confidence scores for each prediction. Shape: (num_pred, 1)

        "pred_bboxes"/"gt_bboxes" (tensor/ndarray): List of 9 DoF bounding boxes.
            Supports two input formats:
            1. 9-dof box format: (num_pred/gt, 9)
            2. center, size and rotation matrix:
                "center": (num_pred/gt, 3),
                "size"  : (num_pred/gt, 3),
                "rot"   : (num_pred/gt, 3, 3)

        "subclass": The subclass of each VG sample.
        "index": Index of the sample.
    }
    ...
]

Question Answering Evaluator

Introduction：The question answering evaluator measures performance using several established metrics.
- Bleu-X: Evaluates n-gram overlap between prediction and ground truths.
- Meteor: Focuses on precision, recall, and synonymy.
- CIDEr: Considers consensus-based agreement.
- SPICE: Used for semantic propositional content.
- SimCSE/SBERT: Semantic similarity measures using sentence embeddings.
- EM (Exact Match) and Refine EM: Compare exact matches between predictions and ground truths.
Details:

# Initialize the evaluator with show_results enabled to display results
my_evaluator = VisualGroundingEvaluator(show_results=True)

# Update the evaluator with the model's output
my_evaluator.update(model_output)

# Start the evaluation process and retrieve metric results
metric_dict = my_evaluator.start_evaluation()

# Optional: Retrieve detailed sample-level results
print(my_evaluator.records)

# Optional: Show the table of results
print(my_evaluator.print_result())

# Important: Reset the evaluator after use
my_evaluator.reset()

The evaluator requires input data structured as follows:

[
    {
        "question" (str): The question text,
        "pred" (list[str]): The predicted answer, single element list,
        "gt" (list[str]): Ground truth answers, containing multiple elements,
        "ID": Unique ID for each QA sample,
        "index": Index of the sample,
    }
    ...
]

GPT Evaluator

Details:

In addition to classical QA metrics, the GPT evaluator offers a more advanced evaluation process.

# Initialize GPT evaluator with an API key for access
my_evaluator = GPTEvaluator(API_key='XXX')

# Load, evaluate with multiprocessing, and store results in temporary path
metric_dict = my_evaluator.load_and_eval(model_output, num_threads=5, tmp_path='XXX')

# Important: Reset evaluator when finished
my_evaluator.reset()

The input structure remains the same as for the question answering evaluator:

[
    {
        "question" (str): The question text,
        "pred" (list[str]): The predicted answer, single element list,
        "gt" (list[str]): Ground truth answers, containing multiple elements,
        "ID": Unique ID for each QA sample,
        "index": Index of the sample,
    }
    ...
]

PreviousDataset

Last updated 6 months ago