Evaluator

Our evaluation tool is designed to streamline the assessment of model outputs for the MMScan task, providing essential metrics to gauge model performance effectively.

from mmscan import MMScan

# (2) The evaluator tool ('VisualGroundingEvaluator', 'QuestionAnsweringEvaluator', 'GPTEvaluator')
from mmscan import VisualGroundingEvaluator, QuestionAnsweringEvaluator, GPTEvaluator

Visual Grounding Evaluator

  • Introduction:For the visual grounding task, our evaluator computes multiple metrics including AP (Average Precision), AR (Average Recall), AP_C, AR_C, and gtop-k.

    • AP and AR: These metrics calculate the precision and recall by considering each sample as an individual category.

    • AP_C and AR_C: These versions categorize samples belonging to the same subclass and calculate them together.

    • gTop-k: An expanded metric that generalizes the traditional Top-k metric, offering superior flexibility and interpretability compared to traditional ones when oriented towards multi-target grounding.

  • Details:

# Initialize the evaluator with show_results enabled to display results
my_evaluator = VisualGroundingEvaluator(show_results=True)

# Update the evaluator with the model's output
my_evaluator.update(model_output)

# Start the evaluation process and retrieve metric results
metric_dict = my_evaluator.start_evaluation()

# Optional: Retrieve detailed sample-level results
print(my_evaluator.records)

# Optional: Show the table of results
print(my_evaluator.print_result())

# Important: Reset the evaluator after use
my_evaluator.reset()

The evaluator expects input data in a specific format, structured as follows:

[
    {
        "pred_scores" (tensor/ndarray): Confidence scores for each prediction. Shape: (num_pred, 1)

        "pred_bboxes"/"gt_bboxes" (tensor/ndarray): List of 9 DoF bounding boxes.
            Supports two input formats:
            1. 9-dof box format: (num_pred/gt, 9)
            2. center, size and rotation matrix:
                "center": (num_pred/gt, 3),
                "size"  : (num_pred/gt, 3),
                "rot"   : (num_pred/gt, 3, 3)

        "subclass": The subclass of each VG sample.
        "index": Index of the sample.
    }
    ...
]

Question Answering Evaluator

  • Introduction:The question answering evaluator measures performance using several established metrics.

    • Bleu-X: Evaluates n-gram overlap between prediction and ground truths.

    • Meteor: Focuses on precision, recall, and synonymy.

    • CIDEr: Considers consensus-based agreement.

    • SPICE: Used for semantic propositional content.

    • SimCSE/SBERT: Semantic similarity measures using sentence embeddings.

    • EM (Exact Match) and Refine EM: Compare exact matches between predictions and ground truths.

  • Details:

# Initialize the evaluator with show_results enabled to display results
my_evaluator = VisualGroundingEvaluator(show_results=True)

# Update the evaluator with the model's output
my_evaluator.update(model_output)

# Start the evaluation process and retrieve metric results
metric_dict = my_evaluator.start_evaluation()

# Optional: Retrieve detailed sample-level results
print(my_evaluator.records)

# Optional: Show the table of results
print(my_evaluator.print_result())

# Important: Reset the evaluator after use
my_evaluator.reset()

The evaluator requires input data structured as follows:

[
    {
        "question" (str): The question text,
        "pred" (list[str]): The predicted answer, single element list,
        "gt" (list[str]): Ground truth answers, containing multiple elements,
        "ID": Unique ID for each QA sample,
        "index": Index of the sample,
    }
    ...
]

GPT Evaluator

  • Details:

In addition to classical QA metrics, the GPT evaluator offers a more advanced evaluation process.

# Initialize GPT evaluator with an API key for access
my_evaluator = GPTEvaluator(API_key='XXX')

# Load, evaluate with multiprocessing, and store results in temporary path
metric_dict = my_evaluator.load_and_eval(model_output, num_threads=5, tmp_path='XXX')

# Important: Reset evaluator when finished
my_evaluator.reset()

The input structure remains the same as for the question answering evaluator:

[
    {
        "question" (str): The question text,
        "pred" (list[str]): The predicted answer, single element list,
        "gt" (list[str]): Ground truth answers, containing multiple elements,
        "ID": Unique ID for each QA sample,
        "index": Index of the sample,
    }
    ...
]

Last updated