Help Center

Your guide to using the AI Model Evaluator for various NLP benchmarks.

How to Use the Tool
A step-by-step guide to benchmarking your model.
  1. Select a Benchmark Suite: Choose the benchmark (e.g., GLUE, SuperGLUE) you want to evaluate your model against from the tabs at the top.
  2. Select a Dataset: Within your chosen benchmark suite, select the specific dataset from the dropdown menu.
  3. Prepare Your Model's Output: Ensure your model's predictions are in the correct format for the selected dataset. The output can be a single-column list of predictions in CSV format or a JSON array of predictions. See the dataset details below for specific formats.
  4. Provide the Output: You can either paste your model's output directly into the text area or upload a file (.csv, .json, .txt).
  5. Run Benchmark: Click the "Run Benchmark" button to start the evaluation.
  6. Review the Report: The tool will generate a comprehensive report with a summary of results (including pass/fail status and key metrics) and a detailed analysis of your model's performance.
Benchmark & Dataset Information
Details on each task and the expected output format for your model.

The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding tasks.

CoLA (Corpus of Linguistic Acceptability)

Task: Determine if a sentence is grammatically correct. A binary classification task.

Expected Output: A list of 0s (unacceptable) and 1s (acceptable).

[1, 0, 1, 1, 0, ...]

SST-2 (Stanford Sentiment Treebank)

Task: Classify the sentiment of a sentence from a movie review. A binary classification task.

Expected Output: A list of 0s (negative) and 1s (positive).

[0, 1, 1, 0, 1, ...]

MRPC (Microsoft Research Paraphrase Corpus)

Task: Determine if two sentences are paraphrases of each other. A binary classification task.

Expected Output: A list of 0s (not equivalent) and 1s (equivalent).

[1, 0, 1, 1, 0, ...]

STS-B (Semantic Textual Similarity Benchmark)

Task: Predict a similarity score between 1 and 5 for two sentences. A regression task.

Expected Output: A list of floating point numbers between 1.0 and 5.0.

[5.0, 1.2, 3.8, 2.5, ...]

QQP (Quora Question Pairs)

Task: Determine whether a pair of questions are semantically equivalent. A binary classification task.

Expected Output: A list of 0s (not duplicate) and 1s (duplicate).

[0, 1, 1, 0, 1, ...]

MNLI (Multi-Genre Natural Language Inference)

Task: Given a premise and a hypothesis, determine if the premise entails the hypothesis, contradicts it, or neither. A 3-class classification task.

Expected Output: A list of 0s (entailment), 1s (neutral), and 2s (contradiction).

[0, 2, 1, 0, 1, ...]

QNLI (Question-Answering NLI)

Task: For a question-sentence pair, determine if the sentence contains the answer to the question. A binary classification task.

Expected Output: A list of 0s (entailment) and 1s (not entailment).

[0, 1, 0, 0, 1, ...]

RTE (Recognizing Textual Entailment)

Task: A smaller version of textual entailment task. A binary classification task.

Expected Output: A list of 0s (entailment) and 1s (not entailment).

[1, 0, 0, 1, 1, ...]

WNLI (Winograd NLI)

Task: A small reading comprehension task. A binary classification task.

Expected Output: A list of 0s (not entailment) and 1s (entailment).

[0, 1, 1, 0, 1, ...]