Help Center
Your guide to using the AI Model Evaluator for various NLP benchmarks.
- Select a Benchmark Suite: Choose the benchmark (e.g., GLUE, SuperGLUE) you want to evaluate your model against from the tabs at the top.
- Select a Dataset: Within your chosen benchmark suite, select the specific dataset from the dropdown menu.
- Prepare Your Model's Output: Ensure your model's predictions are in the correct format for the selected dataset. The output can be a single-column list of predictions in CSV format or a JSON array of predictions. See the dataset details below for specific formats.
- Provide the Output: You can either paste your model's output directly into the text area or upload a file (.csv, .json, .txt).
- Run Benchmark: Click the "Run Benchmark" button to start the evaluation.
- Review the Report: The tool will generate a comprehensive report with a summary of results (including pass/fail status and key metrics) and a detailed analysis of your model's performance.
The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding tasks.
CoLA (Corpus of Linguistic Acceptability)
Task: Determine if a sentence is grammatically correct. A binary classification task.
Expected Output: A list of 0s (unacceptable) and 1s (acceptable).
[1, 0, 1, 1, 0, ...]SST-2 (Stanford Sentiment Treebank)
Task: Classify the sentiment of a sentence from a movie review. A binary classification task.
Expected Output: A list of 0s (negative) and 1s (positive).
[0, 1, 1, 0, 1, ...]MRPC (Microsoft Research Paraphrase Corpus)
Task: Determine if two sentences are paraphrases of each other. A binary classification task.
Expected Output: A list of 0s (not equivalent) and 1s (equivalent).
[1, 0, 1, 1, 0, ...]STS-B (Semantic Textual Similarity Benchmark)
Task: Predict a similarity score between 1 and 5 for two sentences. A regression task.
Expected Output: A list of floating point numbers between 1.0 and 5.0.
[5.0, 1.2, 3.8, 2.5, ...]QQP (Quora Question Pairs)
Task: Determine whether a pair of questions are semantically equivalent. A binary classification task.
Expected Output: A list of 0s (not duplicate) and 1s (duplicate).
[0, 1, 1, 0, 1, ...]MNLI (Multi-Genre Natural Language Inference)
Task: Given a premise and a hypothesis, determine if the premise entails the hypothesis, contradicts it, or neither. A 3-class classification task.
Expected Output: A list of 0s (entailment), 1s (neutral), and 2s (contradiction).
[0, 2, 1, 0, 1, ...]QNLI (Question-Answering NLI)
Task: For a question-sentence pair, determine if the sentence contains the answer to the question. A binary classification task.
Expected Output: A list of 0s (entailment) and 1s (not entailment).
[0, 1, 0, 0, 1, ...]RTE (Recognizing Textual Entailment)
Task: A smaller version of textual entailment task. A binary classification task.
Expected Output: A list of 0s (entailment) and 1s (not entailment).
[1, 0, 0, 1, 1, ...]WNLI (Winograd NLI)
Task: A small reading comprehension task. A binary classification task.
Expected Output: A list of 0s (not entailment) and 1s (entailment).
[0, 1, 1, 0, 1, ...]