1. Hallucination free AI for the Enterprise
  2. Overcoming obstacles of Generative AI

PubmedQA

PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions.

MMLU

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans.

MedQA

Multiple choice question answering based on the United States Medical License Exams (USMLE). The dataset is collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively.

Coming Soon

Helm - NQ Dataset

Holistic Evaluation of Language Models (HELM) is a comprehensive benchmark framework designed to improve the transparency of language models (LMs) by taxonomizing the vast space of potential scenarios and metrics of interest for LMs. Developed by Stanford CRFM, HELM serves as a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

Coming Soon

MS Marco

Relevance labels are derived from what| passages was marked as having the answer in the QnA dataset making this one of the largest relevance datasets ever.

Coming Soon