Benchmarks for LM4Code/LM4SE

Table of contents

Relevant papers
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
On the Evaluation of Neural Code Translation: Taxonomy and Benchmark
Evaluating large language models trained on code
A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends
Bug Repair
Code Generation/Synthesis
Code Sumarization

This page lists popular benchmarks for evaluating language models for code (LM4Code) and language models for software engineering (LM4SE).

Relevant papers

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Release year: 2021-02
Paper
Repository
Description: Proposes a benchmark with 8 tasks to evaluate code understanding and generation models like CodeBERT, CodeGPT, GraphCodeBERT, etc.

On the Evaluation of Neural Code Translation: Taxonomy and Benchmark

Release year: 2023
Paper
Description: This paper proposes a taxonomy of neural code translation tasks and introduces a new benchmark, G-TransEval, for evaluating neural code translation models. G-TransEval includes a variety of code translation tasks with different levels of difficulty and complexity.

Evaluating large language models trained on code

Release year: 2021
Paper
Description: Evaluates Codex, AlphaCode, and other models on code translation, code completion, code summarization, and other tasks.

A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends

Release year: 2023
Paper

Bug Repair

Defects4J

Release year: 2014
Paper: “Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs”

ManyBugs/IntroClass

Release year: 2015
Paper: “The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs”

BugAID

Release year: 2016
Paper: “Discovering Bug Patterns in JavaScript”

CoCoNut

Release year: 2020
Paper: “CoCoNuT: combining context-aware neural translation models using ensemble for program repair”

QuixBugs

Release year: 2017
Paper: “QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge”

Bugs.jar

Release year: 2018
Paper: “Bugs.jar: a large-scale, diverse dataset of real-world Java bugs”

BugsInPy

Release year: 2020
Paper: “BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies”

DeepFix

Release year: 2017
Paper: “DeepFix: Fixing Common C Language Errors by Deep Learning”

Code Generation/Synthesis

CONCODE

Release year: 2018
Paper: “Mapping Language to Code in Programmatic Context”

HumanEval

Release year: 2021
Paper: “Evaluating Large Language Models Trained on Code”

MBPP/MathQA-Python

Release year: 2021
Paper: “Program Synthesis with Large Language Models”

Code Sumarization

CODE-NN

Release year: 2016
Paper: “Summarizing Source Code using a Neural Attention Model”

TL-CodeSum

Release year: 2018
Paper: “Summarizing Source Code with Transferred API Knowledge”

CodeSearchNet

Release year: 2019
Paper: “CodeSearchNet Challenge: Evaluating the State of Semantic Code Search”