Benchmarks for LM4Code/LM4SE
Table of contents
- Relevant papers
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
- On the Evaluation of Neural Code Translation: Taxonomy and Benchmark
- Evaluating large language models trained on code
- A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends
- Bug Repair
- Code Generation/Synthesis
- Code Sumarization
This page lists popular benchmarks for evaluating language models for code (LM4Code) and language models for software engineering (LM4SE).
Relevant papers
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
- Release year: 2021-02
- Paper
- Repository
- Description: Proposes a benchmark with 8 tasks to evaluate code understanding and generation models like CodeBERT, CodeGPT, GraphCodeBERT, etc.
On the Evaluation of Neural Code Translation: Taxonomy and Benchmark
- Release year: 2023
- Paper
- Description: This paper proposes a taxonomy of neural code translation tasks and introduces a new benchmark, G-TransEval, for evaluating neural code translation models. G-TransEval includes a variety of code translation tasks with different levels of difficulty and complexity.
Evaluating large language models trained on code
- Release year: 2021
- Paper
- Description: Evaluates Codex, AlphaCode, and other models on code translation, code completion, code summarization, and other tasks.
A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends
- Release year: 2023
- Paper
Bug Repair
Defects4J
- Release year: 2014
- Paper: “Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs”
ManyBugs/IntroClass
- Release year: 2015
- Paper: “The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs”
BugAID
- Release year: 2016
- Paper: “Discovering Bug Patterns in JavaScript”
CoCoNut
- Release year: 2020
- Paper: “CoCoNuT: combining context-aware neural translation models using ensemble for program repair”
QuixBugs
- Release year: 2017
- Paper: “QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge”
Bugs.jar
- Release year: 2018
- Paper: “Bugs.jar: a large-scale, diverse dataset of real-world Java bugs”
BugsInPy
- Release year: 2020
- Paper: “BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies”
DeepFix
- Release year: 2017
- Paper: “DeepFix: Fixing Common C Language Errors by Deep Learning”
Code Generation/Synthesis
CONCODE
- Release year: 2018
- Paper: “Mapping Language to Code in Programmatic Context”
HumanEval
- Release year: 2021
- Paper: “Evaluating Large Language Models Trained on Code”
MBPP/MathQA-Python
- Release year: 2021
- Paper: “Program Synthesis with Large Language Models”
Code Sumarization
CODE-NN
- Release year: 2016
- Paper: “Summarizing Source Code using a Neural Attention Model”
TL-CodeSum
- Release year: 2018
- Paper: “Summarizing Source Code with Transferred API Knowledge”
CodeSearchNet
- Release year: 2019
- Paper: “CodeSearchNet Challenge: Evaluating the State of Semantic Code Search”