Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Reliable Lanugage Models for Code (LM4Code)

Collections of research and resources towards more robust and reliable LLMs for software engineering tasks.

This repository extends from our recent work, “Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey” and “Large language models for software engineering: A systematic literature review”. It includes necessary information for our research and a curated collection of LM4Code papers and other resources (datasets, tutorials, etc.). The focus is primarily on papers that use pre-trained models, especially large language models, to improve the reliability of language models in Software Engineering research.

Content

About our survey

Our survey paper, “Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey”, discusses how Language Models for Code Intelligence (LM4Code) are susceptible to potential pitfalls. These pitfalls can hinder realistic performance and impact their reliability and applicability in real-world deployment.

This paper provides a comprehensive survey of the pitfalls in language models for code intelligence. We propose a taxonomy of pitfalls in LM4Code research and conduct a systematic study to summarize the issues, implications, current solutions, and challenges of different pitfalls for LM4Code systems. We have developed a comprehensive classification scheme that dissects pitfalls across four crucial aspects: data collection and labeling, system design and learning, performance evaluation, and deployment and maintenance. We hope this survey can help researchers and practitioners understand the pitfalls in language models for code intelligence and develop more robust language models for code intelligence.

If you want to check the survey results, please go to the link for more details.

Repository Contents:

  • LM4Code Papers: A curated list of research papers focusing on reliable LM4Code with annotations and summaries.
  • Datasets: Links to relevant datasets for training and evaluating LM4Code models.
  • Tutorials: Resources and tutorials for learning about and implementing LM4Code techniques.
  • Tools and Libraries: Links to open-source tools and libraries for building and utilizing LM4Code models.
  • Blog Posts and Articles: Latest news and insights on LM4Code research and development.
  • Survey Results: Detailed results from our survey paper “Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey”.

Contribute to Reliable LM4Code

We encourage contributions from the community to further enhance the LM4Code field. Whether you have research papers, datasets, or tools to share, please feel free to submit them to this repository.