Large Language Models for Software Engineering

Motivation: Large language models have shown great potential in assisting software development tasks, such as code generation, code completion, and bug fixing. However, the reliability and explainability of these models remain a concern, as they may generate incorrect or suboptimal code. Ensuring the quality and trustworthiness of large language models is critical for their adoption in real-world software engineering practices.

Approach: To address these challenges, we explore the capabilities and limitations of large language models for software engineering through extensive analysis. Our primary goal is to enhance the reliability, robustness, and trustworthiness of LLMs in practical applications. We approach this from different aspects:

  • Taxonomy and Review: We have conducted a systematic literature review to understand the current state of large language models in software engineering. This includes a taxonomy of the models, their applications, and their limitations[ LLM4SE, Pitfalls ].
  • Benchmark Datasets: Benchmark datasets are collections of data used to evaluate and compare the performance of LLM-based software development tools. Our research [TOSEM 24, arxiv 23] shows that data duplications and the lack of temporal splits can lead to unrealistic performance evaluations, resulting in misleading evaluation results.
  • Code Generation: Popular LLM products like ChatGPT and GitHub Copilot can help software developers directly write code. However, LLM tools that produce low-quality code are not reliable for real-world usage. We conducted a systematic study on the quality of 4,066 LLM-generated code implementations in two popular programming languages, Java and Python [TOSEM 24]. Our study unveils various issues in LLM-generated code, including solution inaccuracies and maintainability issues. To address these, we propose three different strategies to improve code quality.
  • Explainability: Popular LLMs are often black-box models, making it important to understand their decision-making processes. Our explainability analysis [TOSEM 24] reveals that, in various experimental scenarios, language models can recognize code grammar and structural information but exhibit limited robustness to changes in input sequences.
  • Related Publications

    On the Reliability and Explainability of Language Models for Program Generation
    Yue Liu, Chakkrit Tantithamthavorn, Yonghui Liu, and Li Li
    ACM Transactions on Software Engineering and Methodology (TOSEM 2024), to appear (Core A*, CCF A)

    Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues
    Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, and David Lo
    ACM Transactions on Software Engineering and Methodology (TOSEM 2024), to appear (Core A*, CCF A)

    Automatically Recommend Code Updates: Are We There Yet?
    Knox Liu, Chakkrit Tantithamthavorn, Yonghui Liu, Patanamon Thongtanunam, Li Li

    Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey
    Xinyu She, Yue Liu, Yanjie Zhao, Yiling He, Li Li, Chakkrit Tantithamthavorn, Zhan Qin, and Haoyu Wang

    Large Language Models for Software Engineering: A Systematic Literature Review
    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang