Skip to main content

Chain-of-Thought Codebank for LLM-based Automatic Debugging

Primary supervisor

Chunyang Chen

Recently, large language models (LLM) gained popularity for their emerging powerful capabilities. For example, when given appropriate prompts, they could execute a task following instructions or demonstrations. In this project, we focus on generating chain-of-thought (CoT) prompts, using a codebank filled with basic sketches, to measure LLMs’ ability in automatic debugging.


We briefly introduce the mentioned “CoT prompts” and “automatic debugging”. Compared to general prompts, CoT prompts explicitly expose the required reasoning chain in a given task. By imitating such demonstrations, LLMs tend to reason in a normal logic so more likely to give correct answers, as shown in Figure 1. Meanwhile, using CoT prompts with progressive difficulties could be a manner to evaluate LLMs’ ability, such as, when LLMs no longer correctly answer questions. For automatic debugging, we want to well study LLMs’ capability on program state analysis, namely, answer questions about a program’s intermediate state with neural executions [2]. Therefore, it is different from automatic program repair, where LLMs are strong by learning data patterns [3]. To the best of our knowledge, LLMs could perfectly debug over 96% of beginner-level programming tasks but performs poorly in advanced programming challenges (but still better than others) [4].

Inspired by the BIG-bench project 1 for LLMs, we are curious about: 1) whether we could measure their limits with a benchmark; 2) whether we could guide LLMs’ reasoning process with the most relevant demonstrations. However, we would consider using its deliverables on other open questions as well. In addition, the idea of implementation is lightweight and straightforward, that is:

  • collect and template basic code operations, statements, and blocks into a codebank;
  • decompose arbitrary programs to basic sketches, as shown in Figure 2;
  • mine rules to compose these basic sketches into random programs, as shown in Figure 3;
  • compute program states using an oracle (PySnooper 2) as the ground truth to evaluate LLMs.

In our planning, we want to realize a pipeline on building a codebank and doing automatic debugging. The whole project is divided into four steps as mentioned above, and they correspond to four white blocks illustrated in Figure 4. Among these small blocks, the red ones are the functions to be implemented or introduced, while the blue ones are the data to be processed or achieved.

The figures and more details can be seen at


    1 2

    Student cohort

    Double Semester



    1. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022.
    2. Wojciech Zaremba and Ilya Sutskever. Learning to execute. ArXiv, abs/1410.4615, 2014.
    3. Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. An analysis of the automatic bug fixing performance of chatgpt. ArXiv, abs/2301.08653, 2023.
    4. Jialu Zhang, José Pablo Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen. Repairing bugs in python assignments using large language models. ArXiv, abs/2209.14876, 2022.
    5. Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. Cert: Continual pre-training on sketches for library-oriented code generation. In International Joint Conference on Artificial Intelligence, 2022.
    6. Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Haiquan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. 2022.

    Required knowledge

    • Software development
    • Data analysis
    • Basic knowledge about AI/Machine Learning is a plus