LONGCODEU

Benchmarking Long-Context Language Models on Long Code Understanding

ACL 2025
Jia Li, Xuyuan Guo, Lei Li, Kechi Zhang, Ge Li†, Jia Li♂,
Zhengwei Tao, Fang Liu, Chongyang Tao, Yuqi Zhu, Zhi Jin†
Key Lab of High Confidence Software Technology (Peking University), MoE,
School of Computer Science, Peking University, China

Corresponding authors
Figure 1

Figure 1: Examples of a synthetic long code with independent functions and a real-world long code with non-standalone functions. Dependencies are highlighted.

Leaderboard

Abstract

Current advanced long-context language models offer great potential for real-world software engineering applications. However, progress in this critical domain remains hampered by a fundamental limitation: the absence of a rigorous evaluation framework for long code understanding. To gap this obstacle, we propose a long code understanding benchmark LONGCODEU from four aspects (8 tasks) to evaluate LCLMs' long code understanding ability required for practical applications, including code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long code documentation understanding. We evaluate 9 popular LCLMs on LONGCODEU (i.e. 6 general models and 3 code models). Our experimental results reveal key limitations in current LCLMs’ capabilities for long code understanding. Particularly, the performance of LCLMs drops dramatically when the long code length is greater than 32K, falling far short of their claimed 128K~1M context windows. In the four aspects, inter-code unit relation understanding is the most challenging for LCLMs. Our study provides valuable insights for optimizing LCLMs and driving advancements in software engineering.

Tasks

In this paper, we propose LONGCODEU to comprehensively evaluate LCLMs' long code understanding ability from four aspects: code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long documentation understanding.

Figure 2

Figure 2: Four understanding aspects in LONGCODEU.

  • Code Unit Perception: We treat a function as the code unit and require LCLMs to identify all defined functions in long code and return their corresponding function names, where long code is composed of one or more code file contents collected from real-world repositories.
  • Intra-Code Unit Understanding: We purpose (1) Code Unit Data Flow Analysis: Given a code unit in long code, LCLMs are required to figure out lines where the value of the given variable changes by tracing data flow; (2) Code Unit Semantic Analysis: LCLMs are aksed to return a code unit from the long code that satisfies the given description.
  • Inter-Code Unit Relation Understanding: We purpose (1) Dependency Relation Analysis: Given a code unit (or a natural language description), this task requires LCLMs to find code units that are invoked by the given unit (or invoked for generating the desired code that satisfies the given description) from long code; (2) Semantic Relation Extraction: Given a code unit (or a programming requirement), this task asks LCLMs to extract semantically similar code units with the given unit from long code (or find semantically similar units with the given requirement).
  • Long Documentation Understanding: Given long documentation and a code unit name such as a function name contained in the documentation, this task requires LCLMs to extract the related information to the unit name.

Results and Analysis

We evaluate 9 popular LCLMs, which contain 6 general models (i.e., GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Flash, DeepSeek-V2.5, Mistral-v0.3, and Phi-3.5) and 3 code models (i.e., DeepSeek-Coder-V2, Qwen2.5 Coder, CodeLlama) on LONGCODEU. The experimental results reveal key limitations in current LCLMs’ capabilities for long code understanding. Especially, LCLMs’ performance drops dramatically when the long code length is greater than 32K, falling far short of their claimed context windows such as 128K-1M tokens. In the four aspects, inter code unit relation understanding is the most challenging for LCLMs.

MY ALT TEXT

Figure 3: Performance comparison across tasks and long code lengths on LONGCODEU (grey blocks indicate unavailable configurations). The rate of performance degradation exhibits task-specific and model-specific patterns.

BibTeX

@article{li2025longcodeu,
  title={LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding},
  author={Li, Jia and Guo, Xuyuan and Li, Lei and Zhang, Kechi and Li, Ge and Tao, Zhengwei and Liu, Fang and Tao, Chongyang and Zhu, Yuqi and Jin, Zhi},
  journal={arXiv preprint arXiv:2503.04359},
  year={2025}
}