CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is an experimental Python 3.9-compatible interpreter implementation in C++. Unlike CPython, this interpreter uses a register-based VM instead of a stack-based VM, implements Python objects as C++ classes, and includes MLIR integration for advanced optimizations.

Build System

Prerequisites

CMake 3.25+
C++23 compiler
LLVM 23+ with MLIR (required for MLIR backend)
GMP (GNU Multiple Precision library)
ICU (International Components for Unicode)

Install LLVM/MLIR on Ubuntu:

wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
sudo ./llvm.sh 23 all
sudo apt install libmlir-23-dev mlir-23-tools

Build Commands

Configure and build:

cmake --preset release
cmake --build --preset release

Run tests:

# Run all tests (unit tests + integration tests)
ctest --preset release

# Run just integration tests
ctest --preset release -R integration-tests

# Run just unittests
ctest --preset release -E integration-tests

Run the Python interpreter:

# The binary is named `python` and lives under the preset's build dir
./build/release/src/python <script.py>

# Stress the garbage collector while running (recommended when debugging
# object-lifetime issues); unit is number of allocations, default 10000
./build/release/src/python <script.py> --gc-frequency 1000000

Useful diagnostic flags: -t/--tokenize (print tokens), -a/--ast (print AST), -b/--bytecode (print generated bytecode), -d/--debug / --trace (logging).

Development builds with sanitizers:

# Address sanitizer
cmake -B build -DCMAKE_BUILD_TYPE=Debug -DENABLE_SANITIZER_ADDRESS=ON
cmake --build build

# Undefined behavior sanitizer
cmake -B build -DCMAKE_BUILD_TYPE=Debug -DENABLE_SANITIZER_UNDEFINED_BEHAVIOR=ON
cmake --build build

Architecture Overview

Execution Pipeline

Source → Lexer → Parser → AST → Compiler → Program → VM → Runtime

Lexer (src/lexer/) tokenizes Python source using CPython-compatible tokens
Parser (src/parser/) builds an AST using the same grammar spec as CPython
AST (src/ast/) represents code with the same node types as CPython
Compiler has three backends (compiler::Backend in src/executable/Program.cpp):
- MLIR (current default): the python binary always compiles via Backend::MLIR. Uses MLIR dialects for optimization, then lowers to bytecode
- BytecodeGenerator: Register-based bytecode generated directly from the AST
- LLVM: JIT compilation (incomplete/experimental). Must be compiled in by configuring with -DENABLE_LLVM_BACKEND=ON (which defines the USE_LLVM macro), then selected at runtime with --use-llvm
VM (src/vm/) executes instructions with register-based architecture
Interpreter (src/interpreter/) manages execution state, frames, modules
Runtime (src/runtime/) implements Python objects as C++ classes

Register-Based VM Architecture

Unlike CPython's stack-based VM, this interpreter uses registers for intermediate values:

StackFrame structure:

registers: Vector of py::Value acting like CPU registers
locals: Stack-allocated local variables (separate from registers)
stack_pointer: For runtime stack management

Instructions specify register operands explicitly:

// Example: BINARY_OPERATION r5 r3 r4  means  r5 = r3 + r4
const auto &lhs = vm.reg(m_lhs);
const auto &rhs = vm.reg(m_rhs);
vm.reg(m_destination) = result.unwrap();

Benefits over stack-based:

Fewer memory accesses
More optimization opportunities
Closer to actual CPU architectures

Trade-offs:

Larger instruction encoding (includes register indices)
Currently no register reuse optimization (allocated sequentially)

MLIR Integration

MLIR provides an optimization infrastructure and alternative compilation path.

Compilation flow:

AST → MLIR Python Dialect → Optimizations → MLIR PythonBytecode Dialect → Bytecode

Key components:

Python Dialect (src/executable/mlir/Dialect/Python/): High-level Python operations (py.add, py.call, etc.) defined in TableGen
MLIRGenerator (src/executable/mlir/Dialect/Python/MLIRGenerator.hpp): Visitor over AST nodes that generates MLIR operations
PythonBytecode Dialect (src/executable/mlir/Dialect/EmitPythonBytecode/): Lower-level operations closer to final bytecode
Conversion Pass (src/executable/mlir/Conversion/PythonToPythonBytecode/): Lowers Python dialect → PythonBytecode dialect
Bytecode Emitter (src/executable/mlir/Target/PythonBytecode/): Translates MLIR to BytecodeProgram

Why MLIR?

Enables sophisticated optimizations (constant folding, DCE, inlining)
Infrastructure for future JIT compilation
Clean separation between frontend (Python semantics) and backend (codegen)
Can leverage MLIR's ecosystem of transformation passes

Python Objects as C++ Classes

All Python objects inherit from PyObject (src/runtime/PyObject.hpp):

class PyObject : public Cell {  // Cell enables garbage collection
    TypePrototype &m_type;       // Type information
    PyDict *m_attributes;        // Instance __dict__
};

TypePrototype pattern:

Template-based compile-time introspection
Slot functions for protocols (__add__, __getitem__, etc.)
Supports both C++ lambdas and PyObject methods

Value representation (src/runtime/Value.hpp):

py::Value is a discriminated union to avoid heap allocations for primitives
Can hold PyObject*, inline Number, String, or Bytes

Concrete types (src/runtime/):

Each Python type is a C++ class: PyInteger, PyString, PyList, PyDict, PyTuple, etc.
Implement Python protocols via methods

Interpreter and Runtime Interaction

Interpreter (src/interpreter/Interpreter.hpp) manages:

Current execution frame (m_current_frame: PyFrame*)
Module registry and import machinery
Global frame for module-level code
Exception state

Runtime provides object implementations and delegates protocol operations:

// VM executes instruction, calls interpreter for object operations
PyResult<Value> execute(VirtualMachine &vm, Interpreter &interpreter) {
    const auto &lhs = vm.reg(m_lhs);
    return add(lhs, rhs, interpreter);  // delegates to runtime
}

Frame management:

PyFrame: Python execution context (locals, globals, builtins)
StackFrame: VM state (registers, stack pointer)
Interpreter maintains frame chain for tracebacks

Important Patterns & Conventions

Result Type for Error Handling

All runtime operations return PyResult<T> for error propagation:

template<typename T> class PyResult;  // Either Ok(T) or Err(BaseException*)

PyResult<PyObject*> add(const PyObject*, const PyObject*);

Never throw exceptions from runtime code - use PyResult.

Visitor Pattern

Used extensively for:

AST traversal: ast::CodeGenerator with visit() methods for each AST node type
Garbage collection: Cell::Visitor for graph traversal
Both use double-dispatch pattern

Scoping and Variables Resolution

VariablesResolver (src/executable/bytecode/codegen/VariablesResolver.hpp):

Pre-pass before bytecode generation
Analyzes variable scope (local, global, free variables, cell variables)
Critical for correct closure and nested function implementation

Name mangling (src/executable/Mangler.hpp):

Implements Python's private name mangling for class attributes (e.g., __private → _ClassName__private)
Used during bytecode generation

Control Flow

Uses Label objects for jumps and branches
Two-pass compilation: generate code with labels, then relocate to instruction positions
See src/executable/Label.hpp

Memory Management

Garbage Collection (src/memory/):

Mark-sweep collector
All objects inherit from Cell to participate in GC
Slab allocator for efficient small object allocation

Factory functions:

static PyObject* create(...);  // Allocates via VirtualMachine::heap()

Directory Structure

Core Components

Execution:

src/vm/ - Register-based virtual machine
src/interpreter/ - Execution control, frame management, module system
src/executable/ - Compiled program representations (BytecodeProgram, etc.)

Frontend (CPython-compatible):

src/lexer/ - Tokenization
src/parser/ - Recursive descent parser
src/ast/ - Abstract syntax tree nodes

Compilation:

src/executable/bytecode/codegen/ - Register bytecode generator
src/executable/bytecode/instructions/ - ~80 instruction types
src/executable/mlir/ - MLIR compilation pipeline
- Dialect/Python/ - High-level Python dialect (TableGen definitions)
- Dialect/EmitPythonBytecode/ - Low-level bytecode dialect
- Conversion/ - Lowering passes between dialects
- Target/ - Final bytecode emission from MLIR

Runtime:

src/runtime/ - Python object implementations (PyInteger, PyList, PyDict, etc.)
src/runtime/types/ - Built-in type definitions
src/runtime/modules/ - Standard library modules (sys, builtins, math, etc.)

Memory:

src/memory/ - Mark-sweep garbage collector, slab allocator

Other:

src/utilities/ - Helper utilities and freeze tool
src/repl/ - Interactive shell (uses linenoise)
src/testing/ - Test infrastructure

Integration Tests

Location: integration/

Run integration tests:

# Language-feature test suite
./integration/run_python_tests.sh ./build/release/src/python

# Full integration run (examples + run_python_tests.sh + LLVM backend)
./integration/run_integration_tests.sh ./build/release/src/python

Test categories:

integration/tests/ - Python scripts testing various language features
integration/aoc/ - Advent of Code solutions used as larger programs
integration/fibonacci/ - Fibonacci example
integration/mandelbrot/ - Mandelbrot set computation
integration/llvm/ - LLVM backend tests (experimental)

Test structure:

Tests should assert using Python's assert statement
Scripts exit with code 0 on success, non-zero on failure
Tests run with --gc-frequency flag to stress-test garbage collector

Development Workflow

Adding a New Bytecode Instruction

Define instruction in src/executable/bytecode/instructions/
Add to instruction set enumeration
Implement execute() method that takes VM and Interpreter
Register in instruction decoder
Update BytecodeGenerator to emit the instruction when visiting relevant AST nodes

Adding a New MLIR Operation

Define operation in TableGen: src/executable/mlir/Dialect/Python/IR/PythonOps.td
Build to generate C++ code from TableGen
Add emission in MLIRGenerator when visiting AST nodes
Add lowering to PythonBytecode dialect in conversion pass
Add bytecode emission in Target

Adding a New Python Type

Create class inheriting from PyObject in src/runtime/
Implement Python protocols as methods
Create TypePrototype registration
Add factory function using VirtualMachine::heap()
Implement GC visitor if type contains references to other objects
Add to builtins in src/runtime/modules/BuiltinsModule.cpp

Debugging

GC debugging:

Use --gc-frequency N to trigger GC every N allocations
Useful for finding object lifetime bugs

Bytecode inspection:

Run with --bytecode (or -b) to print generated instructions; --ast/-a and --tokenize/-t dump the AST and token stream

MLIR pipeline debugging:

Set MLIR_PRINT_IR_AFTER_ALL=1 when running the python binary to dump the IR after every pass (e.g. MLIR_PRINT_IR_AFTER_ALL=1 ./build/release/src/python <script.py>). The interpreter parses its own args with cxxopts and does not expose MLIR's -mlir-print-* command-line flags directly.
The standalone python-mlir-opt tool (src/executable/mlir/tools/python-mlir-opt/) is a regular mlir-opt-style driver and does accept MLIR's CL flags.

Compatibility with CPython

What's the same:

Token types from the lexer
Grammar specification for the parser
AST node types
Python 3.9 language semantics

What's different:

VM architecture (register-based vs stack-based)
Runtime implementation (C++ classes vs C structs)
Bytecode format (incompatible with CPython .pyc files)
Performance characteristics (no JIT yet, but register VM may have different trade-offs)

Testing Philosophy

The codebase maintains compatibility by keeping the frontend (lexer, parser, AST) identical to CPython while innovating in the backend (VM, runtime). Integration tests in integration/tests/ verify Python semantics are preserved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Build System

Prerequisites

Build Commands

Architecture Overview

Execution Pipeline

Register-Based VM Architecture

MLIR Integration

Python Objects as C++ Classes

Interpreter and Runtime Interaction

Important Patterns & Conventions

Result Type for Error Handling

Visitor Pattern

Scoping and Variables Resolution

Control Flow

Memory Management

Directory Structure

Core Components

Integration Tests

Development Workflow

Adding a New Bytecode Instruction

Adding a New MLIR Operation

Adding a New Python Type

Debugging

Compatibility with CPython

Testing Philosophy

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Build System

Prerequisites

Build Commands

Architecture Overview

Execution Pipeline

Register-Based VM Architecture

MLIR Integration

Python Objects as C++ Classes

Interpreter and Runtime Interaction

Important Patterns & Conventions

Result Type for Error Handling

Visitor Pattern

Scoping and Variables Resolution

Control Flow

Memory Management

Directory Structure

Core Components

Integration Tests

Development Workflow

Adding a New Bytecode Instruction

Adding a New MLIR Operation

Adding a New Python Type

Debugging

Compatibility with CPython

Testing Philosophy