Debugging and Profiling JIT Compiled LLVM-IR

I have recently been working on a code-generating database that uses LLVM to compile SQL queries to machine code. My biggest pain point has been understanding and debugging the generated LLVM and compiled machine code. This is challenging for two reasons 1) the code is generated and linked into the host process at runtime and 2) we do not currently have debug symbols because it is not immediately obvious what source the debug symbols should to. In this post, I have collected a few techniques I have been using and demonstrate how to use them.

This post is targetted at readers who are 1) generating LLVM-IR and 2) are using or writing their own JIT compiler using the LLVM ORC APIs. To make things more concrete, I have implemented an example JIT similar to the KaleidoscopeJIT used in the Building a JIT LLVM tutorial. You can find all the code for this post here. I’ll be using LLVM-17, but the concepts are applicable at least as far back as LLVM-14 even if the code will not quite compile for it.

Infrastructure and Running Example
Printing Generated Code
- Dumping Generated IR
- Printing Generated and Optimized IR with IRTransformLayer
Dumping Object Files
Debugging with GDB and Profiling with VTune/Perf
Self-referential Debug Information

Infrastructure and Running Example

We start with a basic JIT using the ORCv2 APIs. For our running example, we will generate three functions in a module, then optimize and compile the module, linking the symbols into our process, and call the generated functions from our main function. Two of the generated functions take two integer arguments and return their sum, but one (deliberately) has a bug. The third function takes a pointer to an array of integers and the array size and returns the sum of the array elements. LLVM can perform more interesting optimizations on this last function. Below is the code of our main function:

int main() {
  initializeLLVM();
  // inserting into a llvm::module created in `initializeLLVM`
  createAddFunction();
  createBuggyAddFunction();
  createArraySumFunction();

  // compile our code
  auto TSM = llvm::orc::ThreadSafeModule(std::move(TheModule), std::move(TheContext));
  ExitOnErr(TheJIT->addModule(std::move(TSM)));

  auto add_symbol = ExitOnErr(TheJIT->lookup("add"));
  int (*add_fp)(int, int) = add_symbol.getAddress().toPtr<int (*)(int, int)>();

  auto array_sum_symbol = ExitOnErr(TheJIT->lookup("arraySum"));
  int (*array_sum_fp)(int*, int) = array_sum_symbol.getAddress().toPtr<int (*)(int*, int)>();

  auto buggy_add_symbol = ExitOnErr(TheJIT->lookup("buggyAdd"));
  int (*buggy_add_fp)(int, int) = buggy_add_symbol.getAddress().toPtr<int (*)(int, int)>();

  int* arr = new int[1024];
  for (int i = 0; i < 1024; i++) {
    arr[i] = i + 1;
  }

  std::cout << "Adding 1+2 = " << add_fp(1, 2) << std::endl;
  std::cout << "sum of {1, 2, ... 1024} = " << array_sum_fp(arr, 1024) << std::endl;
  std::cout << "(with bugs) adding 1+2 = " << buggy_add_fp(1, 2) << std::endl;
  return 0;
}

and this is our simple JIT, which includes optimization of the IR before compiling to machine code.

// jit.hpp
class MyJIT {
 private:
  llvm::orc::ExecutionSession ES;
  llvm::orc::JITTargetMachineBuilder JTMB;
  llvm::DataLayout DL;
  llvm::orc::MangleAndInterner Mangle;
  llvm::orc::RTDyldObjectLinkingLayer LinkingLayer;
  llvm::orc::IRCompileLayer CompileLayer;
  llvm::orc::IRTransformLayer TransformLayer;
  llvm::orc::JITDylib &MainJD;

 public:
  MyJIT()
      : ES{llvm::cantFail(llvm::orc::SelfExecutorProcessControl::Create())},
        JTMB(llvm::cantFail(llvm::orc::JITTargetMachineBuilder::detectHost())
                 .setCodeGenOptLevel(llvm::CodeGenOpt::Aggressive)
                 .setCodeModel(llvm::CodeModel::Model::Large)),
        DL(llvm::cantFail(JTMB.getDefaultDataLayoutForTarget())),
        Mangle(ES, DL),
        LinkingLayer(ES, []() { return std::make_unique<llvm::SectionMemoryManager>(); }),
        CompileLayer(ES, LinkingLayer, std::make_unique<llvm::orc::ConcurrentIRCompiler>(JTMB)),
        TransformLayer(ES, CompileLayer,
                       [TM = JTMB.createTargetMachine()](
                           llvm::orc::ThreadSafeModule TSM,
                           const llvm::orc::MaterializationResponsibility &R) mutable
                       -> llvm::Expected<llvm::orc::ThreadSafeModule> {
                         if (!TM) return TM.takeError();
                         return optimizeModule(std::move(TSM), std::move(TM.get()));
                       }),
        MainJD(ES.createBareJITDylib("<main>")) {
    MainJD.addGenerator(llvm::cantFail(
        llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(DL.getGlobalPrefix())));
  }

  ~MyJIT();

  const llvm::DataLayout &getDataLayout() const { return DL; }

  llvm::Error addModule(llvm::orc::ThreadSafeModule TSM) {
    return TransformLayer.add(MainJD, std::move(TSM));
  }

  llvm::Expected<llvm::orc::ExecutorSymbolDef> lookup(llvm::StringRef Name) {
    return ES.lookup({&MainJD}, Mangle(Name.str()));
  }

 private:
  static llvm::Expected<llvm::orc::ThreadSafeModule> optimizeModule(
      llvm::orc::ThreadSafeModule TSM, std::unique_ptr<llvm::TargetMachine> TM);
};

The complete code for the baseline implementation is in this commit. Note that, like the LLVM tutorials, the code is minimal and intended to show how to use the APIs, not as an example of good software engineering.

Printing Generated Code

The first and most obvious approach is to print out the generated LLVM-IR into a human-readable format. This enables static analysis of the code¹: staring at it until you find your bug.

`Dump`ing Generated IR

The simplest way to see your generated IR is with the dump() method, which will print the IR of the dumped object to stderr. It is helpful for examining generated IR, but it cannot be easily used to examine IR after optimization. This is convenient for debugging small amounts of generated IR but can be limited because it lacks syntax highlighting and will be intermingled with anything else written to stderr. Example:

int main() {
  auto* buggy_add_func_ir = createBuggyAddFunction();
  buggy_add_func_ir->dump();
  ...

Output:

hamish@server:~$ ./main
efine i32 @buggyAdd(i32 %0, i32 %1) {
entry:
  %2 = add i32 %0, %1
  %3 = load i32, ptr inttoptr (i64 42 to ptr), align 4
  %4 = add i32 %2, %3
  ret i32 %4
}

Adding 1+2 = 3
sum of {1, 2, ... 1024} = 524800
Segmentation fault

Printing Generated and Optimized IR with IRTransformLayer

My preferred way to get the generated as well as the optimized IR is to use an IRTransformLayer. The IRTransformLayer is quite straightforward; it is a layer that takes a module, applies a transformation to it, and then emits the module. In our case, we will not actually perform a transformation but print the module to a file, returning the module unchanged. This approach is neat because it allows us to inspect the IR at multiple points in the compilation pipeline by adding multiple of our IR printing IRTransformLayers, i.e., both before and after the module has been optimized.

Below, we have extended MyJIT with two additional layers: PrintGeneratedIRLayer and PrintOptimizedIRLayer, both of which use a lambda as the transformation function. The lambda conditionally calls a new method, printIR, depending on the boolean print_generated_code. Here, print_generated_code is always true, but you could set this through a command-line argument, for example. PrintGeneratedIRLayer is also now the first layer of our compilation pipeline, so we need to modify our addModule method to suit.

jit.hpp

class MyJIT {
 private:
  const bool print_generated_code = true;
  llvm::orc::ExecutionSession ES;
  llvm::orc::JITTargetMachineBuilder JTMB;
  llvm::DataLayout DL;
  llvm::orc::MangleAndInterner Mangle;
  llvm::orc::RTDyldObjectLinkingLayer LinkingLayer;
  llvm::orc::IRCompileLayer CompileLayer;
  llvm::orc::IRTransformLayer PrintOptimizedIRLayer;
  llvm::orc::IRTransformLayer TransformLayer;
  llvm::orc::IRTransformLayer PrintGeneratedIRLayer;
  llvm::orc::JITDylib &MainJD;

 public:
  MyJIT()
      : ES{llvm::cantFail(llvm::orc::SelfExecutorProcessControl::Create())},
        JTMB(llvm::cantFail(llvm::orc::JITTargetMachineBuilder::detectHost())
                 .setCodeGenOptLevel(llvm::CodeGenOpt::Aggressive)
                 .setCodeModel(llvm::CodeModel::Model::Large)),
        DL(llvm::cantFail(JTMB.getDefaultDataLayoutForTarget())),
        Mangle(ES, DL),
        LinkingLayer(ES, []() { return std::make_unique<llvm::SectionMemoryManager>(); }),
        CompileLayer(ES, LinkingLayer, std::make_unique<llvm::orc::ConcurrentIRCompiler>(JTMB)),
        PrintOptimizedIRLayer(
            ES, CompileLayer,
            [print_generated_code = this->print_generated_code](
                llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
                -> llvm::Expected<llvm::orc::ThreadSafeModule> {
              if (print_generated_code) {
                return printIR(std::move(TSM), "_opt");
              }
              return std::move(TSM);
            }),
        TransformLayer(ES, PrintOptimizedIRLayer,
                       [TM = JTMB.createTargetMachine()](
                           llvm::orc::ThreadSafeModule TSM,
                           const llvm::orc::MaterializationResponsibility &R) mutable
                       -> llvm::Expected<llvm::orc::ThreadSafeModule> {
                         if (!TM) return TM.takeError();
                         return optimizeModule(std::move(TSM), std::move(TM.get()));
                       }),
        PrintGeneratedIRLayer(
            ES, TransformLayer,
            [print_generated_code = this->print_generated_code](
                llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
                -> llvm::Expected<llvm::orc::ThreadSafeModule> {
              if (print_generated_code) {
                return printIR(std::move(TSM), "");
              }
              return std::move(TSM);
            }),
        MainJD(ES.createBareJITDylib("<main>")) {
    MainJD.addGenerator(llvm::cantFail(
        llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(DL.getGlobalPrefix())));
  }

  ~MyJIT() {
    if (auto Err = ES.endSession()) ES.reportError(std::move(Err));
  }

  const llvm::DataLayout &getDataLayout() const { return DL; }

  llvm::Error addModule(llvm::orc::ThreadSafeModule TSM) {
    return PrintGeneratedIRLayer.add(MainJD, std::move(TSM));
  }

  llvm::Expected<llvm::orc::ExecutorSymbolDef> lookup(llvm::StringRef Name) {
    return ES.lookup({&MainJD}, Mangle(Name.str()));
  }

 private:
  static llvm::Expected<llvm::orc::ThreadSafeModule> optimizeModule;

  static llvm::Expected<llvm::orc::ThreadSafeModule> printIR(llvm::orc::ThreadSafeModule TSM,
                                                             const std::string &suffix = "") {
    TSM.withModuleDo([&suffix](llvm::Module &m) {
      std::error_code EC;
      const std::string output_directory = "generated_code/";
      const std::string output_file = m.getName().str() + suffix + ".ll";
      llvm::sys::fs::create_directory(output_directory, /*ignoreExisting=*/true);

      llvm::raw_fd_ostream out(output_directory + output_file, EC,
                               llvm::sys::fs::OpenFlags::OF_None);
      m.print(out, nullptr, false, true);
    });
    return TSM;
  }
};

And now we have both our generated and optimized IR conviently in the generated_code/ directory which we can open and view with syntax highlighting.

Generated (unoptimized) IR: my_module.ll

; ModuleID = 'my_module'
source_filename = "my_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i32 @add(i32 %0, i32 %1) {
entry:
  %2 = add i32 %0, %1
  ret i32 %2
}

define i32 @buggyAdd(i32 %0, i32 %1) {
entry:
  %2 = add i32 %0, %1
  %3 = load i32, ptr inttoptr (i64 42 to ptr), align 4
  %4 = add i32 %2, %3
  ret i32 %4
}

define i32 @arraySum(ptr %0, i32 %1) {
entry:
  %sum = alloca i32, align 4
  store i32 0, ptr %sum, align 4
  %index = alloca i32, align 4
  store i32 0, ptr %index, align 4
  br label %loop_cond

loop_cond:                                        ; preds = %entry
  %loaded_index = load i32, ptr %index, align 4
  %2 = icmp slt i32 %loaded_index, %1
  br i1 %2, label %loop_body, label %exit

loop_body:                                        ; preds = %loop_body, %loop_cond
  %3 = load i32, ptr %index, align 4
  %4 = getelementptr i32, ptr %0, i32 %3
  %5 = load i32, ptr %4, align 4
  %6 = load i32, ptr %index, align 4
  %7 = add i32 %6, 1
  %8 = load i32, ptr %sum, align 4
  %newSum = add i32 %8, %5
  store i32 %7, ptr %index, align 4
  store i32 %newSum, ptr %sum, align 4
  %9 = load i32, ptr %index, align 4
  %10 = icmp slt i32 %9, %1
  br i1 %10, label %loop_body, label %exit

exit:                                             ; preds = %loop_body, %loop_cond
  %11 = load i32, ptr %sum, align 4
  ret i32 %11
}

Optimized IR: my_module_opt.ll

; ModuleID = 'my_module'
source_filename = "my_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define i32 @add(i32 %0, i32 %1) local_unnamed_addr #0 {
entry:
  %2 = add i32 %1, %0
  ret i32 %2
}

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(read, inaccessiblemem: none)
define i32 @buggyAdd(i32 %0, i32 %1) local_unnamed_addr #1 {
entry:
  %2 = add i32 %1, %0
  %3 = load i32, ptr inttoptr (i64 42 to ptr), align 4
  %4 = add i32 %2, %3
  ret i32 %4
}

; Function Attrs: nofree norecurse nosync nounwind memory(argmem: read)
define i32 @arraySum(ptr nocapture readonly %0, i32 %1) local_unnamed_addr #2 {
entry:
  %2 = icmp sgt i32 %1, 0
  br i1 %2, label %loop_body.preheader, label %exit

loop_body.preheader:                              ; preds = %entry
  %wide.trip.count = zext i32 %1 to i64
  %min.iters.check = icmp ult i32 %1, 32
  br i1 %min.iters.check, label %loop_body.preheader14, label %vector.ph

vector.ph:                                        ; preds = %loop_body.preheader
  %n.vec = and i64 %wide.trip.count, 4294967264
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ %7, %vector.body ]
  %vec.phi6 = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ %8, %vector.body ]
  %vec.phi7 = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ %9, %vector.body ]
  %vec.phi8 = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ %10, %vector.body ]
  %3 = getelementptr i32, ptr %0, i64 %index
  %wide.load = load <8 x i32>, ptr %3, align 4
  %4 = getelementptr i32, ptr %3, i64 8
  %wide.load9 = load <8 x i32>, ptr %4, align 4
  %5 = getelementptr i32, ptr %3, i64 16
  %wide.load10 = load <8 x i32>, ptr %5, align 4
  %6 = getelementptr i32, ptr %3, i64 24
  %wide.load11 = load <8 x i32>, ptr %6, align 4
  %7 = add <8 x i32> %wide.load, %vec.phi
  %8 = add <8 x i32> %wide.load9, %vec.phi6
  %9 = add <8 x i32> %wide.load10, %vec.phi7
  %10 = add <8 x i32> %wide.load11, %vec.phi8
  %index.next = add nuw i64 %index, 32
  %11 = icmp eq i64 %index.next, %n.vec
  br i1 %11, label %middle.block, label %vector.body, !llvm.loop !0

middle.block:                                     ; preds = %vector.body
  %bin.rdx = add <8 x i32> %8, %7
  %bin.rdx12 = add <8 x i32> %9, %bin.rdx
  %bin.rdx13 = add <8 x i32> %10, %bin.rdx12
  %12 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %bin.rdx13)
  %cmp.n = icmp eq i64 %n.vec, %wide.trip.count
  br i1 %cmp.n, label %exit, label %loop_body.preheader14

loop_body.preheader14:                            ; preds = %loop_body.preheader, %middle.block
  %indvars.iv.ph = phi i64 [ 0, %loop_body.preheader ], [ %n.vec, %middle.block ]
  %sum.0.ph = phi i32 [ 0, %loop_body.preheader ], [ %12, %middle.block ]
  br label %loop_body

loop_body:                                        ; preds = %loop_body.preheader14, %loop_body
  %indvars.iv = phi i64 [ %indvars.iv.next, %loop_body ], [ %indvars.iv.ph, %loop_body.preheader14 ]
  %sum.0 = phi i32 [ %newSum, %loop_body ], [ %sum.0.ph, %loop_body.preheader14 ]
  %13 = getelementptr i32, ptr %0, i64 %indvars.iv
  %14 = load i32, ptr %13, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %newSum = add i32 %14, %sum.0
  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond.not, label %exit, label %loop_body, !llvm.loop !3

exit:                                             ; preds = %loop_body, %middle.block, %entry
  %sum.1 = phi i32 [ 0, %entry ], [ %12, %middle.block ], [ %newSum, %loop_body ]
  ret i32 %sum.1
}

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %0) #3

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) }
attributes #1 = { mustprogress nofree norecurse nosync nounwind willreturn memory(read, inaccessiblemem: none) }
attributes #2 = { nofree norecurse nosync nounwind memory(argmem: read) }
attributes #3 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }

!0 = distinct !{!0, !1, !2}
!1 = !{!"llvm.loop.isvectorized", i32 1}
!2 = !{!"llvm.loop.unroll.runtime.disable"}
!3 = distinct !{!3, !2, !1}

This is quite useful to inspect what optimizations LLVM was able to apply to our generated code. For example, we can see that the arraySum was vectorized.

Dumping Object Files

Up to this point, we’ve explored methods to examine the generated IR. Often, this is enough to debug our code, especially since I find LLVM-IR easier to parse than assembly. Occasionally, I need to know the specific CPU instructions that LLVM emits. For example, in the vectorized arraySum we may wish to know what CPU instruction(s) the load <8 x i32> in arraySum is lowered to. One option is to compile the IR files we saved to disk with llc. However, we would have to carefully select the command-line flags to ensure llc generates the same machine code as our JIT.

If we want to examine the exact machine code our optimized IR is compiled to, it is better to dump the JIT-compiled code directly. Thankfully, LLVM includes the llvm::orc::DumpObjects utility class to dump a compiled module in memory, in the form of a llvm::MemoryBuffer, to disk. We use a similar approach to the one used to save our IR to disk. We create an instance of DumpObjects, and instead of a IRTransformLayer, we will use an ObjectTransformLayer between the CompileLayer and the LinkingLayer.

jit.hpp

class MyJIT {
 private:
  const bool print_generated_code = true;
  ...
  llvm::orc::RTDyldObjectLinkingLayer LinkingLayer;
  llvm::orc::ObjectTransformLayer::TransformFunction DumpObjectTransform;
  llvm::orc::ObjectTransformLayer DumpObjectTransformLayer;
  llvm::orc::IRCompileLayer CompileLayer;
  llvm::orc::IRTransformLayer PrintOptimizedIRLayer;
  llvm::orc::IRTransformLayer TransformLayer;
  llvm::orc::IRTransformLayer PrintGeneratedIRLayer;
  llvm::orc::JITDylib &MainJD;

 public:
  MyJIT()
      : ES{llvm::cantFail(llvm::orc::SelfExecutorProcessControl::Create())},
        JTMB(llvm::cantFail(llvm::orc::JITTargetMachineBuilder::detectHost())
                 .setCodeGenOptLevel(llvm::CodeGenOpt::Aggressive)
                 .setCodeModel(llvm::CodeModel::Model::Large)),
        DL(llvm::cantFail(JTMB.getDefaultDataLayoutForTarget())),
        Mangle(ES, DL),
        LinkingLayer(ES, []() { return std::make_unique<llvm::SectionMemoryManager>(); }),
        DumpObjectTransform{llvm::orc::DumpObjects("generated_code/")},
        DumpObjectTransformLayer(ES, LinkingLayer,
                                 [&transform = this->DumpObjectTransform,
                                  dump_compiled_object_files = this->dump_compiled_object_files](
                                     std::unique_ptr<llvm::MemoryBuffer> buf)
                                     -> llvm::Expected<std::unique_ptr<llvm::MemoryBuffer>> {
                                   if (dump_compiled_object_files) {
                                     return transform(std::move(buf));
                                   } else {
                                     return std::move(buf);
                                   }
                                 }),
        CompileLayer(ES, DumpObjectTransformLayer,
                     std::make_unique<llvm::orc::ConcurrentIRCompiler>(JTMB)),
        PrintOptimizedIRLayer(
            ES, CompileLayer,
            [print_generated_code = this->print_generated_code](
                llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
                -> llvm::Expected<llvm::orc::ThreadSafeModule> {
              if (print_generated_code) {
                return printIR(std::move(TSM), "_opt");
              }
              return std::move(TSM);
            }),
        TransformLayer(ES, PrintOptimizedIRLayer,
                       [TM = JTMB.createTargetMachine()](
                           llvm::orc::ThreadSafeModule TSM,
                           const llvm::orc::MaterializationResponsibility &R) mutable
                       -> llvm::Expected<llvm::orc::ThreadSafeModule> {
                         if (!TM) return TM.takeError();
                         return optimizeModule(std::move(TSM), std::move(TM.get()));
                       }),
        PrintGeneratedIRLayer(
            ES, TransformLayer,
            [print_generated_code = this->print_generated_code](
                llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
                -> llvm::Expected<llvm::orc::ThreadSafeModule> {
              if (print_generated_code) {
                return printIR(std::move(TSM), "");
              }
              return std::move(TSM);
            }),
        MainJD(ES.createBareJITDylib("<main>")) {
    MainJD.addGenerator(llvm::cantFail(
        llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(DL.getGlobalPrefix())));
  }

Now when we JIT our code we also create an object file in the generated_code/ directory. This can be inspected using standard utilities such as objdump. For example, to see how the arraySum function was lowered to x86 assembly:

hamish@server:~$ objdump -M intel --disassemble=arraySum generated_code/my_module-jitted-objectbuffer.o --no-show-raw-insn
generated_code/my_module-jitted-objectbuffer.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000020 <arraySum>:
arraySum():
  20:   test   esi,esi
  22:   jle    34 <arraySum+0x14>
  24:   mov    ecx,esi
  26:   cmp    esi,0x20
  29:   jae    37 <arraySum+0x17>
  2b:   xor    edx,edx
  2d:   xor    eax,eax
  2f:   jmp    c0 <arraySum+0xa0>
  34:   xor    eax,eax
  36:   ret
  37:   mov    eax,ecx
  39:   shr    eax,0x5
  3c:   mov    edx,ecx
  3e:   and    edx,0xffffffe0
  41:   vpxor  xmm0,xmm0,xmm0
  45:   xor    esi,esi
  47:   vpxor  xmm1,xmm1,xmm1
  4b:   vpxor  xmm2,xmm2,xmm2
  4f:   vpxor  xmm3,xmm3,xmm3
  53:   shl    rax,0x7
  57:   nop    WORD PTR [rax+rax*1+0x0]
  60:   vpaddd ymm0,ymm0,YMMWORD PTR [rdi+rsi*1]
  65:   vpaddd ymm1,ymm1,YMMWORD PTR [rdi+rsi*1+0x20]
  6b:   vpaddd ymm2,ymm2,YMMWORD PTR [rdi+rsi*1+0x40]
  71:   vpaddd ymm3,ymm3,YMMWORD PTR [rdi+rsi*1+0x60]
  77:   sub    rsi,0xffffffffffffff80
  7b:   cmp    rax,rsi
  7e:   jne    60 <arraySum+0x40>
  80:   vpaddd ymm0,ymm1,ymm0
  84:   vpaddd ymm1,ymm3,ymm2
  88:   vpaddd ymm0,ymm1,ymm0
  8c:   vextracti128 xmm1,ymm0,0x1
  92:   vpaddd xmm0,xmm0,xmm1
  96:   vpshufd xmm1,xmm0,0xee
  9b:   vpaddd xmm0,xmm0,xmm1
  9f:   vpshufd xmm1,xmm0,0x55
  a4:   vpaddd xmm0,xmm0,xmm1
  a8:   vmovd  eax,xmm0
  ac:   cmp    rdx,rcx
  af:   je     cb <arraySum+0xab>
  b1:   data16 data16 data16 data16 data16 cs nop WORD PTR [rax+rax*1+0x0]
  c0:   add    eax,DWORD PTR [rdi+rdx*4]
  c3:   inc    rdx
  c6:   cmp    rcx,rdx
  c9:   jne    c0 <arraySum+0xa0>
  cb:   vzeroupper
  ce:   ret

Here we can see that an AVX2 vpaddd with pointer source operand is used for the combination of load <8 x i32> and add <8 x i32>.

Debugging with GDB and Profiling with VTune/Perf

Often just staring at IR and assembly is not enough to debug or identify performance issues, we need to use tooling to understand our code at runtime such as debuggers like GDB or profilers like perf or VTune. Debuggers and profilers need to know the programs symbols and debug information (if present) to be useful. This is more complicated for programs that generate and compile code at runtime as these tools usually rely on object files for symbols and debug info. Luckily for us, the hardwork has already been done. GDB has an interface to allow JIT compilers to register code compiled at runtime. Perf has two interfaces to support JIT’ed code. The first (documented here) requires the application to write for each symbol, it’s start address, code size and name to /tmp/perf-{PID}.map. This keeps the symbol information around after the process terminates so perf report etc can resolve symbols. It requires no change in how perf is used, but you will not be able to create source file annotations. The second perf interface is a jitdump file. This format provides more information, including about the program source to perf. It does however requires slightly different usage of the perf commands which are shown below. VTune also has an interface as a part of Intel’s ITT API. Unfortunately, if your preferred profiler on AMD platforms is AMDuProf you are out of luck. At least as of late 2023 AMDuProf does not support profiling LLVM JIT generated code.

The LLVM JITEventListener interface is used to notify clients when objects are loaded into memory, i.e. when we have newly compiled code. There are implementations of listeners for several tools including GDB, perf and VTune. The perf listener is based on the jitdump file interface. Note: the VTune listener requires LLVM to be configured with -DLLVM_USE_INTEL_JITEVENTS=ON and the perf listener requires -DLLVM_USE_PERF=ON. If your LLVM is not built with these listeners enabled, nothing will happen as the failure to create the listeners is not an error, but you may spend some substantial time wondering why they are not working. Installing the listeners is straightforward, using the GDB and perf listeners for example:

jit.hpp

class MyJIT {
 private:
  ...
  llvm::JITEventListener *PerfListener;
  llvm::JITEventListener *GDBListener;

 public:
  MyJIT()
      : ...
        MainJD(ES.createBareJITDylib("<main>")),
        PerfListener(llvm::JITEventListener::createPerfJITEventListener()),
        GDBListener(llvm::JITEventListener::createGDBRegistrationListener()) {
    if (PerfListener == nullptr) {
      std::cout << "Could not create Perf listener. Perhaps LLVM was not "
                   "compiled with perf support (LLVM_USE_PERF)."
                << std::endl;
    } else {
      LinkingLayer.registerJITEventListener(*PerfListener);
    }

    if (GDBListener == nullptr) {
      std::cout << "Could not create GDB listener." << std::endl;
    } else {
      LinkingLayer.registerJITEventListener(*GDBListener);
    }

    MainJD.addGenerator(llvm::cantFail(
        llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(DL.getGlobalPrefix())));
  }

debugging generated LLVM-IR without the GDB listener results in no symbols being resolved

debugging generated LLVM-IR with the GDB listener means we have symbols, so we know what function we are in, but without debug information we don't know which part of our function we are looking at.

Debugging in GDB without the JIT listener (left) and with the listener (right)

While GDB does work without the listener it does not know how to access the symbol table for the generated code. For example, you can find the instruction that triggers a segfault and inspect register values; however, without additional context such as logs, it is very difficult to orient where in the code the segfault occures because we do not even have frame names. When we add the listener, GDB has the symbol table so we have frame names. This narrows down the region of suspect generated code to a single function.

To debug if listeners are working in GDB, there is the set debug jit on gdb command that will cause GDB to print information from its JIT API. maint info jit will display the address of the JIT’ed code and the address of the symbol file and maint info symtabs lists all the symbol tables.

Unlike GDB, perf does not work at all for profiling JIT’ed code without the listener. This is because perf traces mmap to locate the compiled program so that it can map instruction pointers to code. JIT’ed code does not have a backing file for the code. The LLVM perf listener uses the perf jitdump interface, and the jitdump file is linked to the perf profile by the listener mmaping the file. A record of this mmap will be in the output perf profile which allows perf to locate the jitdump file. The file is identified as a jitdump file based on the format of the file name.

To profile generated code that has been registered with the listener we need to do a little bit of extra work. First, when recording we need to ensure we are using the monotonic clock (-k 1). Then, we need to run perf inject --jit, which creates an object file on disk for each function in the jitdump file and creates a new profile file with updated mappings that point to the generated object files. Then we can use the perf analysis tools as normal:

hamish@server:~$ perf record -k 1 -o /tmp/perf.data ./main
hamish@server:~$ perf inject -j -i /tmp/perf.data -o /tmp/perf.data.jit
hamish@server:~$ perf annotate -i /tmp/perf.data.jit arraySum

profiling generated code with perf when using the perf JITEventListener.

Self-referential Debug Information

The user experience of profilers and debuggers is greatly aided by debug symbols. Typically, compilers that use LLVM generate debug information at the same time as generating LLVM IR while visiting the input program’s abstract syntax tree.

What debug information to generate for a code-generating database like Proteus is not obvious. It is a complex question because the generated code results from translating high-level queries into low-level IR. The challenge is compounded by the fact that SQL is a declarative language, meaning it specifies what results are desired rather than detailing the exact steps needed to obtain those results. For example, code for parallelizing query execution would not correspond to any specific SQL. This makes it difficult to correlate specific parts of the LLVM IR back to the original SQL code.

We could add debug info that refers back to the C++ source that generated the IR, but determining which line of source to refer to is also a tricky question. In this case, the problem is that there are layers of abstractions and helper functions for IR generation. For instance, should the debug information refer to the call site of the helper function that initiates the code generation or to the specific line within the helper function where the LLVM code generation APIs are invoked? While it may be possible to develop a solution that generates debug information that refers back to the generating code, I expect it would create a higher maintenance burden. While it may make sense for a commercial codebase, it may just end up unmaintained for an academic lab.

Instead, my approach is to inject debug information that refers back to the IR itself. The code to generate this self-referential debug info is fully contained within the JIT layer. IR-generating code does not need to be updated or maintained to support emitting this debug information. Just having debug info that refers back to the generated LLVM-IR is surprisingly helpful. Finding the code that generated the IR is relatively easy once I catch the bug and see the IR in GDB, especially if the generating code uses identifying names for instructions.

There used to be an LLVM IR pass to do this, but it was not maintained and was eventually removed. It has been revived as a standalone out-of-tree tool, debugir which is still maintained. I have taken the approach of using the debugir source and adjusting it to insert debug information into my module based on the IR file that is already dumped to disk (commit). Then I extended debugir to also support debugging function parameters (commit and upstreamed). I also name all unnamed instructions before printing the IR and generating debug info so they can be printed as named variables in GDB. This can be done with the InstructionNamerPass, but I chose to do it as a simple function in the PrintGeneratedIRLayer (commit). Because optimized IR can be substantially different, I like have the option to switch between debug info that refers to the generated IR and debug info that refers to the optimized IR. I generally use generated for debugging and optimized for profiling.

jit.hpp

#ifndef MyJIT_HPP
#define MyJIT_HPP

#include <llvm/ADT/StringRef.h>
#include <llvm/ExecutionEngine/Orc/CompileUtils.h>
#include <llvm/ExecutionEngine/Orc/Core.h>
#include <llvm/ExecutionEngine/Orc/ExecutionUtils.h>
#include <llvm/ExecutionEngine/Orc/ExecutorProcessControl.h>
#include <llvm/ExecutionEngine/Orc/IRCompileLayer.h>
#include <llvm/ExecutionEngine/Orc/JITTargetMachineBuilder.h>
#include <llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h>
#include <llvm/ExecutionEngine/Orc/Shared/ExecutorSymbolDef.h>
#include <llvm/ExecutionEngine/SectionMemoryManager.h>
#include <llvm/IR/DataLayout.h>
#include <llvm/IR/LLVMContext.h>
#include <llvm/ExecutionEngine/JITEventListener.h>
#include <llvm/ExecutionEngine/Orc/IRTransformLayer.h>
#include <llvm/Passes/PassBuilder.h>
#include <llvm/IR/Verifier.h>
#include <llvm/ExecutionEngine/Orc/ObjectTransformLayer.h>
#include <llvm/ExecutionEngine/Orc/DebugUtils.h>
#include <llvm/ExecutionEngine/JITEventListener.h>
#include "DebugIR.hpp"

class MyJIT {
 private:
  const bool insert_preopt_debug_info = false;
  const bool insert_postopt_debug_info = false;
  const bool print_generated_code = true;
  const bool dump_compiled_object_files = true;
  llvm::orc::ExecutionSession ES;
  llvm::orc::JITTargetMachineBuilder JTMB;
  llvm::DataLayout DL;
  llvm::orc::MangleAndInterner Mangle;
  llvm::orc::RTDyldObjectLinkingLayer LinkingLayer;
  llvm::orc::ObjectTransformLayer::TransformFunction DumpObjectTransform;
  llvm::orc::ObjectTransformLayer DumpObjectTransformLayer;
  llvm::orc::IRCompileLayer CompileLayer;
  llvm::orc::IRTransformLayer PrintOptimizedIRLayer;
  llvm::orc::IRTransformLayer TransformLayer;
  llvm::orc::IRTransformLayer PrintGeneratedIRLayer;
  llvm::orc::JITDylib &MainJD;
  llvm::JITEventListener *PerfListener;
  llvm::JITEventListener *GDBListener;

 public:
  MyJIT()
      : ES{llvm::cantFail(llvm::orc::SelfExecutorProcessControl::Create())},
        JTMB(llvm::cantFail(llvm::orc::JITTargetMachineBuilder::detectHost())
                 .setCodeGenOptLevel(llvm::CodeGenOpt::Aggressive)
                 .setCodeModel(llvm::CodeModel::Model::Large)),
        DL(llvm::cantFail(JTMB.getDefaultDataLayoutForTarget())),
        Mangle(ES, DL),
        LinkingLayer(ES, []() { return std::make_unique<llvm::SectionMemoryManager>(); }),
        DumpObjectTransform{llvm::orc::DumpObjects("generated_code/")},
        DumpObjectTransformLayer(ES, LinkingLayer,
                                 [&transform = this->DumpObjectTransform,
                                  dump_compiled_object_files = this->dump_compiled_object_files](
                                     std::unique_ptr<llvm::MemoryBuffer> buf)
                                     -> llvm::Expected<std::unique_ptr<llvm::MemoryBuffer>> {
                                   if (dump_compiled_object_files) {
                                     return transform(std::move(buf));
                                   } else {
                                     return std::move(buf);
                                   }
                                 }),
        CompileLayer(ES, DumpObjectTransformLayer,
                     std::make_unique<llvm::orc::ConcurrentIRCompiler>(JTMB)),
        PrintOptimizedIRLayer(
            ES, CompileLayer,
            [print_generated_code = this->print_generated_code,
             insert_debug_info = this->insert_postopt_debug_info](
                llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
                -> llvm::Expected<llvm::orc::ThreadSafeModule> {
              if (print_generated_code) {
                return printIR(std::move(TSM), "_opt", insert_debug_info);
              }
              return std::move(TSM);
            }),
        TransformLayer(ES, PrintOptimizedIRLayer,
                       [TM = JTMB.createTargetMachine()](
                           llvm::orc::ThreadSafeModule TSM,
                           const llvm::orc::MaterializationResponsibility &R) mutable
                       -> llvm::Expected<llvm::orc::ThreadSafeModule> {
                         if (!TM) return TM.takeError();
                         return optimizeModule(std::move(TSM), std::move(TM.get()));
                       }),
        PrintGeneratedIRLayer(
            ES, TransformLayer,
            [print_generated_code = this->print_generated_code,
             insert_debug_info = this->insert_preopt_debug_info](
                llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
                -> llvm::Expected<llvm::orc::ThreadSafeModule> {
              TSM = nameInstructions(std::move(TSM));
              if (print_generated_code) {
                return printIR(std::move(TSM), "", insert_debug_info);
              }
              return std::move(TSM);
            }),
        MainJD(ES.createBareJITDylib("<main>")),
        PerfListener(llvm::JITEventListener::createPerfJITEventListener()),
        GDBListener(llvm::JITEventListener::createGDBRegistrationListener()) {
    if (PerfListener == nullptr) {
      std::cout << "Could not create Perf listener. Perhaps LLVM was not "
                   "compiled with perf support (LLVM_USE_PERF)."
                << std::endl;
    } else {
      LinkingLayer.registerJITEventListener(*PerfListener);
    }

    if (GDBListener == nullptr) {
      std::cout << "Could not create GDB listener." << std::endl;
    } else {
      LinkingLayer.registerJITEventListener(*GDBListener);
    }

    MainJD.addGenerator(llvm::cantFail(
        llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(DL.getGlobalPrefix())));
  }

  ~MyJIT() {
    if (auto Err = ES.endSession()) ES.reportError(std::move(Err));
  }

  const llvm::DataLayout &getDataLayout() const { return DL; }

  llvm::Error addModule(llvm::orc::ThreadSafeModule TSM);

  llvm::Expected<llvm::orc::ExecutorSymbolDef> lookup(llvm::StringRef Name);

 private:
  static llvm::Expected<llvm::orc::ThreadSafeModule> optimizeModule(
      llvm::orc::ThreadSafeModule TSM, std::unique_ptr<llvm::TargetMachine> TM);

  // based on InstructionNamerPass, but as a transform because we do not want to do any IR
  // optimization so we can print IR that is the same as the generated IR, just with renamed
  // instructions for readability
  static llvm::orc::ThreadSafeModule nameInstructions(llvm::orc::ThreadSafeModule TSM) {
    TSM.withModuleDo([](llvm::Module &M) {
      for (auto &F : M.functions()) {
        for (auto &Arg : F.args()) {
          if (!Arg.hasName()) Arg.setName("arg");
        }

        for (llvm::BasicBlock &BB : F) {
          if (!BB.hasName()) BB.setName("bb");

          for (llvm::Instruction &I : BB) {
            if (!I.hasName() && !I.getType()->isVoidTy()) I.setName("i");
          }
        }
      }
    });
    return TSM;
  }

  static llvm::Expected<llvm::orc::ThreadSafeModule> printIR(llvm::orc::ThreadSafeModule TSM,
                                                             const std::string &suffix = "",
                                                             bool add_debug_info = false) {
    TSM.withModuleDo([&suffix, &add_debug_info](llvm::Module &m) {
      std::error_code EC;
      const std::string output_directory = "generated_code/";
      const std::string output_file = m.getName().str() + suffix + ".ll";
      llvm::sys::fs::create_directory(output_directory, /*ignoreExisting=*/true);

      llvm::raw_fd_ostream out(output_directory + output_file, EC,
                               llvm::sys::fs::OpenFlags::OF_None);
      m.print(out, nullptr, false, true);
      if (add_debug_info) {
        llvm::createDebugInfo(m, output_directory, output_file);
      }
    });
    return TSM;
  }
};

#endif  // MyJIT_HPP

debugging generated LLVM-IR with debug information.

Debugging in GDB with debug information

Although much of this information is available online, it’s scattered across mailing lists and Stack Overflow, and the LLVM APIs have changed multiple times over the years. It took me a while to figure everything out, so I decided to write it down, primarily for my own reference, but I hope it proves helpful to others as well.

To be clear, this is a joke and not the usual use of the phrase static analysis ↩

Infrastructure and Running Example

Printing Generated Code

Dumping Generated IR

Printing Generated and Optimized IR with IRTransformLayer

Dumping Object Files

Debugging with GDB and Profiling with VTune/Perf

Self-referential Debug Information

`Dump`ing Generated IR