Debugging and Profiling JIT Compiled LLVM-IR
I have recently been working on a code-generating database that uses LLVM to compile SQL queries to machine code. My biggest pain point has been understanding and debugging the generated LLVM and compiled machine code. This is challenging for two reasons 1) the code is generated and linked into the host process at runtime and 2) we do not currently have debug symbols because it is not immediately obvious what source the debug symbols should to. In this post, I have collected a few techniques I have been using and demonstrate how to use them.
This post is targetted at readers who are 1) generating LLVM-IR and 2) are using or writing their own JIT compiler using the LLVM ORC APIs. To make things more concrete, I have implemented an example JIT similar to the KaleidoscopeJIT used in the Building a JIT LLVM tutorial. You can find all the code for this post here. I’ll be using LLVM-17, but the concepts are applicable at least as far back as LLVM-14 even if the code will not quite compile for it.
- Infrastructure and Running Example
- Printing Generated Code
- Dumping Object Files
- Debugging with GDB and Profiling with VTune/Perf
- Self-referential Debug Information
Infrastructure and Running Example
We start with a basic JIT using the ORCv2 APIs. For our running example, we will generate three functions in a module, then optimize and compile the module, linking the symbols into our process, and call the generated functions from our main function. Two of the generated functions take two integer arguments and return their sum, but one (deliberately) has a bug. The third function takes a pointer to an array of integers and the array size and returns the sum of the array elements. LLVM can perform more interesting optimizations on this last function. Below is the code of our main function:
int main() {
initializeLLVM();
// inserting into a llvm::module created in `initializeLLVM`
createAddFunction();
createBuggyAddFunction();
createArraySumFunction();
// compile our code
auto TSM = llvm::orc::ThreadSafeModule(std::move(TheModule), std::move(TheContext));
ExitOnErr(TheJIT->addModule(std::move(TSM)));
auto add_symbol = ExitOnErr(TheJIT->lookup("add"));
int (*add_fp)(int, int) = add_symbol.getAddress().toPtr<int (*)(int, int)>();
auto array_sum_symbol = ExitOnErr(TheJIT->lookup("arraySum"));
int (*array_sum_fp)(int*, int) = array_sum_symbol.getAddress().toPtr<int (*)(int*, int)>();
auto buggy_add_symbol = ExitOnErr(TheJIT->lookup("buggyAdd"));
int (*buggy_add_fp)(int, int) = buggy_add_symbol.getAddress().toPtr<int (*)(int, int)>();
int* arr = new int[1024];
for (int i = 0; i < 1024; i++) {
arr[i] = i + 1;
}
std::cout << "Adding 1+2 = " << add_fp(1, 2) << std::endl;
std::cout << "sum of {1, 2, ... 1024} = " << array_sum_fp(arr, 1024) << std::endl;
std::cout << "(with bugs) adding 1+2 = " << buggy_add_fp(1, 2) << std::endl;
return 0;
}
and this is our simple JIT, which includes optimization of the IR before compiling to machine code.
// jit.hpp
class MyJIT {
private:
llvm::orc::ExecutionSession ES;
llvm::orc::JITTargetMachineBuilder JTMB;
llvm::DataLayout DL;
llvm::orc::MangleAndInterner Mangle;
llvm::orc::RTDyldObjectLinkingLayer LinkingLayer;
llvm::orc::IRCompileLayer CompileLayer;
llvm::orc::IRTransformLayer TransformLayer;
llvm::orc::JITDylib &MainJD;
public:
MyJIT()
: ES{llvm::cantFail(llvm::orc::SelfExecutorProcessControl::Create())},
JTMB(llvm::cantFail(llvm::orc::JITTargetMachineBuilder::detectHost())
.setCodeGenOptLevel(llvm::CodeGenOpt::Aggressive)
.setCodeModel(llvm::CodeModel::Model::Large)),
DL(llvm::cantFail(JTMB.getDefaultDataLayoutForTarget())),
Mangle(ES, DL),
LinkingLayer(ES, []() { return std::make_unique<llvm::SectionMemoryManager>(); }),
CompileLayer(ES, LinkingLayer, std::make_unique<llvm::orc::ConcurrentIRCompiler>(JTMB)),
TransformLayer(ES, CompileLayer,
[TM = JTMB.createTargetMachine()](
llvm::orc::ThreadSafeModule TSM,
const llvm::orc::MaterializationResponsibility &R) mutable
-> llvm::Expected<llvm::orc::ThreadSafeModule> {
if (!TM) return TM.takeError();
return optimizeModule(std::move(TSM), std::move(TM.get()));
}),
MainJD(ES.createBareJITDylib("<main>")) {
MainJD.addGenerator(llvm::cantFail(
llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(DL.getGlobalPrefix())));
}
~MyJIT();
const llvm::DataLayout &getDataLayout() const { return DL; }
llvm::Error addModule(llvm::orc::ThreadSafeModule TSM) {
return TransformLayer.add(MainJD, std::move(TSM));
}
llvm::Expected<llvm::orc::ExecutorSymbolDef> lookup(llvm::StringRef Name) {
return ES.lookup({&MainJD}, Mangle(Name.str()));
}
private:
static llvm::Expected<llvm::orc::ThreadSafeModule> optimizeModule(
llvm::orc::ThreadSafeModule TSM, std::unique_ptr<llvm::TargetMachine> TM);
};
The complete code for the baseline implementation is in this commit. Note that, like the LLVM tutorials, the code is minimal and intended to show how to use the APIs, not as an example of good software engineering.
Printing Generated Code
The first and most obvious approach is to print out the generated LLVM-IR into a human-readable format. This enables static analysis of the code1: staring at it until you find your bug.
Dump
ing Generated IR
The simplest way to see your generated IR is with the dump()
method, which will print the IR of the dumped object to stderr. It is helpful for examining generated IR, but it cannot be easily used to examine IR after optimization. This is convenient for debugging small amounts of generated IR but can be limited because it lacks syntax highlighting and will be intermingled with anything else written to stderr. Example:
int main() {
auto* buggy_add_func_ir = createBuggyAddFunction();
buggy_add_func_ir->dump();
...
Output:
hamish@server:~$ ./main
efine i32 @buggyAdd(i32 %0, i32 %1) {
entry:
%2 = add i32 %0, %1
%3 = load i32, ptr inttoptr (i64 42 to ptr), align 4
%4 = add i32 %2, %3
ret i32 %4
}
Adding 1+2 = 3
sum of {1, 2, ... 1024} = 524800
Segmentation fault
Printing Generated and Optimized IR with IRTransformLayer
My preferred way to get the generated as well as the optimized IR is to use an IRTransformLayer
. The IRTransformLayer
is quite straightforward; it is a layer that takes a module, applies a transformation to it, and then emits the module. In our case, we will not actually perform a transformation but print the module to a file, returning the module unchanged. This approach is neat because it allows us to inspect the IR at multiple points in the compilation pipeline by adding multiple of our IR printing IRTransformLayer
s, i.e., both before and after the module has been optimized.
Below, we have extended MyJIT
with two additional layers: PrintGeneratedIRLayer
and PrintOptimizedIRLayer
, both of which use a lambda as the transformation function. The lambda conditionally calls a new method, printIR, depending on the boolean print_generated_code. Here, print_generated_code
is always true, but you could set this through a command-line argument, for example. PrintGeneratedIRLayer
is also now the first layer of our compilation pipeline, so we need to modify our addModule
method to suit.
jit.hpp
class MyJIT {
private:
const bool print_generated_code = true;
llvm::orc::ExecutionSession ES;
llvm::orc::JITTargetMachineBuilder JTMB;
llvm::DataLayout DL;
llvm::orc::MangleAndInterner Mangle;
llvm::orc::RTDyldObjectLinkingLayer LinkingLayer;
llvm::orc::IRCompileLayer CompileLayer;
llvm::orc::IRTransformLayer PrintOptimizedIRLayer;
llvm::orc::IRTransformLayer TransformLayer;
llvm::orc::IRTransformLayer PrintGeneratedIRLayer;
llvm::orc::JITDylib &MainJD;
public:
MyJIT()
: ES{llvm::cantFail(llvm::orc::SelfExecutorProcessControl::Create())},
JTMB(llvm::cantFail(llvm::orc::JITTargetMachineBuilder::detectHost())
.setCodeGenOptLevel(llvm::CodeGenOpt::Aggressive)
.setCodeModel(llvm::CodeModel::Model::Large)),
DL(llvm::cantFail(JTMB.getDefaultDataLayoutForTarget())),
Mangle(ES, DL),
LinkingLayer(ES, []() { return std::make_unique<llvm::SectionMemoryManager>(); }),
CompileLayer(ES, LinkingLayer, std::make_unique<llvm::orc::ConcurrentIRCompiler>(JTMB)),
PrintOptimizedIRLayer(
ES, CompileLayer,
[print_generated_code = this->print_generated_code](
llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
-> llvm::Expected<llvm::orc::ThreadSafeModule> {
if (print_generated_code) {
return printIR(std::move(TSM), "_opt");
}
return std::move(TSM);
}),
TransformLayer(ES, PrintOptimizedIRLayer,
[TM = JTMB.createTargetMachine()](
llvm::orc::ThreadSafeModule TSM,
const llvm::orc::MaterializationResponsibility &R) mutable
-> llvm::Expected<llvm::orc::ThreadSafeModule> {
if (!TM) return TM.takeError();
return optimizeModule(std::move(TSM), std::move(TM.get()));
}),
PrintGeneratedIRLayer(
ES, TransformLayer,
[print_generated_code = this->print_generated_code](
llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
-> llvm::Expected<llvm::orc::ThreadSafeModule> {
if (print_generated_code) {
return printIR(std::move(TSM), "");
}
return std::move(TSM);
}),
MainJD(ES.createBareJITDylib("<main>")) {
MainJD.addGenerator(llvm::cantFail(
llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(DL.getGlobalPrefix())));
}
~MyJIT() {
if (auto Err = ES.endSession()) ES.reportError(std::move(Err));
}
const llvm::DataLayout &getDataLayout() const { return DL; }
llvm::Error addModule(llvm::orc::ThreadSafeModule TSM) {
return PrintGeneratedIRLayer.add(MainJD, std::move(TSM));
}
llvm::Expected<llvm::orc::ExecutorSymbolDef> lookup(llvm::StringRef Name) {
return ES.lookup({&MainJD}, Mangle(Name.str()));
}
private:
static llvm::Expected<llvm::orc::ThreadSafeModule> optimizeModule;
static llvm::Expected<llvm::orc::ThreadSafeModule> printIR(llvm::orc::ThreadSafeModule TSM,
const std::string &suffix = "") {
TSM.withModuleDo([&suffix](llvm::Module &m) {
std::error_code EC;
const std::string output_directory = "generated_code/";
const std::string output_file = m.getName().str() + suffix + ".ll";
llvm::sys::fs::create_directory(output_directory, /*ignoreExisting=*/true);
llvm::raw_fd_ostream out(output_directory + output_file, EC,
llvm::sys::fs::OpenFlags::OF_None);
m.print(out, nullptr, false, true);
});
return TSM;
}
};
And now we have both our generated and optimized IR conviently in the generated_code/
directory which we can open and view with syntax highlighting.
Generated (unoptimized) IR: my_module.ll
; ModuleID = 'my_module'
source_filename = "my_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
define i32 @add(i32 %0, i32 %1) {
entry:
%2 = add i32 %0, %1
ret i32 %2
}
define i32 @buggyAdd(i32 %0, i32 %1) {
entry:
%2 = add i32 %0, %1
%3 = load i32, ptr inttoptr (i64 42 to ptr), align 4
%4 = add i32 %2, %3
ret i32 %4
}
define i32 @arraySum(ptr %0, i32 %1) {
entry:
%sum = alloca i32, align 4
store i32 0, ptr %sum, align 4
%index = alloca i32, align 4
store i32 0, ptr %index, align 4
br label %loop_cond
loop_cond: ; preds = %entry
%loaded_index = load i32, ptr %index, align 4
%2 = icmp slt i32 %loaded_index, %1
br i1 %2, label %loop_body, label %exit
loop_body: ; preds = %loop_body, %loop_cond
%3 = load i32, ptr %index, align 4
%4 = getelementptr i32, ptr %0, i32 %3
%5 = load i32, ptr %4, align 4
%6 = load i32, ptr %index, align 4
%7 = add i32 %6, 1
%8 = load i32, ptr %sum, align 4
%newSum = add i32 %8, %5
store i32 %7, ptr %index, align 4
store i32 %newSum, ptr %sum, align 4
%9 = load i32, ptr %index, align 4
%10 = icmp slt i32 %9, %1
br i1 %10, label %loop_body, label %exit
exit: ; preds = %loop_body, %loop_cond
%11 = load i32, ptr %sum, align 4
ret i32 %11
}
Optimized IR: my_module_opt.ll
; ModuleID = 'my_module'
source_filename = "my_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define i32 @add(i32 %0, i32 %1) local_unnamed_addr #0 {
entry:
%2 = add i32 %1, %0
ret i32 %2
}
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(read, inaccessiblemem: none)
define i32 @buggyAdd(i32 %0, i32 %1) local_unnamed_addr #1 {
entry:
%2 = add i32 %1, %0
%3 = load i32, ptr inttoptr (i64 42 to ptr), align 4
%4 = add i32 %2, %3
ret i32 %4
}
; Function Attrs: nofree norecurse nosync nounwind memory(argmem: read)
define i32 @arraySum(ptr nocapture readonly %0, i32 %1) local_unnamed_addr #2 {
entry:
%2 = icmp sgt i32 %1, 0
br i1 %2, label %loop_body.preheader, label %exit
loop_body.preheader: ; preds = %entry
%wide.trip.count = zext i32 %1 to i64
%min.iters.check = icmp ult i32 %1, 32
br i1 %min.iters.check, label %loop_body.preheader14, label %vector.ph
vector.ph: ; preds = %loop_body.preheader
%n.vec = and i64 %wide.trip.count, 4294967264
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ %7, %vector.body ]
%vec.phi6 = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ %8, %vector.body ]
%vec.phi7 = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ %9, %vector.body ]
%vec.phi8 = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ %10, %vector.body ]
%3 = getelementptr i32, ptr %0, i64 %index
%wide.load = load <8 x i32>, ptr %3, align 4
%4 = getelementptr i32, ptr %3, i64 8
%wide.load9 = load <8 x i32>, ptr %4, align 4
%5 = getelementptr i32, ptr %3, i64 16
%wide.load10 = load <8 x i32>, ptr %5, align 4
%6 = getelementptr i32, ptr %3, i64 24
%wide.load11 = load <8 x i32>, ptr %6, align 4
%7 = add <8 x i32> %wide.load, %vec.phi
%8 = add <8 x i32> %wide.load9, %vec.phi6
%9 = add <8 x i32> %wide.load10, %vec.phi7
%10 = add <8 x i32> %wide.load11, %vec.phi8
%index.next = add nuw i64 %index, 32
%11 = icmp eq i64 %index.next, %n.vec
br i1 %11, label %middle.block, label %vector.body, !llvm.loop !0
middle.block: ; preds = %vector.body
%bin.rdx = add <8 x i32> %8, %7
%bin.rdx12 = add <8 x i32> %9, %bin.rdx
%bin.rdx13 = add <8 x i32> %10, %bin.rdx12
%12 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %bin.rdx13)
%cmp.n = icmp eq i64 %n.vec, %wide.trip.count
br i1 %cmp.n, label %exit, label %loop_body.preheader14
loop_body.preheader14: ; preds = %loop_body.preheader, %middle.block
%indvars.iv.ph = phi i64 [ 0, %loop_body.preheader ], [ %n.vec, %middle.block ]
%sum.0.ph = phi i32 [ 0, %loop_body.preheader ], [ %12, %middle.block ]
br label %loop_body
loop_body: ; preds = %loop_body.preheader14, %loop_body
%indvars.iv = phi i64 [ %indvars.iv.next, %loop_body ], [ %indvars.iv.ph, %loop_body.preheader14 ]
%sum.0 = phi i32 [ %newSum, %loop_body ], [ %sum.0.ph, %loop_body.preheader14 ]
%13 = getelementptr i32, ptr %0, i64 %indvars.iv
%14 = load i32, ptr %13, align 4
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%newSum = add i32 %14, %sum.0
%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
br i1 %exitcond.not, label %exit, label %loop_body, !llvm.loop !3
exit: ; preds = %loop_body, %middle.block, %entry
%sum.1 = phi i32 [ 0, %entry ], [ %12, %middle.block ], [ %newSum, %loop_body ]
ret i32 %sum.1
}
; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %0) #3
attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) }
attributes #1 = { mustprogress nofree norecurse nosync nounwind willreturn memory(read, inaccessiblemem: none) }
attributes #2 = { nofree norecurse nosync nounwind memory(argmem: read) }
attributes #3 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
!0 = distinct !{!0, !1, !2}
!1 = !{!"llvm.loop.isvectorized", i32 1}
!2 = !{!"llvm.loop.unroll.runtime.disable"}
!3 = distinct !{!3, !2, !1}
This is quite useful to inspect what optimizations LLVM was able to apply to our generated code. For example, we can see that the arraySum
was vectorized.
Dumping Object Files
Up to this point, we’ve explored methods to examine the generated IR. Often, this is enough to debug our code, especially since I find LLVM-IR easier to parse than assembly. Occasionally, I need to know the specific CPU instructions that LLVM emits. For example, in the vectorized arraySum
we may wish to know what CPU instruction(s) the load <8 x i32>
in arraySum
is lowered to. One option is to compile the IR files we saved to disk with llc
. However, we would have to carefully select the command-line flags to ensure llc
generates the same machine code as our JIT.
If we want to examine the exact machine code our optimized IR is compiled to, it is better to dump the JIT-compiled code directly. Thankfully, LLVM includes the llvm::orc::DumpObjects
utility class to dump a compiled module in memory, in the form of a llvm::MemoryBuffer
, to disk. We use a similar approach to the one used to save our IR to disk. We create an instance of DumpObjects
, and instead of a IRTransformLayer
, we will use an ObjectTransformLayer
between the CompileLayer
and the LinkingLayer
.
jit.hpp
class MyJIT {
private:
const bool print_generated_code = true;
...
llvm::orc::RTDyldObjectLinkingLayer LinkingLayer;
llvm::orc::ObjectTransformLayer::TransformFunction DumpObjectTransform;
llvm::orc::ObjectTransformLayer DumpObjectTransformLayer;
llvm::orc::IRCompileLayer CompileLayer;
llvm::orc::IRTransformLayer PrintOptimizedIRLayer;
llvm::orc::IRTransformLayer TransformLayer;
llvm::orc::IRTransformLayer PrintGeneratedIRLayer;
llvm::orc::JITDylib &MainJD;
public:
MyJIT()
: ES{llvm::cantFail(llvm::orc::SelfExecutorProcessControl::Create())},
JTMB(llvm::cantFail(llvm::orc::JITTargetMachineBuilder::detectHost())
.setCodeGenOptLevel(llvm::CodeGenOpt::Aggressive)
.setCodeModel(llvm::CodeModel::Model::Large)),
DL(llvm::cantFail(JTMB.getDefaultDataLayoutForTarget())),
Mangle(ES, DL),
LinkingLayer(ES, []() { return std::make_unique<llvm::SectionMemoryManager>(); }),
DumpObjectTransform{llvm::orc::DumpObjects("generated_code/")},
DumpObjectTransformLayer(ES, LinkingLayer,
[&transform = this->DumpObjectTransform,
dump_compiled_object_files = this->dump_compiled_object_files](
std::unique_ptr<llvm::MemoryBuffer> buf)
-> llvm::Expected<std::unique_ptr<llvm::MemoryBuffer>> {
if (dump_compiled_object_files) {
return transform(std::move(buf));
} else {
return std::move(buf);
}
}),
CompileLayer(ES, DumpObjectTransformLayer,
std::make_unique<llvm::orc::ConcurrentIRCompiler>(JTMB)),
PrintOptimizedIRLayer(
ES, CompileLayer,
[print_generated_code = this->print_generated_code](
llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
-> llvm::Expected<llvm::orc::ThreadSafeModule> {
if (print_generated_code) {
return printIR(std::move(TSM), "_opt");
}
return std::move(TSM);
}),
TransformLayer(ES, PrintOptimizedIRLayer,
[TM = JTMB.createTargetMachine()](
llvm::orc::ThreadSafeModule TSM,
const llvm::orc::MaterializationResponsibility &R) mutable
-> llvm::Expected<llvm::orc::ThreadSafeModule> {
if (!TM) return TM.takeError();
return optimizeModule(std::move(TSM), std::move(TM.get()));
}),
PrintGeneratedIRLayer(
ES, TransformLayer,
[print_generated_code = this->print_generated_code](
llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
-> llvm::Expected<llvm::orc::ThreadSafeModule> {
if (print_generated_code) {
return printIR(std::move(TSM), "");
}
return std::move(TSM);
}),
MainJD(ES.createBareJITDylib("<main>")) {
MainJD.addGenerator(llvm::cantFail(
llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(DL.getGlobalPrefix())));
}
Now when we JIT our code we also create an object file in the generated_code/
directory. This can be inspected using standard utilities such as objdump
. For example, to see how the arraySum
function was lowered to x86 assembly:
hamish@server:~$ objdump -M intel --disassemble=arraySum generated_code/my_module-jitted-objectbuffer.o --no-show-raw-insn
generated_code/my_module-jitted-objectbuffer.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000020 <arraySum>:
arraySum():
20: test esi,esi
22: jle 34 <arraySum+0x14>
24: mov ecx,esi
26: cmp esi,0x20
29: jae 37 <arraySum+0x17>
2b: xor edx,edx
2d: xor eax,eax
2f: jmp c0 <arraySum+0xa0>
34: xor eax,eax
36: ret
37: mov eax,ecx
39: shr eax,0x5
3c: mov edx,ecx
3e: and edx,0xffffffe0
41: vpxor xmm0,xmm0,xmm0
45: xor esi,esi
47: vpxor xmm1,xmm1,xmm1
4b: vpxor xmm2,xmm2,xmm2
4f: vpxor xmm3,xmm3,xmm3
53: shl rax,0x7
57: nop WORD PTR [rax+rax*1+0x0]
60: vpaddd ymm0,ymm0,YMMWORD PTR [rdi+rsi*1]
65: vpaddd ymm1,ymm1,YMMWORD PTR [rdi+rsi*1+0x20]
6b: vpaddd ymm2,ymm2,YMMWORD PTR [rdi+rsi*1+0x40]
71: vpaddd ymm3,ymm3,YMMWORD PTR [rdi+rsi*1+0x60]
77: sub rsi,0xffffffffffffff80
7b: cmp rax,rsi
7e: jne 60 <arraySum+0x40>
80: vpaddd ymm0,ymm1,ymm0
84: vpaddd ymm1,ymm3,ymm2
88: vpaddd ymm0,ymm1,ymm0
8c: vextracti128 xmm1,ymm0,0x1
92: vpaddd xmm0,xmm0,xmm1
96: vpshufd xmm1,xmm0,0xee
9b: vpaddd xmm0,xmm0,xmm1
9f: vpshufd xmm1,xmm0,0x55
a4: vpaddd xmm0,xmm0,xmm1
a8: vmovd eax,xmm0
ac: cmp rdx,rcx
af: je cb <arraySum+0xab>
b1: data16 data16 data16 data16 data16 cs nop WORD PTR [rax+rax*1+0x0]
c0: add eax,DWORD PTR [rdi+rdx*4]
c3: inc rdx
c6: cmp rcx,rdx
c9: jne c0 <arraySum+0xa0>
cb: vzeroupper
ce: ret
Here we can see that an AVX2 vpaddd
with pointer source operand is used for the combination of load <8 x i32>
and add <8 x i32>
.
Debugging with GDB and Profiling with VTune/Perf
Often just staring at IR and assembly is not enough to debug or identify performance issues, we need to use tooling to understand our code at runtime such as debuggers like GDB or profilers like perf or VTune. Debuggers and profilers need to know the programs symbols and debug information (if present) to be useful. This is more complicated for programs that generate and compile code at runtime as these tools usually rely on object files for symbols and debug info. Luckily for us, the hardwork has already been done. GDB has an interface to allow JIT compilers to register code compiled at runtime. Perf has two interfaces to support JIT’ed code. The first (documented here) requires the application to write for each symbol, it’s start address, code size and name to /tmp/perf-{PID}.map
. This keeps the symbol information around after the process terminates so perf report
etc can resolve symbols. It requires no change in how perf is used, but you will not be able to create source file annotations. The second perf interface is a jitdump file. This format provides more information, including about the program source to perf. It does however requires slightly different usage of the perf commands which are shown below. VTune also has an interface as a part of Intel’s ITT API. Unfortunately, if your preferred profiler on AMD platforms is AMDuProf you are out of luck. At least as of late 2023 AMDuProf does not support profiling LLVM JIT generated code.
The LLVM JITEventListener interface is used to notify clients when objects are loaded into memory, i.e. when we have newly compiled code. There are implementations of listeners for several tools including GDB, perf and VTune. The perf listener is based on the jitdump file interface. Note: the VTune listener requires LLVM to be configured with -DLLVM_USE_INTEL_JITEVENTS=ON
and the perf listener requires -DLLVM_USE_PERF=ON
. If your LLVM is not built with these listeners enabled, nothing will happen as the failure to create the listeners is not an error, but you may spend some substantial time wondering why they are not working. Installing the listeners is straightforward, using the GDB and perf listeners for example:
jit.hpp
class MyJIT {
private:
...
llvm::JITEventListener *PerfListener;
llvm::JITEventListener *GDBListener;
public:
MyJIT()
: ...
MainJD(ES.createBareJITDylib("<main>")),
PerfListener(llvm::JITEventListener::createPerfJITEventListener()),
GDBListener(llvm::JITEventListener::createGDBRegistrationListener()) {
if (PerfListener == nullptr) {
std::cout << "Could not create Perf listener. Perhaps LLVM was not "
"compiled with perf support (LLVM_USE_PERF)."
<< std::endl;
} else {
LinkingLayer.registerJITEventListener(*PerfListener);
}
if (GDBListener == nullptr) {
std::cout << "Could not create GDB listener." << std::endl;
} else {
LinkingLayer.registerJITEventListener(*GDBListener);
}
MainJD.addGenerator(llvm::cantFail(
llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(DL.getGlobalPrefix())));
}
While GDB does work without the listener it does not know how to access the symbol table for the generated code. For example, you can find the instruction that triggers a segfault and inspect register values; however, without additional context such as logs, it is very difficult to orient where in the code the segfault occures because we do not even have frame names. When we add the listener, GDB has the symbol table so we have frame names. This narrows down the region of suspect generated code to a single function.
To debug if listeners are working in GDB, there is the set debug jit on
gdb command that will cause GDB to print information from its JIT API. maint info jit
will display the address of the JIT’ed code and the address of the symbol file and maint info symtabs
lists all the symbol tables.
Unlike GDB, perf does not work at all for profiling JIT’ed code without the listener. This is because perf traces mmap to locate the compiled program so that it can map instruction pointers to code. JIT’ed code does not have a backing file for the code. The LLVM perf listener uses the perf jitdump interface, and the jitdump file is linked to the perf profile by the listener mmaping the file. A record of this mmap will be in the output perf profile which allows perf to locate the jitdump file. The file is identified as a jitdump file based on the format of the file name.
To profile generated code that has been registered with the listener we need to do a little bit of extra work. First, when recording we need to ensure we are using the monotonic clock (-k 1
). Then, we need to run perf inject --jit
, which creates an object file on disk for each function in the jitdump file and creates a new profile file with updated mappings that point to the generated object files. Then we can use the perf analysis tools as normal:
hamish@server:~$ perf record -k 1 -o /tmp/perf.data ./main
hamish@server:~$ perf inject -j -i /tmp/perf.data -o /tmp/perf.data.jit
hamish@server:~$ perf annotate -i /tmp/perf.data.jit arraySum
Self-referential Debug Information
The user experience of profilers and debuggers is greatly aided by debug symbols. Typically, compilers that use LLVM generate debug information at the same time as generating LLVM IR while visiting the input program’s abstract syntax tree.
What debug information to generate for a code-generating database like Proteus is not obvious. It is a complex question because the generated code results from translating high-level queries into low-level IR. The challenge is compounded by the fact that SQL is a declarative language, meaning it specifies what results are desired rather than detailing the exact steps needed to obtain those results. For example, code for parallelizing query execution would not correspond to any specific SQL. This makes it difficult to correlate specific parts of the LLVM IR back to the original SQL code.
We could add debug info that refers back to the C++ source that generated the IR, but determining which line of source to refer to is also a tricky question. In this case, the problem is that there are layers of abstractions and helper functions for IR generation. For instance, should the debug information refer to the call site of the helper function that initiates the code generation or to the specific line within the helper function where the LLVM code generation APIs are invoked? While it may be possible to develop a solution that generates debug information that refers back to the generating code, I expect it would create a higher maintenance burden. While it may make sense for a commercial codebase, it may just end up unmaintained for an academic lab.
Instead, my approach is to inject debug information that refers back to the IR itself. The code to generate this self-referential debug info is fully contained within the JIT layer. IR-generating code does not need to be updated or maintained to support emitting this debug information. Just having debug info that refers back to the generated LLVM-IR is surprisingly helpful. Finding the code that generated the IR is relatively easy once I catch the bug and see the IR in GDB, especially if the generating code uses identifying names for instructions.
There used to be an LLVM IR pass to do this, but it was not maintained and was eventually removed. It has been revived as a standalone out-of-tree tool, debugir
which is still maintained. I have taken the approach of using the debugir
source and adjusting it to insert debug information into my module based on the IR file that is already dumped to disk (commit). Then I extended debugir
to also support debugging function parameters (commit and upstreamed). I also name all unnamed instructions before printing the IR and generating debug info so they can be printed as named variables in GDB. This can be done with the InstructionNamerPass, but I chose to do it as a simple function in the PrintGeneratedIRLayer
(commit). Because optimized IR can be substantially different, I like have the option to switch between debug info that refers to the generated IR and debug info that refers to the optimized IR. I generally use generated for debugging and optimized for profiling.
jit.hpp
#ifndef MyJIT_HPP
#define MyJIT_HPP
#include <llvm/ADT/StringRef.h>
#include <llvm/ExecutionEngine/Orc/CompileUtils.h>
#include <llvm/ExecutionEngine/Orc/Core.h>
#include <llvm/ExecutionEngine/Orc/ExecutionUtils.h>
#include <llvm/ExecutionEngine/Orc/ExecutorProcessControl.h>
#include <llvm/ExecutionEngine/Orc/IRCompileLayer.h>
#include <llvm/ExecutionEngine/Orc/JITTargetMachineBuilder.h>
#include <llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h>
#include <llvm/ExecutionEngine/Orc/Shared/ExecutorSymbolDef.h>
#include <llvm/ExecutionEngine/SectionMemoryManager.h>
#include <llvm/IR/DataLayout.h>
#include <llvm/IR/LLVMContext.h>
#include <llvm/ExecutionEngine/JITEventListener.h>
#include <llvm/ExecutionEngine/Orc/IRTransformLayer.h>
#include <llvm/Passes/PassBuilder.h>
#include <llvm/IR/Verifier.h>
#include <llvm/ExecutionEngine/Orc/ObjectTransformLayer.h>
#include <llvm/ExecutionEngine/Orc/DebugUtils.h>
#include <llvm/ExecutionEngine/JITEventListener.h>
#include "DebugIR.hpp"
class MyJIT {
private:
const bool insert_preopt_debug_info = false;
const bool insert_postopt_debug_info = false;
const bool print_generated_code = true;
const bool dump_compiled_object_files = true;
llvm::orc::ExecutionSession ES;
llvm::orc::JITTargetMachineBuilder JTMB;
llvm::DataLayout DL;
llvm::orc::MangleAndInterner Mangle;
llvm::orc::RTDyldObjectLinkingLayer LinkingLayer;
llvm::orc::ObjectTransformLayer::TransformFunction DumpObjectTransform;
llvm::orc::ObjectTransformLayer DumpObjectTransformLayer;
llvm::orc::IRCompileLayer CompileLayer;
llvm::orc::IRTransformLayer PrintOptimizedIRLayer;
llvm::orc::IRTransformLayer TransformLayer;
llvm::orc::IRTransformLayer PrintGeneratedIRLayer;
llvm::orc::JITDylib &MainJD;
llvm::JITEventListener *PerfListener;
llvm::JITEventListener *GDBListener;
public:
MyJIT()
: ES{llvm::cantFail(llvm::orc::SelfExecutorProcessControl::Create())},
JTMB(llvm::cantFail(llvm::orc::JITTargetMachineBuilder::detectHost())
.setCodeGenOptLevel(llvm::CodeGenOpt::Aggressive)
.setCodeModel(llvm::CodeModel::Model::Large)),
DL(llvm::cantFail(JTMB.getDefaultDataLayoutForTarget())),
Mangle(ES, DL),
LinkingLayer(ES, []() { return std::make_unique<llvm::SectionMemoryManager>(); }),
DumpObjectTransform{llvm::orc::DumpObjects("generated_code/")},
DumpObjectTransformLayer(ES, LinkingLayer,
[&transform = this->DumpObjectTransform,
dump_compiled_object_files = this->dump_compiled_object_files](
std::unique_ptr<llvm::MemoryBuffer> buf)
-> llvm::Expected<std::unique_ptr<llvm::MemoryBuffer>> {
if (dump_compiled_object_files) {
return transform(std::move(buf));
} else {
return std::move(buf);
}
}),
CompileLayer(ES, DumpObjectTransformLayer,
std::make_unique<llvm::orc::ConcurrentIRCompiler>(JTMB)),
PrintOptimizedIRLayer(
ES, CompileLayer,
[print_generated_code = this->print_generated_code,
insert_debug_info = this->insert_postopt_debug_info](
llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
-> llvm::Expected<llvm::orc::ThreadSafeModule> {
if (print_generated_code) {
return printIR(std::move(TSM), "_opt", insert_debug_info);
}
return std::move(TSM);
}),
TransformLayer(ES, PrintOptimizedIRLayer,
[TM = JTMB.createTargetMachine()](
llvm::orc::ThreadSafeModule TSM,
const llvm::orc::MaterializationResponsibility &R) mutable
-> llvm::Expected<llvm::orc::ThreadSafeModule> {
if (!TM) return TM.takeError();
return optimizeModule(std::move(TSM), std::move(TM.get()));
}),
PrintGeneratedIRLayer(
ES, TransformLayer,
[print_generated_code = this->print_generated_code,
insert_debug_info = this->insert_preopt_debug_info](
llvm::orc::ThreadSafeModule TSM, const llvm::orc::MaterializationResponsibility &R)
-> llvm::Expected<llvm::orc::ThreadSafeModule> {
TSM = nameInstructions(std::move(TSM));
if (print_generated_code) {
return printIR(std::move(TSM), "", insert_debug_info);
}
return std::move(TSM);
}),
MainJD(ES.createBareJITDylib("<main>")),
PerfListener(llvm::JITEventListener::createPerfJITEventListener()),
GDBListener(llvm::JITEventListener::createGDBRegistrationListener()) {
if (PerfListener == nullptr) {
std::cout << "Could not create Perf listener. Perhaps LLVM was not "
"compiled with perf support (LLVM_USE_PERF)."
<< std::endl;
} else {
LinkingLayer.registerJITEventListener(*PerfListener);
}
if (GDBListener == nullptr) {
std::cout << "Could not create GDB listener." << std::endl;
} else {
LinkingLayer.registerJITEventListener(*GDBListener);
}
MainJD.addGenerator(llvm::cantFail(
llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(DL.getGlobalPrefix())));
}
~MyJIT() {
if (auto Err = ES.endSession()) ES.reportError(std::move(Err));
}
const llvm::DataLayout &getDataLayout() const { return DL; }
llvm::Error addModule(llvm::orc::ThreadSafeModule TSM);
llvm::Expected<llvm::orc::ExecutorSymbolDef> lookup(llvm::StringRef Name);
private:
static llvm::Expected<llvm::orc::ThreadSafeModule> optimizeModule(
llvm::orc::ThreadSafeModule TSM, std::unique_ptr<llvm::TargetMachine> TM);
// based on InstructionNamerPass, but as a transform because we do not want to do any IR
// optimization so we can print IR that is the same as the generated IR, just with renamed
// instructions for readability
static llvm::orc::ThreadSafeModule nameInstructions(llvm::orc::ThreadSafeModule TSM) {
TSM.withModuleDo([](llvm::Module &M) {
for (auto &F : M.functions()) {
for (auto &Arg : F.args()) {
if (!Arg.hasName()) Arg.setName("arg");
}
for (llvm::BasicBlock &BB : F) {
if (!BB.hasName()) BB.setName("bb");
for (llvm::Instruction &I : BB) {
if (!I.hasName() && !I.getType()->isVoidTy()) I.setName("i");
}
}
}
});
return TSM;
}
static llvm::Expected<llvm::orc::ThreadSafeModule> printIR(llvm::orc::ThreadSafeModule TSM,
const std::string &suffix = "",
bool add_debug_info = false) {
TSM.withModuleDo([&suffix, &add_debug_info](llvm::Module &m) {
std::error_code EC;
const std::string output_directory = "generated_code/";
const std::string output_file = m.getName().str() + suffix + ".ll";
llvm::sys::fs::create_directory(output_directory, /*ignoreExisting=*/true);
llvm::raw_fd_ostream out(output_directory + output_file, EC,
llvm::sys::fs::OpenFlags::OF_None);
m.print(out, nullptr, false, true);
if (add_debug_info) {
llvm::createDebugInfo(m, output_directory, output_file);
}
});
return TSM;
}
};
#endif // MyJIT_HPP
Although much of this information is available online, it’s scattered across mailing lists and Stack Overflow, and the LLVM APIs have changed multiple times over the years. It took me a while to figure everything out, so I decided to write it down, primarily for my own reference, but I hope it proves helpful to others as well.
-
To be clear, this is a joke and not the usual use of the phrase static analysis ↩