CMU 15-418 (Spring 2012) Final Project:
Parallel Malloc
Alex Reece
Reports

Project Proposal

Checkpoint Report

Final Report

Working Schedule

Week What We Plan To Do What We Actually Did
Apr 1-7 Sequential Malloc Malloc Research
Apr 8-14 Sequential Malloc Sequential Malloc
Apr 15-21 Thread local mallocs False Sharing Checker
Apr 22-28 Sequential Malloc with TLS
  • Single locked arena
  • Naive static size slabs
  • Cache in TLS
 
Apr 29-May 5 Multiple Arenas
  • Multiple arenas
  • Slabs dynamic size for slice size
  • Clear / repopulate cache with highwater mark
 
May 6-11 GC Sim flag  

Working Log
Email to dga
tl;dr: The go memory allocator possibly induces false sharing. How do I measure / reduce this?

As part of 15-418, I've started looking into parallel memory allocators (I'll probably write a parallel malloc for my final project). I've been reading some research papers and production mallocs and also have looked at the Go memory allocator. One of the things I have been focusing on is the issue of false sharing induced by the allocator returning to different threads objects that fall on the same cache line. Interestingly, I feel that the nature of Go might be that this is significantly more of an issue that with C.

Some background: As far I can tell, the most popular modern parallel mallocs are jemalloc, tcmalloc, ptmalloc, concur, and hoard. Oracle did some investigation and I have taken a look at the internals of jemalloc, tcmalloc, concur, and hoard. As I understand:

As for the problem of false sharing, the interesting thing is that only hoard addresses it (by always returning a block to the original superblock). jemalloc explicitly ignores it (although I think freed blocks eventually go back to the arena from which they originated) on the basis that it would unusual for developers to allocate blocks in one thread and free them in another. Neither tcmalloc nor concur even mention it, although concur at least puts blocks back in the original arena. This raises the important question: how much does false sharing induced by memory allocators actually affect performance? Perhaps, as jemalloc claims, it is infrequent enough that its not worth addressing (the same thread will probably allocate and free a given object). Its not clear to me exactly how to test this either. The hoard guy makes up this test that seems artificial at best and it would be nice to check for false sharing on real workloads.

Which brings me to Go. Go uses a memory allocator very strongly based on the tcmalloc allocator (it uses thread local caches and then otherwise globally shared slabs and no attempt to reduce false sharing). However, I suspect that the usage patterns of Go make false sharing significantly more of an issue than in C programs: the assumption that the same thread allocate and frees a given object is simply not valid when the garbage collector frees everything and goroutines migrate between kernel threads.

I sent a message to golang-dev earlier this week mentioning some of this. I want to investigate and measure the effect of false sharing induced by the memory allocator. In addition, I want to refactor the go allocator so I can swap in other implementations and evaluate performance on real workloads. Ultimately, I think it would be cool to port jemalloc to go (or write an equivalent memory allocator that focused more on reducing false sharing).

Email to golang-dev
I've spent a little time looking at the Go memory allocator and wanted to investigate the impact of false sharing due to blocks migrating between threads.

Currently, the go memory allocator is based off of tcmalloc and (as far as I can tell) does not attempt to avoid false sharing via allocating objects on the same cache line to multiple threads. This false sharing would occur when one kernel thread allocated several blocks on the same cache line and then freed only one of them with another kernel thread freeing the other (perhaps because the object was passed to another thread or the goroutine migrated threads). These blocks would be put in each threads MCache and then could be allocated later to induce false sharing.

I'm not clear on how the current malloc attempts to avoid false sharing. It would seem reasonable to avoid false sharing by moving a cache line aligned group of objects from the shared MCentral and then hoping that the objects don't later migrate. This assumption doesn't quite seem valid, since the garbage collector doesn't appear to try to align objects when putting them back into MCentral and the MCentral doesn't appear to try to hand out cache aligned groups of objects. In addition, my suspicion is that the assumption that objects don't migrate between threads is less valid in Go than in C because a migrating goroutine would transfer control of an object from one kernel thread to another.

I would like to investigate the effect of false sharing and work to reduce it if appropriate. I started working on a test to mimic the existing malloc1.go etc tests in the tree, but don't know how to run it (the existing files appear to use runtime.Alloc and runtime.Free which don't appear to be valid anymore). In addition, I would appreciate it if someone could give some feedback on my assumptions on the existing code base and some advice on how to proceed (I understand there is a large CL out for the garbage collector - I guess I should wait until that gets pulled in?).

Email to dga

I'm thinking of constructing a test to keep track of the allocations to threads (in a map, but XORing or stripping bits so the GC won't consider them references) but that runs into the same problem with the runtime API. I'm thinking of using a modified go tree that exposes runtimeĀ·free and runtimeĀ·alloc for the purpose of testing these, but then I won't have the expected behavior of the GC (in a different thread) freeing blocks (perhaps I can simulate that with another thread).

Email to golang-dev: preliminary report
tl;dr I created a simple program to do some preliminary investigation of possible false sharing, and got some results that lead me to think its worth looking at.

I wrote a simple program "malloccheck.go" that spawned a bunch of threads that each did a bunch of small memory allocations. I then added a bunch of performance analysis on top of it to track which threads were allocated which blocks and emit pprof data as necessary.

Running the program a bunch of times, I was surprised to see data that looks like:

$ for i in {1..4}; do ./malloccheck --procs=$i --checkcache; done
Malloc check ran with 1 procs for 1000000 iters of size 16
	Cache line shared 0 (0.00%) times
	Cache was reused 992768 (99.28%) times
Malloc check ran with 2 procs for 1000000 iters of size 16
	Cache line shared 117344 (11.73%) times
	Cache was reused 871328 (87.13%) times
Malloc check ran with 3 procs for 1000000 iters of size 16
	Cache line shared 158466 (15.85%) times
	Cache was reused 829757 (82.98%) times
Malloc check ran with 4 procs for 1000000 iters of size 16
	Cache line shared 180950 (18.09%) times
	Cache was reused 805802 (80.58%) times

This is a really preliminary step in the investigation, but it suggests that perhaps false sharing occurs more frequently in Go than one might think. I suspect that this is because the garbage collector frees blocks to a global pool where all threads have equal chance at the blocks that were just freed. I want to look into perhaps having segregated arenas (ala jemalloc or concur) so freed blocks migrate back to the thread that originally allocated them.

Real world traces

I spent some time (read: 6+ hours) today trying to get real world parallel malloc traces. I experimented with ltrace, but found that firefox uses an internal memory allocator (uses many internal memory allocators) so ltrace had nothing to hook to. ltrace also did not work on chrome for unknown reasons. I then downloaded and compiled firefox with --enable-trace-malloc, but that did not actually produce any useful traces. Perhaps this is due to the variety of internal memory allocators? I am not sure. I also spent some time trying to trace Go memory allocations, but the garbage collection makes it somewhat unclear what exactly I need to hook into. I have some existing trace infrastructure, but it might record excess frees.

Real Go traces

I realized that I could patch the runtime to print on every malloc and then post process the trace. I did so (using the same logic as before, keeping track of each time a cache line migrated from one thread to another) under the understanding that this would be a lower bound on the number of invalidation due to false sharing / cache line migration (in the real world, threads might perform more than one write to a block): Source here

I haven't tried it on many test cases yet, but preliminary results indicate that my test might not have been that far off: the go.uik test program (an example of a program with multiple goroutines performing interesting logic) exhibited some unexpectedly high cache line behavior:

# a trace of go.uik binary
Malloc trace of 25694 mallocs over 12 cores
Cache line shared 3784 (14.73%) times
Cache line cold missed 5219 (20.31%) times

Its not clear yet to what extent this is caused by thread migrations vs false sharing, but I think it is still worth looking in to.

Also, small allocations (of size 8 or 16) appear to be the most common: Source here. If those get written to a lot, any false sharing could possibly cause a real difference.