Some little-known facts about Windows 32-bit heap management

[Version 1.01, January 2012]

This article attempts to summarize some findings about heap management under 32-bit versions of Windows that I had to learn about the hard way while trying to optimize memory usage of a fairly complex algorithm. While the algorithm was implemented as a portable Python/C hybrid (that is, Python code calling C libraries for the core algorithms), this occasionally complicated matters, but did not change most of the fundamental issues discussed here.

One interesting observation about this project was this: the particular out-of-memory issue I tried to solve did not seem to occur under Linux at all, and looked rather specific to Windows XP. After ruling out obvious things such as memory leaks and Python garbage collection, it eventually turned out that some aspects of Windows memory were responsible that don't appear to be documented explicitly in most places that talk about the Windows heap.

In this context, "very large" means "using a significant portion of the 4GB address space", in other words several 100MB of data in a single array. Of course, there were always a couple of "easy" answers out of this dilemma, such as:

Huge "monolithic" memory blocks are bad

Try to avoid memory allocations of more than about 100MB in a single chunk like the plague.

This is perhaps the single most important lesson: if you keep this in mind and write your code accordingly you could stop reading here, and ignore the remaining advice: it mainly deals with getting by if you can't avoid them completely, e.g. because some algorithms such as qsort() can be implemented much more easily on a flat array.

Please note that the 100MB limit is not a hard and fast number - it may be 50MB or 200MB, depending on what you do with your blocks, and how many there are. But this is the order of magnitude where it usually gets tricky. I'd say anything in excess of 200MB is begging for trouble.

Address-space fragmentation can be a problem, and VMMap is your friend

The following program can be used to exercise most of thes isses described in this text (note that it uses GlobalAlloc rather than HeapAlloc for simplicity, but in recent versions of Windows the only difference is that GlobalAlloc works with the default heap that gets created for a process anyway):

#include <windows.h>
#include <stdio.h>

int main()
{
    int i;
    void *p, *p2;
    static void* ptr[100000];

    printf("Attempting to allocate 1000MB: ");
    p = (void*)GlobalAlloc(0, 1000*1024*1024);
    if(p) puts("success"); else puts("failure");

    printf("Press [Enter] to continue"); getchar();

    printf("Attempting to reallocate to 1001MB: ");
    p2 = (void*)GlobalReAlloc(p, 1001*1024*1024, 0);
    if(p2) puts("success"); else { p2=p; puts("failure"); }

    printf("Press [Enter] to continue"); getchar();

    printf("Attempting to reallocate to 900MB: ");
    p = (void*)GlobalReAlloc(p2, 900*1024*1024, 0);
    if(p2) puts("success"); else { p=p2; puts("failure"); }

    printf("Press [Enter] to continue"); getchar();
    
    GlobalFree(p2);
    puts("Freed memory");

    printf("Attempting to allocate/free 500MB: ");
    p = (void*)GlobalAlloc(0, 500*1024*1024);
    if(p) puts("success"); else puts("failure");
    GlobalFree(p);

    printf("Press [Enter] to continue"); getchar();

    puts("Allocating/freeing 23,552 blocks of 63k");
    for(i=0; i<23*1024; i++)
    {
        ptr[i] = (void*)GlobalAlloc(0, 63*1024);
        if(ptr[i]==NULL) { puts("Unexpected failure"); return 1; }
    }
    for(i=0; i<100000; i++) GlobalFree(ptr[i]);

    printf("Press [Enter] to continue"); getchar();

    printf("Attempting to allocate/free 500MB: ");
    p = (void*)GlobalAlloc(0, 500*1024*1024);
    if(p) puts("success"); else puts("failure");
    GlobalFree(p);

    printf("Press [Enter] to continue"); getchar();

    return 0;
}

Source code and compiled version are available here.

This example is best run together with the the VMMap utility by Microsoft's Sysinternals team, which can be used to display the memory layout of a process, and is an invaluable tool for resolving address space fragmentation issues. While doing so, it is useful to enable the "Show free regions" option of VMMap, which demonstrates very nicely how randomly placed EXEs and DLLs punch holes into the otherwise contiguous address space of Windows, furtherer limit the ability to allocate large arrays.

The sample application above stops after each major re-allocation event, allowing you to inspect memory with VMMap, and continues after pressing [Enter]. The expected output on a 32-bit Windows system looks as follows:

Attempting to allocate 1000MB: success
Press [Enter] to continue
Attempting to reallocate to 1001MB: failure
Press [Enter] to continue
Attempting to reallocate to 900MB: success
Press [Enter] to continue
Freed memory
Attempting to allocate/free 500MB: success
Press [Enter] to continue
Allocating/freeing 23,552 blocks of 63k
Press [Enter] to continue
Attempting to allocate/free 500MB: failure
Press [Enter] to continue

There is no VirtualReAlloc

One very simple fact that causes the first failure is this: there simply is no VirtualReAlloc() call in Windows to complement VirtualAlloc() and VirtualFree(). This has important implications for the way the HeapReAlloc() function is implemented internally: because HeapAlloc() directly calls VirtualAlloc() for allocation of large blocks, Windows can only resize such blocks by allocating a new block of the desired size, then copying the old data and finally freeing the old block.

This strategy of course works only if the two blocks can exist side-by-side in memory for a short time - if there is only enough address space for one of them, resizing fails (as demonstrated by the first failure in the above example). On the other hand, it turns out that in the case of shrinking a block Windows "hides" this failure by simply not reducing its size, but still succeeding: if you inspect memory at the third stopping point, you will notice that the block was not actually shunk back to 900MB in the address space, but remains at its original 1000MB allocation size (however, the unneeded memory gets decommitted from virtual memory).

Heap regions allocated for holding "small" blocks do not shrink back

After allocating a large number of "small" items, the address region they occupied remains reserved for holding only such small items in the future, even after all of them have been freed again. The virtual memory held by these regions is "decommitted", so it doesn't show up in the memory usage of the process any more, but its address space is still unavailable for allocation of new blocks of virtual memory that are too big to be stored in those sub-allocated regions.

This means that large allocations following after freeing a lot of smaller items may fail, even though memory usage of the process appears to be low. In effect, the more "small" items are allocated by an application, the more of ites address space gets permanently dedicated to small-only allocations, and cannot be reclaimed other than by destroying the heap.

In the example above, this is the reason for the second failure of allocating a 500MB block (which used to work fine before) - having allocated about 1.5GB worth of 63k blocks before, most of the address space is now occupied by heap regions for holding "small" items, and thus become unavailable to large blocks.

DLLs and EXEs can fragment memory in unexpected ways

It is also important to look at how your EXEs and DLLs are spread accross the address space. If you look at it with VMMap, they appear to be placed in seemingly random fashion, thus reducing the maximum contiguous amout of memory that can be allocated in a single buffer. For example, when I look at the example above on my Windows 7 Home system, I see the following "Image" locations cutting through the upper end of the biggest available chunk of free memory:

Of course, these locations are not actually random, and may be different depending on your version of Windows - in fact, each EXE and DLL has a "preferred" location at which it can be loaded quickly, without having to relocate any absolute addresses in the executable file, and by default this is just where Windows will place them, in order to load quickly and reduce virtual memory usage. Each DLL loaded by your applications either directly or indirectly (such as an MSVC runtime library) will further fragment the address space.

However, when tuning an application for optimal usage of address space, this behaviour may be counter-productive, and in some cases it may be useful to optimize the loading location of application-specific DLLs to be clustered more closely in address space. This is best done at compile time, but in cases where rebuilding is not possible, the /REBASE option of the EDITBIN utility can also be used to move a DLL to a different base location.

One limitation I noticed when trying to rebase an executable (in my case, python.exe, which use a base location of 0x1d000000 by default, separating about 450MB from the bottom of the contiguous address space) is that EDITBIN will not rebase an EXE file, and may simply leave it unchanged without producing an error message. It is not clear to my why this happens, but in such cases only recompiling the EXE (or using a hex editor, for the foolishly brave :-)) will help. In each case, VMMap should be used to verify the intended effect of the address space reshuffling, as Windows may always take the liberty to resolve address space collisions by moving a DLL to a different place.