This article attempts to summarize some findings about heap management under 32-bit versions of Windows that I had to learn about the hard way while trying to optimize memory usage of a fairly complex algorithm. While the algorithm was implemented as a portable Python/C hybrid (that is, Python code calling C libraries for the core algorithms), this occasionally complicated matters, but did not change most of the fundamental issues discussed here.
One interesting observation about this project was this: the
particular out-of-memory issue I tried to solve did not seem to
occur under Linux at all, and looked rather specific to Windows
XP. After ruling out obvious things such as memory leaks and
Python garbage collection, it eventually turned out that some
aspects of Windows memory were responsible that don't appear to be
documented explicitly in most places that talk about the Windows
heap.
In this context, "very large" means "using a significant portion of the 4GB address space", in other words several 100MB of data in a single array. Of course, there were always a couple of "easy" answers out of this dilemma, such as:
- Use double indrection to split up a "huge" array into several "small" ones with more manageable size
- Maximize usable address space by booting Windows with the /3GB
switch and setting the LARGEADDRESSAWARE
flag on the executable.
Note: Any warnings about reduced overall Windows performance should be remembered: I experienced a slowdown of at least a factor of 10 (!) in rebuilding a large project after enabling the "/3GB" switch, and have also seen reports about a drop in some graphics cards' performance (probably due to the loss of memory apertures). - Move to a 64-bit version of Windows (and recompiling your code to make use of it!)
- Use Linux :-)
Huge "monolithic" memory blocks are bad
Try to avoid memory allocations of more than about 100MB in a single chunk like the plague.
This is perhaps the single most important lesson: if you keep this in mind and write your code accordingly you could stop reading here, and ignore the remaining advice: it mainly deals with getting by if you can't avoid them completely, e.g. because some algorithms such as qsort() can be implemented much more easily on a flat array.
Please note that the 100MB limit is not a hard and fast number - it may be 50MB or 200MB, depending on what you do with your blocks, and how many there are. But this is the order of magnitude where it usually gets tricky. I'd say anything in excess of 200MB is begging for trouble.
Address-space fragmentation can be a problem, and VMMap is your friend
The following program can be used to exercise most of thes isses described in this text (note that it uses GlobalAlloc rather than HeapAlloc for simplicity, but in recent versions of Windows the only difference is that GlobalAlloc works with the default heap that gets created for a process anyway):
#include <windows.h>
#include <stdio.h>
int main()
{
int i;
void *p, *p2;
static void* ptr[100000];
printf("Attempting to allocate 1000MB: ");
p = (void*)GlobalAlloc(0, 1000*1024*1024);
if(p) puts("success"); else puts("failure");
printf("Press [Enter] to continue");
getchar();
printf("Attempting to reallocate to 1001MB:
");
p2 = (void*)GlobalReAlloc(p, 1001*1024*1024,
0);
if(p2) puts("success"); else { p2=p;
puts("failure"); }
printf("Press [Enter] to continue");
getchar();
printf("Attempting to reallocate to 900MB:
");
p = (void*)GlobalReAlloc(p2, 900*1024*1024,
0);
if(p2) puts("success"); else { p=p2;
puts("failure"); }
printf("Press [Enter] to continue");
getchar();
GlobalFree(p2);
puts("Freed memory");
printf("Attempting to allocate/free 500MB:
");
p = (void*)GlobalAlloc(0, 500*1024*1024);
if(p) puts("success"); else puts("failure");
GlobalFree(p);
printf("Press [Enter] to continue");
getchar();
puts("Allocating/freeing 23,552 blocks of
63k");
for(i=0; i<23*1024; i++)
{
ptr[i] =
(void*)GlobalAlloc(0, 63*1024);
if(ptr[i]==NULL) {
puts("Unexpected failure"); return 1; }
}
for(i=0; i<100000; i++)
GlobalFree(ptr[i]);
printf("Press [Enter] to continue");
getchar();
printf("Attempting to allocate/free 500MB:
");
p = (void*)GlobalAlloc(0, 500*1024*1024);
if(p) puts("success"); else puts("failure");
GlobalFree(p);
printf("Press [Enter] to continue");
getchar();
return 0;
}
Source code and compiled version are available here.
This example is best run together with the the VMMap
utility by Microsoft's Sysinternals team, which can be used
to display the memory layout of a process, and is an invaluable
tool for resolving address space fragmentation issues. While doing
so, it is useful to enable the "Show free regions"
option of VMMap, which demonstrates very nicely how randomly
placed EXEs and DLLs punch holes into the otherwise contiguous
address space of Windows, furtherer limit the ability to allocate
large arrays.
The sample application above stops after each major re-allocation event, allowing you to inspect memory with VMMap, and continues after pressing [Enter]. The expected output on a 32-bit Windows system looks as follows:
Attempting to allocate 1000MB: success
Press [Enter] to continue
Attempting to reallocate to 1001MB: failure
Press [Enter] to continue
Attempting to reallocate to 900MB: success
Press [Enter] to continue
Freed memory
Attempting to allocate/free 500MB: success
Press [Enter] to continue
Allocating/freeing 23,552 blocks of 63k
Press [Enter] to continue
Attempting to allocate/free 500MB: failure
Press [Enter] to continue
There is no VirtualReAlloc
One very simple fact that causes the first failure is this: there simply is no VirtualReAlloc() call in Windows to complement VirtualAlloc() and VirtualFree(). This has important implications for the way the HeapReAlloc() function is implemented internally: because HeapAlloc() directly calls VirtualAlloc() for allocation of large blocks, Windows can only resize such blocks by allocating a new block of the desired size, then copying the old data and finally freeing the old block.
This strategy of course works only if the two blocks can exist
side-by-side in memory for a short time - if there is only enough
address space for one of them, resizing fails (as demonstrated by
the first failure in the above example). On the other hand, it
turns out that in the case of shrinking a block Windows "hides"
this failure by simply not reducing its size, but still
succeeding: if you inspect memory at the third stopping point, you
will notice that the block was not actually shunk back to 900MB in
the address space, but remains at its original 1000MB allocation
size (however, the unneeded memory gets decommitted from virtual
memory).
Heap regions allocated for holding "small" blocks do not shrink
back
After allocating a large number of "small" items, the address
region they occupied remains reserved for holding only such small
items in the future, even after all of them have been freed again.
The virtual memory held by these regions is "decommitted", so it
doesn't show up in the memory usage of the process any more, but
its address space is still unavailable for allocation of new
blocks of virtual memory that are too big to be stored in those
sub-allocated regions.
This means that large allocations following after freeing a lot
of smaller items may fail, even though memory usage of the process
appears to be low. In effect, the more "small" items are allocated
by an application, the more of ites address space gets permanently
dedicated to small-only allocations, and cannot be reclaimed other
than by destroying the heap.
In the example above, this is the reason for the second failure
of allocating a 500MB block (which used to work fine before) -
having allocated about 1.5GB worth of 63k blocks before, most of
the address space is now occupied by heap regions for holding
"small" items, and thus become unavailable to large blocks.
DLLs and EXEs can fragment memory in unexpected ways
It is also important to look at how your EXEs and DLLs are spread
accross the address space. If you look at it with VMMap, they
appear to be placed in seemingly random fashion, thus reducing the
maximum contiguous amout of memory that can be allocated in a
single buffer. For example, when I look at the example above on my
Windows 7 Home system, I see the following "Image" locations
cutting through the upper end of the biggest available chunk of
free memory:
- KernelBase.dll at 0x75AC0000
- ...followed by 28.120K of free memory
- ntdll.dll at 0x77680000
- ...followed by 80K of free memory
- kernel32.dll at 0x777D0000
- ...followed by 112K of free memory
- apisetschema.dll at 0x778C0000
- ...followed by 129.212K of free memory
Of course, these locations are not actually random, and may be
different depending on your version of Windows - in fact, each EXE
and DLL has a "preferred" location at which it can be loaded
quickly, without having to relocate any absolute addresses in the
executable file, and by default this is just where Windows will
place them, in order to load quickly and reduce virtual memory
usage. Each DLL loaded by your applications either directly or
indirectly (such as an MSVC runtime library) will further fragment
the address space.
However, when tuning an application for optimal usage of address
space, this behaviour may be counter-productive, and in some cases
it may be useful to optimize the loading location of
application-specific DLLs to be clustered more closely in address
space. This is best done at compile time, but in cases where
rebuilding is not possible, the /REBASE
option of the EDITBIN utility can also be used to move a DLL to a
different base location.
One limitation I noticed when trying to rebase an executable (in
my case, python.exe, which use a base location of 0x1d000000 by
default, separating about 450MB from the bottom of the contiguous
address space) is that EDITBIN will not rebase an EXE file, and
may simply leave it unchanged without producing an error message.
It is not clear to my why this happens, but in such cases only
recompiling the EXE (or using a hex editor, for the foolishly
brave :-)) will help. In each case, VMMap should be used to verify
the intended effect of the address space reshuffling, as Windows
may always take the liberty to resolve address space collisions by
moving a DLL to a different place.