- Joined
- Jan 3, 2022
- Messages
- 355
Disclaimer: This is not something you would ever encounter under normal circumstances.
TLDR: At very high GC pressure processes' memory will grow. If the GC catches up, the memory will not be released to OS.
I'm currently attempting to debug a memory leak in a big trigger-heavy map I'm porting from Jass to Lua. To do that I hooked every single function: automatic error catching and to count function calls.
The function intercepts each function call and temporarily saves all input arguments in a local table.
This has caused war3.exe's process memory to bloat continuously... at one point it had reached 17GB before I noticed and on another run only 11GB in Working Set. I thought to shift the blame on the pre-configured Lua GC after the 1.31 desyncs, so I tried to reduce the generation of temporary objects in my intercepting functions:
For some reason this solved the issue. I do not understand exactly why, because Lua's garbage collector should just work. Then I made a demo map to replicate garbage creation with a simple function that generates a specified amount of tables with dummy values every second (attached map):
Testing this at the value of "-gen 100000", 100k tables per second resulted in a steady state of the game's process at 2.7GB memory. No further memory allocations. Smaller values settled at lower memory sizes.
Further testing at 500k tables/s: the process memory reached 21GB and kept growing before I dropped creation speed back to 123/s. The memory was never released by the process. OK I think this was just too much
My best attempt to explain the difference: my tracing functions didn't just create any tables, they created tables holding references to in-game objects: all API functions such as CreateGroup were hooked too. Although my small test shows that Lua GC has no issue with allocation speed and will eventually reach an equilibrium, the game engine must not handle it well when references are released very late, leading to an unnecessary bloat of internal structures holding the data. The original map created a lot of groups, locations and iterated over many units (it was a mix of Jass, vJass and GUI).
Anyway since there's no way to spur the GC or tune it, I'd call it a bug in Reforged. Again, normally you shouldn't encounter it.
Speculating on possible advice: clean saved references ("userdata") as early as possible (e.g. right after calling RemoveLocation etc.) For example if you keep them saved in a global table. This isn't different from Jass. However you do not need to manually null local variables, the GC will do fine. And if anyone knows how the Lua GC works in-depth, please comment if there anything else can be done (collectgarbage and similar are not available to scripts in 1.32+)
TLDR: At very high GC pressure processes' memory will grow. If the GC catches up, the memory will not be released to OS.
I'm currently attempting to debug a memory leak in a big trigger-heavy map I'm porting from Jass to Lua. To do that I hooked every single function: automatic error catching and to count function calls.
Lua:
-- safely call functions from other code and catch any errors, outputting them to all chat
-- Note: By default Reforged doesn't output any errors, they can only be found in memory dumps
function safeCall(...) -- TODO: not needed, GC
return safeCallHandle(pcall(...))
end
-- This is needed due to how varargs work in Lua
function safeCallHandle(...)
local arg = {...} -- TODO: not needed, GC
local ok, err = arg[1], arg[2]
if ok == true then
return select(2, ...)
else
print("|cffee2222Error occurred: ")
for line in tostring(err):gmatch("[^\n]+") do
-- I think errors don't have stack traces here, so only single line :(
print("|cffee2222".. line)
end
-- abort execution and point to above function
error(tostring(err), 3)
end
end
---
originalFunctions = {
-- [functionName] = func
callCount = {
-- [functionName] = call count
}
}
blacklist = {} -- empty
profileArg1 = {
["RemoveLocation"] = true,
["DestroyLightning"] = true,
["DestroyGroup"] = true,
["RemoveDestructable"] = true,
["KillDestructable"] = true,
["DestroyTextTag"] = true,
["DestroyEffect"] = true,
["RemoveWeatherEffect"] = true,
["DestroyFogModifier"] = true,
["KillUnit"] = true,
["RemoveUnit"] = true,
}
for k,v in pairs(_G) do
if type(v) == "function" then
if k ~= "pcall" and k ~= "xpcall"
and k ~= "print" and k ~= "select" and k ~= "error"
and k ~= "tostring" and k ~= "type" and k ~= "tonumber"
and k ~= "next" and k ~= "pairs"
and k ~= "safeCall" and k ~= "safeCallHandle" then
print("Overwriting: ".. k)
originalFunctions[k] = v
local origFunc = v
local origName = k
_G[k] = function (...)
local profilingName = origName
-- check if first arg is nil
if profileArg1[origName] then
local args = table.pack(...) -- TODO: not needed, GC
profilingName = profilingName .. (args[1] == nil and "+nil" or "")
end
if false and not blacklist[origName] then
print(profilingName)
end
originalFunctions.callCount[profilingName] =
1 + (originalFunctions.callCount[profilingName] or 0)
return safeCall(origFunc, ...)
end
end
end
end
do
trg_dump = CreateTrigger()
act_dump = function()
print("Dumping...")
local sortedFuncs = {}
for funcName, calls in pairs(originalFunctions.callCount) do
table.insert(sortedFuncs, funcName)
end
table.sort(sortedFuncs, function(a,b)
return originalFunctions.callCount[a] < originalFunctions.callCount[b]
end)
local out = {}
for i = 1, #sortedFuncs do
local name = sortedFuncs[i]
-- TODO: not needed, GC
table.insert(out, string.format("\0377d: \037s", originalFunctions.callCount[name], name))
end
PreloadGenClear()
PreloadGenStart()
-- Preload has a string limit of 259 chars... Windows' MAX_PATH?
for i = 1, #out do
Preload(out[i])
end
PreloadGenEnd(os.date("profiling-%Y_%m_%d-%H_%M.txt"))
out = nil -- idk something in the game was causing a 16GB mem leak
return true
end
local act_dumpLocal = act_dump
TriggerAddAction(trg_dump, function() return safeCall(act_dumpLocal) end)
-- note: this command stops working after 15-20s in-game, I haven't figured out why
-- maybe it's a rogue trigger doing something
TriggerRegisterPlayerChatEvent(trg_dump, Player(0), "-dumpprofile", true)
local t = CreateTimer()
TimerStart(t, 10, true, act_dump)
end
This has caused war3.exe's process memory to bloat continuously... at one point it had reached 17GB before I noticed and on another run only 11GB in Working Set. I thought to shift the blame on the pre-configured Lua GC after the 1.31 desyncs, so I tried to reduce the generation of temporary objects in my intercepting functions:
Diff:
function safeCallHandle(...)
- local arg = {...} -- TODO: not needed, GC
- local ok, err = arg[1], arg[2]
+ local ok, err = ...
if ok == true then
return select(2, ...)
else
@@ -447,8 +446,8 @@ for k,v in pairs(_G) do
local profilingName = origName
-- check if first arg is nil
if profileArg1[origName] then
- local args = table.pack(...) -- TODO: not needed, GC
- profilingName = profilingName .. (args[1] == nil and "+nil" or "")
+ local arg1 = ...
+ profilingName = profilingName .. (arg1 == nil and "+nil" or "")
end
Lua:
function genGarbage(amount)
local n = math.random(0, 65535)
for i = 1, amount do
-- avoid array reallocations
local t = {n, n+1, n+2, n+3}
end
end
Further testing at 500k tables/s: the process memory reached 21GB and kept growing before I dropped creation speed back to 123/s. The memory was never released by the process. OK I think this was just too much
My best attempt to explain the difference: my tracing functions didn't just create any tables, they created tables holding references to in-game objects: all API functions such as CreateGroup were hooked too. Although my small test shows that Lua GC has no issue with allocation speed and will eventually reach an equilibrium, the game engine must not handle it well when references are released very late, leading to an unnecessary bloat of internal structures holding the data. The original map created a lot of groups, locations and iterated over many units (it was a mix of Jass, vJass and GUI).
Anyway since there's no way to spur the GC or tune it, I'd call it a bug in Reforged. Again, normally you shouldn't encounter it.
Speculating on possible advice: clean saved references ("userdata") as early as possible (e.g. right after calling RemoveLocation etc.) For example if you keep them saved in a global table. This isn't different from Jass. However you do not need to manually null local variables, the GC will do fine. And if anyone knows how the Lua GC works in-depth, please comment if there anything else can be done (collectgarbage and similar are not available to scripts in 1.32+)
Attachments
Last edited: