BLP shared JPEG header

Barade · Aug 26, 2011

As wc3c.net seems to be quite inactive I just want to ask this here, as well.

Dr Super Good · Aug 27, 2011

Recently I got the problem to write JPEG compressed MIP maps of BLP properly into file, so I've searched around and found this thread which doesn't really help me since I still don't know how I have to write down the JPEG header. There's always one shared but I think there must be some other header data per MIP map where I get the size from since when I read MIP maps from Warcraft III textures I only use the jpeglib to read their size which is always correct although I've added the shared header before the MIP map data itself.
Could someone please explain me how I do have to store the MIP maps properly? I am writing OGRE and Qt plugins for BLP and I do support one thread per MIP map for JPEG compression already.

I am a bit confused. You can read mipmaps from jpeg compressed .blp yet you can not write mipmaps to a jpeg compressed .blp?

Can your own working .blp reader read your written .blp files? If not then look at what you are doing different in the reading steps from the writing steps.

I do support one thread per MIP map for JPEG compression already.

I am not too sure that is wise... Threads have a huge overhead to set up and syncronize. Especially since the final mipmap consists of 1 pixel, this seems like a huge waste of a thread. You will probably find that it takes over 1000 times longer to actually read the .blp file than it takes to decode the jpeg blocks in it. If you insist on multi threading, you should consider a pipeline approach where you decode jpeg blocks into various final image componetns instead of per mipmap as this will give symetrical thread loading and is more scalable for multi core CPUs than 1 per mipmap level. It is still probably a huge waste of processor time (virtually no gain over a single thread for most practicle cases).

I currently do not know much about .jpeg compressed .blp files offhand, however I am willing to learn about them to help you.

Barade · Aug 29, 2011

Dr Super Good said:
I am a bit confused. You can read mipmaps from jpeg compressed .blp yet you can not write mipmaps to a jpeg compressed .blp?

Can your own working .blp reader read your written .blp files? If not then look at what you are doing different in the reading steps from the writing steps.

It's not that easy. There is always a shared header which is shared by ALL MIP maps and what I wanted to know is what that header contains and how I can omit data (like image size) which obviously is not shared.
I've already written a private message to user Shadow Daemon who's explaining me the whole thing.

Dr Super Good said:
I am not too sure that is wise... Threads have a huge overhead to set up and syncronize. Especially since the final mipmap consists of 1 pixel, this seems like a huge waste of a thread. You will probably find that it takes over 1000 times longer to actually read the .blp file than it takes to decode the jpeg blocks in it. If you insist on multi threading, you should consider a pipeline approach where you decode jpeg blocks into various final image componetns instead of per mipmap as this will give symetrical thread loading and is more scalable for multi core CPUs than 1 per mipmap level. It is still probably a huge waste of processor time (virtually no gain over a single thread for most practicle cases).

I currently do not know much about .jpeg compressed .blp files offhand, however I am willing to learn about them to help you.

I don't think threads do have that huge overhead. You're right about the issue with really small MIP maps where it's really senseless to use threads but in my opinion it's hard to split everything up for scanline parts. Maybe I am wrong but libjpeg doesn't seem to have an API which really supports splitting up scanline operations easily. You probably would have to create decompression structures for each scanline part. As most systems have multiple cores today it's still much faster than decompressing without threads.

Dr Super Good · Aug 29, 2011

I don't think threads do have that huge overhead.

I believe dispatching them does. As you need to prepare a new stack for the thread and it logically involves a lot of kernal instructions (informing the thread manager that a new thread is to start running). Yes this is a small delay but do remember that the opperations you are carrying out hardly take a long time. Additionally there is the time taken to syncronize the threads (you can not not return texture data until all mipmaps are loaded). As mipmaps add approximatly 33% more pixel area to an image (which is is a linear factor in time taken to decode JPEG) that means the theretical time saving is aproximatly 25% using this approach. Factor in the time taken for setting up all the threads, syncronizing the result, cache misses (as now you have many threads potentially needing to read from memory) and that not all threads can run at the same time (a 1024*1024 image would produce 10 threads) and you will be very lucky to see anywhere near that 25% saving.

As most systems have multiple cores today it's still much faster than decompressing without threads.

Have you actually timed this? A theretical maximum time saving is about 25% (can vary due to JPEG being in blocks of atleast 8*8 pixels).

Ultimatly the time taken to read the file will be a lot longer than the time taken to actually decode it to a texture. As such even if you do it with a single thread you could easilly hide the decode time by multi threading the loading and decoding process (pipeline). 1 thread loads the files as fast as possible into memory while the other thread just decodes the loaded files. Might not be the quickets for a single texture nor the most memory efficient (as you buffer whole files) but it is by far the most time efficient way for loading large number of files. If for some reason the files load at a comparitivly fast rate to the processor (eg, from a RAM drive or some "magical" future memory type) you could always create multiple decoding threads to decode multiple files at the same time (true scalability).

Multi-threading is tricky. It almost never works well if you think small.

Barade · Aug 29, 2011

Dr Super Good said:
I believe dispatching them does. As you need to prepare a new stack for the thread and it logically involves a lot of kernal instructions (informing the thread manager that a new thread is to start running). Yes this is a small delay but do remember that the opperations you are carrying out hardly take a long time. Additionally there is the time taken to syncronize the threads (you can not not return texture data until all mipmaps are loaded).

Why not? I don't do this because I store all MIP map data in C++ map structures for easier access since my program aims to provide an easy API for all formats but theoretically why shouldn't you be able to access the texture data of one MIP map when decompressing has been finished for it? Of course you will need another thread but if you don't use threads at all you will have to wait for all MIP map decompression, too (btw. my library supports limited MIP map reading as well - you could read only 1 MIP map if you want to).
I don't have made any tests yet but most libraries try to decrease thread creation performance as much as possible, so I still don't think that the thread creation performance amount is that high compared to the operations at all.

Dr Super Good said:
As mipmaps add approximatly 33% more pixel area to an image (which is is a linear factor in time taken to decode JPEG) that means the theretical time saving is aproximatly 25% using this approach. Factor in the time taken for setting up all the threads, syncronizing the result, cache misses (as now you have many threads potentially needing to read from memory) and that not all threads can run at the same time (a 1024*1024 image would produce 10 threads) and you will be very lucky to see anywhere near that 25% saving.

That's a point I didn't really think about. I just new they would run synchronously and would have access to the RAM but I don't know how RAM accessing could block each other. Usually data should be cached/separated (I think on Linux each thread actually is a custom process) otherwise many thread operations on RAM wouldn't increase performance at all.

Dr Super Good said:
Have you actually timed this? A theretical maximum time saving is about 25% (can vary due to JPEG being in blocks of atleast 8*8 pixels).

No.

Dr Super Good said:
Ultimatly the time taken to read the file will be a lot longer than the time taken to actually decode it to a texture. As such even if you do it with a single thread you could easilly hide the decode time by multi threading the loading and decoding process (pipeline). 1 thread loads the files as fast as possible into memory while the other thread just decodes the loaded files. Might not be the quickets for a single texture nor the most memory efficient (as you buffer whole files) but it is by far the most time efficient way for loading large number of files. If for some reason the files load at a comparitivly fast rate to the processor (eg, from a RAM drive or some "magical" future memory type) you could always create multiple decoding threads to decode multiple files at the same time (true scalability).

I don't know how fast reading from disk is. I would have to compare it but since I do load everything into buffers at the same time (depends on filesystem, hardware etc. etc.) it might be faster than you would expect it. Otherwise you're right and the amount of increasing performance for decoding would be really small.

Dr Super Good said:
Multi-threading is tricky. It almost never works well if you think small.

Consider that this improvement has only been added for multicore systems.

There's no real point (except those really small MIP maps) why my solution should slow down everything on multicore systems. It increases performance especially for large MIP maps. Please do also consider that we're talking about very theoretical stuff here. Everything depends on implementations and hardware. There is different operating systems and multithreading implementations out there.

Dr Super Good · Aug 29, 2011

Why not? I don't do this because I store all MIP map data in C++ map structures for easier access since my program aims to provide an easy API for all formats but theoretically why shouldn't you be able to access the texture data of one MIP map when decompressing has been finished for it? Of course you will need another thread but if you don't use threads at all you will have to wait for all MIP map decompression, too (btw. my library supports limited MIP map reading as well - you could read only 1 MIP map if you want to).
I don't have made any tests yet but most libraries try to decrease thread creation performance as much as possible, so I still don't think that the thread creation performance amount is that high compared to the operations at all.

I was thinking of using the texture for hardware accelerated graphics (like OpenGL or DirectX as that is mostly what you do with mipmapped textures). I do not think it is possible or atleast practicle to load incomplete mipmaps as usable textures (I think it is most efficient to transfer them all at once).

That's a point I didn't really think about. I just new they would run synchronously and would have access to the RAM but I don't know how RAM accessing could block each other. Usually data should be cached/separated (I think on Linux each thread actually is a custom process) otherwise many thread operations on RAM wouldn't increase performance at all.

Memory is very complicated. Each processor has private cache (fastest). You then sometimes have a middle level shared cache between processor clusters (like 2 processors). Finally you have generic cache which is the slowest cache (shared between all processors). Ofcourse this varies depending on the processor vender and model. System RAM memory is extreemly slow compared to cache so if many threads need memory in it can slow down. For small images where the file input buffer is still cached this will make virtually no difference, but for large images where additional memory from RAM may be needed it could start to make noticable differences in performance. Although generally not a performance concern, it can result in worse than expected performance. A good example of this I saw was using a linear sort on a variable sample size O(n) as when the processor cache became full the time taken to sort to sample size gradient increased noticably (deviated from the expected O(n) line for the worse).

Well hope everything works well.

BLP shared JPEG header

Barade

Barade

Resources

Dr Super Good

Dr Super Good

Resources

Barade

Barade

Resources

Dr Super Good

Dr Super Good

Resources

Barade

Barade

Resources

Dr Super Good

Dr Super Good

Resources

Similar threads