Shader Magic

[)roi(] · Apr 29, 2016

cguy said:
You're not even using the right terminology. What do you mean by "optimized out" - the compiler doesn't have advance knowledge of the data that will be run so it can't optimize out anything we've discussed. Do you perhaps mean "early out" or "skip" paths not taken? This won't happen either because this is a SIMT/D architecture, which will rather execute both sides of the branch and predicate the results for short code blocks. The error function can easily be scaled up to X/255, which will either have hardware to normalize it to 0-1, or will simply require that it be multiplied by 1/255, which is much faster than running through a flattened condition tree.

The error function can't be simply scaled up as a multiple of 255; as the error isn't a factor of colour component but error distribution; this dither when applied correctly has a distinct honey comb effect on the pixels, losing floating point precision results in something quite different. As to optimisation, I implied the results are calculated i.e. the float division is not executed for every pixel, as this is unrelated to the pixel values, but in the case of the texture it would have to be re its storage / retrieval.

As to the branch, sure it has a cost (that I don't doubt), but to understand how much, it would probably be far simpler to compare the frame rendering times for the current version of the kernel versus an alternative that avoids branch by using textures. Where I'm not so convinced is that dependant texture reads from the cache are also costly and whether this cost is as you imply marginal in comparison with the branches.

True error diffusion dithering is anyway impossible to implement on the GPU as all the good ones require that the calculated error be proportionately diffused across multiple neighboring pixels at the same time (carry forward).

The GPU compatible dithering formulation I've used is an adaptation of ordered dithering where an error is applied to the current pixel based on its value and the offset of its neighbouring pixels -- this is determined in my code with it's calculated modulus of the chosen dither matrix size. It's this linkage to the neighbouring pixels that I believe invalidates the advantage that could be derived from the use of textures and the cache. If I'm correct this is going to completely nullify any ability to optimise the texture cache reads. Now whether that is worse, the same or better than the ugly duckling is anyone's guess. Performance profiling would be needed.

Anyway I'm in two minds about whether its really worth the extra effort, yet I do agree that the nearest neighbour colour downsampling for the reduced palettes would probably benefit from using a texture as opposed to all the ifs.

Alway thanks for the thought provoking debate.

cguy · Apr 30, 2016

[)roi(];17520496 said:
The error function can't be simply scaled up as a multiple of 255; as the error isn't a factor of colour component but error distribution; this dither when applied correctly has a distinct honey comb effect on the pixels, losing floating point precision results in something quite different. As to optimisation, I implied the results are calculated i.e. the float division is not executed for every pixel, as this is unrelated to the pixel values, but in the case of the texture it would have to be re its storage / retrieval.

As to the branch, sure it has a cost (that I don't doubt), but to understand how much, it would probably be far simpler to compare the frame rendering times for the current version of the kernel versus an alternative that avoids branch by using textures. Where I'm not so convinced is that dependant texture reads from the cache are also costly and whether this cost is as you imply marginal in comparison with the branches.

True error diffusion dithering is anyway impossible to implement on the GPU as all the good ones require that the calculated error be proportionately diffused across multiple neighboring pixels at the same time (carry forward).

The GPU compatible dithering formulation I've used is an adaptation of ordered dithering where an error is applied to the current pixel based on its value and the offset of its neighbouring pixels -- this is determined in my code with it's calculated modulus of the chosen dither matrix size. It's this linkage to the neighbouring pixels that I believe invalidates the advantage that could be derived from the use of textures and the cache. If I'm correct this is going to completely nullify any ability to optimise the texture cache reads. Now whether that is worse, the same or better than the ugly duckling is anyone's guess. Performance profiling would be needed.

Anyway I'm in two minds about whether its really worth the extra effort, yet I do agree that the nearest neighbour colour downsampling for the reduced palettes would probably benefit from using a texture as opposed to all the ifs.

Alway thanks for the thought provoking debate.

You are very confused about this. I strongly suggest spending some time reading about GPU architecture. There are several papers available online, and the book RealTime Rendering has some good chapters on it. I also think that learning some basic CUDA, and the detailed architectural documentation that goes along with it would be hugely beneficial for you, since most of the issues are highly relevant to graphics shaders too. It is NVIDIA specific, but many of the same principles apply universally - even to mobile processors by other vendors.

Diffusing error over multiple pixels isn't impossible, it's just more expensive - either you loop over your source image kernel ("filter support", not "code kernel"), or you do multiple passes and accumulate the diffusion. I'm not sure why you think dependent texture reads within a kernel (do you even understand the issue when there is cross kernel dependency?) are an issue. I also don't understand the concern with the dither kernel size? This value is constant per kernel launch, right? It could be an issue if it was variable within the launch, but otherwise it is not.

[)roi(] · Apr 30, 2016

cguy said:
You are very confused about this. I strongly suggest spending some time reading about GPU architecture. There are several papers available online, and the book RealTime Rendering has some good chapters on it. I also think that learning some basic CUDA, and the detailed architectural documentation that goes along with it would be hugely beneficial for you, since most of the issues are highly relevant to graphics shaders too. It is NVIDIA specific, but many of the same principles apply universally - even to mobile processors by other vendors.

Diffusing error over multiple pixels isn't impossible, it's just more expensive - either you loop over your source image kernel ("filter support", not "code kernel"), or you do multiple passes and accumulate the diffusion. I'm not sure why you think dependent texture reads within a kernel (do you even understand the issue when there is cross kernel dependency?) are an issue. I also don't understand the concern with the dither kernel size? This value is constant per kernel launch, right? It could be an issue if it was variable within the launch, but otherwise it is not.

For somebody who says I'm confused (sorry, but you clearly appear to be); I think the problem is that you are not reading what I wrote or you didn't bother to look at the error diffusion algorithms that I linked, for example:

True error diffusion dithering is anyway impossible to implement on the GPU as all the good ones require that the calculated error be proportionately diffused across multiple neighboring pixels at the same time (carry forward).

+ I specifically linked to the algorithms that I was referring to: Floyd Steinberg, Atkins, etc...
Clearly I also stated that these algorithms cannot be implemented (this is not my singular view it's one supported by research papers), so please do yourself a favour and read a little before you jump to conclusions that are not based on fact.

Here for example is a quote from a research paper from the "Centre for Visual Information Technology International Institute of Information Technology" in India, wherein the 1st paragraph clearly confirms what I said:

Many image filtering operations provide ample parallelism, but progressive non-linear processing of images is among the hardest to parallelize due to long, sequential, and non-linear data dependency. A typical example of such an operation is error diffusion dithering, exemplified by the Floyd-Steinberg algorithm. In this paper, we present its parallelization on multicore CPUs using a block-based approach and on the GPU using a pixel based approach.

How can I make this more clear than that. Nobody has successfully implemented the original error diffusion algorithms on the GPU; what however has been done, as is for example covered in this research paper is the implementation of other GPU "friendly" algorithms to try to best approximate the results of true error diffusion dithering; however let's be clear these are formulations to approximate result and are not implementations of the original formulations as I provided in the link.
What can however be done re the original algorithms is to multi-thread them, however this is vastly different from what happens on the GPU.

Your assumptions are a bit on the irritating side i.e. you're assuming I haven't spent a lot of time researching this subject, or attempting various approaches to rewrite the algorithms to make them more GPU friendly; but as you should see in all the papers on the subject; it's possible create GPU friendly algorithms however their quality is considerably poor in comparison with the original algorithms. This brings me back to our original back and forth debate.

Diffusion as its stated is among the hardest to parallelise, so whilst you may hate my ugly duckling it is an accurate implementation of the bayer dither algorithm (which btw is not error diffusion, but rather a hybridization algorithm to try to approximate the result of the original algorithms; but as I also stated this one in particular is notable for its distinct honey comb pixel diffusion, i.e. its vastly substandard visually in comparison with the original error diffusion ones, a simple example is that these algorithms tend to apply diffusion to lighter areas i.e. whites become darker)

Bayer Example: (whites have been darkened, but it runs on the GPU at ~60 FPS)

Floyd Sternberg Example (takes ~600ms in a single thread on the CPU to complete):

FYI all the attempts I've made to implement the real algorithms on either the GPU or DSP or combination of both have always resulted in significantly slower processing times versus the original CPU coding.

Here's another paper on this, which should clearly confirm the approach is to find new GPU friendly algorithms of a quality near to the originals

Parallelizing error diffusion is a difficult task and usually comes with a compromise between performance and image quality. Metaxas4 has attempted to parallelize the Floyd-Steinberg algorithm along the image diagonal in a wavefront fashion. As the amount of parallelism of this method is limited by the number of diagonal pixels, it is not well-suited for the massively-parallel GPU.

Here's another one. This last paper I linked makes use of CUDA in conjunction with the GLSL on the GPU; which is pretty much similar to what Apple offers ito its Metal framework i.e. the ability to run far more complex CPU type operations on the GPU; but as I said I've avoided that as it's proprietary, which you should know also applies to CUDA i.e. its proprietary to NVIDIA cards. The open standard as you probably also know is OpenCL, but support of this is quite poor, hence at this stage GLSL is still best bet, but it comes with a compromise.

cguy · Apr 30, 2016

[)roi(];17521388 said:
For somebody who says I'm confused (sorry, but you clearly appear to be); I think the problem is that you are not reading what I wrote or you didn't bother to look at the error diffusion algorithms that I linked, for example:
+ I specifically linked to the algorithms that I was referring to: Floyd Steinberg, Atkins, etc...
Clearly I also stated that these algorithms cannot be implemented (this is not my singular view it's one supported by research papers), so please do yourself a favour and read a little before you jump to conclusions that are not based on fact.

Yes, you did clearly state that the algorithm cannot be implemented, which is also clearly bollocks - a GPU is Turing complete. It is hard to extract parallelism to do it efficiently, but even true diffusion isn't impossible to implement. Also, my statement about multi-pass methods said nothing about "true" error diffusion (which is still possible using that technique, just harder to make efficient), however, there is a quality/correctness-efficiency spectrum, and it is possible to do non-localized (although not fully global) diffusion easily and efficiently.

[)roi(];17521388 said:
Here for example is a quote from a research paper from the "Centre for Visual Information Technology International Institute of Information Technology" in India, wherein the 1st paragraph clearly confirms what I said:
How can I make this more clear than that. Nobody has successfully implemented the original error diffusion algorithms on the GPU;

This is what you quoted:

Many image filtering operations provide ample parallelism, but progressive non-linear processing of images is among the hardest to parallelize due to long, sequential, and non-linear data dependency. A typical example of such an operation is error diffusion dithering, exemplified by the Floyd-Steinberg algorithm. In this paper, we present its parallelization on multicore CPUs using a block-based approach and on the GPU using a pixel based approach.

I don't know how to make it more clear than that either.

[)roi(];17521388 said:
Your assumptions are a bit on the irritating side i.e. you're assuming I haven't spent a lot of time researching this subject, or attempting various approaches to rewrite the algorithms to make them more GPU friendly;

When someone says something as inane as:

[)roi(];17520272 said:
Finally you assume all GPUs are SIMT, some are SIMD.

Or (quite astoundingly) quotes things that contradict their assertions, it means that they haven't taken a disciplined approach to solving the problem. I am sure that you have spent a lot of time on it, but throwing yourself against a wall again and again isn't the best way to make progress.

[)roi(] · Apr 30, 2016

cguy said:
Yes, you did clearly state that the algorithm cannot be implemented, which is also clearly bollocks - a GPU is Turing complete. It is hard to extract parallelism to do it efficiently, but even true diffusion isn't impossible to implement. Also, my statement about multi-pass methods said nothing about "true" error diffusion (which is still possible using that technique, just harder to make efficient), however, there is a quality/correctness-efficiency spectrum, and it is possible to do non-localized (although not fully global) diffusion easily and efficiently.

This is what you quoted:

I don't know how to make it more clear than that either.

When someone says something as inane as:

Or (quite astoundingly) quotes things that contradict their assertions, it means that they haven't taken a disciplined approach to solving the problem. I am sure that you have spent a lot of time on it, but throwing yourself against a wall again and again isn't the best way to make progress.

Hahaha... You're a card... It's easy to stand on a fence and piss on somebody's crop. You selectively choose bits that supports your view (maybe you should read the papers). You clearly were challenging my insistence around true error diffusion, and the research supported that, but now when pushed into a corner you again throw out some weak ass insults. Hey you pissed on my crop; good for you.

Problem is you assume SIMD and SIMT are the same; they're not and there are specific optimisations that NVIDIA's has implemented with SIMT which is not available with SIMD i.e. they there e.g. to improve branch predication and avoid divergence, but doesn't matter in this case. Why? because as I already tried to explain, your suggestion that this would be faster with textures is flawed (except colour downsampling which I already agreed could benefit). Why? because of the texture reads are dependant on something that is not known i.e. it needs to be calculated 1st and the texture offset will vary depending on the calculated pixel modulus offset and the result a error calculation based on that modulus, which I already explained had to calculated (decimal precision is important).

So? simple it negates the advantage of that approach.
If you think I'm wrong then, how about you stop pissing from the fence and actually demonstrate with a working solution to prove that I am wrong (oh, and please conform to the limitations of CoreImage), because the research papers don't support your notion of this, there is a big penalty.

cguy · Apr 30, 2016

[)roi(];17522878 said:
Hahaha... You're a card... It's easy to stand on a fence and piss on somebody's crop. You selectively choose bits that supports your view (maybe you should read the papers). You clearly were challenging my insistence around true error diffusion, and the research supported that, but now when pushed into a corner you again throw out some weak ass insults. Hey you pissed on my crop; good for you.

I think that you should take a cold shower and reread your posts. If you think the above matches reality, you are a lost cause.

[)roi(];17522878 said:
Problem is you assume SIMD and SIMT are the same; they're not and there are specific optimisations that NVIDIA's has implemented with SIMT which is not available with SIMD i.e. they there e.g. to improve branch predication and avoid divergence, but doesn't matter in this case. Why? because as I already tried to explain, your suggestion that this would be faster with textures is flawed (except colour downsampling which I already agreed could benefit). Why? because of the texture reads are dependant on something that is not known i.e. it needs to be calculated 1st and the texture offset will vary depending on the calculated pixel modulus offset and the result a error calculation based on that modulus, which I already explained had to calculated (decimal precision is important).

This is why you need to spend more time reading and less time writing. You don't know what you are talking about. You say that you "tried to explain", but you haven't - you just make a bunch of random, patently incorrect statements. SIMT doesn't improve branch predication (do you even know what this is?). SIMT is a just a different ISA programming model, with all the divergence limitations of a modern SIMD architecture - which is why SIMT vs SIMD is irrelevant, hence, why bringing up the difference between the two just shows a poor understanding of the fundamentals of GPU programming.

You also keep bringing up dependent texture reads. So what? There is no reason to think that they'll be slow in this context. You just think that because you don't understand how any of this works - if you could, you could tell me why they're bad.

[)roi(];17522878 said:
So? simple it negates the advantage of that approach.
If you think I'm wrong then, how about you stop pissing from the fence and actually demonstrate with a working solution to prove that I am wrong (oh, and please conform to the limitations of CoreImage), because the research papers don't support your notion of this, there is a big penalty.

Do your own work. I give advice based on extensive knowledge in this area, and rather than trying to learn something, you just defend your crap, sorry: crop, because you have more ego than sense.

[)roi(] · Apr 30, 2016

cguy said:
I think that you should take a cold shower and reread your posts. If you think the above matches reality, you are a lost cause.

This is why you need to spend more time reading and less time writing. You don't know what you are talking about. You say that you "tried to explain", but you haven't - you just make a bunch of random, patently incorrect statements. SIMT doesn't improve branch predication (do you even know what this is?). SIMT is a just a different ISA programming model, with all the divergence limitations of a modern SIMD architecture - which is why SIMT vs SIMD is irrelevant, hence, why bringing up the difference between the two just shows a poor understanding of the fundamentals of GPU programming.

You also keep bringing up dependent texture reads. So what? There is no reason to think that they'll be slow in this context. You just think that because you don't understand how any of this works - if you could, you could tell me why they're bad.

Do your own work. I give advice based on extensive knowledge in this area, and rather than trying to learn something, you just defend your crap, sorry: crop, because you have more ego than sense.

Lol...
As always you're assuming I don't read and that I haven't read papers that back up my statements, even though they're not what you want it to be, it is part of a research paper, and I'll let you challenge the validity of that.

In addition, we make the observation that branch prediction on a SIMT architecture can actually perform better then branch prediction on a MIMD architecure like a modern multicore CPU.
Nicholas Arnold Department of Electrical Engineering and Computer Science
University of Central Florida Orlando, Florida 32826
Email: [email][email protected][/email]

We have demonstrated that the data-parallel nature of the SIMT model can be successfully used to dramatically increase prediction accuracy by sharing prediction structures accross multiple gpus. We have shown, through experiment, that even a simple branch predictor such as a Smith bimodal predictor benefits greatly from information given to it by peer contexts sharing dynamic branch information. We have shown several possible configurations of this branch predictor and demonstrated the performance gains.

Back to SIMT I used one example where it's different, that was the point; you implied SIMT is exactly the same as SIMD (i.e. no difference), but SIMT is different, it is SIMD + Multithreading + some a few other features, one arguably being it's ability to do better with branch prediction.

...but apparently all this research is BS according to you. Now I'm left wondering who's ego is the problem; you were the one who turned this personal, and you know what they say about that.

On showing a working example: well I guess the avoidance speaks volumes.

cguy · Apr 30, 2016

[)roi(];17524074 said:
Lol...
As always you're assuming I don't read and that I haven't read papers that back up my statements, even though they're not what you want it to be, it is part of a research paper, and I'll let you challenge the validity of that.

Back to SIMT I used one example where it's different, that was the point; you implied SIMT is exactly the same as SIMD (i.e. no difference), but SIMT is different, it is SIMD + Multithreading + some a few other features, one arguably being it's ability to do better with branch prediction.

...but apparently all this research is BS according to you. Now I'm left wondering who's ego is the problem; you were the one who turned this personal, and you know what they say about that.

On showing a working example: well I guess the avoidance speaks volumes.

That's branch prediction, Sunshine. We were talking about predication. These are two totally different things - the one you mention (prediction) being irrelevant because your code will never do any true branches.

Also AFAIK, no GPU in existence does dynamic branch prediction, as we have on the CPU. I doubt it would ever be worth the area.

EDIT: Now that I've read the paper: apart from the massive prediction/predication facepalm above, the paper you linked to says nothing about SIMT vs SIMD, it is comparing SIMT to MIMD, and SIMD has the same advantages over MIMD for branch prediction - that you think MIMD is the same as SIMD is really frightening. The only time SIMT is mentioned in the context of SIMD is:

we must have an appropriate simulator modeling SIMD/SIMT architecture.

Where they are so coupled, because from a performance perspective, they're the same goddam thing, and are treated as such for the rest of the paper.

[)roi(] · May 1, 2016

On all matters of a silly nature
SIMT == SIMD == SMT

CISC == RISC == MIPS

ORANGE == PEACH == BANANA

oh bugger it, let's just call it all a THING. :wtf:

Endless back & forward BS when the debate started on an approach that contradicted all existing research, but f..k the research, "we" know better than everyone else.

Back to reality...

SIMD: elements of short vectors are processed in parallel.
SMT: instructions of several threads are run in parallel.
SIMT: somewhere in between; a hybrid between vector processing and hardware threading.

Oh give it up already... f..ck it, call it a THING, it's certainly easier to remember.

cguy · May 1, 2016

[)roi(];17524398 said:
On all matters of a silly nature
SIMT == SIMD == SMT
CISC == RISC == MIPS
ORANGE == PEACH == BANANA
oh bugger it, let's just call it all a THING. :wtf:

Endless back & forward BS when the debate started on an approach that contradicted all existing research, but f..k the research, "we" know better than everyone else.

Back to reality...

SIMD: elements of short vectors are processed in parallel.

SMT: instructions of several threads are run in parallel.

SIMT: somewhere in between; a hybrid between vector processing and hardware threading.

Oh give it up already... f..ck it, call it a THING, it's certainly easier to remember.

SIMT: Looks like SMT from an instruction-set perspective. Behaves like SIMD.

Join the MyBroadband community

Get started

Shader Magic

[)roi(]

Executive Member

cguy

Executive Member

[)roi(]

Executive Member

cguy

Executive Member

[)roi(]

Executive Member

cguy

Executive Member

[)roi(]

Executive Member

cguy

Executive Member

[)roi(]

Executive Member

cguy

Executive Member