Skip to content

Multi-Queue Support#339

Draft
kvark wants to merge 4 commits intomainfrom
multi-queue
Draft

Multi-Queue Support#339
kvark wants to merge 4 commits intomainfrom
multi-queue

Conversation

@kvark
Copy link
Copy Markdown
Owner

@kvark kvark commented Apr 14, 2026

Closes #329

}

#[derive(Clone, Debug)]
#[derive(Clone, Debug, Default)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyncPoint being default helps a lot with the ergonomics, nice!

Copy link
Copy Markdown
Contributor

@EriKWDev EriKWDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I want to put this into the engine and see what I can to. I might need to have several encoders per frame now which is fun!

Thanks for implementing it so quickly :)

The ping-pong raytrace fits perfectly into this API.

The other use-case I have is that one initial encoder enqueues some work, then we fire off the main encoder render task and some async compute encoder tasks that depend on that initial work being done. With this API the the async queue submission and main queue submission both can take in the initial encoder's syncpoint and the initial encoder's submit can take in the previous async compute syncpoint and main render syncpoint. I think this works too!

Comment thread blade-graphics/src/vulkan/init.rs
@@ -55,14 +54,14 @@ impl FramePacer {
}

pub fn end_frame(&mut self, context: &blade_graphics::Context) -> &blade_graphics::SyncPoint {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not useful for you, but we actually use a version of this framepacer and temporary resources concept in our engine too :) Though I added support for all kinds of resources as well as more than just one in flight. So I will need to add the after: &[SyncPoint], here or somehow integrate it into the pacer

Comment thread blade-graphics/src/lib.rs
/// Enable multi-queue support (async compute and transfer).
/// When enabled, every `submit` call must provide explicit
/// synchronization via a non-empty list of sync points.
pub multi_queue: bool,
Copy link
Copy Markdown
Contributor

@EriKWDev EriKWDev Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we can request multi_queue, but it is difficult to inspect the context we get back to determine if internal async compute queue is truly a unique queue or just the same as main.

In the Game, if we have true async compute we will want to do our resource "Ping-Ponging", but if we don't it would be nice to only allocate one set of probe data resources and render and sample the same all the time.

So, maybe it would be nice to be able to query somehow if they are all the same queues really under the hood after Context creation, maybe Context::get_selected_queue_id(&self, kind: QueueType) -> u32, which would allow us to know if they are just the same..

Or, Context::enumerate could report more details in the DeviceInfo about the queues as well.

But maybe I am worrying about something that isn't really an issue and this is too nieche to expose.. Its not the end of the world if we have unnecessary probe datum

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to add this to Capabilities at some point but then it slipped. I'll add it.

@EriKWDev
Copy link
Copy Markdown
Contributor

The particle example currently submits the compute on the async queue correctly and then render on the main queue as well, but due to the data dependency on particle_buf between the compute and render, they are still executed sequentially

image

So as an exercise, I will try to rewrite the particle example to have two separate buffers of particle data, one being worked on by the compute task and one being sampled from and rendered just to see if I can get the kind of parallelism I want to see.

@EriKWDev
Copy link
Copy Markdown
Contributor

EriKWDev commented Apr 14, 2026

image

Got double buffering working, but maybe I am naive in thinking this should be able to be better?

Screenshot From 2026-04-14 12-30-07_annotated

Do you think that the barrier at the end of the compute task might be preventing something? (it shouldn't? should be intra-queue only?) Or is this just the reality of the hardware?

@EriKWDev
Copy link
Copy Markdown
Contributor

EriKWDev commented Apr 14, 2026

Been working on the particle example with double buffering the particles.

Hmm, no matter what I do it seems to take roughly the same time in the end. The one outlier I got was one where the driver suddenly decided to schedule the async compute task at the very end of the vertex shader in the particle example, overlapping both vertex and fragmetn work.

I split up the compute workload into chunks and submitted it after the main submission and suddenly this happened and was much faster:

image

Other than that one exception, I get roughly the same perf from async compute as I get from removing some of the barriers internally between passes on a single queue. I did an unsafe experiment just to see and removed most of the barriers inside blade and submitted everyhitng on the main queue, and got this:

image

Also tried various permutations of splitting the render task into multiple draws of chunks of particles and splitting the compute into various chunks of compute work, but nothing seems to move the needle greatly

image image

So it is a bit unclear from my experiments here how to get the async compute work to land in an optimal location..

madness image

@kvark
Copy link
Copy Markdown
Owner Author

kvark commented Apr 15, 2026

I just wanted to demonstrate in one of the examples how to specify dependencies. But you are totally right to expect the actual parallelism in the particle example. I'll look into it more.

@kvark kvark marked this pull request as draft April 15, 2026 15:48
@kvark
Copy link
Copy Markdown
Owner Author

kvark commented Apr 16, 2026

@EriKWDev please take another look.
I've extended the particle example with an option to toggle parallel execution on/off.
Particle-parallel
When cranking up the spawn rate, I see a strong difference between parallel ON (about 4.8ms) and OFF (about 5.7ms) on my AMD Framework 13. Radeon GPU profiler should agree with that.

@EriKWDev
Copy link
Copy Markdown
Contributor

EriKWDev commented Apr 16, 2026

Nice with getting access to what queues are available!

I'm afraid we did the same thing with the example xD But nice, now the example contains the parallelism.

The problem I tried to point out with the measurements was that my gains were very inconsistent frame to frame, and the best performance I got when the compute task overlapped with the vertex shading in this particle example.

But with this API I don't have the granularity to specify that it specifically run with the vertex and not the fragment.

In the game however, multiple encoders should be enough. We probably want the raytracing compute to always overlap with the shadowmap generation and reflection pass, and I think I can achieve it with multiple encoders and correct after: &[] syncpoints.

My experiment also seemed to suggest that removing the barriers internal to blade had about a similar effect to the async compute^ for this workload, but having manual barriers is perhaps undesirable. Though, I could see a way for blade to expose it with encoder.set_use_auto_barriers(bool) and then a manual encoder.barrier(&self, src, dst). But if we need that ultimately for the game I don't know if it needs to be included in blade proper.

Also, for correctness of the simulation, I believe there has to be a buffer copy as well so that each compute task works on the resulting data of the most recent one

                if let mut transfer = self.compute_encoder.transfer("copy") {
                    let prev_compute_idx = draw_idx; // only one in flight atm

                    let a = &self.particle_systems[prev_compute_idx];
                    let b = &self.particle_systems[compute_idx];
                    // NOTE: an indirect_dispatch or at least just compute pass could copy only the alive particles..
                    transfer.copy_buffer_to_buffer(
                        a.particle_buf.at(0),
                        b.particle_buf.at(0),
                        a.particle_buf.size(),
                    );
                    transfer.copy_buffer_to_buffer(
                        a.free_list_buf.at(0),
                        b.free_list_buf.at(0),
                        a.free_list_buf.size(),
                    );
                }

@kvark
Copy link
Copy Markdown
Owner Author

kvark commented Apr 16, 2026

Yeah, I'm not sure how I feel about skipping the barriers between passes. Something like this should be very explicit about what you are expecting to run at the same time. I'll think about it.

@EriKWDev
Copy link
Copy Markdown
Contributor

EriKWDev commented Apr 16, 2026

Yeah, I'm not sure how I feel about skipping the barriers between passes. Something like this should be very explicit about what you are expecting to run at the same time. I'll think about it.

Yeah the barriers was just an experiment on my part and not related really, though I did open #343 as we have a perhaps much more motivating example for custom barriers, or just a single little pass without an automatic barrier.

@kvark
Copy link
Copy Markdown
Owner Author

kvark commented Apr 17, 2026

Interesting. So in this case multi-queue would not help you. I'm actually surprised you are seeing this much cost for the barriers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for multiple submission queues? (async compute)

2 participants