Multi-Queue Support by kvark · Pull Request #339 · kvark/blade

kvark · 2026-04-14T08:25:13Z

Closes #329

…r multi-queue

EriKWDev · 2026-04-14T08:45:33Z

 }

-#[derive(Clone, Debug)]
+#[derive(Clone, Debug, Default)]


SyncPoint being default helps a lot with the ergonomics, nice!

EriKWDev

Nice! I want to put this into the engine and see what I can to. I might need to have several encoders per frame now which is fun!

Thanks for implementing it so quickly :)

The ping-pong raytrace fits perfectly into this API.

The other use-case I have is that one initial encoder enqueues some work, then we fire off the main encoder render task and some async compute encoder tasks that depend on that initial work being done. With this API the the async queue submission and main queue submission both can take in the initial encoder's syncpoint and the initial encoder's submit can take in the previous async compute syncpoint and main render syncpoint. I think this works too!

EriKWDev · 2026-04-14T09:09:36Z

@@ -55,14 +54,14 @@ impl FramePacer {
    }

    pub fn end_frame(&mut self, context: &blade_graphics::Context) -> &blade_graphics::SyncPoint {


Probably not useful for you, but we actually use a version of this framepacer and temporary resources concept in our engine too :) Though I added support for all kinds of resources as well as more than just one in flight. So I will need to add the after: &[SyncPoint], here or somehow integrate it into the pacer

EriKWDev · 2026-04-14T09:20:51Z

+    /// Enable multi-queue support (async compute and transfer).
+    /// When enabled, every `submit` call must provide explicit
+    /// synchronization via a non-empty list of sync points.
+    pub multi_queue: bool,


Now we can request multi_queue, but it is difficult to inspect the context we get back to determine if internal async compute queue is truly a unique queue or just the same as main.

In the Game, if we have true async compute we will want to do our resource "Ping-Ponging", but if we don't it would be nice to only allocate one set of probe data resources and render and sample the same all the time.

So, maybe it would be nice to be able to query somehow if they are all the same queues really under the hood after Context creation, maybe Context::get_selected_queue_id(&self, kind: QueueType) -> u32, which would allow us to know if they are just the same..

Or, Context::enumerate could report more details in the DeviceInfo about the queues as well.

But maybe I am worrying about something that isn't really an issue and this is too nieche to expose.. Its not the end of the world if we have unnecessary probe datum

I wanted to add this to Capabilities at some point but then it slipped. I'll add it.

EriKWDev · 2026-04-14T10:02:07Z

The particle example currently submits the compute on the async queue correctly and then render on the main queue as well, but due to the data dependency on particle_buf between the compute and render, they are still executed sequentially

So as an exercise, I will try to rewrite the particle example to have two separate buffers of particle data, one being worked on by the compute task and one being sampled from and rendered just to see if I can get the kind of parallelism I want to see.

EriKWDev · 2026-04-14T10:33:57Z

Got double buffering working, but maybe I am naive in thinking this should be able to be better?

Screenshot From 2026-04-14 12-30-07_annotated

Do you think that the barrier at the end of the compute task might be preventing something? (it shouldn't? should be intra-queue only?) Or is this just the reality of the hardware?

EriKWDev · 2026-04-14T22:41:22Z

Been working on the particle example with double buffering the particles.

Hmm, no matter what I do it seems to take roughly the same time in the end. The one outlier I got was one where the driver suddenly decided to schedule the async compute task at the very end of the vertex shader in the particle example, overlapping both vertex and fragmetn work.

I split up the compute workload into chunks and submitted it after the main submission and suddenly this happened and was much faster:

Other than that one exception, I get roughly the same perf from async compute as I get from removing some of the barriers internally between passes on a single queue. I did an unsafe experiment just to see and removed most of the barriers inside blade and submitted everyhitng on the main queue, and got this:

Also tried various permutations of splitting the render task into multiple draws of chunks of particles and splitting the compute into various chunks of compute work, but nothing seems to move the needle greatly

So it is a bit unclear from my experiments here how to get the async compute work to land in an optimal location..

madness

kvark · 2026-04-15T07:19:14Z

I just wanted to demonstrate in one of the examples how to specify dependencies. But you are totally right to expect the actual parallelism in the particle example. I'll look into it more.

…o be parallel

kvark · 2026-04-16T07:20:47Z

@EriKWDev please take another look.
I've extended the particle example with an option to toggle parallel execution on/off.

When cranking up the spawn rate, I see a strong difference between parallel ON (about 4.8ms) and OFF (about 5.7ms) on my AMD Framework 13. Radeon GPU profiler should agree with that.

EriKWDev · 2026-04-16T09:01:25Z

Nice with getting access to what queues are available!

I'm afraid we did the same thing with the example xD But nice, now the example contains the parallelism.

The problem I tried to point out with the measurements was that my gains were very inconsistent frame to frame, and the best performance I got when the compute task overlapped with the vertex shading in this particle example.

But with this API I don't have the granularity to specify that it specifically run with the vertex and not the fragment.

In the game however, multiple encoders should be enough. We probably want the raytracing compute to always overlap with the shadowmap generation and reflection pass, and I think I can achieve it with multiple encoders and correct after: &[] syncpoints.

My experiment also seemed to suggest that removing the barriers internal to blade had about a similar effect to the async compute^ for this workload, but having manual barriers is perhaps undesirable. Though, I could see a way for blade to expose it with encoder.set_use_auto_barriers(bool) and then a manual encoder.barrier(&self, src, dst). But if we need that ultimately for the game I don't know if it needs to be included in blade proper.

Also, for correctness of the simulation, I believe there has to be a buffer copy as well so that each compute task works on the resulting data of the most recent one

                if let mut transfer = self.compute_encoder.transfer("copy") {
                    let prev_compute_idx = draw_idx; // only one in flight atm

                    let a = &self.particle_systems[prev_compute_idx];
                    let b = &self.particle_systems[compute_idx];
                    // NOTE: an indirect_dispatch or at least just compute pass could copy only the alive particles..
                    transfer.copy_buffer_to_buffer(
                        a.particle_buf.at(0),
                        b.particle_buf.at(0),
                        a.particle_buf.size(),
                    );
                    transfer.copy_buffer_to_buffer(
                        a.free_list_buf.at(0),
                        b.free_list_buf.at(0),
                        a.free_list_buf.size(),
                    );
                }

kvark · 2026-04-16T14:14:15Z

Yeah, I'm not sure how I feel about skipping the barriers between passes. Something like this should be very explicit about what you are expecting to run at the same time. I'll think about it.

EriKWDev · 2026-04-16T17:17:16Z

Yeah, I'm not sure how I feel about skipping the barriers between passes. Something like this should be very explicit about what you are expecting to run at the same time. I'll think about it.

Yeah the barriers was just an experiment on my part and not related really, though I did open #343 as we have a perhaps much more motivating example for custom barriers, or just a single little pass without an automatic barrier.

kvark · 2026-04-17T06:49:56Z

Interesting. So in this case multi-queue would not help you. I'm actually surprised you are seeing this much cost for the barriers.

kvark added 2 commits April 13, 2026 23:53

Extend CommandEncoderDesc, submit(), and ContextDesc with new APIs fo…

0822186

…r multi-queue

Make SyncPoint: Default, implement queues on Vulkan

ecbfe3a

kvark mentioned this pull request Apr 14, 2026

Support for multiple submission queues? (async compute) #329

Open

EriKWDev reviewed Apr 14, 2026

View reviewed changes

kvark marked this pull request as draft April 15, 2026 15:48

kvark added 2 commits April 16, 2026 00:10

Expose the queue types in capabilities, rewire the particle example t…

2967b65

…o be parallel

Bar chart for the particle example performance

a24f3b7

		@@ -55,14 +54,14 @@ impl FramePacer {
		}

		pub fn end_frame(&mut self, context: &blade_graphics::Context) -> &blade_graphics::SyncPoint {

Conversation

kvark commented Apr 14, 2026

Uh oh!

EriKWDev Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

EriKWDev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EriKWDev Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

EriKWDev Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kvark Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

EriKWDev commented Apr 14, 2026

Uh oh!

EriKWDev commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EriKWDev commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kvark commented Apr 15, 2026

Uh oh!

kvark commented Apr 16, 2026

Uh oh!

EriKWDev commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kvark commented Apr 16, 2026

Uh oh!

EriKWDev commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kvark commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EriKWDev left a comment •

edited

Loading

EriKWDev Apr 14, 2026 •

edited

Loading

EriKWDev commented Apr 14, 2026 •

edited

Loading

EriKWDev commented Apr 14, 2026 •

edited

Loading

EriKWDev commented Apr 16, 2026 •

edited

Loading

EriKWDev commented Apr 16, 2026 •

edited

Loading