278 lines
		
	
	
		
			8.4 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
			
		
		
	
	
			278 lines
		
	
	
		
			8.4 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
Workload descriptor format
 | 
						|
==========================
 | 
						|
 | 
						|
ctx.engine.duration_us.dependency.wait,...
 | 
						|
<uint>.<str>.<uint>[-<uint>]|*.<int <= 0>[/<int <= 0>][...].<0|1>,...
 | 
						|
B.<uint>
 | 
						|
M.<uint>.<str>[|<str>]...
 | 
						|
P|S|X.<uint>.<int>
 | 
						|
d|p|s|t|q|a|T.<int>,...
 | 
						|
b.<uint>.<str>[|<str>].<str>
 | 
						|
f
 | 
						|
 | 
						|
For duration a range can be given from which a random value will be picked
 | 
						|
before every submit. Since this and seqno management requires CPU access to
 | 
						|
objects, care needs to be taken in order to ensure the submit queue is deep
 | 
						|
enough these operations do not affect the execution speed unless that is
 | 
						|
desired.
 | 
						|
 | 
						|
Additional workload steps are also supported:
 | 
						|
 | 
						|
 'd' - Adds a delay (in microseconds).
 | 
						|
 'p' - Adds a delay relative to the start of previous loop so that the each loop
 | 
						|
       starts execution with a given period.
 | 
						|
 's' - Synchronises the pipeline to a batch relative to the step.
 | 
						|
 't' - Throttle every n batches.
 | 
						|
 'q' - Throttle to n max queue depth.
 | 
						|
 'f' - Create a sync fence.
 | 
						|
 'a' - Advance the previously created sync fence.
 | 
						|
 'B' - Turn on context load balancing.
 | 
						|
 'b' - Set up engine bonds.
 | 
						|
 'M' - Set up engine map.
 | 
						|
 'P' - Context priority.
 | 
						|
 'S' - Context SSEU configuration.
 | 
						|
 'T' - Terminate an infinite batch.
 | 
						|
 'X' - Context preemption control.
 | 
						|
 | 
						|
Engine ids: DEFAULT, RCS, BCS, VCS, VCS1, VCS2, VECS
 | 
						|
 | 
						|
Example (leading spaces must not be present in the actual file):
 | 
						|
----------------------------------------------------------------
 | 
						|
 | 
						|
  1.VCS1.3000.0.1
 | 
						|
  1.RCS.500-1000.-1.0
 | 
						|
  1.RCS.3700.0.0
 | 
						|
  1.RCS.1000.-2.0
 | 
						|
  1.VCS2.2300.-2.0
 | 
						|
  1.RCS.4700.-1.0
 | 
						|
  1.VCS2.600.-1.1
 | 
						|
  p.16000
 | 
						|
 | 
						|
The above workload described in human language works like this:
 | 
						|
 | 
						|
  1.   A batch is sent to the VCS1 engine which will be executing for 3ms on the
 | 
						|
       GPU and userspace will wait until it is finished before proceeding.
 | 
						|
  2-4. Now three batches are sent to RCS with durations of 0.5-1.5ms (random
 | 
						|
       duration range), 3.7ms and 1ms respectively. The first batch has a data
 | 
						|
       dependency on the preceding VCS1 batch, and the last of the group depends
 | 
						|
       on the first from the group.
 | 
						|
  5.   Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms
 | 
						|
       RCS batch.
 | 
						|
  6.   This is followed by a 4.7ms RCS batch with a data dependency on the 2.3ms
 | 
						|
       VCS2 batch.
 | 
						|
  7.   Then a 0.6ms VCS2 batch is sent depending on the previous RCS one. In the
 | 
						|
       same step the tool is told to wait for the batch completes before
 | 
						|
       proceeding.
 | 
						|
  8.   Finally the tool is told to wait long enough to ensure the next iteration
 | 
						|
       starts 16ms after the previous one has started.
 | 
						|
 | 
						|
When workload descriptors are provided on the command line, commas must be used
 | 
						|
instead of new lines.
 | 
						|
 | 
						|
Multiple dependencies can be given separated by forward slashes.
 | 
						|
 | 
						|
Example:
 | 
						|
 | 
						|
  1.VCS1.3000.0.1
 | 
						|
  1.RCS.3700.0.0
 | 
						|
  1.VCS2.2300.-1/-2.0
 | 
						|
 | 
						|
I this case the last step has a data dependency on both first and second steps.
 | 
						|
 | 
						|
Batch durations can also be specified as infinite by using the '*' in the
 | 
						|
duration field. Such batches must be ended by the terminate command ('T')
 | 
						|
otherwise they will cause a GPU hang to be reported.
 | 
						|
 | 
						|
Sync (fd) fences
 | 
						|
----------------
 | 
						|
 | 
						|
Sync fences are also supported as dependencies.
 | 
						|
 | 
						|
To use them put a "f<N>" token in the step dependecy list. N is this case the
 | 
						|
same relative step offset to the dependee batch, but instead of the data
 | 
						|
dependency an output fence will be emitted at the dependee step, and passed in
 | 
						|
as a dependency in the current step.
 | 
						|
 | 
						|
Example:
 | 
						|
 | 
						|
  1.VCS1.3000.0.0
 | 
						|
  1.RCS.500-1000.-1/f-1.0
 | 
						|
 | 
						|
In this case the second step will have both a data dependency and a sync fence
 | 
						|
dependency on the previous step.
 | 
						|
 | 
						|
Example:
 | 
						|
 | 
						|
  1.RCS.500-1000.0.0
 | 
						|
  1.VCS1.3000.f-1.0
 | 
						|
  1.VCS2.3000.f-2.0
 | 
						|
 | 
						|
VCS1 and VCS2 batches will have a sync fence dependency on the RCS batch.
 | 
						|
 | 
						|
Example:
 | 
						|
 | 
						|
  1.RCS.500-1000.0.0
 | 
						|
  f
 | 
						|
  2.VCS1.3000.f-1.0
 | 
						|
  2.VCS2.3000.f-2.0
 | 
						|
  1.RCS.500-1000.0.1
 | 
						|
  a.-4
 | 
						|
  s.-4
 | 
						|
  s.-4
 | 
						|
 | 
						|
VCS1 and VCS2 batches have an input sync fence dependecy on the standalone fence
 | 
						|
created at the second step. They are submitted ahead of time while still not
 | 
						|
runnable. When the second RCS batch completes the standalone fence is signaled
 | 
						|
which allows the two VCS batches to be executed. Finally we wait until the both
 | 
						|
VCS batches have completed before starting the (optional) next iteration.
 | 
						|
 | 
						|
Submit fences
 | 
						|
-------------
 | 
						|
 | 
						|
Submit fences are a type of input fence which are signalled when the originating
 | 
						|
batch buffer is submitted to the GPU. (In contrary to normal sync fences, which
 | 
						|
are signalled when completed.)
 | 
						|
 | 
						|
Submit fences have the identical syntax as the sync fences with the lower-case
 | 
						|
's' being used to select them. Eg:
 | 
						|
 | 
						|
  1.RCS.500-1000.0.0
 | 
						|
  1.VCS1.3000.s-1.0
 | 
						|
  1.VCS2.3000.s-2.0
 | 
						|
 | 
						|
Here VCS1 and VCS2 batches will only be submitted for executing once the RCS
 | 
						|
batch enters the GPU.
 | 
						|
 | 
						|
Context priority
 | 
						|
----------------
 | 
						|
 | 
						|
  P.1.-1
 | 
						|
  1.RCS.1000.0.0
 | 
						|
  P.2.1
 | 
						|
  2.BCS.1000.-2.0
 | 
						|
 | 
						|
Context 1 is marked as low priority (-1) and then a batch buffer is submitted
 | 
						|
against it. Context 2 is marked as high priority (1) and then a batch buffer
 | 
						|
is submitted against it which depends on the batch from context 1.
 | 
						|
 | 
						|
Context priority command is executed at workload runtime and is valid until
 | 
						|
overriden by another (optional) same context priority change. Actual driver
 | 
						|
ioctls are executed only if the priority level has changed for the context.
 | 
						|
 | 
						|
Context preemption control
 | 
						|
--------------------------
 | 
						|
 | 
						|
  X.1.0
 | 
						|
  1.RCS.1000.0.0
 | 
						|
  X.1.500
 | 
						|
  1.RCS.1000.0.0
 | 
						|
 | 
						|
Context 1 is marked as non-preemptable batches and a batch is sent against 1.
 | 
						|
The same context is then marked to have batches which can be preempted every
 | 
						|
500us and another batch is submitted.
 | 
						|
 | 
						|
Same as with context priority, context preemption commands are valid until
 | 
						|
optionally overriden by another preemption control change on the same context.
 | 
						|
 | 
						|
Engine maps
 | 
						|
-----------
 | 
						|
 | 
						|
Engine maps are a per context feature which changes the way engine selection is
 | 
						|
done in the driver.
 | 
						|
 | 
						|
Example:
 | 
						|
 | 
						|
  M.1.VCS1|VCS2
 | 
						|
 | 
						|
This sets up context 1 with an engine map containing VCS1 and VCS2 engine.
 | 
						|
Submission to this context can now only reference these two engines.
 | 
						|
 | 
						|
Engine maps can also be defined based on class like VCS.
 | 
						|
 | 
						|
Example:
 | 
						|
 | 
						|
M.1.VCS
 | 
						|
 | 
						|
This sets up the engine map to all available VCS class engines.
 | 
						|
 | 
						|
Context load balancing
 | 
						|
----------------------
 | 
						|
 | 
						|
Context load balancing (aka Virtual Engine) is an i915 feature where the driver
 | 
						|
will pick the best engine (most idle) to submit to given previously configured
 | 
						|
engine map.
 | 
						|
 | 
						|
Example:
 | 
						|
 | 
						|
  B.1
 | 
						|
 | 
						|
This enables load balancing for context number one.
 | 
						|
 | 
						|
Engine bonds
 | 
						|
------------
 | 
						|
 | 
						|
Engine bonds are extensions on load balanced contexts. They allow expressing
 | 
						|
rules of engine selection between two co-operating contexts tied with submit
 | 
						|
fences. In other words, the rule expression is telling the driver: "If you pick
 | 
						|
this engine for context one, then you have to pick that engine for context two".
 | 
						|
 | 
						|
Syntax is:
 | 
						|
  b.<context>.<engine_list>.<master_engine>
 | 
						|
 | 
						|
Engine list is a list of one or more sibling engines separated by a pipe
 | 
						|
character (eg. "VCS1|VCS2").
 | 
						|
 | 
						|
There can be multiple bonds tied to the same context.
 | 
						|
 | 
						|
Example:
 | 
						|
 | 
						|
  M.1.RCS|VECS
 | 
						|
  B.1
 | 
						|
  M.2.VCS1|VCS2
 | 
						|
  B.2
 | 
						|
  b.2.VCS1.RCS
 | 
						|
  b.2.VCS2.VECS
 | 
						|
 | 
						|
This tells the driver that if it picked RCS for context one, it has to pick VCS1
 | 
						|
for context two. And if it picked VECS for context one, it has to pick VCS1 for
 | 
						|
context two.
 | 
						|
 | 
						|
If we extend the above example with more workload directives:
 | 
						|
 | 
						|
  1.DEFAULT.1000.0.0
 | 
						|
  2.DEFAULT.1000.s-1.0
 | 
						|
 | 
						|
We get to a fully functional example where two batch buffers are submitted in a
 | 
						|
load balanced fashion, telling the driver they should run simultaneously and
 | 
						|
that valid engine pairs are either RCS + VCS1 (for two contexts respectively),
 | 
						|
or VECS + VCS2.
 | 
						|
 | 
						|
This can also be extended using sync fences to improve chances of the first
 | 
						|
submission not getting on the hardware after the second one. Second block would
 | 
						|
then look like:
 | 
						|
 | 
						|
  f
 | 
						|
  1.DEFAULT.1000.f-1.0
 | 
						|
  2.DEFAULT.1000.s-1.0
 | 
						|
  a.-3
 | 
						|
 | 
						|
Context SSEU configuration
 | 
						|
--------------------------
 | 
						|
 | 
						|
  S.1.1
 | 
						|
  1.RCS.1000.0.0
 | 
						|
  S.2.-1
 | 
						|
  2.RCS.1000.0.0
 | 
						|
 | 
						|
Context 1 is configured to run with one enabled slice (slice mask 1) and a batch
 | 
						|
is sumitted against it. Context 2 is configured to run with all slices (this is
 | 
						|
the default so the command could also be omitted) and a batch submitted against
 | 
						|
it.
 | 
						|
 | 
						|
This shows the dynamic SSEU reconfiguration cost beween two contexts competing
 | 
						|
for the render engine.
 | 
						|
 | 
						|
Slice mask of -1 has a special meaning of "all slices". Otherwise any integer
 | 
						|
can be specifying as the slice mask, but beware any apart from 1 and -1 can make
 | 
						|
the workload not portable between different GPUs.
 |