With DIGITAL PCA, you can examine how you have split the
processing of the application between scalar and vector
processors. You can also analyze how well your application's
algorithms use the vector processor. Certain programs can run
significantly faster on computers containing scalar and vector
processors than on those containing scalar processors alone.
Programs that use repetitive array and matrix operations can
run faster on a vector processor because they are the most
constrained by scalar performance bottlenecks. Programs that
spend most of their time performing I/O operations, system
services, or using data types not supported by vector hardware
(for example, BYTE and LOGICAL) do not benefit as much by being
executed on a computer with both scalar and vector processors.
1 – Finding Vector Processor Usage
The Collector provides two data kinds for sampling vector-
processing information: vector PC sampling and vector CPU
sampling. You use the SET command, as shown in the following
example, to enable sampling of PC values for random vector
instructions:
PCAC> SET VPC_SAMPLING
The preceeding command enables the sampling of vector PC values
and shows you where the wall-clock time is being spent in the
application performing vector instructions. The sampling rate
defaults to an interval of 10 milliseconds and includes all
the idle process time associated with running the program. Call
stack information is collected by default. The following command
enables the sampling of vector PC values and lets you examine the
particular areas of your application where process time is spent
performing vector instructions.
PCAC> SET VCPU_SAMPLING
The sampling rate defaults to an interval of 10 milliseconds
and includes only the time that the application is running on
the processor (process clock time). Call stack information is
collected by default.
When you sample the vector PC values, you can determine the
scalar/vector parallelism throughout your entire program. The
collection of vector PC or CPU sampling data provides you with
the following information:
o The program counter of the vector instruction
o The program relative timestamp
o The vector instruction opcode
o The vector stride
o The vector control word (instruction dependent)
o The vector length register
o The vector mask register
o Call stack information (optionally)
2 – Collecting Concurrent Scalar and Vector Sampling
You can collect both scalar and vector PC samples during a
collection run. The timer intervals must be the same for both
types of PC sampling. If you have set different intervals
for each, the Collector uses the timer interval of the last
sampling command entered. The following example shows setting
the timer interval to 20 milliseconds for CPU sampling, and 100
milliseconds for vector CPU sampling.
PCAC> SET CPU_SAMPLING/INTERVAL:20
PCAC> SET VCPU_SAMPLING/INTERVAL:100
In the example above, the interval for both CPU sampling and PC
sampling is set to 100 milliseconds.
3 – Counting Vector Processor Instructions
You can instruct the Collector to count all vector processor
instructions in all or in part of an application with the SET
VCOUNTERS command. From this information, you can determine to
what extent the vector processor is used. You must specify at
least one nodespec to indicate the domain of the data collected.
PCAC> SET VCOUNTERS PROGRAM_ADDRESS BY VINSTRUCTION
The following example shows collecting vector instruction counts
for an entire program using the nodespec of PROGRAM ADDRESS BY
VINSTRUCTION.
PCAC> SET VCOUNTERS ROUTINE XYZ BY VINSTRUCTION
The following example example shows collecting vector instruction
counts for routine XYZ using the nodespec of ROUTINE BY
VINSTRUCTION.
See the Command Dictionary in the Guide to DIGITAL PCA for a
complete list of available nodespecs with the SET VCOUNTERS
command.
4 – Analyzing Vector Processor Data
The Analyzer plots and displays the results of the vector
instructions data gathered in the Collector. You can use three
views to aid in the analysis of the scalar/vector processor
parallelism: Table, Histogram, and Annotated Source.
You can set the data kind to the any of the following, depending
on what was gathered by the collection run:
o Vector instructions counted
o Vector PC sampling
o Vector CPU sampling
The following additional domains are available with vector
instruction analysis:
o INSTRUCTIONS-Sets the domain to be the vector instruction
found at the sampled or counted PC.
o VLENGTH-Sets the domain to be the Vector Length Register (VLR)
values
o VMASK-Sets the domain to be the Vector Mask Register (VMR)
values
o VOPCODE-Sets the domain to be specific vector instructions
o VOPERATIONS-Sets the domain to be the number of operations per
Vector instruction
o VREGISTERS-Sets the domain to be the Vector Register usage
o VSTRIDE-Sets the domain to be the Vector Stride values
5 – Finding Most Used Vector Instructions
In the INSTRUCTION domain, to determine which vector instructions
are used most by your program, enter the following command line:
PCAC> PLOT/VCOUNTERS INSTRUCTION BY VOPCODE
This command causes the report view to be based on the
disassembled opcode for each vector instruction in the entire
application that is sampled. The number of times a vector
instruction is used lets you see if your application is spending
a lot of time performing certain operations. For example, if you
see that the SYNC vector instruction is executed more than any
other vector instruction, you can infer that the scalar processor
is spending too much idle time waiting for the vector processor
to finish an operation.
6 – Finding Where Vector Instructions are Used
To find where in your program you are using vector instructions,
use the following command:
PCAA> PLOT/VCOUNTERS PROGRAM_ADDRESS BY VINSTRUCTION
This command displays the address of each vector instruction
that is used in your program and shows what percentage of program
execution time is spent on each instruction.
7 – System Configurations
The following illustrates the possible system configurations and
their effect on performance:
o CPU1 and CPU2 with VVIEF support:
Efficent for program development, but can be 3-5 times slower
than the scalar performance. Cost-effective for parallel
applications that do not use vector processing.
o CPU1 - CPU2 with Vector processor:
Efficent vector performance: As soon as a processor issues its
first vector instruction, VMS schedules it only for vector-
present(VP) CPU2. If the process is executing on CPU1, VMS
swaps out and gives it to CPU2. If CPU2 is not free, the
process waits for it to become free: VMS does not use VVIEF
on this system.
o CPU1 and CPU2:
Fatal to vector programs. They will fail when the first vector
instruction issues and neither VVIEF nor any other vector
processors are present.
o CPU1 and CPU2 with Vector processors:
Most efficent parallel-vector performance and cost-effective.
o CPU1 and CPU2 - CPU3 and CPU4 with Vector processors:
Efficent parallel-vector performance.
7.1 – VVIEF on VAX Multiprocessors
If no vector-present CPU is available, OpenVMS executes vector
instructions using the VAX Vector Instruction Emulator Facility
(VVIEF), which is much slower than scalar execution.
NOTE
VVIEF must be enabled on the OpenVMS system; it is disabled
by default. To enable VVIEF, the system manager must execute
the command file SYS$UPDATE:VVIEF$INSTAL.COM. For more
information, refer to your OpenVMS documentation set.