With DIGITAL PCA, you can examine how you have split the processing of the application between scalar and vector processors. You can also analyze how well your application's algorithms use the vector processor. Certain programs can run significantly faster on computers containing scalar and vector processors than on those containing scalar processors alone. Programs that use repetitive array and matrix operations can run faster on a vector processor because they are the most constrained by scalar performance bottlenecks. Programs that spend most of their time performing I/O operations, system services, or using data types not supported by vector hardware (for example, BYTE and LOGICAL) do not benefit as much by being executed on a computer with both scalar and vector processors.
1 – Finding Vector Processor Usage
The Collector provides two data kinds for sampling vector- processing information: vector PC sampling and vector CPU sampling. You use the SET command, as shown in the following example, to enable sampling of PC values for random vector instructions: PCAC> SET VPC_SAMPLING The preceeding command enables the sampling of vector PC values and shows you where the wall-clock time is being spent in the application performing vector instructions. The sampling rate defaults to an interval of 10 milliseconds and includes all the idle process time associated with running the program. Call stack information is collected by default. The following command enables the sampling of vector PC values and lets you examine the particular areas of your application where process time is spent performing vector instructions. PCAC> SET VCPU_SAMPLING The sampling rate defaults to an interval of 10 milliseconds and includes only the time that the application is running on the processor (process clock time). Call stack information is collected by default. When you sample the vector PC values, you can determine the scalar/vector parallelism throughout your entire program. The collection of vector PC or CPU sampling data provides you with the following information: o The program counter of the vector instruction o The program relative timestamp o The vector instruction opcode o The vector stride o The vector control word (instruction dependent) o The vector length register o The vector mask register o Call stack information (optionally)
2 – Collecting Concurrent Scalar and Vector Sampling
You can collect both scalar and vector PC samples during a collection run. The timer intervals must be the same for both types of PC sampling. If you have set different intervals for each, the Collector uses the timer interval of the last sampling command entered. The following example shows setting the timer interval to 20 milliseconds for CPU sampling, and 100 milliseconds for vector CPU sampling. PCAC> SET CPU_SAMPLING/INTERVAL:20 PCAC> SET VCPU_SAMPLING/INTERVAL:100 In the example above, the interval for both CPU sampling and PC sampling is set to 100 milliseconds.
3 – Counting Vector Processor Instructions
You can instruct the Collector to count all vector processor instructions in all or in part of an application with the SET VCOUNTERS command. From this information, you can determine to what extent the vector processor is used. You must specify at least one nodespec to indicate the domain of the data collected. PCAC> SET VCOUNTERS PROGRAM_ADDRESS BY VINSTRUCTION The following example shows collecting vector instruction counts for an entire program using the nodespec of PROGRAM ADDRESS BY VINSTRUCTION. PCAC> SET VCOUNTERS ROUTINE XYZ BY VINSTRUCTION The following example example shows collecting vector instruction counts for routine XYZ using the nodespec of ROUTINE BY VINSTRUCTION. See the Command Dictionary in the Guide to DIGITAL PCA for a complete list of available nodespecs with the SET VCOUNTERS command.
4 – Analyzing Vector Processor Data
The Analyzer plots and displays the results of the vector instructions data gathered in the Collector. You can use three views to aid in the analysis of the scalar/vector processor parallelism: Table, Histogram, and Annotated Source. You can set the data kind to the any of the following, depending on what was gathered by the collection run: o Vector instructions counted o Vector PC sampling o Vector CPU sampling The following additional domains are available with vector instruction analysis: o INSTRUCTIONS-Sets the domain to be the vector instruction found at the sampled or counted PC. o VLENGTH-Sets the domain to be the Vector Length Register (VLR) values o VMASK-Sets the domain to be the Vector Mask Register (VMR) values o VOPCODE-Sets the domain to be specific vector instructions o VOPERATIONS-Sets the domain to be the number of operations per Vector instruction o VREGISTERS-Sets the domain to be the Vector Register usage o VSTRIDE-Sets the domain to be the Vector Stride values
5 – Finding Most Used Vector Instructions
In the INSTRUCTION domain, to determine which vector instructions are used most by your program, enter the following command line: PCAC> PLOT/VCOUNTERS INSTRUCTION BY VOPCODE This command causes the report view to be based on the disassembled opcode for each vector instruction in the entire application that is sampled. The number of times a vector instruction is used lets you see if your application is spending a lot of time performing certain operations. For example, if you see that the SYNC vector instruction is executed more than any other vector instruction, you can infer that the scalar processor is spending too much idle time waiting for the vector processor to finish an operation.
6 – Finding Where Vector Instructions are Used
To find where in your program you are using vector instructions, use the following command: PCAA> PLOT/VCOUNTERS PROGRAM_ADDRESS BY VINSTRUCTION This command displays the address of each vector instruction that is used in your program and shows what percentage of program execution time is spent on each instruction.
7 – System Configurations
The following illustrates the possible system configurations and their effect on performance: o CPU1 and CPU2 with VVIEF support: Efficent for program development, but can be 3-5 times slower than the scalar performance. Cost-effective for parallel applications that do not use vector processing. o CPU1 - CPU2 with Vector processor: Efficent vector performance: As soon as a processor issues its first vector instruction, VMS schedules it only for vector- present(VP) CPU2. If the process is executing on CPU1, VMS swaps out and gives it to CPU2. If CPU2 is not free, the process waits for it to become free: VMS does not use VVIEF on this system. o CPU1 and CPU2: Fatal to vector programs. They will fail when the first vector instruction issues and neither VVIEF nor any other vector processors are present. o CPU1 and CPU2 with Vector processors: Most efficent parallel-vector performance and cost-effective. o CPU1 and CPU2 - CPU3 and CPU4 with Vector processors: Efficent parallel-vector performance.
7.1 – VVIEF on VAX Multiprocessors
If no vector-present CPU is available, OpenVMS executes vector instructions using the VAX Vector Instruction Emulator Facility (VVIEF), which is much slower than scalar execution. NOTE VVIEF must be enabled on the OpenVMS system; it is disabled by default. To enable VVIEF, the system manager must execute the command file SYS$UPDATE:VVIEF$INSTAL.COM. For more information, refer to your OpenVMS documentation set.