1  Vectors
   With DIGITAL PCA, you can examine how you have split the
   processing of the application between scalar and vector
   processors. You can also analyze how well your application's
   algorithms use the vector processor. Certain programs can run
   significantly faster on computers containing scalar and vector
   processors than on those containing scalar processors alone.
   Programs that use repetitive array and matrix operations can
   run faster on a vector processor because they are the most
   constrained by scalar performance bottlenecks. Programs that
   spend most of their time performing I/O operations, system
   services, or using data types not supported by vector hardware
   (for example, BYTE and LOGICAL) do not benefit as much by being
   executed on a computer with both scalar and vector processors.
 

2  Finding_Vector_Processor_Usage
   The Collector provides two data kinds for sampling vector-
   processing information: vector PC sampling and vector CPU
   sampling. You use the SET command, as shown in the following
   example, to enable sampling of PC values for random vector
   instructions:

 PCAC> SET VPC_SAMPLING

   The preceeding command enables the sampling of vector PC values
   and shows you where the wall-clock time is being spent in the
   application performing vector instructions. The sampling rate
   defaults to an interval of 10 milliseconds and includes all
   the idle process time associated with running the program. Call
   stack information is collected by default. The following command
   enables the sampling of vector PC values and lets you examine the
   particular areas of your application where process time is spent
   performing vector instructions.

 PCAC> SET VCPU_SAMPLING


   The sampling rate defaults to an interval of 10 milliseconds
   and includes only the time that the application is running on
   the processor (process clock time). Call stack information is
   collected by default.

   When you sample the vector PC values, you can determine the
   scalar/vector parallelism throughout your entire program. The
   collection of vector PC or CPU sampling data provides you with
   the following information:

   o  The program counter of the vector instruction

   o  The program relative timestamp

   o  The vector instruction opcode

   o  The vector stride

   o  The vector control word (instruction dependent)

   o  The vector length register

   o  The vector mask register

   o  Call stack information (optionally)
 

2  Collecting_Concurrent_Scalar_and_Vector_Sampling
   You can collect both scalar and vector PC samples during a
   collection run. The timer intervals must be the same for both
   types of PC sampling. If you have set different intervals
   for each, the Collector uses the timer interval of the last
   sampling command entered. The following example shows setting
   the timer interval to 20 milliseconds for CPU sampling, and 100
   milliseconds for vector CPU sampling.

 PCAC> SET CPU_SAMPLING/INTERVAL:20
 PCAC> SET VCPU_SAMPLING/INTERVAL:100

   In the example above, the interval for both CPU sampling and PC
   sampling is set to 100 milliseconds.
 

2  Counting_Vector_Processor_Instructions
   You can instruct the Collector to count all vector processor
   instructions in all or in part of an application with the SET
   VCOUNTERS command. From this information, you can determine to
   what extent the vector processor is used. You must specify at
   least one nodespec to indicate the domain of the data collected.

 PCAC> SET VCOUNTERS PROGRAM_ADDRESS BY VINSTRUCTION

   The following example shows collecting vector instruction counts
   for an entire program using the nodespec of PROGRAM ADDRESS BY
   VINSTRUCTION.

 PCAC> SET VCOUNTERS ROUTINE XYZ BY VINSTRUCTION

   The following example example shows collecting vector instruction
   counts for routine XYZ using the nodespec of ROUTINE BY
   VINSTRUCTION.

   See the Command Dictionary in the Guide to DIGITAL PCA for a
   complete list of available nodespecs with the SET VCOUNTERS
   command.
 

2  Analyzing_Vector_Processor_Data
   The Analyzer plots and displays the results of the vector
   instructions data gathered in the Collector. You can use three
   views to aid in the analysis of the scalar/vector processor
   parallelism: Table, Histogram, and Annotated Source.

   You can set the data kind to the any of the following, depending
   on what was gathered by the collection run:

   o  Vector instructions counted

   o  Vector PC sampling

   o  Vector CPU sampling

   The following additional domains are available with vector
   instruction analysis:

   o  INSTRUCTIONS-Sets the domain to be the vector instruction
      found at the sampled or counted PC.

   o  VLENGTH-Sets the domain to be the Vector Length Register (VLR)
      values

   o  VMASK-Sets the domain to be the Vector Mask Register (VMR)
      values

   o  VOPCODE-Sets the domain to be specific vector instructions

   o  VOPERATIONS-Sets the domain to be the number of operations per
      Vector instruction

   o  VREGISTERS-Sets the domain to be the Vector Register usage

   o  VSTRIDE-Sets the domain to be the Vector Stride values
 

2  Finding_Most_Used_Vector_Instructions
   In the INSTRUCTION domain, to determine which vector instructions
   are used most by your program, enter the following command line:

 PCAC> PLOT/VCOUNTERS INSTRUCTION BY VOPCODE

   This command causes the report view to be based on the
   disassembled opcode for each vector instruction in the entire
   application that is sampled. The number of times a vector
   instruction is used lets you see if your application is spending
   a lot of time performing certain operations. For example, if you
   see that the SYNC vector instruction is executed more than any
   other vector instruction, you can infer that the scalar processor
   is spending too much idle time waiting for the vector processor
   to finish an operation.
 

2  Finding_Where_Vector_Instructions_are_Used
   To find where in your program you are using vector instructions,
   use the following command:

 PCAA> PLOT/VCOUNTERS PROGRAM_ADDRESS BY VINSTRUCTION

   This command displays the address of each vector instruction
   that is used in your program and shows what percentage of program
   execution time is spent on each instruction.
 

2  System_Configurations
   The following illustrates the possible system configurations and
   their effect on performance:

   o  CPU1 and CPU2 with VVIEF support:

      Efficent for program development, but can be 3-5 times slower
      than the scalar performance. Cost-effective for parallel
      applications that do not use vector processing.

   o  CPU1 - CPU2 with Vector processor:

      Efficent vector performance: As soon as a processor issues its
      first vector instruction, VMS schedules it only for vector-
      present(VP) CPU2. If the process is executing on CPU1, VMS
      swaps out and gives it to CPU2. If CPU2 is not free, the
      process waits for it to become free: VMS does not use VVIEF
      on this system.

   o  CPU1 and CPU2:

      Fatal to vector programs. They will fail when the first vector
      instruction issues and neither VVIEF nor any other vector
      processors are present.

   o  CPU1 and CPU2 with Vector processors:

      Most efficent parallel-vector performance and cost-effective.

   o  CPU1 and CPU2 - CPU3 and CPU4 with Vector processors:

      Efficent parallel-vector performance.
 

3  VVIEF_on_VAX_Multiprocessors
   If no vector-present CPU is available, OpenVMS executes vector
   instructions using the VAX Vector Instruction Emulator Facility
   (VVIEF), which is much slower than scalar execution.

                                  NOTE

      VVIEF must be enabled on the OpenVMS system; it is disabled
      by default. To enable VVIEF, the system manager must execute
      the command file SYS$UPDATE:VVIEF$INSTAL.COM. For more
      information, refer to your OpenVMS documentation set.