CLAIM TO PRIORITY OF PROVISIONAL APPLICATION
This application claims priority under 35 U.S.C. §119(e)(1) of provisional application Nos. 60/680,624, filed May 13, 2005 and 60/681,427, filed May 16, 2005.
TECHNICAL FIELD OF THE INVENTION
The technical field of this invention is processor and memory emulation technology.
BACKGROUND OF THE INVENTION
During applications code development, the development team traverses a repetitive development cycle shown below hundreds if not thousands of times:
-
- 1. Building code—compile and link a version of applications code
- 2. Loading code—loading the code into real hardware system or a software model
- 3. Debugging/Profiling code—chasing correctness or performance problems
- 4. Making changes—making source code edits, or changing the linker directives
The load and change portions of this cycle are generally viewed as non-productive time, as one is either waiting for code to download from the host to the target system or looking through files that need changes and making changes with a text editor.
Any trip through the loop can either introduce or eliminate bugs. When bugs are introduced, the development context changes to debug. When sufficient bugs are eliminated, the development context may change to profiling. There are obviously different classes of debug and profiling, some more advanced than others. Profiling can involve code performance, code size and power. The developer bounces between the concentric rings of the development context, as the applications code development proceeds.
Special emphasis must be placed on getting to the developer the system control, data transfers, or instrumentation applicable to the current debug or profiling context. This requires packaging the system control and instrumentation in readily accessible systems solutions form, where developers can easily access tools with capabilities targeting specific development problems. The presentation of capabilities must expose the complete capability of the toolset while making the selection of right capability for the task at hand straightforward.
The need for emulation has significantly increased with the introduction of cache based architectures. This increased need primarily arises from the fact that on flat memory model architectures such as the Texas Instruments C620x devices, the performance that can be expected from running on the target could be accurately modeled with a simulator. The actual system performance with interrupts and Direct Memory Access (DMA) was within 10-15% of the simulated performance. This margin was reasonable for most applications of interest.
With the introduction of cache based architectures and the inability to model cache events and their impact on system performance accurately, today's developers find simulated performance to be anywhere from 50-100% away from the actual target performance. This inaccuracy results in a loss of confidence about the capabilities of the device and leads to fictitious performance de-rating factors between cache and flat memory performance. While some of the discrepancy between simulated and actual performance is due to inadequate modeling of the cache, there still exists a fundamental problem in modeling system related interactions such as interrupts or DMA accurately. Hence simulators typically have tended to play catch up with the target in modeling the system accurately. The period over which the simulator for a given target matures is unfortunately the same time that a developer is attempting to get to market.
Visibility into what the target is doing is key to extracting performance on cache-based architectures. The way to get this visibility for profiling system performance is through emulation. Visibility is also key for those writing behavioral simulators to countercheck the behavior of the target against what is expected. It is key to software developers in helping to reduce cache related stalls that impact performance. Visibility on the target is invaluable for system debug and development of applications in a timely manner. The absence of visibility leaves software developers with little else but to speculate about the probable reasons for loss of performance. The inability to know what is going on in the system leads to a trial and error approach to performance improvement that is gained by optimal code and data placement in memory. The lack of proper tools that allow for cache visualization precludes one from answering the question “Is this the most optimal software implementation for this target?” The ability to know if a given software module ever missed real-time in an actual system is of utmost importance to system developers who are bringing up complex systems. Such questions can be only accurately answered by the constant and non-intrusive monitoring of the actual system that advanced emulation offers.
Visibility is key in aiding complex system debug. Debugging memory corruption and being able to halt the CPU when such a corruption is detected is of primary importance, as memory exceptions are not currently supported on Texas Instruments C6x targets. In addition on the C6x Digital Signal Processor (DSP) data memory corruption can also result in program memory corruption causing the CPU execution to crash, as program and data share a unified memory. There is therefore a need to accurately trace the source code that is causing this malicious behavior. The ability to monitor Direct Memory Access (DMA) events, their submissions and completions relative to the CPU will provide additional dimensions to the programmer to tune the size of the data sets the algorithm is working on for more optimal performance. The ability to catch and warn users about spurious CPU writes or DMA writes to memory can prove to be invaluable in cutting down the software debug time. Advanced emulation features once again hold the key to all these critical capabilities. The need for good visibility only gets more serious with the introduction of multiple CPU cores moving forward. The need to know which CPU currently has access to a shared common data resource will be a question of prime importance in such scenarios. The detection and warning of possible memory incoherence is another critical capability that emulation can offer.
The new emulation features will provide enhanced debug and profiling capabilities that allow users to have better visibility into system and memory behavior. Further, several usability issues are addressed.
The aim is to make new debug and profiling capabilities available and fix problems encountered in previous implementations:
-
- Stall cycle profiling to identify parts of the user application that requires code optimization.
- Event profiling to analyze system and memory behavior which in turns allows to choose effective optimization methods.
- Cache viewer and coherence analysis to debug cache coherence problems.
- Software Pipelined Loop instruction (SPLOOP) Debug.
- Support for Memory protection and security
- Reduce Real-time Data Exchange intrusiveness.
- Richer set of Advanced Event Triggering events.
SUMMARY OF THE INVENTION
The master slave timing of interfaces coupled with clock insertion delays of devices causes degradation of performance as the insertion delay comes directly out of the sampling window, decreasing the sampling time. A programmable delay can be added to the clock and data that allows optimization of timing, resulting in a significant increase of transfer rates.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other aspects of this invention are illustrated in the drawings, in which:
FIG. 1 shows compression of trace words;
FIG. 2 shows compression of trace packets;
FIG. 3 demonstrates data extraction;
FIG. 4 shows clock source selection;
FIG. 5 shows input delay lines;
FIG. 6 illustrates dual channel operation for skew adjustments;
FIG. 7 shows the digital delay lines;
FIG. 8 shows the delay line control signals;
FIG. 9 demonstrates delay line cross coupling;
FIG. 10 illustrates tap measurement with a split delay line;
FIG. 11 shows a multi input recording interface;
FIG. 12 shows an alternate implementation of a multi input recording interface;
FIG. 13 shows chip and trace unit interconnections;
FIG. 14 shows clock insertion delay cancellation;
FIG. 15 is a block diagram showing scaled time simulation;
FIG. 16 is a distributed width trace receiver;
FIG. 17 is a flow diagram of a distributed depth trace receiver;
FIG. 18 shows message insertion into the trace stream;
FIG. 19 is a block diagram of a last stall standing implementation; and
FIG. 20 shows an example of a self simulation architecture.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
Trace data is stored in trace memory as it is recorded. At times, the trace data may be repetitive for extended periods of time. Certain sequences may also be repetitive. This presents an opportunity to represent the trace data in a compressed format. This condition can arise when certain types of trace data are generated e.g., trace timing data is generated when program counter (PC) and data trace is turned off and timing remains on.
The trace recording format accommodates compression of consecutive trace words. When at least two consecutive trace words are the same value, the words 2 through n are replaced with a command and count that communicates how many times the word was repeated. The maximum storage for a burst of 2 through n words is two words as shown in FIG. 1, where word 101 does not repeat, words 102, 103, 104 and 105 are identical and then words 106 and 107 are identical. This sequence compresses as follows—word 108 is the same as word 101, word 109 has the value of word 102, and word 110 contains a 3 as the repetition factor for word 109. Similarly, words 106 and 107 are identical, and are encoded as word 111 containing the value of word 106 while word 112 contains the repetition factor of 1.
This concept may be extended to data of any width before it is packed into words. In this case packets or packet patterns (sequences) may be recorded in compressed form. It is not necessary for the packets or patterns to be word aligned. This is shown in FIG. 2, where packet 201 does not repeat, packets 202, 203, 204 and 205 are identical and then packets 206 and 207 are identical. This sequence compresses as follows—packet 208 is the same as packet 201, packet 209 has the value of packet 202, and packet 210 contains a 3 as the repetition factor for packet 209. Similarly, packets 206 and 207 are identical, and are encoded as packet 211 containing the value of packet 206 while packet 212 contains the repetition factor of 1. Data recording of single ended signals may use two out of phase clocks to extract the data to substantially lessen the effects of duty cycle distortion. Using of two out of phase clocks makes the data extraction logic considerably more tolerant of the input duty cycle distortion induced by any component (on-chip or off chip) before the data is extracted from the transmission at the receiver.
The use of two clocks, hereafter called BE_BP mode (both edges, both phases), deals with the duty cycle distortion created by circuitry between the transmitter and receiver. If certain factors distort the waveform, the duty cycle could be as poor as 80%/20% by the time the data reaches the capture circuit.
Data from both a positive edge sample and negative edge sample are used to derive the data bit value stored in a circular buffer in BE_BP mode. The primary and secondary clocks capture two copies of the data. A sample is taken with the positive edge of one clock and the negative edge of the other clock during each bit period. These two captured data values are combined to create the data bit value (along with the data value captured by the previous negative edge). The captured data is clocked into the circular buffer based on the clock edges sampling the data.
BE_BP delivers better bandwidth by utilizing the fact that signals switching in the same direction will have similar distortion characteristics. This is best understood by following an example. Beginning with a data bit that is a zero for multiple bit periods, the data moves to a one. Assuming there is distortion in the duty cycle, the rising edge of the data input has similar characteristics to the rising edge of the clock moving high at the bit period where the data bit moves to a one. Since the bit is a zero previously, the data sampled by the clock that is rising used to define the next data bit. Once the data bit is a high, the falling edge of the clock moving low at the bit period where the data bit moves to a zero is used to determine the bit value. The data extraction algorithm is defined by the following equation:
|
|
|
if (last bit == 0) {data = data sampled by next rising edge |
|
clock;} |
|
else {data = data sampled by next falling edge clock;} |
|
|
When a bit is sampled as a one by the positive and negative edges of the clock, the data is assumed to be a one. If the data sampled by the positive edge indicates a one while data sampled by the negative edge indicates a zero, the bit timing is close or the waveform is distorted. In this case the data sampled by the previous bit's negative edge is checked. If this data was captured as a zero, the data for this bit is declared a one because the data bit must be transitioning from a zero to a one. The converse is also true.
Looking at FIG. 3, one can see how data extraction works. As the equation above shows, data extraction is based on the last data bit extracted at 306 (DATA), data in 303 (DIN), and two clocks that are out of phase with each other 301 and 302 (CLK1 and CLK0). The data sampled by each edge of CLK1 is shown at 304 (SMP1) while the data sampled by each edge of CLK0 is shown as 305 (SMP0). Looking at points 307 (A) and 308 (B), the SMP0 value is used for data as the prior data value is a zero moving to a one at A while the SMP0 value is used for data as the prior value is a one moving to a zero at B. Note that the duty cycle distortion causes erroneous data values sampled by CLK1 (SMP1) at points A and B.
A single trace receiver may be used to record trace data from multiple trace transmitters. It may also be used to accept trace data from a cascaded trace unit, receiving data from another unit. In the example shown in FIG. 4, each input 401 may be used as either clock 403 or data 405, as selected by logic blocks 402 and 404. This allows any of the inputs to be assigned as a clock and all other inputs as data, or other channels. The trace channels that supply clock(s) and data may supply channels that are skewed. At times there is a need to de-skew clocks when multiple clocks are used. There is also a need to de-skew data inputs to a clock. As shown in FIG. 5, delay lines 501 are added within the trace receiver of FIG. 4 to provide for alignment of clocks to each other and clocks to data. Skew between data bits and data and clock may drift over time and can change with temperature.
This skew may be adjusted in a dynamic manner by using two data extraction circuits to accomplish dynamic recalibration. Two separate data paths are created from the same inputs. Both paths are initially calibrated (de-skewed). One circuit is used as the data path after initial calibration. The second circuit is operated in parallel with the first circuit. The skew of the second circuit is adjusted while the channel operates by comparing the data extracted by the two extraction circuits. Once the second circuit is calibrated, its function is changed to the data path with the data path circuit being changed to the calibration path. This process continues at a slow rate as the drift is slow.
Adaptive calibration of input sampling may be implemented to increase the robustness of the system. At very high data rates, the very small sampling windows may drift because of temperature over long periods of time. Adaptive calibration provides a mechanism to identify approaching marginal setup and hold time situations for the capture circuit creating the data sent to trace channels. Two copies of the data capture logic are used to create a collection and calibration copy of incoming data bits. By capturing the data with the same clocks and data sourced from different delay lines, it is possible to measure whether adequate data setup and hold time margins are being maintained. This is accomplished by alternately moving the delay of the calibration delay line before and after the delay setting of collection delay line. The data values captured by the collection and calibration circuits are compared for mismatches when the collection data is passed to the channels.
If a mismatch occurs, the setup-time or hold-time margin of the collection data capture is identified. The calibration delay line is adjusted until data comparison errors or detected or the calibration delay line adjustment has reached its extreme. Since the delay lines can be calibrated so that the delay of each tap is known, and thermal drift is measured using an extra delay line, the trace software can adjust the collection delay setting to optimize the sampling point of the collection capture circuit.
The collection and calibration data streams are compared. The failures are recorded separately for collection data a one and calibration data a zero. A more complete representation of the skew characteristics is provided with this approach. The application software makes adjustments in the collection skew delay when it determines the collection sampling point can be moved to provide more margin.
In the example shown in FIG. 6, there are two separate data paths 601 and 602 (A and B). During operation, the skew between data bits may change because of thermal changes. Both Path A and B are calibrated when the channel is activated. When the channel operates, either Path A or Path B is selected to generate channel data 603. The path not selected processes the same inputs as the path selected. Since the channel is operating, the data pattern is not known. The data extracted from the two channels is compared in block 604 as the delays are adjusted on the path not selected. The optimum sampling points are found for this path. This calibration may take a long time, maybe as much as several minutes. Checks that assure data with ones and zeroes has been passed through the channel are used to assure the path is properly exercised through calibration. Once calibration of the path not selected has been completed, the roles of the two paths are reversed, with the path supplying data to the channel turned into the calibration path at the same time the calibration path is changed to the data source for the channel.
In order to implement the calibration algorithms, a very long digital variable delay line is required, with minimal distortion. FIG. 7 shows an implementation of such a delay line.
The delay line has two inputs, normal 701 (PIN_in) and calibration 702 (Calibrate)) as shown in FIG. 7. Either input or neither input may be selected. When neither input is selected, the delay line may be flushed with a level.
The calibration input is used to configure the delay line as a ring oscillator while the PIN_in is the signal that is normally delayed. Signal 703 (PIN_out) is the delay line output.
Two delay elements are shown, one designated as 704 (odd) and another designated as 705 (even). The odd element is controlled by signal 706 (MORE_O) and 708 (LESS_O) control inputs while the even element is controlled by the 707 (MORE_E) and 709 (LESS_E) control inputs. The symmetry of the circuit and input connectivity of the cascaded elements provides extremely low distortion for delays as long as 10 nanoseconds.
The skew delay is initialized to the minimum when the input is disabled via the MODE codes associated with the input. As shown in FIG. 8, the delay is increased with the MORE DELAY command 801, and decreased with the LESS DELAY command 802. As shown in FIG. 8, these commands generate MORE_E, MORE_O, LESS_E of LESS_0 depending on the last ring control command issued as shown in Table 1. Enable signal 803 enables or disables the control circuit, while Reset signal 804 initializes the delay line settings.
TABLE 1
|
|
| Command |
Last Update |
Current Update |
|
| MORE |
MORE_E |
MORE_O |
| MORE |
MORE_O |
MORE_E |
| MORE |
LESS_E |
MORE_E |
| MORE |
LESS_O |
MORE_O |
| LESS |
LESS_E |
LESS_O |
| LESS |
LESS_O |
LESS_E |
| LESS |
MORE_E |
LESS_E |
| LESS |
MORE_O |
LESS_O |
|
The number of delay elements included in the delay line is controlled by a master slave like shift register mechanism built into the delay element. The Control State of each element is stored locally in an R-S latch. Adjacent cells (even and odd) have different clocks updating these cells. This means the control state latches can be used like the front and back ends of a Master Slave FF. When the cells are connected together they form a left/right shift register. The MORE_O and MORE_E signals are generated by control logic external to the delay line. These signals cause the shift register to shift right one bit. Only half the cells are updated at any one time. A cell that was last updated with a right shift will contain the last one when the shift register structure is viewed from left to right. When the opposite set of cells is updated, a one is moved into the cell to the right of the cell that previously held the last one. This process continues as MORE_E and MORE_O are alternately generated. The circuit looks like a shift register that shifts right filling with ones. The latch implementation is chosen as it is smaller than one done with conventional flip flops.
The LESS_O and LESS_E signals cause the shift register to shift left one bit. Again, only half the cells are updated at any one time. A cell that was last updated with a left shift will contain the last zero when the shift register structure is viewed from right to left. When the opposite set of cells is updated, a zero is moved into the cell to the left of the cell that previously held the last zero. This process continues as LESS_E and LESS_O are alternately generated. The circuit looks like a shift register that shifts left, filling with zeros.
When a LESS directive follows a MORE directive, it will update the same set of delay elements as the MORE directive. When a MORE directive follows a LESS directive, it will update the same set of delay elements as the LESS directive. This is shown in Table 1.
Digital delay lines may be used to provide fixed delays within circuits. These delays may need to be a specific time value. To get a time value, the number of delay elements needed to create the delay must be chosen. This requires the delay of each delay line tap be determined. The ability to determine this delay in a precise fashion is described. It is not sufficient to just turn the delay line into a ring oscillator as minimal setting will create an oscillator that runs too fast to be measured easily.
In the implementation shown in FIG. 9, delay lines 901 and 902 are cross coupled. After both delay lines are cross coupled, they are cleared. With one delay line at full length, the other delay line length is changed one tap at a time with the cross coupled delay lines functioning as a ring oscillator. The ring oscillator increments counter 903 once released. The counter is cleared before the delay line is enabled as an oscillator. After a certain period of time the counter is stopped, and the frequency determined. The difference in frequency when a tap is added gives the delay of the delay line tap.
The same approach may be used with a single delay line as it may be split in half to appear as two delay lines 1001 and 1002 as shown in FIG. 10. The delays generated by the taps in one section are determined while the other section's delays are held static.
A trace data source may output trace packets in a width that is not native to the packet. For example, 8 10-bit trace packets may be transmitted as 10 8-bit transmission packets. On the receiver end, the 8-bit transmission packets may be packed into 16-bit, 32-bit, or 64-bit values and stored in trace memory. Any other word with is also acceptable.
The function that performs the packing of a series of M-bit values into P-bit frames to be stored in memory is called a Packing Unit (PU). In one implementation, the PU stores a number of trace transmission packets in 64-bit words called PWORDs. These trace packets are conveyed to the PU through trace transmission packets that may be a different width than the native trace packet. In this implementation, the PU accommodates trace packet widths of 1 to 20 bits. Other widths are possible. The PU is presented a 48-bit input created from two 24-bit sections. The PU uses the data even valid (DE_VALID[n]) and data odd valid (DO_VALID[n]) indications to determine when sections of the input need processing. The Packing Unit processes the data frame based on:
-
- Transmission packet width
- Number of buffer entries in the 48-bit input (0, 1, or 2 transmission packets available)
- Number of transmission packets processed previously
A lookup table is used to map the incoming transmission packets in the input frame into the 64-bit words. It is programmed before a trace recording session begins based on the factors noted above. This processing creates 64-bit packed words (PWORDs). These words are then stored in trace memory.
In this example, the programmable implementation of a packing unit provides for the packing of any transmission width from 1 to 23 bits into PWORDs from 1 to 63 wide. The Packing Unit uses a lookup RAM to define the packing sequence of a series of trace packets that appear in the 48-bit data frame output from one of the AUs. When one works through examples of varied transmission packet and PWORD widths, it is found that the width of the PWORD (less than or equal to 63 bits) determines the programming depth of the lookup RAM.
The PWORD width is set to an integer multiple of the trace packet width. For a 10-bit trace packet the recording word width is set to 10, 20, 30, 40, 50, or 60 bits. For a 9-bit trace packet width is set to 9, 18, 27, 36, 45, 54, or 63 bits and so forth.
Let us assume a 4-bit element and a 63-bit recording frame. In this example, the number of recording frames built from the 4-bit input segments is defined by the recording frame width. In other words, the example builds four 63-bit words from 63 4-bit input values. If the input data width is five bits with a memory word width of 63-bits, five 63-bit words are built from 63 five bit input values.
If the number of words built and the recording word width have a common factor, both numbers can be divided by this factor. In the example of a 10-bit element and a 60-bit recording frame, the common factor is 10. This means the frame builder can construct one 60-bit word from six 10-bit elements. The relationship between number of words, recording width, and element width is defined by the following equation:
X words can be constructed from Y elements where:
X=Element width/common factor
Y=recording width/common factor
The lookup table must be programmed to the point it repeats (Y locations). A 6-bit register value is used to define the length of the packing sequence before it repeats.
There is a separate lookup table for each of the 64 recording word bits. These lookup tables specify the input to PWORD bit mapping during the mapping sequence. An extra lookup table output bit is added to the table for bits 21:00 as these bits can straddle one of two PWORDS. The extra bit further defines the PWORD associated with this bit. Bits 62:22 do not need this bit so it is not implemented.
This results in a 64×7 bit (for PWORD bits 21:00) and a 64×6 bit lookup table (for PWORD bits 62:22). The lookup table specifies the mapping of the input bits (transmission frames) to the PWORDs each clock. The address to these lookup tables begins at zero and is incremented once for each transmission packet processed (0, 1, or 2 each clock). The address generation for a recording channel lookup RAM is defined by the following expression:
|
|
|
if(address + number of elements >= maximum + 1){next |
|
address = 1} |
|
else if(address + number of elements > maximum) |
|
{next_address = 0;} |
|
else {next_address = address + number of |
|
elements;} |
|
|
The address generation is handled by a dedicated hardware block that uses the number of valid transmission packets in the input frame and the end of sequence value. The Bit Builders use the address to drive a 64 lookup random access memories (RAMs), one for each of the 63 bits in the PWORD and a 64th to define when PWORDS are completely constructed. The tables within the lookup RAMs select the bit in the 48-bit input that is to be loaded into each PWORD bit. The Multiplexer Lookup RAMs are organized as 16 64×32-bit RAMS (not all bits are implemented), each RAM supplying the multiplexer control for four bits.
The address generation for the multiplexer control lookup tables increments the address by 0, 1, or 2. The wrap address is set through a register before activating the unit. The address generation begins at zero and progress from there, with the signals indicating available transmission packets driving the address generation.
While a typical trace receiver records from one input port, bandwidth requirements may dictate the use of multi port input trace receivers capable of recording on multiple channels. Such a multiple port, multiple channel receiver is shown as an example in FIG. 11, where multiple recording interfaces 1101-1102 connect to multiple recording channels 1103, 1104, 1105 and 1106 in a selectable manner so that input from each recording interface may be assigned to any recording channel 1107 through 1110. While FIG. 11 shows a two input, four channel system, there is no limitation on the number of inputs or channels.
In the interest of increasing bandwidth, recording may be time division multiplexed between the available recording channels. FIG. 12 shows such a trace receiver with multiple recording interface 1201 connecting to multiple recording channels 1202. A multiple clocks with offsets are used to direct the input data to the desired port.
Typical trace recorders control trace recording by starting and stopping recording at the source. This is done using gated clocks or an enable. With the advent of more sophisticated transmission methods, the recording control point may be moved to a point past the front end, much closer to the memory interface. The trace receiver front end is synchronized to chip transmission and remains synchronized, while the actual on/off control takes place at the memory interface. This allows the input to continue to operate while the data is either presented to the memory interface or may be discarded without affecting input data synchronization.
In a typical system, the trace is being recorded by an external device. The trace function may be treated as a peripheral of the device being traced. As shown on FIG. 13, a trace receiver 1301 is attached to the device 1302 being traced through a trace port 1303 and bus 1304. The trace device records activity through the trace port 1303, and may be programmed or the recorded data retrieved through bus 1304.
The trace function may be implemented on a development board as a trace chip shown in FIG. 13. In an alternate implementation the trace capability may be placed on a small add on board.
It is desirable to be able look at trace information without halting trace recording. It is also preferable to be able to use the trace buffer as a large FIFO for data where the collection rate is less than the rate the host may empty the trace buffer.
Host transfers to and from trace memory while additional trace data is stored are called Real-time Transfers (RTTs) RTTs can take two forms:
-
- Chasing the most recently stored data (forward reads that progress from the start of buffer toward end of buffer)
- Snapshot the most recently stored data (reverse reads that progress from the end of buffer toward start of buffer)
When a RTT is initiated, the command causes the initial memory address for a host memory activity to be dynamically generated from the current trace buffer address. For real-time reads, a read command dynamically generates the initial transfer address. For reads where the read direction is opposite that of store direction, the last stored address is used for the initial read address. For reads where the read direction is the same as that of store direction, the next store address is captured, assuming the buffer is full.
Trace buffers can be stored or read either forward or backward. Reads while the channel transfer is stopped are called Static Reads. Static Reads provide access to the entire trace buffer contents without the threat of the data being corrupted by subsequent stores. The storing of new data is suppressed by turning the channel off prior to performing a read. The debug software for this type of read specifies the initial transfer address. Static Reads can read the buffer forward or backward.
Since the trace buffer is circular, a read command can cross the start or end of buffer address. The hardware manages the buffer wrap conditions by resetting the address to the starting buffer address or ending buffer address as required. This may also be done by software.
When the data is read from the most recently stored data to the least recently stored data, the transfer is assumed to have two components. The first component is created from the current buffer address to the start address and second created from the end buffer address to the current buffer address.
When the data is read from the least recently stored data to the most recently stored data, the transfer is also assumed to have two components. The first component is created from the current buffer address to the end address and second created from the start buffer address to the current buffer address.
For the reads from the most recently stored to the least recently stored data, the read processing proceeds as follows. A transfer incomplete error is set if the read terminates before the desired number of words is read. This is caused by a wrap condition occurring on real-time reads (new stores have overwritten data that was to be read creating a discontinuity in old and new data). A no data error is set if no data has been stored in the buffer.
Care must be taken to detect when the data being read is overwritten by data being stored in the case of real-time transfers. This condition may be detected with a collision counter. This counter detects two overrun conditions:
-
- Data is stored with incrementing/decrementing buffer addresses, data is read with decrementing/incrementing buffer addresses. The number of words stored plus the number of words read is equal to the buffer size. (Peek)
- Data is stored with incrementing/decrementing buffer addresses, data is read with incrementing/decrementing buffer addresses. The number of words stored minus the number of words read is equal to the buffer size. (Chase)
These overrun conditions are detected using a Collision Counter. This counter is used to determine the distance between the read and write pointers of the Trace Buffer. When this distance becomes zero, a buffer wrap condition is eminent (some accesses may still be in the pipeline and may not have actually happened yet). Before the Collision Counter has decremented to zero, each word read is valid as it was definitely read before new data is stored in this location. A second Valid Transfer Counter, is incremented for each word read before the Collision Counter decrements past zero.
The Collision Counter is loaded with the trace buffer size prior to a host transfer. Once the host transfer request is issued, each trace word stored decrements the collision counter. Each word the Transfer Counter stores in the temporary buffer as a result of the channel read request also counts the counter down. When the sum of the two counts decrements past zero, the data read becomes suspect as a wrap condition has occurred or is on the verge of occurring.
Before the Collision Counter decrements to zero, the Valid Transfer Counter tracks the number of reads that are successful prior to the Collision Counter decrementing past zero. When the transfer completes, Debug Software uses the Valid Transfer Count value to determine how many of the words in read buffer are really valid.
The chase operation has two components:
-
- Counting the words stored to the buffer and notifying the host
- The host initiating reads to retrieve the words after being notified
Once a chase operation is requested, channel stores decrement the Collision Counter and TC stores associated with the channel increment the Collision Counter. Since trace data stores have higher priority, the counter will never count up past the buffer size. An overrun condition occurs when the channel stores decrement the counter past zero. When this occurs, the channel store has stored the entire buffer without the host emptying it. Host reads will read out of order data in this situation.
At this point another counter, the Store Counter, comes into play. This counter is used to notify the host when a fixed number of words are stored beginning with the point the read request is issued (an interrupt may be generated). The interrupt interval may be made programmable. Once a transfer has been activated, it merely suspends when words are read. A read may be restarted by merely continuing the read from where it paused. Read continues to pause until either terminated with a TERMINATE or INITIALIZE command.
The overrun condition is detected with the Collision Counter just as with peeks. The counter starts with the buffer size and is decremented by stores and incremented by and TC stores related to the channel read transfer.
The master slave timing of interfaces coupled with clock insertion delays of devices causes slower performance as the insertion delay comes directly out of the sampling window. As shown in FIG. 14, programmable delays 1401 and 1403 can be added to the clock and 1402 to the data that allows optimization of timing. The delay may be adjusted dynamically during operation to optimize performance. Scan rates and other transfers may be accelerated by as much as a third when the clock insertion delay is cancelled.
With traditional trace recorders such as logic analyzers, a time stamp is recorded in parallel with each sample stored into trace memory. Each trace sample corresponded to a cycle of system activity. With today's trace implementations on chip, the trace information does not represent a cycle of system activity. Instead a trace word may be an encoded view of many cycles of system activity. Additionally, on-chip trace export mechanisms may schedule output from multiple sources out of order of execution. This makes the exact arrival of trace information in the receiver imprecise.
Instead of using the traditional method of adding Time of the Day (TOD) or Time Stamp (TS) information to trace for every sample, this information may be placed in the trace stream itself and represented as a control word. This may be done periodically or at the first empty slot after some period has elapsed.
By partitioning trace logic to free run while functional logic is clock stepped, the device state of interest may be exported as trace information. When the trace generated by a single functional clock is exported, another functional clock is issued generating more trace information. The functional clock rate is slowed to a rate necessary to export the state of interest.
The operation of scaled-time simulation is relatively straight forward as shown in FIG. 15. When a chip is built with trace, the trace logic 1501 is supplied clocks 1502 which are separate from clocks 1503 that normally run the system logic 1504. This allows the chip to be placed in a special mode where the functional logic is issued one clock. One frame of trace data is generated for each functional clock issued. The valid signal 1505 may be implemented as a toggle, changing state when new information is generated. The Trace Logic 1501, whose clock is free running, detects a change in state in the valid signal. It processes the trace information presented to it, exporting this information 1506 to a trace recorder. When transmission of this information has created sufficient space to accept a new frame of trace information, the Empty signal 1507 is generated. This causes the clock generation logic to issue another clock to the System Logic. This starts the process over. An optional stall 1508 may be generated by the Trace receiver so it may pace transactions.
Generally, a trace receiver built with a programmable component, or potentially with another technology (standard cell or ASIC) may, for bandwidth reasons, have a limit as to the width of incoming trace data that can be processed. This is due to the fact that the incoming data rates may outstrip the ability of the receiver to store the data to memory. At times parallel input units may be deployed to capture some portion of the input. The assignment of more than one input channel to a unit can constrain the number of bits that can be processed in parallel. For instance doubling the data rate of the input and using two input channels to process the input in an interleaved fashion, the unit's memory band width or some other factor may require the input width of the incoming data to be constrained to a level than can be handled by the unit.
The simplest way of dealing with an input capacity problems unit is to place two units in parallel, with each unit recording some portion of the incoming data. In other cases, a wide but slower interface such as a memory bus may be used for recording data, with unused memory BW used to export trace data. In this case the wider interface may also require the use of one or more units for recording.
FIG. 16 demonstrates an implementation of a distributed width architecture. The system logic 1601 connects to trace channels 1602, 1603 and 1604 in parallel. Each channel is supplied a set of controls that are identical, and may be as simple as the trace clock. The data 1608, 1609 and 1610 to be recorded by each unit are different.
When multiple debug tools are connected to a target system it may be desirable for them to coordinate their activities. Examples of the need for coordination may be during trace compression or other functions where supervision by a master recording unit is required, and a master and one or more slave units must be designated. This coordination may need to be close to the physical connection. The coordination may involve wide trace, coordination of execution control, or global triggers. This coordination may take place in a variety of ways, including direct connections between the respective debug units. An alternate way of coordination may employ a connection through the target connector, wherein the debug units communicate with the connector which in turn implements the required interconnections.
It may be desirable to expand the trace recording in the deeper dimension. Generally, a trace receiver built with a programmable component, or potentially with another technology (standard cell or ASIC) may, for bandwidth reasons, have a limit as to the amount of incoming trace data that can be processed. In addition the depth of the trace recording may be doubled when the memory space of two or more units is combined. The simplest way of dealing with a trace depth issue is to place two or more units in series, with each unit recording some portion of the incoming data. FIG. 17 demonstrates this architecture. The system logic block 1701 being traced connects to trace unit 1702, which in turn connects to trace unit 1703 and then to 1704 thus expanding the depth of the trace.
When memory events are traced, the timing stream is used to associate events with instructions and indicate pipeline advances precluding the recording of stall cycles. These events are traced when the PC is traced. The tracing of data trace values may not be possible concurrent with memory events in some event encoding modes that use both the timing stream and data value.
When tracing processor activity, three streams are present: timing stream, program counter (PC) stream and data stream. The timing stream has the active and event information, PC stream has all the discontinuity information, and the data stream has all the detailed information. The various streams are synchronized using markers called sync points. The sync points provide a unique identifier field and a context to the data that will follow it. All streams may generate a sync point with this unique identifier. These unique identifiers allow synchronization between multiple streams. When a sync point is generated we will have the streams generated as shown in Table 2. It should be noted that the context information is provided only in the PC stream. There is no order dependency of the various streams with each other. However within each stream the order cannot be changed between sync points.
TABLE 2
|
|
| Timing stream |
PC stream |
Data stream |
|
| Timing sync point, id = 1 |
PC sync point, id = 1 |
Data sync point, id = 1 |
| Timing data |
|
PC data |
Memory Data |
| Timing data |
|
Memory Data |
| Timing data |
PC data |
Memory Data |
|
PC data |
| Timing data |
|
Memory Data |
| Timing sync point, id = 2 |
PC sync point, id = 2 |
Data sync point, id = 2 |
|
Four events will be sent to trace although at any one time only some of those events may be active. Information is sent to trace to inform how many and which events occurred.
A timing stream is shown with 0 being active cycle. A “1” however does not represent a stall cycle. Instead it indicates the occurrence of an event.
Bits [7:0]=00111000 is a timing packet.
A “1” in the timing stream implies there is at least one event that has occurred. The event profiling information will be encoded and sent to the data section of the data trace FIFO.
In the generic encoding method, every event that occurs inserts a “1” in the timing stream. If there are multiple events, then it is possible that many “1”s will be inserted in the stream forming an event group. A single “1” can also be an event group by itself. Event groups that occur in a cycle are separated by one or more “0”. The group of “1”s map to the count of events, as outlined in the following table, that occurred with the execute packet. The encoding bits are arranged from MSB to LSB. The total bits required in generic encoding are shown in Table 3. The columns are defined as follows:
-
- #Etrace: Total number of Events being traced;
- #Events: Total events that occurred in that cycle;
- Implication: The bits in the stream reflect these events have occurred
- #Bits: Total bits used for the generic encoding scheme;
- E0: Event 0;
- E1: Event 1;
- E2: Event 2;
- E3: Event 3.
Generic encoding should be used when all the events have equal probability of occurring. The user may opt to trace anywhere from 1 event or all four events.
TABLE 3
|
|
| Line |
|
|
Timing |
|
|
|
| No. |
#Etrace |
#Events |
[MSB:LSB] |
Data [MSB:LSB] |
Implication |
# Bits |
|
|
| 1 |
1 |
1 |
1 |
No bits in data stream |
E0 |
1 |
| 2 |
2 |
1 |
1 |
No bits in data stream |
E0 |
1 |
| 3 |
|
1 |
11 |
No bits in data stream |
E1 |
2 |
| 4 |
|
2 |
111 |
No bits in data stream |
E0 E1 |
3 |
| 5 |
3 |
1 |
1 |
0 |
E0 |
2 |
| 6 |
|
1 |
1 |
01 |
E1 |
3 |
| 7 |
|
1 |
1 |
11 |
E2 |
3 |
| 8 |
|
2 |
11 |
0 |
E0 E1 |
3 |
| 9 |
|
2 |
11 |
01 |
E0 E2 |
4 |
| 10 |
|
2 |
11 |
11 |
E1 E2 |
4 |
| 11 |
|
3 |
111 |
No bits in data stream |
E0 E1 E2 |
3 |
| 12 |
4 |
1 |
1 |
00 |
E0 |
3 |
| 13 |
|
1 |
1 |
01 |
E1 |
3 |
| 14 |
|
1 |
1 |
11 |
E2 |
3 |
| 15 |
|
1 |
1 |
10 |
E3 |
3 |
| 16 |
|
2 |
11 |
01 |
E0 E1 |
4 |
| 17 |
|
2 |
11 |
11 |
E0 E2 |
4 |
| 18 |
|
2 |
11 |
000 |
E0 E3 |
5 |
| 19 |
|
2 |
11 |
010 |
E1 E2 |
5 |
| 20 |
|
2 |
11 |
100 |
E1 E3 |
5 |
| 21 |
|
2 |
11 |
110 |
E2 E3 |
5 |
| 22 |
|
3 |
111 |
10 |
E1 E2 E3 |
5 |
| 23 |
|
3 |
111 |
11 |
E0 E2 E3 |
5 |
| 24 |
|
3 |
111 |
00 |
E0 E1 E3 |
5 |
| 25 |
|
3 |
111 |
01 |
E0 E1 E2 |
5 |
| 26 |
|
4 |
1111 |
No bits in data stream |
E0 E1 E2 E3 |
4 |
|
The consecutive “1s” in the timing stream determine the number of events that are active and being reported. The encoding in the data stream can then be used to determine the exact events that are active in that group. The following table gives an example of the encoding and decoding of the events. The bits are filled in from the LSB. The latter events are packed in the higher bits. It is assumed that the encoding is in generic mode in the following example and all four AEG are active. Therefore only lines 12-26 of Table 3 are referenced for encoding and decoding this data. The same data stream is interpreted differently with reference to different timing streams. The (MSB: LSB) column is the data stored in the FIFO. “Lines” is the lines to be referred to in Table 3 with the current timing data. The table highlights the fact that the interpretation of the data stream changes based on the timing stream.
In prioritized mode encoding scheme, lesser number of bits are used for some events while some other events may take up more bits. This enables high frequency events to take up lesser number of bits thus decreasing the stress on the available bandwidth. A classic example of this would be misses from the local cache (high frequency), versus misses from the external memory (low frequency).
A timing stream is shown with 0 being active cycle as before. A “1” however does not represent a stall cycle. Instead it indicates the occurrence of an event.
Bits [7:0]=00111000 is a timing packet.
A “1” in the timing stream implies there is at least one event that has occurred. The event profiling information will be encoded and sent to the data section of the data trace FIFO. The priority encoding of this information is based on the following table. The encoding bits are arranged from MSB to LSB.
The various columns in Table 4 are defined as follows:
-
- #AEG: Total number of AEG active;
- #Events: Total events that occurred in that cycle;
- Implication: The bits in the stream reflect these events have occurred;
- #Bits: Total bits used for the priority encoding scheme;
- E0: Event from AEG0;
- E1: Event from AEG1;
- E2: Event from AEG2;
- E3: Event from AEG3.
The consecutive “1's” in the timing stream determine the number of events that are active and being reported. The encoding in the data stream can then be used to determine the exact events that are active in that group. Table 4 gives and example of the encoding and decoding of the events. The bits are filled in from the LSB. The latter events are packed in the higher bits. It is assumed that the encoding is in prioritized mode in the following example and all four AEG are active. Therefore only lines 12-26 of Table 4 are referenced for encoding and decoding this data. The same data stream is interpreted differently with reference to different timing streams. The (MSB: LSB) column in the data stored in the FIFO. “Lines” is the lines to be referred to in Table 4 with the current timing data. Table 4 highlights the fact that the interpretation of the data stream changes based on the timing stream.
Table 4 shows the encoding for prioritized compression mode. The prioritized encoding can be used if the user has a mix of long and short stalls, or frequent versus infrequent. This method is skewed toward efficiently sending out a specific event. It is slightly less efficient in sending out rest of the events. This encoding scheme should be used for the case where one event either does not cause any stalls, or happens very frequently with very little stall duration. The longer stalls can be put in the group that take more bits to encode. The shorter stalls can be put in a group that takes fewer bits to be encoded. An example of this is L2 miss which is a long stall, versus L1D stall which is a short stall.
TABLE 4
|
|
| Line |
|
|
Timing |
|
|
|
| No. |
#AEG |
#Events |
[MSB:LSB] |
Data [MSB:LSB] |
Implication |
# Bits |
|
|
| 1 |
1 |
1 |
1 |
No bits in data stream |
E0 |
1 |
| 2 |
2 |
1 |
1 |
No bits in data stream |
E0 |
1 |
| 3 |
|
1 |
11 |
No bits in data stream |
E1 |
2 |
| 4 |
|
2 |
111 |
No bits in data stream |
E0 E1 |
3 |
| 5 |
3 |
1 |
1 |
No bits in the data stream |
E0 |
1 |
| 6 |
|
1 |
11 |
0 |
E1 |
3 |
| 7 |
|
1 |
11 |
11 |
E2 |
4 |
| 8 |
|
2 |
11 |
01 |
E0 E1 |
4 |
| 9 |
|
2 |
111 |
1 |
E0 E2 |
4 |
| 10 |
|
2 |
111 |
0 |
E1 E2 |
4 |
| 11 |
|
3 |
1111 |
No bits in the data stream |
E0 E1 E2 |
4 |
| 12 |
4 |
1 |
1 |
No bits in the data stream |
E0 |
1 |
| 13 |
|
1 |
11 |
0 |
E1 |
3 |
| 14 |
|
1 |
11 |
11 |
E2 |
4 |
| 15 |
|
1 |
11 |
01 |
E3 |
4 |
| 16 |
|
2 |
111 |
01 |
E0 E1 |
5 |
| 17 |
|
2 |
111 |
11 |
E0 E2 |
5 |
| 18 |
|
2 |
111 |
000 |
E0 E3 |
5 |
| 19 |
|
2 |
111 |
010 |
E1 E2 |
6 |
| 20 |
|
2 |
111 |
100 |
E1 E3 |
6 |
| 21 |
|
2 |
111 |
110 |
E2 E3 |
6 |
| 22 |
|
3 |
1111 |
10 |
E1 E2 E3 |
6 |
| 23 |
|
3 |
1111 |
11 |
E0 E2 E3 |
6 |
| 24 |
|
3 |
1111 |
00 |
E0 E1 E3 |
6 |
| 25 |
|
3 |
1111 |
01 |
E0 E1 E2 |
6 |
| 26 |
|
4 |
1111 |
100 |
E0 E1 E2 E3 |
7 |
|
An example of decoding the streams in the prioritized mode is shown in Table 5. The data stream interpretation changes based on the timing stream.
|
TABLE 5
|
|
|
|
MSB:LSB |
Interpretation |
Lines |
|
|
|
| Data stream |
001 |
— |
— |
| Timing example 1 |
011011110 |
“1111” in TM => 3 or 4 |
22-25 |
|
|
events active |
|
|
“01” in Data => E0 E1 E2 |
25 |
|
|
“11” in TM => 1 |
12-15 |
|
|
event active |
|
|
‘0’ left in Data => E1 |
13 |
| Timing example |
000111000 |
“111” in TM => 2 |
16-21 |
|
|
events active |
|
|
“01” in Data => E0 E1 |
16 |
|
In normal trace, timing stream reflects active and stall cycles. It is also possible to suppress the stall bits, and the stall encoding may instead be replaced with event information. When events are traced, the timing stream is used to associate events with instructions and indicate pipeline advances precluding the recording of stall cycles. This allows the real time tracing of the processor activity without disturbing or halting the processor, and visibility into the memory system activity with lesser number of trace pins than other approaches.
A timing stream is shown in where a “0” is an active cycle. In normal encoding a “1” can, therefore represent a stall cycle.
Bits [7:0]=00111000 is a timing packet.
Therefore this packet would indicate that there were 3 active cycles, followed by 3 stall cycles, which were then followed by 2 active cycles.
Instead we can now replace the stall information with event information. The stall information will be suppressed. A “1” now indicates the occurrence of an event. Therefore the above packet can now be interpreted as follows:
There are 3 active cycles, followed by some event (encoded in this case with 3-“1's”), which is then followed by 2 active cycles.
The exact encoding is completely user dependent on the protocol implemented. For example if 2 possible events are being traced, they could be encoded as follows:
-
- 1->Event 0 occurred
- 11->Event 1 occurred
- 111->Event 0 and 1 occurred.
A timing stream is shown in FIG. 1 where a “0” is an active cycle. In normal encoding a “1” can, therefore represent a stall cycle.
Bits [7:0]=00111000 is a timing packet.
Therefore this packet would indicate that there were 3 active cycles, followed by 3 stall cycles, which were then followed by 2 active cycles.
The exact encoding may also be completely user dependent as to the protocol being implemented. For example if 3 possible events are being traced, they could be encoded as shown in Table 6:
TABLE 6
|
|
| Timing stream |
Comment |
Total bits used |
|
|
| 1 |
Event 0 occurred |
1 |
| 11 |
Event 1 occurred |
2 |
| 111 |
Event 2 occurred |
3 |
| 1111 |
Event 0 and 1 occurred |
4 |
| 1111 |
Event 0 and 2 occurred |
5 |
| 11111 |
Event 1 and 2 occurred |
6 |
| 111111 |
Event 0, 1 and 2 occurred |
7 |
|
The user can change the above encoding based on the fact that the likelihood of events alone as well in combination is equal. Then the above method can be changed to a different method shown in Table 7 where a separate stream can hold the reason for the event:
TABLE 7
|
|
| Timing |
|
|
|
| stream |
Data Stream |
Comment |
Total bits used |
|
|
| 1 |
00 |
Event 0 occurred |
3 |
| 1 |
01 |
Event 1 occurred |
3 |
| 1 |
10 |
Event 2 occurred |
3 |
| 11 |
00 |
Event 0 and 1 occurred |
4 |
| 11 |
01 |
Event 0 and 2 occurred |
4 |
| 11 |
10 |
Event 1 and 2 occurred |
4 |
| 11 |
|
Event 0, 1 and 2 occurred |
4 |
|
The user may be really constrained on the total bandwidth he has, and may potentially wants to profile the events in two runs. In the first run he may have an implied blocking in the events, and thus send out only one event each time. Once he sees his problem area, the user can then focus on just part of his algorithm, enabling higher visibility in that run. Let us say that event 0 has the highest blocking priority. Then the above encoding can be changed to what is shown in Table 8:
TABLE 8
|
|
| Timing |
|
|
|
| stream |
Data Stream |
Comment |
Total bits used |
|
|
| 1 |
Not used |
Event 0 occurred |
1 |
| 11 |
Not used |
Event 1 occurred |
2 |
| 111 |
Not used |
Event 2 occurred |
3 |
| 1 |
Not used |
Event 0 and 1 occurred |
1 |
| 1 |
Not used |
Event 0 and 2 occurred |
1 |
| 11 |
Not used |
Event 1 and 2 occurred |
2 |
| 1 |
|
Event 0, 1 and 2 occurred |
1 |
|
If we compare the Tables 6, 7 and 8 the total bits that are used in each case is shown in Table 9:
|
TABLE 9
|
|
|
|
Comment |
Table 6 |
Table 7 |
Table 8 |
|
|
|
Event 0 occurred |
1 |
3 |
1 |
|
Event 1 occurred |
2 |
3 |
2 |
|
Event 2 occurred |
3 |
3 |
3 |
|
Event 0 and 1 occurred |
4 |
4 |
1 |
|
Event 0 and 2 occurred |
5 |
4 |
1 |
|
Event 1 and 2 occurred |
6 |
4 |
2 |
|
Event 0, 1 and 2 occurred |
7 |
4 |
1 |
|
|
The exact encoding is user dependent, however the point illustrated here is that approach shown in Table 6 works really well for Event 0 if it occurs very frequently, while it takes more bits if events are occurring together. Therefore it gives higher priority for encoding of event 0 and then the priority tapers off for the other events. The approach of Table 7 works really well if all events have an equal likelihood of occurring. It does not take too many bits if all events have equal likelihood of occurring, but loses visibility into the details of the events.
The exact trade-offs between the various encoding schemes can be made based on the architecture and the variations most users are interested in.
The timing stream may be used to capture pipeline advances and recording of contributing stall cycles. These stalls are traced when the PC is traced. The trace of data trace values is not allowed concurrent with stall profiling as that stream is used for holding the reasons for the stalls. In a generic mode encoding scheme, all stall groups take up around the same number of bits.
A timing stream is shown where a “0” is an active cycle. In normal encoding a “1” can, therefore represent a stall cycle.
Bits [7:0]=00111000 is a timing packet. A “1” in the timing stream implies there is at least one contributing stall group active. At the 1st active cycle after that, the last contributing stall that was active (last stall standing) will be encoded and stored. The encoding of this information is based on Table 8. The information is stored in the data part of the data trace FIFO if required. It should be noted that in this mode, tracing of the data values themselves is disabled. In the following table 10 for example implies LSS group 0.
TABLE 10
|
|
| Stall |
|
Generic encoding (Data FIFO) |
|
| groups |
Data FIFO |
(MSB:LSB) |
Implication |
|
| 1 |
not used |
not used |
L0 |
| 2 |
1 bit |
0 |
L0 |
|
|
1 |
L1 |
| 3 |
1-2bits |
0 |
L0 |
|
|
01 |
L1 |
|
|
11 |
L2 |
| 4 |
1-3 bits |
00 |
L0 |
|
|
01 |
L1 |
|
|
11 |
L2 |
|
|
10 |
L3 |
|
Generic encoding should be used when all the events have equal probability of occurring.
In prioritized mode encoding, lesser number of bits are used for some stall groups while some other stall groups may take up more bits. This enables high frequency stall events to take up lesser number of bits thus decreasing the stress on the available bandwidth. A classic example of this would be misses from the local cache (high frequency), versus misses from the external memory (low frequency).
A timing stream is shown where a “0” is an active cycle. In normal encoding a “1” can, therefore represent a stall cycle.
Bits [7:0]=00111000 is a timing packet.
A “1” in the timing stream implies there is at least one contributing stall group active. At the 1st active cycle after that, the last contributing stall that was active (last stall standing) will be encoded and stored. The encoding of this information is based on Table 10. The information is stored in the data part of the data trace FIFO if required. It should be noted that in this mode, tracing of the data values themselves is disabled. In the following Table 11 for e.g. implies LSS group 0.
TABLE 11
|
|
| Stall |
|
Prioritized encoding (Data FIFO) |
|
| groups |
Data FIFO |
(MSB:LSB) |
Implication |
|
| 1 |
not used |
not used |
L0 |
| 2 |
1 bit |
0 |
L0 |
|
|
1 |
L1 |
| 3 |
1-2bits |
0 |
L0 |
|
|
01 |
L1 |
|
|
11 |
L2 |
| 4 |
1-3 bits |
0 |
L0 |
|
|
01 |
L1 |
|
|
011 |
L2 |
|
|
111 |
L3 |
|
Prioritized encoding can be used if there is a mix of long and short stalls. This method is skewed toward efficiently sending out a specific event. It is slightly less efficient in sending out rest of the events. This encoding should be used for the case where one event either does not cause any stalls, or happens very frequently with very little stall duration. The longer stalls can be put in the group that take more bits to encode. The shorter stalls can be put in a group that takes fewer bits to be encoded. An example of this is L2 miss which is a long stall, versus L1D stall which is a short stall.
External events can occur on an active or stall cycle. They need to be marked in the stream to indicate the position of their occurrence. The timing stream can be adjusted to send out that information. Some of the restrictions of this mode are:
Any packet can be terminated due to an external event.
The pattern matching and event profiling stream is shown in Table 12. The definition of C3 and C5 changes in these modes.
TABLE 12
|
|
| 11 |
C1 |
C2 |
Packet 0 [4:0] |
| 10 |
C3 |
C0 |
Packet 1 [6:0] |
The control bits definition for C0 defining the modes, stays the same as shown in Table 13:
TABLE 13
|
|
| C0 |
Function |
|
| 0 or does not exist |
Pattern mode |
| 1 |
Pattern type either type “1010” (A) or “0101” (5) |
|
Mode 1 uses pattern length matching. The basic mode definition stays the same. It has been enhanced such that the timing packet will be sent out also if the event happens to fall at a pattern boundary. In which case, the event will be reported for the last of the pattern match counts.
If the event does not occur at a pattern boundary, the current timing pattern packets are rejected. In parallel with it, the 2