Garga.iet.unipi.it

Reducing Sensitivity to NoC Latency in NUCA Caches
Pierfrancesco Foglia*, Giacomo Gabrielli*, Francesco Panicucci†, Marco Solinas* {foglia, giacomo.gabrielli, marco.solinas}@iet.unipi.it, [email protected] Members of the HiPEAC European Network of Excellence Abstract
proposed schemes [4, 5]. Taking into account theconsiderations derived from the performed analysis, we Non Uniform Cache Architectures (NUCA) are a novel derive an alternative NUCA organization based on bank design paradigm for large last-level on-chip caches which clustering that is better performing and it is able to reduce have been introduced to deliver low access latencies in the strong sensitivity to the cut-through latency of routers.
wire-delay dominated environments. Typically, NUCA caches make use of a network-on-chip (NoC) to connect 1.1. NUCA caches
the different sub-banks and the cache controller. This work analyzes how different network parameters, namely Non Uniform Cache Architectures have been proposed hop latency and buffering capacity of routers, affect the as a novel design paradigm for large last-level on-chip overall performance of NUCA-based systems for the caches [4] in order to reduce the effects of wire delays, single processor case, assuming a reference NUCA which significantly limit the performance scaling of organization derived from previous works. This analysis today’s high clock frequency microprocessors [6]. This is leads to some important design guidelines: first, the achieved by the adoption of a storage structure partitioned sensitivity of the system to the hop latency is very high and router architectures with long pipelines, suitable for independently accessible entity, and by the adoption of a throughput-oriented applications, are not adequate; fast interconnection network to connect the banks and the second, limited buffering capacity is sufficient to achieve cache controller. The access latency exhibited by a NUCA a good performance level. As a consequence, in this work cache is a function of the physical location of the we propose an alternative NUCA organization based on requested line. The mapping between cache lines and bank clustering that limits the average number of hops banks can be either static (S-NUCA) or dynamic (D- experienced by cache accesses. This organization is better NUCA). In a D-NUCA cache a line can be located in one performing in most of the cases and scales better as the of a set of allowed bank locations, which collectively form cut-through latency increases, thus simplifying the a bank set, and each bank of the bank set behaves like a single way of a set associative cache [4]. Lines can dynamically migrate from one bank to another, provided 1. Introduction
that it belongs to the pertaining bank set, and the migration is triggered by a certain number of consecutive This paper presents an analysis of the impact of two line accesses. Different implementation policies have been main NoC parameters on the performance of NUCA proposed for D-NUCA, i.e. mapping policies, line search caches, i.e. the cut-through latency of routers (the latency policies and migration policies [4]; in order to keep the to deliver a traffic unit from an input channel to an output number of variable parameters reasonably low, in this channel in a no-load condition) and their buffering study a specific set of policies has been selected for D- capacity. While several implementations of NoCs have NUCA, leading to a configuration that is a good tradeoff been proposed and adopted so far in other scenarios, such between performance and complexity. The selected as system-on-chips [1] and tiled architectures [2, 3], policies are: simple mapping, with each row of banks NUCA caches are an emerging technology and, as far as making up a bank set; broadcast search; promotion in the we know, none of the previous studies clearly focused on adjacent bank upon each hit (1 bank/1 hit).
the impact of the characteristics of the network routers on the system performance. The analysis described in this 1.2. Networks-on-chip for NUCA caches
paper shows that different implementations of network routers can significantly affect the overall performance of A viable solution to connect the banks and the single processor systems adopting a reference NUCA L2 controller of a NUCA cache is represented by a NoC. The cache, whose structure has been derived by previously NoC paradigm tends to favour the reuse of design and * Università di Pisa, Dipartimento di Ingegneria dell’Informazione, Via Diotisalvi 2, 56122 Pisa (Italy)† IMT Lucca, Institute for Advanced Studies, Piazza S. Ponziano 6, 55100 Lucca (Italy) verification efforts, which is particularly important for modern VLSI processes. In addition, the resulting interconnection scheme is more scalable than traditional approaches based on broadcast media, such as busses and rings. The intrinsic features of NUCA caches introduce constraints on the design of the NoC, in particular on the design of the network routers. These constraints impact on the characteristics of the network itself, such as topology, routing, and flow control, but, primarily, they are influenced by the way with which last-level on-chip caches are accessed by the CPU. A fundamental property of the NUCA on-chip network is that it is self-throttling Figure 1. Partial 2D mesh topology. The NUCA
structure represented here is made up of 64 banks (8x8).
The black circles depict the network routers.

interconnects. In fact, non-blocking caches are able to support only a limited number of outstanding misses, therefore the number of simultaneous requests on the last- For the NUCA architectures considered in this work, a level cache is limited by the number of outstanding misses request packet is first propagated along the vertical supported by the higher level. This number is determined dimension (vertical links in Figure 1), then it is by the number and size of the Miss Status Holding propagated along the horizontal dimension (horizontal Registers (MSHRs) [8], which are used to keep track of links in Figure 1); reply packets follow the same path. For the pending misses. From these considerations, we might D-NUCA caches, since a bank set is mapped to a single expect the network traffic offered to the network to be row of banks, first a flit has to reach the pertaining bank quite moderate. Since the access latency is the set, then it is propagated to the nodes attached to the fundamental performance metric of a NUCA cache, banks of its bank set, starting from the nearest one to the together with the hit rate, we also might expect latency, cache controller. This causes the global access latency to instead of bandwidth, to be the primary design goal for the raise as the distance of the requested cache line from the switching elements of the network, in order to build fast first node of the pertaining bank set increases.
NUCA caches. However, as far as we know, none of the The network routers are assumed to be input buffered previous studies put the emphasis on the impact of the and the buffers are managed on a per-flit basis in a FIFO router parameters on the performance of NUCA-based manner; a crossbar switch is adopted to minimize systems and it is not clear how the characteristics contention on output channels; the flow control is credit- described above translate into constraints on the network based. A detailed model of the router architecture design. Jin et al. [12] have focused on NoC-related described so far has been incorporated into the selected aspects of NUCA, but in their work a fixed single-cycle router architecture is considered and the effects of this choice on the overall system performance are not Table 1. Configuration parameters for the CPU and
the memory hierarchy

reported; in this sense, our work can be considered as Parameter
2. Methodology
The analysis described in this paper assumes a 64 KB, 2-way s.a., 64B line, 1 cycle hit lat.
reference NUCA structure whose topology is derived from a 2D mesh, which will be called partial 2D mesh in the following, since only a subset of the links of a full 2D mesh are employed in order to reduce the area overhead (Figure 1). The sole injection point of the network is the L2 cache controller, which is assumed to be directly attached to the external DRAM controller.
with acc. time = 13, cyc. time = 3 (cycles) The reference NoC architecture is based on a wormhole scheme, with routing and flow control policies with acc. time = 11, cyc. time = 2 (cycles) working on a per-flit basis. The size of a flit is assumed to be equal to the link width, and the links are bidirectional. For this study, we selected three different L2 cache The routing scheme is deterministic, dimension ordered. architectures, i.e. UCA, S-NUCA and D-NUCA; for each one, we selected the best performing configuration, assuming a constant cache size fixed at 8 Mbytes and a Focusing on a single value for hop latency, e.g. 2 line size of 64 bytes. The design space exploration for this cycles, it is possible to quantitatively evaluate the step comprised several parameters: global associativity, performance degradation due to limited buffer capacity bank associativity, number of banks and their organization with respect to the ideal router case (infinite buffer in rows and columns. The configurations of the simulated capacity), for both S-NUCA and D-NUCA. Figure 3 highlights this degradation, reporting the normalized IPC The values of bank access latency and wire delay were with respect to the ideal router case with infinite buffer obtained from CACTI 5.1 [9], which derives the capacity. The resulting performance degradation is technological parameters for devices and wires from the negligible even for the 5 flits per channel buffer capacity; projections of the ITRS report [10].
for both S-NUCA and D-NUCA the degradation is less The selected simulation platform is an extended version of sim-alpha [11], which is able to model NUCA cache architectures and the related NoC traffic with cycle- accurate fidelity. We selected L2 cache intensive applications from the SPEC CPU2000 and NAS Parallel Benchmarks suites (applu, art, bt, bzip2, cg, equake, galgel, gcc, mcf, mesa, mgrid, parser, perlbmk, sp, twolf) and we simulated a representative phase of each application; to identify the run phases we applied the same Figure 3. IPC vs. buffer capacity for 1 cycle cut-through
3. Results
latency.
Figure 2 shows the average IPC (Instructions Per The limited performance sensitivity to the amount of Cycle) for the entire workload as the cut-through latency buffering resources may be explained by analyzing the varies from 0 (when the hop latency is given only by the average network traffic, in terms of buffer occupancy. wire delay) to 5 clock cycles for 10 flits per channel Figure 4 shows the distribution of the buffer occupancy buffer capacity. We can highlight that the overall system for two applications: we selected the parser benchmark, performance for NUCA is highly sensitive to the hop which exhibits a moderate load on the network, as latency. While D-NUCA always outperforms S-NUCA, witnessed by the utilization of the link that experiences the the performance of NUCA-based architectures rapidly highest occupancy (the link is occupied for the 1.9% of decreases from a simulation node to the next. For 2 cycles the time), and the gcc benchmark, which experiences a cut-through latency, S-NUCA is less performing than relatively higher load (being the link with highest UCA, while the benefits of employing a D-NUCA are occupancy transmitting for the 15.9% of time). The poor (only 2.7% improvement over UCA). This high configuration consists of a D-NUCA architecture, with sensitivity witnesses that the NoC latency has strong single-cycle cut-through latency and infinite buffering effects on the overall system performance, while the capacity. The queue length distribution is shown for the latency of bank accesses becomes less influential as we router located at the injection point of the network, which move towards higher latencies for hops.
shows the highest average queue length for all the applications, since this router has to propagate all the 10 flits-per-channel buffer capacity
traffic generated by the cache controller. We selected the queue that experiences the highest average occupancy w.r.t. the other queues of the router. For the gcc benchmark, the queue length at the injection point is null (meaning that no buffering resources are occupied) in 90.43% of the time; the maximum measured queue length is 17 flits, but a queue longer than 5 flits is found with a very low frequency (less than 0.6% of the time), while a queue longer than 10 flits is found with a frequency lower Cut-through latency (cycles)
than 0.001%. The parser benchmark experiences an even lower load condition, being the maximum measured queue Figure 2. IPC vs. cut-through latency for 10 flits-per-
length 7 flits, but with an occupancy of more than 5 flits channel buffer capacity
being found only in the 0.002% of the time.
parser (b)
architectures. Except for the null cut-through latency case, the clustered scheme always outperforms the reference one. These results indicate that the minimal cut-through latency constraint can be relaxed, as this configuration is much more scalable w.r.t. the reference architecture.
10 flits-per-channel buffer capacity
queue length, n. of flits
queue length, n. of flits
Figure 4. Distribution of buffer occupancy for gcc (a)
and parser (b). The percentage of total execution time
spent for each occupancy state is shown. The selected
queues belong to the injection point of the NUCA on-
chip network. Data refer to a D-NUCA architecture,
with single-cycle cut-through latency and infinite
buffering capacity.
Cut-through latency (cycles)
4. Reducing sensitivity to NoC latency
Figure 6. IPC vs. cut-through latency for the
clustered scheme (4BPN = 4 banks per node).

One of the most effective ways to mitigate the high Even for this architecture, the considerations about sensitivity to NoC latency is to reduce the average number buffer occupancy are the same as for the non-clustered of hops that cache accesses experience. This can be approach: the system exhibits very poor sensitivity to the achieved by reducing the number of cache banks (assuming a constant cache capacity, this means that the The clustered scheme, while being better performing, size of banks is increased) or clustering the banks so that reduces the number of network routers, thus leading to a each cluster is attached to a network node, while keeping simpler implementation. The additional overhead is given the bank size fixed. Since a wire-delay dominated by the additional ports to connect each router to its local environment put strong constraints on the topology, the banks (3 additional ports w.r.t. the traditional scheme). only relevant scheme that we take into account for the However, a solution based on the multiplexing of a single clustered approach is a configuration with 4 banks per port to connect to the local banks could be used, cluster, as depicted in Figure 5. The partitioning of the employing a simple arbiter. We performed a set of address space inside a single cluster is obtained by simulations which indicated that the performance checking the least significant bits from the index field of degradation due to the loss of parallelism introduced by the address. For D-NUCA caches, in order to achieve a this solution is negligible: for instance, for a D-NUCA significant improvement, we also introduced an alternative with a single-cycle cut-through latency and infinite logical organization, which involves the way with which buffering capacity, the performance degradation is only lines are mapped onto cache banks: with the clustered approach each bank set is mapped onto a row of clusters, and each column of clusters now behaves like a single 5. Acknowledgements
We wish to thank the anonymous reviewers for their helpful and valuable comments. We also wish to thank Stephen Keckler who furnished us with the initial version of the modified sim-alpha simulator, José Duato for his suggestions on our work, and Cristian Croce for helping us in the development of the simulation platform.
This work is partially supported by the SARC project funded by the European Union under contract no. 27648.
Figure 5. Partial 2D mesh topology with 4 banks per
6. References
node (clustered approach).
[1] L. Benini and G. De Micheli. Networks on chips: a Figure 6 reports the performance achieved by the new new SoC paradigm. IEEE Computer, 35(1):70–78, 2002.
scheme, when applied to both S-NUCA and D-NUCA [2] S. Vangal et al. An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In Digest of Technical Papers, International Solid-State Circuits Conference (ISSCC), pages 98–589, 2007.
[3] K. Sankaralingam et al. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proceedings of the 39th International Symposium on Microarchitecture (MICRO), pages 480–491, 2006.
[4] C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 211–222, 2002.
[5] A. Bardine, P. Foglia, G. Gabrielli, C. A. Prete, and P. Stenstrom. Improving power efficiency of D-NUCA caches. ACM SIGARCH Computer Architecture News, 35(4):53–58, 2007.
[6] V. Agarwal, M. S. Hrishikesh, S.W. Keckler, and D. Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In Proceedings of the 27th International Symposium on Computer Architecture(ISCA), pages 248–259, 2000.
[7] W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2003.
[8] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th Annual Symposium on Computer Architecture (ISCA), pages 81–87, 1981.
[9] S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. CACTI 5.0. Technical report, HP, 2007.
[10]International Semiconductors, 2005 Edition Report.
[11]R. Desikan, D. Burger, S. Keckler, and T. Austin. Sim-alpha: a validated, execution-driven Alpha 21264 simulator. Technical report, Department of Computer Sciences, University of Texas at Austin, 2001.
[12]Y. Jin, E.J. Kim, and K.H. Yum. A domain-specific on-chip network design for large scale cache systems. In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA), pages 318–327, 2007.

Source: http://garga.iet.unipi.it/paper/2009/cipro%20foglia-noc-nuca-paper.pdf

Microsoft word - cci pa list-12-07.doc

In an effort to promote the appropriate use of certain drugs and to help better manage the cost of expensive drugs, the ConnectiCare Pharmacy & Therapeutics Committee has developed a list of prescription drugs that require prior authorization. Prior authorization requests must be faxed to ConnectiCare’s Pharmacy Services department at 860-674-2851 or toll free 800-249-1367 by the prescribin

Microsoft word - excerpts for jb solomon blog.docx

Excerpts from Screaming Freedom by Allen Michael Hines. Available from J.B. Solomon Editions at www.lulu.com/screamingfreedom. Raindrops slam the pavement of my patio like dimes falling from Sears Tower. I am made rich. I cannot speak now. I will have surgery tomorrow to ease a catheter up my spine. The catheter will ooze poison to relax my muscles. When I wake in post op, I will be richer,

Copyright © 2010-2014 Drug Shortages pdf