Reducing Sensitivity to NoC Latency in NUCA Caches
Pierfrancesco Foglia*, Giacomo Gabrielli*, Francesco Panicucci†, Marco Solinas*
{foglia, giacomo.gabrielli, marco.solinas}@iet.unipi.it, [email protected]Members of the HiPEAC European Network of ExcellenceAbstract
proposed schemes [4, 5]. Taking into account theconsiderations derived from the performed analysis, we
Non Uniform Cache Architectures (NUCA) are a novel
derive an alternative NUCA organization based on bank
design paradigm for large last-level on-chip caches which
clustering that is better performing and it is able to reduce
have been introduced to deliver low access latencies in
the strong sensitivity to the cut-through latency of routers. wire-delay dominated environments. Typically, NUCA caches make use of a network-on-chip (NoC) to connect 1.1. NUCA caches the different sub-banks and the cache controller. This work analyzes how different network parameters, namely
Non Uniform Cache Architectures have been proposed
hop latency and buffering capacity of routers, affect the
as a novel design paradigm for large last-level on-chip
overall performance of NUCA-based systems for the
caches [4] in order to reduce the effects of wire delays,
single processor case, assuming a reference NUCA
which significantly limit the performance scaling of
organization derived from previous works. This analysis
today’s high clock frequency microprocessors [6]. This is
leads to some important design guidelines: first, the
achieved by the adoption of a storage structure partitioned
sensitivity of the system to the hop latency is very high and router architectures with long pipelines, suitable for
independently accessible entity, and by the adoption of a
throughput-oriented applications, are not adequate;
fast interconnection network to connect the banks and the
second, limited buffering capacity is sufficient to achieve
cache controller. The access latency exhibited by a NUCA
a good performance level. As a consequence, in this work
cache is a function of the physical location of the
we propose an alternative NUCA organization based on
requested line. The mapping between cache lines and
bank clustering that limits the average number of hops
banks can be either static (S-NUCA) or dynamic (D-
experienced by cache accesses. This organization is better
NUCA). In a D-NUCA cache a line can be located in one
performing in most of the cases and scales better as the
of a set of allowed bank locations, which collectively form
cut-through latency increases, thus simplifying the
a bank set, and each bank of the bank set behaves like a
single way of a set associative cache [4]. Lines can dynamically migrate from one bank to another, provided
1. Introduction
that it belongs to the pertaining bank set, and the migration is triggered by a certain number of consecutive
This paper presents an analysis of the impact of two
line accesses. Different implementation policies have been
main NoC parameters on the performance of NUCA
proposed for D-NUCA, i.e. mapping policies, line search
caches, i.e. the cut-through latency of routers (the latency
policies and migration policies [4]; in order to keep the
to deliver a traffic unit from an input channel to an output
number of variable parameters reasonably low, in this
channel in a no-load condition) and their buffering
study a specific set of policies has been selected for D-
capacity. While several implementations of NoCs have
NUCA, leading to a configuration that is a good tradeoff
been proposed and adopted so far in other scenarios, such
between performance and complexity. The selected
as system-on-chips [1] and tiled architectures [2, 3],
policies are: simple mapping, with each row of banks
NUCA caches are an emerging technology and, as far as
making up a bank set; broadcast search; promotion in the
we know, none of the previous studies clearly focused on
adjacent bank upon each hit (1 bank/1 hit).
the impact of the characteristics of the network routers on the system performance. The analysis described in this
1.2. Networks-on-chip for NUCA caches
paper shows that different implementations of network routers can significantly affect the overall performance of
A viable solution to connect the banks and the
single processor systems adopting a reference NUCA L2
controller of a NUCA cache is represented by a NoC. The
cache, whose structure has been derived by previously
NoC paradigm tends to favour the reuse of design and
* Università di Pisa, Dipartimento di Ingegneria dell’Informazione, Via Diotisalvi 2, 56122 Pisa (Italy)† IMT Lucca, Institute for Advanced Studies, Piazza S. Ponziano 6, 55100 Lucca (Italy)
verification efforts, which is particularly important for modern VLSI processes. In addition, the resulting interconnection scheme is more scalable than traditional approaches based on broadcast media, such as busses and rings. The intrinsic features of NUCA caches introduce constraints on the design of the NoC, in particular on the design of the network routers. These constraints impact on the characteristics of the network itself, such as topology, routing, and flow control, but, primarily, they are influenced by the way with which last-level on-chip caches are accessed by the CPU. A fundamental property of the NUCA on-chip network is that it is self-throttling
Figure 1. Partial 2D mesh topology. The NUCA structure represented here is made up of 64 banks (8x8). The black circles depict the network routers.
interconnects. In fact, non-blocking caches are able to support only a limited number of outstanding misses, therefore the number of simultaneous requests on the last-
For the NUCA architectures considered in this work, a
level cache is limited by the number of outstanding misses
request packet is first propagated along the vertical
supported by the higher level. This number is determined
dimension (vertical links in Figure 1), then it is
by the number and size of the Miss Status Holding
propagated along the horizontal dimension (horizontal
Registers (MSHRs) [8], which are used to keep track of
links in Figure 1); reply packets follow the same path. For
the pending misses. From these considerations, we might
D-NUCA caches, since a bank set is mapped to a single
expect the network traffic offered to the network to be
row of banks, first a flit has to reach the pertaining bank
quite moderate. Since the access latency is the
set, then it is propagated to the nodes attached to the
fundamental performance metric of a NUCA cache,
banks of its bank set, starting from the nearest one to the
together with the hit rate, we also might expect latency,
cache controller. This causes the global access latency to
instead of bandwidth, to be the primary design goal for the
raise as the distance of the requested cache line from the
switching elements of the network, in order to build fast
first node of the pertaining bank set increases.
NUCA caches. However, as far as we know, none of the
The network routers are assumed to be input buffered
previous studies put the emphasis on the impact of the
and the buffers are managed on a per-flit basis in a FIFO
router parameters on the performance of NUCA-based
manner; a crossbar switch is adopted to minimize
systems and it is not clear how the characteristics
contention on output channels; the flow control is credit-
described above translate into constraints on the network
based. A detailed model of the router architecture
design. Jin et al. [12] have focused on NoC-related
described so far has been incorporated into the selected
aspects of NUCA, but in their work a fixed single-cycle
router architecture is considered and the effects of this choice on the overall system performance are not
Table 1. Configuration parameters for the CPU and the memory hierarchy
reported; in this sense, our work can be considered as
Parameter 2. Methodology
The analysis described in this paper assumes a
64 KB, 2-way s.a., 64B line, 1 cycle hit lat.
reference NUCA structure whose topology is derived from
a 2D mesh, which will be called partial 2D mesh in the
following, since only a subset of the links of a full 2D
mesh are employed in order to reduce the area overhead
(Figure 1). The sole injection point of the network is the
L2 cache controller, which is assumed to be directly
attached to the external DRAM controller.
with acc. time = 13, cyc. time = 3 (cycles)
The reference NoC architecture is based on a
wormhole scheme, with routing and flow control policies
with acc. time = 11, cyc. time = 2 (cycles)
working on a per-flit basis. The size of a flit is assumed to be equal to the link width, and the links are bidirectional.
For this study, we selected three different L2 cache
The routing scheme is deterministic, dimension ordered.
architectures, i.e. UCA, S-NUCA and D-NUCA; for each one, we selected the best performing configuration,
assuming a constant cache size fixed at 8 Mbytes and a
Focusing on a single value for hop latency, e.g. 2
line size of 64 bytes. The design space exploration for this
cycles, it is possible to quantitatively evaluate the
step comprised several parameters: global associativity,
performance degradation due to limited buffer capacity
bank associativity, number of banks and their organization
with respect to the ideal router case (infinite buffer
in rows and columns. The configurations of the simulated
capacity), for both S-NUCA and D-NUCA. Figure 3
highlights this degradation, reporting the normalized IPC
The values of bank access latency and wire delay were
with respect to the ideal router case with infinite buffer
obtained from CACTI 5.1 [9], which derives the
capacity. The resulting performance degradation is
technological parameters for devices and wires from the
negligible even for the 5 flits per channel buffer capacity;
projections of the ITRS report [10].
for both S-NUCA and D-NUCA the degradation is less
The selected simulation platform is an extended
version of sim-alpha [11], which is able to model NUCA
cache architectures and the related NoC traffic with cycle-
accurate fidelity. We selected L2 cache intensive
applications from the SPEC CPU2000 and NAS Parallel
Benchmarks suites (applu, art, bt, bzip2, cg, equake,
galgel, gcc, mcf, mesa, mgrid, parser, perlbmk, sp, twolf)
and we simulated a representative phase of each
application; to identify the run phases we applied the same
Figure 3. IPC vs. buffer capacity for 1 cycle cut-through 3. Results latency.
Figure 2 shows the average IPC (Instructions Per
The limited performance sensitivity to the amount of
Cycle) for the entire workload as the cut-through latency
buffering resources may be explained by analyzing the
varies from 0 (when the hop latency is given only by the
average network traffic, in terms of buffer occupancy.
wire delay) to 5 clock cycles for 10 flits per channel
Figure 4 shows the distribution of the buffer occupancy
buffer capacity. We can highlight that the overall system
for two applications: we selected the parser benchmark,
performance for NUCA is highly sensitive to the hop
which exhibits a moderate load on the network, as
latency. While D-NUCA always outperforms S-NUCA,
witnessed by the utilization of the link that experiences the
the performance of NUCA-based architectures rapidly
highest occupancy (the link is occupied for the 1.9% of
decreases from a simulation node to the next. For 2 cycles
the time), and the gcc benchmark, which experiences a
cut-through latency, S-NUCA is less performing than
relatively higher load (being the link with highest
UCA, while the benefits of employing a D-NUCA are
occupancy transmitting for the 15.9% of time). The
poor (only 2.7% improvement over UCA). This high
configuration consists of a D-NUCA architecture, with
sensitivity witnesses that the NoC latency has strong
single-cycle cut-through latency and infinite buffering
effects on the overall system performance, while the
capacity. The queue length distribution is shown for the
latency of bank accesses becomes less influential as we
router located at the injection point of the network, which
move towards higher latencies for hops.
shows the highest average queue length for all the applications, since this router has to propagate all the
10 flits-per-channel buffer capacity
traffic generated by the cache controller. We selected the
queue that experiences the highest average occupancy
w.r.t. the other queues of the router. For the gcc
benchmark, the queue length at the injection point is null
(meaning that no buffering resources are occupied) in
90.43% of the time; the maximum measured queue length
is 17 flits, but a queue longer than 5 flits is found with a
very low frequency (less than 0.6% of the time), while a
queue longer than 10 flits is found with a frequency lower
Cut-through latency (cycles)
than 0.001%. The parser benchmark experiences an even
lower load condition, being the maximum measured queue
Figure 2. IPC vs. cut-through latency for 10 flits-per-
length 7 flits, but with an occupancy of more than 5 flits
channel buffer capacity
being found only in the 0.002% of the time. parser (b)
architectures. Except for the null cut-through latency case,
the clustered scheme always outperforms the reference
one. These results indicate that the minimal cut-through
latency constraint can be relaxed, as this configuration is
much more scalable w.r.t. the reference architecture. 10 flits-per-channel buffer capacity queue length, n. of flits queue length, n. of flits Figure 4. Distribution of buffer occupancy for gcc (a) and parser (b). The percentage of total execution time spent for each occupancy state is shown. The selected queues belong to the injection point of the NUCA on- chip network. Data refer to a D-NUCA architecture, with single-cycle cut-through latency and infinite buffering capacity. Cut-through latency (cycles) 4. Reducing sensitivity to NoC latency Figure 6. IPC vs. cut-through latency for the clustered scheme (4BPN = 4 banks per node).
One of the most effective ways to mitigate the high
Even for this architecture, the considerations about
sensitivity to NoC latency is to reduce the average number
buffer occupancy are the same as for the non-clustered
of hops that cache accesses experience. This can be
approach: the system exhibits very poor sensitivity to the
achieved by reducing the number of cache banks
(assuming a constant cache capacity, this means that the
The clustered scheme, while being better performing,
size of banks is increased) or clustering the banks so that
reduces the number of network routers, thus leading to a
each cluster is attached to a network node, while keeping
simpler implementation. The additional overhead is given
the bank size fixed. Since a wire-delay dominated
by the additional ports to connect each router to its local
environment put strong constraints on the topology, the
banks (3 additional ports w.r.t. the traditional scheme).
only relevant scheme that we take into account for the
However, a solution based on the multiplexing of a single
clustered approach is a configuration with 4 banks per
port to connect to the local banks could be used,
cluster, as depicted in Figure 5. The partitioning of the
employing a simple arbiter. We performed a set of
address space inside a single cluster is obtained by
simulations which indicated that the performance
checking the least significant bits from the index field of
degradation due to the loss of parallelism introduced by
the address. For D-NUCA caches, in order to achieve a
this solution is negligible: for instance, for a D-NUCA
significant improvement, we also introduced an alternative
with a single-cycle cut-through latency and infinite
logical organization, which involves the way with which
buffering capacity, the performance degradation is only
lines are mapped onto cache banks: with the clustered
approach each bank set is mapped onto a row of clusters, and each column of clusters now behaves like a single
5. Acknowledgements
We wish to thank the anonymous reviewers for their
helpful and valuable comments. We also wish to thank Stephen Keckler who furnished us with the initial version of the modified sim-alpha simulator, José Duato for his suggestions on our work, and Cristian Croce for helping us in the development of the simulation platform.
This work is partially supported by the SARC project
funded by the European Union under contract no. 27648. Figure 5. Partial 2D mesh topology with 4 banks per 6. References node (clustered approach).
[1] L. Benini and G. De Micheli. Networks on chips: a
Figure 6 reports the performance achieved by the new
new SoC paradigm. IEEE Computer, 35(1):70–78, 2002.
scheme, when applied to both S-NUCA and D-NUCA
[2] S. Vangal et al. An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In Digest of Technical Papers, International Solid-State Circuits Conference (ISSCC), pages 98–589, 2007. [3] K. Sankaralingam et al. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proceedings of the 39th International Symposium on Microarchitecture (MICRO), pages 480–491, 2006. [4] C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 211–222, 2002. [5] A. Bardine, P. Foglia, G. Gabrielli, C. A. Prete, and P. Stenstrom. Improving power efficiency of D-NUCA caches. ACM SIGARCH Computer Architecture News, 35(4):53–58, 2007. [6] V. Agarwal, M. S. Hrishikesh, S.W. Keckler, and D. Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In Proceedings of the 27th International Symposium on Computer Architecture(ISCA), pages 248–259, 2000. [7] W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2003. [8] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th Annual Symposium on Computer Architecture (ISCA), pages 81–87, 1981. [9] S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. CACTI 5.0. Technical report, HP, 2007. [10]International
Semiconductors, 2005 Edition Report. [11]R. Desikan, D. Burger, S. Keckler, and T. Austin. Sim-alpha: a validated, execution-driven Alpha 21264 simulator. Technical report, Department of Computer Sciences, University of Texas at Austin, 2001. [12]Y. Jin, E.J. Kim, and K.H. Yum. A domain-specific on-chip network design for large scale cache systems. In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA), pages 318–327, 2007.
In an effort to promote the appropriate use of certain drugs and to help better manage the cost of expensive drugs, the ConnectiCare Pharmacy & Therapeutics Committee has developed a list of prescription drugs that require prior authorization. Prior authorization requests must be faxed to ConnectiCare’s Pharmacy Services department at 860-674-2851 or toll free 800-249-1367 by the prescribin
Excerpts from Screaming Freedom by Allen Michael Hines. Available from J.B. Solomon Editions at www.lulu.com/screamingfreedom. Raindrops slam the pavement of my patio like dimes falling from Sears Tower. I am made rich. I cannot speak now. I will have surgery tomorrow to ease a catheter up my spine. The catheter will ooze poison to relax my muscles. When I wake in post op, I will be richer,