Design and Performance Analysis of a Fast 4-Way Set Associative Cache Controller using Tree Pseudo Least Recently Used Algorithm

ABSTRACT


INTRODUCTION
Cache memory is an integral component of modern computing systems, bridging the speed gap between fast processors and slow main memory, especially given Moore's Law's exponential increase in processing power.Even though the number of transistors doubles approximately every two years, main memory speeds have not kept pace, resulting in potential performance bottlenecks [1,2].Cache memory functions as a high-speed buffer to counteract this, enabling processors to retrieve data quickly.The cache controller is essential for optimizing the performance of the cache memory.This vital component, directed by the Finite State Machine (FSM), coordinates the data flow and operations between the processor, main memory, The performance of a cache memory system is highly dependent on the effectiveness of the cache controller, which is primarily determined by its architectural design.The mapping function is the foundation of this design.The simplest form of mapping, direct mapping, assigns a unique main memory block to a specific cache line.However, when multiple memory blocks compete for the same cache space, this singular mapping can result in a high cache miss rate.In contrast, fully associative mapping, which necessitates searching all cache lines for the desired data, can be time-consuming and detrimental to performance.An intermediate solution set associative mapping is not devoid of difficulties: excessive ways complicate replacement policies, while insufficient ways increase the rate of misses.The existing research does not comprehensively understand designing a VHDL 4-way set associative cache controller.Therefore, this research aims to design an FSMbased 4-way set associative cache controller using VHDL, verify its behavior through Quartus Lite Edition's timing diagram, and evaluate its performance via ModelSim simulation.

CACHE MEMORY PRINCIPLES
This section explores the fundamental principles of cache memory, including its complex mapping functions, the nuances of replacement algorithms and writes policies, and the cache controller's central role in the overall system.

Cache Mapping Functions
Mapping main memory blocks to cache lines is a complex task, necessitating specific algorithms, especially given the limited cache lines compared to main memory blocks.The architecture of the cache is influenced by the selected mapping function [14,15].Address mapping schemes were emphasized in [10] as pivotal cache parameters for this allocation.Three primary mapping techniques are detailed: direct mapping, fully associative, and set-associative.In direct mapping, each main memory block is mapped to only one specific cache line, determined by the address of the block multiplied by its modulo.This method is straightforward, as it doesn't require a replacement process; however, its rigidity may cause frequent block swaps in the cache if a program accesses different blocks mapping to the same line, leading to "trashing."Fully associative mapping is more versatile than direct mapping, allowing block data to be copied to any cache line.When no lines are free, a replacement algorithm comes into play.
Figure 1.The k-way Set-Associative Cache Organization [16] We have set associative mapping by combining the advantages of both the direct and associative techniques while minimizing their cons.It involves dividing all cache lines into sets based on a predetermined value k.While a specific block can be mapped to any line within a particular set, it's restricted to that set, as shown in Figure 1.Cache replacement algorithms might also be employed here if no lines are available.This method interprets a memory address via three fields: Tag, Set, and Word.The cache controller identifies the set through the Set field, compares each line's tag within the set, and then uses the Word field for line-byte selection.Notably, the set-associative approach can become direct mapping or fully associative based on the number of sets and lines in each set.Two-line set-associative organizations are typically the most common, with four-way set associative offering marginal improvements at a slightly added cost.

Replacement Algorithms and Write Policy
Cache memory design relies on the replacement algorithm [17].This algorithm selects the cache line to replace with the main memory block.LFU, FIFO, and LRU are popular algorithms in this area (LRU).LRU stands out due to its intuitive nature and the underlying principle that recent memory accesses indicate nearfuture accesses, potentially optimizing hit ratios.LRU is described as replacing the cache line least recently accessed [18].Several replacement algorithms will be discussed in this section.
First-In-First-Out (FIFO) is one of the simplest cache replacement algorithms.It operates on the age principle, replacing the oldest item in the cache when a new item needs to be loaded.The algorithm assumes that data brought into the cache earlier is less likely to be used again soon.Assuming a queue structure for the cache with items  1 ,  2 , … ,   , where  1 is the oldest and   is the most recent, FIFO algorithm can be defined as: where _(  ) is the time when the item   entered the cache.Random Replacement does exactly as its name suggests: it replaces a randomly chosen item in the cache when a new item is to be loaded.Due to its unpredictable nature, it might outperform more complex algorithms in certain scenarios, especially when access patterns do not exhibit strong temporal locality.The algorithm can be formulated as follows: where  is the cache size.
The Optimal Replacement algorithm is theoretical and serves as a benchmark.It replaces the item that won't be accessed for the longest duration in the future.While it's not feasible in practice (as it requires knowledge of future requests), it sets an upper-performance limit.The algorithm can be formulated as follows: where _(  ) denotes the future time when the item   will be accessed next.
Least Recently Used (LRU) is based on the principle that if an item has not been accessed recently, it's less likely to be accessed shortly.It replaces the least recently accessed item when a new item is loaded.It can be defined as follows: where _(  ) is the time when the item   was last accessed.Tree Pseudo-Least Recently Used (PLRU) approximates the LRU policy commonly used for setassociative caches.Instead of perfectly keeping track of the exact LRU order, which can be hardware-intensive, Tree PLRU uses a binary tree to guide replacement decisions [19,20].It makes it more efficient in terms of hardware implementation for larger set-associative caches.Considering a cache with 2  lines, PLRU utilizes a binary tree with 2  leaves representing cache lines.Cache accesses alter the state of the tree.An internal node's state, set to '0' or '1', indicates which child node (or associated cache line) was accessed last.The algorithm determines the least recently used line by following the path from the root according to these states.
When updating a cache-resident block, there are two possible outcomes.First, if the block is unchanged, it can be overwritten in the cache without updating the main memory.Before introducing a new block, the main memory must be updated if the existing block has undergone at least one write operation.Diverse writing policies address these scenarios, each with its advantages and disadvantages.This dynamic reveals two principal concerns.First, multiple devices, such as I/O modules, may access the main memory, rendering cache-only modified words potentially invalid.This situation becomes more complex with multiple CPUs on a shared bus, each with its cache.Changing a cached word can invalidate words in other caches.
The "write-through" technique is simple.It ensures that both the cache and the main memory are updated simultaneously, preserving the integrity of the memory.Additionally, it enables other processor-cache modules to monitor main memory traffic, ensuring cache consistency.In contrast, the "write-back" technique uses additional flags in each cache line, such as Type (differentiating between data and instructions), Valid (indicating valid cache line data), Lock (preventing line replacement when set), and Dirty (denoting data written to cache but not main memory).As explained in [21], the write-back policy minimizes memory writes by updating the main memory only when the dirty bit is active during block replacement.The write-back policy struggles with error handling without an Error Correcting Code (ECC), and the write-through approach can excessively congest traffic [8,22].

Cache Controller
The cache controller is an essential piece of hardware that manages data transfers between the CPU, main memory, and cache memory.Whenever the processor attempts to access a specific location in the main memory, the cache controller first determines whether the data is present in the cache.If present, the data is transmitted directly to the processor; otherwise, the data is retrieved from the main memory, thereby updating the cache [3].In addition, the cache controller is responsible for monitoring the cache's induced miss rate [12,23].
The cache controller comprises a Finite State Machine (FSM), Tag cache, General Mux (GMux), and the LRU controller unit.The FSM manages read and write operations for cache and main memory, directing requests to GMux and the LRU controller unit for set associativity [18,24].The tag cache organizes data into distinct fields within the controller, including tag bits, a valid bit, a dirty bit, and LRU bits for each data cache line.The FSM generates signals such as "Dwith," making it easier for GMux to connect input and output data buses.When set associative mode is enabled, the LRU controller identifies the path utilized the least recently while the FSM manages read and write operations for each set.

PROPOSED FAST 4-WAY SET ASSOCIATIVE CACHE CONTROLLER
This research focuses on modeling the cache controller to validate its functionality and simulate its performance.This modeling utilizes HDL because it can represent behavior at multiple abstraction levels.Abstraction is a pillar of engineering; it facilitates the comprehension of system operations without delving into complex internal processes [25].In addition, omitting the specifics of the low-level implementation ensures that the simulation can be executed within a reasonable timeframe.Our research uses the Register Transfer Level, where HDL specifies the transfer and transformation of data across and within subsystems in response to system inputs.

Cache Controller and Cache Memory Specifications
Before designing the cache controller, it is necessary to define its inputs and outputs, as shown in Table 1.This project uses a 4-way set associative mapping for cache memory, which consists of 16 cache lines of 256 bytes each.This memory is divided into four sets, each containing four cache lines.Write-Through for write hits and Write-Around for write misses are the chosen write policies.Adopting the Tree Pseudo-LRU, the research introduces improvements to the replacement policy (PLRU).In addition, the cache controller includes a Tag Array for storing tags for comparison with the processor's provided address.Figure 2 demonstrates the complete RTL layout.Standard bus specifications include 32-bit data buses, a 32-bit address bus (with only 10 bits addressable by memory), and a 128-bit read data block.

System Design Using Finite State Machine
After understanding the cache controller and memory specifications, the next step is to design the Finite State Machine (FSM).FSMs are among the most powerful digital circuits due to their capacity to represent and manage distinct system states.In the context of this study, the FSM allows us to precisely define the behavior of the cache controller based on its inputs.The architecture of our cache controller is based on prior research and consists of an FSM that performs four fundamental operations: retrieving addresses from the processor, reading data from both cache and main memory, writing data to cache and main memory, and finally returning the requested data to the processor.Figure 3   representation and preparing it for FPGA implementation, upon compilation.A subsequent step entails the creation of a Test Bench, which is specific to the module code, to direct the simulator, ModelSim, to evaluate the operational efficiency of the circuit without delving into timing nuances.Timing remains crucial, with clock stall metrics highlighting the model's performance; extensive clock stalls indicate potential processor data retrieval delays.Due to Quartus' affiliation with Intel, those with the necessary hardware can implement the model on an Intel FPGA device, with the final phase focusing on the programming required to finalize the system's architecture.

Netlist View
We will investigate the complexities of our circuit's netlist.This netlist, which serves as a blueprint for the interconnectedness of the system's components, is the result of successful code synthesis and compilation.The system's architecture consists of three essential components: the cache controller, the cache memory, and the main memory.Figure 4 vividly depicts the meticulous integration of individual entities to form the overarching "top entity" in constructing such a system.Each entity has its own unique significance and design considerations.The focus then shifts to the second entity, the cache controller, which consists of two segments, as depicted in Figure 6.The cache memory, a crucial component of our system, effectively interfaces with these memory banks to optimize data retrieval and storage processes.

RESULTS AND DISCUSSION
This section explores the fundamental principles of cache memory, including its complex mapping functions, the nuances of replacement algorithms and writes policies, and the cache controller's central role in the overall system.In our examination of the performance of the 'Top entity,' we simulated the functions of a non-pipelined processor using a Test Bench.This Test Bench generated Read/Write data requests to the memory consistently, effectively simulating the "Load" and "Store" instructions inherent to all programs a processor executes.Our tests revealed that the cache controller was operating at peak efficiency.During the research, several notable observations were made.First, the Read/Write Miss and Hit signals were generated with extraordinary accuracy.Second, all four routes for the Read/Write data operations were fully functional.In addition, the pLRU algorithm demonstrated its proficiency by replacing cache lines as anticipated.Notably, we did not encounter incoherent or inconsistent data during our research.To ensure a thorough evaluation, we administered three sets of test vectors, one for each scenario: read hit, read miss, write hit, and write miss.The subsequent section will comprehensively analyze these scenarios, spanning all three sets.

. First Set of Test Vector Analysis
The initial set of test vectors mirrored those of a previously cited study to facilitate a direct comparison with prior research.The overarching objective was to identify potential inconsistencies or flaws in this or the other study.The results showed: • Read Miss: For addresses not present in the cache, the data block is fetched from the main memory, stalling the processor for 3 cycles and taking a total of 5.5 cycles for the read operation.• Read Hit: If the address is in the cache, data is directly fetched, requiring 3 cycles.
• Write Hit: When the target address is in the cache, data is written and simultaneously copied to the main memory, owing to the Write-Through policy.This operation takes 3.5 cycles, with a 2-cycle processor stall.
Figure 7 zeroes in on the specific instance of test_addr(0) <= x "00000201", marking a situation where a read miss is registered.The processor's incoming address is mirrored in the tb_processor/addr signal.Notably, the rd signal, set at '1', indicates a read operation that, intriguingly, consumes 3 cycles.The dilemma arises, however, when the desired address is conspicuously absent from the cache.It triggers the transfer of an entire data block from the main memory to the cache before sending it to the processor.While effective at ensuring data integrity, this process introduces a delay: the cache controller temporarily holds the processor in check for three additional cycles.This results in a total of 5.5 cycles for read operations should a miss occur.It highlights the inherent impact of cache misses on system performance and the significance of optimizing cache management to reduce their occurrence.

Second Set of Test Vector Analysis
This set introduced the 'Write Miss' operation, which was not covered in the previous set.The analysis showed: • Read Miss: Akin to the first set, data retrieval from the main memory resulted in a 3-cycle processor stall, with the read operation consuming 4.5 cycles.• Read Hit: Consistent with the first set, direct data retrieval from the cache took 3 cycles.
• Write Miss: Direct writing to the main memory occurred when the desired address wasn't in the cache.
This process took 3.5 cycles, stalling the processor for 2 cycles.• Write Hit: Similar to the first set, a 3.5-cycle operation was observed with data being written in the cache and copied to the main memory, resulting in a 2-cycle processor stall.Figure 8 elucidates a scenario involving a write miss, characterized by the specific instance test_addr (11) <= x "00000207".The crux of this situation lies in the absence of the desired address within the cache memory, directly categorizing the write request as a miss.Instead of leveraging the cache, data is straightforwardly written to the main memory, denoted by test_data (11) <= "44444444".A critical observation here revolves around the wr signal, set to '1'.This signal denotes a write operation, which requires three cycles as deduced from the results.Despite being direct, this operational procedure introduces a latent overhead: the processor is temporarily suspended for an additional 2 cycles.The cumulative effect of write misses is an access time of 3.5 cycles.It elucidates the inherent overhead of write misses and highlights the urgent need to optimize cache strategies to prevent such occurrences and boost overall system efficiency.

Third Set of Test Vector Analysis
The third set provided a concise overview of the cache controller's behavior, allowing observation of specific scenarios like data updating on the same address.The analysis showed: • Read Miss and Hit: Behaviors were consistent with previous sets.
• Write Miss and Hit: The cache controller responded correctly to write requests, showcasing its ability to handle data updates on the same address.
The third set of test vectors provides a more constrained coverage than the preceding first and second sets.The design's rationale is an essential factor to consider in this context.Although limited in scope, this set is intricately designed to examine the behavior of the cache controller in four distinct scenarios: read miss, read hit, write miss, and write hit, with the added complexity of data updates on identical addresses.This deliberate limitation suggests a strong emphasis on depth rather than breadth.The value of such a strategy is evident: by narrowing the scope, there is increased observational granularity.It allows for a more in-depth understanding of the cache controller's operations in particular scenarios instead of a broader but potentially superficial understanding.
Figure 9 provides additional visual distinction.Each scenario is color-coded distinctly: blue represents read misses, red represents read hits, green represents write misses, and orange represents write hits.Although this color coding is intuitive, one may question whether these choices effectively accommodate all potential viewers, including those with color blindness.Reading hits are the most effective operations, requiring only three cycles and causing no processor delays.In contrast, reading misses require data retrieval from the main memory, resulting in a 5.5-cycle operation.Both write hits and misses presented similar difficulties, halting the processor for two cycles and requiring three and a half cycles to gain access.The Write-Through policy of the design ensured data consistency by writing to the main memory with each writing hit.In addition, the Write-around policy became evident during writing misses, wherein data was written directly to the main memory without updating the cache.Table 2 compares our design with the other works, revealing that the cycle efficiency of cache interactions has evolved.A relatively unbalanced cycle distribution is evident in [3], with read misses dominating by a significant margin at 9 cycles, nearly double its nearest metric (read hits at 4 cycles).It could indicate a system that, while robust, may encounter difficulties in scenarios involving frequent data access.In contrast, a more balanced cycle distribution was displayed in [5], particularly between write operations.The reduced cycle count for read misses compared to [3] (7 cycles) suggests read efficiency-enhancing optimizations.However, it also features a significantly higher cycle count for write operations, which may indicate a potential design trade-off.
The proposed design provides convincing evidence of a refined approach.With a consistent cycle count for both read and write hits of 3 and 3.5 cycles, it establishes a balance between read and write operations, thereby addressing potential inefficiencies in both.Moreover, the significantly reduced cycle count for read misses to 5.5 cycles demonstrates a read mechanism that has been optimized, most likely utilizing innovative techniques or algorithms.In essence, the proposed design appears to have integrated the lessons from the two previous papers, combining their respective strengths to create a system that promises speed and balanced performance across diverse cache interactions.
Beyond the comparative analysis outlined in Table 2, it is critical to contemplate the ramifications that the implementation of FPGA would have on our proposed design.The present investigation is founded upon ModelSim simulations; however, it is expected that the adoption of an FPGA platform, such as the Cyclone V FPGA, will verify and potentially augment these findings.The utilization of FPGA implementation provides a concrete setting in which the design can be evaluated under authentic circumstances, thereby yielding significant knowledge regarding timing, power consumption, and system integration as a whole.
The efficacy of our simulation is demonstrated by the cycle efficiency outcomes, which indicate that it accurately models the behavior of the cache controller in a variety of scenarios.By providing a controlled environment in which to precisely manipulate each parameter, simulations guarantee a comprehensive comprehension of the performance of the design.By employing this comprehensive simulation methodology, a strong groundwork is established for subsequent FPGA implementation, during which the design's practical viability will be further evaluated.Furthermore, it is anticipated that the shift from simulation to FPGA implementation will provide further advantages.For example, testing on FPGAs can provide valuable information regarding the scalability and adaptability of the design across various hardware configurations.Additionally, it facilitates debugging and testing in real-time, which can result in more immediate and tangible design enhancements.Hence, although the present investigation centers on performance analysis via simulation, the potential implementation of our cache controller design on an FPGA is an essential subsequent course of action.In addition to validating the results obtained from our simulations, it will furnish an allencompassing comprehension of the design's operational efficacy.By integrating comprehensive simulations and rigorous hardware testing, our proposed design is characterized by its exceptional dependability and groundbreaking nature.

CONCLUSIONS
The primary objective of this study was to develop and evaluate a cache controller that utilized a Tree Pseudo Least Recently Used (PLRU) replacement policy and a 4-way set associative mapped cache.An exhaustive literature review was conducted to identify effective cache controller design techniques; this led to the selection of a four-way set for each cache set, which balanced complexity, performance, and cost while optimizing latency.During the practical implementation phase, Quartus Prime 16.1 Lite Edition was utilized to develop VHDL code for the cache controller, main memory system, and cache memory entities.The conversion of abstract principles into a physical design during this critical stage was accomplished with the assistance of compilation, synthesis, and ModelSim simulations for validation purposes.The timing diagrams generated by these simulations served as the foundation for our performance evaluation.Through comparative analysis with prior research, the effectiveness of our design was demonstrated, specifically in the reduction of read miss latency to 5.5 cycles and the attainment of balanced latencies in write operations.It is important to mention that the performance of the cache controller is affected by a multitude of factors, such as the specifications of the device and the replacement and writing policies that are chosen.Our investigation was limited to operation cycles; internal delay considerations were disregarded.In its entirety, this research makes a substantial contribution to the field of cache controller design by showcasing the efficacy and feasibility of the 4-way set associative mapped cache controller in conjunction with the Tree PLRU policy through simulations.Subsequent research endeavors to incorporate and scrutinize this architecture on an FPGA platform, thereby furnishing an exhaustive assessment of its practical efficacy.This stage is critical to verify the feasibility and efficacy of the design in real-world situations, thereby connecting theoretical analysis with practical implementation and contributing to the advancement of cache controller technology.

A
Convolutional Neural Network Approach For Detection And Classification… (Arselan Ashraf et al) 1055

Figure 2 .
Figure 2. RTL View of the Entire System depicts a representative state diagram of the cache controller.

Figure 3 .
Figure 3. Cache Controller State Diagram3.3.Integrated System Design and DevelopmentUsing Quartus Lite software, the intended circuit is constructed textually, utilizing the fundamental Finite State Machine (FSM) to streamline VHDL design into distinct procedural blocks for different circuits.Quartus transforms VHDL code into a tangible circuit with logic elements, providing a schematic

Figure 4 .Figure 5 .
Figure 4. Top Entity of the System

AFigure 9 .
Figure 9. Simulation result for the third set of test vector

Table 1 .
I/O Ports for the Cache Controller

Table 2 .
Comparison with other works