level 1 cache

pages: 894 words: 190,485

Write Great Code, Volume 1 by Randall Hyde

AltaVista, business process, Donald Knuth, John von Neumann, level 1 cache, locality of reference, machine readable, Von Neumann architecture, Y2K

Instead, when the CPU requests data from memory, the L1 cache subsystem takes over. If the requested data is in the cache, then the L1 cache subsystem returns the data to the CPU, and that concludes the memory access. If the requested data is not present in the L1 cache, then the L1 cache subsystem passes the request on down to the L2 cache subsystem. If the L2 cache subsystem has the data, it returns this data to the L1 cache, which then returns the data to the CPU. Note that requests for the same data in the near future will be fulfilled by the L1 cache rather than the L2 cache because the L1 cache now has a copy of the data.

…

On a 1-GHz processor the L1 cache must respond within one nanosecond if the cache operates with zero wait states (some processors actually introduce wait states in L1 cache accesses, but CPU designers try to avoid this). Accessing data in the L2 cache is always slower than in the L1 cache, and there is always the equivalent of at least one wait state, and up to eight, when accessing data in the L2 cache. There are several reasons why L2 cache accesses are slower than L1 accesses. First, it takes the CPU time to determine that the data it is seeking is not in the L1 cache. By the time it determines that the data is not present in the L1 cache, the memory-access cycle is nearly complete, and there is no time to access the data in the L2 cache.

…

Secondly, the circuitry of the L2 cache may be slower than the circuitry of the L1 cache in order to make the L2 cache less expensive. Third, L2 caches are usually 16 to 64 times larger than L1 caches, and larger memory subsystems tend to be slower than smaller ones. All this adds up to additional wait states when accessing data in the L2 cache. As noted earlier, the L2 cache can be as much as one order of magnitude slower than the L1 cache. The L1 and L2 caches also differ in the amount of data the system fetches when there is a cache miss (see Chapter 6). When the CPU fetches data from or writes data to the L1 cache, it generally fetches or writes only the data requested.

pages: 757 words: 193,541

The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services, Volume 2 by Thomas A. Limoncelli, Strata R. Chalup, Christina J. Hogan

active measures, Amazon Web Services, anti-pattern, barriers to entry, business process, cloud computing, commoditize, continuous integration, correlation coefficient, database schema, Debian, defense in depth, delayed gratification, DevOps, domain-specific language, en.wikipedia.org, fault tolerance, finite state, Firefox, functional programming, Google Glasses, information asymmetry, Infrastructure as a Service, intermodal, Internet of things, job automation, job satisfaction, Ken Thompson, Kickstarter, level 1 cache, load shedding, longitudinal study, loose coupling, machine readable, Malcom McLean invented shipping containers, Marc Andreessen, place-making, platform as a service, premature optimization, recommendation engine, revision control, risk tolerance, Salesforce, scientific management, seminal paper, side project, Silicon Valley, software as a service, sorting algorithm, standardized shipping container, statistical model, Steven Levy, supply-chain management, systems thinking, The future is already here, Toyota Production System, vertical integration, web application, Yogi Berra

The cache medium simply must be faster than the main medium. A disk can cache for data that has to be gathered from a remote server because disks are generally faster than remote retrieval. For example, YouTube videos are cached in many servers around the world to conserve internet bandwidth. Very fast RAM can cache for normal RAM. For example, the L1 cache of a CPU caches the computer’s main memory. Caches are not just used to improve data lookups. For example, calculations can be cached. A function that does a difficult mathematical calculation might cache recent calculation results if they are likely to be requested again. 5.4.3 Cache Persistence When a system starts, the cache is usually empty, or cold.

…

If reading from main memory took 1 second, how long would the other operations take? For extra credit, draw your answer to resemble a calendar or the solar system. 6. Take the data table in Figure 1.10 and add a column that identifies the cost of each item. Scale the costs to the same unit—for example, the cost of 1 terabyte of RAM, 1 terabyte of disk, and 1 terabyte of L1 cache. Add another column that shows the ratio of performance to cost. 7. What is the theoretical model that describes the different kinds of scaling techniques? 8. How do you know when scaling is needed? 9. What are the most common scaling techniques and how do they work? When are they most appropriate to use?

…

More surprisingly, a loop that reads every element of an array takes approximately the same amount of time to execute as a loop that reads every 16th element of the same array. Even though the former does 1/16th of the number of operations, the number of RAM cache misses is the same for both loops. Reading blocks of memory from RAM to the CPU’s L1 (Level 1) cache is slow and dominates the run-time of either algorithm. The fact that the former runs faster is due to the fact that there are special instructions for sequentially reading memory. Above and beyond RAM, virtual memory, and disk issues, we also have to consider the effect of threads, multiple CPUs trying to access the same data, locks, and mutexes.

pages: 1,202 words: 144,667

The Linux kernel primer: a top-down approach for x86 and PowerPC architectures by Claudia Salzberg Rodriguez, Gordon Fischer, Steven Smolski

Debian, Dennis Ritchie, domain-specific language, en.wikipedia.org, Free Software Foundation, G4S, history of Unix, Ken Thompson, level 1 cache, Multics, recommendation engine, Richard Stallman

Let's look at their creation: ----------------------------------------------------------------------------- mm/slab.c 486 static kmem_cache_t cache_cache = { 487 .lists = LIST3_INIT(cache_cache.lists), 488 .batchcount = 1, 489 .limit = BOOT_CPUCACHE_ENTRIES, 490 .objsize = sizeof(kmem_cache_t), 491 .flags = SLAB_NO_REAP, 492 .spinlock = SPIN_LOCK_UNLOCKED, 493 .color_off = L1_CACHE_BYTES, 494 .name = "kmem_cache", 495 }; 496 497 /* Guard access to the cache-chain. */ 498 static struct semaphore cache_chain_sem; 499 500 struct list_head cache_chain; ----------------------------------------------------------------------------- The cache_cache cache descriptor has the SLAB_NO_REAP flag.

…

= next)) { 2301 next->timestamp = now; 2302 rq->nr_switches++; 2303 rq->curr = next; 2304 ++*switch_count; 2305 2306 prepare_arch_switch(rq, next); 2307 prev = context_switch(rq, prev, next); 2308 barrier(); 2309 2310 finish_task_switch(prev); 2311 } else 2312 spin_unlock_irq(&rq->lock); 2313 2314 reacquire_kernel_lock(current); 2315 preempt_enable_no_resched(); 2316 if (test_thread_flag(TIF_NEED_RESCHED)) 2317 goto need_resched; 2318 } ----------------------------------------------------------------------- Line 2288 We attempt to get the memory of the new process' task structure into the CPU's L1 cache. (See include/linux/prefetch.h for more information.) Line 2290 Because we're going through a context switch, we need to inform the current CPU that we're doing so. This allows a multi-CPU device to ensure data that is shared across CPUs is accessed exclusively. This process is called read-copy updating.

pages: 303 words: 57,177

Hands-On Functional Programming in RUST by Andrew Johnson

anti-pattern, Debian, domain-specific language, don't repeat yourself, Donald Knuth, functional programming, higher-order functions, level 1 cache, premature optimization

More distance (d) in one direction also means an increase in available space of quadratic (d2) or cubic (d3) scale. In other words building the fridge farther away provides more space for a larger fridge. Bringing this story back to a technical context, here are some latency numbers that every programmer should know (~2012: https://gist.github.com/jboner/2841832): Request Time L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1 Kb with Zippy 3000 ns Send 1 Kb over 1 Gbps network 10000 ns Read 4 Kb randomly from SSD 150000 ns Read 1 Mb sequentially from memory 250000 ns Round trip within same datacenter 500000 ns Send packet CA | Netherlands | CA 150000000 ns Here, we can see in specific numbers that if you want a donut and some coffee then you could eat 300,000,000 donuts from the fridge next to your computer before taking your first bite from a Danish.

pages: 292 words: 62,575

97 Things Every Programmer Should Know by Kevlin Henney

A Pattern Language, active measures, Apollo 11, business intelligence, business logic, commoditize, continuous integration, crowdsourcing, database schema, deliberate practice, domain-specific language, don't repeat yourself, Donald Knuth, fixed income, functional programming, general-purpose programming language, Grace Hopper, index card, inventory management, job satisfaction, level 1 cache, loose coupling, machine readable, Silicon Valley, sorting algorithm, The Wisdom of Crowds

Modern computer systems are organized as hierarchies of physical and virtual machines, including language runtimes, operating systems, CPUs, cache memory, random-access memory, disk drives, and networks. This table shows the limits on random access time and storage capacity for a typical networked server. Access time Capacity Register < 1 ns 64 b Cache line 64 B L1 cache 1 ns 64 KB L2 cache 4 ns 8 MB RAM 20 ns 32 GB Disk 10 ms 10 TB LAN 20 ms > 1 PB Internet 100 ms > 1 ZB Note that capacity and speed vary by several orders of magnitude. Caching and lookahead are used heavily at every level of our systems to hide this variation, but they only work when access is predictable.

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

active measures, Amazon Web Services, billion-dollar mistake, bitcoin, blockchain, business intelligence, business logic, business process, c2.com, cloud computing, collaborative editing, commoditize, conceptual framework, cryptocurrency, data science, database schema, deep learning, DevOps, distributed ledger, Donald Knuth, Edward Snowden, end-to-end encryption, Ethereum, ethereum blockchain, exponential backoff, fake news, fault tolerance, finite state, Flash crash, Free Software Foundation, full text search, functional programming, general-purpose programming language, Hacker News, informal economy, information retrieval, Internet of things, iterative process, John von Neumann, Ken Thompson, Kubernetes, Large Hadron Collider, level 1 cache, loose coupling, machine readable, machine translation, Marc Andreessen, microservices, natural language processing, Network effects, no silver bullet, operational security, packet switching, peer-to-peer, performance metric, place-making, premature optimization, recommendation engine, Richard Feynman, self-driving car, semantic web, Shoshana Zuboff, social graph, social web, software as a service, software is eating the world, sorting algorithm, source of truth, SPARQL, speech recognition, SQL injection, statistical model, surveillance capitalism, systematic bias, systems thinking, Tragedy of the Commons, undersea cable, web application, WebSocket, wikimedia commons

Besides reducing the volume of data that needs to be loaded from disk, columnoriented storage layouts are also good for making efficient use of CPU cycles. For example, the query engine can take a chunk of compressed column data that fits comfortably in the CPU’s L1 cache and iterate through it in a tight loop (that is, with no function calls). A CPU can execute such a loop much faster than code that requires a lot of function calls and conditions for each record that is processed. Col‐ umn compression allows more rows from a column to fit in the same amount of L1 cache. Operators, such as the bitwise AND and OR described previously, can be designed to operate on such chunks of compressed column data directly. This techni‐ que is known as vectorized processing [58, 49].

pages: 1,237 words: 227,370

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

active measures, Amazon Web Services, billion-dollar mistake, bitcoin, blockchain, business intelligence, business logic, business process, c2.com, cloud computing, collaborative editing, commoditize, conceptual framework, cryptocurrency, data science, database schema, deep learning, DevOps, distributed ledger, Donald Knuth, Edward Snowden, end-to-end encryption, Ethereum, ethereum blockchain, exponential backoff, fake news, fault tolerance, finite state, Flash crash, Free Software Foundation, full text search, functional programming, general-purpose programming language, Hacker News, informal economy, information retrieval, Infrastructure as a Service, Internet of things, iterative process, John von Neumann, Ken Thompson, Kubernetes, Large Hadron Collider, level 1 cache, loose coupling, machine readable, machine translation, Marc Andreessen, microservices, natural language processing, Network effects, no silver bullet, operational security, packet switching, peer-to-peer, performance metric, place-making, premature optimization, recommendation engine, Richard Feynman, self-driving car, semantic web, Shoshana Zuboff, social graph, social web, software as a service, software is eating the world, sorting algorithm, source of truth, SPARQL, speech recognition, SQL injection, statistical model, surveillance capitalism, systematic bias, systems thinking, Tragedy of the Commons, undersea cable, web application, WebSocket, wikimedia commons

Besides reducing the volume of data that needs to be loaded from disk, column-oriented storage layouts are also good for making efficient use of CPU cycles. For example, the query engine can take a chunk of compressed column data that fits comfortably in the CPU’s L1 cache and iterate through it in a tight loop (that is, with no function calls). A CPU can execute such a loop much faster than code that requires a lot of function calls and conditions for each record that is processed. Column compression allows more rows from a column to fit in the same amount of L1 cache. Operators, such as the bitwise AND and OR described previously, can be designed to operate on such chunks of compressed column data directly. This technique is known as vectorized processing [58, 49].

pages: 612 words: 187,431

The Art of UNIX Programming by Eric S. Raymond

A Pattern Language, Albert Einstein, Apple Newton, barriers to entry, bioinformatics, Boeing 747, Clayton Christensen, combinatorial explosion, commoditize, Compatible Time-Sharing System, correlation coefficient, David Brooks, Debian, Dennis Ritchie, domain-specific language, don't repeat yourself, Donald Knuth, end-to-end encryption, Everything should be made as simple as possible, facts on the ground, finite state, Free Software Foundation, general-purpose programming language, George Santayana, history of Unix, Innovator's Dilemma, job automation, Ken Thompson, Larry Wall, level 1 cache, machine readable, macro virus, Multics, MVC pattern, Neal Stephenson, no silver bullet, OSI model, pattern recognition, Paul Graham, peer-to-peer, premature optimization, pre–internet, publish or perish, revision control, RFC: Request For Comment, Richard Stallman, Robert Metcalfe, Steven Levy, the Cathedral and the Bazaar, transaction costs, Turing complete, Valgrind, wage slave, web application

Here's a new one: you want the central data structures and the time-critical loops in your code never to fall out of cache. Consider your target machine as a hierarchy of memory types arranged by distance from the processor. There are the processor's own registers; its instruction pipeline; the level-one (L1) cache; the level-two (L2) cache; possibly a level-three (L3) cache; main memory (what Unix old hands still quaintly call ‘core’); and the disk drives where swap space lives. Technologies like SMP, shared-memory clusters, and non-uniform memory access (NUMA) add more layers to the picture but only widen the overall spread.

pages: 1,409 words: 205,237

Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale by Jan Kunigk, Ian Buss, Paul Wilkinson, Lars George

Amazon Web Services, barriers to entry, bitcoin, business intelligence, business logic, business process, cloud computing, commoditize, computer vision, continuous integration, create, read, update, delete, data science, database schema, Debian, deep learning, DevOps, domain-specific language, fault tolerance, Firefox, FOSDEM, functional programming, Google Chrome, Induced demand, information security, Infrastructure as a Service, Internet of things, job automation, Kickstarter, Kubernetes, level 1 cache, loose coupling, microservices, natural language processing, Network effects, platform as a service, single source of truth, source of truth, statistical model, vertical integration, web application

In this scenario, multiple threads could be running on both physical CPUs, each trying to access a location of memory that is distant for some of these threads and local to others. To improve the speed of repeated access, some of this remote memory will naturally reside in the local processor’s L3, L2, or L1 caches, but this comes at the cost of additional overhead in the coherency protocol, which now also spans the inter-processor connect. Linux provides tools and interfaces through which users and programs can influence NUMA behavior directly. Crucially, this allows for requesting an optimal mapping for applications on a given processor, which in NUMA terminology is called a NUMA node (not to be confused with a core or a thread within a processor).

pages: 1,065 words: 229,099

Real World Haskell by Bryan O'Sullivan, John Goerzen, Donald Stewart, Donald Bruce Stewart

bash_history, database schema, Debian, distributed revision control, domain-specific language, duck typing, en.wikipedia.org, Firefox, functional programming, general-purpose programming language, Guido van Rossum, higher-order functions, job automation, Larry Wall, lateral thinking, level 1 cache, machine readable, p-value, panic early, plutocrats, revision control, sorting algorithm, SQL injection, transfer pricing, type inference, web application, Yochai Benkler

This represents a string of data as a list of chunks, arrays of up to 64 KB in size. Each ByteString type performs better under particular circumstances. For streaming a large quantity (hundreds of megabytes to terabytes) of data, the lazy ByteString type is usually best. Its chunk size is tuned to be friendly to a modern CPU’s L1 cache, and a garbage collector can quickly discard chunks of streamed data that are no longer being used. The strict ByteString type performs best for applications that are less concerned with memory footprint or that need to access data randomly. Binary I/O and Qualified Imports Let’s develop a small function to illustrate some of the ByteString API.

pages: 385 words: 112,842

Arriving Today: From Factory to Front Door -- Why Everything Has Changed About How and What We Buy by Christopher Mims

air freight, Airbnb, Amazon Robotics, Amazon Web Services, Apollo 11, augmented reality, autonomous vehicles, big-box store, blue-collar work, Boeing 747, book scanning, business logic, business process, call centre, cloud computing, company town, coronavirus, cotton gin, COVID-19, creative destruction, data science, Dava Sobel, deep learning, dematerialisation, deskilling, digital twin, Donald Trump, easy for humans, difficult for computers, electronic logging device, Elon Musk, Frederick Winslow Taylor, fulfillment center, gentrification, gig economy, global pandemic, global supply chain, guest worker program, Hans Moravec, heat death of the universe, hive mind, Hyperloop, immigration reform, income inequality, independent contractor, industrial robot, interchangeable parts, intermodal, inventory management, Jacquard loom, Jeff Bezos, Jessica Bruder, job automation, John Maynard Keynes: Economic Possibilities for our Grandchildren, Joseph Schumpeter, Kaizen: continuous improvement, Kanban, Kiva Systems, level 1 cache, Lewis Mumford, lockdown, lone genius, Lyft, machine readable, Malacca Straits, Mark Zuckerberg, market bubble, minimum wage unemployment, Nomadland, Ocado, operation paperclip, Panamax, Pearl River Delta, planetary scale, pneumatic tube, polynesian navigation, post-Panamax, random stow, ride hailing / ride sharing, robot derives from the Czech word robota Czech, meaning slave, Rodney Brooks, rubber-tired gantry crane, scientific management, self-driving car, sensor fusion, Shenzhen special economic zone , Shoshana Zuboff, Silicon Valley, six sigma, skunkworks, social distancing, South China Sea, special economic zone, spinning jenny, standardized shipping container, Steve Jobs, supply-chain management, surveillance capitalism, TED Talk, the scientific method, Tim Cook: Apple, Toyota Production System, traveling salesman, Turing test, two-sided market, Uber and Lyft, Uber for X, uber lyft, Upton Sinclair, vertical integration, warehouse automation, warehouse robotics, workplace surveillance

The principles of computer architecture that are inherent in the design of the Kiva robotics systems (which is now the Amazon Robotics system) are fractal: these principles repeat themselves at every scale of the system, from the actions of individual robots, to the behavior of all of the robots in a fulfillment center, to the behavior of Amazon’s entire network of warehouses, which are constantly rebalancing inventory between each other as well as within themselves. Something similar happens in computer chip design. Chips have small local caches of data—known as level 1 caches—immediately accessible to processors and also caches slightly more removed both physically and temporally, including levels 2, 3, and 4, depending on their design. These caches contain copies of data that also exists in main memory (RAM) or on the computer’s (spinning or solid-state) hard drive.

pages: 400 words: 121,988

Trading at the Speed of Light: How Ultrafast Algorithms Are Transforming Financial Markets by Donald MacKenzie

algorithmic trading, automated trading system, banking crisis, barriers to entry, bitcoin, blockchain, Bonfire of the Vanities, Bretton Woods, Cambridge Analytica, centralized clearinghouse, Claude Shannon: information theory, coronavirus, COVID-19, cryptocurrency, disintermediation, diversification, en.wikipedia.org, Ethereum, ethereum blockchain, family office, financial intermediation, fixed income, Flash crash, Google Earth, Hacker Ethic, Hibernia Atlantic: Project Express, interest rate derivative, interest rate swap, inventory management, Jim Simons, level 1 cache, light touch regulation, linked data, lockdown, low earth orbit, machine readable, market design, market microstructure, Martin Wolf, proprietary trading, Renaissance Technologies, Satoshi Nakamoto, Small Order Execution System, Spread Networks laid a new fibre optics cable between New York and Chicago, statistical arbitrage, statistical model, Steven Levy, The Great Moderation, transaction costs, UUNET, zero-sum game

In respect to money, Andresen said, “There’s that whole hacker ethos of ‘I’m above money,’ ” but unlike many cases in which “they affect it because it’s considered cool,” he found it to be genuine in Levine’s case: “I’d be like, ‘Josh, here’s your bonus check,’ [and] he was like, ‘Ah, give it to those guys.’ ” (“Those guys” were the programmers who had taken over much of the programming burden from Levine; email to author from Levine, September 1, 2012.) 19. See https://colin-scott.github.io/personal_website/research/interactive_latency.html, which gives an estimate of 23 nanoseconds for a level 1 “cache reference,” using the technology of 1996 (accessed January 21, 2020). I am hugely grateful to interviewee AF for pointing me to this site, which is a wonderful resource for someone interested in computing’s materiality. 20. A “two-phase commit” is the procedure by which a system writes information onto a disk or other form of memory and receives back a message acknowledging that the information has indeed been written. 21.

pages: 523 words: 143,139

Algorithms to Live By: The Computer Science of Human Decisions by Brian Christian, Tom Griffiths

4chan, Ada Lovelace, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, Albert Einstein, algorithmic bias, algorithmic trading, anthropic principle, asset allocation, autonomous vehicles, Bayesian statistics, behavioural economics, Berlin Wall, Big Tech, Bill Duvall, bitcoin, Boeing 747, Charles Babbage, cognitive load, Community Supported Agriculture, complexity theory, constrained optimization, cosmological principle, cryptocurrency, Danny Hillis, data science, David Heinemeier Hansson, David Sedaris, delayed gratification, dematerialisation, diversification, Donald Knuth, Donald Shoup, double helix, Dutch auction, Elon Musk, exponential backoff, fault tolerance, Fellow of the Royal Society, Firefox, first-price auction, Flash crash, Frederick Winslow Taylor, fulfillment center, Garrett Hardin, Geoffrey Hinton, George Akerlof, global supply chain, Google Chrome, heat death of the universe, Henri Poincaré, information retrieval, Internet Archive, Jeff Bezos, Johannes Kepler, John Nash: game theory, John von Neumann, Kickstarter, knapsack problem, Lao Tzu, Leonard Kleinrock, level 1 cache, linear programming, martingale, multi-armed bandit, Nash equilibrium, natural language processing, NP-complete, P = NP, packet switching, Pierre-Simon Laplace, power law, prediction markets, race to the bottom, RAND corporation, RFC: Request For Comment, Robert X Cringely, Sam Altman, scientific management, sealed-bid auction, second-price auction, self-driving car, Silicon Valley, Skype, sorting algorithm, spectrum auction, Stanford marshmallow experiment, Steve Jobs, stochastic process, Thomas Bayes, Thomas Malthus, Tragedy of the Commons, traveling salesman, Turing machine, urban planning, Vickrey auction, Vilfredo Pareto, Walter Mischel, Y Combinator, zero-sum game

If you make a library bigger, it takes longer to find a book in the library. If you have a stack of papers on your desk that’s bigger, it takes longer to find the paper you’re looking for, right? Caches are actually a solution to that problem.… For example, right now, if you go to buy a processor, what you’ll get is a Level 1 cache and a Level 2 cache on the chip. The reason that there are—even just on the chip there are two caches!—is that in order to keep up with the cycle rate of the processor, the first-level cache is limited in size. Unavoidably, the larger a memory is, the more time it takes to search for and extract a piece of information from it.