exponential backoff

pages: 725 words: 168,262

API Design Patterns by Jj Geewax

Amazon Web Services, anti-pattern, bitcoin, blockchain, business logic, cognitive load, continuous integration, COVID-19, database schema, en.wikipedia.org, exponential backoff, imposter syndrome, Internet of things, Kubernetes, lateral thinking, loose coupling, machine readable, microservices, natural language processing, Paradox of Choice, ride hailing / ride sharing, social graph, sorting algorithm

Now that we have an idea of the various requests and whether they’re retriable (and in what circumstances), let’s switch topics and start looking at the cases where we’ve decided it’s acceptable to retry a request. In particular, let’s look at the way we’ll decide on how long to wait before retrying a request. 29.3.2 Exponential back-off As we learned earlier, exponential back-off is an algorithm by which the delay between requests grows exponentially, doubling each time a request returns an error result. Perhaps the best way to understand this would be by looking at some code. Listing 29.1 Example demonstrating retrial with exponential back-off async function getChatRoomWithRetries(id: string): Promise<ChatRoom> { return new Promise<ChatRoom>(async (resolve, reject) => { let delayMs = 1000; ❶ while (true) { try { return resolve(GetChatRoom({ id })); ❷ } catch (e) { await new Promise((resolve) => { ❸ return setTimeout(resolve, delayMs); }); delayMs *= 2; ❹ } } }); } ❶ We define the initial delay as 1 second (1,000 milliseconds) ❷ First, attempt to retrieve the resource and resolve the promise. ❸ If the request fails, wait for a fixed amount of time, defined by the delay. ❹ Finally, double the amount of time we’ll need to wait next time.

…

It’s particularly tricky because we have no idea what’s happening once the request is sent, leaving us to simply make guesses about when the request is more likely to be handled successfully on the other side. Based on these constraints, the most effective, battle-tested algorithm for this problem is exponential back-off. The idea of exponential back-off is pretty simple. For a first retry attempt, we might wait one second before making the request again. After that, for every subsequent failure response we take the time that we waited previously and double it. If a request fails a second time, we wait two seconds and then try again.

…

We’ll introduce some extra pieces to this algorithm, but the core concept will remain unchanged. As we noted, this algorithm is great when the system on the other side is a complete black box. In other words, if we know nothing else about the system, exponential back-off works quite well. But is that the case with our web API? In truth, this is not the case and, as we’ll see in the next section, there is actually a better solution for a unique subset of scenarios. 29.2.2 Server-specified retry timing While exponential back-off is still a reasonable algorithm, it turns out that there are certain cases in which there is a better option available. For example, consider the scenario where a request fails because the API service says that this request can only be executed once per minute.

pages: 523 words: 143,139

Algorithms to Live By: The Computer Science of Human Decisions by Brian Christian, Tom Griffiths

4chan, Ada Lovelace, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, Albert Einstein, algorithmic bias, algorithmic trading, anthropic principle, asset allocation, autonomous vehicles, Bayesian statistics, behavioural economics, Berlin Wall, Big Tech, Bill Duvall, bitcoin, Boeing 747, Charles Babbage, cognitive load, Community Supported Agriculture, complexity theory, constrained optimization, cosmological principle, cryptocurrency, Danny Hillis, data science, David Heinemeier Hansson, David Sedaris, delayed gratification, dematerialisation, diversification, Donald Knuth, Donald Shoup, double helix, Dutch auction, Elon Musk, exponential backoff, fault tolerance, Fellow of the Royal Society, Firefox, first-price auction, Flash crash, Frederick Winslow Taylor, fulfillment center, Garrett Hardin, Geoffrey Hinton, George Akerlof, global supply chain, Google Chrome, heat death of the universe, Henri Poincaré, information retrieval, Internet Archive, Jeff Bezos, Johannes Kepler, John Nash: game theory, John von Neumann, Kickstarter, knapsack problem, Lao Tzu, Leonard Kleinrock, level 1 cache, linear programming, martingale, multi-armed bandit, Nash equilibrium, natural language processing, NP-complete, P = NP, packet switching, Pierre-Simon Laplace, power law, prediction markets, race to the bottom, RAND corporation, RFC: Request For Comment, Robert X Cringely, Sam Altman, scientific management, sealed-bid auction, second-price auction, self-driving car, Silicon Valley, Skype, sorting algorithm, spectrum auction, Stanford marshmallow experiment, Steve Jobs, stochastic process, Thomas Bayes, Thomas Malthus, Tragedy of the Commons, traveling salesman, Turing machine, urban planning, Vickrey auction, Vilfredo Pareto, Walter Mischel, Y Combinator, zero-sum game

Since the maximum delay length (2, 4, 8, 16…) forms an exponential progression, it’s become known as Exponential Backoff. Exponential Backoff was a huge part of the successful functioning of the ALOHAnet beginning in 1971, and in the 1980s it was baked into TCP, becoming a critical part of the Internet. All these decades later, it still is. As one influential paper puts it, “For a transport endpoint embedded in a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations, only one scheme has any hope of working—exponential backoff.” But it is the algorithm’s other uses that suggest something both more prescriptive and more profound.

…

But it is the algorithm’s other uses that suggest something both more prescriptive and more profound. Beyond just collision avoidance, Exponential Backoff has become the default way of handling almost all cases of networking failure or unreliability. For instance, when your computer is trying to reach a website that appears to be down, it uses Exponential Backoff—trying again one second later, again a few seconds after that, and so forth. This is good for everyone: it prevents a host server that’s down from getting slammed with requests as soon as it comes back online, and it prevents your own machine from wasting too much effort trying to get blood from a stone.

…

The longer the round-trip time between sender and receiver, the longer it takes a silence to be significant—and the more information can be potentially “in flight” before the sender realizes there’s a problem. In networking, having the parties properly tune their expectations for the timeliness of acknowledgments is crucial to the system functioning correctly. The second question, of course, once we do recognize a breakdown, is what exactly we should do about it. Exponential Backoff: The Algorithm of Forgiveness The world’s most difficult word to translate has been identified as “ilunga,” from the Tshiluba language spoken in south-eastern DR Congo.… Ilunga means “a person who is ready to forgive any abuse for the first time, to tolerate it a second time, but never a third time.”

Designing Web APIs: Building APIs That Developers Love by Brenda Jin, Saurabh Sahni, Amir Shevat

active measures, Amazon Web Services, augmented reality, Big Tech, blockchain, business logic, business process, cognitive load, continuous integration, create, read, update, delete, exponential backoff, Google Hangouts, if you build it, they will come, Lyft, machine readable, MITM: man-in-the-middle, premature optimization, pull request, Salesforce, Silicon Valley, Snapchat, software as a service, the market place, uber lyft, web application, WebSocket

If your infrastructure can support additional quota and there is no bet‐ ter way to achieve the same result, you might want to consider giving them an exception. 112 | Chapter 6: Scaling APIs • Starting with lower rate-limit thresholds is helpful. Increasing rate-limit thresholds is easier than reducing them because reducing them can negatively affect active developer apps. • Implement exponential back-off in your client SDKs and pro‐ vide sample code to developers on how to do that. This way, developers are less likely to continue hitting your API when they are rate-limited. For more information, see “Error Han‐ dling and Exponential Back-Off ” on page 115. • Use rate limits to reduce the impact of incidents by heavily ratelimiting noncritical traffic during an outage. Lessons Learned from Slack’s Rate-Limiting In March 2018, Slack rolled out an evolved rate-limiting system.

…

Caching Frequently Used Data You can add support for storing API responses or frequently used data locally in a cache. This can help in reducing the number of API calls you will receive. If you have concerns or policies around what data clients can store, ensuring that the cache automatically expires in a few hours can help. Error Handling and Exponential Back-Off Errors are often handled poorly by developers. It’s difficult for devel‐ opers to reproduce all possible errors during development, and that’s why they might not write code to handle those errors grace‐ fully. As you build your SDK, first, you could implement local checks to return errors on invalid requests.

…

Some failures, like authorization errors, cannot be addressed by a retry. Your SDK should surface appropriate errors for these failures to the developer. For other errors, it’s simply better for the SDK to automatically retry the API call. To help developers avoid making too many API calls to your server, your SDK should implement exponential back-off. This is a standard error-handling strategy in which client applications periodically retry a failed request over an increasing amount of time. Exponen‐ tial back-off can help in reducing the number of requests your server receives. When your SDKs implement this, it helps your web application recover gracefully from outages.

pages: 719 words: 181,090

Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy

"Margaret Hamilton" Apollo, Abraham Maslow, Air France Flight 447, anti-pattern, barriers to entry, business intelligence, business logic, business process, Checklist Manifesto, cloud computing, cognitive load, combinatorial explosion, continuous integration, correlation does not imply causation, crowdsourcing, database schema, defense in depth, DevOps, en.wikipedia.org, exponential backoff, fail fast, fault tolerance, Flash crash, George Santayana, Google Chrome, Google Earth, if you see hoof prints, think horses—not zebras, information asymmetry, job automation, job satisfaction, Kubernetes, linear programming, load shedding, loose coupling, machine readable, meta-analysis, microservices, minimum viable product, MVC pattern, no silver bullet, OSI model, performance metric, platform as a service, proprietary trading, reproducible builds, revision control, risk tolerance, side project, six sigma, the long tail, the scientific method, Toyota Production System, trickle-down economics, warehouse automation, web application, zero day

When issuing automatic retries, keep in mind the following considerations: Most of the backend protection strategies described in “Preventing Server Overload” apply. In particular, testing the system can highlight problems, and graceful degradation can reduce the effect of the retries on the backend. Always use randomized exponential backoff when scheduling retries. See also “Exponential Backoff and Jitter” in the AWS Architecture Blog [Bro15]. If retries aren’t randomly distributed over the retry window, a small perturbation (e.g., a network blip) can cause retry ripples to schedule at the same time, which can then amplify themselves [Flo94]. Limit retries per request.

…

You might consider some of the following production tests: Reducing task counts quickly or slowly over time, beyond expected traffic patterns Rapidly losing a cluster’s worth of capacity Blackholing various backends Test Popular Clients Understand how large clients use your service. For example, you want to know if clients: Can queue work while the service is down Use randomized exponential backoff on errors Are vulnerable to external triggers that can create large amounts of load (e.g., an externally triggered software update might clear an offline client’s cache) Depending on your service, you may or may not be in control of all the client code that talks to your service. However, it’s still a good idea to have an understanding of how large clients that interact with your service will behave.

…

As capacity becomes scarce, the service no longer returns pictures alongside text or small maps illustrating where a story takes place. And depending on its purpose, an RPC that times out is either not retried (for example, in the case of the aforementioned pictures), or is retried with a randomized exponential backoff. Despite these safeguards, the tasks fail one by one and are then restarted by Borg, which drives the number of working tasks down even more. As a result, some graphs on the service dashboard turn an alarming shade of red and SRE is paged. In response, SREs temporarily add capacity to the Asian datacenter by increasing the number of tasks available for the Shakespeare job.

Principles of Protocol Design by Robin Sharp

accounting loophole / creative accounting, business process, discrete time, exponential backoff, fault tolerance, finite state, functional programming, Gödel, Escher, Bach, information retrieval, loose coupling, MITM: man-in-the-middle, OSI model, packet switching, quantum cryptography, RFC: Request For Comment, stochastic process

Analysis of this situation (see for example Chapter 4 in [11]) shows that when the (new) traffic generated by the users exceeds a certain value, then the retransmissions themselves cause so many extra collisions that the throughput of the system actually begins to fall as the load of new traffic rises further. This leads to instability in the system, which can only be avoided by reducing the rate of retransmission as the load increases. Doubling the average retransmission delay for each failed attempt is a simple way of doing this. The technical term for this is Binary Exponential Backoff (BEB); or, if the doubling process terminates after a certain maximum number of attempts, as in the case of the ISO/IEEE CSMA/CD protocol, truncated BEB. The receiver processes, described by R[i] are much simpler. These are controlled by the signals cs[i], which indicates the presence of an arriving transmission, and nc[i], which indicate the absence of such a transmission.

…

Since poor estimates of the timeout setting give rise to excessive retransmissions, which in turn may cause more congestion and further delays, the retransmission timeout timer value is doubled after each unsuccessful retransmission. As in the case of the CSMA/CD protocol described by Protocol 11, this mechanism, known as Binary Exponential Backoff ensures stability in the system. After a successful retransmission, the normal way of evaluating the round trip time – and thus the timer value – is recontinued. 7.4 Congestion 235 Y B X A Fig. 7.25 Congestion control using choking. System X has detected excessive utilisation of the link from X to Y.

…

Assumed underlying service: CSMA/CD LAN Physical Layer (ISO8802.3). Connection phase: None. Data transfer phase: Unacknowledged data transfer. Disconnection phase: None. Other features: Statistical time-division multiplexing with distributed control. Collision resolution by random wait with truncated binary exponential backoff. Coding: Start of DPDU marked by delimiter field. CRC-32 block checksum. Ad hoc binary coding of all fields in DPDU. Addressing: 16-bit or 48-bit flat addressing. Single-station or group addressing. Fault tolerance: Corruption of data PDUs. A number of other LAN MAC protocols, which are described in other parts of the ISO8802/IEEE 802 family of standards, are listed in Table 9.1.

pages: 1,025 words: 150,187

ZeroMQ by Pieter Hintjens

AGPL, anti-pattern, behavioural economics, carbon footprint, cloud computing, Debian, distributed revision control, domain-specific language, eat what you kill, Eben Moglen, exponential backoff, factory automation, fail fast, fault tolerance, fear of failure, finite state, Internet of things, iterative process, no silver bullet, power law, premature optimization, profit motive, pull request, revision control, RFC: Request For Comment, Richard Stallman, Skype, smart transportation, software patent, Steve Jobs, Valgrind, WebSocket

Take a look at the Paranoid Pirate worker in Example 4-13. Example 4-13. Paranoid Pirate worker (ppworker.c) // // Paranoid Pirate worker // #include "czmq.h" #define HEARTBEAT_LIVENESS 3 // 3-5 is reasonable #define HEARTBEAT_INTERVAL 1000 // msec #define INTERVAL_INIT 1000 // Initial reconnect #define INTERVAL_MAX 32000 // After exponential backoff // Paranoid Pirate Protocol constants #define PPP_READY "\001" // Signals worker is ready #define PPP_HEARTBEAT "\002" // Signals worker heartbeat // Helper function that returns a new configured socket // connected to the Paranoid Pirate queue static void * s_worker_socket (zctx_t *ctx) { void *worker = zsocket_new (ctx, ZMQ_DEALER); zsocket_connect (worker, "tcp://localhost:5556"); // Tell queue we're ready for work printf ("I: worker ready\n"); zframe_t *frame = zframe_new (PPP_READY, 1); zframe_send (&frame, worker, 0); return worker; } We have a single task that implements the worker side of the Paranoid Pirate Protocol (PPP).

…

\n", interval); zclock_sleep (interval); if (interval < INTERVAL_MAX) interval *= 2; zsocket_destroy (ctx, worker); worker = s_worker_socket (ctx); liveness = HEARTBEAT_LIVENESS; } // Send heartbeat to queue if it's time if (zclock_time () > heartbeat_at) { heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL; printf ("I: worker heartbeat\n"); zframe_t *frame = zframe_new (PPP_HEARTBEAT, 1); zframe_send (&frame, worker, 0); } } zctx_destroy (&ctx); return 0; } Some comments about this example: The code includes simulation of failures, as before. This makes it (a) very hard to debug, and (b) dangerous to reuse. When you want to debug this code, disable the failure simulation. The worker uses a reconnect strategy similar to the one we designed for the Lazy Pirate client, with two major differences: it does an exponential backoff, and it retries indefinitely (whereas the client retries a few times before reporting a failure). You can try the client, queue, and workers by using a script like this: ppqueue & for i in 1 2 3 4; do ppworker & sleep 1 done lpclient & You should see the workers die one by one as they simulate a crash, and the client eventually give up.

…

In the zmq_poll() loop, whenever we pass this time, we send a heartbeat to the queue. Here’s the essential heartbeating code for the worker: #define HEARTBEAT_LIVENESS 3 // 3-5 is reasonable #define HEARTBEAT_INTERVAL 1000 // msec #define INTERVAL_INIT 1000 // Initial reconnect #define INTERVAL_MAX 32000 // After exponential backoff ... // If liveness hits zero, queue is considered disconnected size_t liveness = HEARTBEAT_LIVENESS; size_t interval = INTERVAL_INIT; // Send out heartbeats at regular intervals uint64_t heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL; while (true) { zmq_pollitem_t items [] = { { worker, 0, ZMQ_POLLIN, 0 } }; int rc = zmq_poll (items, 1, HEARTBEAT_INTERVAL * ZMQ_POLL_MSEC); if (items [0].revents & ZMQ_POLLIN) { // Receive any message from queue liveness = HEARTBEAT_LIVENESS; interval = INTERVAL_INIT; } else if (--liveness == 0) { zclock_sleep (interval); if (interval < INTERVAL_MAX) interval *= 2; zsocket_destroy (ctx, worker); ... liveness = HEARTBEAT_LIVENESS; } // Send heartbeat to queue if it's time if (zclock_time () > heartbeat_at) { heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL; // Send heartbeat message to queue } } The queue does the same, but manages an expiration time for each worker.

pages: 1,380 words: 190,710

Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Ana Oprea, Piotr Lewandowski, Adam Stubblefield

air gap, anti-pattern, barriers to entry, bash_history, behavioural economics, business continuity plan, business logic, business process, Cass Sunstein, cloud computing, cognitive load, continuous integration, correlation does not imply causation, create, read, update, delete, cryptocurrency, cyber-physical system, database schema, Debian, defense in depth, DevOps, Edward Snowden, end-to-end encryption, exponential backoff, fault tolerance, fear of failure, general-purpose programming language, Google Chrome, if you see hoof prints, think horses—not zebras, information security, Internet of things, Kubernetes, load shedding, margin call, microservices, MITM: man-in-the-middle, NSO Group, nudge theory, operational security, performance metric, pull request, ransomware, reproducible builds, revision control, Richard Thaler, risk tolerance, self-driving car, single source of truth, Skype, slashdot, software as a service, source of truth, SQL injection, Stuxnet, the long tail, Turing test, undersea cable, uranium enrichment, Valgrind, web application, Y2K, zero day

If many clients are caught in this loop, the resulting demand makes recovering from the outage difficult.9 Client software should be carefully designed to avoid tight retry loops. If a server fails, the client may retry, but should implement exponential backoff—for example, doubling the wait period each time an attempt fails. This approach limits the number of requests to the server, but on its own is not sufficient—an outage can synchronize all clients, causing repeated bursts of high traffic. To avoid synchronous retries, each client should wait for a random duration, called jitter. At Google, we implement exponential backoff with jitter in most of our client software. What can you do if you don’t control the client? This is a common concern for people operating authoritative DNS servers.

…

Sooner or later, overload, network issues, or some other issue will result in a dependency being unavailable. In many cases, it would be reasonable to retry the request, but implement retries carefully in order to avoid a cascading failure (akin to falling dominoes).1 The most common solution is to retry with an exponential backoff.2 A good framework should provide support for such logic, rather than requiring the developer to implement the logic for every RPC call. A framework that gracefully handles unavailable dependencies and redirects traffic to avoid overloading the service or its dependencies naturally improves the reliability of both the service itself and the entire ecosystem.

…

= nil { return ctx, err } if ai.allowedRoles[callInfo.User] { return ctx, nil } return ctx, fmt.Errorf("Unauthorized request from %q", callInfo.User) } func (*authzInterceptor) After(ctx context.Context, resp *Response) error { return nil // Nothing left to do here after the RPC is handled. } Example 12-3. Example logging interceptor that logs every incoming request (before stage) and then logs all the failed requests with their status (after stage); WithAttemptCount is a framework-provided RPC call option that implements exponential backoff type logInterceptor struct { logger *LoggingBackendStub } func (*logInterceptor) Before(ctx context.Context, req *Request) (context.Context, error) { // callInfo was populated by the framework. callInfo, err := FromContext(ctx) if err != nil { return ctx, err } logReq := &pb.LogRequest{ timestamp: time.Now().Unix(), user: callInfo.User, request: req.Payload, } resp, err := logger.Log(ctx, logReq, WithAttemptCount(3)) return ctx, err } func (*logInterceptor) After(ctx context.Context, resp *Response) error { if resp.Err == nil { return nil } logErrorReq := &pb.LogErrorRequest{ timestamp: time.Now().Unix(), error: resp.Err.Error(), } resp, err := logger.LogError(ctx, logErrorReq, WithAttemptCount(3)) return err } Common Security Vulnerabilities In large codebases, a handful of classes account for the majority of security vulnerabilities, despite ongoing efforts to educate developers and introduce code review.

pages: 629 words: 109,663

Docker in Action by Jeff Nickoloff, Stephen Kuenzli

air gap, Amazon Web Services, cloud computing, computer vision, continuous integration, database schema, Debian, end-to-end encryption, exponential backoff, fail fast, failed state, information security, Kubernetes, microservices, MITM: man-in-the-middle, peer-to-peer, software as a service, web application

If that container was configured to always restart, and Docker always immediately restarted it, the system would do nothing but restart that container. Instead, Docker uses an exponential backoff strategy for timing restart attempts. A backoff strategy determines the amount of time that should pass between successive restart attempts. An exponential back-off strategy will do something like double the previous time spent waiting on each successive attempt. For example, if the first time the container needs to be restarted Docker waits 1 second, then on the second attempt it would wait 2 seconds, 4 seconds on the third attempt, 8 on the fourth, and so on. Exponential backoff strategies with low initial wait times are a common service-restoration technique.

UNIX® Network Programming, Volume 1: The Sockets Networking API, 3rd Edition by W. Richard Stevens, Bill Fenner, Andrew M. Rudoff

Dennis Ritchie, exponential backoff, failed state, fudge factor, global macro, history of Unix, information retrieval, OpenAI, OSI model, p-value, RFC: Request For Comment, Richard Stallman, UUNET, web application

This function is called when the retransmission timer expires. 83 int 84 rtt_timeout(struct rtt_info *ptr) 85 { 86 ptr->rtt_rto *= 2; /* next RTO */ 87 88 89 90 } if (++ptr->rtt_nrexmt > RTT_MAXNREXMT) return (-1); /* time to give up for this packet */ return (0); lib/rtt.c Figure 22.14 rtt_timeout function: applies exponential backoff. 86 lib/rtt.c The current RTO is doubled: This is the exponential backoff. 608 Advanced UDP Sockets 87–89 Chapter 22 If we have reached the maximum number of retransmissions, −1 is returned to tell the caller to give up; otherwise, 0 is returned. As an example, our client was run twice to two different echo servers across the Internet in the morning on a weekday.

…

This value will be overridden by the SCTP stack if it is greater than the maximum allowable streams the SCTP stack supports. • sinit_max_attempts expresses how many times the SCTP stack should send the initial INIT message before considering the peer endpoint unreachable. • sinit_max_init_timeo represents the maximum RTO value for the INIT timer. During exponential backoff of the initial timer, this value replaces RTO.max as the ceiling for retransmissions. This value is expressed in milliseconds. Note that when setting these fields, any value set to 0 will be ignored by the SCTP socket. A user of the one-to-many-style socket (described in Section 9.2) may also pass an sctp_initmsg structure in ancillary data during implicit association setup.

…

Indeed, the TCP kernel implementation (Section 25.7 of TCPv2) is normally performed using fixed-point arithmetic for speed, but for simplicity, we use floating-point calculations in our code that follows. Another point made in [Jacobson 1988] is that when the retransmission timer expires, an exponential backoff must be used for the next RTO. For example, if our first RTO is 2 seconds and the reply is not received in this time, then the next RTO is 4 seconds. If there is still no reply, the next RTO is 8 seconds, and then 16, and so on. Jacobson’s algorithms tell us how to calculate the RTO each time we measure an RTT and how to increase the RTO when we retransmit.

pages: 933 words: 205,691

Hadoop: The Definitive Guide by Tom White

Amazon Web Services, bioinformatics, business intelligence, business logic, combinatorial explosion, data science, database schema, Debian, domain-specific language, en.wikipedia.org, exponential backoff, fallacies of distributed computing, fault tolerance, full text search, functional programming, Grace Hopper, information retrieval, Internet Archive, Kickstarter, Large Hadron Collider, linked data, loose coupling, openstreetmap, recommendation engine, RFID, SETI@home, social graph, sparse data, web application

Reduce-side tuning properties Property nameTypeDefault valueDescription mapred.reduce.parallel.copies int 5 The number of threads used to copy map outputs to the reducer. mapred.reduce.copy.backoff int 300 The maximum amount of time, in seconds, to spend retrieving one map output for a reducer before declaring it as failed. The reducer may repeatedly reattempt a transfer within this time if it fails (using exponential backoff). io.sort.factor int 10 The maximum number of streams to merge at once when sorting files. This property is also used in the map. mapred.job.shuffle.input.buffer.percent float 0.70 The proportion of total heap size to be allocated to the map outputs buffer during the copy phase of the shuffle.

…

If, after trying all of the servers in the ensemble, it can’t connect, then it throws an IOException. The likelihood of all ZooKeeper servers being unavailable is low; nevertheless, some applications may choose to retry the operation in a loop until ZooKeeper is available. This is just one strategy for retry handling—there are many others, such as using exponential backoff where the period between retries is multiplied by a constant each time. The org.apache.hadoop.io.retry package in Hadoop Core is a set of utilities for adding retry logic into your code in a reusable way, and it may be helpful for building ZooKeeper applications. A Lock Service A distributed lock is a mechanism for providing mutual exclusion between a collection of processes.

Smart Grid Standards by Takuro Sato

business cycle, business process, carbon footprint, clean water, cloud computing, data acquisition, decarbonisation, demand response, distributed generation, electricity market, energy security, exponential backoff, factory automation, Ford Model T, green new deal, green transition, information retrieval, information security, Intergovernmental Panel on Climate Change (IPCC), Internet of things, Iridium satellite, iterative process, knowledge economy, life extension, linear programming, low earth orbit, machine readable, market design, MITM: man-in-the-middle, off grid, oil shale / tar sands, OSI model, packet switching, performance metric, RFC: Request For Comment, RFID, smart cities, smart grid, smart meter, smart transportation, Thomas Davenport

ESI shall be able to receive a message from a service-provider’s back office and deliver to devices in the HAN at the beginning of the valid period. Devices that have negotiated to receive messages shall accept those messages. If unable to deliver to a certain device, ESI must retry sending with exponential back-off time until successful or message expiration. Directed messages shall be able to be addressed to a specific premise. These messages shall be delivered to all devices configured to receive messages from the ESI through which they are delivered. Consumers may grant authorization to the service provider to communicate with or control a HAN device.

Seeking SRE: Conversations About Running Production Systems at Scale by David N. Blank-Edelman

Affordable Care Act / Obamacare, algorithmic trading, AlphaGo, Amazon Web Services, backpropagation, Black Lives Matter, Bletchley Park, bounce rate, business continuity plan, business logic, business process, cloud computing, cognitive bias, cognitive dissonance, cognitive load, commoditize, continuous integration, Conway's law, crowdsourcing, dark matter, data science, database schema, Debian, deep learning, DeepMind, defense in depth, DevOps, digital rights, domain-specific language, emotional labour, en.wikipedia.org, exponential backoff, fail fast, fallacies of distributed computing, fault tolerance, fear of failure, friendly fire, game design, Grace Hopper, imposter syndrome, information retrieval, Infrastructure as a Service, Internet of things, invisible hand, iterative process, Kaizen: continuous improvement, Kanban, Kubernetes, loose coupling, Lyft, machine readable, Marc Andreessen, Maslow's hierarchy, microaggression, microservices, minimum viable product, MVC pattern, performance metric, platform as a service, pull request, RAND corporation, remote working, Richard Feynman, risk tolerance, Ruby on Rails, Salesforce, scientific management, search engine result page, self-driving car, sentiment analysis, Silicon Valley, single page application, Snapchat, software as a service, software is eating the world, source of truth, systems thinking, the long tail, the scientific method, Toyota Production System, traumatic brain injury, value engineering, vertical integration, web application, WebSocket, zero day

Service discovery Distributed applications need to find each other. Mechanisms range in complexity from Domain Name System (DNS) to fully consistent solutions like Consul. Distributed system best practices At a theoretical level, microservice practitioners are told that they need to employ best practices such as retries with exponential back-off, circuit breaking, rate limiting, and timeouts. The actual implementations of these best practices are usually varied or missing entirely. Authentication and authorization Although most internet architectures utilize edge encryption via Transport Layer Security (TLS) and some type of edge authentication, the implementations vary widely, from proprietary to OAuth.

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

active measures, Amazon Web Services, billion-dollar mistake, bitcoin, blockchain, business intelligence, business logic, business process, c2.com, cloud computing, collaborative editing, commoditize, conceptual framework, cryptocurrency, data science, database schema, deep learning, DevOps, distributed ledger, Donald Knuth, Edward Snowden, end-to-end encryption, Ethereum, ethereum blockchain, exponential backoff, fake news, fault tolerance, finite state, Flash crash, Free Software Foundation, full text search, functional programming, general-purpose programming language, Hacker News, informal economy, information retrieval, Internet of things, iterative process, John von Neumann, Ken Thompson, Kubernetes, Large Hadron Collider, level 1 cache, loose coupling, machine readable, machine translation, Marc Andreessen, microservices, natural language processing, Network effects, no silver bullet, operational security, packet switching, peer-to-peer, performance metric, place-making, premature optimization, recommendation engine, Richard Feynman, self-driving car, semantic web, Shoshana Zuboff, social graph, social web, software as a service, software is eating the world, sorting algorithm, source of truth, SPARQL, speech recognition, SQL injection, statistical model, surveillance capitalism, systematic bias, systems thinking, Tragedy of the Commons, undersea cable, web application, WebSocket, wikimedia commons

Although retrying an aborted transaction is a simple and effective error handling mechanism, it isn’t perfect: • If the transaction actually succeeded, but the network failed while the server tried to acknowledge the successful commit to the client (so the client thinks it failed), then retrying the transaction causes it to be performed twice—unless you have an additional application-level deduplication mechanism in place. • If the error is due to overload, retrying the transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit the number of retries, use exponential backoff, and handle overload-related errors differently from other errors (if possible). • It is only worth retrying after transient errors (for example due to deadlock, iso‐ lation violation, temporary network interruptions, and failover); after a perma‐ nent error (e.g., constraint violation) a retry would be pointless. • If the transaction also has side effects outside of the database, those side effects may happen even if the transaction is aborted.

pages: 1,237 words: 227,370

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

active measures, Amazon Web Services, billion-dollar mistake, bitcoin, blockchain, business intelligence, business logic, business process, c2.com, cloud computing, collaborative editing, commoditize, conceptual framework, cryptocurrency, data science, database schema, deep learning, DevOps, distributed ledger, Donald Knuth, Edward Snowden, end-to-end encryption, Ethereum, ethereum blockchain, exponential backoff, fake news, fault tolerance, finite state, Flash crash, Free Software Foundation, full text search, functional programming, general-purpose programming language, Hacker News, informal economy, information retrieval, Infrastructure as a Service, Internet of things, iterative process, John von Neumann, Ken Thompson, Kubernetes, Large Hadron Collider, level 1 cache, loose coupling, machine readable, machine translation, Marc Andreessen, microservices, natural language processing, Network effects, no silver bullet, operational security, packet switching, peer-to-peer, performance metric, place-making, premature optimization, recommendation engine, Richard Feynman, self-driving car, semantic web, Shoshana Zuboff, social graph, social web, software as a service, software is eating the world, sorting algorithm, source of truth, SPARQL, speech recognition, SQL injection, statistical model, surveillance capitalism, systematic bias, systems thinking, Tragedy of the Commons, undersea cable, web application, WebSocket, wikimedia commons

Although retrying an aborted transaction is a simple and effective error handling mechanism, it isn’t perfect: If the transaction actually succeeded, but the network failed while the server tried to acknowledge the successful commit to the client (so the client thinks it failed), then retrying the transaction causes it to be performed twice—unless you have an additional application-level deduplication mechanism in place. If the error is due to overload, retrying the transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit the number of retries, use exponential backoff, and handle overload-related errors differently from other errors (if possible). It is only worth retrying after transient errors (for example due to deadlock, isolation violation, temporary network interruptions, and failover); after a permanent error (e.g., constraint violation) a retry would be pointless.