source of truth

43 results back to index


pages: 355 words: 81,788

Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith by Sam Newman

Airbnb, business process, continuous integration, database schema, DevOps, fault tolerance, ghettoisation, inventory management, Jeff Bezos, Kubernetes, loose coupling, microservices, MVC pattern, price anchoring, pull request, single page application, software as a service, source of truth, telepresence

To resolve this, you have a few options: Write to one source All writes are sent to one of the sources of truth. Data is synchronized to the other source of truth after the write occurs. Send writes to both sources All write requests made by upstream clients are sent to both sources of truth. This occurs by making sure the client makes a call to each source of truth itself, or by relying on an intermediary to broadcast the request to each downstream service. Seed writes to either source Clients can send write requests to either source of truth, and behind the scenes the data is synchronized in a two-way fashion between the systems. The two separate options of sending writes to both sources of truth, or sending to one source of truth and relying on some form of background synchronization, seem like workable solutions, and the example we’ll explore in a moment uses both of these techniques.

With a tracer write, we move the source of truth for data in an incremental fashion, tolerating there being two sources of truth during the migration. You identify a new service that will host the relocated data. The current system still maintains a record of this data locally, but when making changes also ensures this data is written to the new service via its service interface. Existing code can be changed to start accessing the new service, and once all functionality is using the new service as the source of truth, the old source of truth can be retired. Careful consideration needs to be given regarding how data is synchronized between the two sources of truth. Figure 4-18. A tracer write allows for incremental migration of data from one system to another by accommodating two sources of truth during the migration Wanting a single source of truth is a totally rational desire.

A tracer write allows for incremental migration of data from one system to another by accommodating two sources of truth during the migration Wanting a single source of truth is a totally rational desire. It allows us to ensure consistency of data, to control access to that data, and can reduce maintenance costs. The problem is that if we insist on only ever having one source of truth for a piece of data, then we are forced into a situation that changing where this data lives becomes a single big switchover. Before the release, the monolith is the source of truth. After the release, our new microservice is the source of truth. The issue is that various things can go wrong during this change over. A pattern like the tracer write allows for a phased switchover, reducing the impact of each release, in exchange for being more tolerant of having more than one source of truth. The reason this pattern is called a tracer write is that you can start with a small set of data being synchronized and increase this over time, while also increasing the number of consumers of the new source of data.


pages: 461 words: 106,027

Zero to Sold: How to Start, Run, and Sell a Bootstrapped Business by Arvid Kahl

"side hustle", business process, centre right, Chuck Templeton: OpenTable:, continuous integration, coronavirus, COVID-19, Covid-19, crowdsourcing, domain-specific language, financial independence, Google Chrome, if you build it, they will come, information asymmetry, information retrieval, inventory management, Jeff Bezos, job automation, Kubernetes, minimum viable product, Network effects, performance metric, post-work, premature optimization, risk tolerance, Ruby on Rails, sentiment analysis, Silicon Valley, software as a service, source of truth, statistical model, subscription business, supply-chain management, trickle-down economics, web application

Your employees will be the first touch-points for customer interactions, co-founders and directors, and partners and other businesses. What once was a unified voice—your voice—is now a chorus. If you want to have a company that is consistent and aligned, you’ll need to be the conductor of that chorus. The reason that a chorus of 100 singers can create incredibly elegant harmonies is that they have a central source of truth: a score written in commonly readable music notation, and a conductor to help them keep in sync. You will need to be all of that for your employees and co-founders. Building a Source of Truth Create a vision and mission document. Set the tone by explaining the “Why” and the “How” in your own authentic voice. This document will be the point of reference for any question that could crop up in the future. Whenever an employee, co-founder, or a potential acquirer needs to learn about the voice of your business, you can refer them to this writeup, as it will settle any dispute and answer any question.

Building software like this will surface dependency and configuration errors much quicker and safer. After all, you will see the errors as they happen on your local computer, and not just after having deployed a new faulty version to production. It's a way to keep your operational peace of mind. You will still need to do ample testing before every deploy. Having your systems in immutable containers will force them to be mostly stateless, which allows you to save your data in a single source of truth, likely a database or in-memory storage system. Stateless containers will enable you to launch as many as you need to handle the increasing load over time since they can work on different tasks in parallel. Many orchestration systems can auto-scale containers to match the computational demand of your customers. Once you have multiple containers doing work in parallel, you may want to look into having the containers communicate with each other to distribute tasks and use shared resources evenly.

Whenever an employee, co-founder, or a potential acquirer needs to learn about the voice of your business, you can refer them to this writeup, as it will settle any dispute and answer any question. Write it down in extensive prose, or create a video in which you explain your motivation and your aspirations. Share this with everyone who joins the company, make it clear to them that this is the source of truth whenever someone wonders how they should communicate the means and goals of the business. At FeedbackPanda, I stated in that document something along the lines of “At FeedbackPanda, we want to enable our teachers who are likely to come from fragile financial backgrounds. When they are in trouble, we help them teach more by using our product until they can catch up.” If you are a customer service representative tasked with deciding if a customer should get a few weeks’ extension because their credit card is overdrawn, you don’t have to think twice after reading this paragraph.


pages: 514 words: 111,012

The Art of Monitoring by James Turnbull

Amazon Web Services, anti-pattern, cloud computing, continuous integration, correlation does not imply causation, Debian, DevOps, domain-specific language, failed state, Kickstarter, Kubernetes, microservices, performance metric, pull request, Ruby on Rails, software as a service, source of truth, web application, WebSocket

We're going to update our Reactive environment to focus on events, metrics, and logs. We'll replace a lot of our existing monitoring infrastructure—for example, service and host-centric checks—and replace them with event and metric-driven checks. In our monitoring framework, events, metrics, and logs are going to be at the core of our solution. The data points that make up our events, metrics, and logs will provide the source of truth for: The state of our environment. The performance of our environment. So, rather than infrastructure-centric checks like pinging a host to return its availability or monitoring a process to confirm if a service is running, we're going to replace most of those fault detection checks with metrics. If a metric is measuring then the service is available. If it stops measuring then it's likely the service is not available.

But first, in these initial chapters, we're going to look at data collection, metrics, aggregation, and visualization. Then we'll expand the framework to collect application and business metrics, culminating in a capstone chapter where we'll put everything together. We'll build a framework that focuses on events and metrics and collects data in a scalable and robust way. In our new monitoring paradigm, events and metrics are going to be at the core of our solution. This data will provide the source of truth for: The state of our environment The performance of our environment Visualization of this data will also allow for the ready expression and interpretation of complex ideas that would otherwise take thousands of words or hours of explanation. In this chapter we're going to step through our proposed monitoring framework. We'll introduce the basic concepts and lay the groundwork that will help you understand the choice of tools and techniques we've made later in the book.

In virtual and cloud environments, a host or service being monitored may be highly ephemeral: appearing, disappearing, or migrating locations or hosts multiple times during its lifespan. Statically defined checks just don't handle this changing landscape, resulting in checks (and faults) on resources that do not exist or that have changed. Further, many monitoring systems require you to duplicate configuration on both a server and the object being monitored. This lack of a single source of truth leads to increased risk of inconsistency and difficulty in managing checks. It also generally means that the monitoring server needs to know about resources being monitored before they can be monitored. This is clearly problematic in dynamic or changing landscapes. Additionally, updates to monitoring are often considered secondary to scaling or evolving the systems themselves. Many faults are thus the result of incorrect configuration or orphaned checks.


pages: 292 words: 66,588

Learning Vue.js 2: Learn How to Build Amazing and Complex Reactive Web Applications Easily With Vue.js by Olga Filipova

Amazon Web Services, continuous integration, create, read, update, delete, en.wikipedia.org, Firefox, Google Chrome, MVC pattern, pull request, side project, single page application, Skype, source of truth, web application

Thus, the components are bound to the read-only brain state and can dispatch brain actions that will alter the state. The components are not aware of each other and cannot modify each other's state directly in any way. They also can also not affect directly the brain's initial state. They can only call the actions. Actions belong to the brain, and in their callbacks, the state can be modified. Thus, our brain is a single source of truth. Tip Single source of truth in information systems is a way of designing the architecture of the application in such a way that every data element is only stored once. This data is read-only to prevent the application's components from corrupting the state that is accessed by other components. The Vuex store is designed in such a way that it is not possible to change its state from any component. How does the store work and what is so special about it?

And, it's reactive. Putting all this in statements: The Vuex store is reactive. Once components retrieve a state from it, they will reactively update their views every time the state changes. Components are not able to directly mutate the store's state. Instead, they have to dispatch mutations declared by the store, which allows easy tracking of changes. Our Vuex store thus becomes a single source of truth. Let's create a simple greetings example to see Vuex in action. Greetings with store We will create a very simple Vue application with two components: one of them will contain the greetings message and the other one will contain input that will allow us to change this message. Our store will contain the initial state that will represent the initial greeting and the mutation that will be able to change the message.


pages: 165 words: 50,798

Intertwingled: Information Changes Everything by Peter Morville

A Pattern Language, Airbnb, Albert Einstein, Arthur Eddington, augmented reality, Bernie Madoff, Black Swan, business process, Cass Sunstein, cognitive dissonance, collective bargaining, disruptive innovation, index card, information retrieval, Internet of things, Isaac Newton, iterative process, Jane Jacobs, John Markoff, Lean Startup, Lyft, minimum viable product, Mother of all demos, Nelson Mandela, Paul Graham, peer-to-peer, RFID, Richard Thaler, ride hailing / ride sharing, Schrödinger's Cat, self-driving car, semantic web, sharing economy, Silicon Valley, Silicon Valley startup, source of truth, Steve Jobs, Stewart Brand, Ted Nelson, The Death and Life of Great American Cities, the scientific method, The Wisdom of Crowds, theory of mind, uber lyft, urban planning, urban sprawl, Vannevar Bush, zero-sum game

We spoke the language of their culture, and they listened. Together we made search better. Of course, when it comes to co-cultures, there are some things you just can’t fix. I saw this firsthand while working with DaimlerChrylser soon after the 1998 merger. We were hired to build an information architecture strategy for a unified corporate portal. By integrating several American and German intranets into a single source of truth, executives hoped to bring the cultures together. While this seemed unrealistic, we were willing to give it a go. But the more we learned, the less we believed in the mission. In stakeholder interviews, the absence of trust was palpable. This was a culture clash of epic proportions. On the surface, friction was caused by different wage structures, org charts, values, and brands. But at a deeper level, conflict was driven by differences in national culture.

Metrics are defined to back the plan. It’s amazing how much is done in support of what we already know. Within a culture, the idiosyncrasy of authoritative information is mostly invisible. Insiders rarely question their institutional ways of knowing. As a consultant, I’ve had clients who place too much faith in my expertise, and others who don’t trust enough. Some use “usability tests” as the sole source of truth. Others track conversions as the way to know what’s right. And an awful lot of folks simply believe the boss knows best. As an outsider, it’s my role to ask the questions that never get asked, but when I begin I have no idea what they are or where to find them. So I use a multi-method research process that affords breadth and depth. I wallow in data of all sorts and talk to people from all walks.


pages: 540 words: 103,101

Building Microservices by Sam Newman

airport security, Amazon Web Services, anti-pattern, business process, call centre, continuous integration, create, read, update, delete, defense in depth, don't repeat yourself, Edward Snowden, fault tolerance, index card, information retrieval, Infrastructure as a Service, inventory management, job automation, Kubernetes, load shedding, loose coupling, microservices, MITM: man-in-the-middle, platform as a service, premature optimization, pull request, recommendation engine, social graph, software as a service, source of truth, the built environment, web application, WebSocket

We need to embrace the idea that a microservice will encompass the lifecycle of our core domain entities, like the Customer. We’ve already talked about the importance of the logic associated with changing this Customer being held in the customer service, and that if we want to change it we have to issue a request to the customer service. But it also follows that we should consider the customer service as being the source of truth for Customers. When we retrieve a given Customer resource from the customer service, we get to see what that resource looked like when we made the request. It is possible that after we requested that Customer resource, something else has changed it. What we have in effect is a memory of what the Customer resource once looked like. The longer we hold on to this memory, the higher the chance that this memory will be false.

These systems allow you to store information about principals, such as what roles they play in the organization. Often, the directory service and the identity provider are one and the same, while sometimes they are separate but linked. Okta, for example, is a hosted SAML identity provider that handles tasks like two-factor authentication, but can link to your company’s directory services as the source of truth. SAML is a SOAP-based standard, and is known for being fairly complex to work with despite the libraries and tooling available to support it. OpenID Connect is a standard that has emerged as a specific implementation of OAuth 2.0, based on the way Google and others handle SSO. It uses simpler REST calls, and in my opinion is likely to make inroads into enterprises due to its improved ease of use.

Other tools offer everything up to and including rate limiting, monetization, API catalogs, and discovery systems. Some API systems allow you to bridge API keys to existing directory services. This would allow you to issue API keys to principals (representing people or systems) in your organization, and control the lifecycle of those keys in the same way you’d manage their normal credentials. This opens up the possibility of allowing access to your services in different ways but keeping the same source of truth — for example, using SAML to authenticate humans for SSO, and using API keys for service-to-service communication, as shown in Figure 9-2. Figure 9-2. Using directory services to synchronize principal information between an SSO and an API gateway The Deputy Problem Having a principal authenticate with a given microserservice is simple enough. But what happens if that service then needs to make additional calls to complete an operation?


pages: 1,380 words: 190,710

Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Ana Oprea, Piotr Lewandowski, Adam Stubblefield

anti-pattern, barriers to entry, bash_history, business continuity plan, business process, Cass Sunstein, cloud computing, continuous integration, correlation does not imply causation, create, read, update, delete, cryptocurrency, cyber-physical system, database schema, Debian, defense in depth, DevOps, Edward Snowden, fault tolerance, fear of failure, general-purpose programming language, Google Chrome, Internet of things, Kubernetes, load shedding, margin call, microservices, MITM: man-in-the-middle, performance metric, pull request, ransomware, revision control, Richard Thaler, risk tolerance, self-driving car, Skype, slashdot, software as a service, source of truth, Stuxnet, Turing test, undersea cable, uranium enrichment, Valgrind, web application, Y2K, zero day

Handling emergencies directly In order to revoke keys and certificates quickly, you may want to design infrastructure to handle emergencies directly by deploying changes to a server’s authorized_users or Key Revocation List (KRL) files.15 This solution is troublesome for recovery in several ways. Note It’s especially tempting to manage authorized_keys or known_hosts files directly when dealing with small numbers of nodes, but doing so scales poorly and smears ground truth across your entire fleet. It is very difficult to ensure that a given set of keys has been removed from the files on all servers, particularly if those files are the sole source of truth. Instead of managing authorized_keys or known_hosts files directly, you can ensure update processes are consistent by centrally managing keys and certificates, and distributing state to servers through a revocation list. In fact, deploying explicit revocation lists is an opportunity to minimize uncertainty during risky situations when you’re moving at maximum speed: you can use your usual mechanisms for updating and monitoring files—including your rate-limiting mechanisms—on individual nodes.

For example, old certificates may suddenly become valid again, letting in attackers, and correct certificates may suddenly fail validation, causing a service outage. These are complications you don’t want to experience during a stressful compromise or service outage. It’s better for your system’s certificate validation to depend on aspects you directly control, such as pushing out files containing root authority public keys or files containing revocation lists. The systems that push the files, the files themselves, and the central source of truth are likely to be much better secured, maintained, and monitored than the distribution of time. Recovery then becomes a matter of simply pushing out files, and monitoring only needs to check whether these are the intended files—standard processes that your systems already use. Revoking credentials at scale When using explicit revocation, it’s important to consider the implications of scalability.

Understanding, monitoring, and reproducing the state of your system to the greatest extent possible—through software versions, memory, wall-clock time, and so on—is key to reliably recovering the system to any previously working state, and ensuring that its current state matches your security requirements. As a last resort, emergency access permits responders to remain connected, assess a system, and mitigate the situation. Thoughtfully managing policy versus procedure, the central source of truth versus local functions, and the expected state versus the system’s actual state paves the way to recoverable systems, while also promoting resilience and robust everyday operations. 1 The CAP theorem describes some tradeoffs involved in scaling distributed systems and their consequences. 2 Unexpected bit flips can be caused by failing hardware, noise from other systems, or even cosmic rays.


pages: 109 words: 29,486

Marx: A Very Short Introduction by Peter Singer

clockwatching, means of production, Paul Samuelson, source of truth

Chapter 1 A Life and its Impact Marx’s impact can only be compared with that of religious figures like Jesus or Muhammad. For much of the second half of the twentieth century, nearly four of of every ten people on earth lived under governments that considered themselves Marxist and claimed – however implausibly – to use Marxist principles to decide how the nation should be run. In these countries Marx was a kind of secular Jesus; his writings were the ultimate source of truth and authority; his image was everywhere reverently displayed. The lives of hundreds of millions of people have been deeply affected by Marx’s legacy. Nor has Marx’s influence been limited to communist societies. Conservative governments have ushered in social reforms to cut the ground from under revolutionary Marxist opposition movements. Conservatives have also reacted in less benign ways: Mussolini and Hitler were helped to power by conservatives who saw their rabid nationalism as the answer to the Marxist threat.


pages: 161 words: 39,526

Applied Artificial Intelligence: A Handbook for Business Leaders by Mariya Yao, Adelyn Zhou, Marlene Jia

Airbnb, Amazon Web Services, artificial general intelligence, autonomous vehicles, business intelligence, business process, call centre, chief data officer, computer vision, conceptual framework, en.wikipedia.org, future of work, industrial robot, Internet of things, iterative process, Jeff Bezos, job automation, Marc Andreessen, natural language processing, new economy, pattern recognition, performance metric, price discrimination, randomized controlled trial, recommendation engine, self-driving car, sentiment analysis, Silicon Valley, skunkworks, software is eating the world, source of truth, speech recognition, statistical model, strong AI, technological singularity

Other companies have accumulated big piles of data, but aren’t actively transforming their information assets into improved business practices. Do You Have a Central Technology Infrastructure and Team? A key milestone in the corporate digital transformation is the development of a centralized data and technology infrastructure. These two elements connect consumer applications, enterprise systems, and third-party partners and provide access to a single source of truth that contains relevant, up-to-date, and accurate information for all parties. Designing and implementing the infrastructure needed for enterprise-scale AI requires a strong and dedicated technology team that can develop internal application programming interfaces (APIs) to standardize access to both data and your company’s internal business technology. Doing so will enable your company to streamline enterprise-wide data analysis, accelerate product development, and respond more quickly in evolving markets.


pages: 296 words: 41,381

Vue.js by Callum Macrae

Airbnb, single page application, source of truth, web application, WebSocket

Throughout the rest of the chapter, I’ll introduce the individual concepts you saw in that example—state, mutations, and actions—and explain a way we can structure our vuex modules in large applications to avoid having one large, messy file. State and State Helpers First, let’s look at state. State indicates how data is stored in our vuex store. It’s like one big object that we can access from anywhere in our application—it’s the single source of truth. Let’s take a simple store that contains only a number: import Vuex from 'vuex'; export default new Vuex.Store({ state: { messageCount: 10 } }); Now, in our application, we can access the messageCount property of the state object by accessing this.$store.state.messageCount. This is a bit verbose, so generally it’s better to put it in a computed property, like so: const NotificationCount = { template: `<p>Messages: {{ messageCount }}</p>`, computed: { messageCount() { return this.


pages: 394 words: 118,929

Dreaming in Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software by Scott Rosenberg

A Pattern Language, Benevolent Dictator For Life (BDFL), Berlin Wall, c2.com, call centre, collaborative editing, conceptual framework, continuous integration, Donald Knuth, Douglas Engelbart, Douglas Engelbart, Douglas Hofstadter, Dynabook, en.wikipedia.org, Firefox, Ford paid five dollars a day, Francis Fukuyama: the end of history, George Santayana, Grace Hopper, Guido van Rossum, Gödel, Escher, Bach, Howard Rheingold, HyperCard, index card, Internet Archive, inventory management, Jaron Lanier, John Markoff, John von Neumann, knowledge worker, Larry Wall, life extension, Loma Prieta earthquake, Menlo Park, Merlin Mann, Mitch Kapor, new economy, Nicholas Carr, Norbert Wiener, pattern recognition, Paul Graham, Potemkin village, RAND corporation, Ray Kurzweil, Richard Stallman, Ronald Reagan, Ruby on Rails, semantic web, side project, Silicon Valley, Singularitarianism, slashdot, software studies, source of truth, South of Market, San Francisco, speech recognition, stealth mode startup, stem cell, Stephen Hawking, Steve Jobs, Stewart Brand, Ted Nelson, Therac-25, thinkpad, Turing test, VA Linux, Vannevar Bush, Vernor Vinge, web application, Whole Earth Catalog, Y2K

Or you could avoid conflict in the first place through “locking,” which allows a user editing a document to make sure that no one else has access to it until the work is done and the changes are saved. If WebDAV could do it, why was it so hard for Chandler? Chandler’s peer-to-peer approach meant there was no central server to be what developers call, with a kind of flip reverence, “the source of truth.” WebDAV’s server stored the document, knew what was happening to it, and could coordinate messages about its status to multiple users. Under a decentralized peer-to-peer approach, multiple copies of a document can proliferate with no master copy for users to rely on, no authority to turn to. Life is harder without a “source of truth.” For programmers as for other human beings, a canonical authority can be convenient. It rescues you from having to figure out how to adjudicate dilemmas on your own. After just a few weeks at OSAF, Dusseault became convinced that the peer-to-peer road to Chandler sharing was likely to prove a dead end.


Pulling Strings With Puppet: Configuration Management Made Easy by James Turnbull

Debian, en.wikipedia.org, Kickstarter, revision control, Ruby on Rails, source of truth, SpamAssassin

External nodes provide the capability to store our node definitions in a data source external to Puppet, for example, generated by a script or drawn from a database. An extension of this functionality, LDAP nodes, allows you to store your node configurations in a LDAP server. This externalization of data provides a number of advantages when managing our configuration information, especially in providing a single source of truth and a centralized repository for asset and configuration information. As discussed in Chapter 1, the Puppet client-server model is not yet fully scalable to large installations, for example, the management of thousands of nodes. In this chapter, I’ll examine using the Mongrel web server in combination with an Apache proxy running the mod_ssl and mod_proxy_balancer modules to enhance Puppet’s scalability and allow you to run multiple master daemons.


pages: 212 words: 49,544

WikiLeaks and the Age of Transparency by Micah L. Sifry

1960s counterculture, Amazon Web Services, banking crisis, barriers to entry, Bernie Sanders, Buckminster Fuller, Chelsea Manning, citizen journalism, Climategate, crowdsourcing, Google Earth, Howard Rheingold, Internet Archive, Jacob Appelbaum, John Markoff, Julian Assange, Network effects, RAND corporation, school vouchers, Skype, social web, source of truth, Stewart Brand, web application, WikiLeaks

The government’s failure to tell a truth that the public already knew damaged its authority. In the second case, real news was released by ordinary people, in the face of the desire of the authorities to maintain an official but false narrative. The government’s failure to tell the truth that the public found out despite its efforts damaged its authority. In both cases, free 144 MICAH L. SIFRY agents are the sources of truths more credible than anything the government offers. And in both cases, it is the same human impulse to share the truth that shines through. Now, think about Bradley Manning and what motivated him. In the networked age, where the watched can also be the watchers, what is at stake is nothing less than the credibility of authority itself. Western governments presumably rest on the consent of the governed, but only if the governed trust the word of those who would govern them.


Martin Kleppmann-Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable and Maintainable Systems-O’Reilly (2017) by Unknown

active measures, Amazon Web Services, bitcoin, blockchain, business intelligence, business process, c2.com, cloud computing, collaborative editing, commoditize, conceptual framework, cryptocurrency, database schema, DevOps, distributed ledger, Donald Knuth, Edward Snowden, Ethereum, ethereum blockchain, fault tolerance, finite state, Flash crash, full text search, general-purpose programming language, informal economy, information retrieval, Internet of things, iterative process, John von Neumann, Kubernetes, loose coupling, Marc Andreessen, microservices, natural language processing, Network effects, packet switching, peer-to-peer, performance metric, place-making, premature optimization, recommendation engine, Richard Feynman, self-driving car, semantic web, Shoshana Zuboff, social graph, social web, software as a service, software is eating the world, sorting algorithm, source of truth, SPARQL, speech recognition, statistical model, undersea cable, web application, WebSocket, wikimedia commons

This aspect of system-building is often overlooked by vendors who claim that their product can sat‐ isfy all your needs. In reality, integrating disparate systems is one of the most impor‐ tant things that needs to be done in a nontrivial application. Systems of Record and Derived Data On a high level, systems that store and process data can be grouped into two broad categories: Systems of record A system of record, also known as source of truth, holds the authoritative version of your data. When new data comes in, e.g., as user input, it is first written here. Each fact is represented exactly once (the representation is typically normalized). If there is any discrepancy between another system and the system of record, then the value in the system of record is (by definition) the correct one. Derived data systems Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way.

See the discussions of read skew in “Snapshot Isolation and Repeatable Read” on page 237, write skew in “Write Skew and Phantoms” on page 246, and clock stored procedure A way of encoding the logic of a transac‐ tion such that it can be entirely executed on a database server, without communi‐ cating back and forth with a client during the transaction. See “Actual Serial Execu‐ tion” on page 252. stream process A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See Chapter 11. synchronous The opposite of asynchronous. system of record A system that holds the primary, authori‐ tative version of some data, also known as the source of truth. Changes are first writ‐ ten here, and other datasets may be derived from the system of record. See the introduction to Part III. timeout One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See “Timeouts and Unbounded Delays” on page 281. total order A way of comparing things (e.g., time‐ stamps) that allows you to always say which one of two things is greater and which one is lesser.

The opposite of bounded. 558 | Glossary Index A aborts (transactions), 222, 224 in two-phase commit, 356 performance of optimistic concurrency con‐ trol, 266 retrying aborted transactions, 231 abstraction, 21, 27, 222, 266, 321 access path (in network model), 37, 60 accidental complexity, removing, 21 accountability, 535 ACID properties (transactions), 90, 223 atomicity, 223, 228 consistency, 224, 529 durability, 226 isolation, 225, 228 acknowledgements (messaging), 445 active/active replication (see multi-leader repli‐ cation) active/passive replication (see leader-based rep‐ lication) ActiveMQ (messaging), 137, 444 distributed transaction support, 361 ActiveRecord (object-relational mapper), 30, 232 actor model, 138 (see also message-passing) comparison to Pregel model, 425 comparison to stream processing, 468 Advanced Message Queuing Protocol (see AMQP) aerospace systems, 6, 10, 305, 372 aggregation data cubes and materialized views, 101 in batch processes, 406 in stream processes, 466 aggregation pipeline query language, 48 Agile, 22 minimizing irreversibility, 414, 497 moving faster with confidence, 532 Unix philosophy, 394 agreement, 365 (see also consensus) Airflow (workflow scheduler), 402 Ajax, 131 Akka (actor framework), 139 algorithms algorithm correctness, 308 B-trees, 79-83 for distributed systems, 306 hash indexes, 72-75 mergesort, 76, 402, 405 red-black trees, 78 SSTables and LSM-trees, 76-79 all-to-all replication topologies, 175 AllegroGraph (database), 50 ALTER TABLE statement (SQL), 40, 111 Amazon Dynamo (database), 177 Amazon Web Services (AWS), 8 Kinesis Streams (messaging), 448 network reliability, 279 postmortems, 9 RedShift (database), 93 S3 (object storage), 398 checking data integrity, 530 amplification of bias, 534 of failures, 364, 495 Index | 559 of tail latency, 16, 207 write amplification, 84 AMQP (Advanced Message Queuing Protocol), 444 (see also messaging systems) comparison to log-based messaging, 448, 451 message ordering, 446 analytics, 90 comparison to transaction processing, 91 data warehousing (see data warehousing) parallel query execution in MPP databases, 415 predictive (see predictive analytics) relation to batch processing, 411 schemas for, 93-95 snapshot isolation for queries, 238 stream analytics, 466 using MapReduce, analysis of user activity events (example), 404 anti-caching (in-memory databases), 89 anti-entropy, 178 Apache ActiveMQ (see ActiveMQ) Apache Avro (see Avro) Apache Beam (see Beam) Apache BookKeeper (see BookKeeper) Apache Cassandra (see Cassandra) Apache CouchDB (see CouchDB) Apache Curator (see Curator) Apache Drill (see Drill) Apache Flink (see Flink) Apache Giraph (see Giraph) Apache Hadoop (see Hadoop) Apache HAWQ (see HAWQ) Apache HBase (see HBase) Apache Helix (see Helix) Apache Hive (see Hive) Apache Impala (see Impala) Apache Jena (see Jena) Apache Kafka (see Kafka) Apache Lucene (see Lucene) Apache MADlib (see MADlib) Apache Mahout (see Mahout) Apache Oozie (see Oozie) Apache Parquet (see Parquet) Apache Qpid (see Qpid) Apache Samza (see Samza) Apache Solr (see Solr) Apache Spark (see Spark) 560 | Index Apache Storm (see Storm) Apache Tajo (see Tajo) Apache Tez (see Tez) Apache Thrift (see Thrift) Apache ZooKeeper (see ZooKeeper) Apama (stream analytics), 466 append-only B-trees, 82, 242 append-only files (see logs) Application Programming Interfaces (APIs), 5, 27 for batch processing, 403 for change streams, 456 for distributed transactions, 361 for graph processing, 425 for services, 131-136 (see also services) evolvability, 136 RESTful, 133 SOAP, 133 application state (see state) approximate search (see similarity search) archival storage, data from databases, 131 arcs (see edges) arithmetic mean, 14 ASCII text, 119, 395 ASN.1 (schema language), 127 asynchronous networks, 278, 553 comparison to synchronous networks, 284 formal model, 307 asynchronous replication, 154, 553 conflict detection, 172 data loss on failover, 157 reads from asynchronous follower, 162 Asynchronous Transfer Mode (ATM), 285 atomic broadcast (see total order broadcast) atomic clocks (caesium clocks), 294, 295 (see also clocks) atomicity (concurrency), 553 atomic increment-and-get, 351 compare-and-set, 245, 327 (see also compare-and-set operations) replicated operations, 246 write operations, 243 atomicity (transactions), 223, 228, 553 atomic commit, 353 avoiding, 523, 528 blocking and nonblocking, 359 in stream processing, 360, 477 maintaining derived data, 453 for multi-object transactions, 229 for single-object writes, 230 auditability, 528-533 designing for, 531 self-auditing systems, 530 through immutability, 460 tools for auditable data systems, 532 availability, 8 (see also fault tolerance) in CAP theorem, 337 in service level agreements (SLAs), 15 Avro (data format), 122-127 code generation, 127 dynamically generated schemas, 126 object container files, 125, 131, 414 reader determining writer’s schema, 125 schema evolution, 123 use in Hadoop, 414 awk (Unix tool), 391 AWS (see Amazon Web Services) Azure (see Microsoft) B B-trees (indexes), 79-83 append-only/copy-on-write variants, 82, 242 branching factor, 81 comparison to LSM-trees, 83-85 crash recovery, 82 growing by splitting a page, 81 optimizations, 82 similarity to dynamic partitioning, 212 backpressure, 441, 553 in TCP, 282 backups database snapshot for replication, 156 integrity of, 530 snapshot isolation for, 238 use for ETL processes, 405 backward compatibility, 112 BASE, contrast to ACID, 223 bash shell (Unix), 70, 395, 503 batch processing, 28, 389-431, 553 combining with stream processing lambda architecture, 497 unifying technologies, 498 comparison to MPP databases, 414-418 comparison to stream processing, 464 comparison to Unix, 413-414 dataflow engines, 421-423 fault tolerance, 406, 414, 422, 442 for data integration, 494-498 graphs and iterative processing, 424-426 high-level APIs and languages, 403, 426-429 log-based messaging and, 451 maintaining derived state, 495 MapReduce and distributed filesystems, 397-413 (see also MapReduce) measuring performance, 13, 390 outputs, 411-413 key-value stores, 412 search indexes, 411 using Unix tools (example), 391-394 Bayou (database), 522 Beam (dataflow library), 498 bias, 534 big ball of mud, 20 Bigtable data model, 41, 99 binary data encodings, 115-128 Avro, 122-127 MessagePack, 116-117 Thrift and Protocol Buffers, 117-121 binary encoding based on schemas, 127 by network drivers, 128 binary strings, lack of support in JSON and XML, 114 BinaryProtocol encoding (Thrift), 118 Bitcask (storage engine), 72 crash recovery, 74 Bitcoin (cryptocurrency), 532 Byzantine fault tolerance, 305 concurrency bugs in exchanges, 233 bitmap indexes, 97 blockchains, 532 Byzantine fault tolerance, 305 blocking atomic commit, 359 Bloom (programming language), 504 Bloom filter (algorithm), 79, 466 BookKeeper (replicated log), 372 Bottled Water (change data capture), 455 bounded datasets, 430, 439, 553 (see also batch processing) bounded delays, 553 in networks, 285 process pauses, 298 broadcast hash joins, 409 Index | 561 brokerless messaging, 442 Brubeck (metrics aggregator), 442 BTM (transaction coordinator), 356 bulk synchronous parallel (BSP) model, 425 bursty network traffic patterns, 285 business data processing, 28, 90, 390 byte sequence, encoding data in, 112 Byzantine faults, 304-306, 307, 553 Byzantine fault-tolerant systems, 305, 532 Byzantine Generals Problem, 304 consensus algorithms and, 366 C caches, 89, 553 and materialized views, 101 as derived data, 386, 499-504 database as cache of transaction log, 460 in CPUs, 99, 338, 428 invalidation and maintenance, 452, 467 linearizability, 324 CAP theorem, 336-338, 554 Cascading (batch processing), 419, 427 hash joins, 409 workflows, 403 cascading failures, 9, 214, 281 Cascalog (batch processing), 60 Cassandra (database) column-family data model, 41, 99 compaction strategy, 79 compound primary key, 204 gossip protocol, 216 hash partitioning, 203-205 last-write-wins conflict resolution, 186, 292 leaderless replication, 177 linearizability, lack of, 335 log-structured storage, 78 multi-datacenter support, 184 partitioning scheme, 213 secondary indexes, 207 sloppy quorums, 184 cat (Unix tool), 391 causal context, 191 (see also causal dependencies) causal dependencies, 186-191 capturing, 191, 342, 494, 514 by total ordering, 493 causal ordering, 339 in transactions, 262 sending message to friends (example), 494 562 | Index causality, 554 causal ordering, 339-343 linearizability and, 342 total order consistent with, 344, 345 consistency with, 344-347 consistent snapshots, 340 happens-before relationship, 186 in serializable transactions, 262-265 mismatch with clocks, 292 ordering events to capture, 493 violations of, 165, 176, 292, 340 with synchronized clocks, 294 CEP (see complex event processing) certificate transparency, 532 chain replication, 155 linearizable reads, 351 change data capture, 160, 454 API support for change streams, 456 comparison to event sourcing, 457 implementing, 454 initial snapshot, 455 log compaction, 456 changelogs, 460 change data capture, 454 for operator state, 479 generating with triggers, 455 in stream joins, 474 log compaction, 456 maintaining derived state, 452 Chaos Monkey, 7, 280 checkpointing in batch processors, 422, 426 in high-performance computing, 275 in stream processors, 477, 523 chronicle data model, 458 circuit-switched networks, 284 circular buffers, 450 circular replication topologies, 175 clickstream data, analysis of, 404 clients calling services, 131 pushing state changes to, 512 request routing, 214 stateful and offline-capable, 170, 511 clocks, 287-299 atomic (caesium) clocks, 294, 295 confidence interval, 293-295 for global snapshots, 294 logical (see logical clocks) skew, 291-294, 334 slewing, 289 synchronization and accuracy, 289-291 synchronization using GPS, 287, 290, 294, 295 time-of-day versus monotonic clocks, 288 timestamping events, 471 cloud computing, 146, 275 need for service discovery, 372 network glitches, 279 shared resources, 284 single-machine reliability, 8 Cloudera Impala (see Impala) clustered indexes, 86 CODASYL model, 36 (see also network model) code generation with Avro, 127 with Thrift and Protocol Buffers, 118 with WSDL, 133 collaborative editing multi-leader replication and, 170 column families (Bigtable), 41, 99 column-oriented storage, 95-101 column compression, 97 distinction between column families and, 99 in batch processors, 428 Parquet, 96, 131, 414 sort order in, 99-100 vectorized processing, 99, 428 writing to, 101 comma-separated values (see CSV) command query responsibility segregation (CQRS), 462 commands (event sourcing), 459 commits (transactions), 222 atomic commit, 354-355 (see also atomicity; transactions) read committed isolation, 234 three-phase commit (3PC), 359 two-phase commit (2PC), 355-359 commutative operations, 246 compaction of changelogs, 456 (see also log compaction) for stream operator state, 479 of log-structured storage, 73 issues with, 84 size-tiered and leveled approaches, 79 CompactProtocol encoding (Thrift), 119 compare-and-set operations, 245, 327 implementing locks, 370 implementing uniqueness constraints, 331 implementing with total order broadcast, 350 relation to consensus, 335, 350, 352, 374 relation to transactions, 230 compatibility, 112, 128 calling services, 136 properties of encoding formats, 139 using databases, 129-131 using message-passing, 138 compensating transactions, 355, 461, 526 complex event processing (CEP), 465 complexity distilling in theoretical models, 310 hiding using abstraction, 27 of software systems, managing, 20 composing data systems (see unbundling data‐ bases) compute-intensive applications, 3, 275 concatenated indexes, 87 in Cassandra, 204 Concord (stream processor), 466 concurrency actor programming model, 138, 468 (see also message-passing) bugs from weak transaction isolation, 233 conflict resolution, 171, 174 detecting concurrent writes, 184-191 dual writes, problems with, 453 happens-before relationship, 186 in replicated systems, 161-191, 324-338 lost updates, 243 multi-version concurrency control (MVCC), 239 optimistic concurrency control, 261 ordering of operations, 326, 341 reducing, through event logs, 351, 462, 507 time and relativity, 187 transaction isolation, 225 write skew (transaction isolation), 246-251 conflict-free replicated datatypes (CRDTs), 174 conflicts conflict detection, 172 causal dependencies, 186, 342 in consensus algorithms, 368 in leaderless replication, 184 Index | 563 in log-based systems, 351, 521 in nonlinearizable systems, 343 in serializable snapshot isolation (SSI), 264 in two-phase commit, 357, 364 conflict resolution automatic conflict resolution, 174 by aborting transactions, 261 by apologizing, 527 convergence, 172-174 in leaderless systems, 190 last write wins (LWW), 186, 292 using atomic operations, 246 using custom logic, 173 determining what is a conflict, 174, 522 in multi-leader replication, 171-175 avoiding conflicts, 172 lost updates, 242-246 materializing, 251 relation to operation ordering, 339 write skew (transaction isolation), 246-251 congestion (networks) avoidance, 282 limiting accuracy of clocks, 293 queueing delays, 282 consensus, 321, 364-375, 554 algorithms, 366-368 preventing split brain, 367 safety and liveness properties, 365 using linearizable operations, 351 cost of, 369 distributed transactions, 352-375 in practice, 360-364 two-phase commit, 354-359 XA transactions, 361-364 impossibility of, 353 membership and coordination services, 370-373 relation to compare-and-set, 335, 350, 352, 374 relation to replication, 155, 349 relation to uniqueness constraints, 521 consistency, 224, 524 across different databases, 157, 452, 462, 492 causal, 339-348, 493 consistent prefix reads, 165-167 consistent snapshots, 156, 237-242, 294, 455, 500 (see also snapshots) 564 | Index crash recovery, 82 enforcing constraints (see constraints) eventual, 162, 322 (see also eventual consistency) in ACID transactions, 224, 529 in CAP theorem, 337 linearizability, 324-338 meanings of, 224 monotonic reads, 164-165 of secondary indexes, 231, 241, 354, 491, 500 ordering guarantees, 339-352 read-after-write, 162-164 sequential, 351 strong (see linearizability) timeliness and integrity, 524 using quorums, 181, 334 consistent hashing, 204 consistent prefix reads, 165 constraints (databases), 225, 248 asynchronously checked, 526 coordination avoidance, 527 ensuring idempotence, 519 in log-based systems, 521-524 across multiple partitions, 522 in two-phase commit, 355, 357 relation to consensus, 374, 521 relation to event ordering, 347 requiring linearizability, 330 Consul (service discovery), 372 consumers (message streams), 137, 440 backpressure, 441 consumer offsets in logs, 449 failures, 445, 449 fan-out, 11, 445, 448 load balancing, 444, 448 not keeping up with producers, 441, 450, 502 context switches, 14, 297 convergence (conflict resolution), 172-174, 322 coordination avoidance, 527 cross-datacenter, 168, 493 cross-partition ordering, 256, 294, 348, 523 services, 330, 370-373 coordinator (in 2PC), 356 failure, 358 in XA transactions, 361-364 recovery, 363 copy-on-write (B-trees), 82, 242 CORBA (Common Object Request Broker Architecture), 134 correctness, 6 auditability, 528-533 Byzantine fault tolerance, 305, 532 dealing with partial failures, 274 in log-based systems, 521-524 of algorithm within system model, 308 of compensating transactions, 355 of consensus, 368 of derived data, 497, 531 of immutable data, 461 of personal data, 535, 540 of time, 176, 289-295 of transactions, 225, 515, 529 timeliness and integrity, 524-528 corruption of data detecting, 519, 530-533 due to pathological memory access, 529 due to radiation, 305 due to split brain, 158, 302 due to weak transaction isolation, 233 formalization in consensus, 366 integrity as absence of, 524 network packets, 306 on disks, 227 preventing using write-ahead logs, 82 recovering from, 414, 460 Couchbase (database) durability, 89 hash partitioning, 203-204, 211 rebalancing, 213 request routing, 216 CouchDB (database) B-tree storage, 242 change feed, 456 document data model, 31 join support, 34 MapReduce support, 46, 400 replication, 170, 173 covering indexes, 86 CPUs cache coherence and memory barriers, 338 caching and pipelining, 99, 428 increasing parallelism, 43 CRDTs (see conflict-free replicated datatypes) CREATE INDEX statement (SQL), 85, 500 credit rating agencies, 535 Crunch (batch processing), 419, 427 hash joins, 409 sharded joins, 408 workflows, 403 cryptography defense against attackers, 306 end-to-end encryption and authentication, 519, 543 proving integrity of data, 532 CSS (Cascading Style Sheets), 44 CSV (comma-separated values), 70, 114, 396 Curator (ZooKeeper recipes), 330, 371 curl (Unix tool), 135, 397 cursor stability, 243 Cypher (query language), 52 comparison to SPARQL, 59 D data corruption (see corruption of data) data cubes, 102 data formats (see encoding) data integration, 490-498, 543 batch and stream processing, 494-498 lambda architecture, 497 maintaining derived state, 495 reprocessing data, 496 unifying, 498 by unbundling databases, 499-515 comparison to federated databases, 501 combining tools by deriving data, 490-494 derived data versus distributed transac‐ tions, 492 limits of total ordering, 493 ordering events to capture causality, 493 reasoning about dataflows, 491 need for, 385 data lakes, 415 data locality (see locality) data models, 27-64 graph-like models, 49-63 Datalog language, 60-63 property graphs, 50 RDF and triple-stores, 55-59 query languages, 42-48 relational model versus document model, 28-42 data protection regulations, 542 data systems, 3 about, 4 Index | 565 concerns when designing, 5 future of, 489-544 correctness, constraints, and integrity, 515-533 data integration, 490-498 unbundling databases, 499-515 heterogeneous, keeping in sync, 452 maintainability, 18-22 possible faults in, 221 reliability, 6-10 hardware faults, 7 human errors, 9 importance of, 10 software errors, 8 scalability, 10-18 unreliable clocks, 287-299 data warehousing, 91-95, 554 comparison to data lakes, 415 ETL (extract-transform-load), 92, 416, 452 keeping data systems in sync, 452 schema design, 93 slowly changing dimension (SCD), 476 data-intensive applications, 3 database triggers (see triggers) database-internal distributed transactions, 360, 364, 477 databases archival storage, 131 comparison of message brokers to, 443 dataflow through, 129 end-to-end argument for, 519-520 checking integrity, 531 inside-out, 504 (see also unbundling databases) output from batch workflows, 412 relation to event streams, 451-464 (see also changelogs) API support for change streams, 456, 506 change data capture, 454-457 event sourcing, 457-459 keeping systems in sync, 452-453 philosophy of immutable events, 459-464 unbundling, 499-515 composing data storage technologies, 499-504 designing applications around dataflow, 504-509 566 | Index observing derived state, 509-515 datacenters geographically distributed, 145, 164, 278, 493 multi-tenancy and shared resources, 284 network architecture, 276 network faults, 279 replication across multiple, 169 leaderless replication, 184 multi-leader replication, 168, 335 dataflow, 128-139, 504-509 correctness of dataflow systems, 525 differential, 504 message-passing, 136-139 reasoning about, 491 through databases, 129 through services, 131-136 dataflow engines, 421-423 comparison to stream processing, 464 directed acyclic graphs (DAG), 424 partitioning, approach to, 429 support for declarative queries, 427 Datalog (query language), 60-63 datatypes binary strings in XML and JSON, 114 conflict-free, 174 in Avro encodings, 122 in Thrift and Protocol Buffers, 121 numbers in XML and JSON, 114 Datomic (database) B-tree storage, 242 data model, 50, 57 Datalog query language, 60 excision (deleting data), 463 languages for transactions, 255 serial execution of transactions, 253 deadlocks detection, in two-phase commit (2PC), 364 in two-phase locking (2PL), 258 Debezium (change data capture), 455 declarative languages, 42, 554 Bloom, 504 CSS and XSL, 44 Cypher, 52 Datalog, 60 for batch processing, 427 recursive SQL queries, 53 relational algebra and SQL, 42 SPARQL, 59 delays bounded network delays, 285 bounded process pauses, 298 unbounded network delays, 282 unbounded process pauses, 296 deleting data, 463 denormalization (data representation), 34, 554 costs, 39 in derived data systems, 386 materialized views, 101 updating derived data, 228, 231, 490 versus normalization, 462 derived data, 386, 439, 554 from change data capture, 454 in event sourcing, 458-458 maintaining derived state through logs, 452-457, 459-463 observing, by subscribing to streams, 512 outputs of batch and stream processing, 495 through application code, 505 versus distributed transactions, 492 deterministic operations, 255, 274, 554 accidental nondeterminism, 423 and fault tolerance, 423, 426 and idempotence, 478, 492 computing derived data, 495, 526, 531 in state machine replication, 349, 452, 458 joins, 476 DevOps, 394 differential dataflow, 504 dimension tables, 94 dimensional modeling (see star schemas) directed acyclic graphs (DAGs), 424 dirty reads (transaction isolation), 234 dirty writes (transaction isolation), 235 discrimination, 534 disks (see hard disks) distributed actor frameworks, 138 distributed filesystems, 398-399 decoupling from query engines, 417 indiscriminately dumping data into, 415 use by MapReduce, 402 distributed systems, 273-312, 554 Byzantine faults, 304-306 cloud versus supercomputing, 275 detecting network faults, 280 faults and partial failures, 274-277 formalization of consensus, 365 impossibility results, 338, 353 issues with failover, 157 limitations of distributed transactions, 363 multi-datacenter, 169, 335 network problems, 277-286 quorums, relying on, 301 reasons for using, 145, 151 synchronized clocks, relying on, 291-295 system models, 306-310 use of clocks and time, 287 distributed transactions (see transactions) Django (web framework), 232 DNS (Domain Name System), 216, 372 Docker (container manager), 506 document data model, 30-42 comparison to relational model, 38-42 document references, 38, 403 document-oriented databases, 31 many-to-many relationships and joins, 36 multi-object transactions, need for, 231 versus relational model convergence of models, 41 data locality, 41 document-partitioned indexes, 206, 217, 411 domain-driven design (DDD), 457 DRBD (Distributed Replicated Block Device), 153 drift (clocks), 289 Drill (query engine), 93 Druid (database), 461 Dryad (dataflow engine), 421 dual writes, problems with, 452, 507 duplicates, suppression of, 517 (see also idempotence) using a unique ID, 518, 522 durability (transactions), 226, 554 duration (time), 287 measurement with monotonic clocks, 288 dynamic partitioning, 212 dynamically typed languages analogy to schema-on-read, 40 code generation and, 127 Dynamo-style databases (see leaderless replica‐ tion) E edges (in graphs), 49, 403 property graph model, 50 edit distance (full-text search), 88 effectively-once semantics, 476, 516 Index | 567 (see also exactly-once semantics) preservation of integrity, 525 elastic systems, 17 Elasticsearch (search server) document-partitioned indexes, 207 partition rebalancing, 211 percolator (stream search), 467 usage example, 4 use of Lucene, 79 ElephantDB (database), 413 Elm (programming language), 504, 512 encodings (data formats), 111-128 Avro, 122-127 binary variants of JSON and XML, 115 compatibility, 112 calling services, 136 using databases, 129-131 using message-passing, 138 defined, 113 JSON, XML, and CSV, 114 language-specific formats, 113 merits of schemas, 127 representations of data, 112 Thrift and Protocol Buffers, 117-121 end-to-end argument, 277, 519-520 checking integrity, 531 publish/subscribe streams, 512 enrichment (stream), 473 Enterprise JavaBeans (EJB), 134 entities (see vertices) epoch (consensus algorithms), 368 epoch (Unix timestamps), 288 equi-joins, 403 erasure coding (error correction), 398 Erlang OTP (actor framework), 139 error handling for network faults, 280 in transactions, 231 error-correcting codes, 277, 398 Esper (CEP engine), 466 etcd (coordination service), 370-373 linearizable operations, 333 locks and leader election, 330 quorum reads, 351 service discovery, 372 use of Raft algorithm, 349, 353 Ethereum (blockchain), 532 Ethernet (networks), 276, 278, 285 packet checksums, 306, 519 568 | Index Etherpad (collaborative editor), 170 ethics, 533-543 code of ethics and professional practice, 533 legislation and self-regulation, 542 predictive analytics, 533-536 amplifying bias, 534 feedback loops, 536 privacy and tracking, 536-543 consent and freedom of choice, 538 data as assets and power, 540 meaning of privacy, 539 surveillance, 537 respect, dignity, and agency, 543, 544 unintended consequences, 533, 536 ETL (extract-transform-load), 92, 405, 452, 554 use of Hadoop for, 416 event sourcing, 457-459 commands and events, 459 comparison to change data capture, 457 comparison to lambda architecture, 497 deriving current state from event log, 458 immutability and auditability, 459, 531 large, reliable data systems, 519, 526 Event Store (database), 458 event streams (see streams) events, 440 deciding on total order of, 493 deriving views from event log, 461 difference to commands, 459 event time versus processing time, 469, 477, 498 immutable, advantages of, 460, 531 ordering to capture causality, 493 reads as, 513 stragglers, 470, 498 timestamp of, in stream processing, 471 EventSource (browser API), 512 eventual consistency, 152, 162, 308, 322 (see also conflicts) and perpetual inconsistency, 525 evolvability, 21, 111 calling services, 136 graph-structured data, 52 of databases, 40, 129-131, 461, 497 of message-passing, 138 reprocessing data, 496, 498 schema evolution in Avro, 123 schema evolution in Thrift and Protocol Buffers, 120 schema-on-read, 39, 111, 128 exactly-once semantics, 360, 476, 516 parity with batch processors, 498 preservation of integrity, 525 exclusive mode (locks), 258 eXtended Architecture transactions (see XA transactions) extract-transform-load (see ETL) F Facebook Presto (query engine), 93 React, Flux, and Redux (user interface libra‐ ries), 512 social graphs, 49 Wormhole (change data capture), 455 fact tables, 93 failover, 157, 554 (see also leader-based replication) in leaderless replication, absence of, 178 leader election, 301, 348, 352 potential problems, 157 failures amplification by distributed transactions, 364, 495 failure detection, 280 automatic rebalancing causing cascading failures, 214 perfect failure detectors, 359 timeouts and unbounded delays, 282, 284 using ZooKeeper, 371 faults versus, 7 partial failures in distributed systems, 275-277, 310 fan-out (messaging systems), 11, 445 fault tolerance, 6-10, 555 abstractions for, 321 formalization in consensus, 365-369 use of replication, 367 human fault tolerance, 414 in batch processing, 406, 414, 422, 425 in log-based systems, 520, 524-526 in stream processing, 476-479 atomic commit, 477 idempotence, 478 maintaining derived state, 495 microbatching and checkpointing, 477 rebuilding state after a failure, 478 of distributed transactions, 362-364 transaction atomicity, 223, 354-361 faults, 6 Byzantine faults, 304-306 failures versus, 7 handled by transactions, 221 handling in supercomputers and cloud computing, 275 hardware, 7 in batch processing versus distributed data‐ bases, 417 in distributed systems, 274-277 introducing deliberately, 7, 280 network faults, 279-281 asymmetric faults, 300 detecting, 280 tolerance of, in multi-leader replication, 169 software errors, 8 tolerating (see fault tolerance) federated databases, 501 fence (CPU instruction), 338 fencing (preventing split brain), 158, 302-304 generating fencing tokens, 349, 370 properties of fencing tokens, 308 stream processors writing to databases, 478, 517 Fibre Channel (networks), 398 field tags (Thrift and Protocol Buffers), 119-121 file descriptors (Unix), 395 financial data, 460 Firebase (database), 456 Flink (processing framework), 421-423 dataflow APIs, 427 fault tolerance, 422, 477, 479 Gelly API (graph processing), 425 integration of batch and stream processing, 495, 498 machine learning, 428 query optimizer, 427 stream processing, 466 flow control, 282, 441, 555 FLP result (on consensus), 353 FlumeJava (dataflow library), 403, 427 followers, 152, 555 (see also leader-based replication) foreign keys, 38, 403 forward compatibility, 112 forward decay (algorithm), 16 Index | 569 Fossil (version control system), 463 shunning (deleting data), 463 FoundationDB (database) serializable transactions, 261, 265, 364 fractal trees, 83 full table scans, 403 full-text search, 555 and fuzzy indexes, 88 building search indexes, 411 Lucene storage engine, 79 functional reactive programming (FRP), 504 functional requirements, 22 futures (asynchronous operations), 135 fuzzy search (see similarity search) G garbage collection immutability and, 463 process pauses for, 14, 296-299, 301 (see also process pauses) genome analysis, 63, 429 geographically distributed datacenters, 145, 164, 278, 493 geospatial indexes, 87 Giraph (graph processing), 425 Git (version control system), 174, 342, 463 GitHub, postmortems, 157, 158, 309 global indexes (see term-partitioned indexes) GlusterFS (distributed filesystem), 398 GNU Coreutils (Linux), 394 GoldenGate (change data capture), 161, 170, 455 (see also Oracle) Google Bigtable (database) data model (see Bigtable data model) partitioning scheme, 199, 202 storage layout, 78 Chubby (lock service), 370 Cloud Dataflow (stream processor), 466, 477, 498 (see also Beam) Cloud Pub/Sub (messaging), 444, 448 Docs (collaborative editor), 170 Dremel (query engine), 93, 96 FlumeJava (dataflow library), 403, 427 GFS (distributed file system), 398 gRPC (RPC framework), 135 MapReduce (batch processing), 390 570 | Index (see also MapReduce) building search indexes, 411 task preemption, 418 Pregel (graph processing), 425 Spanner (see Spanner) TrueTime (clock API), 294 gossip protocol, 216 government use of data, 541 GPS (Global Positioning System) use for clock synchronization, 287, 290, 294, 295 GraphChi (graph processing), 426 graphs, 555 as data models, 49-63 example of graph-structured data, 49 property graphs, 50 RDF and triple-stores, 55-59 versus the network model, 60 processing and analysis, 424-426 fault tolerance, 425 Pregel processing model, 425 query languages Cypher, 52 Datalog, 60-63 recursive SQL queries, 53 SPARQL, 59-59 Gremlin (graph query language), 50 grep (Unix tool), 392 GROUP BY clause (SQL), 406 grouping records in MapReduce, 406 handling skew, 407 H Hadoop (data infrastructure) comparison to distributed databases, 390 comparison to MPP databases, 414-418 comparison to Unix, 413-414, 499 diverse processing models in ecosystem, 417 HDFS distributed filesystem (see HDFS) higher-level tools, 403 join algorithms, 403-410 (see also MapReduce) MapReduce (see MapReduce) YARN (see YARN) happens-before relationship, 340 capturing, 187 concurrency and, 186 hard disks access patterns, 84 detecting corruption, 519, 530 faults in, 7, 227 sequential write throughput, 75, 450 hardware faults, 7 hash indexes, 72-75 broadcast hash joins, 409 partitioned hash joins, 409 hash partitioning, 203-205, 217 consistent hashing, 204 problems with hash mod N, 210 range queries, 204 suitable hash functions, 203 with fixed number of partitions, 210 HAWQ (database), 428 HBase (database) bug due to lack of fencing, 302 bulk loading, 413 column-family data model, 41, 99 dynamic partitioning, 212 key-range partitioning, 202 log-structured storage, 78 request routing, 216 size-tiered compaction, 79 use of HDFS, 417 use of ZooKeeper, 370 HDFS (Hadoop Distributed File System), 398-399 (see also distributed filesystems) checking data integrity, 530 decoupling from query engines, 417 indiscriminately dumping data into, 415 metadata about datasets, 410 NameNode, 398 use by Flink, 479 use by HBase, 212 use by MapReduce, 402 HdrHistogram (numerical library), 16 head (Unix tool), 392 head vertex (property graphs), 51 head-of-line blocking, 15 heap files (databases), 86 Helix (cluster manager), 216 heterogeneous distributed transactions, 360, 364 heuristic decisions (in 2PC), 363 Hibernate (object-relational mapper), 30 hierarchical model, 36 high availability (see fault tolerance) high-frequency trading, 290, 299 high-performance computing (HPC), 275 hinted handoff, 183 histograms, 16 Hive (query engine), 419, 427 for data warehouses, 93 HCatalog and metastore, 410 map-side joins, 409 query optimizer, 427 skewed joins, 408 workflows, 403 Hollerith machines, 390 hopping windows (stream processing), 472 (see also windows) horizontal scaling (see scaling out) HornetQ (messaging), 137, 444 distributed transaction support, 361 hot spots, 201 due to celebrities, 205 for time-series data, 203 in batch processing, 407 relieving, 205 hot standbys (see leader-based replication) HTTP, use in APIs (see services) human errors, 9, 279, 414 HyperDex (database), 88 HyperLogLog (algorithm), 466 I I/O operations, waiting for, 297 IBM DB2 (database) distributed transaction support, 361 recursive query support, 54 serializable isolation, 242, 257 XML and JSON support, 30, 42 electromechanical card-sorting machines, 390 IMS (database), 36 imperative query APIs, 46 InfoSphere Streams (CEP engine), 466 MQ (messaging), 444 distributed transaction support, 361 System R (database), 222 WebSphere (messaging), 137 idempotence, 134, 478, 555 by giving operations unique IDs, 518, 522 idempotent operations, 517 immutability advantages of, 460, 531 Index | 571 deriving state from event log, 459-464 for crash recovery, 75 in B-trees, 82, 242 in event sourcing, 457 inputs to Unix commands, 397 limitations of, 463 Impala (query engine) for data warehouses, 93 hash joins, 409 native code generation, 428 use of HDFS, 417 impedance mismatch, 29 imperative languages, 42 setting element styles (example), 45 in doubt (transaction status), 358 holding locks, 362 orphaned transactions, 363 in-memory databases, 88 durability, 227 serial transaction execution, 253 incidents cascading failures, 9 crashes due to leap seconds, 290 data corruption and financial losses due to concurrency bugs, 233 data corruption on hard disks, 227 data loss due to last-write-wins, 173, 292 data on disks unreadable, 309 deleted items reappearing, 174 disclosure of sensitive data due to primary key reuse, 157 errors in transaction serializability, 529 gigabit network interface with 1 Kb/s throughput, 311 network faults, 279 network interface dropping only inbound packets, 279 network partitions and whole-datacenter failures, 275 poor handling of network faults, 280 sending message to ex-partner, 494 sharks biting undersea cables, 279 split brain due to 1-minute packet delay, 158, 279 vibrations in server rack, 14 violation of uniqueness constraint, 529 indexes, 71, 555 and snapshot isolation, 241 as derived data, 386, 499-504 572 | Index B-trees, 79-83 building in batch processes, 411 clustered, 86 comparison of B-trees and LSM-trees, 83-85 concatenated, 87 covering (with included columns), 86 creating, 500 full-text search, 88 geospatial, 87 hash, 72-75 index-range locking, 260 multi-column, 87 partitioning and secondary indexes, 206-209, 217 secondary, 85 (see also secondary indexes) problems with dual writes, 452, 491 SSTables and LSM-trees, 76-79 updating when data changes, 452, 467 Industrial Revolution, 541 InfiniBand (networks), 285 InfiniteGraph (database), 50 InnoDB (storage engine) clustered index on primary key, 86 not preventing lost updates, 245 preventing write skew, 248, 257 serializable isolation, 257 snapshot isolation support, 239 inside-out databases, 504 (see also unbundling databases) integrating different data systems (see data integration) integrity, 524 coordination-avoiding data systems, 528 correctness of dataflow systems, 525 in consensus formalization, 365 integrity checks, 530 (see also auditing) end-to-end, 519, 531 use of snapshot isolation, 238 maintaining despite software bugs, 529 Interface Definition Language (IDL), 117, 122 intermediate state, materialization of, 420-423 internet services, systems for implementing, 275 invariants, 225 (see also constraints) inversion of control, 396 IP (Internet Protocol) unreliability of, 277 ISDN (Integrated Services Digital Network), 284 isolation (in transactions), 225, 228, 555 correctness and, 515 for single-object writes, 230 serializability, 251-266 actual serial execution, 252-256 serializable snapshot isolation (SSI), 261-266 two-phase locking (2PL), 257-261 violating, 228 weak isolation levels, 233-251 preventing lost updates, 242-246 read committed, 234-237 snapshot isolation, 237-242 iterative processing, 424-426 J Java Database Connectivity (JDBC) distributed transaction support, 361 network drivers, 128 Java Enterprise Edition (EE), 134, 356, 361 Java Message Service (JMS), 444 (see also messaging systems) comparison to log-based messaging, 448, 451 distributed transaction support, 361 message ordering, 446 Java Transaction API (JTA), 355, 361 Java Virtual Machine (JVM) bytecode generation, 428 garbage collection pauses, 296 process reuse in batch processors, 422 JavaScript in MapReduce querying, 46 setting element styles (example), 45 use in advanced queries, 48 Jena (RDF framework), 57 Jepsen (fault tolerance testing), 515 jitter (network delay), 284 joins, 555 by index lookup, 403 expressing as relational operators, 427 in relational and document databases, 34 MapReduce map-side joins, 408-410 broadcast hash joins, 409 merge joins, 410 partitioned hash joins, 409 MapReduce reduce-side joins, 403-408 handling skew, 407 sort-merge joins, 405 parallel execution of, 415 secondary indexes and, 85 stream joins, 472-476 stream-stream join, 473 stream-table join, 473 table-table join, 474 time-dependence of, 475 support in document databases, 42 JOTM (transaction coordinator), 356 JSON Avro schema representation, 122 binary variants, 115 for application data, issues with, 114 in relational databases, 30, 42 representing a résumé (example), 31 Juttle (query language), 504 K k-nearest neighbors, 429 Kafka (messaging), 137, 448 Kafka Connect (database integration), 457, 461 Kafka Streams (stream processor), 466, 467 fault tolerance, 479 leader-based replication, 153 log compaction, 456, 467 message offsets, 447, 478 request routing, 216 transaction support, 477 usage example, 4 Ketama (partitioning library), 213 key-value stores, 70 as batch process output, 412 hash indexes, 72-75 in-memory, 89 partitioning, 201-205 by hash of key, 203, 217 by key range, 202, 217 dynamic partitioning, 212 skew and hot spots, 205 Kryo (Java), 113 Kubernetes (cluster manager), 418, 506 L lambda architecture, 497 Lamport timestamps, 345 Index | 573 Large Hadron Collider (LHC), 64 last write wins (LWW), 173, 334 discarding concurrent writes, 186 problems with, 292 prone to lost updates, 246 late binding, 396 latency instability under two-phase locking, 259 network latency and resource utilization, 286 response time versus, 14 tail latency, 15, 207 leader-based replication, 152-161 (see also replication) failover, 157, 301 handling node outages, 156 implementation of replication logs change data capture, 454-457 (see also changelogs) statement-based, 158 trigger-based replication, 161 write-ahead log (WAL) shipping, 159 linearizability of operations, 333 locking and leader election, 330 log sequence number, 156, 449 read-scaling architecture, 161 relation to consensus, 367 setting up new followers, 155 synchronous versus asynchronous, 153-155 leaderless replication, 177-191 (see also replication) detecting concurrent writes, 184-191 capturing happens-before relationship, 187 happens-before relationship and concur‐ rency, 186 last write wins, 186 merging concurrently written values, 190 version vectors, 191 multi-datacenter, 184 quorums, 179-182 consistency limitations, 181-183, 334 sloppy quorums and hinted handoff, 183 read repair and anti-entropy, 178 leap seconds, 8, 290 in time-of-day clocks, 288 leases, 295 implementation with ZooKeeper, 370 574 | Index need for fencing, 302 ledgers, 460 distributed ledger technologies, 532 legacy systems, maintenance of, 18 less (Unix tool), 397 LevelDB (storage engine), 78 leveled compaction, 79 Levenshtein automata, 88 limping (partial failure), 311 linearizability, 324-338, 555 cost of, 335-338 CAP theorem, 336 memory on multi-core CPUs, 338 definition, 325-329 implementing with total order broadcast, 350 in ZooKeeper, 370 of derived data systems, 492, 524 avoiding coordination, 527 of different replication methods, 332-335 using quorums, 334 relying on, 330-332 constraints and uniqueness, 330 cross-channel timing dependencies, 331 locking and leader election, 330 stronger than causal consistency, 342 using to implement total order broadcast, 351 versus serializability, 329 LinkedIn Azkaban (workflow scheduler), 402 Databus (change data capture), 161, 455 Espresso (database), 31, 126, 130, 153, 216 Helix (cluster manager) (see Helix) profile (example), 30 reference to company entity (example), 34 Rest.li (RPC framework), 135 Voldemort (database) (see Voldemort) Linux, leap second bug, 8, 290 liveness properties, 308 LMDB (storage engine), 82, 242 load approaches to coping with, 17 describing, 11 load testing, 16 load balancing (messaging), 444 local indexes (see document-partitioned indexes) locality (data access), 32, 41, 555 in batch processing, 400, 405, 421 in stateful clients, 170, 511 in stream processing, 474, 478, 508, 522 location transparency, 134 in the actor model, 138 locks, 556 deadlock, 258 distributed locking, 301-304, 330 fencing tokens, 303 implementation with ZooKeeper, 370 relation to consensus, 374 for transaction isolation in snapshot isolation, 239 in two-phase locking (2PL), 257-261 making operations atomic, 243 performance, 258 preventing dirty writes, 236 preventing phantoms with index-range locks, 260, 265 read locks (shared mode), 236, 258 shared mode and exclusive mode, 258 in two-phase commit (2PC) deadlock detection, 364 in-doubt transactions holding locks, 362 materializing conflicts with, 251 preventing lost updates by explicit locking, 244 log sequence number, 156, 449 logic programming languages, 504 logical clocks, 293, 343, 494 for read-after-write consistency, 164 logical logs, 160 logs (data structure), 71, 556 advantages of immutability, 460 compaction, 73, 79, 456, 460 for stream operator state, 479 creating using total order broadcast, 349 implementing uniqueness constraints, 522 log-based messaging, 446-451 comparison to traditional messaging, 448, 451 consumer offsets, 449 disk space usage, 450 replaying old messages, 451, 496, 498 slow consumers, 450 using logs for message storage, 447 log-structured storage, 71-79 log-structured merge tree (see LSMtrees) replication, 152, 158-161 change data capture, 454-457 (see also changelogs) coordination with snapshot, 156 logical (row-based) replication, 160 statement-based replication, 158 trigger-based replication, 161 write-ahead log (WAL) shipping, 159 scalability limits, 493 loose coupling, 396, 419, 502 lost updates (see updates) LSM-trees (indexes), 78-79 comparison to B-trees, 83-85 Lucene (storage engine), 79 building indexes in batch processes, 411 similarity search, 88 Luigi (workflow scheduler), 402 LWW (see last write wins) M machine learning ethical considerations, 534 (see also ethics) iterative processing, 424 models derived from training data, 505 statistical and numerical algorithms, 428 MADlib (machine learning toolkit), 428 magic scaling sauce, 18 Mahout (machine learning toolkit), 428 maintainability, 18-22, 489 defined, 23 design principles for software systems, 19 evolvability (see evolvability) operability, 19 simplicity and managing complexity, 20 many-to-many relationships in document model versus relational model, 39 modeling as graphs, 49 many-to-one and many-to-many relationships, 33-36 many-to-one relationships, 34 MapReduce (batch processing), 390, 399-400 accessing external services within job, 404, 412 comparison to distributed databases designing for frequent faults, 417 diversity of processing models, 416 diversity of storage, 415 Index | 575 comparison to stream processing, 464 comparison to Unix, 413-414 disadvantages and limitations of, 419 fault tolerance, 406, 414, 422 higher-level tools, 403, 426 implementation in Hadoop, 400-403 the shuffle, 402 implementation in MongoDB, 46-48 machine learning, 428 map-side processing, 408-410 broadcast hash joins, 409 merge joins, 410 partitioned hash joins, 409 mapper and reducer functions, 399 materialization of intermediate state, 419-423 output of batch workflows, 411-413 building search indexes, 411 key-value stores, 412 reduce-side processing, 403-408 analysis of user activity events (exam‐ ple), 404 grouping records by same key, 406 handling skew, 407 sort-merge joins, 405 workflows, 402 marshalling (see encoding) massively parallel processing (MPP), 216 comparison to composing storage technolo‐ gies, 502 comparison to Hadoop, 414-418, 428 master-master replication (see multi-leader replication) master-slave replication (see leader-based repli‐ cation) materialization, 556 aggregate values, 101 conflicts, 251 intermediate state (batch processing), 420-423 materialized views, 101 as derived data, 386, 499-504 maintaining, using stream processing, 467, 475 Maven (Java build tool), 428 Maxwell (change data capture), 455 mean, 14 media monitoring, 467 median, 14 576 | Index meeting room booking (example), 249, 259, 521 membership services, 372 Memcached (caching server), 4, 89 memory in-memory databases, 88 durability, 227 serial transaction execution, 253 in-memory representation of data, 112 random bit-flips in, 529 use by indexes, 72, 77 memory barrier (CPU instruction), 338 MemSQL (database) in-memory storage, 89 read committed isolation, 236 memtable (in LSM-trees), 78 Mercurial (version control system), 463 merge joins, MapReduce map-side, 410 mergeable persistent data structures, 174 merging sorted files, 76, 402, 405 Merkle trees, 532 Mesos (cluster manager), 418, 506 message brokers (see messaging systems) message-passing, 136-139 advantages over direct RPC, 137 distributed actor frameworks, 138 evolvability, 138 MessagePack (encoding format), 116 messages exactly-once semantics, 360, 476 loss of, 442 using total order broadcast, 348 messaging systems, 440-451 (see also streams) backpressure, buffering, or dropping mes‐ sages, 441 brokerless messaging, 442 event logs, 446-451 comparison to traditional messaging, 448, 451 consumer offsets, 449 replaying old messages, 451, 496, 498 slow consumers, 450 message brokers, 443-446 acknowledgements and redelivery, 445 comparison to event logs, 448, 451 multiple consumers of same topic, 444 reliability, 442 uniqueness in log-based messaging, 522 Meteor (web framework), 456 microbatching, 477, 495 microservices, 132 (see also services) causal dependencies across services, 493 loose coupling, 502 relation to batch/stream processors, 389, 508 Microsoft Azure Service Bus (messaging), 444 Azure Storage, 155, 398 Azure Stream Analytics, 466 DCOM (Distributed Component Object Model), 134 MSDTC (transaction coordinator), 356 Orleans (see Orleans) SQL Server (see SQL Server) migrating (rewriting) data, 40, 130, 461, 497 modulus operator (%), 210 MongoDB (database) aggregation pipeline, 48 atomic operations, 243 BSON, 41 document data model, 31 hash partitioning (sharding), 203-204 key-range partitioning, 202 lack of join support, 34, 42 leader-based replication, 153 MapReduce support, 46, 400 oplog parsing, 455, 456 partition splitting, 212 request routing, 216 secondary indexes, 207 Mongoriver (change data capture), 455 monitoring, 10, 19 monotonic clocks, 288 monotonic reads, 164 MPP (see massively parallel processing) MSMQ (messaging), 361 multi-column indexes, 87 multi-leader replication, 168-177 (see also replication) handling write conflicts, 171 conflict avoidance, 172 converging toward a consistent state, 172 custom conflict resolution logic, 173 determining what is a conflict, 174 linearizability, lack of, 333 replication topologies, 175-177 use cases, 168 clients with offline operation, 170 collaborative editing, 170 multi-datacenter replication, 168, 335 multi-object transactions, 228 need for, 231 Multi-Paxos (total order broadcast), 367 multi-table index cluster tables (Oracle), 41 multi-tenancy, 284 multi-version concurrency control (MVCC), 239, 266 detecting stale MVCC reads, 263 indexes and snapshot isolation, 241 mutual exclusion, 261 (see also locks) MySQL (database) binlog coordinates, 156 binlog parsing for change data capture, 455 circular replication topology, 175 consistent snapshots, 156 distributed transaction support, 361 InnoDB storage engine (see InnoDB) JSON support, 30, 42 leader-based replication, 153 performance of XA transactions, 360 row-based replication, 160 schema changes in, 40 snapshot isolation support, 242 (see also InnoDB) statement-based replication, 159 Tungsten Replicator (multi-leader replica‐ tion), 170 conflict detection, 177 N nanomsg (messaging library), 442 Narayana (transaction coordinator), 356 NATS (messaging), 137 near-real-time (nearline) processing, 390 (see also stream processing) Neo4j (database) Cypher query language, 52 graph data model, 50 Nephele (dataflow engine), 421 netcat (Unix tool), 397 Netflix Chaos Monkey, 7, 280 Network Attached Storage (NAS), 146, 398 network model, 36 Index | 577 graph databases versus, 60 imperative query APIs, 46 Network Time Protocol (see NTP) networks congestion and queueing, 282 datacenter network topologies, 276 faults (see faults) linearizability and network delays, 338 network partitions, 279, 337 timeouts and unbounded delays, 281 next-key locking, 260 nodes (in graphs) (see vertices) nodes (processes), 556 handling outages in leader-based replica‐ tion, 156 system models for failure, 307 noisy neighbors, 284 nonblocking atomic commit, 359 nondeterministic operations accidental nondeterminism, 423 partial failures in distributed systems, 275 nonfunctional requirements, 22 nonrepeatable reads, 238 (see also read skew) normalization (data representation), 33, 556 executing joins, 39, 42, 403 foreign key references, 231 in systems of record, 386 versus denormalization, 462 NoSQL, 29, 499 transactions and, 223 Notation3 (N3), 56 npm (package manager), 428 NTP (Network Time Protocol), 287 accuracy, 289, 293 adjustments to monotonic clocks, 289 multiple server addresses, 306 numbers, in XML and JSON encodings, 114 O object-relational mapping (ORM) frameworks, 30 error handling and aborted transactions, 232 unsafe read-modify-write cycle code, 244 object-relational mismatch, 29 observer pattern, 506 offline systems, 390 (see also batch processing) 578 | Index stateful, offline-capable clients, 170, 511 offline-first applications, 511 offsets consumer offsets in partitioned logs, 449 messages in partitioned logs, 447 OLAP (online analytic processing), 91, 556 data cubes, 102 OLTP (online transaction processing), 90, 556 analytics queries versus, 411 workload characteristics, 253 one-to-many relationships, 30 JSON representation, 32 online systems, 389 (see also services) Oozie (workflow scheduler), 402 OpenAPI (service definition format), 133 OpenStack Nova (cloud infrastructure) use of ZooKeeper, 370 Swift (object storage), 398 operability, 19 operating systems versus databases, 499 operation identifiers, 518, 522 operational transformation, 174 operators, 421 flow of data between, 424 in stream processing, 464 optimistic concurrency control, 261 Oracle (database) distributed transaction support, 361 GoldenGate (change data capture), 161, 170, 455 lack of serializability, 226 leader-based replication, 153 multi-table index cluster tables, 41 not preventing write skew, 248 partitioned indexes, 209 PL/SQL language, 255 preventing lost updates, 245 read committed isolation, 236 Real Application Clusters (RAC), 330 recursive query support, 54 snapshot isolation support, 239, 242 TimesTen (in-memory database), 89 WAL-based replication, 160 XML support, 30 ordering, 339-352 by sequence numbers, 343-348 causal ordering, 339-343 partial order, 341 limits of total ordering, 493 total order broadcast, 348-352 Orleans (actor framework), 139 outliers (response time), 14 Oz (programming language), 504 P package managers, 428, 505 packet switching, 285 packets corruption of, 306 sending via UDP, 442 PageRank (algorithm), 49, 424 paging (see virtual memory) ParAccel (database), 93 parallel databases (see massively parallel pro‐ cessing) parallel execution of graph analysis algorithms, 426 queries in MPP databases, 216 Parquet (data format), 96, 131 (see also column-oriented storage) use in Hadoop, 414 partial failures, 275, 310 limping, 311 partial order, 341 partitioning, 199-218, 556 and replication, 200 in batch processing, 429 multi-partition operations, 514 enforcing constraints, 522 secondary index maintenance, 495 of key-value data, 201-205 by key range, 202 skew and hot spots, 205 rebalancing partitions, 209-214 automatic or manual rebalancing, 213 problems with hash mod N, 210 using dynamic partitioning, 212 using fixed number of partitions, 210 using N partitions per node, 212 replication and, 147 request routing, 214-216 secondary indexes, 206-209 document-based partitioning, 206 term-based partitioning, 208 serial execution of transactions and, 255 Paxos (consensus algorithm), 366 ballot number, 368 Multi-Paxos (total order broadcast), 367 percentiles, 14, 556 calculating efficiently, 16 importance of high percentiles, 16 use in service level agreements (SLAs), 15 Percona XtraBackup (MySQL tool), 156 performance describing, 13 of distributed transactions, 360 of in-memory databases, 89 of linearizability, 338 of multi-leader replication, 169 perpetual inconsistency, 525 pessimistic concurrency control, 261 phantoms (transaction isolation), 250 materializing conflicts, 251 preventing, in serializability, 259 physical clocks (see clocks) pickle (Python), 113 Pig (dataflow language), 419, 427 replicated joins, 409 skewed joins, 407 workflows, 403 Pinball (workflow scheduler), 402 pipelined execution, 423 in Unix, 394 point in time, 287 polyglot persistence, 29 polystores, 501 PostgreSQL (database) BDR (multi-leader replication), 170 causal ordering of writes, 177 Bottled Water (change data capture), 455 Bucardo (trigger-based replication), 161, 173 distributed transaction support, 361 foreign data wrappers, 501 full text search support, 490 leader-based replication, 153 log sequence number, 156 MVCC implementation, 239, 241 PL/pgSQL language, 255 PostGIS geospatial indexes, 87 preventing lost updates, 245 preventing write skew, 248, 261 read committed isolation, 236 recursive query support, 54 representing graphs, 51 Index | 579 serializable snapshot isolation (SSI), 261 snapshot isolation support, 239, 242 WAL-based replication, 160 XML and JSON support, 30, 42 pre-splitting, 212 Precision Time Protocol (PTP), 290 predicate locks, 259 predictive analytics, 533-536 amplifying bias, 534 ethics of (see ethics) feedback loops, 536 preemption of datacenter resources, 418 of threads, 298 Pregel processing model, 425 primary keys, 85, 556 compound primary key (Cassandra), 204 primary-secondary replication (see leaderbased replication) privacy, 536-543 consent and freedom of choice, 538 data as assets and power, 540 deleting data, 463 ethical considerations (see ethics) legislation and self-regulation, 542 meaning of, 539 surveillance, 537 tracking behavioral data, 536 probabilistic algorithms, 16, 466 process pauses, 295-299 processing time (of events), 469 producers (message streams), 440 programming languages dataflow languages, 504 for stored procedures, 255 functional reactive programming (FRP), 504 logic programming, 504 Prolog (language), 61 (see also Datalog) promises (asynchronous operations), 135 property graphs, 50 Cypher query language, 52 Protocol Buffers (data format), 117-121 field tags and schema evolution, 120 provenance of data, 531 publish/subscribe model, 441 publishers (message streams), 440 punch card tabulating machines, 390 580 | Index pure functions, 48 putting computation near data, 400 Q Qpid (messaging), 444 quality of service (QoS), 285 Quantcast File System (distributed filesystem), 398 query languages, 42-48 aggregation pipeline, 48 CSS and XSL, 44 Cypher, 52 Datalog, 60 Juttle, 504 MapReduce querying, 46-48 recursive SQL queries, 53 relational algebra and SQL, 42 SPARQL, 59 query optimizers, 37, 427 queueing delays (networks), 282 head-of-line blocking, 15 latency and response time, 14 queues (messaging), 137 quorums, 179-182, 556 for leaderless replication, 179 in consensus algorithms, 368 limitations of consistency, 181-183, 334 making decisions in distributed systems, 301 monitoring staleness, 182 multi-datacenter replication, 184 relying on durability, 309 sloppy quorums and hinted handoff, 183 R R-trees (indexes), 87 RabbitMQ (messaging), 137, 444 leader-based replication, 153 race conditions, 225 (see also concurrency) avoiding with linearizability, 331 caused by dual writes, 452 dirty writes, 235 in counter increments, 235 lost updates, 242-246 preventing with event logs, 462, 507 preventing with serializable isolation, 252 write skew, 246-251 Raft (consensus algorithm), 366 sensitivity to network problems, 369 term number, 368 use in etcd, 353 RAID (Redundant Array of Independent Disks), 7, 398 railways, schema migration on, 496 RAMCloud (in-memory storage), 89 ranking algorithms, 424 RDF (Resource Description Framework), 57 querying with SPARQL, 59 RDMA (Remote Direct Memory Access), 276 read committed isolation level, 234-237 implementing, 236 multi-version concurrency control (MVCC), 239 no dirty reads, 234 no dirty writes, 235 read path (derived data), 509 read repair (leaderless replication), 178 for linearizability, 335 read replicas (see leader-based replication) read skew (transaction isolation), 238, 266 as violation of causality, 340 read-after-write consistency, 163, 524 cross-device, 164 read-modify-write cycle, 243 read-scaling architecture, 161 reads as events, 513 real-time collaborative editing, 170 near-real-time processing, 390 (see also stream processing) publish/subscribe dataflow, 513 response time guarantees, 298 time-of-day clocks, 288 rebalancing partitions, 209-214, 556 (see also partitioning) automatic or manual rebalancing, 213 dynamic partitioning, 212 fixed number of partitions, 210 fixed number of partitions per node, 212 problems with hash mod N, 210 recency guarantee, 324 recommendation engines batch process outputs, 412 batch workflows, 403, 420 iterative processing, 424 statistical and numerical algorithms, 428 records, 399 events in stream processing, 440 recursive common table expressions (SQL), 54 redelivery (messaging), 445 Redis (database) atomic operations, 243 durability, 89 Lua scripting, 255 single-threaded execution, 253 usage example, 4 redundancy hardware components, 7 of derived data, 386 (see also derived data) Reed–Solomon codes (error correction), 398 refactoring, 22 (see also evolvability) regions (partitioning), 199 register (data structure), 325 relational data model, 28-42 comparison to document model, 38-42 graph queries in SQL, 53 in-memory databases with, 89 many-to-one and many-to-many relation‐ ships, 33 multi-object transactions, need for, 231 NoSQL as alternative to, 29 object-relational mismatch, 29 relational algebra and SQL, 42 versus document model convergence of models, 41 data locality, 41 relational databases eventual consistency, 162 history, 28 leader-based replication, 153 logical logs, 160 philosophy compared to Unix, 499, 501 schema changes, 40, 111, 130 statement-based replication, 158 use of B-tree indexes, 80 relationships (see edges) reliability, 6-10, 489 building a reliable system from unreliable components, 276 defined, 6, 22 hardware faults, 7 human errors, 9 importance of, 10 of messaging systems, 442 Index | 581 software errors, 8 Remote Method Invocation (Java RMI), 134 remote procedure calls (RPCs), 134-136 (see also services) based on futures, 135 data encoding and evolution, 136 issues with, 134 using Avro, 126, 135 using Thrift, 135 versus message brokers, 137 repeatable reads (transaction isolation), 242 replicas, 152 replication, 151-193, 556 and durability, 227 chain replication, 155 conflict resolution and, 246 consistency properties, 161-167 consistent prefix reads, 165 monotonic reads, 164 reading your own writes, 162 in distributed filesystems, 398 leaderless, 177-191 detecting concurrent writes, 184-191 limitations of quorum consistency, 181-183, 334 sloppy quorums and hinted handoff, 183 monitoring staleness, 182 multi-leader, 168-177 across multiple datacenters, 168, 335 handling write conflicts, 171-175 replication topologies, 175-177 partitioning and, 147, 200 reasons for using, 145, 151 single-leader, 152-161 failover, 157 implementation of replication logs, 158-161 relation to consensus, 367 setting up new followers, 155 synchronous versus asynchronous, 153-155 state machine replication, 349, 452 using erasure coding, 398 with heterogeneous data systems, 453 replication logs (see logs) reprocessing data, 496, 498 (see also evolvability) from log-based messaging, 451 request routing, 214-216 582 | Index approaches to, 214 parallel query execution, 216 resilient systems, 6 (see also fault tolerance) response time as performance metric for services, 13, 389 guarantees on, 298 latency versus, 14 mean and percentiles, 14 user experience, 15 responsibility and accountability, 535 REST (Representational State Transfer), 133 (see also services) RethinkDB (database) document data model, 31 dynamic partitioning, 212 join support, 34, 42 key-range partitioning, 202 leader-based replication, 153 subscribing to changes, 456 Riak (database) Bitcask storage engine, 72 CRDTs, 174, 191 dotted version vectors, 191 gossip protocol, 216 hash partitioning, 203-204, 211 last-write-wins conflict resolution, 186 leaderless replication, 177 LevelDB storage engine, 78 linearizability, lack of, 335 multi-datacenter support, 184 preventing lost updates across replicas, 246 rebalancing, 213 search feature, 209 secondary indexes, 207 siblings (concurrently written values), 190 sloppy quorums, 184 ring buffers, 450 Ripple (cryptocurrency), 532 rockets, 10, 36, 305 RocksDB (storage engine), 78 leveled compaction, 79 rollbacks (transactions), 222 rolling upgrades, 8, 112 routing (see request routing) row-oriented storage, 96 row-based replication, 160 rowhammer (memory corruption), 529 RPCs (see remote procedure calls) Rubygems (package manager), 428 rules (Datalog), 61 S safety and liveness properties, 308 in consensus algorithms, 366 in transactions, 222 sagas (see compensating transactions) Samza (stream processor), 466, 467 fault tolerance, 479 streaming SQL support, 466 sandboxes, 9 SAP HANA (database), 93 scalability, 10-18, 489 approaches for coping with load, 17 defined, 22 describing load, 11 describing performance, 13 partitioning and, 199 replication and, 161 scaling up versus scaling out, 146 scaling out, 17, 146 (see also shared-nothing architecture) scaling up, 17, 146 scatter/gather approach, querying partitioned databases, 207 SCD (slowly changing dimension), 476 schema-on-read, 39 comparison to evolvable schema, 128 in distributed filesystems, 415 schema-on-write, 39 schemaless databases (see schema-on-read) schemas, 557 Avro, 122-127 reader determining writer’s schema, 125 schema evolution, 123 dynamically generated, 126 evolution of, 496 affecting application code, 111 compatibility checking, 126 in databases, 129-131 in message-passing, 138 in service calls, 136 flexibility in document model, 39 for analytics, 93-95 for JSON and XML, 115 merits of, 127 schema migration on railways, 496 Thrift and Protocol Buffers, 117-121 schema evolution, 120 traditional approach to design, fallacy in, 462 searches building search indexes in batch processes, 411 k-nearest neighbors, 429 on streams, 467 partitioned secondary indexes, 206 secondaries (see leader-based replication) secondary indexes, 85, 557 partitioning, 206-209, 217 document-partitioned, 206 index maintenance, 495 term-partitioned, 208 problems with dual writes, 452, 491 updating, transaction isolation and, 231 secondary sorts, 405 sed (Unix tool), 392 self-describing files, 127 self-joins, 480 self-validating systems, 530 semantic web, 57 semi-synchronous replication, 154 sequence number ordering, 343-348 generators, 294, 344 insufficiency for enforcing constraints, 347 Lamport timestamps, 345 use of timestamps, 291, 295, 345 sequential consistency, 351 serializability, 225, 233, 251-266, 557 linearizability versus, 329 pessimistic versus optimistic concurrency control, 261 serial execution, 252-256 partitioning, 255 using stored procedures, 253, 349 serializable snapshot isolation (SSI), 261-266 detecting stale MVCC reads, 263 detecting writes that affect prior reads, 264 distributed execution, 265, 364 performance of SSI, 265 preventing write skew, 262-265 two-phase locking (2PL), 257-261 index-range locks, 260 performance, 258 Serializable (Java), 113 Index | 583 serialization, 113 (see also encoding) service discovery, 135, 214, 372 using DNS, 216, 372 service level agreements (SLAs), 15 service-oriented architecture (SOA), 132 (see also services) services, 131-136 microservices, 132 causal dependencies across services, 493 loose coupling, 502 relation to batch/stream processors, 389, 508 remote procedure calls (RPCs), 134-136 issues with, 134 similarity to databases, 132 web services, 132, 135 session windows (stream processing), 472 (see also windows) sessionization, 407 sharding (see partitioning) shared mode (locks), 258 shared-disk architecture, 146, 398 shared-memory architecture, 146 shared-nothing architecture, 17, 146-147, 557 (see also replication) distributed filesystems, 398 (see also distributed filesystems) partitioning, 199 use of network, 277 sharks biting undersea cables, 279 counting (example), 46-48 finding (example), 42 website about (example), 44 shredding (in relational model), 38 siblings (concurrent values), 190, 246 (see also conflicts) similarity search edit distance, 88 genome data, 63 k-nearest neighbors, 429 single-leader replication (see leader-based rep‐ lication) single-threaded execution, 243, 252 in batch processing, 406, 421, 426 in stream processing, 448, 463, 522 size-tiered compaction, 79 skew, 557 584 | Index clock skew, 291-294, 334 in transaction isolation read skew, 238, 266 write skew, 246-251, 262-265 (see also write skew) meanings of, 238 unbalanced workload, 201 compensating for, 205 due to celebrities, 205 for time-series data, 203 in batch processing, 407 slaves (see leader-based replication) sliding windows (stream processing), 472 (see also windows) sloppy quorums, 183 (see also quorums) lack of linearizability, 334 slowly changing dimension (data warehouses), 476 smearing (leap seconds adjustments), 290 snapshots (databases) causal consistency, 340 computing derived data, 500 in change data capture, 455 serializable snapshot isolation (SSI), 261-266, 329 setting up a new replica, 156 snapshot isolation and repeatable read, 237-242 implementing with MVCC, 239 indexes and MVCC, 241 visibility rules, 240 synchronized clocks for global snapshots, 294 snowflake schemas, 95 SOAP, 133 (see also services) evolvability, 136 software bugs, 8 maintaining integrity, 529 solid state drives (SSDs) access patterns, 84 detecting corruption, 519, 530 faults in, 227 sequential write throughput, 75 Solr (search server) building indexes in batch processes, 411 document-partitioned indexes, 207 request routing, 216 usage example, 4 use of Lucene, 79 sort (Unix tool), 392, 394, 395 sort-merge joins (MapReduce), 405 Sorted String Tables (see SSTables) sorting sort order in column storage, 99 source of truth (see systems of record) Spanner (database) data locality, 41 snapshot isolation using clocks, 295 TrueTime API, 294 Spark (processing framework), 421-423 bytecode generation, 428 dataflow APIs, 427 fault tolerance, 422 for data warehouses, 93 GraphX API (graph processing), 425 machine learning, 428 query optimizer, 427 Spark Streaming, 466 microbatching, 477 stream processing on top of batch process‐ ing, 495 SPARQL (query language), 59 spatial algorithms, 429 split brain, 158, 557 in consensus algorithms, 352, 367 preventing, 322, 333 using fencing tokens to avoid, 302-304 spreadsheets, dataflow programming capabili‐ ties, 504 SQL (Structured Query Language), 21, 28, 43 advantages and limitations of, 416 distributed query execution, 48 graph queries in, 53 isolation levels standard, issues with, 242 query execution on Hadoop, 416 résumé (example), 30 SQL injection vulnerability, 305 SQL on Hadoop, 93 statement-based replication, 158 stored procedures, 255 SQL Server (database) data warehousing support, 93 distributed transaction support, 361 leader-based replication, 153 preventing lost updates, 245 preventing write skew, 248, 257 read committed isolation, 236 recursive query support, 54 serializable isolation, 257 snapshot isolation support, 239 T-SQL language, 255 XML support, 30 SQLstream (stream analytics), 466 SSDs (see solid state drives) SSTables (storage format), 76-79 advantages over hash indexes, 76 concatenated index, 204 constructing and maintaining, 78 making LSM-Tree from, 78 staleness (old data), 162 cross-channel timing dependencies, 331 in leaderless databases, 178 in multi-version concurrency control, 263 monitoring for, 182 of client state, 512 versus linearizability, 324 versus timeliness, 524 standbys (see leader-based replication) star replication topologies, 175 star schemas, 93-95 similarity to event sourcing, 458 Star Wars analogy (event time versus process‐ ing time), 469 state derived from log of immutable events, 459 deriving current state from the event log, 458 interplay between state changes and appli‐ cation code, 507 maintaining derived state, 495 maintenance by stream processor in streamstream joins, 473 observing derived state, 509-515 rebuilding after stream processor failure, 478 separation of application code and, 505 state machine replication, 349, 452 statement-based replication, 158 statically typed languages analogy to schema-on-write, 40 code generation and, 127 statistical and numerical algorithms, 428 StatsD (metrics aggregator), 442 stdin, stdout, 395, 396 Stellar (cryptocurrency), 532 Index | 585 stock market feeds, 442 STONITH (Shoot The Other Node In The Head), 158 stop-the-world (see garbage collection) storage composing data storage technologies, 499-504 diversity of, in MapReduce, 415 Storage Area Network (SAN), 146, 398 storage engines, 69-104 column-oriented, 95-101 column compression, 97-99 defined, 96 distinction between column families and, 99 Parquet, 96, 131 sort order in, 99-100 writing to, 101 comparing requirements for transaction processing and analytics, 90-96 in-memory storage, 88 durability, 227 row-oriented, 70-90 B-trees, 79-83 comparing B-trees and LSM-trees, 83-85 defined, 96 log-structured, 72-79 stored procedures, 161, 253-255, 557 and total order broadcast, 349 pros and cons of, 255 similarity to stream processors, 505 Storm (stream processor), 466 distributed RPC, 468, 514 Trident state handling, 478 straggler events, 470, 498 stream processing, 464-481, 557 accessing external services within job, 474, 477, 478, 517 combining with batch processing lambda architecture, 497 unifying technologies, 498 comparison to batch processing, 464 complex event processing (CEP), 465 fault tolerance, 476-479 atomic commit, 477 idempotence, 478 microbatching and checkpointing, 477 rebuilding state after a failure, 478 for data integration, 494-498 586 | Index maintaining derived state, 495 maintenance of materialized views, 467 messaging systems (see messaging systems) reasoning about time, 468-472 event time versus processing time, 469, 477, 498 knowing when window is ready, 470 types of windows, 472 relation to databases (see streams) relation to services, 508 search on streams, 467 single-threaded execution, 448, 463 stream analytics, 466 stream joins, 472-476 stream-stream join, 473 stream-table join, 473 table-table join, 474 time-dependence of, 475 streams, 440-451 end-to-end, pushing events to clients, 512 messaging systems (see messaging systems) processing (see stream processing) relation to databases, 451-464 (see also changelogs) API support for change streams, 456 change data capture, 454-457 derivative of state by time, 460 event sourcing, 457-459 keeping systems in sync, 452-453 philosophy of immutable events, 459-464 topics, 440 strict serializability, 329 strong consistency (see linearizability) strong one-copy serializability, 329 subjects, predicates, and objects (in triplestores), 55 subscribers (message streams), 440 (see also consumers) supercomputers, 275 surveillance, 537 (see also privacy) Swagger (service definition format), 133 swapping to disk (see virtual memory) synchronous networks, 285, 557 comparison to asynchronous networks, 284 formal model, 307 synchronous replication, 154, 557 chain replication, 155 conflict detection, 172 system models, 300, 306-310 assumptions in, 528 correctness of algorithms, 308 mapping to the real world, 309 safety and liveness, 308 systems of record, 386, 557 change data capture, 454, 491 treating event log as, 460 systems thinking, 536 T t-digest (algorithm), 16 table-table joins, 474 Tableau (data visualization software), 416 tail (Unix tool), 447 tail vertex (property graphs), 51 Tajo (query engine), 93 Tandem NonStop SQL (database), 200 TCP (Transmission Control Protocol), 277 comparison to circuit switching, 285 comparison to UDP, 283 connection failures, 280 flow control, 282, 441 packet checksums, 306, 519, 529 reliability and duplicate suppression, 517 retransmission timeouts, 284 use for transaction sessions, 229 telemetry (see monitoring) Teradata (database), 93, 200 term-partitioned indexes, 208, 217 termination (consensus), 365 Terrapin (database), 413 Tez (dataflow engine), 421-423 fault tolerance, 422 support by higher-level tools, 427 thrashing (out of memory), 297 threads (concurrency) actor model, 138, 468 (see also message-passing) atomic operations, 223 background threads, 73, 85 execution pauses, 286, 296-298 memory barriers, 338 preemption, 298 single (see single-threaded execution) three-phase commit, 359 Thrift (data format), 117-121 BinaryProtocol, 118 CompactProtocol, 119 field tags and schema evolution, 120 throughput, 13, 390 TIBCO, 137 Enterprise Message Service, 444 StreamBase (stream analytics), 466 time concurrency and, 187 cross-channel timing dependencies, 331 in distributed systems, 287-299 (see also clocks) clock synchronization and accuracy, 289 relying on synchronized clocks, 291-295 process pauses, 295-299 reasoning about, in stream processors, 468-472 event time versus processing time, 469, 477, 498 knowing when window is ready, 470 timestamp of events, 471 types of windows, 472 system models for distributed systems, 307 time-dependence in stream joins, 475 time-of-day clocks, 288 timeliness, 524 coordination-avoiding data systems, 528 correctness of dataflow systems, 525 timeouts, 279, 557 dynamic configuration of, 284 for failover, 158 length of, 281 timestamps, 343 assigning to events in stream processing, 471 for read-after-write consistency, 163 for transaction ordering, 295 insufficiency for enforcing constraints, 347 key range partitioning by, 203 Lamport, 345 logical, 494 ordering events, 291, 345 Titan (database), 50 tombstones, 74, 191, 456 topics (messaging), 137, 440 total order, 341, 557 limits of, 493 sequence numbers or timestamps, 344 total order broadcast, 348-352, 493, 522 consensus algorithms and, 366-368 Index | 587 implementation in ZooKeeper and etcd, 370 implementing with linearizable storage, 351 using, 349 using to implement linearizable storage, 350 tracking behavioral data, 536 (see also privacy) transaction coordinator (see coordinator) transaction manager (see coordinator) transaction processing, 28, 90-95 comparison to analytics, 91 comparison to data warehousing, 93 transactions, 221-267, 558 ACID properties of, 223 atomicity, 223 consistency, 224 durability, 226 isolation, 225 compensating (see compensating transac‐ tions) concept of, 222 distributed transactions, 352-364 avoiding, 492, 502, 521-528 failure amplification, 364, 495 in doubt/uncertain status, 358, 362 two-phase commit, 354-359 use of, 360-361 XA transactions, 361-364 OLTP versus analytics queries, 411 purpose of, 222 serializability, 251-266 actual serial execution, 252-256 pessimistic versus optimistic concur‐ rency control, 261 serializable snapshot isolation (SSI), 261-266 two-phase locking (2PL), 257-261 single-object and multi-object, 228-232 handling errors and aborts, 231 need for multi-object transactions, 231 single-object writes, 230 snapshot isolation (see snapshots) weak isolation levels, 233-251 preventing lost updates, 242-246 read committed, 234-238 transitive closure (graph algorithm), 424 trie (data structure), 88 triggers (databases), 161, 441 implementing change data capture, 455 implementing replication, 161 588 | Index triple-stores, 55-59 SPARQL query language, 59 tumbling windows (stream processing), 472 (see also windows) in microbatching, 477 tuple spaces (programming model), 507 Turtle (RDF data format), 56 Twitter constructing home timelines (example), 11, 462, 474, 511 DistributedLog (event log), 448 Finagle (RPC framework), 135 Snowflake (sequence number generator), 294 Summingbird (processing library), 497 two-phase commit (2PC), 353, 355-359, 558 confusion with two-phase locking, 356 coordinator failure, 358 coordinator recovery, 363 how it works, 357 issues in practice, 363 performance cost, 360 transactions holding locks, 362 two-phase locking (2PL), 257-261, 329, 558 confusion with two-phase commit, 356 index-range locks, 260 performance of, 258 type checking, dynamic versus static, 40 U UDP (User Datagram Protocol) comparison to TCP, 283 multicast, 442 unbounded datasets, 439, 558 (see also streams) unbounded delays, 558 in networks, 282 process pauses, 296 unbundling databases, 499-515 composing data storage technologies, 499-504 federation versus unbundling, 501 need for high-level language, 503 designing applications around dataflow, 504-509 observing derived state, 509-515 materialized views and caching, 510 multi-partition data processing, 514 pushing state changes to clients, 512 uncertain (transaction status) (see in doubt) uniform consensus, 365 (see also consensus) uniform interfaces, 395 union type (in Avro), 125 uniq (Unix tool), 392 uniqueness constraints asynchronously checked, 526 requiring consensus, 521 requiring linearizability, 330 uniqueness in log-based messaging, 522 Unix philosophy, 394-397 command-line batch processing, 391-394 Unix pipes versus dataflow engines, 423 comparison to Hadoop, 413-414 comparison to relational databases, 499, 501 comparison to stream processing, 464 composability and uniform interfaces, 395 loose coupling, 396 pipes, 394 relation to Hadoop, 499 UPDATE statement (SQL), 40 updates preventing lost updates, 242-246 atomic write operations, 243 automatically detecting lost updates, 245 compare-and-set operations, 245 conflict resolution and replication, 246 using explicit locking, 244 preventing write skew, 246-251 V validity (consensus), 365 vBuckets (partitioning), 199 vector clocks, 191 (see also version vectors) vectorized processing, 99, 428 verification, 528-533 avoiding blind trust, 530 culture of, 530 designing for auditability, 531 end-to-end integrity checks, 531 tools for auditable data systems, 532 version control systems, reliance on immutable data, 463 version vectors, 177, 191 capturing causal dependencies, 343 versus vector clocks, 191 Vertica (database), 93 handling writes, 101 replicas using different sort orders, 100 vertical scaling (see scaling up) vertices (in graphs), 49 property graph model, 50 Viewstamped Replication (consensus algo‐ rithm), 366 view number, 368 virtual machines, 146 (see also cloud computing) context switches, 297 network performance, 282 noisy neighbors, 284 reliability in cloud services, 8 virtualized clocks in, 290 virtual memory process pauses due to page faults, 14, 297 versus memory management by databases, 89 VisiCalc (spreadsheets), 504 vnodes (partitioning), 199 Voice over IP (VoIP), 283 Voldemort (database) building read-only stores in batch processes, 413 hash partitioning, 203-204, 211 leaderless replication, 177 multi-datacenter support, 184 rebalancing, 213 reliance on read repair, 179 sloppy quorums, 184 VoltDB (database) cross-partition serializability, 256 deterministic stored procedures, 255 in-memory storage, 89 output streams, 456 secondary indexes, 207 serial execution of transactions, 253 statement-based replication, 159, 479 transactions in stream processing, 477 W WAL (write-ahead log), 82 web services (see services) Web Services Description Language (WSDL), 133 webhooks, 443 webMethods (messaging), 137 WebSocket (protocol), 512 Index | 589 windows (stream processing), 466, 468-472 infinite windows for changelogs, 467, 474 knowing when all events have arrived, 470 stream joins within a window, 473 types of windows, 472 winners (conflict resolution), 173 WITH RECURSIVE syntax (SQL), 54 workflows (MapReduce), 402 outputs, 411-414 key-value stores, 412 search indexes, 411 with map-side joins, 410 working set, 393 write amplification, 84 write path (derived data), 509 write skew (transaction isolation), 246-251 characterizing, 246-251, 262 examples of, 247, 249 materializing conflicts, 251 occurrence in practice, 529 phantoms, 250 preventing in snapshot isolation, 262-265 in two-phase locking, 259-261 options for, 248 write-ahead log (WAL), 82, 159 writes (database) atomic write operations, 243 detecting writes affecting prior reads, 264 preventing dirty writes with read commit‐ ted, 235 WS-* framework, 133 (see also services) WS-AtomicTransaction (2PC), 355 590 | Index X XA transactions, 355, 361-364 heuristic decisions, 363 limitations of, 363 xargs (Unix tool), 392, 396 XML binary variants, 115 encoding RDF data, 57 for application data, issues with, 114 in relational databases, 30, 41 XSL/XPath, 45 Y Yahoo!


pages: 1,237 words: 227,370

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

active measures, Amazon Web Services, bitcoin, blockchain, business intelligence, business process, c2.com, cloud computing, collaborative editing, commoditize, conceptual framework, cryptocurrency, database schema, DevOps, distributed ledger, Donald Knuth, Edward Snowden, Ethereum, ethereum blockchain, fault tolerance, finite state, Flash crash, full text search, general-purpose programming language, informal economy, information retrieval, Infrastructure as a Service, Internet of things, iterative process, John von Neumann, Kubernetes, loose coupling, Marc Andreessen, microservices, natural language processing, Network effects, packet switching, peer-to-peer, performance metric, place-making, premature optimization, recommendation engine, Richard Feynman, self-driving car, semantic web, Shoshana Zuboff, social graph, social web, software as a service, software is eating the world, sorting algorithm, source of truth, SPARQL, speech recognition, statistical model, undersea cable, web application, WebSocket, wikimedia commons

This aspect of system-building is often overlooked by vendors who claim that their product can satisfy all your needs. In reality, integrating disparate systems is one of the most important things that needs to be done in a nontrivial application. Systems of Record and Derived Data On a high level, systems that store and process data can be grouped into two broad categories: Systems of record A system of record, also known as source of truth, holds the authoritative version of your data. When new data comes in, e.g., as user input, it is first written here. Each fact is represented exactly once (the representation is typically normalized). If there is any discrepancy between another system and the system of record, then the value in the system of record is (by definition) the correct one. Derived data systems Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way.

stored procedure A way of encoding the logic of a transaction such that it can be entirely executed on a database server, without communicating back and forth with a client during the transaction. See “Actual Serial Execution”. stream process A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See Chapter 11. synchronous The opposite of asynchronous. system of record A system that holds the primary, authoritative version of some data, also known as the source of truth. Changes are first written here, and other datasets may be derived from the system of record. See the introduction to Part III. timeout One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See “Timeouts and Unbounded Delays”.

setting up a new replica, Setting Up New Followers snapshot isolation and repeatable read, Snapshot Isolation and Repeatable Read-Repeatable read and naming confusionimplementing with MVCC, Implementing snapshot isolation indexes and MVCC, Indexes and snapshot isolation visibility rules, Visibility rules for observing a consistent snapshot synchronized clocks for global snapshots, Synchronized clocks for global snapshots snowflake schemas, Stars and Snowflakes: Schemas for Analytics SOAP, Web services(see also services) evolvability, Data encoding and evolution for RPC software bugs, Software Errorsmaintaining integrity, Maintaining integrity in the face of software bugs solid state drives (SSDs)access patterns, Advantages of LSM-trees detecting corruption, The end-to-end argument, Don’t just blindly trust what they promise faults in, Durability sequential write throughput, Hash Indexes Solr (search server)building indexes in batch processes, Building search indexes document-partitioned indexes, Partitioning Secondary Indexes by Document request routing, Request Routing usage example, Thinking About Data Systems use of Lucene, Making an LSM-tree out of SSTables sort (Unix tool), Simple Log Analysis, Sorting versus in-memory aggregation, The Unix Philosophy sort-merge joins (MapReduce), Sort-merge joins Sorted String Tables (see SSTables) sortingsort order in column storage, Sort Order in Column Storage source of truth (see systems of record) Spanner (database)data locality, Data locality for queries snapshot isolation using clocks, Synchronized clocks for global snapshots TrueTime API, Clock readings have a confidence interval Spark (processing framework), Dataflow engines-Discussion of materializationbytecode generation, The move toward declarative query languages dataflow APIs, High-Level APIs and Languages fault tolerance, Fault tolerance for data warehouses, The divergence between OLTP databases and data warehouses GraphX API (graph processing), The Pregel processing model machine learning, Specialization for different domains query optimizer, The move toward declarative query languages Spark Streaming, Stream analyticsmicrobatching, Microbatching and checkpointing stream processing on top of batch processing, Batch and Stream Processing SPARQL (query language), The SPARQL query language spatial algorithms, Specialization for different domains split brain, Leader failure: Failover, Glossaryin consensus algorithms, Distributed Transactions and Consensus, Single-leader replication and consensus preventing, Consistency and Consensus, Implementing Linearizable Systems using fencing tokens to avoid, The leader and the lock-Fencing tokens spreadsheets, dataflow programming capabilities, Designing Applications Around Dataflow SQL (Structured Query Language), Simplicity: Managing Complexity, Relational Model Versus Document Model, Query Languages for Dataadvantages and limitations of, Diversity of processing models distributed query execution, MapReduce Querying graph queries in, Graph Queries in SQL isolation levels standard, issues with, Repeatable read and naming confusion query execution on Hadoop, Diversity of processing models résumé (example), The Object-Relational Mismatch SQL injection vulnerability, Byzantine Faults SQL on Hadoop, The divergence between OLTP databases and data warehouses statement-based replication, Statement-based replication stored procedures, Pros and cons of stored procedures SQL Server (database)data warehousing support, The divergence between OLTP databases and data warehouses distributed transaction support, XA transactions leader-based replication, Leaders and Followers preventing lost updates, Automatically detecting lost updates preventing write skew, Characterizing write skew, Implementation of two-phase locking read committed isolation, Implementing read committed recursive query support, Graph Queries in SQL serializable isolation, Implementation of two-phase locking snapshot isolation support, Snapshot Isolation and Repeatable Read T-SQL language, Pros and cons of stored procedures XML support, The Object-Relational Mismatch SQLstream (stream analytics), Complex event processing SSDs (see solid state drives) SSTables (storage format), SSTables and LSM-Trees-Performance optimizationsadvantages over hash indexes, SSTables and LSM-Trees concatenated index, Partitioning by Hash of Key constructing and maintaining, Constructing and maintaining SSTables making LSM-Tree from, Making an LSM-tree out of SSTables staleness (old data), Reading Your Own Writescross-channel timing dependencies, Cross-channel timing dependencies in leaderless databases, Writing to the Database When a Node Is Down in multi-version concurrency control, Detecting stale MVCC reads monitoring for, Monitoring staleness of client state, Pushing state changes to clients versus linearizability, Linearizability versus timeliness, Timeliness and Integrity standbys (see leader-based replication) star replication topologies, Multi-Leader Replication Topologies star schemas, Stars and Snowflakes: Schemas for Analytics-Stars and Snowflakes: Schemas for Analyticssimilarity to event sourcing, Event Sourcing Star Wars analogy (event time versus processing time), Event time versus processing time statederived from log of immutable events, State, Streams, and Immutability deriving current state from the event log, Deriving current state from the event log interplay between state changes and application code, Dataflow: Interplay between state changes and application code maintaining derived state, Maintaining derived state maintenance by stream processor in stream-stream joins, Stream-stream join (window join) observing derived state, Observing Derived State-Multi-partition data processing rebuilding after stream processor failure, Rebuilding state after a failure separation of application code and, Separation of application code and state state machine replication, Using total order broadcast, Databases and Streams statement-based replication, Statement-based replication statically typed languagesanalogy to schema-on-write, Schema flexibility in the document model code generation and, Code generation and dynamically typed languages statistical and numerical algorithms, Specialization for different domains StatsD (metrics aggregator), Direct messaging from producers to consumers stdin, stdout, A uniform interface, Separation of logic and wiring Stellar (cryptocurrency), Tools for auditable data systems stock market feeds, Direct messaging from producers to consumers STONITH (Shoot The Other Node In The Head), Leader failure: Failover stop-the-world (see garbage collection) storagecomposing data storage technologies, Composing Data Storage Technologies-What’s missing?


The Great Turning: From Empire to Earth Community by David C. Korten

Albert Einstein, banks create money, big-box store, Bretton Woods, British Empire, business cycle, clean water, colonial rule, Community Supported Agriculture, death of newspapers, declining real wages, different worldview, European colonialism, Francisco Pizarro, full employment, George Gilder, global supply chain, global village, God and Mammon, Hernando de Soto, Howard Zinn, informal economy, Intergovernmental Panel on Climate Change (IPCC), invisible hand, joint-stock company, land reform, market bubble, market fundamentalism, Monroe Doctrine, Naomi Klein, neoliberal agenda, new economy, peak oil, planetary scale, plutocrats, Plutocrats, Project for a New American Century, Ronald Reagan, Rosa Parks, sexual politics, shared worldview, social intelligence, source of truth, South Sea Bubble, stem cell, structural adjustment programs, The Chicago School, trade route, Washington Consensus, wealth creators, World Values Survey

This contest has raged in the West since the beginning of the scientific revolution. 253 254 PART IV: THE GREAT TURNING Religion of the Strict Father By the time of the early scientific revolution in the sixteenth century, prevailing Christian theology had fallen into a distrust of the human intellect and its ability to perceive truth from observations of the material world. Indeed, excessive concern with material phenomena was considered a sign of a neglected soul. Religious authorities maintained that divine revelation as enshrined in scripture and interpreted by themselves was the only valid source of truth and that the universe is governed by forces beyond human knowing. The prevailing Western worldview of that time, particularly as defined by the Catholic faith, • viewed the human relation to God as one of a child to a father who demands strict loyalty and obedience; • ascribed to God both human emotions and the power to create and destroy whole worlds by an arbitrary act of will; • held humans to be both the purpose and center of God’s creation; • venerated a pantheon of saints with powers to intervene in matters of the heart and flesh; • attributed physical and mental afflictions to possession by malevolent spirits; and • claimed for religious authorities the power to guarantee a place in heaven.

Unfortunately, however, the scientific revolution brought not only a rejection of the magical fantasies of the lowest order of consciousness but also a denial of the spiritual foundation of reality and a deep alienation from life. Science of the Aging Clock In sharp contrast to the belief systems of most religions, the ideological frame of standard Western science steadfastly maintains that the physical world is the only reality and that the disciplined observation of Beyond Strict Father versus Aging Clock 255 physical phenomena is the only source of truth. That stance began with the theories of Nicolaus Copernicus (1473–1543) and the proofs of Galileo Galilei (1564–1642) that the sun is the center of the solar system and Earth is but one of its several orbiting planets. The conventional scientific wisdom of that day held that nature functions with the predictable precision of a mechanical clock and that its mechanisms are fully amenable to human understanding.2 Unable to explain the origins of the complex machine postulated by their theories, the early philosophers of the scientific revolution conceded that territory to the theologians, suggesting that the universe was created and set in motion by a master clock maker who then left it to wind down as the embodied energy potential of its wound-up spring was depleted.


Learning Ansible 2 - Second Edition by Fabio Alessandro Locati

Amazon Web Services, anti-pattern, cloud computing, continuous integration, Debian, DevOps, don't repeat yourself, Infrastructure as a Service, inventory management, Kickstarter, revision control, source of truth, web application

Scaling to Multiple Hosts In the previous chapters, we have specified the hosts in the command line. This worked fine while having a single host to work on, but will not work very well when managing multiple servers. In this chapter, we will see exactly how to manage multiple servers. We'll explore the following topics: • Ansible inventories • Ansible host/group variables • Ansible loops Working with inventory files An inventory file is the source of truth for Ansible (there is also an advanced concept called dynamic inventory, which we will cover later). It follows the Initialization (INI) format and tells Ansible whether the remote host or hosts provided by the user are genuine. Ansible can run its tasks against multiple hosts in parallel. To do this, you can directly pass the list of hosts to Ansible using an inventory file. For such parallel execution, Ansible allows you to group your hosts in the inventory file; the file passes the group name to Ansible.


The Techno-Human Condition by Braden R. Allenby, Daniel R. Sarewitz

airport security, augmented reality, carbon footprint, clean water, cognitive dissonance, coherent worldview, conceptual framework, creative destruction, Credit Default Swap, decarbonisation, different worldview, facts on the ground, friendly fire, industrial cluster, Intergovernmental Panel on Climate Change (IPCC), invisible hand, Isaac Newton, Jane Jacobs, land tenure, life extension, Long Term Capital Management, market fundamentalism, mutually assured destruction, nuclear winter, Peter Singer: altruism, planetary scale, prediction markets, Ralph Waldo Emerson, Ray Kurzweil, Silicon Valley, smart grid, source of truth, stem cell, Stewart Brand, technoutopianism, the built environment, The Wealth of Nations by Adam Smith, transcontinental railway, Whole Earth Catalog

The strongest critics of the Enlightenment have been its children-Rousseau, Marx, Freud, postmodernists of all stripes. For these revolutionaries and critics, not only has the Enlightenment tradition been the source of the negation; it has itself been transformed, transcended, and made more universal and encompassing by the dialectic generated by the negation. Indeed, the Enlightenment framework succeeded-persistedonly to the extent it was able to continually negate itself as a unique source of "truth." But this process of self-negation was largely carried out in the domains of science and social theory, and largely in reaction to what had come before, not in anticipation of what might be coming. Institutions and Anticipatory Self-Negation What we want to suggest now is that the challenges of rapid and continual technological transformation require an acceleration 174 Chapter 8 of the life-giving process of self-negation that has allowed the Enlightenment, as a way of explaining and justifying certain types of human activity (especially the creation of knowledge and the accumulation of wealth), to flourish.


pages: 274 words: 66,721

Double Entry: How the Merchants of Venice Shaped the Modern World - and How Their Invention Could Make or Break the Planet by Jane Gleeson-White

Affordable Care Act / Obamacare, Bernie Madoff, Black Swan, British Empire, business cycle, carbon footprint, corporate governance, credit crunch, double entry bookkeeping, full employment, Gordon Gekko, income inequality, invention of movable type, invention of writing, Islamic Golden Age, Johann Wolfgang von Goethe, Johannes Kepler, joint-stock company, joint-stock limited liability company, Joseph Schumpeter, means of production, Naomi Klein, Nelson Mandela, Ponzi scheme, shareholder value, Silicon Valley, Simon Kuznets, source of truth, spice trade, spinning jenny, The Wealth of Nations by Adam Smith, Thomas Malthus, trade route, traveling salesman, upwardly mobile

It also marked the beginning of the demise of Latin as the universal language of Europe and its gradual replacement by the language of science, which, as Italian astronomer and philosopher Galileo pointed out in 1623, ‘was neither Latin nor the vernacular, but numbers and figures, circles, triangles and squares’. It is worth noting just how radically the printing of multiple copies of reliable charts and figures influenced the course of western civilisation. For a start, as Galileo understood, it made possible the triumph of science and the rise of mathematics as the universal language. Simultaneously, it brought about the demise of the Bible and religion as the ultimate and uncontested source of truth. As historian Elizabeth Eisenstein argues, the changes printing brought ‘provide the most plausible point of departure for explaining how confidence shifted from divine revelation to mathematical reasoning and man-made maps’. It turns out that what Leon Battista Alberti thought he was seeing in the early fifteenth century—the introduction of mathematics to the arts, to painting, sculpture and architecture, as expounded in his 1435 treatise on painting, De pictura—was in fact merely one localised example of the spread of mathematics through every sphere of life.


pages: 231 words: 71,248

Shipping Greatness by Chris Vander Mey

corporate raider, don't be evil, en.wikipedia.org, fudge factor, Google Chrome, Google Hangouts, Gordon Gekko, Jeff Bezos, Kickstarter, Lean Startup, minimum viable product, performance metric, recommendation engine, Skype, slashdot, sorting algorithm, source of truth, Steve Jobs, Superbowl ad, web application

I like dumb questions because when I answer them, I feel like I’ve made progress without expending much effort. It’s a rare and delightful feeling for me. If I think a user may ask the same question that I receive, I write the question in an “External” section of the same document. I also continue to update the document with new questions as they arrive, so the document becomes a “living” source of truth for people with questions. When I get a question I can’t answer, it goes into the FAQ too, along with the hope that someone else will answer it. Worst case, you can use the FAQ just like a personal bug list or source of topics for discussion with your team. When the number of open issues approaches zero, you’re ready to write a quality one-pager or product requirements doc. There are two major benefits of building an FAQ document.


pages: 243 words: 76,686

How to Do Nothing by Jenny Odell

Airbnb, augmented reality, back-to-the-land, Burning Man, collective bargaining, Donald Trump, Filter Bubble, full employment, gig economy, Google Earth, Internet Archive, Jane Jacobs, Jaron Lanier, Kickstarter, late capitalism, Mark Zuckerberg, market fundamentalism, means of production, Minecraft, peer-to-peer, Peter Thiel, Port of Oakland, Results Only Work Environment, Rosa Parks, Sand Hill Road, Silicon Valley, Silicon Valley startup, Snapchat, source of truth, Steve Jobs, strikebreaker, technoutopianism, union organizing, white flight, Works Progress Administration

One must ascend to higher ground to see reality: the government is admirable in many respects, “but seen from a point of view a little higher they are what I have described them; seen from a higher still, and the highest, who shall say what they are, or that they are worth looking at or thinking of at all?” As for Plato, for whom the escapee from the cave suffers and must be “dragged” into the light, Thoreau’s ascent is no Sunday stroll in the park. Instead it is a long hike to the top of a mountain when most would prefer to stay in the hills: They who know of no purer sources of truth, who have traced up its stream no higher, stand, and wisely stand, by the Bible and the Constitution, and drink at it there with reverence and humility; but they who behold where it comes trickling into this lake or that pool, gird up their loins once more, and continue their pilgrimage to its fountain-head.30 Things look different from up there, which explains why Thoreau’s world, like that of Diogenes and Zhuang Zhou, is full of reversals.


pages: 309 words: 81,975

Brave New Work: Are You Ready to Reinvent Your Organization? by Aaron Dignan

"side hustle", activist fund / activist shareholder / activist investor, Airbnb, Albert Einstein, autonomous vehicles, basic income, Bertrand Russell: In Praise of Idleness, bitcoin, Black Swan, blockchain, Buckminster Fuller, Burning Man, butterfly effect, cashless society, Clayton Christensen, clean water, cognitive bias, cognitive dissonance, corporate governance, corporate social responsibility, correlation does not imply causation, creative destruction, crony capitalism, crowdsourcing, cryptocurrency, David Heinemeier Hansson, deliberate practice, DevOps, disruptive innovation, don't be evil, Elon Musk, endowment effect, Ethereum, ethereum blockchain, Frederick Winslow Taylor, future of work, gender pay gap, Geoffrey West, Santa Fe Institute, gig economy, Google X / Alphabet X, hiring and firing, hive mind, income inequality, information asymmetry, Internet of things, Jeff Bezos, job satisfaction, Kevin Kelly, Kickstarter, Lean Startup, loose coupling, loss aversion, Lyft, Marc Andreessen, Mark Zuckerberg, minimum viable product, new economy, Paul Graham, race to the bottom, remote working, Richard Thaler, shareholder value, Silicon Valley, six sigma, smart contracts, Social Responsibility of Business Is to Increase Its Profits, software is eating the world, source of truth, Stanford marshmallow experiment, Steve Jobs, TaskRabbit, the High Line, too big to fail, Toyota Production System, uber lyft, universal basic income, Y Combinator, zero-sum game

If you’re still exchanging documents with names such as “presentation-v32.7-final-ad-final-final.ppt,” you’re missing out on the cheapest productivity boost in the world: multiplayer applications. Apps such as Google’s G Suite, Office 365, Dropbox Paper, Box Notes, Quip, Trello, Evernote, Basecamp, Asana, and Parabol allow multiple users to create and edit documents, files, and data simultaneously. Instead of information flying from inbox to inbox, everyone shares a single source of truth that is always one click away. Teams can coauthor presentations, documents, and even entire projects synchronously or asynchronously, in the same room or remotely, for less than you spend on printer paper. When media company PopSugar switched to G Suite, the average time to go from an interview to a published piece went from twenty-four hours to less than two. Market research company Forrester found that organizations that switch to G Suite experience a 213 percent ROI over three years.


pages: 282 words: 81,873

Live Work Work Work Die: A Journey Into the Savage Heart of Silicon Valley by Corey Pein

23andMe, 4chan, affirmative action, Affordable Care Act / Obamacare, Airbnb, Amazon Mechanical Turk, Anne Wojcicki, artificial general intelligence, bank run, barriers to entry, Benevolent Dictator For Life (BDFL), Bernie Sanders, bitcoin, Build a better mousetrap, California gold rush, cashless society, colonial rule, computer age, cryptocurrency, data is the new oil, disruptive innovation, Donald Trump, Douglas Hofstadter, Elon Musk, Extropian, gig economy, Google bus, Google Glasses, Google X / Alphabet X, hacker house, hive mind, illegal immigration, immigration reform, Internet of things, invisible hand, Isaac Newton, Jeff Bezos, job automation, Kevin Kelly, Khan Academy, Law of Accelerating Returns, Lean Startup, life extension, Lyft, Mahatma Gandhi, Marc Andreessen, Mark Zuckerberg, Menlo Park, minimum viable product, move fast and break things, move fast and break things, mutually assured destruction, obamacare, passive income, patent troll, Paul Graham, peer-to-peer lending, Peter H. Diamandis: Planetary Resources, Peter Thiel, platform as a service, plutocrats, Plutocrats, Ponzi scheme, post-work, Ray Kurzweil, regulatory arbitrage, rent control, RFID, Robert Mercer, rolodex, Ronald Reagan, Ross Ulbricht, Ruby on Rails, Sam Altman, Sand Hill Road, Scientific racism, self-driving car, sharing economy, side project, Silicon Valley, Silicon Valley startup, Singularitarianism, Skype, Snapchat, social software, software as a service, source of truth, South of Market, San Francisco, Startup school, stealth mode startup, Steve Jobs, Steve Wozniak, TaskRabbit, technological singularity, technoutopianism, telepresence, too big to fail, Travis Kalanick, tulip mania, Uber for X, uber lyft, ubercab, upwardly mobile, Vernor Vinge, X Prize, Y Combinator

The company made money from advertising and referral fees earned whenever it steered users toward a certain credit card, mortgage, insurance policy, business investment, or student loan. “I want to cover every substantial financial decision that anyone can make in their life,” CEO and cofounder Tim Chen, a former banker, told TechCrunch. “We’re talking about a shitload of big decisions there.” The NerdWallet homepage pronounced the company to be “your source of truth for all of life’s financial decisions.” A fine-print “advertiser disclosure” at the foot of the page revealed that the truth was proudly sponsored by Bank of America, Capital One, JPMorgan Chase, Citibank, Discover, and other financial institutions, noting that “compensation may impact which cards we review and write about and how and where products appear on this site.” No kidding! NerdWallet’s reviews had titles like “The Best Bank of America Credit Cards 2015” and “CitiBusiness® / AAdvantage® Platinum Select® World MasterCard®: Your Company’s Ticket to Perks.”


Seeking SRE: Conversations About Running Production Systems at Scale by David N. Blank-Edelman

Affordable Care Act / Obamacare, algorithmic trading, Amazon Web Services, bounce rate, business continuity plan, business process, cloud computing, cognitive bias, cognitive dissonance, commoditize, continuous integration, crowdsourcing, dark matter, database schema, Debian, defense in depth, DevOps, domain-specific language, en.wikipedia.org, fault tolerance, fear of failure, friendly fire, game design, Grace Hopper, information retrieval, Infrastructure as a Service, Internet of things, invisible hand, iterative process, Kubernetes, loose coupling, Lyft, Marc Andreessen, microservices, minimum viable product, MVC pattern, performance metric, platform as a service, pull request, RAND corporation, remote working, Richard Feynman, risk tolerance, Ruby on Rails, search engine result page, self-driving car, sentiment analysis, Silicon Valley, single page application, Snapchat, software as a service, software is eating the world, source of truth, the scientific method, Toyota Production System, web application, WebSocket, zero day

This web interface would show the queue of outstanding requests, the locality of servers, rack diversity, and the current stock of available servers in every cage of the data centers, giving developers more choice and a better understanding of how we built our hardware ecosystem. The next thing that was tackled was the DNS infrastructure. At the time, DNS records for servers were manually added and deployed by the ops team. The first thing the ops team did was to use a CMDB as the source of truth for determining what server records should be automatically added to the zone files. This helped reduce the number of mistakes made, like forgetting to add the trailing dot. When enough confidence was built that these zone files were accurate, they were automatically deployed to the authoritative DNS servers. This, again, freed up much of our time, and developer satisfaction increased. Services in our backend discover one another using DNS SRV records.

Instead, it’s important to modify how you approach certain problems that require consistent, global views of the world. Let’s take a throttling mechanism as an example. As Figure 25-8 illustrates, the throttle passes requests up to a certain limit for a given amount of time and blocks all subsequent requests until the next block of time. This sort of throttle typically requires having a single, consistent source of truth shared between all potential throttling points. Figure 25-8. Each request is registered in a database shared by both load balancers. An alternative to storing a counter in a shared data store would be to store the counter on each load balancer, as depicted in Figure 25-9, but divide the maximum throttle size by the number of load balancers receiving requests. If a single load balancer reaches the limit, all other load balancers are likely to have reached the same limit as well.


pages: 291 words: 92,406

Thinking in Pictures: And Other Reports From My Life With Autism by Temple Grandin

Albert Einstein, Asperger Syndrome, factory automation, randomized controlled trial, Richard Feynman, selective serotonin reuptake inhibitor (SSRI), social intelligence, source of truth, theory of mind, twin studies

He went on to say in the same paper: “The road to this paradise was not as comfortable and alluring as the road to the religious paradise; but it has proved itself trustworthy, and I have never regretted having chosen it.” But my favorite of Einstein's words on religion is “Science without religion is lame. Religion without science is blind.” I like this because both science and religion are needed to answer life's great questions. Even scientists such as Richard Feynman, who rejected religion and poetry as sources of truth, concede grudgingly that there are questions that science cannot answer. I am deeply interested in the new chaos theory, because it means that order can arise out of disorder and randomness. I've read many popular articles about it, because I want scientific proof that the universe is orderly. I do not have the mathematical ability to understand chaos theory fully, but it confirms the idea that order can come from disorder and randomness.


pages: 374 words: 94,508

Infonomics: How to Monetize, Manage, and Measure Information as an Asset for Competitive Advantage by Douglas B. Laney

3D printing, Affordable Care Act / Obamacare, banking crisis, blockchain, business climate, business intelligence, business process, call centre, chief data officer, Claude Shannon: information theory, commoditize, conceptual framework, crowdsourcing, dark matter, data acquisition, digital twin, discounted cash flows, disintermediation, diversification, en.wikipedia.org, endowment effect, Erik Brynjolfsson, full employment, informal economy, intangible asset, Internet of things, linked data, Lyft, Nash equilibrium, Network effects, new economy, obamacare, performance metric, profit motive, recommendation engine, RFID, semantic web, smart meter, Snapchat, software as a service, source of truth, supply-chain management, text mining, uber lyft, Y2K, yield curve

And lacking the means to calculate the value of information itself, even seasoned information executives such as CDOs struggle with quantifying the benefits of information management. AIG’s CDO, Leandro DalleMule, told me that the core of his information vision is an overall philosophy of “data defense and data offense… building ‘data management in a box’ by attacking business cases module by module to create a single source of truth that ultimately pulls data from 3000 systems into one place.”1 Applied Asset Management for Information Vision After studying the library science domain, it’s clear that information should ascend to a level of importance on par with—or even above—other assets. However, information leaders should promote the principle that information capability is directly related to business process performance and a source of strategic advantage as their colleagues do in human capital management.


pages: 320 words: 97,509

Doctored: The Disillusionment of an American Physician by Sandeep Jauhar

Affordable Care Act / Obamacare, delayed gratification, illegal immigration, income inequality, Induced demand, medical malpractice, moral hazard, obamacare, profit motive, randomized controlled trial, source of truth, stem cell, The Wealth of Nations by Adam Smith, Yogi Berra

“Why doesn’t he protect everyone?” I asked. “Because we are in his inner circle,” he replied. Then he quickly added, “You must have faith, Doctor. It all depends on faith.” As a boy I’d had faith. I’d believed there were people who possessed special knowledge that I could not access. When I was in trouble, I prayed. But this all had changed. I no longer believed in prayer. I no longer trusted there was a greater source of truth than the thoughts in my own head. I was now apt to ignore the pronouncements of those in authority. Still, I missed that time when I thought others knew more than I about how to live my life. As much as the need for their approval had once unnerved me, my lack of faith was just as unsettling. Since Sonia’s family was playing host, I was accorded a coveted seat next to Guruji in the living room.


pages: 332 words: 100,601

Rebooting India: Realizing a Billion Aspirations by Nandan Nilekani

Airbnb, Atul Gawande, autonomous vehicles, barriers to entry, bitcoin, call centre, cashless society, clean water, cloud computing, collaborative consumption, congestion charging, DARPA: Urban Challenge, dematerialisation, demographic dividend, Edward Snowden, en.wikipedia.org, energy security, financial exclusion, Google Hangouts, illegal immigration, informal economy, Khan Academy, Kickstarter, knowledge economy, land reform, law of one price, M-Pesa, Mahatma Gandhi, Marc Andreessen, Mark Zuckerberg, mobile money, Mohammed Bouazizi, more computing power than Apollo, Negawatt, Network effects, new economy, offshore financial centre, price mechanism, price stability, rent-seeking, RFID, Ronald Coase, school choice, school vouchers, self-driving car, sharing economy, Silicon Valley, Skype, smart grid, smart meter, software is eating the world, source of truth, Steve Jobs, The Nature of the Firm, transaction costs, WikiLeaks

A survey they performed in New Delhi in 2015 found that nearly 22 per cent of the names on the voter list need to be updated or deleted as these individuals were not found at their listed address.28 Any changes to the voter list should be clearly visible, and all political parties and citizens should be brought on board with the process so as to forestall any allegations of malpractice. Equally important, individuals should be able to view their voter records easily so that any errors in their record can be quickly noted, a process that can be made faster through automation. Once the data is available, political campaigns can use it as a single source of truth. The data can be accessed with a simple device like a smartphone or a tablet. As we will discuss in the next chapter, one of the most successful technological innovations in Nandan’s campaign was also among the simplest—a digital voter roll that people could look up on a smartphone, making it easy to find voter data and to direct people to the correct polling location. Many volunteers told us of people who landed up at the wrong polling booth, or whose data was missing in the list of voters, and who would have had to return home without casting their vote if volunteers hadn’t looked up the correct data on the digital roll.


pages: 412 words: 96,251

Why We're Polarized by Ezra Klein

affirmative action, Affordable Care Act / Obamacare, barriers to entry, Bernie Sanders, Cass Sunstein, centre right, Climategate, collapse of Lehman Brothers, currency manipulation / currency intervention, David Brooks, demographic transition, desegregation, Donald Trump, ending welfare as we know it, Ferguson, Missouri, illegal immigration, immigration reform, Nate Silver, obamacare, Ralph Nader, Ronald Reagan, Silicon Valley, single-payer health, source of truth

Hopkins, “How Information Became Ideological,” Inside Higher Ed, October 11, 2016, insidehighered.com/views/2016/10/11/how-conservative-movement-has-undermined-trust-academe-essay. 14 David Roberts, “Donald Trump and the Rise of Tribal Epistemology: Journalism Cannot Be Neutral Toward a Threat to the Conditions That Make It Possible,” Vox, May 19, 2017, vox.com/policy-and-politics/2017/3/22/14762030/donald-trump-tribal-epistemology. 15 David Roberts, “Donald Trump Is the Sole Reliable Source of Truth, Says Chair of House Science Committee: ‘Better to Get Your News Directly from the President,’ said Rep. Lamar Smith of Texas,” Vox, January 27, 2017, vox.com/science-and-health/2017/1/27/14395978/donald-trump-lamar-smith. 16 David Hookstead, “This Sexy Model Is Blowing Up the Internet [SLIDESHOW],” Daily Caller, December 16, 2016, dailycaller.com/2016/12/16/this-sexy-model-is-blowing-up-the-internet-slideshow/; David Hookstead, “This UFC Octagon Girl’s Instagram Account Is Sizzling Hot [SLIDESHOW],” Daily Caller, December 24, 2016, dailycaller.com/2016/12/24/this-ufc-octagon-girls-instagram-account-is-sizzling-hot-slideshow/; Kaitlan Collins, “13 Syrian Refugees We’d Take Immediately [PHOTOS],” Dailey Caller, November 18, 2015, dailycaller.com/2015/11/18/13-syrian-refugees-wed-take-immediately-photos/. 17 Jonathan A.


pages: 411 words: 108,119

The Irrational Economist: Making Decisions in a Dangerous World by Erwann Michel-Kerjan, Paul Slovic

"Robert Solow", Andrei Shleifer, availability heuristic, bank run, Black Swan, business cycle, Cass Sunstein, clean water, cognitive dissonance, collateralized debt obligation, complexity theory, conceptual framework, corporate social responsibility, Credit Default Swap, credit default swaps / collateralized debt obligations, cross-subsidies, Daniel Kahneman / Amos Tversky, endowment effect, experimental economics, financial innovation, Fractional reserve banking, George Akerlof, hindsight bias, incomplete markets, information asymmetry, Intergovernmental Panel on Climate Change (IPCC), invisible hand, Isaac Newton, iterative process, Kenneth Arrow, Loma Prieta earthquake, London Interbank Offered Rate, market bubble, market clearing, money market fund, moral hazard, mortgage debt, Pareto efficiency, Paul Samuelson, placebo effect, price discrimination, price stability, RAND corporation, Richard Thaler, Robert Shiller, Robert Shiller, Ronald Reagan, source of truth, statistical model, stochastic process, The Wealth of Nations by Adam Smith, Thomas Bayes, Thomas Kuhn: the structure of scientific revolutions, too big to fail, transaction costs, ultimatum game, University of East Anglia, urban planning, Vilfredo Pareto

There are honest brokers in the world of deeds, eager for scientifically sound proposals that they can carry forward. But there are also practitioners with foregone conclusions, looking for experts whose work they can invoke, in order to justify positions that they have already adopted. They may care little about the quality of our work, as long as it points in their direction and can be cited as a “neutral” source of truth. If our fame advances their cause, then they may help us to find better speaking engagements, better luck with our op-eds, and better consulting opportunities. But we are just means to their predetermined ends. How can we tell whether we are being “kept” by the powerful, rather than getting well-deserved audiences? One positive sign is finding that our supporters have followed a discovery process paralleling our own, independently discovering a behavioral regularity that we have documented and explained.


pages: 518 words: 49,555

Designing Social Interfaces by Christian Crumlish, Erin Malone

A Pattern Language, Amazon Mechanical Turk, anti-pattern, barriers to entry, c2.com, carbon footprint, cloud computing, collaborative editing, creative destruction, crowdsourcing, en.wikipedia.org, Firefox, game design, ghettoisation, Howard Rheingold, hypertext link, if you build it, they will come, Merlin Mann, Nate Silver, Network effects, Potemkin village, recommendation engine, RFC: Request For Comment, semantic web, SETI@home, Skype, slashdot, social graph, social software, social web, source of truth, stealth mode startup, Stewart Brand, telepresence, The Wisdom of Crowds, web application

The Corporate Identity and Profile First of all, the designer should consider how much of the basic social networking foundations need to be a part of the system. In most corporate environments, there is an intranet and an internal employee lookup system, such as LDAP, which gives employees information about role, title, email address, phone number, location, and other information about their fellow colleagues. This information is often managed and generated by the HR and IT departments and is a source of truth in terms of data. Any social tools built for this environment should pull in this existing profile and identity information rather than duplicate it (Figure 18-2). Users should not be required to create another profile. Figure 18-2. The user experience design (UED) team at Yahoo! has its own intranet, but taps into the main Yahoo! internal system for identity information, including username, reporting structure, phone numbers, and email.


pages: 312 words: 114,586

How I Found Freedom in an Unfree World: A Handbook for Personal Liberty by Harry Browne

full employment, Johann Wolfgang von Goethe, Ralph Waldo Emerson, source of truth, War on Poverty

Such a person wants someone else to guarantee that he's right—no matter what happens. You are responsible, because you will experience the consequences of your own acts, and those consequences are the final judge of whether you've been right or wrong. They provide a verdict from which there is no appeal. The insecure individual hopes somehow to bypass that verdict. He looks for a way to believe he's right, no matter what consequences he experiences. He looks for a source of "truth" that he can believe in. When he finds it, he accepts it totally. He feels that this gives him the security to know that he's right, and he prefers that kind of security to the need to rely upon his own ability. The philosophy he finds usually contains three basic ingredients. They are moral rightness, a leader, and an enemy. These ingredients arm him with an assurance that allows him to disregard the test of consequences.


pages: 411 words: 136,413

The Voice of Reason: Essays in Objectivist Thought by Ayn Rand, Leonard Peikoff, Peter Schwartz

affirmative action, Berlin Wall, British Empire, business process, cuban missile crisis, haute cuisine, invisible hand, Isaac Newton, laissez-faire capitalism, means of production, medical malpractice, profit motive, Ralph Nader, Ronald Reagan, source of truth, The Wealth of Nations by Adam Smith, trade route, transcontinental railway, urban renewal, War on Poverty

Freud was merely one of his many heirs, as are the modern skeptics who distort Einstein’s findings to rationalize their viewpoint, as are the rhetoric professors at Berkeley and all their like-minded colleagues. In countless forms, Kant’s rejection of reason is at the root of our modern colleges. Question, debate, dispute—the founding fathers urged men—because by this means you will reach answers to your questions and discover how to act. Question, debate, dispute—our Kantianized faculty urges today—not to find the answers, but to discover that there aren’t any, that there is no source of truth and no guide to action, that the Enlightenment viewpoint was merely a comfortable superstition or a naivete. Come to college, they say, and we’ll cure you of that superstition for life. Which, unfortunately, they often do. “On the first day of classes,” a student from Kent State University in Ohio wrote me, “my English professor said the purpose of college is to take a high-school graduate who’s sure of himself and make him confused.”


pages: 504 words: 147,660

In the Realm of Hungry Ghosts: Close Encounters With Addiction by Gabor Mate, Peter A. Levine

addicted to oil, Albert Einstein, Anton Chekhov, corporate governance, epigenetics, ghettoisation, impulse control, longitudinal study, mass immigration, meta analysis, meta-analysis, Naomi Klein, phenotype, placebo effect, Rat Park, selective serotonin reuptake inhibitor (SSRI), source of truth, twin studies, Yogi Berra

The sacred fire through which Moshe (Moses) experienced the presence of God on Mount Horeb did not annihilate the bush from which it arose: And YHWH’s messenger was seen by him in the flame of a fire out of the midst of a bush. He saw: here, the bush is burning with fire, and the bush is not consumed!3 Passion is divine fire: it enlivens and makes holy; it gives light and yields inspiration. Passion is generous because it’s not ego-driven; addiction is self-centred. Passion gives and enriches; addiction is a thief. Passion is a source of truth and enlightenment; addictive behaviours lead you into darkness. You’re more alive when you are passionate, and you triumph whether or not you attain your goal. But an addiction requires a specific outcome that feeds the ego; without that outcome, the ego feels empty and deprived. A consuming passion that you are helpless to resist, no matter what the consequences, is an addiction. You may even devote your entire life to a passion, but if it’s truly a passion and not an addiction, you’ll do so with freedom, joy and a full assertion of your truest self and values.


pages: 559 words: 155,372

Chaos Monkeys: Obscene Fortune and Random Failure in Silicon Valley by Antonio Garcia Martinez

Airbnb, airport security, always be closing, Amazon Web Services, Burning Man, Celtic Tiger, centralized clearinghouse, cognitive dissonance, collective bargaining, corporate governance, Credit Default Swap, crowdsourcing, death of newspapers, disruptive innovation, drone strike, El Camino Real, Elon Musk, Emanuel Derman, financial independence, global supply chain, Goldman Sachs: Vampire Squid, hive mind, income inequality, information asymmetry, interest rate swap, intermodal, Jeff Bezos, Kickstarter, Malcom McLean invented shipping containers, Marc Andreessen, Mark Zuckerberg, Maui Hawaii, means of production, Menlo Park, minimum viable product, MITM: man-in-the-middle, move fast and break things, move fast and break things, Network effects, orbital mechanics / astrodynamics, Paul Graham, performance metric, Peter Thiel, Ponzi scheme, pre–internet, Ralph Waldo Emerson, random walk, Ruby on Rails, Sam Altman, Sand Hill Road, Scientific racism, second-price auction, self-driving car, Silicon Valley, Silicon Valley startup, Skype, Snapchat, social graph, social web, Socratic dialogue, source of truth, Steve Jobs, telemarketer, undersea cable, urban renewal, Y Combinator, zero-sum game, éminence grise

To continue my (now perhaps stretched) nautical analogy, they’re merely the container cranes that shuffle boxes around. But in fact they play a much more important role. Look: the advertiser doesn’t trust its agency, the agency doesn’t trust its trading desk, the trading desk doesn’t trust the ads-buying software it’s using, and the ads-buying technology company doesn’t trust the exchanges. The only thing that keeps this dishonest world honest is the existence of an agreed-upon source of truth. That oracle is the ad server. If a marketer wants to reach a million people in the eastern United States, showing each person no more than four ad impressions between the hours of four and ten p.m. on Thursday (a common buy for movies, incidentally, which always launch on a Friday), then that marketer will be satisfied only if the ad server report says that’s what he got. The ad server isn’t merely a data server spewing forth pixels on demand; it’s also the accounting system that decides what gets delivered when, to whom, how often, and where on the Internet.


pages: 1,402 words: 369,528

A History of Western Philosophy by Aaron Finkel

British Empire, Eratosthenes, Georg Cantor, George Santayana, invention of agriculture, liberation theology, Mahatma Gandhi, plutocrats, Plutocrats, source of truth, Thales and the olive presses, Thales of Miletus, the market place, William of Occam

There is a very sympathetic account of Plato, whom he places above all other philosophers. All others are to give place to him: “Let Thales depart with his water, Anaximenes with the air, the Stoics with their fire, Epicurus with his atoms.”* All these were materialists; Plato was not. Plato saw that God is not any bodily thing, but that all things have their being from God, and from something immutable. He was right, also, in saying that perception is not the source of truth. Platonists are the best in logic and ethics, and nearest to Christianity. “It is said that Plotinus, that lived but lately, understood Plato the best of any.” As for Aristotle, he was Plato’s inferior, but far above the rest. Both, however, said that all gods are good, and to be worshipped. As against the Stoics, who condemned all passion, Saint Augustine holds that the passions of Christians may be causes of virtue; anger, or pity, is not to be condemned per se, but we must inquire into its cause.

The subject was a thorny one; Augustine had dealt with it in his writings against Pelagius, but it was dangerous to agree with Augustine and still more dangerous to disagree with him explicitly. John supported free will, and this might have passed uncensured; but what roused indignation was the purely philosophic character of his argument. Not that he professed to controvert anything accepted in theology, but that he maintained the equal, or even superior, authority of a philosophy independent of revelation. He contended that reason and revelation are both sources of truth, and therefore cannot conflict; but if they ever seem to conflict, reason is to be preferred. True religion, he said, is true philosophy; but, conversely, true philosophy is true religion. His work was condemned by two councils, in 855 and 859; the first of these described it as “Scots porridge.” He escaped punishment, however, owing to the support of the king, with whom he seems to have been on familiar terms.


pages: 598 words: 169,194

Bernie Madoff, the Wizard of Lies: Inside the Infamous $65 Billion Swindle by Diana B. Henriques

accounting loophole / creative accounting, airport security, Albert Einstein, banking crisis, Bernie Madoff, break the buck, British Empire, buy and hold, centralized clearinghouse, collapse of Lehman Brothers, computerized trading, corporate raider, diversified portfolio, Donald Trump, dumpster diving, Edward Thorp, financial deregulation, financial thriller, fixed income, forensic accounting, Gordon Gekko, index fund, locking in a profit, mail merge, merger arbitrage, money market fund, plutocrats, Plutocrats, Ponzi scheme, Potemkin village, random walk, Renaissance Technologies, riskless arbitrage, Ronald Reagan, short selling, Small Order Execution System, source of truth, sovereign wealth fund, too big to fail, transaction costs, traveling salesman

But the current Fairfield account statements still showed options trades. There did not seem to be any explanation, other than that Madoff was lying to someone. Since Madoff’s feeder funds all believed he was still buying options for them, it would be devastating if the SEC started suggesting that he wasn’t. Still, even the discrepancy over his use of options did not suggest to the SEC team that relying on Madoff as a source of truthful information was a bad idea. He remained an extremely reputable figure on Wall Street, as far as they knew. Reflecting on the case later, one investigator wrote a colleague that there wasn’t “any real reason to suspect some kind of wrongdoing.” In December 2005 the chief risk officer at Fairfield Greenwich confirmed in an SEC interview—conducted after the SEC gave him permission to consult with Madoff before testifying—that options remained part of the Madoff strategy.


pages: 526 words: 160,601

A Generation of Sociopaths: How the Baby Boomers Betrayed America by Bruce Cannon Gibney

1960s counterculture, 2013 Report for America's Infrastructure - American Society of Civil Engineers - 19 March 2013, affirmative action, Affordable Care Act / Obamacare, American Society of Civil Engineers: Report Card, Bernie Madoff, Bernie Sanders, Bretton Woods, business cycle, buy and hold, carbon footprint, Charles Lindbergh, cognitive dissonance, collapse of Lehman Brothers, collateralized debt obligation, corporate personhood, Corrections Corporation of America, currency manipulation / currency intervention, Daniel Kahneman / Amos Tversky, dark matter, Deng Xiaoping, Donald Trump, Downton Abbey, Edward Snowden, Elon Musk, ending welfare as we know it, equal pay for equal work, failed state, financial deregulation, Francis Fukuyama: the end of history, future of work, gender pay gap, gig economy, Haight Ashbury, Home mortgage interest deduction, Hyperloop, illegal immigration, impulse control, income inequality, Intergovernmental Panel on Climate Change (IPCC), invisible hand, Jane Jacobs, Kitchen Debate, labor-force participation, Long Term Capital Management, Lyft, Mark Zuckerberg, market bubble, mass immigration, mass incarceration, McMansion, medical bankruptcy, Menlo Park, Mont Pelerin Society, moral hazard, mortgage debt, mortgage tax deduction, neoliberal agenda, Network effects, obamacare, offshore financial centre, oil shock, operation paperclip, plutocrats, Plutocrats, Ponzi scheme, price stability, quantitative easing, Ralph Waldo Emerson, RAND corporation, rent control, ride hailing / ride sharing, risk tolerance, Robert Shiller, Robert Shiller, Ronald Reagan, Rubik’s Cube, school choice, secular stagnation, self-driving car, shareholder value, short selling, side project, Silicon Valley, smart grid, Snapchat, source of truth, stem cell, Steve Jobs, Stewart Brand, survivorship bias, TaskRabbit, The Wealth of Nations by Adam Smith, Tim Cook: Apple, too big to fail, War on Poverty, white picket fence, Whole Earth Catalog, women in the workforce, Y2K, Yom Kippur War, zero-sum game

If all else failed, one could invoke a fake epistemic crisis by stating—accurately, if misleadingly—that scientists are not 100 percent certain about, e.g., global warming. We cannot be 100 percent sure that we’d lose a game of Russian roulette with a fully loaded six-shooter, either, though this is largely how Boomer climate policy has operated. Having deposed empiricism, the Boomers were free to seek new sources of truth, and these they located in feelings, a commodity not in short supply during the Age of Aquarius. The triumph of feelings shows up in the literature of the period. Usage of the word “feel,” stable for decades, rose dramatically from the mid-1960s, as did the more revealing “how I feel” and “I feel that.”25 For people without sufficient access to their own thought processes, the debut of the Mood Ring in 1975—a tacky contraption designed to change colors in response to a person’s mood (or body temperature, anyway)—provided a handy gauge.


pages: 661 words: 169,298

Coming of Age in the Milky Way by Timothy Ferris

Albert Einstein, Albert Michelson, Alfred Russel Wallace, anthropic principle, Arthur Eddington, Atahualpa, Cepheid variable, Commentariolus, cosmic abundance, cosmic microwave background, cosmological constant, cosmological principle, dark matter, delayed gratification, Edmond Halley, Eratosthenes, Ernest Rutherford, Gary Taubes, Harlow Shapley and Heber Curtis, Harvard Computers: women astronomers, Henri Poincaré, invention of writing, Isaac Newton, Johannes Kepler, John Harrison: Longitude, Karl Jansky, Lao Tzu, Louis Pasteur, Magellanic Cloud, mandelbrot fractal, Menlo Park, Murray Gell-Mann, music of the spheres, planetary scale, retrograde motion, Richard Feynman, Search for Extraterrestrial Intelligence, Searching for Interstellar Communications, Solar eclipse in 1919, source of truth, Stephen Hawking, Thales of Miletus, Thomas Kuhn: the structure of scientific revolutions, Thomas Malthus, Wilhelm Olbers

One reason for his reluctance to publish was that Copernicus, like Darwin, had reason to fear censure by the religious authorities. The threat of papal disapproval was real enough that the Lutheran theologian Andreas Osiander thought it prudent to oil the waters by writing an unsigned preface to Copernicus’s book, as if composed by the dying Copernicus himself, reassuring its readers that divine revelation was the sole source of truth and that astronomical treatises like this one were intended merely to “save the phenomena.” Nor were the Protestants any more apt to kiss the heliocentric hem. “Who will venture to place the authority of Copernicus above that of the Holy Spirit?” thundered Calvin,12 and Martin Luther complained, in his voluble way, that “this fool wishes to reverse the entire science of astronomy; but sacred Scripture tells us that Joshua commanded the sun to stand still, and not the earth.”13* The book survived, however, and changed the world, for much the same reason that Darwin’s Origin of Species did—because it was too technically competent for the professionals to ignore it.


Evidence-Based Technical Analysis: Applying the Scientific Method and Statistical Inference to Trading Signals by David Aronson

Albert Einstein, Andrew Wiles, asset allocation, availability heuristic, backtesting, Black Swan, butter production in bangladesh, buy and hold, capital asset pricing model, cognitive dissonance, compound rate of return, computerized trading, Daniel Kahneman / Amos Tversky, distributed generation, Elliott wave, en.wikipedia.org, feminist movement, hindsight bias, index fund, invention of the telescope, invisible hand, Long Term Capital Management, mental accounting, meta analysis, meta-analysis, p-value, pattern recognition, Paul Samuelson, Ponzi scheme, price anchoring, price stability, quantitative trading / quantitative finance, Ralph Nelson Elliott, random walk, retrograde motion, revision control, risk tolerance, risk-adjusted returns, riskless arbitrage, Robert Shiller, Robert Shiller, Sharpe ratio, short selling, source of truth, statistical model, stocks for the long run, systematic trading, the scientific method, transfer pricing, unbiased observer, yield curve, Yogi Berra

They tell us how to act (ethics), how governments should rule (political philosophy), what is beautiful (aesthetics), and of greatest concern to us, what constitutes valid knowledge (epistemology) and how we should go about getting it (philosophy of science). The scientific revolution was, in part, a revolt against Aristotelian science. The Greeks regarded the physical world as an unreliable source of truth. According to Plato, mentor of Aristotle, the world was merely a flawed copy of the truth and perfection that existed in the world of Forms, a metaphysical nonmaterial realm, where archetypes of the perfect dog, the perfect tree, and every other imaginable thing could be found. When the revolt against this view of reality finally arrived, it was harsh and unremitting.19 The new school of thought, empiricism, resoundingly rejected the Greek paradigm.


pages: 659 words: 203,574

The Collected Stories of Vernor Vinge by Vernor Vinge

anthropic principle, Asilomar, back-to-the-land, dematerialisation, gravity well, invisible hand, low earth orbit, Machinery of Freedom by David Friedman, MITM: man-in-the-middle, source of truth, technological singularity, unbiased observer, Vernor Vinge

It put him in a cold sweat to think how casually he published new twists on traditional themes, or allowed small inconsistencies into story cycles. And just few days ago, he’d looked forward to testing the Hrala skit with these people! The tall priest’s tone remained friendly: “You have come at an appropriate moment, Master Guille. We have confronted blasphemers—who may be harbingers of the Final Battle. Now is a time when we must consult all sources of Truth.” Another priest, an older fellow with a limp, interrupted with something abrupt. The tall guy paused, and looked faintly embarrassed; suddenly Guille knew that he was more than an interpreter, but not one of the high priests. “It will be necessary to inspect both your boat and your persons. More blasphemers may come in fair forms … . Don’t be angered; it is but a formality. I, we recognize you from before.


pages: 904 words: 246,845

A History of the Bible: The Story of the World's Most Influential Book by John Barton

complexity theory, feminist movement, invention of the printing press, Johannes Kepler, lateral thinking, liberation theology, Republic of Letters, source of truth, the market place, trade route

What they call liberal biblical study – that which encourages a critical attitude to Scripture – strikes them as arid and uninspiring, even as faithless and essentially unChristian. In Britain and North America the churches that are growing tend to be those that adopt such a conservative approach to Scripture. They believe that the whole of the Christian faith can be derived from the Bible, which is seen as the only source of truth and inspiration. This produces at least five principles for reading the Bible, which more liberal Christians often endorse too, though in a watered-down version. First, it is claimed, we should read the Bible in the expectation that what we find there will be true. For some Christians the truth that is sought is literal and historical, so that whatever the biblical text affirms is taken to be factually accurate.


The Master and His Emissary: The Divided Brain and the Making of the Western World by Iain McGilchrist

Albert Einstein, Asperger Syndrome, Benoit Mandelbrot, Berlin Wall, cognitive bias, cognitive dissonance, computer age, Donald Trump, double helix, Douglas Hofstadter, epigenetics, experimental subject, Fellow of the Royal Society, Georg Cantor, hedonic treadmill, Henri Poincaré, Lao Tzu, longitudinal study, Louis Pasteur, mandelbrot fractal, meta analysis, meta-analysis, music of the spheres, Necker cube, Panopticon Jeremy Bentham, pattern recognition, randomized controlled trial, Sapir-Whorf hypothesis, Schrödinger's Cat, social intelligence, social web, source of truth, stem cell, Steven Pinker, the scientific method, theory of mind

As the name of him who burnt the library of Alexandria.’78 Like Heidegger, Wittgenstein too emphasised the primacy of context over rules and system building, of practice over theory: ‘What one acquires here is not a technique; one learns correct judgments. There are also rules, but they do not form a system, and only experienced people can apply them right. Unlike calculating rules.’79 He emphasised that it is not just minds that think and feel, but human beings. Like Heidegger, he grasped that truth can hide or deceive as well as reveal. Wittgenstein scholar Peter Hacker writes: Every source of truth is also unavoidably a source of falsehood, from which its own canons of reasoning and confirmation attempt to protect it. But it can also become a source of conceptual confusion, and consequently of forms of intellectual myth-making, against which it is typically powerless. Scientism, the illicit extension of the methods and categories of science beyond their legitimate domain, is one such form, and the conception of the unity of the sciences and the methodological homogeneity of the natural sciences and of humanistic studies one such myth.