recommendation engine

166 results back to index


pages: 523 words: 112,185

Doing Data Science: Straight Talk From the Frontline by Cathy O'Neil, Rachel Schutt

Amazon Mechanical Turk, augmented reality, Augustin-Louis Cauchy, barriers to entry, Bayesian statistics, bioinformatics, computer vision, correlation does not imply causation, crowdsourcing, distributed generation, Edward Snowden, Emanuel Derman, fault tolerance, Filter Bubble, finite state, Firefox, game design, Google Glasses, index card, information retrieval, iterative process, John Harrison: Longitude, Khan Academy, Kickstarter, Mars Rover, Nate Silver, natural language processing, Netflix Prize, p-value, pattern recognition, performance metric, personalized medicine, pull request, recommendation engine, rent-seeking, selection bias, Silicon Valley, speech recognition, statistical model, stochastic process, text mining, the scientific method, The Wisdom of Crowds, Watson beat the top human players on Jeopardy!, X Prize

proximity clustering, Morningside Analytics prtobuf, Back to Josh: Workflow pseudo-likelihood estimation procedure, Inference for ERGMs pseudocounts, Comparing Naive Bayes to k-NN purity, Probabilities Matter, Not 0s and 1s Q Quora, The Current Landscape (with a Little History) R R-squared, Adding in modeling assumptions about the errors, Selection criterion random forests, Random Forests–Random Forests random graphs, A First Example of Random Graphs: The Erdos-Renyi Model–A Second Example of Random Graphs: The Exponential Random Graph Model Erdos-Renyi model, A First Example of Random Graphs: The Erdos-Renyi Model–A Second Example of Random Graphs: The Exponential Random Graph Model exponential, A Second Example of Random Graphs: The Exponential Random Graph Model random variables, Probability distributions ranks, Evaluation, The Dimensionality Problem real-life performance measures, How to Be a Good Modeler real-time streaming data, Populations and Samples of Big Data real-world data, Process Thinking real-world processes, Statistical Inference RealDirect, Case Study: RealDirect website, Exercise: RealDirect Data Strategy RealDirect case study, Case Study: RealDirect–Sample R code RealDirect data strategy exercise, Exercise: RealDirect Data Strategy–Sample R code realizations, Probability distributions recall, Pick an evaluation metric, Evaluation, Defining the error metric receiver operating characteristic curve, Evaluation recommendation engines, Recommendation Engines: Building a User-Facing Data Product at Scale–Exercise: Build Your Own Recommendation System Amazon and, Recommendation Engines: Building a User-Facing Data Product at Scale building, exercise, Exercise: Build Your Own Recommendation System dimensionality, The Dimensionality Problem k-Nearest Neighbors (k-NN) and, Nearest Neighbor Algorithm Review–Some Problems with Nearest Neighbors machine learning classifications and, Beyond Nearest Neighbor: Machine Learning Classification–Beyond Nearest Neighbor: Machine Learning Classification Netflix and, Recommendation Engines: Building a User-Facing Data Product at Scale real-world, A Real-World Recommendation Engine records, Populations and Samples of Big Data Red Hat, Cloudera Reddy, Ben, Helping Hands redundancies, Feature Selection regression, stepwise, Selecting an algorithm regular expressions, Helping Hands relational ties, Terminology from Social Networks relations, Terminology from Social Networks relationships deterministic, Linear Regression understanding, Linear Regression relative time differentials, Thought Experiment residual sum of squares (RSS), Fitting the model residuals, Adding in modeling assumptions about the errors retention, understanding, Example: User Retention return, The Decision Tree Algorithm Robo-Graders, ethical implications of as thought experiment, Thought Experiment: What Are the Ethical Implications of a Robo-Grader?

Qualitative surveys can really help. Chapter 8. Recommendation Engines: Building a User-Facing Data Product at Scale Recommendation engines, also called recommendation systems, are the quintessential data product and are a good starting point when you’re explaining to non–data scientists what you do or what data science really is. This is because many people have interacted with recommendation systems when they’ve been suggested books on Amazon.com or gotten recommended movies on Netflix. Beyond that, however, they likely have not thought much about the engineering and algorithms underlying those recommendations, nor the fact that their behavior when they buy a book or rate a movie is generating data that then feeds back into the recommendation engine and leads to (hopefully) improved recommendations for themselves and other people.

The service can also be used by third parties to personalize content for a given site—a nice business proposition that led to eBay acquiring Hunch.com. Matt has been building code since he was a kid, so he considers software engineering to be his strong suit. Hunch requires cross-domain experience so he doesn’t consider himself a domain expert in any focused way, except for recommendation systems themselves. The best quote Matt gave us was this: “Forming a data team is kind of like planning a heist.” He means that you need people with all sorts of skills, and that one person probably can’t do everything by herself. (Think Ocean’s Eleven, but sexier.) A Real-World Recommendation Engine Recommendation engines are used all the time—what movie would you like, knowing other movies you liked? What book would you like, keeping in mind past purchases? What kind of vacation are you likely to embark on, considering past trips? There are plenty of different ways to go about building such a model, but they have very similar feels if not implementation.


pages: 347 words: 91,318

Netflixed: The Epic Battle for America's Eyeballs by Gina Keating

activist fund / activist shareholder / activist investor, barriers to entry, business intelligence, collaborative consumption, corporate raider, inventory management, Jeff Bezos, late fees, Mark Zuckerberg, McMansion, Menlo Park, Netflix Prize, new economy, out of africa, performance metric, Ponzi scheme, pre–internet, price stability, recommendation engine, Saturday Night Live, shareholder value, Silicon Valley, Silicon Valley startup, Steve Jobs, subscription business, Superbowl ad, telemarketer, X Prize

Eventually Hastings brought in mathematicians to help formulate the underlying algorithms. They named the recommendation system Cinematch, and launched it in January 2000 with a “Movies for Two” promotion that promised to help couples find common ground in their movie choices. Rather than using their initial approach of assigning traits to each movie and trying to match similar films, Cinematch created customer clusters—people who had rated movies similarly. Cinematch noted the overlap in certain subscribers’ tastes through a five-star ratings system, then presented films highly rated by cluster members to others in the same cluster who had not previously rented or rated them on Netflix. The recommendation engine scoured Netflix’s inventory hourly to take into consideration how many copies of each movie it had before issuing recommendations.

The costs of buying enough DVDs to satisfy the growing subscriber base would eventually crush the company unless Lowe could persuade studios to drop DVD prices drastically in exchange for a share of rental revenues. In the meantime, Netflix engineers had been hard at work since shortly after launch on a recommendation engine—an in-house solution to DVD shortages that would theoretically drive up retention and get more of the company’s catalog into circulation by directing customers away from the most popular films toward more obscure titles that they would like just as much. As a result, the recommendation engine took over the editorial team’s tasks of determining which movies to feature on certain themed Web pages, using machine logic rather than human intuition. The program ran through several criteria when considering which movies to show consumers: How many copies do we have in stock?

The big prize seemed well within their grasp as they entered the contest’s second year. • • • WHEN NETFLIX’S FOUNDING software engineers, including Hastings, contemplated building a recommendation engine in 1999, their first approach was rudimentary and involved linking movies through common attributes: genre, actors, director, setting, happy or sad ending. As the film library grew, that method proved cumbersome and inaccurate, because no matter how many attributes they assigned each film, they could not capture why Pretty Woman was so different from, say, American Gigolo. Both were movies about prostitution set in a major U.S. city and starring Richard Gere, but they were unlikely to appeal to the same audiences. Early recommendation engines were unpredictable: In one famous gaffe, Walmart had to issue an apology and disable theirs after its Web site presented the film Planet of the Apes to shoppers looking for films related to Black History Month.


pages: 451 words: 103,606

Machine Learning for Hackers by Drew Conway, John Myles White

call centre, centre right, correlation does not imply causation, Debian, Erdős number, Nate Silver, natural language processing, Netflix Prize, p-value, pattern recognition, Paul Erdős, recommendation engine, social graph, SpamAssassin, statistical model, text mining, the scientific method, traveling salesman

Generating rules for ranking a list of items is an increasingly common task in machine learning, yet you may not have thought of it in these terms. More likely, you have heard of something like a recommendation system, which implicitly produces a ranking of products. Even if you have not heard of a recommendation system, it’s almost certain that you have used or interacted with a recommendation system at some point. Some of the most successful ecommerce websites have benefited from leveraging data on their users to generate recommendations for other products their users might be interested in. For example, if you have ever shopped at Amazon.com, then you have interacted with a recommendation system. The problem Amazon faces is simple: what items in their inventory are you most likely to buy? The implication of that statement is that the items in Amazon’s inventory have an ordering specific to each user.

To do so, we define measures of distance and describe methods for clustering observations basing on their spatial distances. We use data from US Senator roll call voting to cluster those legislators based on their votes. Recommendation system: suggesting R packages to users To further the discussion of spatial similarities, we discuss how to build a recommendation system based on the closeness of observations in space. Here we introduce the k-nearest neighbors algorithm and use it to suggest R packages to programmers based on their currently installed packages. Social network analysis: who to follow on Twitter Here we attempt to combine many of the concepts previously discussed, as well as introduce a few new ones, to design and build a “who to follow” recommendation system from Twitter data. In this case we build a system for downloading Twitter network data, discover communities within the structure, and recommend new users to follow using basic social network analysis techniques.

Perhaps the structure is quite obvious, as is the case with Drew’s network, or maybe the communities are more subtle. It can be quite interesting and informative to explore these structures in detail, and we encourage you to do so. In the next and final section, we will use these community structures to build our own “who to follow” recommendation engine for Twitter. Building Your Own “Who to Follow” Engine There are many ways that we might think about building our own friend recommendation engine for Twitter. Twitter has many dimensions of data in it, so we could think about recommending people based on what they “tweet” about. This would be an exercise in text mining and would require matching people based on some common set of words or topics within their corpus of tweets. Likewise, many tweets contain geo-location data, so we might recommend users who are active and in close proximity to you.


pages: 23 words: 5,264

Designing Great Data Products by Jeremy Howard, Mike Loukides, Margit Zwemer

AltaVista, Filter Bubble, PageRank, pattern recognition, recommendation engine, self-driving car, sentiment analysis, Silicon Valley, text mining

One of the authors of this paper was explaining an iterative optimization technique, and the host says, “So, in a sense Jeremy, your approach was like that of doing a startup, which is just get something out there and iterate and iterate and iterate.” The takeaway, whether you are a tiny startup or a giant insurance company, is that we unconsciously use optimization whenever we decide how to get to where we want to go. Drivetrain Approach to recommender systems Let’s look at how we could apply this process to another industry: marketing. We begin by applying the Drivetrain Approach to a familiar example, recommendation engines, and then building this up into an entire optimized marketing strategy. Recommendation engines are a familiar example of a data product based on well-built predictive models that do not achieve an optimal objective. The current algorithms predict what products a customer will like, based on purchase history and the histories of similar customers. A company like Amazon represents every purchase that has ever been made as a giant sparse matrix, with customers as the rows and products as the columns.

These models are good at predicting whether a customer will like a given product, but they often suggest products that the customer already knows about or has already decided not to buy. Amazon’s recommendation engine is probably the best one out there, but it’s easy to get it to show its warts. Here is a screenshot of the “Customers Who Bought This Item Also Bought” feed on Amazon from a search for the latest book in Terry Pratchett’s “Discworld series:” All of the recommendations are for other books in the same series, but it’s a good assumption that a customer who searched for “Terry Pratchett” is already aware of these books. There may be some unexpected recommendations on pages 2 through 14 of the feed, but how many customers are going to bother clicking through? Instead, let’s design an improved recommendation engine using the Drivetrain Approach, starting by reconsidering our objective. The objective of a recommendation engine is to drive additional sales by surprising and delighting the customer with books he or she would not have purchased without the recommendation.

While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. O'Reilly Media * * * Chapter 1. Designing Great Data Products By Jeremy Howard, Margit Zwemer, and Mike Loukides In the past few years, we’ve seen many data products based on predictive modeling. These products range from weather forecasting to recommendation engines to services that predict airline flight times more accurately than the airline itself. But these products are still just making predictions, rather than asking what action they want someone to take as a result of a prediction. Prediction technology can be interesting and mathematically elegant, but we need to take the next step. The technology exists to build data products that can revolutionize entire industries.


pages: 519 words: 102,669

Programming Collective Intelligence by Toby Segaran

always be closing, correlation coefficient, Debian, en.wikipedia.org, Firefox, full text search, information retrieval, PageRank, prediction markets, recommendation engine, slashdot, Thomas Bayes, web application

To find a set of links similar to one that you found particularly interesting, you can try: >>url=recommendations.getRecommendations(delusers,user)[0][1] >> recommendations.topMatches(recommendations.transformPrefs(delusers),url) [(0.312, u'http://www.fonttester.com/'), (0.312, u'http://www.cssremix.com/'), (0.266, u'http://www.logoorange.com/color/color-codes-chart.php'), (0.254, u'http://yotophoto.com/'), (0.254, u'http://www.wpdfd.com/editorial/basics/index.html')] That's it! You've successfully added a recommendation engine to del.icio.us. There's a lot more that could be done here. Since del.icio.us supports searching by tags, you can look for tags that are similar to each other. You can even search for people trying to manipulate the "popular" pages by posting the same links with multiple accounts. Item-Based Filtering The way the recommendation engine has been implemented so far requires the use of all the rankings from every user in order to create a dataset. This will probably work well for a few thousand people or items, but a very large site like Amazon has millions of customers and products—comparing a user with every other user and then comparing every product each user has rated can be very slow.

Introduction to Collective Intelligence Netflix is an online DVD rental company that lets people choose movies to be sent to their homes, and makes recommendations based on the movies that customers have previously rented. In late 2006 it announced a prize of $1 million to the first person to improve the accuracy of its recommendation system by 10 percent, along with progress prizes of $50,000 to the current leader each year for as long as the contest runs. Thousands of teams from all over the world entered and, as of April 2007, the leading team has managed to score an improvement of 7 percent. By using data about which movies each customer enjoyed, Netflix is able to recommend movies to other customers that they may never have even heard of and keep them coming back for more. Any way to improve its recommendation system is worth a lot of money to Netflix. The search engine Google was started in 1998, at a time when there were already several big search engines, and many assumed that a new player would never be able to take on the giants.

Google is likely the largest effort—it not only uses web links to rank pages, but it constantly gathers information on when advertisements are clicked by different users, which allows Google to target the advertising more effectively. In Chapter 4 you'll learn about search engines and the PageRank algorithm, an important part of Google's ranking system. Other examples include web sites with recommendation systems. Sites like Amazon and Netflix use information about the things people buy or rent to determine which people or items are similar to one another, and then make recommendations based on purchase history. Other sites like Pandora and Last.fm use your ratings of different bands and songs to create custom radio stations with music they think you will enjoy. Chapter 2 covers ways to build recommendation systems. Prediction markets are also a form of collective intelligence. One of the most well known of these is the Hollywood Stock Exchange (http://hsx.com), where people trade stocks on movies and movie stars.


pages: 1,085 words: 219,144

Solr in Action by Trey Grainger, Timothy Potter

business intelligence, cloud computing, commoditize, conceptual framework, crowdsourcing, data acquisition, en.wikipedia.org, failed state, fault tolerance, finite state, full text search, glass ceiling, information retrieval, natural language processing, openstreetmap, performance metric, premature optimization, recommendation engine, web application

Instead of thinking of Solr as a text search engine, it can be mentally freeing to think of Solr as a “matching engine that happens to be able to match on parsed text.” Whether the search is manual or automated is of no consequence to Solr. In fact, several organizations have successfully built recommender systems directly on top of Solr using this thinking. The following sections will cover how to build your own Solr-powered recommendation engine and ultimately how to merge the concepts of a user-driven search experience and an automated recommendation system to provide a powerful, personalized search experience. In particular, we will discuss several content-based recommendation approaches including attribute-based matching, hierarchical-classification-based matching, matching based upon extracted interesting terms (More Like This), concept-based matching, and geographical matching.

This shifts the paradigm completely, because it requires software systems to be intelligent enough to recommend information to users as opposed to having them explicitly search for it. Although organizations such as Netflix and Amazon are well known for their recommender systems and have spent millions of dollars developing them, it’s both possible and easy to develop such systems yourself—particularly on top of Solr—to drastically improve the relevancy of your application. 16.5.1. Search vs. recommendations When one thinks of a search engine, the vision of a keyword box (and sometimes a separate location box) typically comes to mind. Likewise, when one thinks of a recommendation engine, the vision of a magical algorithm which automatically suggests information based upon past behavior and preferences likely comes to mind. In reality, both search and recommendations are just related forms of matching, with search engines generally matching keywords and locations in a query to keywords and locations in a document, and recommendation engines typically matching behavior of users to documents for which other users exhibited similar behaviors or matching content of one document to the content of another document.

The beauty of collaborative filtering, regardless of the implementation, is that it’s able to work without any knowledge about the content of your documents. Therefore, you could build a recommendation engine based upon Solr with documents containing nothing more than document IDs and users, and you should still see quality recommendations as long as you have enough users linking your documents together. If you don’t put any text content, attributes, or classifications into Solr, then it means you will not be able to make use of those additional techniques at all. The next section will discuss why you may want to consider combining multiple techniques to achieve optimal relevancy in your recommendation system. 16.5.8. Hybrid approaches Throughout this chapter, you have seen multiple different recommendation approaches, each with its own strengths and weaknesses.


Designing Search: UX Strategies for Ecommerce Success by Greg Nudelman, Pabini Gabriel-Petit

access to a mobile phone, Albert Einstein, AltaVista, augmented reality, barriers to entry, business intelligence, call centre, crowdsourcing, information retrieval, Internet of things, performance metric, QR code, recommendation engine, RFID, search engine result page, semantic web, Silicon Valley, social graph, social web, speech recognition, text mining, the map is not the territory, The Wisdom of Crowds, web application, zero-sum game, Zipcar

To drive people to explore, the overall design of each group must be fairly Spartan, so customers can make their decisions quickly and move on to exploring the main body of the search results. If fancy group formatting or Ajax carousels make customers disregard the more important More Like This buttons, such a page fails to meet its primary objective. Note—If you are still thinking about using a carousel for your More Like This groups, consider that Netflix has one of the best recommendation engines in the world and can usually select very relevant items to include among its 8 to 10 options. Amazon.com, which also has an exceptional recommendation engine, tried incorporating carousels for all its groups in the past, but has since dropped the feature. Amazon.com now uses the carousel feature sparingly, if at all, presumably, because the results underperformed the Spartan group design, which is optimized for quick scanning. Although a typical More Like This page does not warrant the use of carousels, in some circumstances they can be appropriate.

—Brynn Evans References Enterprise Social Search slides: www.slideshare.net/bmevans/designing-for-sociality-in-enterprise-search Wired article (Wired, November 2010): www.wired.com/magazine/2010/11/st_flowchart_social/ “Do your friends make you smarter” paper: http://brynnevans.com/papers/Do-your-friends-make-you-smarter.pdf Personalized Search and Recommender Systems Machine learning lets search engines draw reliable inferences and deliver improved search results by leveraging customers’ data. In the ecommerce realm, personalized search lets an online vendor use a customer’s past purchasing history—and possibly other data like product ratings, search history, the customer’s user profile, and even social networking activity—to interpret search strings, predict what products might be of interest to that customer, and deliver more relevant search results. On ecommerce sites, recommender systems—which are sometimes called implicit collaborative filtering systems, a bit of a misnomer—often use the past purchasing history of other customers who are similar in some way to a particular customer to predict what products might be of interest to that customer.

. … Given a similar-items table, the algorithm finds items similar to each of the user’s purchases and ratings, aggregates those items, and then recommends the most popular or correlated items.” [1] Amazon employs its recommender system to great effect—delivering product recommendations that encourage customers to browse additional products and, thus, helping users to find similar products of interest. Recommendations are particularly effective on product pages, where Amazon uses them in cross-selling additional products to customers. Amazon also personalizes the content on its home page extensively by providing many different types of recommendations. The recommender system Amazon has innovated helps customers find what they need and, because its recommendations actually provide a valuable service to customers, increases customer loyalty—and ultimately enhances Amazon’s bottom line.


pages: 523 words: 61,179

Human + Machine: Reimagining Work in the Age of AI by Paul R. Daugherty, H. James Wilson

3D printing, AI winter, algorithmic trading, Amazon Mechanical Turk, augmented reality, autonomous vehicles, blockchain, business process, call centre, carbon footprint, cloud computing, computer vision, correlation does not imply causation, crowdsourcing, digital twin, disintermediation, Douglas Hofstadter, en.wikipedia.org, Erik Brynjolfsson, friendly AI, future of work, industrial robot, Internet of things, inventory management, iterative process, Jeff Bezos, job automation, job satisfaction, knowledge worker, Lyft, natural language processing, personalized medicine, precision agriculture, Ray Kurzweil, recommendation engine, RFID, ride hailing / ride sharing, risk tolerance, Rodney Brooks, Second Machine Age, self-driving car, sensor fusion, sentiment analysis, Shoshana Zuboff, Silicon Valley, software as a service, speech recognition, telepresence, telepresence robot, text mining, the scientific method, uber lyft

Whereas, in the past, a salesperson might glean a sales opportunity based on physical or social cues over the phone or in person, 6sense is returning to salespeople some of the skills that more socially opaque online interactions, like the extensive use of email, had blunted.7 Your Buddy, the Brand Some of the biggest changes to the front office are happening through online tools and AI-enabled interfaces. Think how easily Amazon customers can purchase a vast array of consumer items, thanks to AI-enhanced product-recommendation engines and “Alexa” (the personal assistant bot), which is used via “Echo” (the smart, voice-enabled wireless speaker). AI systems similar to those designed for jobs like customer service are now beginning to play a much larger role in generating revenue, traditionally a front-office objective, and the ease of the purchasing experience has become a major factor for customers. In one study, 98 percent of online consumers said they would be likely or very likely to make another purchase if they had a good experience.8 When AI performs the job of customer interaction, the software can become a primary way for a company to distinguish itself from competitors.

., 7, 19, 106, 207 integration of, 3 modifying outcomes of, 172–174 potential and impact of, 3–4 in production, supply chain, and distribution, 19–39 in R&D, 67–83 responses to, 131–132 scientific method and, 69–77 skills of, 20–21, 105–106 symbiotic partnerships with, 7–8 third wave of, 4–6 training, 100, 114–122 “winters” of, 25, 41 Akshaya Patra, 37 Alexa, 11, 56, 86, 92, 94, 118, 146 Capital One and, 204–205 empathy training for, 117–118 Alexander, Rob, 204–205 algorithm aversion, 167 algorithm forensics analysts, 124–125 Alice, 146 Allgood, Brandon, 81 Almax, 89, 90 Amazon Alexa, 11, 56, 86, 92, 94–95, 118, 146 Echo, 92, 94–95, 164–165 fulfillment at, 31, 150 Go, 160–165 Mechanical Turk, 169 recommendation engine, 92 Amelia, 55–56, 139, 164, 201, 202 amplification, 7, 107, 138–139, 141–143, 176–177 jobs with, 141–143 See also augmentation; missing middle anthropomorphism, brand, 93–94 Antigena, 58 anti-money-laundering (AML) detection, 45–46, 51 Apple, 11, 96–97, 118, 146 Apprenticeship Levy, 202 apprenticing, reciprocal, 12, 201–202 Arizona State University, 49 “Artificial Intelligence, Automation, and the Economy,” 211 Asimov, Isaac, 69, 128–129 assembly lines, 1–2, 4 flexible teams vs., 13–14 AT&T, 188 Audi, 158–160, 190 audio and signal processing, 64 Audi Robotic Telepresence (ART), 159–160 augmentation, 5, 7 customer-aware shops and, 87–90 embodiment and, 147–149 fostering positive experiences with, 166 generative design and, 135–137 of observation, 157–158 types of, 138–140 workforce implications of, 137–138 augmented reality, 143 Autodesk, 3, 136–137, 141 automakers, 116–117, 140 autonomous cars and, 67–68, 166–167, 189, 190 BMW, 1, 4, 10, 149–150 customization among, 147–149 Mercedes-Benz, 4, 10 process reimagination at, 158–160 automation, 5, 19 intelligent, 65 automation ethicists, 130–131 Ayasdi, 178 back-office operations, 10 banking digital lending, 86 fraud detection in, 42 money laundering and, 45–46, 51 virtual assistants in, 55–56 Beiersdorf, 176–177 Benetton, 89 Benioff, Marc, 196 Berg Health, 82 Bezos, Jeff, 161, 164 BHP Billiton Ltd., 28 biases, 121–122, 129–130, 174, 179 biometrics, 65 BlackRock, 122 blockchain, 37 Bloomberg Beta, 195 BMW, 1, 4, 10, 148, 209 Boeing, 28, 143 Boli.io, 196 bot-based empowerment, 12, 186, 195–196 boundaries, 168–169 BQ Zosi, 146 Braga, Leda, 167 brands, 87, 92–94 anthropomorphism of, 93–94 disintermediated, 94–95 personalization and, 96–97 as two-way relationships, 119 Brooks, Rodney, 22, 24 burnout, 187–188 Burns, Ed, 76 business models, 152 business processes.

Identifies people, gestures, or trends in biometric measures (stress, activity, etc.) for purposes of natural human-machine interaction, or identification and verification. Intelligent automation. Transfers some tasks from man to machine to fundamentally change the traditional ways of operating. Through machine-specific strengths and capabilities (speed, scale, and the ability to cut through complexity), these tools complement human work to expand what is possible. Recommendation systems. Make suggestions based on subtle patterns detected by AI algorithms over time. These can be targeted toward consumers to suggest new products or used internally to make strategic suggestions. Intelligent products. Have intelligence baked into their design so that they can evolve to continuously meet and anticipate customers’ needs and preferences. Personalization. Analyzes trends and patterns for customers and employees to optimize tools and products for individual users or customers.


pages: 371 words: 108,317

The Inevitable: Understanding the 12 Technological Forces That Will Shape Our Future by Kevin Kelly

A Declaration of the Independence of Cyberspace, AI winter, Airbnb, Albert Einstein, Amazon Web Services, augmented reality, bank run, barriers to entry, Baxter: Rethink Robotics, bitcoin, blockchain, book scanning, Brewster Kahle, Burning Man, cloud computing, commoditize, computer age, connected car, crowdsourcing, dark matter, dematerialisation, Downton Abbey, Edward Snowden, Elon Musk, Filter Bubble, Freestyle chess, game design, Google Glasses, hive mind, Howard Rheingold, index card, indoor plumbing, industrial robot, Internet Archive, Internet of things, invention of movable type, invisible hand, Jaron Lanier, Jeff Bezos, job automation, John Markoff, Kevin Kelly, Kickstarter, lifelogging, linked data, Lyft, M-Pesa, Marc Andreessen, Marshall McLuhan, means of production, megacity, Minecraft, Mitch Kapor, multi-sided market, natural language processing, Netflix Prize, Network effects, new economy, Nicholas Carr, old-boy network, peer-to-peer, peer-to-peer lending, personalized medicine, placebo effect, planetary scale, postindustrial economy, recommendation engine, RFID, ride hailing / ride sharing, Rodney Brooks, self-driving car, sharing economy, Silicon Valley, slashdot, Snapchat, social graph, social web, software is eating the world, speech recognition, Stephen Hawking, Steven Levy, Ted Nelson, the scientific method, transport as a service, two-sided market, Uber for X, uber lyft, Watson beat the top human players on Jeopardy!, Whole Earth Review, zero-sum game

And I’ll make it personal. How would I like to choose what I give my attention to next? First I’d like to be delivered more of what I know I like. This personal filter already exists. It’s called a recommendation engine. It is in wide use at Amazon, Netflix, Twitter, LinkedIn, Spotify, Beats, and Pandora, among other aggregators. Twitter uses a recommendation system to suggest who I should follow based on whom I already follow. Pandora uses a similar system to recommend what new music I’ll like based on what I already like. Over half of the connections made on LinkedIn arise from their follower recommender. Amazon’s recommendation engine is responsible for the well-known banner that “others who like this item also liked this next item.” Netflix uses the same to recommend movies for me. Clever algorithms churn through a massive history of everyone’s behavior in order to closely predict my own behavior.

Amazon’s greatest asset is not its Prime delivery service but the millions of reader reviews it has accumulated over decades. Readers will pay for Amazon’s all-you-can-read ebook service, Kindle Unlimited, even though they will be able to find ebooks for free elsewhere, because Amazon’s reviews will guide them to books they want to read. Ditto for Netflix. Movie fans will pay Netflix because their recommendation engine finds gems they would not otherwise discover. They may be free somewhere else, but they are essentially lost and buried. In these examples, you are not paying for the copies, you are paying for the findability. • • • These eight qualities require a new skill set for creators. Success no longer derives from mastering distribution. Distribution is nearly automatic; it’s all streams. The Great Copy Machine in the Sky takes care of that.

., 70–71 and platform synergy, 122–25 and real-time on demand, 114–17 and renting, 117–18 and right of modification, 124–25 accountability, 260–64 Adobe, 113, 206 advertising, 177–89 aggregated information, 140, 147 Airbnb, 109, 113, 124, 172 algorithms and targeted advertising, 179–82 Alibaba, 109 Amazon and accessibility vs. ownership, 109 and artificial intelligence, 33 cloud of, 128, 129 and on-demand model of access, 115 as ecosystem, 124 and filtering systems, 171–72 and recommendation engines, 169 and robot technology, 50 and tracking technology, 254 and user reviews, 21, 72–73 anime, 198 annotation systems, 202 anonymity, 263–64 anthropomorphization of technology, 259 Apache software, 69, 141, 143 API (application programming interface), 23 Apple, 1–2, 123, 124, 246 Apple Pay, 65 Apple Watch, 224 Arthur, Brian, 193, 209 artificial intelligence (AI), 29–60 ability to think differently, 42–43, 48, 51–52 as accelerant of change, 30 as alien intelligence, 48 in chess, 41–42 and cloud-based services, 127 and collaboration, 273 and commodity consumer attention, 179 and complex questions, 47 concerns regarding, 44 and consciousness, 42 corporate investment in, 32 costs of, 29, 52–53 data informing, 39 and defining humanity, 48–49 and digital storage capacity, 265, 266–67 and emergence of the “holos,” 291 as enhancement of human intelligence, 41–42 and filtering systems, 175 of Google, 36–37 impact of, 29 learning ability of, 32–33, 40 and lifelogging, 251 networked, 30 and network effect, 40 potential applications for, 34–36 questions arising from, 284 specialized applications of, 42 in tagging book content, 98 technological breakthroughs influencing, 38–40 ubiquity of, 30, 33 and video games, 230 and visual intelligence, 203 See also robots arts and artists artist/audience inversion, 81 and augmented reality, 232 and authenticity, 70 and creative remixing, 209 and crowdfunding, 156–61 and low-cost reproduction, 87 and patronage, 72 public art, 232 attention, 168–69, 176, 177–89 audience, 88, 148–49, 155, 156–57 audio recording, 249.


pages: 302 words: 73,581

Platform Scale: How an Emerging Business Model Helps Startups Build Large Empires With Minimum Investment by Sangeet Paul Choudary

3D printing, Airbnb, Amazon Web Services, barriers to entry, bitcoin, blockchain, business process, Chuck Templeton: OpenTable:, Clayton Christensen, collaborative economy, commoditize, crowdsourcing, cryptocurrency, data acquisition, frictionless, game design, hive mind, Internet of things, invisible hand, Kickstarter, Lean Startup, Lyft, M-Pesa, Marc Andreessen, Mark Zuckerberg, means of production, multi-sided market, Network effects, new economy, Paul Graham, recommendation engine, ride hailing / ride sharing, shareholder value, sharing economy, Silicon Valley, Skype, Snapchat, social graph, social software, software as a service, software is eating the world, Spread Networks laid a new fibre optics cable between New York and Chicago, TaskRabbit, the payments system, too big to fail, transport as a service, two-sided market, Uber and Lyft, Uber for X, uber lyft, Wave and Pay

Context may be static or dynamic. Many Web 1.0 era filters were created based on long sign-up forms that the user filled out. Today, filters are created based on data captured on an ongoing basis through a user’s actions. Filters may be standalone or collaborative. Amazon’s “People who purchased this product also purchased this product” feature is based on a collaborative filter. Many recommendation platforms allow users to filter results based on a “people like you” parameter. This, again, is a collaborative filter. The most important innovation in recent times that has led to the spread of collaborative filters is the implementation of Facebook’s social graph. Through the social graph, third-party platforms like TripAdvisor serve reviews based on a collaborative filter of people who are close to you on the graph.


pages: 202 words: 62,901

The People's Republic of Walmart: How the World's Biggest Corporations Are Laying the Foundation for Socialism by Leigh Phillips, Michal Rozworski

Berlin Wall, Bernie Sanders, call centre, carbon footprint, central bank independence, Colonization of Mars, combinatorial explosion, complexity theory, computer age, corporate raider, decarbonisation, discovery of penicillin, Elon Musk, G4S, Georg Cantor, germ theory of disease, Gordon Gekko, greed is good, hiring and firing, index fund, Intergovernmental Panel on Climate Change (IPCC), Internet of things, inventory management, invisible hand, Jeff Bezos, Joseph Schumpeter, linear programming, liquidity trap, mass immigration, Mont Pelerin Society, new economy, Norbert Wiener, oil shock, passive investing, Paul Samuelson, post scarcity, profit maximization, profit motive, purchasing power parity, recommendation engine, Ronald Coase, Ronald Reagan, sharing economy, Silicon Valley, Skype, sovereign wealth fund, strikebreaker, supply-chain management, technoutopianism, The Nature of the Firm, The Wealth of Nations by Adam Smith, theory of mind, transaction costs, Turing machine, union organizing

Behind the scenes, however, Amazon appears as a chaotic jumble of the most varied items zipping between warehouses, suppliers and end destinations. In truth, Amazon specializes in highly managed chaos. Two of the best examples of this are the “chaotic storage” system Amazon uses in its warehouses and the recommendations system buzzing in the background of its website, telling you which books or garden implements you might be interested in. Amazon’s recommendations system is the backbone of the company’s rapid success. This system drives those usually helpful (although sometimes comical—“Frequently bought together: baseball bat + black balaclava”) items that pop up in the “Customers who bought this also bought …” section of the website. Recommendations systems solve some of the information problems that have historically been associated with planning. This is a crucial innovation for dreamers of planned economies that also manage to satisfy consumer wants, historically the bane of Stalinist systems.

The chaos of individual tastes and opinions is condensed into something useable. A universe of the most disparate ratings and reviews—always partial and often contradictory—can, if parsed right, provide very useful and lucrative information. Amazon also uses a system it calls “item-to-item collaborative filtering.” The company made a breakthrough when it devised its recommendations algorithm by managing to avoid common pitfalls plaguing other early recommendation engines. Amazon’s system doesn’t look for similarities between people; not only do such systems slow down significantly once millions are profiled, but they report significant overlaps among people whose tastes are actually very different (e.g., hipsters and boomers who buy the same bestsellers). Nor does Amazon group people into “segments”—something that often ends up oversimplifying recommendations by ignoring the complexity of individual tastes.

For example, a bicycle repair manual may consistently be bought alongside a particular bike-friendly set of Allen keys, even though the set isn’t marketed as such. The two things may not be very obviously related, but it is enough that some people buy or browse them together. Combining millions of such interactions between people and things, Amazon’s algorithm creates a virtual map of its catalog that adapts very well to new information, even saving precious computing power when compared to the alternatives—clunkier recommendations systems that try to match similar users or find abstract similarities. Here is how the researchers at IBM’s labs describe Amazon’s recommendations: “When it takes other users’ behavior into account, collaborative filtering uses group knowledge to form a recommendation based on like users.” Filtering is an example of an IT-based rejoinder to one of the criticisms Hayek leveled against his socialist adversaries in the 1930s calculation debate: that only markets can aggregate and put to use the information dispersed throughout society.


pages: 267 words: 72,552

Reinventing Capitalism in the Age of Big Data by Viktor Mayer-Schönberger, Thomas Ramge

accounting loophole / creative accounting, Air France Flight 447, Airbnb, Alvin Roth, Atul Gawande, augmented reality, banking crisis, basic income, Bayesian statistics, bitcoin, blockchain, Capital in the Twenty-First Century by Thomas Piketty, carbon footprint, Cass Sunstein, centralized clearinghouse, Checklist Manifesto, cloud computing, cognitive bias, conceptual framework, creative destruction, Daniel Kahneman / Amos Tversky, disruptive innovation, Donald Trump, double entry bookkeeping, Elon Musk, en.wikipedia.org, Erik Brynjolfsson, Ford paid five dollars a day, Frederick Winslow Taylor, fundamental attribution error, George Akerlof, gig economy, Google Glasses, information asymmetry, interchangeable parts, invention of the telegraph, inventory management, invisible hand, James Watt: steam engine, Jeff Bezos, job automation, job satisfaction, joint-stock company, Joseph Schumpeter, Kickstarter, knowledge worker, labor-force participation, land reform, lone genius, low cost airline, low cost carrier, Marc Andreessen, market bubble, market design, market fundamentalism, means of production, meta analysis, meta-analysis, Moneyball by Michael Lewis explains big data, multi-sided market, natural language processing, Network effects, Norbert Wiener, offshore financial centre, Parag Khanna, payday loans, peer-to-peer lending, Peter Thiel, Ponzi scheme, prediction markets, price anchoring, price mechanism, purchasing power parity, random walk, recommendation engine, Richard Thaler, ride hailing / ride sharing, Sam Altman, Second Machine Age, self-driving car, Silicon Valley, Silicon Valley startup, six sigma, smart grid, smart meter, Snapchat, statistical model, Steve Jobs, technoutopianism, The Future of Employment, The Market for Lemons, The Nature of the Firm, transaction costs, universal basic income, William Langewiesche, Y Combinator

With data-richness, market participants may learn the preferences of others and pair them using matching algorithms, but how do market participants express their preferences and their relative weight and communicate them to each other? It’s a difficult challenge, and solving it is crucial. Nobody wants to transact on markets that require hours of time spent answering questionnaires. Fortunately, here, too, recent technical advances have gotten us much closer to viable solutions. Consider again Amazon’s product-recommendation engine: at first glance, it’s a matching system. It quite successfully matches our preferences with available products and makes recommendations about what we should order. But that is only half of the story. Amazon captures our preferences not from us directly but from the comprehensive data stream it gathers about our every interaction with its website—what products we look at, when and for how long we look at them, which reviews we read.

To put an end to such inefficiencies, firms such as American Express, AT&T, and IBM have phased in software platforms that go far beyond classified-ad-type announcements of open positions on the company’s intranet. They match detailed (albeit standardized) job descriptions with detailed (albeit standardized) talent profiles. Filters make individuals and position pools easy to search, both for employees seeking a new challenge and for managers looking for new talent. And recommendation engines facilitate matchmaking across multiple dimensions. These internal talent marketplaces offer a number of advantages. First, they decentralize matching, reducing information overload within HR departments. Searching and matching is done outside HR, by managers with positions to fill and employees interested in making a move. Thanks to multidimensional information streams and talent-matching software, the costs of the search are kept comparatively low.

Companies can realize two or even all three of them at the same time. Consider Amazon: because of its sheer scale, it can fulfill customer orders at low cost. Network effects make Amazon a thick market, with lots of buyers and sellers, and many customers who leave valuable product reviews for others. Each additional customer adds value to the community. Finally, Amazon uses adaptive systems and feedback data to hone its recommendation engine, as well as its intelligent personal assistant, Alexa. Apple’s iPhone is another case in point. Because it can mass produce the phone, Apple can keep profit margins high while still holding to a price point that’s acceptable to consumers. A growing number of iPhone users have led to a vibrant app market. And Siri (among other services) continually improves, thanks to a huge and increasing volume of feedback data.


pages: 296 words: 78,631

Hello World: Being Human in the Age of Algorithms by Hannah Fry

23andMe, 3D printing, Air France Flight 447, Airbnb, airport security, augmented reality, autonomous vehicles, Brixton riot, chief data officer, computer vision, crowdsourcing, DARPA: Urban Challenge, Douglas Hofstadter, Elon Musk, Firefox, Google Chrome, Gödel, Escher, Bach, Ignaz Semmelweis: hand washing, John Markoff, Mark Zuckerberg, meta analysis, meta-analysis, pattern recognition, Peter Thiel, RAND corporation, ransomware, recommendation engine, ride hailing / ride sharing, selection bias, self-driving car, Shai Danziger, Silicon Valley, Silicon Valley startup, Snapchat, speech recognition, Stanislav Petrov, statistical model, Stephen Hawking, Steven Levy, Tesla Model S, The Wisdom of Crowds, Thomas Bayes, Watson beat the top human players on Jeopardy!, web of trust, William Langewiesche

There are algorithms that can automatically classify and remove inappropriate content on YouTube, algorithms that will label your holiday photos for you, and algorithms that can scan your handwriting and classify each mark on the page as a letter of the alphabet. Association: finding links Association is all about finding and marking relationships between things. Dating algorithms such as OKCupid have association at their core, looking for connections between members and suggesting matches based on the findings. Amazon’s recommendation engine uses a similar idea, connecting your interests to those of past customers. It’s what led to the intriguing shopping suggestion that confronted Reddit user Kerbobotat after buying a baseball bat on Amazon: ‘Perhaps you’ll be interested in this balaclava?’11 Filtering: isolating what’s important Algorithms often need to remove some information to focus on what’s important, to separate the signal from the noise.

Whatever statistical techniques, or artificial intelligence tricks, or machine-learning algorithms you deploy, trying to use numbers to latch on to the essence of artistic excellence is like clutching at smoke with your hands. But an algorithm needs something to go on. So, once you take away popularity and inherent quality, you’re left with the only thing that can be quantified: a metric for similarity to whatever has gone before. There’s still a great deal that can be done using measures of similarity. When it comes to building a recommendation engine, like the ones found in Netflix and Spotify, similarity is arguably the ideal measure. Both companies have a way to help users discover new films and songs, and, as subscription services, both have an incentive to accurately predict what users will enjoy. They can’t base their algorithms on what’s popular, or users would just get bombarded with suggestions for Justin Bieber and Peppa Pig The Movie.

The recommendation algorithms merely offer you songs and films that are good enough to insure you against disappointment. They’re giving you an inoffensive way of passing the time. Every now and then they will come up with something that you absolutely love, but it’s a bit like cold reading in that sense. You only need a strike every now and then to feel the serendipity of discovering new music. The engines don’t need to be right all the time. Similarity works perfectly well for recommendation engines. But when you ask algorithms to create art without a pure measure for quality, that’s where things start to get interesting. Can an algorithm be creative if its only sense of art is what happened in the past? Good artists borrow; great artists steal – Pablo Picasso In October 1997, an audience arrived at the University of Oregon to be treated to a rather unusual concert. A lone piano sat on the stage at the front.


pages: 368 words: 96,825

Bold: How to Go Big, Create Wealth and Impact the World by Peter H. Diamandis, Steven Kotler

3D printing, additive manufacturing, Airbnb, Amazon Mechanical Turk, Amazon Web Services, augmented reality, autonomous vehicles, Charles Lindbergh, cloud computing, creative destruction, crowdsourcing, Daniel Kahneman / Amos Tversky, dematerialisation, deskilling, disruptive innovation, Elon Musk, en.wikipedia.org, Exxon Valdez, fear of failure, Firefox, Galaxy Zoo, Google Glasses, Google Hangouts, gravity well, ImageNet competition, industrial robot, Internet of things, Jeff Bezos, John Harrison: Longitude, John Markoff, Jono Bacon, Just-in-time delivery, Kickstarter, Kodak vs Instagram, Law of Accelerating Returns, Lean Startup, life extension, loss aversion, Louis Pasteur, low earth orbit, Mahatma Gandhi, Marc Andreessen, Mark Zuckerberg, Mars Rover, meta analysis, meta-analysis, microbiome, minimum viable product, move fast and break things, Narrative Science, Netflix Prize, Network effects, Oculus Rift, optical character recognition, packet switching, PageRank, pattern recognition, performance metric, Peter H. Diamandis: Planetary Resources, Peter Thiel, pre–internet, Ray Kurzweil, recommendation engine, Richard Feynman, ride hailing / ride sharing, risk tolerance, rolodex, self-driving car, sentiment analysis, shareholder value, Silicon Valley, Silicon Valley startup, skunkworks, Skype, smart grid, stem cell, Stephen Hawking, Steve Jobs, Steven Levy, Stewart Brand, superconnector, technoutopianism, telepresence, telepresence robot, Turing test, urban renewal, web application, X Prize, Y Combinator, zero-sum game

Thus, if you could create an incentive prize that harnessed this competitive love of coding and this argumentative love of movies and tied them together—meaning design a prize around the intrinsic motivations at the core of coder culture—what might be possible? Well, in the case of Netflix, a better movie recommendation engine. A movie recommendation engine is a bit of software that tells you what movie you might want to watch next based on movies you’ve already watched and rated (on a scale of one to five stars). Netflix’s original recommendation engine, Cinematch, was created back in 2000 and quickly proved to be a wild success. Within a few years, nearly two-thirds of their rental business was being driven by their recommendation engine. Thus the obvious corollary: the better their recommendation engine, the better their business. And that was the problem. By the middle 2000s, Netflix engineers had plucked all the low-hanging fruit and the rate of Cinematch optimization had slowed to a crawl.

The prize hunters, even the leaders, are startlingly open about the methods they’re using, acting more like academics huddled over a knotty problem than entrepreneurs jostling for a $1 million payday. In December 2006, a competitor called ‘simonfunk’ posted a complete description of his algorithm—which at the time was tied for third place—giving everyone else the opportunity to piggyback on his progress. ‘We had no idea the extent to which people would collaborate with each other,’ says Jim Bennett, vice president for recommendation systems at Netflix.”16 And this isn’t an aberration. Over the course of the eight XPRIZEs launched to date, there has been an extraordinary amount of cooperation. We’ve seen teams providing unsolicited advice, teams merging, teams acquiring and sharing technology and experts. When the prize is driven by an MTP, while a team’s primary purpose is to win, a close second is their desire to see the primary objective achieved; thus teams exhibit a much higher willingness to share.


pages: 353 words: 104,146

European Founders at Work by Pedro Gairifo Santos

business intelligence, cloud computing, crowdsourcing, fear of failure, full text search, information retrieval, inventory management, iterative process, Jeff Bezos, Joi Ito, Lean Startup, Mark Zuckerberg, natural language processing, pattern recognition, pre–internet, recommendation engine, Richard Stallman, Silicon Valley, Skype, slashdot, Steve Jobs, Steve Wozniak, subscription business, technology bubble, web application, Y Combinator

All this time, were mainly concerned with keeping the site afloat, keeping it fast, scaling up properly, and this sort of scrobbling data and radio. The recommendation engine wasn't brilliant to begin with. And then, we finally decided we needed to hire somebody who knows what they're doing, who's going to work on this full-time. We e-mailed some mailing lists. We e-mailed the ISMIR2 mailing list. They're a group who meet every year about music recommendations and information retrieval in music. We ended up hiring a guy called Norman, who was both a great scientist and understood all the algorithms and captive audience sort of things, but also an excellent programmer who was able to implement all these ideas. So we got really lucky. The first person we hired was great and he just took over. He chucked out all of our crappy recommendation systems we had and built something good, and then improved it constantly for the next several years. __________ 2 The International Society for Music Information Retrieval So we had some A/B testing, split testing systems in there for the radio so they could try out new tweaks to the algorithms and see what was performing better.

They weren't even interested in recommendations at that point. I didn't really have a good recommender system for a long time. From your listening stats, you could click on an artist, and see who else had been listening to them. You could then see the listening stats of the other fans of artists you like. Just that system of connecting all the listening tastes proved to be really quite addictive. It spread by word of mouth. And then toward the end of my degree, I started working on some collaborative filtering recommendation stuff. Obviously that all tapped into some latent interest that people have in stats on their music listening. So I knew that recommendations weren't necessarily the main focus at that point. Not for a couple years after that did we have a really good recommender system. Music recommendation never really was my field, but I had a go at it, and then later on we hired somebody who knew what they were doing.

. _____________ 1 Digital Millennium Copyright Act Santos: Did you ever have any court problems with any of the copyright holders? Jones: Nothing substantial, really. I think sometimes rights holders, especially in the music industry, will use court action or the threat of court action as a sort of negotiating position. But, no. I think we managed to avoid anything serious in that regard. Santos: From the technical point of view, the actual recommendation engine and statistics, how does that actually work? How hard was it to develop it and tweak it? Did you change the approach many times? Did you have a clear idea on how to do it from the start? Jones: So initially when I was building it, we tried all sorts of stuff. I think what I was using for a long time in the beginning was just to use Lucene, a document indexing system. We just created fake documents of people's profiles.


pages: 377 words: 97,144

Singularity Rising: Surviving and Thriving in a Smarter, Richer, and More Dangerous World by James D. Miller

23andMe, affirmative action, Albert Einstein, artificial general intelligence, Asperger Syndrome, barriers to entry, brain emulation, cloud computing, cognitive bias, correlation does not imply causation, crowdsourcing, Daniel Kahneman / Amos Tversky, David Brooks, David Ricardo: comparative advantage, Deng Xiaoping, en.wikipedia.org, feminist movement, Flynn Effect, friendly AI, hive mind, impulse control, indoor plumbing, invention of agriculture, Isaac Newton, John von Neumann, knowledge worker, Long Term Capital Management, low skilled workers, Netflix Prize, neurotypical, Norman Macrae, pattern recognition, Peter Thiel, phenotype, placebo effect, prisoner's dilemma, profit maximization, Ray Kurzweil, recommendation engine, reversible computing, Richard Feynman, Rodney Brooks, Silicon Valley, Singularitarianism, Skype, statistical model, Stephen Hawking, Steve Jobs, supervolcano, technological singularity, The Coming Technological Singularity, the scientific method, Thomas Malthus, transaction costs, Turing test, twin studies, Vernor Vinge, Von Neumann architecture

A big part of our brain is devoted to processing visual inputs. Hence, a good recommendation system would necessarily have powerful insights into a significant chunk of our brains. 3.Measurable Incremental Progress—Think of AI as a destination a thousand miles away with the entire pathway hidden by fog. To reach our destination, we need to take many small steps, and for each step we need a way to determine if we have gone in the right direction. A video recommendation system provides this corrective by gathering continuous feedback on how many users liked the recommended videos. 4.Profitable with Every Step—Businesses are more motivated to invest in a type of innovation if they can continually increase revenue with each small improvement. Consequently, an application such as a video recommendation engine in which each improvement increases consumer satisfaction is (all else being equal) more likely to attract large corporate investment than an application that would have value only if it achieved near-human-level intelligence. 5.Amenable to Parallel Processing—Imagine we want to move a heavy object from point A to point B.

Fortunately, with video recommendations, many challenges, such as finding what type of cat video a certain set of users might enjoy, can be worked on independently for reasonably long periods of time. 6.Free Labor from Customers—A recommendation system would rely on millions of people to freely help train the system by picking which videos to watch, rating some of the videos they see, writing reviews of videos, and labeling in words the content they upload. 7.Help from Advertisers and Political Consultants—Salesmen would eagerly seek to learn what types of messages appealed to different factions of the population. The recommendation system could piggyback on these salesmen’s attempts to understand their clientele and use their insights to improve recommendation software. 8.AI and Human Recommenders Could Productively Work Together—Unlike what YouTube currently does, an effective AI recommendation system could make use of human evaluators. When my son was four, he enjoyed watching YouTube videos of supernovas and children’s cartoons.

For example, if 90 percent of people who had some unusual allele or brain microstructure enjoyed a certain cat video, then the AI recommender would suggest the video to all other viewers who had that trait. 12.Amenable to Crowdsourcing—Netflix, the rent-by-mail and streaming video distributor, offered (and eventually paid) a $1 million prize to whichever group improved its recommendation system the most, so long as at least one group improved the system by at least 10 percent. This “crowdsourcing,” which occurs when a problem is thrown open to anyone, helps a company by allowing them to draw on the talents of strangers, while only paying the strangers if they help the firm. This kind of crowdsourcing works only if, as with a video recommendation system, there is an easy and objective way of measuring progress toward the crowdsourced goal. 13.Potential Improvement All the Way Up to Superhuman Artificial General Intelligence—A recommendation AI could slowly morph into a content creator.


pages: 404 words: 95,163

Amazon: How the World’s Most Relentless Retailer Will Continue to Revolutionize Commerce by Natalie Berg, Miya Knights

3D printing, Airbnb, Amazon Web Services, augmented reality, Bernie Sanders, big-box store, business intelligence, cloud computing, Colonization of Mars, commoditize, computer vision, connected car, Donald Trump, Doomsday Clock, Elon Musk, gig economy, Internet of things, inventory management, invisible hand, Jeff Bezos, market fragmentation, new economy, pattern recognition, Ponzi scheme, pre–internet, QR code, race to the bottom, recommendation engine, remote working, sensor fusion, sharing economy, Skype, supply-chain management, TaskRabbit, trade route, underbanked, urban planning, white picket fence

The value of recommendation Having identified AI as the culmination of the main drivers shaping technology innovation today (stemming from a need for more autonomous computer systems particularly) – and before diving straight into voice technology as its current apotheosis – it is necessary to undertake an examination of how Amazon capitalized on the development of AI systems across its business and not just in its customers’ homes, as we have already done with the drivers of ubiquitous connectivity and pervasive interfaces. This examination adds to our understanding of how it has achieved its aim of removing friction from the average shopping journey and, in so doing, created a virtuous cycle that, in turn, generates even more sales and growth. In fact, it is AI that underpins the power of its search and recommendation engines. Back in the 1990s, Amazon was one of the first e-commerce players to rely heavily on product recommendations, which also helped it to cross-sell new categories as it moved beyond books. It is a category of technology development that Bezos has described as ‘the practical application of machine learning’. Amazon’s search and recommendation machine learning capabilities also underpin its sophisticated supply chain proficiency, as well as its most recent voice shopping assistant functionality.

McKinsey estimates put the proportion of Amazon purchases driven by product recommendations at 35 per cent.3 In 2016, it made its AI framework, DSSTNE (pronounced as ‘destiny’) free, to help expand the ways deep learning can extend beyond speech and language understanding and object recognition to areas such as search and recommendations. The decision to open source DSSTNE also demonstrates when Amazon recognizes the need to collaborate over making gains with the vast potential of AI. On the Amazon site, these recommendations can be personalized, based on categories and ranges previously searched or browsed, to increase conversion. Equally, Amazon’s recommendation engine can display products similar to those searched for or browsed in the hopes of converting customers to rival brands or products. There are also recommendations based on anything ‘related to the items you’ve viewed’. Or they can depend on items that are ‘frequently bought together’ or by ‘customers who bought this item also bought…’ with the aim of boosting average order value. In these cases, ‘if that, then this’ AI-powered decision engines work in the background to match the items in your basket with other complementary products.

According to Wei Hu, Alibaba Merchant Service Business Unit director of data technology, its engine can consider data points from other browsing and shopping data points to match new shoppers with relevant items. Return customers to the Group’s Tmall and Taobao platforms are presented with product recommendations based not just on their past transactions, but also on browsing history, product feedback, bookmarks, geographic location and other online activity-related data. During the 2016 ‘Singles’ Day’ shopping festival, Alibaba said it used its AI recommendations engine to generate 6.7 billion personalized shopping pages based on merchants’ target customer data. Alibaba said that this large-scale personalization resulted in a 20 per cent improvement in conversion rate from the 11 November event.4 Recommendations and personalization aside, Amazon’s reliance on AI systems to orchestrate its vast business operations as well as its customer-facing ones is diverse.


pages: 304 words: 82,395

Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schonberger, Kenneth Cukier

23andMe, Affordable Care Act / Obamacare, airport security, barriers to entry, Berlin Wall, big data - Walmart - Pop Tarts, Black Swan, book scanning, business intelligence, business process, call centre, cloud computing, computer age, correlation does not imply causation, dark matter, double entry bookkeeping, Eratosthenes, Erik Brynjolfsson, game design, IBM and the Holocaust, index card, informal economy, intangible asset, Internet of things, invention of the printing press, Jeff Bezos, Joi Ito, lifelogging, Louis Pasteur, Mark Zuckerberg, Menlo Park, Moneyball by Michael Lewis explains big data, Nate Silver, natural language processing, Netflix Prize, Network effects, obamacare, optical character recognition, PageRank, paypal mafia, performance metric, Peter Thiel, post-materialism, random walk, recommendation engine, self-driving car, sentiment analysis, Silicon Valley, Silicon Valley startup, smart grid, smart meter, social graph, speech recognition, Steve Jobs, Steven Levy, the scientific method, The Signal and the Noise by Nate Silver, The Wealth of Nations by Adam Smith, Thomas Davenport, Turing test, Watson beat the top human players on Jeopardy!

In fact, the company approached its business model in that order, which is the inverse of the norm. It initially only had the idea for its celebrated recommendation system. Its stock market prospectus in 1997 described “collaborative filtering” before Amazon knew how it would work in practice or had enough data to make it useful. Both Google and Amazon span the categories, but their strategies differ. When Google first sets out to collect any sort of data, it has secondary uses in mind. Its Street View cars, as we have seen, collected GPS information not just for its map service but also to train self-driving cars. By contrast, Amazon is more focused on the primary use of data and only taps the secondary uses as a marginal bonus. Its recommendation system, for example, relies on clickstream data as a signal, but the company hasn’t used the information to do extraordinary things like predict the state of the economy or flu outbreaks.

Companies that have failed to appreciate the importance of data’s reuse have learned their lesson the hard way. For example, in Amazon’s early days it signed a deal with AOL to run the technology behind AOL’s e-commerce site. To most people, it looked like an ordinary outsourcing deal. But what really interested Amazon, explains Andreas Weigend, Amazon’s former chief scientist, was getting hold of data on what AOL users were looking at and buying, which would improve the performance of its recommendation engine. Poor AOL never realized this. It only saw the data’s value in terms of its primary purpose—sales. Clever Amazon knew it could reap benefits by putting the data to a secondary use. Or take the case of Google’s entry into speech recognition with GOOG-411 for local search listings, which ran from 2007 to 2010. The search giant didn’t have its own speech-recognition technology so needed to license it.

Buy a book on Poland and you’d be bombarded with Eastern European fare. Purchase one about babies and you’d be inundated with more of the same. “They tended to offer you tiny variations on your previous purchase, ad infinitum,” recalled James Marcus, an Amazon book reviewer from 1996 to 2001, in his memoir, Amazonia. “It felt as if you had gone shopping with the village idiot.” Greg Linden saw a solution. He realized that the recommendation system didn’t actually need to compare people with other people, a task that was technically cumbersome. All it needed to do was find associations among products themselves. In 1998 Linden and his colleagues applied for a patent on “item-to-item” collaborative filtering, as the technique is known. The shift in approach made a big difference. Because the calculations could be done ahead of time, the recommendations were lightning fast.


The Ethical Algorithm: The Science of Socially Aware Algorithm Design by Michael Kearns, Aaron Roth

23andMe, affirmative action, algorithmic trading, Alvin Roth, Bayesian statistics, bitcoin, cloud computing, computer vision, crowdsourcing, Edward Snowden, Elon Musk, Filter Bubble, general-purpose programming language, Google Chrome, ImageNet competition, Lyft, medical residency, Nash equilibrium, Netflix Prize, p-value, Pareto efficiency, performance metric, personalized medicine, pre–internet, profit motive, quantitative trading / quantitative finance, RAND corporation, recommendation engine, replication crisis, ride hailing / ride sharing, Robert Bork, Ronald Coase, self-driving car, short selling, sorting algorithm, speech recognition, statistical model, Stephen Hawking, superintelligent machines, telemarketer, Turing machine, two-sided market, Vilfredo Pareto

The goal of a collaborative filtering engine is to predict how a given user will rate a movie she hasn’t seen yet. The engine can then recommend to a user the movies that it predicts she will rate the highest. Netflix had a basic recommendation system based on collaborative filtering, but the company wanted a better one. The Netflix Prize competition offered $1 million for improving the accuracy of Netflix’s existing system by 10 percent. A 10 percent improvement is hard, so Netflix expected a multiyear competition. An improvement of 1 percent over the previous year’s state of the art qualified a competitor for an annual $50,000 progress prize, which would go to the best recommendation system submitted that year. Of course, to build a recommendation system, you need data, so Netflix publicly released a lot of it—a dataset consisting of more than a hundred million movie rating records, corresponding to the ratings that roughly half a million users gave to a total of nearly eighteen thousand movies.

But now that we know this, can the problem of privacy be solved by simply concealing information about birthdate, sex, and zip code in future data releases? It turns out that lots of less obvious things can also identify you—like the movies you watch. In 2006, Netflix launched the Netflix Prize competition, a public data science competition to find the best “collaborative filtering” algorithm to power Netflix’s movie recommendation engine. A key feature of Netflix’s service is its ability to recommend to users movies that they might like, given how they have rated past movies. (This was especially important when Netflix was primarily a mail-order DVD rental service, rather than a streaming service—it was harder to quickly browse or sample movies.) Collaborative filtering is a kind of machine learning problem designed to recommend purchases to users based on what similar users rated well.


Martin Kleppmann-Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable and Maintainable Systems-O’Reilly (2017) by Unknown

active measures, Amazon Web Services, bitcoin, blockchain, business intelligence, business process, c2.com, cloud computing, collaborative editing, commoditize, conceptual framework, cryptocurrency, database schema, DevOps, distributed ledger, Donald Knuth, Edward Snowden, Ethereum, ethereum blockchain, fault tolerance, finite state, Flash crash, full text search, general-purpose programming language, informal economy, information retrieval, Internet of things, iterative process, John von Neumann, Kubernetes, loose coupling, Marc Andreessen, microservices, natural language processing, Network effects, packet switching, peer-to-peer, performance metric, place-making, premature optimization, recommendation engine, Richard Feynman, self-driving car, semantic web, Shoshana Zuboff, social graph, social web, software as a service, software is eating the world, sorting algorithm, source of truth, SPARQL, speech recognition, statistical model, undersea cable, web application, WebSocket, wikimedia commons

Beyond MapReduce | 423 Graphs and Iterative Processing In “Graph-Like Data Models” on page 49 we discussed using graphs for modeling data, and using graph query languages to traverse the edges and vertices in a graph. The discussion in Chapter 2 was focused around OLTP-style use: quickly executing queries to find a small number of vertices matching certain criteria. It is also interesting to look at graphs in a batch processing context, where the goal is to perform some kind of offline processing or analysis on an entire graph. This need often arises in machine learning applications such as recommendation engines, or in ranking systems. For example, one of the most famous graph analysis algorithms is PageRank [69], which tries to estimate the popularity of a web page based on what other web pages link to it. It is used as part of the formula that determines the order in which web search engines present their results. Dataflow engines like Spark, Flink, and Tez (see “Materialization of Intermediate State” on page 419) typically arrange the operators in a job as a directed acyclic graph (DAG).

The opposite of bounded. 558 | Glossary Index A aborts (transactions), 222, 224 in two-phase commit, 356 performance of optimistic concurrency con‐ trol, 266 retrying aborted transactions, 231 abstraction, 21, 27, 222, 266, 321 access path (in network model), 37, 60 accidental complexity, removing, 21 accountability, 535 ACID properties (transactions), 90, 223 atomicity, 223, 228 consistency, 224, 529 durability, 226 isolation, 225, 228 acknowledgements (messaging), 445 active/active replication (see multi-leader repli‐ cation) active/passive replication (see leader-based rep‐ lication) ActiveMQ (messaging), 137, 444 distributed transaction support, 361 ActiveRecord (object-relational mapper), 30, 232 actor model, 138 (see also message-passing) comparison to Pregel model, 425 comparison to stream processing, 468 Advanced Message Queuing Protocol (see AMQP) aerospace systems, 6, 10, 305, 372 aggregation data cubes and materialized views, 101 in batch processes, 406 in stream processes, 466 aggregation pipeline query language, 48 Agile, 22 minimizing irreversibility, 414, 497 moving faster with confidence, 532 Unix philosophy, 394 agreement, 365 (see also consensus) Airflow (workflow scheduler), 402 Ajax, 131 Akka (actor framework), 139 algorithms algorithm correctness, 308 B-trees, 79-83 for distributed systems, 306 hash indexes, 72-75 mergesort, 76, 402, 405 red-black trees, 78 SSTables and LSM-trees, 76-79 all-to-all replication topologies, 175 AllegroGraph (database), 50 ALTER TABLE statement (SQL), 40, 111 Amazon Dynamo (database), 177 Amazon Web Services (AWS), 8 Kinesis Streams (messaging), 448 network reliability, 279 postmortems, 9 RedShift (database), 93 S3 (object storage), 398 checking data integrity, 530 amplification of bias, 534 of failures, 364, 495 Index | 559 of tail latency, 16, 207 write amplification, 84 AMQP (Advanced Message Queuing Protocol), 444 (see also messaging systems) comparison to log-based messaging, 448, 451 message ordering, 446 analytics, 90 comparison to transaction processing, 91 data warehousing (see data warehousing) parallel query execution in MPP databases, 415 predictive (see predictive analytics) relation to batch processing, 411 schemas for, 93-95 snapshot isolation for queries, 238 stream analytics, 466 using MapReduce, analysis of user activity events (example), 404 anti-caching (in-memory databases), 89 anti-entropy, 178 Apache ActiveMQ (see ActiveMQ) Apache Avro (see Avro) Apache Beam (see Beam) Apache BookKeeper (see BookKeeper) Apache Cassandra (see Cassandra) Apache CouchDB (see CouchDB) Apache Curator (see Curator) Apache Drill (see Drill) Apache Flink (see Flink) Apache Giraph (see Giraph) Apache Hadoop (see Hadoop) Apache HAWQ (see HAWQ) Apache HBase (see HBase) Apache Helix (see Helix) Apache Hive (see Hive) Apache Impala (see Impala) Apache Jena (see Jena) Apache Kafka (see Kafka) Apache Lucene (see Lucene) Apache MADlib (see MADlib) Apache Mahout (see Mahout) Apache Oozie (see Oozie) Apache Parquet (see Parquet) Apache Qpid (see Qpid) Apache Samza (see Samza) Apache Solr (see Solr) Apache Spark (see Spark) 560 | Index Apache Storm (see Storm) Apache Tajo (see Tajo) Apache Tez (see Tez) Apache Thrift (see Thrift) Apache ZooKeeper (see ZooKeeper) Apama (stream analytics), 466 append-only B-trees, 82, 242 append-only files (see logs) Application Programming Interfaces (APIs), 5, 27 for batch processing, 403 for change streams, 456 for distributed transactions, 361 for graph processing, 425 for services, 131-136 (see also services) evolvability, 136 RESTful, 133 SOAP, 133 application state (see state) approximate search (see similarity search) archival storage, data from databases, 131 arcs (see edges) arithmetic mean, 14 ASCII text, 119, 395 ASN.1 (schema language), 127 asynchronous networks, 278, 553 comparison to synchronous networks, 284 formal model, 307 asynchronous replication, 154, 553 conflict detection, 172 data loss on failover, 157 reads from asynchronous follower, 162 Asynchronous Transfer Mode (ATM), 285 atomic broadcast (see total order broadcast) atomic clocks (caesium clocks), 294, 295 (see also clocks) atomicity (concurrency), 553 atomic increment-and-get, 351 compare-and-set, 245, 327 (see also compare-and-set operations) replicated operations, 246 write operations, 243 atomicity (transactions), 223, 228, 553 atomic commit, 353 avoiding, 523, 528 blocking and nonblocking, 359 in stream processing, 360, 477 maintaining derived data, 453 for multi-object transactions, 229 for single-object writes, 230 auditability, 528-533 designing for, 531 self-auditing systems, 530 through immutability, 460 tools for auditable data systems, 532 availability, 8 (see also fault tolerance) in CAP theorem, 337 in service level agreements (SLAs), 15 Avro (data format), 122-127 code generation, 127 dynamically generated schemas, 126 object container files, 125, 131, 414 reader determining writer’s schema, 125 schema evolution, 123 use in Hadoop, 414 awk (Unix tool), 391 AWS (see Amazon Web Services) Azure (see Microsoft) B B-trees (indexes), 79-83 append-only/copy-on-write variants, 82, 242 branching factor, 81 comparison to LSM-trees, 83-85 crash recovery, 82 growing by splitting a page, 81 optimizations, 82 similarity to dynamic partitioning, 212 backpressure, 441, 553 in TCP, 282 backups database snapshot for replication, 156 integrity of, 530 snapshot isolation for, 238 use for ETL processes, 405 backward compatibility, 112 BASE, contrast to ACID, 223 bash shell (Unix), 70, 395, 503 batch processing, 28, 389-431, 553 combining with stream processing lambda architecture, 497 unifying technologies, 498 comparison to MPP databases, 414-418 comparison to stream processing, 464 comparison to Unix, 413-414 dataflow engines, 421-423 fault tolerance, 406, 414, 422, 442 for data integration, 494-498 graphs and iterative processing, 424-426 high-level APIs and languages, 403, 426-429 log-based messaging and, 451 maintaining derived state, 495 MapReduce and distributed filesystems, 397-413 (see also MapReduce) measuring performance, 13, 390 outputs, 411-413 key-value stores, 412 search indexes, 411 using Unix tools (example), 391-394 Bayou (database), 522 Beam (dataflow library), 498 bias, 534 big ball of mud, 20 Bigtable data model, 41, 99 binary data encodings, 115-128 Avro, 122-127 MessagePack, 116-117 Thrift and Protocol Buffers, 117-121 binary encoding based on schemas, 127 by network drivers, 128 binary strings, lack of support in JSON and XML, 114 BinaryProtocol encoding (Thrift), 118 Bitcask (storage engine), 72 crash recovery, 74 Bitcoin (cryptocurrency), 532 Byzantine fault tolerance, 305 concurrency bugs in exchanges, 233 bitmap indexes, 97 blockchains, 532 Byzantine fault tolerance, 305 blocking atomic commit, 359 Bloom (programming language), 504 Bloom filter (algorithm), 79, 466 BookKeeper (replicated log), 372 Bottled Water (change data capture), 455 bounded datasets, 430, 439, 553 (see also batch processing) bounded delays, 553 in networks, 285 process pauses, 298 broadcast hash joins, 409 Index | 561 brokerless messaging, 442 Brubeck (metrics aggregator), 442 BTM (transaction coordinator), 356 bulk synchronous parallel (BSP) model, 425 bursty network traffic patterns, 285 business data processing, 28, 90, 390 byte sequence, encoding data in, 112 Byzantine faults, 304-306, 307, 553 Byzantine fault-tolerant systems, 305, 532 Byzantine Generals Problem, 304 consensus algorithms and, 366 C caches, 89, 553 and materialized views, 101 as derived data, 386, 499-504 database as cache of transaction log, 460 in CPUs, 99, 338, 428 invalidation and maintenance, 452, 467 linearizability, 324 CAP theorem, 336-338, 554 Cascading (batch processing), 419, 427 hash joins, 409 workflows, 403 cascading failures, 9, 214, 281 Cascalog (batch processing), 60 Cassandra (database) column-family data model, 41, 99 compaction strategy, 79 compound primary key, 204 gossip protocol, 216 hash partitioning, 203-205 last-write-wins conflict resolution, 186, 292 leaderless replication, 177 linearizability, lack of, 335 log-structured storage, 78 multi-datacenter support, 184 partitioning scheme, 213 secondary indexes, 207 sloppy quorums, 184 cat (Unix tool), 391 causal context, 191 (see also causal dependencies) causal dependencies, 186-191 capturing, 191, 342, 494, 514 by total ordering, 493 causal ordering, 339 in transactions, 262 sending message to friends (example), 494 562 | Index causality, 554 causal ordering, 339-343 linearizability and, 342 total order consistent with, 344, 345 consistency with, 344-347 consistent snapshots, 340 happens-before relationship, 186 in serializable transactions, 262-265 mismatch with clocks, 292 ordering events to capture, 493 violations of, 165, 176, 292, 340 with synchronized clocks, 294 CEP (see complex event processing) certificate transparency, 532 chain replication, 155 linearizable reads, 351 change data capture, 160, 454 API support for change streams, 456 comparison to event sourcing, 457 implementing, 454 initial snapshot, 455 log compaction, 456 changelogs, 460 change data capture, 454 for operator state, 479 generating with triggers, 455 in stream joins, 474 log compaction, 456 maintaining derived state, 452 Chaos Monkey, 7, 280 checkpointing in batch processors, 422, 426 in high-performance computing, 275 in stream processors, 477, 523 chronicle data model, 458 circuit-switched networks, 284 circular buffers, 450 circular replication topologies, 175 clickstream data, analysis of, 404 clients calling services, 131 pushing state changes to, 512 request routing, 214 stateful and offline-capable, 170, 511 clocks, 287-299 atomic (caesium) clocks, 294, 295 confidence interval, 293-295 for global snapshots, 294 logical (see logical clocks) skew, 291-294, 334 slewing, 289 synchronization and accuracy, 289-291 synchronization using GPS, 287, 290, 294, 295 time-of-day versus monotonic clocks, 288 timestamping events, 471 cloud computing, 146, 275 need for service discovery, 372 network glitches, 279 shared resources, 284 single-machine reliability, 8 Cloudera Impala (see Impala) clustered indexes, 86 CODASYL model, 36 (see also network model) code generation with Avro, 127 with Thrift and Protocol Buffers, 118 with WSDL, 133 collaborative editing multi-leader replication and, 170 column families (Bigtable), 41, 99 column-oriented storage, 95-101 column compression, 97 distinction between column families and, 99 in batch processors, 428 Parquet, 96, 131, 414 sort order in, 99-100 vectorized processing, 99, 428 writing to, 101 comma-separated values (see CSV) command query responsibility segregation (CQRS), 462 commands (event sourcing), 459 commits (transactions), 222 atomic commit, 354-355 (see also atomicity; transactions) read committed isolation, 234 three-phase commit (3PC), 359 two-phase commit (2PC), 355-359 commutative operations, 246 compaction of changelogs, 456 (see also log compaction) for stream operator state, 479 of log-structured storage, 73 issues with, 84 size-tiered and leveled approaches, 79 CompactProtocol encoding (Thrift), 119 compare-and-set operations, 245, 327 implementing locks, 370 implementing uniqueness constraints, 331 implementing with total order broadcast, 350 relation to consensus, 335, 350, 352, 374 relation to transactions, 230 compatibility, 112, 128 calling services, 136 properties of encoding formats, 139 using databases, 129-131 using message-passing, 138 compensating transactions, 355, 461, 526 complex event processing (CEP), 465 complexity distilling in theoretical models, 310 hiding using abstraction, 27 of software systems, managing, 20 composing data systems (see unbundling data‐ bases) compute-intensive applications, 3, 275 concatenated indexes, 87 in Cassandra, 204 Concord (stream processor), 466 concurrency actor programming model, 138, 468 (see also message-passing) bugs from weak transaction isolation, 233 conflict resolution, 171, 174 detecting concurrent writes, 184-191 dual writes, problems with, 453 happens-before relationship, 186 in replicated systems, 161-191, 324-338 lost updates, 243 multi-version concurrency control (MVCC), 239 optimistic concurrency control, 261 ordering of operations, 326, 341 reducing, through event logs, 351, 462, 507 time and relativity, 187 transaction isolation, 225 write skew (transaction isolation), 246-251 conflict-free replicated datatypes (CRDTs), 174 conflicts conflict detection, 172 causal dependencies, 186, 342 in consensus algorithms, 368 in leaderless replication, 184 Index | 563 in log-based systems, 351, 521 in nonlinearizable systems, 343 in serializable snapshot isolation (SSI), 264 in two-phase commit, 357, 364 conflict resolution automatic conflict resolution, 174 by aborting transactions, 261 by apologizing, 527 convergence, 172-174 in leaderless systems, 190 last write wins (LWW), 186, 292 using atomic operations, 246 using custom logic, 173 determining what is a conflict, 174, 522 in multi-leader replication, 171-175 avoiding conflicts, 172 lost updates, 242-246 materializing, 251 relation to operation ordering, 339 write skew (transaction isolation), 246-251 congestion (networks) avoidance, 282 limiting accuracy of clocks, 293 queueing delays, 282 consensus, 321, 364-375, 554 algorithms, 366-368 preventing split brain, 367 safety and liveness properties, 365 using linearizable operations, 351 cost of, 369 distributed transactions, 352-375 in practice, 360-364 two-phase commit, 354-359 XA transactions, 361-364 impossibility of, 353 membership and coordination services, 370-373 relation to compare-and-set, 335, 350, 352, 374 relation to replication, 155, 349 relation to uniqueness constraints, 521 consistency, 224, 524 across different databases, 157, 452, 462, 492 causal, 339-348, 493 consistent prefix reads, 165-167 consistent snapshots, 156, 237-242, 294, 455, 500 (see also snapshots) 564 | Index crash recovery, 82 enforcing constraints (see constraints) eventual, 162, 322 (see also eventual consistency) in ACID transactions, 224, 529 in CAP theorem, 337 linearizability, 324-338 meanings of, 224 monotonic reads, 164-165 of secondary indexes, 231, 241, 354, 491, 500 ordering guarantees, 339-352 read-after-write, 162-164 sequential, 351 strong (see linearizability) timeliness and integrity, 524 using quorums, 181, 334 consistent hashing, 204 consistent prefix reads, 165 constraints (databases), 225, 248 asynchronously checked, 526 coordination avoidance, 527 ensuring idempotence, 519 in log-based systems, 521-524 across multiple partitions, 522 in two-phase commit, 355, 357 relation to consensus, 374, 521 relation to event ordering, 347 requiring linearizability, 330 Consul (service discovery), 372 consumers (message streams), 137, 440 backpressure, 441 consumer offsets in logs, 449 failures, 445, 449 fan-out, 11, 445, 448 load balancing, 444, 448 not keeping up with producers, 441, 450, 502 context switches, 14, 297 convergence (conflict resolution), 172-174, 322 coordination avoidance, 527 cross-datacenter, 168, 493 cross-partition ordering, 256, 294, 348, 523 services, 330, 370-373 coordinator (in 2PC), 356 failure, 358 in XA transactions, 361-364 recovery, 363 copy-on-write (B-trees), 82, 242 CORBA (Common Object Request Broker Architecture), 134 correctness, 6 auditability, 528-533 Byzantine fault tolerance, 305, 532 dealing with partial failures, 274 in log-based systems, 521-524 of algorithm within system model, 308 of compensating transactions, 355 of consensus, 368 of derived data, 497, 531 of immutable data, 461 of personal data, 535, 540 of time, 176, 289-295 of transactions, 225, 515, 529 timeliness and integrity, 524-528 corruption of data detecting, 519, 530-533 due to pathological memory access, 529 due to radiation, 305 due to split brain, 158, 302 due to weak transaction isolation, 233 formalization in consensus, 366 integrity as absence of, 524 network packets, 306 on disks, 227 preventing using write-ahead logs, 82 recovering from, 414, 460 Couchbase (database) durability, 89 hash partitioning, 203-204, 211 rebalancing, 213 request routing, 216 CouchDB (database) B-tree storage, 242 change feed, 456 document data model, 31 join support, 34 MapReduce support, 46, 400 replication, 170, 173 covering indexes, 86 CPUs cache coherence and memory barriers, 338 caching and pipelining, 99, 428 increasing parallelism, 43 CRDTs (see conflict-free replicated datatypes) CREATE INDEX statement (SQL), 85, 500 credit rating agencies, 535 Crunch (batch processing), 419, 427 hash joins, 409 sharded joins, 408 workflows, 403 cryptography defense against attackers, 306 end-to-end encryption and authentication, 519, 543 proving integrity of data, 532 CSS (Cascading Style Sheets), 44 CSV (comma-separated values), 70, 114, 396 Curator (ZooKeeper recipes), 330, 371 curl (Unix tool), 135, 397 cursor stability, 243 Cypher (query language), 52 comparison to SPARQL, 59 D data corruption (see corruption of data) data cubes, 102 data formats (see encoding) data integration, 490-498, 543 batch and stream processing, 494-498 lambda architecture, 497 maintaining derived state, 495 reprocessing data, 496 unifying, 498 by unbundling databases, 499-515 comparison to federated databases, 501 combining tools by deriving data, 490-494 derived data versus distributed transac‐ tions, 492 limits of total ordering, 493 ordering events to capture causality, 493 reasoning about dataflows, 491 need for, 385 data lakes, 415 data locality (see locality) data models, 27-64 graph-like models, 49-63 Datalog language, 60-63 property graphs, 50 RDF and triple-stores, 55-59 query languages, 42-48 relational model versus document model, 28-42 data protection regulations, 542 data systems, 3 about, 4 Index | 565 concerns when designing, 5 future of, 489-544 correctness, constraints, and integrity, 515-533 data integration, 490-498 unbundling databases, 499-515 heterogeneous, keeping in sync, 452 maintainability, 18-22 possible faults in, 221 reliability, 6-10 hardware faults, 7 human errors, 9 importance of, 10 software errors, 8 scalability, 10-18 unreliable clocks, 287-299 data warehousing, 91-95, 554 comparison to data lakes, 415 ETL (extract-transform-load), 92, 416, 452 keeping data systems in sync, 452 schema design, 93 slowly changing dimension (SCD), 476 data-intensive applications, 3 database triggers (see triggers) database-internal distributed transactions, 360, 364, 477 databases archival storage, 131 comparison of message brokers to, 443 dataflow through, 129 end-to-end argument for, 519-520 checking integrity, 531 inside-out, 504 (see also unbundling databases) output from batch workflows, 412 relation to event streams, 451-464 (see also changelogs) API support for change streams, 456, 506 change data capture, 454-457 event sourcing, 457-459 keeping systems in sync, 452-453 philosophy of immutable events, 459-464 unbundling, 499-515 composing data storage technologies, 499-504 designing applications around dataflow, 504-509 566 | Index observing derived state, 509-515 datacenters geographically distributed, 145, 164, 278, 493 multi-tenancy and shared resources, 284 network architecture, 276 network faults, 279 replication across multiple, 169 leaderless replication, 184 multi-leader replication, 168, 335 dataflow, 128-139, 504-509 correctness of dataflow systems, 525 differential, 504 message-passing, 136-139 reasoning about, 491 through databases, 129 through services, 131-136 dataflow engines, 421-423 comparison to stream processing, 464 directed acyclic graphs (DAG), 424 partitioning, approach to, 429 support for declarative queries, 427 Datalog (query language), 60-63 datatypes binary strings in XML and JSON, 114 conflict-free, 174 in Avro encodings, 122 in Thrift and Protocol Buffers, 121 numbers in XML and JSON, 114 Datomic (database) B-tree storage, 242 data model, 50, 57 Datalog query language, 60 excision (deleting data), 463 languages for transactions, 255 serial execution of transactions, 253 deadlocks detection, in two-phase commit (2PC), 364 in two-phase locking (2PL), 258 Debezium (change data capture), 455 declarative languages, 42, 554 Bloom, 504 CSS and XSL, 44 Cypher, 52 Datalog, 60 for batch processing, 427 recursive SQL queries, 53 relational algebra and SQL, 42 SPARQL, 59 delays bounded network delays, 285 bounded process pauses, 298 unbounded network delays, 282 unbounded process pauses, 296 deleting data, 463 denormalization (data representation), 34, 554 costs, 39 in derived data systems, 386 materialized views, 101 updating derived data, 228, 231, 490 versus normalization, 462 derived data, 386, 439, 554 from change data capture, 454 in event sourcing, 458-458 maintaining derived state through logs, 452-457, 459-463 observing, by subscribing to streams, 512 outputs of batch and stream processing, 495 through application code, 505 versus distributed transactions, 492 deterministic operations, 255, 274, 554 accidental nondeterminism, 423 and fault tolerance, 423, 426 and idempotence, 478, 492 computing derived data, 495, 526, 531 in state machine replication, 349, 452, 458 joins, 476 DevOps, 394 differential dataflow, 504 dimension tables, 94 dimensional modeling (see star schemas) directed acyclic graphs (DAGs), 424 dirty reads (transaction isolation), 234 dirty writes (transaction isolation), 235 discrimination, 534 disks (see hard disks) distributed actor frameworks, 138 distributed filesystems, 398-399 decoupling from query engines, 417 indiscriminately dumping data into, 415 use by MapReduce, 402 distributed systems, 273-312, 554 Byzantine faults, 304-306 cloud versus supercomputing, 275 detecting network faults, 280 faults and partial failures, 274-277 formalization of consensus, 365 impossibility results, 338, 353 issues with failover, 157 limitations of distributed transactions, 363 multi-datacenter, 169, 335 network problems, 277-286 quorums, relying on, 301 reasons for using, 145, 151 synchronized clocks, relying on, 291-295 system models, 306-310 use of clocks and time, 287 distributed transactions (see transactions) Django (web framework), 232 DNS (Domain Name System), 216, 372 Docker (container manager), 506 document data model, 30-42 comparison to relational model, 38-42 document references, 38, 403 document-oriented databases, 31 many-to-many relationships and joins, 36 multi-object transactions, need for, 231 versus relational model convergence of models, 41 data locality, 41 document-partitioned indexes, 206, 217, 411 domain-driven design (DDD), 457 DRBD (Distributed Replicated Block Device), 153 drift (clocks), 289 Drill (query engine), 93 Druid (database), 461 Dryad (dataflow engine), 421 dual writes, problems with, 452, 507 duplicates, suppression of, 517 (see also idempotence) using a unique ID, 518, 522 durability (transactions), 226, 554 duration (time), 287 measurement with monotonic clocks, 288 dynamic partitioning, 212 dynamically typed languages analogy to schema-on-read, 40 code generation and, 127 Dynamo-style databases (see leaderless replica‐ tion) E edges (in graphs), 49, 403 property graph model, 50 edit distance (full-text search), 88 effectively-once semantics, 476, 516 Index | 567 (see also exactly-once semantics) preservation of integrity, 525 elastic systems, 17 Elasticsearch (search server) document-partitioned indexes, 207 partition rebalancing, 211 percolator (stream search), 467 usage example, 4 use of Lucene, 79 ElephantDB (database), 413 Elm (programming language), 504, 512 encodings (data formats), 111-128 Avro, 122-127 binary variants of JSON and XML, 115 compatibility, 112 calling services, 136 using databases, 129-131 using message-passing, 138 defined, 113 JSON, XML, and CSV, 114 language-specific formats, 113 merits of schemas, 127 representations of data, 112 Thrift and Protocol Buffers, 117-121 end-to-end argument, 277, 519-520 checking integrity, 531 publish/subscribe streams, 512 enrichment (stream), 473 Enterprise JavaBeans (EJB), 134 entities (see vertices) epoch (consensus algorithms), 368 epoch (Unix timestamps), 288 equi-joins, 403 erasure coding (error correction), 398 Erlang OTP (actor framework), 139 error handling for network faults, 280 in transactions, 231 error-correcting codes, 277, 398 Esper (CEP engine), 466 etcd (coordination service), 370-373 linearizable operations, 333 locks and leader election, 330 quorum reads, 351 service discovery, 372 use of Raft algorithm, 349, 353 Ethereum (blockchain), 532 Ethernet (networks), 276, 278, 285 packet checksums, 306, 519 568 | Index Etherpad (collaborative editor), 170 ethics, 533-543 code of ethics and professional practice, 533 legislation and self-regulation, 542 predictive analytics, 533-536 amplifying bias, 534 feedback loops, 536 privacy and tracking, 536-543 consent and freedom of choice, 538 data as assets and power, 540 meaning of privacy, 539 surveillance, 537 respect, dignity, and agency, 543, 544 unintended consequences, 533, 536 ETL (extract-transform-load), 92, 405, 452, 554 use of Hadoop for, 416 event sourcing, 457-459 commands and events, 459 comparison to change data capture, 457 comparison to lambda architecture, 497 deriving current state from event log, 458 immutability and auditability, 459, 531 large, reliable data systems, 519, 526 Event Store (database), 458 event streams (see streams) events, 440 deciding on total order of, 493 deriving views from event log, 461 difference to commands, 459 event time versus processing time, 469, 477, 498 immutable, advantages of, 460, 531 ordering to capture causality, 493 reads as, 513 stragglers, 470, 498 timestamp of, in stream processing, 471 EventSource (browser API), 512 eventual consistency, 152, 162, 308, 322 (see also conflicts) and perpetual inconsistency, 525 evolvability, 21, 111 calling services, 136 graph-structured data, 52 of databases, 40, 129-131, 461, 497 of message-passing, 138 reprocessing data, 496, 498 schema evolution in Avro, 123 schema evolution in Thrift and Protocol Buffers, 120 schema-on-read, 39, 111, 128 exactly-once semantics, 360, 476, 516 parity with batch processors, 498 preservation of integrity, 525 exclusive mode (locks), 258 eXtended Architecture transactions (see XA transactions) extract-transform-load (see ETL) F Facebook Presto (query engine), 93 React, Flux, and Redux (user interface libra‐ ries), 512 social graphs, 49 Wormhole (change data capture), 455 fact tables, 93 failover, 157, 554 (see also leader-based replication) in leaderless replication, absence of, 178 leader election, 301, 348, 352 potential problems, 157 failures amplification by distributed transactions, 364, 495 failure detection, 280 automatic rebalancing causing cascading failures, 214 perfect failure detectors, 359 timeouts and unbounded delays, 282, 284 using ZooKeeper, 371 faults versus, 7 partial failures in distributed systems, 275-277, 310 fan-out (messaging systems), 11, 445 fault tolerance, 6-10, 555 abstractions for, 321 formalization in consensus, 365-369 use of replication, 367 human fault tolerance, 414 in batch processing, 406, 414, 422, 425 in log-based systems, 520, 524-526 in stream processing, 476-479 atomic commit, 477 idempotence, 478 maintaining derived state, 495 microbatching and checkpointing, 477 rebuilding state after a failure, 478 of distributed transactions, 362-364 transaction atomicity, 223, 354-361 faults, 6 Byzantine faults, 304-306 failures versus, 7 handled by transactions, 221 handling in supercomputers and cloud computing, 275 hardware, 7 in batch processing versus distributed data‐ bases, 417 in distributed systems, 274-277 introducing deliberately, 7, 280 network faults, 279-281 asymmetric faults, 300 detecting, 280 tolerance of, in multi-leader replication, 169 software errors, 8 tolerating (see fault tolerance) federated databases, 501 fence (CPU instruction), 338 fencing (preventing split brain), 158, 302-304 generating fencing tokens, 349, 370 properties of fencing tokens, 308 stream processors writing to databases, 478, 517 Fibre Channel (networks), 398 field tags (Thrift and Protocol Buffers), 119-121 file descriptors (Unix), 395 financial data, 460 Firebase (database), 456 Flink (processing framework), 421-423 dataflow APIs, 427 fault tolerance, 422, 477, 479 Gelly API (graph processing), 425 integration of batch and stream processing, 495, 498 machine learning, 428 query optimizer, 427 stream processing, 466 flow control, 282, 441, 555 FLP result (on consensus), 353 FlumeJava (dataflow library), 403, 427 followers, 152, 555 (see also leader-based replication) foreign keys, 38, 403 forward compatibility, 112 forward decay (algorithm), 16 Index | 569 Fossil (version control system), 463 shunning (deleting data), 463 FoundationDB (database) serializable transactions, 261, 265, 364 fractal trees, 83 full table scans, 403 full-text search, 555 and fuzzy indexes, 88 building search indexes, 411 Lucene storage engine, 79 functional reactive programming (FRP), 504 functional requirements, 22 futures (asynchronous operations), 135 fuzzy search (see similarity search) G garbage collection immutability and, 463 process pauses for, 14, 296-299, 301 (see also process pauses) genome analysis, 63, 429 geographically distributed datacenters, 145, 164, 278, 493 geospatial indexes, 87 Giraph (graph processing), 425 Git (version control system), 174, 342, 463 GitHub, postmortems, 157, 158, 309 global indexes (see term-partitioned indexes) GlusterFS (distributed filesystem), 398 GNU Coreutils (Linux), 394 GoldenGate (change data capture), 161, 170, 455 (see also Oracle) Google Bigtable (database) data model (see Bigtable data model) partitioning scheme, 199, 202 storage layout, 78 Chubby (lock service), 370 Cloud Dataflow (stream processor), 466, 477, 498 (see also Beam) Cloud Pub/Sub (messaging), 444, 448 Docs (collaborative editor), 170 Dremel (query engine), 93, 96 FlumeJava (dataflow library), 403, 427 GFS (distributed file system), 398 gRPC (RPC framework), 135 MapReduce (batch processing), 390 570 | Index (see also MapReduce) building search indexes, 411 task preemption, 418 Pregel (graph processing), 425 Spanner (see Spanner) TrueTime (clock API), 294 gossip protocol, 216 government use of data, 541 GPS (Global Positioning System) use for clock synchronization, 287, 290, 294, 295 GraphChi (graph processing), 426 graphs, 555 as data models, 49-63 example of graph-structured data, 49 property graphs, 50 RDF and triple-stores, 55-59 versus the network model, 60 processing and analysis, 424-426 fault tolerance, 425 Pregel processing model, 425 query languages Cypher, 52 Datalog, 60-63 recursive SQL queries, 53 SPARQL, 59-59 Gremlin (graph query language), 50 grep (Unix tool), 392 GROUP BY clause (SQL), 406 grouping records in MapReduce, 406 handling skew, 407 H Hadoop (data infrastructure) comparison to distributed databases, 390 comparison to MPP databases, 414-418 comparison to Unix, 413-414, 499 diverse processing models in ecosystem, 417 HDFS distributed filesystem (see HDFS) higher-level tools, 403 join algorithms, 403-410 (see also MapReduce) MapReduce (see MapReduce) YARN (see YARN) happens-before relationship, 340 capturing, 187 concurrency and, 186 hard disks access patterns, 84 detecting corruption, 519, 530 faults in, 7, 227 sequential write throughput, 75, 450 hardware faults, 7 hash indexes, 72-75 broadcast hash joins, 409 partitioned hash joins, 409 hash partitioning, 203-205, 217 consistent hashing, 204 problems with hash mod N, 210 range queries, 204 suitable hash functions, 203 with fixed number of partitions, 210 HAWQ (database), 428 HBase (database) bug due to lack of fencing, 302 bulk loading, 413 column-family data model, 41, 99 dynamic partitioning, 212 key-range partitioning, 202 log-structured storage, 78 request routing, 216 size-tiered compaction, 79 use of HDFS, 417 use of ZooKeeper, 370 HDFS (Hadoop Distributed File System), 398-399 (see also distributed filesystems) checking data integrity, 530 decoupling from query engines, 417 indiscriminately dumping data into, 415 metadata about datasets, 410 NameNode, 398 use by Flink, 479 use by HBase, 212 use by MapReduce, 402 HdrHistogram (numerical library), 16 head (Unix tool), 392 head vertex (property graphs), 51 head-of-line blocking, 15 heap files (databases), 86 Helix (cluster manager), 216 heterogeneous distributed transactions, 360, 364 heuristic decisions (in 2PC), 363 Hibernate (object-relational mapper), 30 hierarchical model, 36 high availability (see fault tolerance) high-frequency trading, 290, 299 high-performance computing (HPC), 275 hinted handoff, 183 histograms, 16 Hive (query engine), 419, 427 for data warehouses, 93 HCatalog and metastore, 410 map-side joins, 409 query optimizer, 427 skewed joins, 408 workflows, 403 Hollerith machines, 390 hopping windows (stream processing), 472 (see also windows) horizontal scaling (see scaling out) HornetQ (messaging), 137, 444 distributed transaction support, 361 hot spots, 201 due to celebrities, 205 for time-series data, 203 in batch processing, 407 relieving, 205 hot standbys (see leader-based replication) HTTP, use in APIs (see services) human errors, 9, 279, 414 HyperDex (database), 88 HyperLogLog (algorithm), 466 I I/O operations, waiting for, 297 IBM DB2 (database) distributed transaction support, 361 recursive query support, 54 serializable isolation, 242, 257 XML and JSON support, 30, 42 electromechanical card-sorting machines, 390 IMS (database), 36 imperative query APIs, 46 InfoSphere Streams (CEP engine), 466 MQ (messaging), 444 distributed transaction support, 361 System R (database), 222 WebSphere (messaging), 137 idempotence, 134, 478, 555 by giving operations unique IDs, 518, 522 idempotent operations, 517 immutability advantages of, 460, 531 Index | 571 deriving state from event log, 459-464 for crash recovery, 75 in B-trees, 82, 242 in event sourcing, 457 inputs to Unix commands, 397 limitations of, 463 Impala (query engine) for data warehouses, 93 hash joins, 409 native code generation, 428 use of HDFS, 417 impedance mismatch, 29 imperative languages, 42 setting element styles (example), 45 in doubt (transaction status), 358 holding locks, 362 orphaned transactions, 363 in-memory databases, 88 durability, 227 serial transaction execution, 253 incidents cascading failures, 9 crashes due to leap seconds, 290 data corruption and financial losses due to concurrency bugs, 233 data corruption on hard disks, 227 data loss due to last-write-wins, 173, 292 data on disks unreadable, 309 deleted items reappearing, 174 disclosure of sensitive data due to primary key reuse, 157 errors in transaction serializability, 529 gigabit network interface with 1 Kb/s throughput, 311 network faults, 279 network interface dropping only inbound packets, 279 network partitions and whole-datacenter failures, 275 poor handling of network faults, 280 sending message to ex-partner, 494 sharks biting undersea cables, 279 split brain due to 1-minute packet delay, 158, 279 vibrations in server rack, 14 violation of uniqueness constraint, 529 indexes, 71, 555 and snapshot isolation, 241 as derived data, 386, 499-504 572 | Index B-trees, 79-83 building in batch processes, 411 clustered, 86 comparison of B-trees and LSM-trees, 83-85 concatenated, 87 covering (with included columns), 86 creating, 500 full-text search, 88 geospatial, 87 hash, 72-75 index-range locking, 260 multi-column, 87 partitioning and secondary indexes, 206-209, 217 secondary, 85 (see also secondary indexes) problems with dual writes, 452, 491 SSTables and LSM-trees, 76-79 updating when data changes, 452, 467 Industrial Revolution, 541 InfiniBand (networks), 285 InfiniteGraph (database), 50 InnoDB (storage engine) clustered index on primary key, 86 not preventing lost updates, 245 preventing write skew, 248, 257 serializable isolation, 257 snapshot isolation support, 239 inside-out databases, 504 (see also unbundling databases) integrating different data systems (see data integration) integrity, 524 coordination-avoiding data systems, 528 correctness of dataflow systems, 525 in consensus formalization, 365 integrity checks, 530 (see also auditing) end-to-end, 519, 531 use of snapshot isolation, 238 maintaining despite software bugs, 529 Interface Definition Language (IDL), 117, 122 intermediate state, materialization of, 420-423 internet services, systems for implementing, 275 invariants, 225 (see also constraints) inversion of control, 396 IP (Internet Protocol) unreliability of, 277 ISDN (Integrated Services Digital Network), 284 isolation (in transactions), 225, 228, 555 correctness and, 515 for single-object writes, 230 serializability, 251-266 actual serial execution, 252-256 serializable snapshot isolation (SSI), 261-266 two-phase locking (2PL), 257-261 violating, 228 weak isolation levels, 233-251 preventing lost updates, 242-246 read committed, 234-237 snapshot isolation, 237-242 iterative processing, 424-426 J Java Database Connectivity (JDBC) distributed transaction support, 361 network drivers, 128 Java Enterprise Edition (EE), 134, 356, 361 Java Message Service (JMS), 444 (see also messaging systems) comparison to log-based messaging, 448, 451 distributed transaction support, 361 message ordering, 446 Java Transaction API (JTA), 355, 361 Java Virtual Machine (JVM) bytecode generation, 428 garbage collection pauses, 296 process reuse in batch processors, 422 JavaScript in MapReduce querying, 46 setting element styles (example), 45 use in advanced queries, 48 Jena (RDF framework), 57 Jepsen (fault tolerance testing), 515 jitter (network delay), 284 joins, 555 by index lookup, 403 expressing as relational operators, 427 in relational and document databases, 34 MapReduce map-side joins, 408-410 broadcast hash joins, 409 merge joins, 410 partitioned hash joins, 409 MapReduce reduce-side joins, 403-408 handling skew, 407 sort-merge joins, 405 parallel execution of, 415 secondary indexes and, 85 stream joins, 472-476 stream-stream join, 473 stream-table join, 473 table-table join, 474 time-dependence of, 475 support in document databases, 42 JOTM (transaction coordinator), 356 JSON Avro schema representation, 122 binary variants, 115 for application data, issues with, 114 in relational databases, 30, 42 representing a résumé (example), 31 Juttle (query language), 504 K k-nearest neighbors, 429 Kafka (messaging), 137, 448 Kafka Connect (database integration), 457, 461 Kafka Streams (stream processor), 466, 467 fault tolerance, 479 leader-based replication, 153 log compaction, 456, 467 message offsets, 447, 478 request routing, 216 transaction support, 477 usage example, 4 Ketama (partitioning library), 213 key-value stores, 70 as batch process output, 412 hash indexes, 72-75 in-memory, 89 partitioning, 201-205 by hash of key, 203, 217 by key range, 202, 217 dynamic partitioning, 212 skew and hot spots, 205 Kryo (Java), 113 Kubernetes (cluster manager), 418, 506 L lambda architecture, 497 Lamport timestamps, 345 Index | 573 Large Hadron Collider (LHC), 64 last write wins (LWW), 173, 334 discarding concurrent writes, 186 problems with, 292 prone to lost updates, 246 late binding, 396 latency instability under two-phase locking, 259 network latency and resource utilization, 286 response time versus, 14 tail latency, 15, 207 leader-based replication, 152-161 (see also replication) failover, 157, 301 handling node outages, 156 implementation of replication logs change data capture, 454-457 (see also changelogs) statement-based, 158 trigger-based replication, 161 write-ahead log (WAL) shipping, 159 linearizability of operations, 333 locking and leader election, 330 log sequence number, 156, 449 read-scaling architecture, 161 relation to consensus, 367 setting up new followers, 155 synchronous versus asynchronous, 153-155 leaderless replication, 177-191 (see also replication) detecting concurrent writes, 184-191 capturing happens-before relationship, 187 happens-before relationship and concur‐ rency, 186 last write wins, 186 merging concurrently written values, 190 version vectors, 191 multi-datacenter, 184 quorums, 179-182 consistency limitations, 181-183, 334 sloppy quorums and hinted handoff, 183 read repair and anti-entropy, 178 leap seconds, 8, 290 in time-of-day clocks, 288 leases, 295 implementation with ZooKeeper, 370 574 | Index need for fencing, 302 ledgers, 460 distributed ledger technologies, 532 legacy systems, maintenance of, 18 less (Unix tool), 397 LevelDB (storage engine), 78 leveled compaction, 79 Levenshtein automata, 88 limping (partial failure), 311 linearizability, 324-338, 555 cost of, 335-338 CAP theorem, 336 memory on multi-core CPUs, 338 definition, 325-329 implementing with total order broadcast, 350 in ZooKeeper, 370 of derived data systems, 492, 524 avoiding coordination, 527 of different replication methods, 332-335 using quorums, 334 relying on, 330-332 constraints and uniqueness, 330 cross-channel timing dependencies, 331 locking and leader election, 330 stronger than causal consistency, 342 using to implement total order broadcast, 351 versus serializability, 329 LinkedIn Azkaban (workflow scheduler), 402 Databus (change data capture), 161, 455 Espresso (database), 31, 126, 130, 153, 216 Helix (cluster manager) (see Helix) profile (example), 30 reference to company entity (example), 34 Rest.li (RPC framework), 135 Voldemort (database) (see Voldemort) Linux, leap second bug, 8, 290 liveness properties, 308 LMDB (storage engine), 82, 242 load approaches to coping with, 17 describing, 11 load testing, 16 load balancing (messaging), 444 local indexes (see document-partitioned indexes) locality (data access), 32, 41, 555 in batch processing, 400, 405, 421 in stateful clients, 170, 511 in stream processing, 474, 478, 508, 522 location transparency, 134 in the actor model, 138 locks, 556 deadlock, 258 distributed locking, 301-304, 330 fencing tokens, 303 implementation with ZooKeeper, 370 relation to consensus, 374 for transaction isolation in snapshot isolation, 239 in two-phase locking (2PL), 257-261 making operations atomic, 243 performance, 258 preventing dirty writes, 236 preventing phantoms with index-range locks, 260, 265 read locks (shared mode), 236, 258 shared mode and exclusive mode, 258 in two-phase commit (2PC) deadlock detection, 364 in-doubt transactions holding locks, 362 materializing conflicts with, 251 preventing lost updates by explicit locking, 244 log sequence number, 156, 449 logic programming languages, 504 logical clocks, 293, 343, 494 for read-after-write consistency, 164 logical logs, 160 logs (data structure), 71, 556 advantages of immutability, 460 compaction, 73, 79, 456, 460 for stream operator state, 479 creating using total order broadcast, 349 implementing uniqueness constraints, 522 log-based messaging, 446-451 comparison to traditional messaging, 448, 451 consumer offsets, 449 disk space usage, 450 replaying old messages, 451, 496, 498 slow consumers, 450 using logs for message storage, 447 log-structured storage, 71-79 log-structured merge tree (see LSMtrees) replication, 152, 158-161 change data capture, 454-457 (see also changelogs) coordination with snapshot, 156 logical (row-based) replication, 160 statement-based replication, 158 trigger-based replication, 161 write-ahead log (WAL) shipping, 159 scalability limits, 493 loose coupling, 396, 419, 502 lost updates (see updates) LSM-trees (indexes), 78-79 comparison to B-trees, 83-85 Lucene (storage engine), 79 building indexes in batch processes, 411 similarity search, 88 Luigi (workflow scheduler), 402 LWW (see last write wins) M machine learning ethical considerations, 534 (see also ethics) iterative processing, 424 models derived from training data, 505 statistical and numerical algorithms, 428 MADlib (machine learning toolkit), 428 magic scaling sauce, 18 Mahout (machine learning toolkit), 428 maintainability, 18-22, 489 defined, 23 design principles for software systems, 19 evolvability (see evolvability) operability, 19 simplicity and managing complexity, 20 many-to-many relationships in document model versus relational model, 39 modeling as graphs, 49 many-to-one and many-to-many relationships, 33-36 many-to-one relationships, 34 MapReduce (batch processing), 390, 399-400 accessing external services within job, 404, 412 comparison to distributed databases designing for frequent faults, 417 diversity of processing models, 416 diversity of storage, 415 Index | 575 comparison to stream processing, 464 comparison to Unix, 413-414 disadvantages and limitations of, 419 fault tolerance, 406, 414, 422 higher-level tools, 403, 426 implementation in Hadoop, 400-403 the shuffle, 402 implementation in MongoDB, 46-48 machine learning, 428 map-side processing, 408-410 broadcast hash joins, 409 merge joins, 410 partitioned hash joins, 409 mapper and reducer functions, 399 materialization of intermediate state, 419-423 output of batch workflows, 411-413 building search indexes, 411 key-value stores, 412 reduce-side processing, 403-408 analysis of user activity events (exam‐ ple), 404 grouping records by same key, 406 handling skew, 407 sort-merge joins, 405 workflows, 402 marshalling (see encoding) massively parallel processing (MPP), 216 comparison to composing storage technolo‐ gies, 502 comparison to Hadoop, 414-418, 428 master-master replication (see multi-leader replication) master-slave replication (see leader-based repli‐ cation) materialization, 556 aggregate values, 101 conflicts, 251 intermediate state (batch processing), 420-423 materialized views, 101 as derived data, 386, 499-504 maintaining, using stream processing, 467, 475 Maven (Java build tool), 428 Maxwell (change data capture), 455 mean, 14 media monitoring, 467 median, 14 576 | Index meeting room booking (example), 249, 259, 521 membership services, 372 Memcached (caching server), 4, 89 memory in-memory databases, 88 durability, 227 serial transaction execution, 253 in-memory representation of data, 112 random bit-flips in, 529 use by indexes, 72, 77 memory barrier (CPU instruction), 338 MemSQL (database) in-memory storage, 89 read committed isolation, 236 memtable (in LSM-trees), 78 Mercurial (version control system), 463 merge joins, MapReduce map-side, 410 mergeable persistent data structures, 174 merging sorted files, 76, 402, 405 Merkle trees, 532 Mesos (cluster manager), 418, 506 message brokers (see messaging systems) message-passing, 136-139 advantages over direct RPC, 137 distributed actor frameworks, 138 evolvability, 138 MessagePack (encoding format), 116 messages exactly-once semantics, 360, 476 loss of, 442 using total order broadcast, 348 messaging systems, 440-451 (see also streams) backpressure, buffering, or dropping mes‐ sages, 441 brokerless messaging, 442 event logs, 446-451 comparison to traditional messaging, 448, 451 consumer offsets, 449 replaying old messages, 451, 496, 498 slow consumers, 450 message brokers, 443-446 acknowledgements and redelivery, 445 comparison to event logs, 448, 451 multiple consumers of same topic, 444 reliability, 442 uniqueness in log-based messaging, 522 Meteor (web framework), 456 microbatching, 477, 495 microservices, 132 (see also services) causal dependencies across services, 493 loose coupling, 502 relation to batch/stream processors, 389, 508 Microsoft Azure Service Bus (messaging), 444 Azure Storage, 155, 398 Azure Stream Analytics, 466 DCOM (Distributed Component Object Model), 134 MSDTC (transaction coordinator), 356 Orleans (see Orleans) SQL Server (see SQL Server) migrating (rewriting) data, 40, 130, 461, 497 modulus operator (%), 210 MongoDB (database) aggregation pipeline, 48 atomic operations, 243 BSON, 41 document data model, 31 hash partitioning (sharding), 203-204 key-range partitioning, 202 lack of join support, 34, 42 leader-based replication, 153 MapReduce support, 46, 400 oplog parsing, 455, 456 partition splitting, 212 request routing, 216 secondary indexes, 207 Mongoriver (change data capture), 455 monitoring, 10, 19 monotonic clocks, 288 monotonic reads, 164 MPP (see massively parallel processing) MSMQ (messaging), 361 multi-column indexes, 87 multi-leader replication, 168-177 (see also replication) handling write conflicts, 171 conflict avoidance, 172 converging toward a consistent state, 172 custom conflict resolution logic, 173 determining what is a conflict, 174 linearizability, lack of, 333 replication topologies, 175-177 use cases, 168 clients with offline operation, 170 collaborative editing, 170 multi-datacenter replication, 168, 335 multi-object transactions, 228 need for, 231 Multi-Paxos (total order broadcast), 367 multi-table index cluster tables (Oracle), 41 multi-tenancy, 284 multi-version concurrency control (MVCC), 239, 266 detecting stale MVCC reads, 263 indexes and snapshot isolation, 241 mutual exclusion, 261 (see also locks) MySQL (database) binlog coordinates, 156 binlog parsing for change data capture, 455 circular replication topology, 175 consistent snapshots, 156 distributed transaction support, 361 InnoDB storage engine (see InnoDB) JSON support, 30, 42 leader-based replication, 153 performance of XA transactions, 360 row-based replication, 160 schema changes in, 40 snapshot isolation support, 242 (see also InnoDB) statement-based replication, 159 Tungsten Replicator (multi-leader replica‐ tion), 170 conflict detection, 177 N nanomsg (messaging library), 442 Narayana (transaction coordinator), 356 NATS (messaging), 137 near-real-time (nearline) processing, 390 (see also stream processing) Neo4j (database) Cypher query language, 52 graph data model, 50 Nephele (dataflow engine), 421 netcat (Unix tool), 397 Netflix Chaos Monkey, 7, 280 Network Attached Storage (NAS), 146, 398 network model, 36 Index | 577 graph databases versus, 60 imperative query APIs, 46 Network Time Protocol (see NTP) networks congestion and queueing, 282 datacenter network topologies, 276 faults (see faults) linearizability and network delays, 338 network partitions, 279, 337 timeouts and unbounded delays, 281 next-key locking, 260 nodes (in graphs) (see vertices) nodes (processes), 556 handling outages in leader-based replica‐ tion, 156 system models for failure, 307 noisy neighbors, 284 nonblocking atomic commit, 359 nondeterministic operations accidental nondeterminism, 423 partial failures in distributed systems, 275 nonfunctional requirements, 22 nonrepeatable reads, 238 (see also read skew) normalization (data representation), 33, 556 executing joins, 39, 42, 403 foreign key references, 231 in systems of record, 386 versus denormalization, 462 NoSQL, 29, 499 transactions and, 223 Notation3 (N3), 56 npm (package manager), 428 NTP (Network Time Protocol), 287 accuracy, 289, 293 adjustments to monotonic clocks, 289 multiple server addresses, 306 numbers, in XML and JSON encodings, 114 O object-relational mapping (ORM) frameworks, 30 error handling and aborted transactions, 232 unsafe read-modify-write cycle code, 244 object-relational mismatch, 29 observer pattern, 506 offline systems, 390 (see also batch processing) 578 | Index stateful, offline-capable clients, 170, 511 offline-first applications, 511 offsets consumer offsets in partitioned logs, 449 messages in partitioned logs, 447 OLAP (online analytic processing), 91, 556 data cubes, 102 OLTP (online transaction processing), 90, 556 analytics queries versus, 411 workload characteristics, 253 one-to-many relationships, 30 JSON representation, 32 online systems, 389 (see also services) Oozie (workflow scheduler), 402 OpenAPI (service definition format), 133 OpenStack Nova (cloud infrastructure) use of ZooKeeper, 370 Swift (object storage), 398 operability, 19 operating systems versus databases, 499 operation identifiers, 518, 522 operational transformation, 174 operators, 421 flow of data between, 424 in stream processing, 464 optimistic concurrency control, 261 Oracle (database) distributed transaction support, 361 GoldenGate (change data capture), 161, 170, 455 lack of serializability, 226 leader-based replication, 153 multi-table index cluster tables, 41 not preventing write skew, 248 partitioned indexes, 209 PL/SQL language, 255 preventing lost updates, 245 read committed isolation, 236 Real Application Clusters (RAC), 330 recursive query support, 54 snapshot isolation support, 239, 242 TimesTen (in-memory database), 89 WAL-based replication, 160 XML support, 30 ordering, 339-352 by sequence numbers, 343-348 causal ordering, 339-343 partial order, 341 limits of total ordering, 493 total order broadcast, 348-352 Orleans (actor framework), 139 outliers (response time), 14 Oz (programming language), 504 P package managers, 428, 505 packet switching, 285 packets corruption of, 306 sending via UDP, 442 PageRank (algorithm), 49, 424 paging (see virtual memory) ParAccel (database), 93 parallel databases (see massively parallel pro‐ cessing) parallel execution of graph analysis algorithms, 426 queries in MPP databases, 216 Parquet (data format), 96, 131 (see also column-oriented storage) use in Hadoop, 414 partial failures, 275, 310 limping, 311 partial order, 341 partitioning, 199-218, 556 and replication, 200 in batch processing, 429 multi-partition operations, 514 enforcing constraints, 522 secondary index maintenance, 495 of key-value data, 201-205 by key range, 202 skew and hot spots, 205 rebalancing partitions, 209-214 automatic or manual rebalancing, 213 problems with hash mod N, 210 using dynamic partitioning, 212 using fixed number of partitions, 210 using N partitions per node, 212 replication and, 147 request routing, 214-216 secondary indexes, 206-209 document-based partitioning, 206 term-based partitioning, 208 serial execution of transactions and, 255 Paxos (consensus algorithm), 366 ballot number, 368 Multi-Paxos (total order broadcast), 367 percentiles, 14, 556 calculating efficiently, 16 importance of high percentiles, 16 use in service level agreements (SLAs), 15 Percona XtraBackup (MySQL tool), 156 performance describing, 13 of distributed transactions, 360 of in-memory databases, 89 of linearizability, 338 of multi-leader replication, 169 perpetual inconsistency, 525 pessimistic concurrency control, 261 phantoms (transaction isolation), 250 materializing conflicts, 251 preventing, in serializability, 259 physical clocks (see clocks) pickle (Python), 113 Pig (dataflow language), 419, 427 replicated joins, 409 skewed joins, 407 workflows, 403 Pinball (workflow scheduler), 402 pipelined execution, 423 in Unix, 394 point in time, 287 polyglot persistence, 29 polystores, 501 PostgreSQL (database) BDR (multi-leader replication), 170 causal ordering of writes, 177 Bottled Water (change data capture), 455 Bucardo (trigger-based replication), 161, 173 distributed transaction support, 361 foreign data wrappers, 501 full text search support, 490 leader-based replication, 153 log sequence number, 156 MVCC implementation, 239, 241 PL/pgSQL language, 255 PostGIS geospatial indexes, 87 preventing lost updates, 245 preventing write skew, 248, 261 read committed isolation, 236 recursive query support, 54 representing graphs, 51 Index | 579 serializable snapshot isolation (SSI), 261 snapshot isolation support, 239, 242 WAL-based replication, 160 XML and JSON support, 30, 42 pre-splitting, 212 Precision Time Protocol (PTP), 290 predicate locks, 259 predictive analytics, 533-536 amplifying bias, 534 ethics of (see ethics) feedback loops, 536 preemption of datacenter resources, 418 of threads, 298 Pregel processing model, 425 primary keys, 85, 556 compound primary key (Cassandra), 204 primary-secondary replication (see leaderbased replication) privacy, 536-543 consent and freedom of choice, 538 data as assets and power, 540 deleting data, 463 ethical considerations (see ethics) legislation and self-regulation, 542 meaning of, 539 surveillance, 537 tracking behavioral data, 536 probabilistic algorithms, 16, 466 process pauses, 295-299 processing time (of events), 469 producers (message streams), 440 programming languages dataflow languages, 504 for stored procedures, 255 functional reactive programming (FRP), 504 logic programming, 504 Prolog (language), 61 (see also Datalog) promises (asynchronous operations), 135 property graphs, 50 Cypher query language, 52 Protocol Buffers (data format), 117-121 field tags and schema evolution, 120 provenance of data, 531 publish/subscribe model, 441 publishers (message streams), 440 punch card tabulating machines, 390 580 | Index pure functions, 48 putting computation near data, 400 Q Qpid (messaging), 444 quality of service (QoS), 285 Quantcast File System (distributed filesystem), 398 query languages, 42-48 aggregation pipeline, 48 CSS and XSL, 44 Cypher, 52 Datalog, 60 Juttle, 504 MapReduce querying, 46-48 recursive SQL queries, 53 relational algebra and SQL, 42 SPARQL, 59 query optimizers, 37, 427 queueing delays (networks), 282 head-of-line blocking, 15 latency and response time, 14 queues (messaging), 137 quorums, 179-182, 556 for leaderless replication, 179 in consensus algorithms, 368 limitations of consistency, 181-183, 334 making decisions in distributed systems, 301 monitoring staleness, 182 multi-datacenter replication, 184 relying on durability, 309 sloppy quorums and hinted handoff, 183 R R-trees (indexes), 87 RabbitMQ (messaging), 137, 444 leader-based replication, 153 race conditions, 225 (see also concurrency) avoiding with linearizability, 331 caused by dual writes, 452 dirty writes, 235 in counter increments, 235 lost updates, 242-246 preventing with event logs, 462, 507 preventing with serializable isolation, 252 write skew, 246-251 Raft (consensus algorithm), 366 sensitivity to network problems, 369 term number, 368 use in etcd, 353 RAID (Redundant Array of Independent Disks), 7, 398 railways, schema migration on, 496 RAMCloud (in-memory storage), 89 ranking algorithms, 424 RDF (Resource Description Framework), 57 querying with SPARQL, 59 RDMA (Remote Direct Memory Access), 276 read committed isolation level, 234-237 implementing, 236 multi-version concurrency control (MVCC), 239 no dirty reads, 234 no dirty writes, 235 read path (derived data), 509 read repair (leaderless replication), 178 for linearizability, 335 read replicas (see leader-based replication) read skew (transaction isolation), 238, 266 as violation of causality, 340 read-after-write consistency, 163, 524 cross-device, 164 read-modify-write cycle, 243 read-scaling architecture, 161 reads as events, 513 real-time collaborative editing, 170 near-real-time processing, 390 (see also stream processing) publish/subscribe dataflow, 513 response time guarantees, 298 time-of-day clocks, 288 rebalancing partitions, 209-214, 556 (see also partitioning) automatic or manual rebalancing, 213 dynamic partitioning, 212 fixed number of partitions, 210 fixed number of partitions per node, 212 problems with hash mod N, 210 recency guarantee, 324 recommendation engines batch process outputs, 412 batch workflows, 403, 420 iterative processing, 424 statistical and numerical algorithms, 428 records, 399 events in stream processing, 440 recursive common table expressions (SQL), 54 redelivery (messaging), 445 Redis (database) atomic operations, 243 durability, 89 Lua scripting, 255 single-threaded execution, 253 usage example, 4 redundancy hardware components, 7 of derived data, 386 (see also derived data) Reed–Solomon codes (error correction), 398 refactoring, 22 (see also evolvability) regions (partitioning), 199 register (data structure), 325 relational data model, 28-42 comparison to document model, 38-42 graph queries in SQL, 53 in-memory databases with, 89 many-to-one and many-to-many relation‐ ships, 33 multi-object transactions, need for, 231 NoSQL as alternative to, 29 object-relational mismatch, 29 relational algebra and SQL, 42 versus document model convergence of models, 41 data locality, 41 relational databases eventual consistency, 162 history, 28 leader-based replication, 153 logical logs, 160 philosophy compared to Unix, 499, 501 schema changes, 40, 111, 130 statement-based replication, 158 use of B-tree indexes, 80 relationships (see edges) reliability, 6-10, 489 building a reliable system from unreliable components, 276 defined, 6, 22 hardware faults, 7 human errors, 9 importance of, 10 of messaging systems, 442 Index | 581 software errors, 8 Remote Method Invocation (Java RMI), 134 remote procedure calls (RPCs), 134-136 (see also services) based on futures, 135 data encoding and evolution, 136 issues with, 134 using Avro, 126, 135 using Thrift, 135 versus message brokers, 137 repeatable reads (transaction isolation), 242 replicas, 152 replication, 151-193, 556 and durability, 227 chain replication, 155 conflict resolution and, 246 consistency properties, 161-167 consistent prefix reads, 165 monotonic reads, 164 reading your own writes, 162 in distributed filesystems, 398 leaderless, 177-191 detecting concurrent writes, 184-191 limitations of quorum consistency, 181-183, 334 sloppy quorums and hinted handoff, 183 monitoring staleness, 182 multi-leader, 168-177 across multiple datacenters, 168, 335 handling write conflicts, 171-175 replication topologies, 175-177 partitioning and, 147, 200 reasons for using, 145, 151 single-leader, 152-161 failover, 157 implementation of replication logs, 158-161 relation to consensus, 367 setting up new followers, 155 synchronous versus asynchronous, 153-155 state machine replication, 349, 452 using erasure coding, 398 with heterogeneous data systems, 453 replication logs (see logs) reprocessing data, 496, 498 (see also evolvability) from log-based messaging, 451 request routing, 214-216 582 | Index approaches to, 214 parallel query execution, 216 resilient systems, 6 (see also fault tolerance) response time as performance metric for services, 13, 389 guarantees on, 298 latency versus, 14 mean and percentiles, 14 user experience, 15 responsibility and accountability, 535 REST (Representational State Transfer), 133 (see also services) RethinkDB (database) document data model, 31 dynamic partitioning, 212 join support, 34, 42 key-range partitioning, 202 leader-based replication, 153 subscribing to changes, 456 Riak (database) Bitcask storage engine, 72 CRDTs, 174, 191 dotted version vectors, 191 gossip protocol, 216 hash partitioning, 203-204, 211 last-write-wins conflict resolution, 186 leaderless replication, 177 LevelDB storage engine, 78 linearizability, lack of, 335 multi-datacenter support, 184 preventing lost updates across replicas, 246 rebalancing, 213 search feature, 209 secondary indexes, 207 siblings (concurrently written values), 190 sloppy quorums, 184 ring buffers, 450 Ripple (cryptocurrency), 532 rockets, 10, 36, 305 RocksDB (storage engine), 78 leveled compaction, 79 rollbacks (transactions), 222 rolling upgrades, 8, 112 routing (see request routing) row-oriented storage, 96 row-based replication, 160 rowhammer (memory corruption), 529 RPCs (see remote procedure calls) Rubygems (package manager), 428 rules (Datalog), 61 S safety and liveness properties, 308 in consensus algorithms, 366 in transactions, 222 sagas (see compensating transactions) Samza (stream processor), 466, 467 fault tolerance, 479 streaming SQL support, 466 sandboxes, 9 SAP HANA (database), 93 scalability, 10-18, 489 approaches for coping with load, 17 defined, 22 describing load, 11 describing performance, 13 partitioning and, 199 replication and, 161 scaling up versus scaling out, 146 scaling out, 17, 146 (see also shared-nothing architecture) scaling up, 17, 146 scatter/gather approach, querying partitioned databases, 207 SCD (slowly changing dimension), 476 schema-on-read, 39 comparison to evolvable schema, 128 in distributed filesystems, 415 schema-on-write, 39 schemaless databases (see schema-on-read) schemas, 557 Avro, 122-127 reader determining writer’s schema, 125 schema evolution, 123 dynamically generated, 126 evolution of, 496 affecting application code, 111 compatibility checking, 126 in databases, 129-131 in message-passing, 138 in service calls, 136 flexibility in document model, 39 for analytics, 93-95 for JSON and XML, 115 merits of, 127 schema migration on railways, 496 Thrift and Protocol Buffers, 117-121 schema evolution, 120 traditional approach to design, fallacy in, 462 searches building search indexes in batch processes, 411 k-nearest neighbors, 429 on streams, 467 partitioned secondary indexes, 206 secondaries (see leader-based replication) secondary indexes, 85, 557 partitioning, 206-209, 217 document-partitioned, 206 index maintenance, 495 term-partitioned, 208 problems with dual writes, 452, 491 updating, transaction isolation and, 231 secondary sorts, 405 sed (Unix tool), 392 self-describing files, 127 self-joins, 480 self-validating systems, 530 semantic web, 57 semi-synchronous replication, 154 sequence number ordering, 343-348 generators, 294, 344 insufficiency for enforcing constraints, 347 Lamport timestamps, 345 use of timestamps, 291, 295, 345 sequential consistency, 351 serializability, 225, 233, 251-266, 557 linearizability versus, 329 pessimistic versus optimistic concurrency control, 261 serial execution, 252-256 partitioning, 255 using stored procedures, 253, 349 serializable snapshot isolation (SSI), 261-266 detecting stale MVCC reads, 263 detecting writes that affect prior reads, 264 distributed execution, 265, 364 performance of SSI, 265 preventing write skew, 262-265 two-phase locking (2PL), 257-261 index-range locks, 260 performance, 258 Serializable (Java), 113 Index | 583 serialization, 113 (see also encoding) service discovery, 135, 214, 372 using DNS, 216, 372 service level agreements (SLAs), 15 service-oriented architecture (SOA), 132 (see also services) services, 131-136 microservices, 132 causal dependencies across services, 493 loose coupling, 502 relation to batch/stream processors, 389, 508 remote procedure calls (RPCs), 134-136 issues with, 134 similarity to databases, 132 web services, 132, 135 session windows (stream processing), 472 (see also windows) sessionization, 407 sharding (see partitioning) shared mode (locks), 258 shared-disk architecture, 146, 398 shared-memory architecture, 146 shared-nothing architecture, 17, 146-147, 557 (see also replication) distributed filesystems, 398 (see also distributed filesystems) partitioning, 199 use of network, 277 sharks biting undersea cables, 279 counting (example), 46-48 finding (example), 42 website about (example), 44 shredding (in relational model), 38 siblings (concurrent values), 190, 246 (see also conflicts) similarity search edit distance, 88 genome data, 63 k-nearest neighbors, 429 single-leader replication (see leader-based rep‐ lication) single-threaded execution, 243, 252 in batch processing, 406, 421, 426 in stream processing, 448, 463, 522 size-tiered compaction, 79 skew, 557 584 | Index clock skew, 291-294, 334 in transaction isolation read skew, 238, 266 write skew, 246-251, 262-265 (see also write skew) meanings of, 238 unbalanced workload, 201 compensating for, 205 due to celebrities, 205 for time-series data, 203 in batch processing, 407 slaves (see leader-based replication) sliding windows (stream processing), 472 (see also windows) sloppy quorums, 183 (see also quorums) lack of linearizability, 334 slowly changing dimension (data warehouses), 476 smearing (leap seconds adjustments), 290 snapshots (databases) causal consistency, 340 computing derived data, 500 in change data capture, 455 serializable snapshot isolation (SSI), 261-266, 329 setting up a new replica, 156 snapshot isolation and repeatable read, 237-242 implementing with MVCC, 239 indexes and MVCC, 241 visibility rules, 240 synchronized clocks for global snapshots, 294 snowflake schemas, 95 SOAP, 133 (see also services) evolvability, 136 software bugs, 8 maintaining integrity, 529 solid state drives (SSDs) access patterns, 84 detecting corruption, 519, 530 faults in, 227 sequential write throughput, 75 Solr (search server) building indexes in batch processes, 411 document-partitioned indexes, 207 request routing, 216 usage example, 4 use of Lucene, 79 sort (Unix tool), 392, 394, 395 sort-merge joins (MapReduce), 405 Sorted String Tables (see SSTables) sorting sort order in column storage, 99 source of truth (see systems of record) Spanner (database) data locality, 41 snapshot isolation using clocks, 295 TrueTime API, 294 Spark (processing framework), 421-423 bytecode generation, 428 dataflow APIs, 427 fault tolerance, 422 for data warehouses, 93 GraphX API (graph processing), 425 machine learning, 428 query optimizer, 427 Spark Streaming, 466 microbatching, 477 stream processing on top of batch process‐ ing, 495 SPARQL (query language), 59 spatial algorithms, 429 split brain, 158, 557 in consensus algorithms, 352, 367 preventing, 322, 333 using fencing tokens to avoid, 302-304 spreadsheets, dataflow programming capabili‐ ties, 504 SQL (Structured Query Language), 21, 28, 43 advantages and limitations of, 416 distributed query execution, 48 graph queries in, 53 isolation levels standard, issues with, 242 query execution on Hadoop, 416 résumé (example), 30 SQL injection vulnerability, 305 SQL on Hadoop, 93 statement-based replication, 158 stored procedures, 255 SQL Server (database) data warehousing support, 93 distributed transaction support, 361 leader-based replication, 153 preventing lost updates, 245 preventing write skew, 248, 257 read committed isolation, 236 recursive query support, 54 serializable isolation, 257 snapshot isolation support, 239 T-SQL language, 255 XML support, 30 SQLstream (stream analytics), 466 SSDs (see solid state drives) SSTables (storage format), 76-79 advantages over hash indexes, 76 concatenated index, 204 constructing and maintaining, 78 making LSM-Tree from, 78 staleness (old data), 162 cross-channel timing dependencies, 331 in leaderless databases, 178 in multi-version concurrency control, 263 monitoring for, 182 of client state, 512 versus linearizability, 324 versus timeliness, 524 standbys (see leader-based replication) star replication topologies, 175 star schemas, 93-95 similarity to event sourcing, 458 Star Wars analogy (event time versus process‐ ing time), 469 state derived from log of immutable events, 459 deriving current state from the event log, 458 interplay between state changes and appli‐ cation code, 507 maintaining derived state, 495 maintenance by stream processor in streamstream joins, 473 observing derived state, 509-515 rebuilding after stream processor failure, 478 separation of application code and, 505 state machine replication, 349, 452 statement-based replication, 158 statically typed languages analogy to schema-on-write, 40 code generation and, 127 statistical and numerical algorithms, 428 StatsD (metrics aggregator), 442 stdin, stdout, 395, 396 Stellar (cryptocurrency), 532 Index | 585 stock market feeds, 442 STONITH (Shoot The Other Node In The Head), 158 stop-the-world (see garbage collection) storage composing data storage technologies, 499-504 diversity of, in MapReduce, 415 Storage Area Network (SAN), 146, 398 storage engines, 69-104 column-oriented, 95-101 column compression, 97-99 defined, 96 distinction between column families and, 99 Parquet, 96, 131 sort order in, 99-100 writing to, 101 comparing requirements for transaction processing and analytics, 90-96 in-memory storage, 88 durability, 227 row-oriented, 70-90 B-trees, 79-83 comparing B-trees and LSM-trees, 83-85 defined, 96 log-structured, 72-79 stored procedures, 161, 253-255, 557 and total order broadcast, 349 pros and cons of, 255 similarity to stream processors, 505 Storm (stream processor), 466 distributed RPC, 468, 514 Trident state handling, 478 straggler events, 470, 498 stream processing, 464-481, 557 accessing external services within job, 474, 477, 478, 517 combining with batch processing lambda architecture, 497 unifying technologies, 498 comparison to batch processing, 464 complex event processing (CEP), 465 fault tolerance, 476-479 atomic commit, 477 idempotence, 478 microbatching and checkpointing, 477 rebuilding state after a failure, 478 for data integration, 494-498 586 | Index maintaining derived state, 495 maintenance of materialized views, 467 messaging systems (see messaging systems) reasoning about time, 468-472 event time versus processing time, 469, 477, 498 knowing when window is ready, 470 types of windows, 472 relation to databases (see streams) relation to services, 508 search on streams, 467 single-threaded execution, 448, 463 stream analytics, 466 stream joins, 472-476 stream-stream join, 473 stream-table join, 473 table-table join, 474 time-dependence of, 475 streams, 440-451 end-to-end, pushing events to clients, 512 messaging systems (see messaging systems) processing (see stream processing) relation to databases, 451-464 (see also changelogs) API support for change streams, 456 change data capture, 454-457 derivative of state by time, 460 event sourcing, 457-459 keeping systems in sync, 452-453 philosophy of immutable events, 459-464 topics, 440 strict serializability, 329 strong consistency (see linearizability) strong one-copy serializability, 329 subjects, predicates, and objects (in triplestores), 55 subscribers (message streams), 440 (see also consumers) supercomputers, 275 surveillance, 537 (see also privacy) Swagger (service definition format), 133 swapping to disk (see virtual memory) synchronous networks, 285, 557 comparison to asynchronous networks, 284 formal model, 307 synchronous replication, 154, 557 chain replication, 155 conflict detection, 172 system models, 300, 306-310 assumptions in, 528 correctness of algorithms, 308 mapping to the real world, 309 safety and liveness, 308 systems of record, 386, 557 change data capture, 454, 491 treating event log as, 460 systems thinking, 536 T t-digest (algorithm), 16 table-table joins, 474 Tableau (data visualization software), 416 tail (Unix tool), 447 tail vertex (property graphs), 51 Tajo (query engine), 93 Tandem NonStop SQL (database), 200 TCP (Transmission Control Protocol), 277 comparison to circuit switching, 285 comparison to UDP, 283 connection failures, 280 flow control, 282, 441 packet checksums, 306, 519, 529 reliability and duplicate suppression, 517 retransmission timeouts, 284 use for transaction sessions, 229 telemetry (see monitoring) Teradata (database), 93, 200 term-partitioned indexes, 208, 217 termination (consensus), 365 Terrapin (database), 413 Tez (dataflow engine), 421-423 fault tolerance, 422 support by higher-level tools, 427 thrashing (out of memory), 297 threads (concurrency) actor model, 138, 468 (see also message-passing) atomic operations, 223 background threads, 73, 85 execution pauses, 286, 296-298 memory barriers, 338 preemption, 298 single (see single-threaded execution) three-phase commit, 359 Thrift (data format), 117-121 BinaryProtocol, 118 CompactProtocol, 119 field tags and schema evolution, 120 throughput, 13, 390 TIBCO, 137 Enterprise Message Service, 444 StreamBase (stream analytics), 466 time concurrency and, 187 cross-channel timing dependencies, 331 in distributed systems, 287-299 (see also clocks) clock synchronization and accuracy, 289 relying on synchronized clocks, 291-295 process pauses, 295-299 reasoning about, in stream processors, 468-472 event time versus processing time, 469, 477, 498 knowing when window is ready, 470 timestamp of events, 471 types of windows, 472 system models for distributed systems, 307 time-dependence in stream joins, 475 time-of-day clocks, 288 timeliness, 524 coordination-avoiding data systems, 528 correctness of dataflow systems, 525 timeouts, 279, 557 dynamic configuration of, 284 for failover, 158 length of, 281 timestamps, 343 assigning to events in stream processing, 471 for read-after-write consistency, 163 for transaction ordering, 295 insufficiency for enforcing constraints, 347 key range partitioning by, 203 Lamport, 345 logical, 494 ordering events, 291, 345 Titan (database), 50 tombstones, 74, 191, 456 topics (messaging), 137, 440 total order, 341, 557 limits of, 493 sequence numbers or timestamps, 344 total order broadcast, 348-352, 493, 522 consensus algorithms and, 366-368 Index | 587 implementation in ZooKeeper and etcd, 370 implementing with linearizable storage, 351 using, 349 using to implement linearizable storage, 350 tracking behavioral data, 536 (see also privacy) transaction coordinator (see coordinator) transaction manager (see coordinator) transaction processing, 28, 90-95 comparison to analytics, 91 comparison to data warehousing, 93 transactions, 221-267, 558 ACID properties of, 223 atomicity, 223 consistency, 224 durability, 226 isolation, 225 compensating (see compensating transac‐ tions) concept of, 222 distributed transactions, 352-364 avoiding, 492, 502, 521-528 failure amplification, 364, 495 in doubt/uncertain status, 358, 362 two-phase commit, 354-359 use of, 360-361 XA transactions, 361-364 OLTP versus analytics queries, 411 purpose of, 222 serializability, 251-266 actual serial execution, 252-256 pessimistic versus optimistic concur‐ rency control, 261 serializable snapshot isolation (SSI), 261-266 two-phase locking (2PL), 257-261 single-object and multi-object, 228-232 handling errors and aborts, 231 need for multi-object transactions, 231 single-object writes, 230 snapshot isolation (see snapshots) weak isolation levels, 233-251 preventing lost updates, 242-246 read committed, 234-238 transitive closure (graph algorithm), 424 trie (data structure), 88 triggers (databases), 161, 441 implementing change data capture, 455 implementing replication, 161 588 | Index triple-stores, 55-59 SPARQL query language, 59 tumbling windows (stream processing), 472 (see also windows) in microbatching, 477 tuple spaces (programming model), 507 Turtle (RDF data format), 56 Twitter constructing home timelines (example), 11, 462, 474, 511 DistributedLog (event log), 448 Finagle (RPC framework), 135 Snowflake (sequence number generator), 294 Summingbird (processing library), 497 two-phase commit (2PC), 353, 355-359, 558 confusion with two-phase locking, 356 coordinator failure, 358 coordinator recovery, 363 how it works, 357 issues in practice, 363 performance cost, 360 transactions holding locks, 362 two-phase locking (2PL), 257-261, 329, 558 confusion with two-phase commit, 356 index-range locks, 260 performance of, 258 type checking, dynamic versus static, 40 U UDP (User Datagram Protocol) comparison to TCP, 283 multicast, 442 unbounded datasets, 439, 558 (see also streams) unbounded delays, 558 in networks, 282 process pauses, 296 unbundling databases, 499-515 composing data storage technologies, 499-504 federation versus unbundling, 501 need for high-level language, 503 designing applications around dataflow, 504-509 observing derived state, 509-515 materialized views and caching, 510 multi-partition data processing, 514 pushing state changes to clients, 512 uncertain (transaction status) (see in doubt) uniform consensus, 365 (see also consensus) uniform interfaces, 395 union type (in Avro), 125 uniq (Unix tool), 392 uniqueness constraints asynchronously checked, 526 requiring consensus, 521 requiring linearizability, 330 uniqueness in log-based messaging, 522 Unix philosophy, 394-397 command-line batch processing, 391-394 Unix pipes versus dataflow engines, 423 comparison to Hadoop, 413-414 comparison to relational databases, 499, 501 comparison to stream processing, 464 composability and uniform interfaces, 395 loose coupling, 396 pipes, 394 relation to Hadoop, 499 UPDATE statement (SQL), 40 updates preventing lost updates, 242-246 atomic write operations, 243 automatically detecting lost updates, 245 compare-and-set operations, 245 conflict resolution and replication, 246 using explicit locking, 244 preventing write skew, 246-251 V validity (consensus), 365 vBuckets (partitioning), 199 vector clocks, 191 (see also version vectors) vectorized processing, 99, 428 verification, 528-533 avoiding blind trust, 530 culture of, 530 designing for auditability, 531 end-to-end integrity checks, 531 tools for auditable data systems, 532 version control systems, reliance on immutable data, 463 version vectors, 177, 191 capturing causal dependencies, 343 versus vector clocks, 191 Vertica (database), 93 handling writes, 101 replicas using different sort orders, 100 vertical scaling (see scaling up) vertices (in graphs), 49 property graph model, 50 Viewstamped Replication (consensus algo‐ rithm), 366 view number, 368 virtual machines, 146 (see also cloud computing) context switches, 297 network performance, 282 noisy neighbors, 284 reliability in cloud services, 8 virtualized clocks in, 290 virtual memory process pauses due to page faults, 14, 297 versus memory management by databases, 89 VisiCalc (spreadsheets), 504 vnodes (partitioning), 199 Voice over IP (VoIP), 283 Voldemort (database) building read-only stores in batch processes, 413 hash partitioning, 203-204, 211 leaderless replication, 177 multi-datacenter support, 184 rebalancing, 213 reliance on read repair, 179 sloppy quorums, 184 VoltDB (database) cross-partition serializability, 256 deterministic stored procedures, 255 in-memory storage, 89 output streams, 456 secondary indexes, 207 serial execution of transactions, 253 statement-based replication, 159, 479 transactions in stream processing, 477 W WAL (write-ahead log), 82 web services (see services) Web Services Description Language (WSDL), 133 webhooks, 443 webMethods (messaging), 137 WebSocket (protocol), 512 Index | 589 windows (stream processing), 466, 468-472 infinite windows for changelogs, 467, 474 knowing when all events have arrived, 470 stream joins within a window, 473 types of windows, 472 winners (conflict resolution), 173 WITH RECURSIVE syntax (SQL), 54 workflows (MapReduce), 402 outputs, 411-414 key-value stores, 412 search indexes, 411 with map-side joins, 410 working set, 393 write amplification, 84 write path (derived data), 509 write skew (transaction isolation), 246-251 characterizing, 246-251, 262 examples of, 247, 249 materializing conflicts, 251 occurrence in practice, 529 phantoms, 250 preventing in snapshot isolation, 262-265 in two-phase locking, 259-261 options for, 248 write-ahead log (WAL), 82, 159 writes (database) atomic write operations, 243 detecting writes affecting prior reads, 264 preventing dirty writes with read commit‐ ted, 235 WS-* framework, 133 (see also services) WS-AtomicTransaction (2PC), 355 590 | Index X XA transactions, 355, 361-364 heuristic decisions, 363 limitations of, 363 xargs (Unix tool), 392, 396 XML binary variants, 115 encoding RDF data, 57 for application data, issues with, 114 in relational databases, 30, 41 XSL/XPath, 45 Y Yahoo!

Derived data systems Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way. If you lose derived data, you can recreate it from the original source. A classic example is a cache: data can be served from the cache if present, but if the cache doesn’t contain what you need, you can fall back to the underlying database. Denormalized values, indexes, and materialized views also fall into this category. In recommendation systems, predictive summary data is often derived from usage logs. Technically speaking, derived data is redundant, in the sense that it duplicates exist‐ ing information. However, it is often essential for getting good performance on read queries. It is commonly denormalized. You can derive several different datasets from a single source, enabling you to look at the data from different “points of view.”


Remix: Making Art and Commerce Thrive in the Hybrid Economy by Lawrence Lessig

Amazon Web Services, Andrew Keen, Benjamin Mako Hill, Berlin Wall, Bernie Sanders, Brewster Kahle, Cass Sunstein, collaborative editing, commoditize, disintermediation, don't be evil, Erik Brynjolfsson, Internet Archive, invisible hand, Jeff Bezos, jimmy wales, Joi Ito, Kevin Kelly, Larry Wall, late fees, Mark Shuttleworth, Netflix Prize, Network effects, new economy, optical character recognition, PageRank, peer-to-peer, recommendation engine, revision control, Richard Stallman, Ronald Coase, Saturday Night Live, SETI@home, sharing economy, Silicon Valley, Skype, slashdot, Steve Jobs, The Nature of the Firm, thinkpad, transaction costs, VA Linux, yellow journalism

And so increasingly, we must ask how these different norms might be made to coexist. Jeff Jarvis, journalist and blogger, suggests companies “pay dividends back to [the] crowd” and avoid trying too hard “to control [the gathered] 80706 i-xxiv 001-328 r4nk.indd 233 8/12/08 1:55:56 AM REMI X 234 wisdom, and limit its use and the sharing of it.”19 Tapscott and Williams make the same recommendation: “platforms for participation will only remain viable for as long as all the stakeholders are adequately and appropriately compensated for their contributions— don’t expect a free ride forever.”20 The key word here is “appropriately.” Obviously, there must be adequate compensation. But the kind of compensation is the puzzle. Once again, the “sharing economy” of two lovers is one in which both need to be concerned that the other is “adequately and appropriately compensated for [his or her] contribution.”


pages: 398 words: 86,855

Bad Data Handbook by Q. Ethan McCallum

Amazon Mechanical Turk, asset allocation, barriers to entry, Benoit Mandelbrot, business intelligence, cellular automata, chief data officer, Chuck Templeton: OpenTable:, cloud computing, cognitive dissonance, combinatorial explosion, commoditize, conceptual framework, database schema, DevOps, en.wikipedia.org, Firefox, Flash crash, Gini coefficient, illegal immigration, iterative process, labor-force participation, loose coupling, natural language processing, Netflix Prize, quantitative trading / quantitative finance, recommendation engine, selection bias, sentiment analysis, statistical model, supply-chain management, survivorship bias, text mining, too big to fail, web application

Facebook is powered by its Open Graph, the “people and the connections they have to everything they care about.”[68] Facebook provides an API to access this social network and make it available for integration into other networked datasets. On Twitter, the network structure resulting from friends and followers leads to recommendations of “Who to follow.” On LinkedIn, network-based recommendations include “Jobs you may be interested in” and “Groups you may like.” The recommendation engine hunch.com is built on a “Taste Graph” that “uses signals from around the Web to map members with their predicted affinity for products, services, other people, websites, or just about anything, and customizes recommended topics for them.”[69] A search on Google can be considered a type of recommendation about which of possibly millions of search hits are most relevant for a particular query.

Springer-Verlag New York, Inc., New York, NY, USA. [63] http://en.wikipedia.org/wiki/File:KochFlake.svg [64] http://blueprints.tinkerpop.com [65] http://gremlin.tinkerpop.com [66] http://gremlin.tinkerpop.com/Path-Pattern [67] Ted G. Lewis. 2009. Network Science: Theory and Applications. Wiley Publishing. [68] http://developers.facebook.com/docs/opengraph [69] “eBay Acquires Recommendation Engine Hunch.com,” http://www.businesswire.com/news/home/20111121005831/en [70] Brin, S.; Page, L. 1998. “The anatomy of a large-scale hypertextual Web search engine.” Computer Networks and ISDN Systems 30: 107–117 Chapter 14. Myths of Cloud Computing Steve Francia Myths are an important and natural part of the emergence of any new technology, product, or idea as identified by the hype cycle.

I’ve written code to process accelerometer and hydrophone signals for analysis of dams and other large structures (as an undergraduate student in Engineering at Harvey Mudd College), analyzed recordings of calls from various species of bats (as a graduate student in Electrical Engineering at the University of Washington), built systems to visualize imaging sonar data (as a Graduate Research Assistant at the Applied Physics Lab), used large amounts of crawled web content to build content filtering systems (as the co-founder and CTO of N2H2, Inc.), designed intranet search systems for portal software (at DataChannel), and combined multiple sets of directory assistance data into a searchable website (as CTO at WhitePages.com). For the past five years or so, I’ve spent most of my time at Demand Media using a wide variety of data sources to build optimization systems for advertising and content recommendation systems, with various side excursions into large-scale data-driven search engine optimization (SEO) and search engine marketing (SEM). Most of my examples will be related to work I’ve done in Ad Optimization, Content Recommendation, SEO, and SEM. These areas, as with most, have their own terminology, so a few term definitions may be helpful. Table 2-1. Term Definitions TermDefinition PPC Pay Per Click—Internet advertising model used to drive traffic to websites with a payment model based on clicks on advertisements.


pages: 320 words: 87,853

The Black Box Society: The Secret Algorithms That Control Money and Information by Frank Pasquale

Affordable Care Act / Obamacare, algorithmic trading, Amazon Mechanical Turk, American Legislative Exchange Council, asset-backed security, Atul Gawande, bank run, barriers to entry, basic income, Berlin Wall, Bernie Madoff, Black Swan, bonus culture, Brian Krebs, business cycle, call centre, Capital in the Twenty-First Century by Thomas Piketty, Chelsea Manning, Chuck Templeton: OpenTable:, cloud computing, collateralized debt obligation, computerized markets, corporate governance, Credit Default Swap, credit default swaps / collateralized debt obligations, crowdsourcing, cryptocurrency, Debian, don't be evil, drone strike, Edward Snowden, en.wikipedia.org, Fall of the Berlin Wall, Filter Bubble, financial innovation, financial thriller, fixed income, Flash crash, full employment, Goldman Sachs: Vampire Squid, Google Earth, Hernando de Soto, High speed trading, hiring and firing, housing crisis, informal economy, information asymmetry, information retrieval, interest rate swap, Internet of things, invisible hand, Jaron Lanier, Jeff Bezos, job automation, Julian Assange, Kevin Kelly, knowledge worker, Kodak vs Instagram, kremlinology, late fees, London Interbank Offered Rate, London Whale, Marc Andreessen, Mark Zuckerberg, mobile money, moral hazard, new economy, Nicholas Carr, offshore financial centre, PageRank, pattern recognition, Philip Mirowski, precariat, profit maximization, profit motive, quantitative easing, race to the bottom, recommendation engine, regulatory arbitrage, risk-adjusted returns, Satyajit Das, search engine result page, shareholder value, Silicon Valley, Snapchat, social intelligence, Spread Networks laid a new fibre optics cable between New York and Chicago, statistical arbitrage, statistical model, Steven Levy, the scientific method, too big to fail, transaction costs, two-sided market, universal basic income, Upton Sinclair, value at risk, WikiLeaks, zero-sum game

But what do we know about them? A bad credit score may cost a borrower hundreds of thousands of dollars, but he will never understand exactly how it was calculated. A predictive INTRODUCTION—THE NEED TO KNOW 5 analytics firm may score someone as a “high cost” or “unreliable” worker, yet never tell her about the decision. More benignly, perhaps, these companies influence the choices we make ourselves. Recommendation engines at Amazon and YouTube affect an automated familiarity, gently suggesting offerings they think we’ll like. But don’t discount the significance of that “perhaps.” The economic, political, and cultural agendas behind their suggestions are hard to unravel. As middlemen, they specialize in shifting alliances, sometimes advancing the interests of customers, sometimes suppliers: all to orchestrate an online world that maximizes their own profits.

In short, they improve the quality of our daily lives in ways both noticeable and not. But where do we call a halt? Similar protocols also influence— invisibly—not only the route we take to a new restaurant, but which restaurant Google, Yelp, OpenTable, or Siri recommends to us. They might help us fi nd reviews of the car we drive. Yet choosing a car, or even a restaurant, is not as straightforward as optimizing an engine or routing a drive. Does the recommendation engine take into account, say, whether the restaurant or car company gives its workers health benefits or maternity leave? Could we prompt it to do so? In their race for the most profitable methods of mapping social reality, the data scientists of Silicon Valley and Wall Street tend to treat recommendations as purely technical problems. The values and prerogatives that the encoded rules enact are hidden within black boxes.23 INTRODUCTION—THE NEED TO KNOW 9 The most obvious question is: Are these algorithmic applications fair?

Even if it is the former, we should note that Google’s autosuggest feature may have automatically entered the word “bomb” after “pressure cooker” while he was 228 NOTES TO PAGES 21–23 typing— certainly many people would have done the search in the days after the Boston bombing merely to learn just how lethal such an attack could be. The police had no way of knowing whether Catalano had actually typed “bomb” himself, or accidentally clicked on it thanks to Google’s increasingly aggressive recommendation engines. See also Philip Bump, “Update: Now We Know Why Googling ‘Pressure Cookers’ Gets a Visit from the Cops,” The Wire, August 1, 2013, http://www.thewire.com /national /2013/08/government-knocking -doors-because-google-searches/67864 /#.UfqCSAXy7zQ.facebook. 10. Martin Kuhn, Federal Dataveillance: Implications for Constitutional Privacy Protections (New York: LFB Scholarly Publishing, 2007), 178. 11.


pages: 475 words: 134,707

The Hype Machine: How Social Media Disrupts Our Elections, Our Economy, and Our Health--And How We Must Adapt by Sinan Aral

Airbnb, Albert Einstein, Any sufficiently advanced technology is indistinguishable from magic, augmented reality, Bernie Sanders, bitcoin, carbon footprint, Cass Sunstein, computer vision, coronavirus, correlation does not imply causation, COVID-19, Covid-19, crowdsourcing, cryptocurrency, death of newspapers, disintermediation, Donald Trump, Drosophila, Edward Snowden, Elon Musk, en.wikipedia.org, Erik Brynjolfsson, experimental subject, facts on the ground, Filter Bubble, global pandemic, hive mind, illegal immigration, income inequality, Kickstarter, knowledge worker, longitudinal study, low skilled workers, Lyft, Mahatma Gandhi, Mark Zuckerberg, Menlo Park, meta analysis, meta-analysis, Metcalfe’s law, mobile money, move fast and break things, move fast and break things, multi-sided market, Nate Silver, natural language processing, Network effects, performance metric, phenotype, recommendation engine, Robert Bork, Robert Shiller, Robert Shiller, Second Machine Age, sentiment analysis, shareholder value, skunkworks, Snapchat, social graph, social intelligence, social software, social web, statistical model, stem cell, Stephen Hawking, Steve Jobs, Telecommunications Act of 1996, The Chicago School, The Wisdom of Crowds, theory of mind, Tim Cook: Apple, Uber and Lyft, uber lyft, WikiLeaks, Yogi Berra

They are also the same communities in which disease outbreaks are occurring. In early 2019, social media platforms took notice. Instagram began blocking antivaccine-related hashtags like #vaccinescauseautism and #vaccinesarepoison. YouTube announced it is no longer allowing users to monetize antivaccine videos with ads. Pinterest banned searches for vaccine content. Facebook stopped showing pages and groups featuring antivaccine content and tweaked its recommendation engines to stop suggesting users join these groups. They also took down the Facebook ads that Larry Cook and others had been buying. The social platforms took similar steps to stem the spread of coronavirus fake news in 2020. Will these measures help slow the coronavirus, measles outbreaks, and future pandemics? Will fake news drive the spread of preventable diseases? Answers to these questions lie in the emerging science of fake news.

The Transparency Paradox Immediately after the Cambridge Analytica scandal broke, in an interview by Martin Giles for the MIT Technology Review, I predicted the Hype Machine was about to face a dilemma that would pull it in competing directions. On the one hand, social media platforms would face pressure to be more open and transparent about their inner workings: how their trending and ad-targeting algorithms work, how misinformation diffuses through them, and whether recommendation engines increase polarization. The world wanted Facebook and Twitter to open the kimono and reveal how it all worked, so we could understand how to use and fix social media. On the other hand, the Hype Machine would also be pushed to protect our privacy and security, to lock down consumer data, to stop sharing private information with third parties, and to protect us from data breaches like Cambridge Analytica’s.

In this case, it’s important, because if people with more economic opportunity tend to develop more diverse networks (rather than the networks providing the opportunity), then the Hype Machine is more likely to reflect economic opportunity than to create it. How important is the machine in all this? Do we just replicate our existing social networks on social media, or do the Hype Machine’s recommendation engines provide us with new economic opportunities? Erik Brynjolfsson and I collaborated with Ya Xu and Guillaume Saint-Jacques of LinkedIn to find out. Guillaume was our PhD student at MIT before going to work for Ya, LinkedIn’s director of data science. The collaboration allowed us to test the cause and effect relationship between weak ties and job mobility. We used data from sixty randomized experiments conducted on LinkedIn’s “people you may know” (PYMK) algorithm, which recommends new connections to LinkedIn users.


pages: 201 words: 63,192

Graph Databases by Ian Robinson, Jim Webber, Emil Eifrem

Amazon Web Services, anti-pattern, bioinformatics, commoditize, corporate governance, create, read, update, delete, data acquisition, en.wikipedia.org, fault tolerance, linked data, loose coupling, Network effects, recommendation engine, semantic web, sentiment analysis, social graph, software as a service, SPARQL, web application

Common Use Cases | 95 As in the social use case, making an effective recommendation depends on under‐ standing the connections between things, as well as the quality and strength of those connections—all of which are best expressed as a property graph. Queries are primarily graph local, in that they start with one or more identifiable subjects, whether people or resources, and thereafter discover surrounding portions of the graph. Taken together, social networks and recommendation engines provide key differenti‐ ating capabilities in the areas of retail, recruitment, sentiment analysis, search, and knowledge management. Graphs are a good fit for the densely connected data structures germane to each of these areas; storing and querying this data using a graph database allows an application to surface end-user realtime results that reflect recent changes to the data, rather than pre-calculated, stale results.

. • Foreign key constraints add additional development and maintenance overhead just to make the database work. • Sparse tables with nullable columns require special checking in code, despite the presence of a schema. • Several expensive joins are needed just to discover what a customer bought. • Reciprocal queries are even more costly. “What products did a customer buy?” is relatively cheap compared to “which customers bought this product?”, which is the basis of recommendation systems. We could introduce an index, but even with an index, recursive questions such as “which customers bought this product who also bought that product?” quickly become prohibitively expensive as the degree of re‐ cursion increases. Relational databases struggle with highly-connected domains. To understand the cost of performing connected queries in a relational database, we’ll look at some simple and not-so-simple queries in a social network domain.


pages: 481 words: 125,946

What to Think About Machines That Think: Today's Leading Thinkers on the Age of Machine Intelligence by John Brockman

agricultural Revolution, AI winter, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, algorithmic trading, artificial general intelligence, augmented reality, autonomous vehicles, basic income, bitcoin, blockchain, clean water, cognitive dissonance, Colonization of Mars, complexity theory, computer age, computer vision, constrained optimization, corporate personhood, cosmological principle, cryptocurrency, cuban missile crisis, Danny Hillis, dark matter, discrete time, Douglas Engelbart, Elon Musk, Emanuel Derman, endowment effect, epigenetics, Ernest Rutherford, experimental economics, Flash crash, friendly AI, functional fixedness, global pandemic, Google Glasses, hive mind, income inequality, information trail, Internet of things, invention of writing, iterative process, Jaron Lanier, job automation, Johannes Kepler, John Markoff, John von Neumann, Kevin Kelly, knowledge worker, loose coupling, microbiome, Moneyball by Michael Lewis explains big data, natural language processing, Network effects, Norbert Wiener, pattern recognition, Peter Singer: altruism, phenotype, planetary scale, Ray Kurzweil, recommendation engine, Republic of Letters, RFID, Richard Thaler, Rory Sutherland, Satyajit Das, Search for Extraterrestrial Intelligence, self-driving car, sharing economy, Silicon Valley, Skype, smart contracts, social intelligence, speech recognition, statistical model, stem cell, Stephen Hawking, Steve Jobs, Steven Pinker, Stewart Brand, strong AI, Stuxnet, superintelligent machines, supervolcano, the scientific method, The Wisdom of Crowds, theory of mind, Thorstein Veblen, too big to fail, Turing machine, Turing test, Von Neumann architecture, Watson beat the top human players on Jeopardy!, Y2K

Is it possible to create an artificial mentor for each student? We already have recommender systems on the Internet that tell us, “If you liked X, you might also like Y,” based on data of many others with similar patterns of preference. Someday the mind of each student may be tracked from childhood by a personalized deep-learning system. To achieve this level of understanding of a human mind is beyond the capabilities of current technology, but there are already efforts at Facebook to use their vast social database of friends, photos, and likes to create a Theory of Mind for every person on the planet. So my prediction is that as more and more cognitive appliances, like chess-playing programs and recommender systems are devised, humans will become smarter and more capable. SHALLOW LEARNING SETH LLOYD Professor of quantum mechanical engineering, MIT; author, Programming the Universe Pity the poor folks at the National Security Agency: They’re spying on everyone (quelle surprise!)

Conceptually, autonomous or artificial intelligence systems can develop in two ways: either as an extension of human thinking or as radically new thinking. Call the first “Humanoid Thinking,” or Humanoid AI, and the second “Alien Thinking,” or Alien AI. Almost all AI today is Humanoid Thinking. We use AI to solve problems too difficult, time-consuming, or boring for our limited brains to process: electrical-grid balancing, recommendation engines, self-driving cars, face recognition, trading algorithms, and the like. These artificial agents work in narrow domains with clear goals their human creators specify. Such AI aims to accomplish human objectives—often better, with fewer cognitive errors, distractions, outbursts of bad temper, or processing limitations. In a couple of decades, AI agents might serve as virtual insurance sellers, doctors, psychotherapists, and maybe even virtual spouses and children.

He implies that the Age of the Thinking Machine is resulting in ossification rather than renewal. As our lives become increasingly recorded, archived, and accessed, we have become cannibals driven to consume our history and terrified of transgressing its established norms. To some extent, the future is blocked to us; we’re stuck in stasis; we’re stuck with a version of ourselves that’s becoming increasingly narrow. No thanks to recent tools such as “recommender systems,” we’re lodged in a seemingly endless feedback loop of “If you liked that, you’ll love this.” As we might become increasingly stuck in Curtis’s idea of the “you-loop,” so the nature of what it means to be human might be compromised by job-hogging machines that will render many of us obsolete. This Edge Question points to the next chapter in human history/evolution; we’re facing the beginning of a new definition of man, a new civilization.


pages: 579 words: 160,351

Breaking News: The Remaking of Journalism and Why It Matters Now by Alan Rusbridger

accounting loophole / creative accounting, Airbnb, banking crisis, Bernie Sanders, Boris Johnson, centre right, Chelsea Manning, citizen journalism, cross-subsidies, crowdsourcing, David Attenborough, David Brooks, death of newspapers, Donald Trump, Doomsday Book, Double Irish / Dutch Sandwich, Downton Abbey, Edward Snowden, Etonian, Filter Bubble, forensic accounting, Frank Gehry, future of journalism, G4S, high net worth, invention of movable type, invention of the printing press, Jeff Bezos, jimmy wales, Julian Assange, Mark Zuckerberg, Menlo Park, natural language processing, New Journalism, offshore financial centre, oil shale / tar sands, open borders, packet switching, Panopticon Jeremy Bentham, pre–internet, ransomware, recommendation engine, Ruby on Rails, sexual politics, Silicon Valley, Skype, Snapchat, social web, Socratic dialogue, sovereign wealth fund, speech recognition, Steve Jobs, The Wisdom of Crowds, Tim Cook: Apple, traveling salesman, upwardly mobile, WikiLeaks

The front page of Technorati – the Google of the blogosphere – told us that it was now tracking 24.5 million blogs and 1.8 billion links. Web 2.0 – the thing Emily had warned was going to take over the world – was now called social media. The GMG CEO Carolyn McCall and I took another swing to the West Coast to see what was on the horizon. We dropped in on Flickr, the picture-sharing platform; on Yahoo; on Google; on Topix.net, a content aggregator in Palo Alto. We had drinks with the founders of Digg, a social recommendation platform; tea with Knight Ridder in San Jose; coffee with Real Networks and then on to Microsoft in Seattle. So many people trying so many different things; vast sums of money in play; the speed of development; the seeming impossibility of picking who would be the next big thing and who, in a couple of months, would have shut up shop or sold out. What we were doing had got us noticed on the West Coast: everywhere Carolyn and I went people wanted to know when we would do more in America and whether we could partner with them in any way.


pages: 1,237 words: 227,370

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

active measures, Amazon Web Services, bitcoin, blockchain, business intelligence, business process, c2.com, cloud computing, collaborative editing, commoditize, conceptual framework, cryptocurrency, database schema, DevOps, distributed ledger, Donald Knuth, Edward Snowden, Ethereum, ethereum blockchain, fault tolerance, finite state, Flash crash, full text search, general-purpose programming language, informal economy, information retrieval, Infrastructure as a Service, Internet of things, iterative process, John von Neumann, Kubernetes, loose coupling, Marc Andreessen, microservices, natural language processing, Network effects, packet switching, peer-to-peer, performance metric, place-making, premature optimization, recommendation engine, Richard Feynman, self-driving car, semantic web, Shoshana Zuboff, social graph, social web, software as a service, software is eating the world, sorting algorithm, source of truth, SPARQL, speech recognition, statistical model, undersea cable, web application, WebSocket, wikimedia commons

Graphs and Iterative Processing In “Graph-Like Data Models” we discussed using graphs for modeling data, and using graph query languages to traverse the edges and vertices in a graph. The discussion in Chapter 2 was focused around OLTP-style use: quickly executing queries to find a small number of vertices matching certain criteria. It is also interesting to look at graphs in a batch processing context, where the goal is to perform some kind of offline processing or analysis on an entire graph. This need often arises in machine learning applications such as recommendation engines, or in ranking systems. For example, one of the most famous graph analysis algorithms is PageRank [69], which tries to estimate the popularity of a web page based on what other web pages link to it. It is used as part of the formula that determines the order in which web search engines present their results. Note Dataflow engines like Spark, Flink, and Tez (see “Materialization of Intermediate State”) typically arrange the operators in a job as a directed acyclic graph (DAG).

Derived data systems Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way. If you lose derived data, you can recreate it from the original source. A classic example is a cache: data can be served from the cache if present, but if the cache doesn’t contain what you need, you can fall back to the underlying database. Denormalized values, indexes, and materialized views also fall into this category. In recommendation systems, predictive summary data is often derived from usage logs. Technically speaking, derived data is redundant, in the sense that it duplicates existing information. However, it is often essential for getting good performance on read queries. It is commonly denormalized. You can derive several different datasets from a single source, enabling you to look at the data from different “points of view.”

Therefore, one job in a workflow can only start when the prior jobs—that is, the jobs that produce its input directories—have completed successfully. To handle these dependencies between job executions, various workflow schedulers for Hadoop have been developed, including Oozie, Azkaban, Luigi, Airflow, and Pinball [28]. These schedulers also have management features that are useful when maintaining a large collection of batch jobs. Workflows consisting of 50 to 100 MapReduce jobs are common when building recommendation systems [29], and in a large organization, many different teams may be running different jobs that read each other’s output. Tool support is important for managing such complex dataflows. Various higher-level tools for Hadoop, such as Pig [30], Hive [31], Cascading [32], Crunch [33], and FlumeJava [34], also set up workflows of multiple MapReduce stages that are automatically wired together appropriately.


pages: 518 words: 49,555

Designing Social Interfaces by Christian Crumlish, Erin Malone

A Pattern Language, Amazon Mechanical Turk, anti-pattern, barriers to entry, c2.com, carbon footprint, cloud computing, collaborative editing, creative destruction, crowdsourcing, en.wikipedia.org, Firefox, game design, ghettoisation, Howard Rheingold, hypertext link, if you build it, they will come, Merlin Mann, Nate Silver, Network effects, Potemkin village, recommendation engine, RFC: Request For Comment, semantic web, SETI@home, Skype, slashdot, social graph, social software, social web, source of truth, stealth mode startup, Stewart Brand, telepresence, The Wisdom of Crowds, web application

And TiVo’s whole user experience (including user-education movies that ship with the unit, its printed manual, and, heck, the dang remote has ’em hardcoded on there) is oriented around the thumbs-up and -down voting. I’d venture that thumb-voting and the recommender system are a huge part of why many people buy TiVo in the first place. (OK, that plus “pause live TV.”) Items with a great deal of persistence (on the extreme end are real-world establishments, such as restaurants or businesses) make excellent candidates for rateability. Furthermore, the types of ratings we can ask for may be more involved. Because these establishments will persist, we can be reasonably sure that others will always come along afterward and benefit from the work that the community has put into the item. When it comes to explicitly input recommender systems, we should acknowledge the limitations of folks’ interest in “feeding the machine.” If they understand the benefit, and they think that the work they’ll put in will at some point be worth something to them, then folks will play along.

This network can give rich social rewards to those who participate; however, more and more participants are finding that the rewards extend beyond just being social and discovering that the connectedness and serendipity of ambient intimacy can bring great professional gains as well. These days, ambient intimacy plays many roles in my life: it has stopped me from missing an important international flight and helped me keep sane whilst at home with a small baby. It is my outsourced tech support resource, my recommendation engine, my news filter. Twitter lets me virtually attend conferences I can’t get to but am interested in. But most valuable of all, it has allowed me to create, maintain, and even build professional and personal relationships with people in my field whose work I admire and from whom I have been able to learn and develop as a professional. So, although the question may be “what are you doing?” and perhaps you don’t really care, know that there is much more going on here than just a status update.


pages: 283 words: 85,824

The People's Platform: Taking Back Power and Culture in the Digital Age by Astra Taylor

A Declaration of the Independence of Cyberspace, American Legislative Exchange Council, Andrew Keen, barriers to entry, Berlin Wall, big-box store, Brewster Kahle, citizen journalism, cloud computing, collateralized debt obligation, Community Supported Agriculture, conceptual framework, corporate social responsibility, creative destruction, cross-subsidies, crowdsourcing, David Brooks, digital Maoism, disintermediation, don't be evil, Donald Trump, Edward Snowden, Fall of the Berlin Wall, Filter Bubble, future of journalism, George Gilder, Google Chrome, Google Glasses, hive mind, income inequality, informal economy, Internet Archive, Internet of things, invisible hand, Jane Jacobs, Jaron Lanier, Jeff Bezos, job automation, John Markoff, Julian Assange, Kevin Kelly, Kickstarter, knowledge worker, Mark Zuckerberg, means of production, Metcalfe’s law, Naomi Klein, Narrative Science, Network effects, new economy, New Journalism, New Urbanism, Nicholas Carr, oil rush, peer-to-peer, Peter Thiel, plutocrats, Plutocrats, post-work, pre–internet, profit motive, recommendation engine, Richard Florida, Richard Stallman, self-driving car, shareholder value, sharing economy, Silicon Valley, Silicon Valley ideology, slashdot, Slavoj Žižek, Snapchat, social graph, Steve Jobs, Stewart Brand, technoutopianism, trade route, Whole Earth Catalog, WikiLeaks, winner-take-all economy, Works Progress Administration, young professional

A more democratic culture is one where previously excluded populations are given the material means to fully engage. To create a culture that is more diverse and inclusive, we have to pioneer ways of addressing discrimination and bias head-on, despite the difficulties of applying traditional methods of mitigating prejudice to digital networks. We have to shape our tools of discovery, the recommendation engines and personalization filters, so they do more than reinforce our prior choices and private bubbles. Finally, if we want a culture that is more resistant to the short-term expectations of corporate shareholders and the whims of marketers, we have to invest in noncommercial enterprises. There is no shortage of good ideas. By not experimenting, we court disillusionment. The Internet was supposed to be free and ubiquitous, but a cable cartel would rather rake in profits than provide universal service.

,” Wired, blog post, November 15, 2008, http://www.longtail.com/the_long_tail/2008/11/does-the-long-t.html. 35. Fang Wu and Bernardo A. Huberman, “The Persistence Paradox,” First Monday 15, nos. 1–4 (January 2010). 36. James Evans, “Electronic Publication and the Narrowing of Science and Scholarship,” Science 321, no. 5887 (July 18, 2008): 395–99. 37. Daniel M. Fleder and Kartik Hosanagar, “Blockbuster Culture’s Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity,” Management Science 55, no. 5 (May 2009): 697–712. 38. Evan Hughes, “Here’s How Amazon Self-Destructs,” Salon, July 19, 2013. 39. Gary Flake et al., “Winners Don’t Take All: Characterizing the Competition for Links on the Web,” Proceedings of the National Academy of Sciences 99, no. 8 (April 16, 2002). 40. Eli Pariser, The Filter Bubble: What the Internet Is Hiding from You (New York: Penguin Press, 2011), 128. 41.


pages: 254 words: 79,052

Evil by Design: Interaction Design to Lead Us Into Temptation by Chris Nodder

4chan, affirmative action, Amazon Mechanical Turk, cognitive dissonance, crowdsourcing, Daniel Kahneman / Amos Tversky, Donald Trump, en.wikipedia.org, endowment effect, game design, haute couture, jimmy wales, Jony Ive, Kickstarter, late fees, loss aversion, Mark Zuckerberg, meta analysis, meta-analysis, Milgram experiment, Netflix Prize, Nick Leeson, Occupy movement, pets.com, price anchoring, recommendation engine, Rory Sutherland, Silicon Valley, Stanford prison experiment, stealth mode startup, Steve Jobs, telemarketer, Tim Cook: Apple, trickle-down economics, upwardly mobile

To reduce the confusion caused by the number of options while still retaining the perception of quality, many sites employ recommendation engines or filters. Recommendation engines provide a small set of options based on either comparison with prior behavior or on answers to a set of preference questions. Netflix uses a recommendation engine to suggest new movies based on ones that customers have already watched. Its business is so dependent upon this functionality that it recently offered a one million dollar prize to anyone who could increase the accuracy of the engine by more than 10 percent. Currently, 75 percent of movies watched on Netflix come from a recommendation made by the site. Recommendation engines are a great way to limit choice from an otherwise overwhelming quantity of items. (Netflix.com) Filters rely less on preference algorithms and more on on-screen choices.

However, maximizers like knowing that they chose from all the available options, so just presenting three alternatives may not be sufficient. The problem here is that they may look to other sellers for more choices. So the trick is to demonstrate that you have sufficient options to keep the maximizers happy but also provide tools that allow both the maximizers and the satisficers to find the options they want quickly. The three techniques you can use (alone or in combination) are to present many compatible choices, to use a recommendation engine or filter, and to offer a best choice guarantee. Brands that offer greater variety of compatible (that is, focused and internally consistent) options are perceived as having greater commitment and expertise in the category, which, in turn, enhances their perceived quality and purchase likelihood. When you want to increase the perceived importance of making the decision, allow users to choose between multiple similar options (all with positive outcomes for you).

How to design for fewer options If you want users to make a quick decision about your services, don’t give them too many options. More choices lead to more procrastination. Conversely, if you want to increase the perceived importance of a decision, or if customization is important, ensure that the only choices available to users are between multiple compatible options within narrow boundaries. If you have a larger number of items, use a recommendation engine or filters to quickly bring the number down to a manageable set. If you can’t easily reduce the number of available items, speed people to a decision by reassuring them with a best-choice guarantee. Pre-pick your preferred option Prime people so that they are open to accepting the choice you highlight. Psychologists have known for a long time that if they show you specific words or pictures beforehand, you’ll find it easier to recall those items or related ones in a later test, even after you have consciously forgotten the specific words.


pages: 406 words: 88,820

Television disrupted: the transition from network to networked TV by Shelly Palmer

barriers to entry, call centre, commoditize, disintermediation, en.wikipedia.org, hypertext link, interchangeable parts, invention of movable type, Irwin Jacobs: Qualcomm, James Watt: steam engine, Leonard Kleinrock, linear programming, Marc Andreessen, market design, Metcalfe’s law, pattern recognition, peer-to-peer, recommendation engine, Saturday Night Live, shareholder value, Skype, spectrum auction, Steve Jobs, subscription business, Telecommunications Act of 1996, There's no reason for any individual to have a computer in his home - Ken Olsen, Vickrey auction, Vilfredo Pareto, yield management

We could probably list dozens of reasons why a person might choose to be his or her own program director. The key problem with on-demand technology is not desire; it is complexity. It’s just too hard for the average person to do. Now, making a playlist in iTunes could not be simpler. But, putting your iPod in shuffle mode is actually easier, and it is also the path of least resistance. There are other factors that help with playlist creation. Recommendation engines and collaborative filtering like Amazon’s “if you like this … you might also like …” are good ways to help people pick the right stuff for their playlists. Consumers can also skew shuffle modes, setting them to play the content they manually play the most more often than the content they play less often. Of course, all of this technology requires consumers to collect all of their media into one place.

You can (and should) ask the same question about high traffic Web sites like Google,Yahoo!, MSN, Amazon, eBay, and of course, about every existing broadcast and cable network. A trip to the video section of the Apple Music Store through iTunes is a very interesting experience, particularly when you see how the interface handles show branding vs. network branding. Social Search Solution Another probable future is Tim Halle’s vision of a “social search,” a recommendation system that will emerge from social networking sites. Of course, the biggest social Copyright © 2006, Shelly Palmer. All rights reserved. 8-Television.Chap Eight v3.qxd 3/20/06 7:25 AM Page 114 114 C H A P T E R 8 Media Consumption networking sites like friendster.com or myspace.com are also big brands, so this may be just another permutation of branded search. (See “Folksonomy” in Chapter 6.)


pages: 285 words: 86,853

What Algorithms Want: Imagination in the Age of Computing by Ed Finn

Airbnb, Albert Einstein, algorithmic trading, Amazon Mechanical Turk, Amazon Web Services, bitcoin, blockchain, Chuck Templeton: OpenTable:, Claude Shannon: information theory, commoditize, Credit Default Swap, crowdsourcing, cryptocurrency, disruptive innovation, Donald Knuth, Douglas Engelbart, Douglas Engelbart, Elon Musk, factory automation, fiat currency, Filter Bubble, Flash crash, game design, Google Glasses, Google X / Alphabet X, High speed trading, hiring and firing, invisible hand, Isaac Newton, iterative process, Jaron Lanier, Jeff Bezos, job automation, John Conway, John Markoff, Just-in-time delivery, Kickstarter, late fees, lifelogging, Loebner Prize, Lyft, Mother of all demos, Nate Silver, natural language processing, Netflix Prize, new economy, Nicholas Carr, Norbert Wiener, PageRank, peer-to-peer, Peter Thiel, Ray Kurzweil, recommendation engine, Republic of Letters, ride hailing / ride sharing, Satoshi Nakamoto, self-driving car, sharing economy, Silicon Valley, Silicon Valley ideology, Silicon Valley startup, social graph, software studies, speech recognition, statistical model, Steve Jobs, Steven Levy, Stewart Brand, supply-chain management, TaskRabbit, technological singularity, technoutopianism, The Coming Technological Singularity, the scientific method, The Signal and the Noise by Nate Silver, The Structural Transformation of the Public Sphere, The Wealth of Nations by Adam Smith, transaction costs, traveling salesman, Turing machine, Turing test, Uber and Lyft, Uber for X, uber lyft, urban planning, Vannevar Bush, Vernor Vinge, wage slave

Going farther from shore, the deep waters of algorithmic imagination draw us relentlessly back toward ourselves and the mysterious origins of cognition, inspiration, and serendipity that drive creative work. How are computational systems reinventing, channeling, or modulating those processes? On an individual level this is a straightforward extension of technics: when does the memory bank, the virtual assistant, or the recommendation engine deserve credit in the creative process? These tools manage cognition, inspiration, and serendipity for us, generating conversation and intellectual connection in our social media streams, our digital workspaces and notebooks, and more broadly, in the horizon of visible knowledge. The writer using a word processor to manage drafts; the scientist using research databases and citation tools to manage a field of professional knowledge; the artist using image editing software, photo sharing tools, and a virtual notebook to track observations—all of these creative processes depend on tools that are increasingly active, occasionally manipulative agents in their own use.

At the same time, we are deeply compelled by these abstracting systems, by the romance of clean interfaces and tidy ontologies. Even with thousands of human hours encoded into its recommendations, Netflix presents a seamless computational facade, because we have arrived at a stage where many of us will trust a strange computer’s suggestions more than we will trust a stranger’s. The rhetoric of the recommendation system is so successful because it black boxes the task of judgment, asking us to trust the efficacy of personalization embedded in the algorithm. By contrast, reading movie critics or browsing sites like IMDb or Rotten Tomatoes requires us to evaluate the evaluators in a much more complicated, human way, measuring the applicability of advice generated by other personalities who might not share our tastes.


pages: 504 words: 67,845

Designing Web Interfaces: Principles and Patterns for Rich Interactions by Bill Scott, Theresa Neil

A Pattern Language, anti-pattern, en.wikipedia.org, Firefox, recommendation engine, Ruby on Rails, Silicon Valley, web application

An alternate approach would be to hide them and show them on mouse hover (we will discuss this approach in the next section). It turns out that voting and rating systems are the most common places to make tools always visible. Netflix was the earliest to use a one-click rating system (Figure 4-4). Figure 4-4. Netflix star ratings are always visible Just as with Digg, rating movies is central to the health of Netflix. The Cinematch™ recommendation engine is driven largely by the user's ratings. So a clear call to action (to rate) is important. Not only do the stars serve as a strong call to action to rate movies, but they also provide important information for the other in-context tool: the "Add" button. Adding movies to your movie-shipping queue is key to having a good experience with the Netflix service. Relative importance One way to clarify this process is to decide on the relative importance of each exposed action.

Quick and easy The Gap integrates the shopping cart into its entire site as a drop-down shade. In fact, the Gap, Old Navy, Banana Republic, and PiperLime all share the same Inline Assistant Process-style shopping cart. The Gap is betting that making it quick and easy to add items to the cart across four stores will equal more sales. Additional step Amazon, on the other hand, is betting on its recommendation engine. By going to a second page, Amazon can display other shirts like the one added—as well as advertise the Amazon.com Visa card (Figure 8-8). Figure 8-8. Amazon shows recommendations when confirming an add to its shopping cart Which is the better experience? The Gap seems to be the clear winner in pure user experience. But which brings in more money? It's a question we cannot answer, but the right one for any site to ask

This is what Netflix does when a user adds movies to his shipping queue (Figure 8-9). Figure 8-9. Netflix displays its recommendations in an overlay Each movie on the site has an "Add" button. Clicking "Add" immediately adds the movie to the user's queue. As a confirmation and an opportunity for recommendations, a Dialog Overlay is displayed on top of the movie page. Just like Amazon, Netflix has a sophisticated recommendation engine. The bet is that since the user has expressed interest in an item (shirt or movie), the site can find other items similar to it to suggest. Amazon does this in a separate page. Netflix does it in an overlay that is easily dismissed by clicking anywhere outside the overlay (or by clicking the close button at the top or bottom). In a previous version of Netflix (or if JavaScript is disabled), this becomes a multiple-page experience (Figure 8-10).


pages: 344 words: 96,020

Hacking Growth: How Today's Fastest-Growing Companies Drive Breakout Success by Sean Ellis, Morgan Brown

Airbnb, Amazon Web Services, barriers to entry, Ben Horowitz, bounce rate, business intelligence, business process, correlation does not imply causation, crowdsourcing, DevOps, disruptive innovation, Elon Musk, game design, Google Glasses, Internet of things, inventory management, iterative process, Jeff Bezos, Khan Academy, Kickstarter, Lean Startup, Lyft, Mark Zuckerberg, market design, minimum viable product, Network effects, Paul Graham, Peter Thiel, Ponzi scheme, recommendation engine, ride hailing / ride sharing, side project, Silicon Valley, Silicon Valley startup, Skype, Snapchat, software as a service, Steve Jobs, subscription business, Travis Kalanick, Uber and Lyft, Uber for X, uber lyft, working poor, Y Combinator, young professional

Personalization is also a good monetization tactic, and particularly effective are customized recommendations, usually delivered on the site or in the app while a customer is visiting, and also through email and mobile push messages. Amazon is, once again, a leading practitioner, having developed one of the most powerful “recommendation engines,” the term for the algorithmic programs that customize which items are recommended to you while browsing the site. The selections are based on a combination of a customer’s search history and buying habits, and data about the habits of other shoppers like that customer. All Amazon shoppers in effect see their own version of Amazon with a unique experience tailored to their preferences. Some recommendation engines, such as Amazon’s, as well as those deployed by Google and Netflix, are incredibly complex, but many are based on relatively simple math. As Colin Zima, the chief analytics officer at Looker, a business intelligence software, explains, it can be relatively easy to generate recommendations based on a simple formula called a Jaccard index, or Jaccard similarity coefficient, which determines how similar two products are to each other.

In contrast, the score for peanut butter and, for example, laundry detergent will almost surely be much lower. This calculation can be done for a host of combinations of every item in the store, creating powerful recommendations that lead to more purchases. And with the best recommendation engines, these product suggestions will only get better and more personalized over time because the more customers shop, the more data is available not just about what an individual customer has purchased, but also about common patterns among a large pool of shoppers. The grocery app recommendation engine might, for example, recommend seltzer water and limes when a shopper puts Red Bull in her shopping cart—even if that shopper has no history of buying any of those products—based on data that shows most people buying Red Bull are purchasing mixers for vodka.6 DON’T BE INTRUSIVE An important word of caution about customizing is that it can backfire if you’re not sensitive about how you’re doing it.


pages: 380 words: 118,675

The Everything Store: Jeff Bezos and the Age of Amazon by Brad Stone

airport security, Amazon Mechanical Turk, Amazon Web Services, bank run, Bernie Madoff, big-box store, Black Swan, book scanning, Brewster Kahle, buy and hold, call centre, centre right, Chuck Templeton: OpenTable:, Clayton Christensen, cloud computing, collapse of Lehman Brothers, crowdsourcing, cuban missile crisis, Danny Hillis, Douglas Hofstadter, Elon Musk, facts on the ground, game design, housing crisis, invention of movable type, inventory management, James Dyson, Jeff Bezos, John Markoff, Kevin Kelly, Kodak vs Instagram, late fees, loose coupling, low skilled workers, Maui Hawaii, Menlo Park, Network effects, new economy, optical character recognition, pets.com, Ponzi scheme, quantitative hedge fund, recommendation engine, Renaissance Technologies, RFID, Rodney Brooks, search inside the book, shareholder value, Silicon Valley, Silicon Valley startup, six sigma, skunkworks, Skype, statistical arbitrage, Steve Ballmer, Steve Jobs, Steven Levy, Stewart Brand, Thomas L Friedman, Tony Hsieh, Whole Earth Catalog, why are manhole covers round?, zero-sum game

Once again, Amazon’s lawyers caught wind of this and renamed the program Vendor Realignment. Over the next year, Miller tangled with the European divisions of Random House, Hachette, and Bloomsbury, the publisher of the Harry Potter series. “I did everything I could to screw with their performance,” he says. He took selections of their catalog to full price and yanked their books from Amazon’s recommendation engine; with some titles, like travel books, he promoted comparable books from competitors. Miller’s constant search for new points of leverage exploited the anxieties of neurotic authors who obsessively tracked sales rank—the number on Amazon.com that showed an author how well his or her book was doing compared to other products on the site. “We would constantly meet with authors, so we’d know who would be watching their rankings.”

“Lyn was our ambassador. I credit her for maintaining these relationships.” Amazon approached large publishers aggressively. It demanded accommodations like steeper discounts on bulk purchases, longer periods to pay its bills, and shipping arrangements that leveraged Amazon’s discounts with UPS. To publishers that didn’t comply, Amazon threatened to pull their books out of its automated personalization and recommendation systems, meaning that they would no longer be suggested to customers. “Publishers didn’t really understand Amazon. They were very naïve about what was going on with their back catalog,” says Goss. “Most didn’t know their sales were up because their backlist was getting such visibility.” Amazon had an easy way to demonstrate its market power. When a publisher did not capitulate and the company shut off the recommendation algorithms for its books, the publisher’s sales usually fell by as much as 40 percent.


pages: 293 words: 78,439

Dual Transformation: How to Reposition Today's Business While Creating the Future by Scott D. Anthony, Mark W. Johnson

activist fund / activist shareholder / activist investor, additive manufacturing, Affordable Care Act / Obamacare, Airbnb, Amazon Web Services, autonomous vehicles, barriers to entry, Ben Horowitz, blockchain, business process, business process outsourcing, call centre, Clayton Christensen, cloud computing, commoditize, corporate governance, creative destruction, crowdsourcing, death of newspapers, disintermediation, disruptive innovation, distributed ledger, diversified portfolio, Internet of things, invention of hypertext, inventory management, Jeff Bezos, job automation, job satisfaction, Joseph Schumpeter, Kickstarter, late fees, Lean Startup, Lyft, M-Pesa, Marc Andreessen, Mark Zuckerberg, Minecraft, obamacare, Parag Khanna, Paul Graham, peer-to-peer lending, pez dispenser, recommendation engine, self-driving car, shareholder value, side project, Silicon Valley, Skype, software as a service, software is eating the world, Steve Jobs, the market place, the scientific method, Thomas Kuhn: the structure of scientific revolutions, transfer pricing, uber lyft, Watson beat the top human players on Jeopardy!, Y Combinator, Zipcar

In 2000, Netflix had discussions about selling to Blockbuster for $50 million. Blockbuster, then the market leader, passed. Netflix set to work building sophisticated inventory management systems to help ensure that people could get the DVDs they wanted when they wanted them. The company also invested heavily to build algorithms that predicted users’ desired content based on their ratings of movies they rented. The so-called recommendations engine is so critical to Netflix that in 2008 it announced a public contest wherein the team that most improved the performance of the engine would get $1 million, as long as they crossed a 10 percent improvement threshold. Two teams indeed crossed the threshold, with the winning team receiving a check from Hastings in 2009 (remarkably, that was the first time the team members met face-to-face; they had done their work virtually).

The company has had its ups and downs commercially, but its immediacy and rawness had transformational impact. There are others, of course, such as InterActive Corp (worth about $6 billion as of this writing), which runs a collection of websites such as Match.com, About.com, and The Daily Beast; travel recommendation site TripAdvisor (worth $10 billion); real estate platform Zillow ($1.5 billion); coupon disruptor Groupon ($3 billion); local recommendations engine Yelp ($2 billion), and listicle and algorithmic innovator BuzzFeed ($1.5 billion). As of late 2016, the dozen companies here had created almost $1 trillion in market value. FIGURE 3-1 Transformation B Just because newspaper publishers didn’t create these companies doesn’t mean they couldn’t have created them. In this chapter we go through three case studies of established organizations driving transformation B, finding new ways to solve different problems (see figure 3-1).

See also curiosity capabilities link and, 74–75 disruption as, 8–12, 47–50 focusing on highest-potential, 141–142 leaders on, 196–197 stopping exploration of, 126–127 strategic opportunity areas and, 123–127 Optus, 145, 147–148, 149 Orange Is the New Black, 35 O’Reilly, Charles III, 53, 54 outsiders, involving in decision making, 109–110 overshooting, 103 Palo Alto Research Center (PARC), 13, 31 Pandesic, 78–79 parable of the eleventh floor, 77 Pathway, 58 patientslikeme.com, 60 PayPal, 200 Paytm, 202 Pearson, 67 penicillin, 139 periphery, spotting warning signs from the, 107–108 Perry, Tyler, 98 Pfizer, 17, 22, 138–139 Pharmacyclics, 19 Photoshop Express, 32 Pixar Animation Studios, 3–4 planning fallacy, 120 Playing to Win (Martin and Lafley), 124 Plunify, 72, 74 Porter, Michael, 99–100, 177 portfolio management systems, 80–82 Potemkin portfolios, 120 potential estimating current operations’, 118–119 estimating existing investments’, 119 problem solving approaches, 140–141 Procter & Gamble (P&G), 23, 64, 109 capabilities identification at, 79–80 innovation at, 146 predictability versus innovation at, 137–138 Professional Golfers’ Association, 99 Project ET, 127–128 Psychology Today, 177 purpose, 175–179 leaders on, 194–195 QQ, 106 Quantum Solutions, 51, 52 Quattro Wireless, 67 Quicken, 132–133 Qwikster, 94 Rakuten Group, 143 recommendations engine, Netflix, 33–35 reinvention, 42–43 Reminder app, 152 repositioning, 12, 27–45. See also transformation A reinvention versus, 42–43 Research in Motion (RIM), 4 revenue models, 40–41 reverse mentors, 151 Ricks College, 37, 44, 170. See also Brigham Young University-Idaho (BYU-Idaho) Ries, Eric, 65, 153 risk management early warning signs of disruption and, 102–113 growth gap determination and, 120–121 through experimentation, 64–66 toolkit for, 218–219 Ronn, Karl, 109 Rotman School of Management, 140 Rubin, Andy, 4 Rumelt, Richard P., 78, 116 Safaricom, 201 sales careful management of, 45 salesforce and, 77 Salesforce.com, 27–28, 151 The Salt Lake Tribune, 8.


pages: 286 words: 87,401

Blitzscaling: The Lightning-Fast Path to Building Massively Valuable Companies by Reid Hoffman, Chris Yeh

activist fund / activist shareholder / activist investor, Airbnb, Amazon Web Services, autonomous vehicles, bitcoin, blockchain, Bob Noyce, business intelligence, Chuck Templeton: OpenTable:, cloud computing, crowdsourcing, cryptocurrency, Daniel Kahneman / Amos Tversky, database schema, discounted cash flows, Elon Musk, Firefox, forensic accounting, George Gilder, global pandemic, Google Hangouts, Google X / Alphabet X, hydraulic fracturing, Hyperloop, inventory management, Isaac Newton, Jeff Bezos, Joi Ito, Khan Academy, late fees, Lean Startup, Lyft, M-Pesa, Marc Andreessen, margin call, Mark Zuckerberg, minimum viable product, move fast and break things, move fast and break things, Network effects, Oculus Rift, oil shale / tar sands, Paul Buchheit, Paul Graham, Peter Thiel, pre–internet, recommendation engine, ride hailing / ride sharing, Sam Altman, Sand Hill Road, Saturday Night Live, self-driving car, shareholder value, sharing economy, Silicon Valley, Silicon Valley startup, Skype, smart grid, social graph, software as a service, software is eating the world, speech recognition, stem cell, Steve Jobs, subscription business, Tesla Model S, thinkpad, transaction costs, transport as a service, Travis Kalanick, Uber for X, uber lyft, web application, winner-take-all economy, Y Combinator, yellow journalism

This meant that Netflix had to climb a steep learning curve in terms of both DVD-specific tasks, such as negotiating with the studios for access to movie DVDs and coordinating the logistics required to ship them to and from consumers, and developing new features like the ability to recommend movies based on past selections. Climbing the learning curve for these tasks was painful and expensive, but it gave Netflix a competitive advantage over its competitors. Later, as broadband connections became more widespread, Netflix had to climb the learning curve when building out its massive streaming infrastructure while continuing to improve its consumer recommendation engine. That was when Netflix began running into a major strategic issue. Netflix relied on the studios for its content (movies and TV shows), but the studios now saw online video companies like YouTube and Netflix as a threat. In response, they began to increase the price they demanded from Netflix for licensing their content and held back some of their “crown jewels” (e.g., massively popular content like Saturday Night Live) for themselves and Hulu (an industry joint venture).

Today, Netflix might very well be the leader in original video content, and even traditional Hollywood power players, such as superproducer Shonda Rhimes (Grey’s Anatomy, Scandal, How to Get Away with Murder) and comedian Adam Sandler (Happy Gilmore, Grown Ups), have switched from traditional studios to Netflix. What’s more, the other learning curves that Netflix climbed along the way actually helped it beat the studios at their own game. The consumer recommendation engine gives Netflix an unprecedented ability to predict what content its users want to watch, which allows it to work with creators to produce that content (such as the popular drama Stranger Things). And because Netflix has greater confidence in its own predictions than its competitors have in theirs, it can outbid them for content when they go head-to-head. COMPETITION Yet despite these offensive reasons to scale, the most common driver of blitzscaling is the threat of competition.

LinkedIn is enormously valuable as a database of résumés, but it is even more valuable as the leading community for professionals. The challenge was figuring out how to develop a daily use case that helped LinkedIn users with their professional lives and encouraged them to use the service continuously rather than just when they were looking to switch jobs or hire a new employee. We tried a number of single-threaded efforts to meet the challenge. We rolled out features one after another, such as a recommendation engine for people that our users should meet and a professional Q&A service. None of them worked well enough to solve the problem. We concluded that the problem might require a Swiss Army knife approach with multiple use cases for multiple groups of users. After all, some people might want a news feed, some might want to track their career progress, and some might be keen on continuing education.


pages: 247 words: 81,135

The Great Fragmentation: And Why the Future of All Business Is Small by Steve Sammartino

3D printing, additive manufacturing, Airbnb, augmented reality, barriers to entry, Bill Gates: Altair 8800, bitcoin, BRICs, Buckminster Fuller, citizen journalism, collaborative consumption, cryptocurrency, David Heinemeier Hansson, disruptive innovation, Elon Musk, fiat currency, Frederick Winslow Taylor, game design, Google X / Alphabet X, haute couture, helicopter parent, illegal immigration, index fund, Jeff Bezos, jimmy wales, Kickstarter, knowledge economy, Law of Accelerating Returns, lifelogging, market design, Metcalfe's law, Minecraft, minimum viable product, Network effects, new economy, peer-to-peer, post scarcity, prediction markets, pre–internet, profit motive, race to the bottom, random walk, Ray Kurzweil, recommendation engine, remote working, RFID, Rubik’s Cube, self-driving car, sharing economy, side project, Silicon Valley, Silicon Valley startup, skunkworks, Skype, social graph, social web, software is eating the world, Steve Jobs, survivorship bias, too big to fail, US Airways Flight 1549, web application, zero-sum game

Creative types Collaboration, creative orientation and counter intuition Note Chapter 6: Demographics is history: moving on from predictive marketing How to get profiled The price of pop culture The best average The weapon of choice Don’t fence me in How do you define a teenager? Stealing music or connecting? Marketing 1.0 Marketing revised The new intersection Social + interests = intention The story of cities Do I know you? The interest graph in action The anti-demographic recommendation engine Chapter 7: The truth about pricing: technology and omnipresent deflation Technology deflation Real-world technology deflation The free super computer The crux is human It’s getting quicker Technology curve jumping Technology stacking Omnipresent deflation Consumer price index trickery Connections and the impact on prices Economic border hopping The new minimum wage Notes Chapter 8: A zero-barrier world: how access to knowledge is breaking down barriers So what’s changed?

They focused on direct connection, one new fan at a time. They didn’t try to build an audience. They helped a person, which is a very different approach. It seems old-school BMXers are a little bit smarter than old-school marketers. What a great way to build a community; one that I’m now a part of. While everyone gets enamoured with ‘big data’, there’s probably a lot more we can do with ‘little data’. The anti-demographic recommendation engine A lot of e-commerce platforms and social-media engines seem to be able to do what mainstream marketers could never quite pull off. Every day, I’m exposed to products and services that I have zero interest in ever purchasing, mainly due to the laziness of the marketers who allocate the budget behind them. But occasionally I’m utterly inspired and thankful when great marketers (with permission) introduce me to things that are just perfectly suited.

Twitter is terrific at this with its who-to-follow recommendations. But the best example has to be Amazon’s ‘Recommended for you’ books. It’s always spot on, sitting perfectly in the centre of my personal interest graph, based on the simplicity of what I’ve bought, looked at, wish listed and what others have in their list when there are overlaps. For me personally, it’s very accurate indeed. What’s interesting is that this recommendation engine is what I’d coin an ‘anti-demographic’ profiler: It doesn’t care what sex I am. It doesn’t care where I live. It doesn’t care or know how much I earn. It doesn’t care if I finished school. None of this matters. What matters is the direct connection and the reality of my interests based on my digital footprint. It’s the type of efficiency that mass can never achieve. The smart marketing money now lives in a node-by-node approach.


pages: 420 words: 130,503

Actionable Gamification: Beyond Points, Badges and Leaderboards by Yu-Kai Chou

Apple's 1984 Super Bowl advert, barriers to entry, bitcoin, Burning Man, Cass Sunstein, crowdsourcing, Daniel Kahneman / Amos Tversky, delayed gratification, don't be evil, en.wikipedia.org, endowment effect, Firefox, functional fixedness, game design, IKEA effect, Internet of things, Kickstarter, late fees, lifelogging, loss aversion, Maui Hawaii, Minecraft, pattern recognition, peer-to-peer, performance metric, QR code, recommendation engine, Richard Thaler, Silicon Valley, Skype, software as a service, Stanford prison experiment, Steve Jobs, The Wealth of Nations by Adam Smith, transaction costs

As it does so, the sense of Ownership & Possession grows even more as people now identify it as a unique “My Amazon” experience that no other eCommerce site can provide. Don’t fall behind your neighbors! Accompanying the Alfred Effect is Amazon’s Recommendation Engine, now infamous in the personalization industry. Amazon’s recommendation engine, according to Amazon themselves, led to 30% of their sales5. That’s a fairly significant factor for a company that is already making billions of dollars every month. In fact, JP Mangalindan, a writer for Fortune and CNN money, argues that a significant part of Amazon’s 29% sales growth from the second fiscal quarter of 2011 to the second fiscal quarter of 2012 was attributed to the recommendation engine.6 And what does this recommendation engine look like? “Customers Who Bought This Item Also Bought.” Amazon quickly realized that, by learning about what other people similar to you are buying, you have a much higher tendency to buy the same items too.


pages: 567 words: 122,311

Lean Analytics: Use Data to Build a Better Startup Faster by Alistair Croll, Benjamin Yoskovitz

Airbnb, Amazon Mechanical Turk, Amazon Web Services, Any sufficiently advanced technology is indistinguishable from magic, barriers to entry, Bay Area Rapid Transit, Ben Horowitz, bounce rate, business intelligence, call centre, cloud computing, cognitive bias, commoditize, constrained optimization, en.wikipedia.org, Firefox, Frederick Winslow Taylor, frictionless, frictionless market, game design, Google X / Alphabet X, Infrastructure as a Service, Internet of things, inventory management, Kickstarter, lateral thinking, Lean Startup, lifelogging, longitudinal study, Marshall McLuhan, minimum viable product, Network effects, pattern recognition, Paul Graham, performance metric, place-making, platform as a service, recommendation engine, ride hailing / ride sharing, rolodex, sentiment analysis, skunkworks, Skype, social graph, social software, software as a service, Steve Jobs, subscription business, telemarketer, transaction costs, two-sided market, Uber for X, web application, Y Combinator

But modern e-commerce is seldom this simple: The majority of buyers find what they’re looking for through search rather than by navigating across a series of pages. Shoppers start with an external search and then bounce back and forth from sites they visit to their search results, seeking the scent of what they’re after. Once they find it, on-site navigation becomes more important. This means on-site funnels are somewhat outdated; keywords are more important. Retailers use recommendation engines to anticipate what else a buyer might want, basing their suggestions on past buyers and other users with similar profiles. Few visitors see the same pages as one another. Retailers are always optimizing performance, which means that they’re segmenting traffic. Mid- to large-size retailers segment their funnel by several tests that are being run to find the right products, offers, and prices.

Cost of customer acquisition The money spent to get someone to buy something. Revenue per customer The lifetime value of each customer. Top keywords driving traffic to the site Those terms that people are looking for, and associate with you—a clue to adjacent products or markets. Top search terms Both those that lead to revenue, and those that don’t have any results. Effectiveness of recommendation engines How likely a visitor is to add a recommended product to the shopping cart. Virality Word of mouth, and sharing per visitor. Mailing list effectiveness Click-through rates and ability to make buyers return and buy. More sophisticated retailers care about other metrics such as the number of reviews written or the number considered helpful, but this is really a secondary business within the organization, and we’ll deal with these when we look at the user-generated content model in Chapter 12.

We’re not going to get into the details of search engine optimization and search engine marketing here—those are worlds unto themselves. For now, realize that search is a significant part of any e-commerce operation, and the old model of formal navigational steps toward a particular page is outdated (even though it remains in many analytics tools). Recommendation Acceptance Rate Big e-commerce companies use recommendation engines to suggest additional items to visitors. Today, these engines are becoming more widespread thanks to third-party recommendation services that work with smaller retailers. Even bloggers have this kind of algorithm, suggesting other articles similar to the one the visitor is currently reading. There are many different approaches to recommendations. Some use what the buyer has purchased in the past; others try to predict purchases from visitor attributes like geography, referral, or what the visitor has clicked so far.


We Are the Nerds: The Birth and Tumultuous Life of Reddit, the Internet's Culture Laboratory by Christine Lagorio-Chafkin

4chan, Airbnb, Amazon Web Services, Bernie Sanders, big-box store, bitcoin, blockchain, Brewster Kahle, Burning Man, crowdsourcing, cryptocurrency, David Heinemeier Hansson, Donald Trump, East Village, game design, Golden Gate Park, hiring and firing, Internet Archive, Jacob Appelbaum, Jeff Bezos, jimmy wales, Joi Ito, Justin.tv, Kickstarter, Lean Startup, Lyft, Marc Andreessen, Mark Zuckerberg, medical residency, minimum viable product, natural language processing, Paul Buchheit, Paul Graham, paypal mafia, Peter Thiel, plutocrats, Plutocrats, QR code, recommendation engine, RFID, rolodex, Ruby on Rails, Sam Altman, Sand Hill Road, Saturday Night Live, self-driving car, semantic web, side project, Silicon Valley, Silicon Valley ideology, Silicon Valley startup, slashdot, Snapchat, social web, South of Market, San Francisco, Startup school, Stephen Hawking, Steve Jobs, Steve Wozniak, technoutopianism, uber lyft, web application, WikiLeaks, Y Combinator

The notebook contained some typical college scribbles (“I’m sorry I’ll shut up now”) and doodles (3-D cubes, a penis) he’d made during class at UVA, and some coursework notes, too, but on this day it transformed into a place where Huffman would document the origins of, and his progress on, their new, as yet unnamed project. “The site people go to find something new,” Huffman wrote in blue pen. “Points for being the first to recommend,” he also wrote, likely transcribing Graham’s exact words regarding building a recommendation engine before any of their preexisting competitors could. The recommendation engine was integral to the success of this hypothetical project, Graham thought, because one would need to dangle a carrot for users to entice them to post links in the first place, and then to return again and again to discover and share. Discover and share. Ohanian immediately considered his own personal use case: He spent a lot of time navigating to the New York Times, the Washington Post, and a host of blogs every morning.

It had been a longtime and significant priority of Huffman’s to keep the site loading quickly and, for users, functioning well (programmers call this keeping a site “perky”). Thanks to numerous small changes and additions to its functionality over the past years, the codebase had become unwieldy. Plus, there were portions of code that were now unused, features built and never launched, or pulled back on, such as the complex recommendation engine Paul Graham had pushed so hard for at Reddit’s inception. With a team of four in place in the conference room overlooking SoMa’s tech-company epicenter, it felt good. Reddit was ready to grow. Huffman and Slowe felt proud that they’d learned to navigate Condé Nast human resources well enough to hire, which allowed them finally to get ahead of the game on site maintenance. What they lacked was a plan for the future.

Upon seeing Mitchell’s handle, u/kemitche, in the thread, Slowe commented, “cough don’t know if you saw the hiring thing. nudge nudge wink wink know-what-I-mean. You get a punch card when you come back. Free small sundae on third rehire!” Huffman’s long-standing trust in Slowe was so deep that when Slowe returned to Reddit, Huffman said his mandate was simply: “Go do stuff, Chris.” Slowe dug into how Reddit’s homepage functions for various users, dubbing the project “Relevance.” Updating the homepage algorithm led him to revisit the recommendation engine project they’d worked on eleven years before. Soon, he added another major project to his plate: overseeing a department that would be dubbed “anti-evil.” It would build specific tools for use by the secretive trust and safety team, and essentially be its programming counterpart. As new engineers were hired, more were handed over to Slowe to build robust antispam systems. As Slowe’s team grew, he proved an adept manager and was handed an even larger team—eighteen engineers—leading the group in charge of maintaining and developing the full Reddit site’s architecture.


pages: 215 words: 55,212

The Mesh: Why the Future of Business Is Sharing by Lisa Gansky

Airbnb, Amazon Mechanical Turk, Amazon Web Services, banking crisis, barriers to entry, carbon footprint, Chuck Templeton: OpenTable:, cloud computing, credit crunch, crowdsourcing, diversification, Firefox, fixed income, Google Earth, industrial cluster, Internet of things, Joi Ito, Kickstarter, late fees, Network effects, new economy, peer-to-peer lending, recommendation engine, RFID, Richard Florida, Richard Thaler, ride hailing / ride sharing, sharing economy, Silicon Valley, smart grid, social web, software as a service, TaskRabbit, the built environment, walkable city, yield management, young professional, Zipcar

As the service developed, the company added layers of information to inform a user’s choices, such as reviews from people in the network whose profile of selections and ratings were similar. Recently, it sponsored a contest awarding a million dollars to anyone who could significantly improve the movie recommendation service. Thousands of teams from more than a hundred nations competed. Netflix’s “recommendation engine” relies on algorithms culled from masses of data collected on the Web, including that provided directly by customers. The lesson learned from the contest, according to the New York Times, was the power of collaboration, as winning teams began sharing ideas and information: “The formula for success was to bring together people with complementary skills and combine different methods of problem solving.”

See Social networking starting Mesh company Sweet Spot trends influencing growth of trust building Millennial generation Mobile networks digital translation to physical and flash branding as foundation of the Mesh share-based business operation users, increase in Modular design Mohsenin, Kamran Movie rentals online, Mesh companies Mozilla Firefox Music-based businesses, Mesh companies Natural ecosystem, relationship to Mesh ecosystem Netflix annual sales as information business Mesh strategy perfection recommendation engine recommendations Network effect Niche markets for maintaining/servicing products Mesh companies opening, reason for sharing as North Portland Tool Library (NPTL) Ofoto Olapic Ombudsman Open Architecture Network Open Design Open innovation service provider Open networks advantages of Architecture for Humanity communal IP concept and marketing products openness versus proprietary approach and product improvement software development OpenTable O’Reilly, Tim Ostrom, Elinor Own-to-Mesh model car-sharing services profits, generation from retirees as customers Partnerships characteristics of corporations and Mesh companies income generation from in Mesh ecosystem unexpected value of Patagonia recycled textiles of Walmart partnership Paul, Sunil Payne, Steven Peer-to-peer lending.


pages: 334 words: 102,899

That Will Never Work: The Birth of Netflix and the Amazing Life of an Idea by Marc Randolph

Airbnb, crowdsourcing, high net worth, inventory management, Isaac Newton, Jeff Bezos, late fees, loose coupling, Mason jar, pets.com, recommendation engine, rolodex, Sand Hill Road, Silicon Valley, Silicon Valley startup, Snapchat, Steve Jobs, Travis Kalanick

To keep our customers happy and our costs reasonable, we needed to direct users to less in-demand movies that we knew they’d like—and probably like even better than new releases. For example: Say I rented (and loved) Pleasantville, one of the best movies of 1998 and a clever dark comedy about what happens when two teenagers from the nineties (Tobey Maguire and Reese Witherspoon) are sucked into a black-and-white television show set in 1950s small-town America. The ideal recommendation engine would be able to steer me away from more current new releases and toward other movies, like Pleasantville—movies like Doc Hollywood. That was a tall order. The thing about taste is that it’s subjective. And the number of factors in play, when trying to establish similarities between films, is almost endless. Do you group films by actor, by director, by genre? Release year, award nominations, screenwriter?

It was remarkably easy to amass enough reviews to build a collaborative filtering function that could actually predict—with reasonable accuracy—what someone might like. After that, Reed’s team went to work integrating these taste predictions into a broader algorithm that made movie recommendations after weighing a number of factors—keyword, number of copies, number of copies in stock, cost per disc. The result—which launched in February of 2000 as Cinematch—was a seemingly more intuitive recommendation engine, one that outsourced qualitative assessment to users while also optimizing things on the back end. In many ways, it was the best of both worlds: an automated system that nonetheless felt human, like a video store clerk asking you what you’d seen lately and then recommending something he knew you’d like—and that he had in stock. Actually, it felt better than human. It felt invisible. If it sounds like two of the most innovative and influential developments in the history of Netflix happened quickly, hot on the heels of Reed and I deciding to run the company together—well, if it sounds that way, that’s because it’s true.

Netflix now had more than 350 employees, and we had long since passed the point where I knew everybody. We’d continued on our streak of making major talent hires—the most recent being Leslie Kilgore, whom Reed had convinced to leave Amazon to head our marketing efforts as CMO, and Ted Sarandos, who now managed our content acquisition. Since walking away from à la carte rentals, our no-due-dates, no-late-fees program had steadily built up steam. Users loved Cinematch, our recommendation engine. We did, too. It kept our subscribers’ queues full—and nothing, we found, correlated more to retention than a queue with lots of movies in it. We were now approaching nearly 200,000 paying subscribers. Our other metrics were looking pretty impressive as well. We now carried 5,800 different DVD titles and shipped more than 800,000 discs a month, and our warehouse was packed with more than a million discs.


pages: 58 words: 12,386

Big Data Glossary by Pete Warden

business intelligence, crowdsourcing, fault tolerance, information retrieval, linked data, natural language processing, recommendation engine, web application

To achieve that scalability, most of the code is written as parallelizable jobs on top of Hadoop. It comes with algorithms to perform a lot of common tasks, like clustering and classifying objects into groups, recommending items based on other users’ behaviors, and spotting attributes that occur together a lot. In practical terms, the framework makes it easy to use analysis techniques to implement features such as Amazon’s “People who bought this also bought” recommendation engine on your own site. It’s a heavily used project with an active community of developers and users, and it’s well worth trying if you have any significant number of transaction or similar data that you’d like to get more value out of. Introducing Mahout Using Mahout with Cassandra scikits.learn It’s hard to find good off-the-shelf tools for practical machine learning. Many of the projects are aimed at students and researchers who want access to the inner workings of the algorithms, which can be off-putting when you’re looking for more of a black box to solve a particular problem.


pages: 222 words: 70,132

Move Fast and Break Things: How Facebook, Google, and Amazon Cornered Culture and Undermined Democracy by Jonathan Taplin

1960s counterculture, affirmative action, Affordable Care Act / Obamacare, Airbnb, Amazon Mechanical Turk, American Legislative Exchange Council, Apple's 1984 Super Bowl advert, back-to-the-land, barriers to entry, basic income, battle of ideas, big data - Walmart - Pop Tarts, bitcoin, Brewster Kahle, Buckminster Fuller, Burning Man, Clayton Christensen, commoditize, creative destruction, crony capitalism, crowdsourcing, data is the new oil, David Brooks, David Graeber, don't be evil, Donald Trump, Douglas Engelbart, Douglas Engelbart, Dynabook, Edward Snowden, Elon Musk, equal pay for equal work, Erik Brynjolfsson, future of journalism, future of work, George Akerlof, George Gilder, Google bus, Hacker Ethic, Howard Rheingold, income inequality, informal economy, information asymmetry, information retrieval, Internet Archive, Internet of things, invisible hand, Jaron Lanier, Jeff Bezos, job automation, John Markoff, John Maynard Keynes: technological unemployment, John von Neumann, Joseph Schumpeter, Kevin Kelly, Kickstarter, labor-force participation, life extension, Marc Andreessen, Mark Zuckerberg, Menlo Park, Metcalfe’s law, Mother of all demos, move fast and break things, move fast and break things, natural language processing, Network effects, new economy, Norbert Wiener, offshore financial centre, packet switching, Paul Graham, paypal mafia, Peter Thiel, plutocrats, Plutocrats, pre–internet, Ray Kurzweil, recommendation engine, rent-seeking, revision control, Robert Bork, Robert Gordon, Robert Metcalfe, Ronald Reagan, Ross Ulbricht, Sam Altman, Sand Hill Road, secular stagnation, self-driving car, sharing economy, Silicon Valley, Silicon Valley ideology, smart grid, Snapchat, software is eating the world, Steve Jobs, Stewart Brand, technoutopianism, The Chicago School, The Market for Lemons, The Rise and Fall of American Growth, Tim Cook: Apple, trade route, transfer pricing, Travis Kalanick, trickle-down economics, Tyler Cowen: Great Stagnation, universal basic income, unpaid internship, We wanted flying cars, instead we got 140 characters, web application, Whole Earth Catalog, winner-take-all economy, women in the workforce, Y Combinator

But it turned out it wasn’t just elite Harvard kids who wanted to fashion an online persona—it was everyone. When Thefacebook really started to grow, in the late spring of 2004, Zuckerberg and his right-hand man, Dustin Moskovitz, decided to go to Silicon Valley for the summer. Zuckerberg had met Sean Parker in a Chinese restaurant in New York in May and had been awed by his outlaw tales of Napster. Zuckerberg had written a music-recommendation engine while he was a senior at Exeter, and so Napster loomed large in his notion of hipness. When the two men got to Palo Alto in June, they ran into Parker, who was essentially homeless, having been thrown out of his latest company, Plaxo, an online address-book application. It is a tribute to Zuckerberg’s naive trust that he invited Parker to live in the house he and Moskovitz had rented. Parker promised to teach them about the shark tank known as Sand Hill Road—the center of the Valley’s venture capital business.

In Huxley’s world, the obsession with taking drugs, going to the “feelies” (his equivalent of IMAX movies), playing interactive games, and downloading porn filled the lives of the citizens. They had no time for politics or even for wondering why their horizons were so narrow. The kids attending DigiTour would fit right into the plot of Brave New World. The Internet’s self-curated view from everywhere has the amazing ability to distract us in trivial pursuits, narrow our choices, and keep us safe in a balkanized suburb of our own taste. Search engines and recommendation engines constantly favor the most popular options and constantly make our discovery more limited. I began this chapter wondering whether technology was robbing us of some of our essential humanity. Google’s chief technologist proclaims that technology will “allow us to transcend these limitations of our biological bodies and brains.… There will be no distinction, post-Singularity, between human and machine.”


pages: 274 words: 75,846

The Filter Bubble: What the Internet Is Hiding From You by Eli Pariser

A Declaration of the Independence of Cyberspace, A Pattern Language, Amazon Web Services, augmented reality, back-to-the-land, Black Swan, borderless world, Build a better mousetrap, Cass Sunstein, citizen journalism, cloud computing, cognitive dissonance, crowdsourcing, Danny Hillis, data acquisition, disintermediation, don't be evil, Filter Bubble, Flash crash, fundamental attribution error, global village, Haight Ashbury, Internet of things, Isaac Newton, Jaron Lanier, Jeff Bezos, jimmy wales, Kevin Kelly, knowledge worker, Mark Zuckerberg, Marshall McLuhan, megacity, Metcalfe’s law, Netflix Prize, new economy, PageRank, paypal mafia, Peter Thiel, recommendation engine, RFID, Robert Metcalfe, sentiment analysis, shareholder value, Silicon Valley, Silicon Valley startup, social graph, social software, social web, speech recognition, Startup school, statistical model, stem cell, Steve Jobs, Steven Levy, Stewart Brand, technoutopianism, the scientific method, urban planning, Whole Earth Catalog, WikiLeaks, Y Combinator

In a memo for fellow progressives, Mark Steitz, one of the primary Democratic data gurus, recently wrote that “targeting too often returns to a bombing metaphor—dropping message from planes. Yet the best data tools help build relationships based on observed contacts with people. Someone at the door finds out someone is interested in education; we get back to that person and others like him or her with more information. Amazon’s recommendation engine is the direction we need to head.” The trend is clear: We’re moving from swing states to swing people. Consider this scenario: It’s 2016, and the race is on for the presidency of the United States. Or is it? It depends on who you are, really. If the data says you vote frequently and that you may have been a swing voter in the past, the race is a maelstrom. You’re besieged with ads, calls, and invitations from friends.

Quora Forum, accessed Dec. 17, 2010, www.quora.com/Facebook-company/Whats-the-history-of-the-Awesome-Button-that-eventually-became-the-Like-button-on-Facebook. 151 “against the cruise line industry”: Hollis Thomases, “Google Drops Anti-Cruise Line Ads from AdWords,” Web Ad.vantage, Feb. 13, 2004, accessed Dec. 17, 2010, www.webadvantage.net/webadblog/google-drops-anti-cruise-line-ads-from-adwords-338. 151–52 identify who was persuadable: “How Rove Targeted the Republican Vote,” Frontline, accessed Feb. 8, 2011, www.pbs.org/wgbh/pages/frontline/shows/architect/rove/metrics.html. 152 “Amazon’s recommendation engine is the direction”: Mark Steitz and Laura Quinn, “An Introduction to Microtargeting in Politics,” accessed Dec. 17, 2010, www.docstoc.com/docs/43575201/An-Introduction-to-Microtargeting-in-Politics. 153 round-the-clock “war room”: “Google’s War Room for the Home Stretch of Campaign 2010,” e.politics, Sept. 24, 2010, accessed Feb. 9, 2011, www.epolitics.com/2010/09/24/googles-war-room-for-the-home-stretch-of-campaign-2010/. 155 “campaign wanted to spend on Facebook”: Vincent R.


pages: 301 words: 85,263

New Dark Age: Technology and the End of the Future by James Bridle

AI winter, Airbnb, Alfred Russel Wallace, Automated Insights, autonomous vehicles, back-to-the-land, Benoit Mandelbrot, Bernie Sanders, bitcoin, British Empire, Brownian motion, Buckminster Fuller, Capital in the Twenty-First Century by Thomas Piketty, carbon footprint, cognitive bias, cognitive dissonance, combinatorial explosion, computer vision, congestion charging, cryptocurrency, data is the new oil, Donald Trump, Douglas Engelbart, Douglas Engelbart, Douglas Hofstadter, drone strike, Edward Snowden, fear of failure, Flash crash, Google Earth, Haber-Bosch Process, hive mind, income inequality, informal economy, Internet of things, Isaac Newton, John von Neumann, Julian Assange, Kickstarter, late capitalism, lone genius, mandelbrot fractal, meta analysis, meta-analysis, Minecraft, mutually assured destruction, natural language processing, Network effects, oil shock, p-value, pattern recognition, peak oil, recommendation engine, road to serfdom, Robert Mercer, Ronald Reagan, self-driving car, Silicon Valley, Silicon Valley ideology, Skype, social graph, sorting algorithm, South China Sea, speech recognition, Spread Networks laid a new fibre optics cable between New York and Chicago, stem cell, Stuxnet, technoutopianism, the built environment, the scientific method, Uber for X, undersea cable, University of East Anglia, uranium enrichment, Vannevar Bush, WikiLeaks

YouTube’s official guidelines state that the site is for ages thirteen and up, with parental permission required for those below eighteen, but there’s nothing to stop a thirteen-year-old accessing it. Moreover, there’s no need to have an account at all; like most websites, YouTube tracks unique visitors by their address, browser and device profile, and behaviour, and it can build a detailed demographic and preference profile to feed the recommendation engines without the viewer ever consciously submitting any information about themselves. That applies even if the viewer is a three-year-old child plonked in front of their parent’s iPad and mashing the screen with a balled-up fist. The frequency with which such a situation occurs is obvious in the site’s own viewer statistics. Ryan’s Toy Review, a channel specialising in unboxing videos and other kids’ tropes, is the sixth most popular channel on the platform, only just behind Justin Bieber and the WWE.4 At one point in 2016, it was the most popular.

In the video cited, Peppa endures her horrendous dental experience, and then she transforms into a series of Iron Man/pig/robot hybrids and performs the Learn Colours dance. Whatever agency is at play here is far from clear: the video starts with a trollish Peppa parody, but later syncs into the kind of automated repetition of tropes we’ve seen before. It’s not just trolls, or just automation; it’s not just human actors playing out an algorithmic logic, or algorithms mindlessly responding to recommendation engines. It’s a vast and almost completely hidden matrix of interactions between desires and rewards, technologies and audiences, tropes and masks. Other examples seem less accidental, and more intentional. One whole strand of video production involves automated recuts of video game footage, reprogrammed with superheroes or cartoon characters instead of soldiers and gangsters. Spiderman breaks the legs of the Grim Reaper and Elsa from Frozen and buries them up to their neck in a pit.


pages: 292 words: 85,151

Exponential Organizations: Why New Organizations Are Ten Times Better, Faster, and Cheaper Than Yours (And What to Do About It) by Salim Ismail, Yuri van Geest

23andMe, 3D printing, Airbnb, Amazon Mechanical Turk, Amazon Web Services, augmented reality, autonomous vehicles, Baxter: Rethink Robotics, Ben Horowitz, bioinformatics, bitcoin, Black Swan, blockchain, Burning Man, business intelligence, business process, call centre, chief data officer, Chris Wanstrath, Clayton Christensen, clean water, cloud computing, cognitive bias, collaborative consumption, collaborative economy, commoditize, corporate social responsibility, cross-subsidies, crowdsourcing, cryptocurrency, dark matter, Dean Kamen, dematerialisation, discounted cash flows, disruptive innovation, distributed ledger, Edward Snowden, Elon Musk, en.wikipedia.org, Ethereum, ethereum blockchain, game design, Google Glasses, Google Hangouts, Google X / Alphabet X, gravity well, hiring and firing, Hyperloop, industrial robot, Innovator's Dilemma, intangible asset, Internet of things, Iridium satellite, Isaac Newton, Jeff Bezos, Joi Ito, Kevin Kelly, Kickstarter, knowledge worker, Kodak vs Instagram, Law of Accelerating Returns, Lean Startup, life extension, lifelogging, loose coupling, loss aversion, low earth orbit, Lyft, Marc Andreessen, Mark Zuckerberg, market design, means of production, minimum viable product, natural language processing, Netflix Prize, NetJets, Network effects, new economy, Oculus Rift, offshore financial centre, PageRank, pattern recognition, Paul Graham, paypal mafia, peer-to-peer, peer-to-peer model, Peter H. Diamandis: Planetary Resources, Peter Thiel, prediction markets, profit motive, publish or perish, Ray Kurzweil, recommendation engine, RFID, ride hailing / ride sharing, risk tolerance, Ronald Coase, Second Machine Age, self-driving car, sharing economy, Silicon Valley, skunkworks, Skype, smart contracts, Snapchat, social software, software is eating the world, speech recognition, stealth mode startup, Stephen Hawking, Steve Jobs, subscription business, supply-chain management, TaskRabbit, telepresence, telepresence robot, Tony Hsieh, transaction costs, Travis Kalanick, Tyler Cowen: Great Stagnation, uber lyft, urban planning, WikiLeaks, winner-take-all economy, X Prize, Y Combinator, zero-sum game

Ten years later, its revenues had jumped 125x and the company was generating a half-billion dollars every three days. At the heart of this staggering growth was the PageRank algorithm, which ranks the popularity of web pages. (Google doesn’t gauge which page is better from a human perspective; its algorithms simply respond to the pages that deliver the most clicks.) Google isn’t alone. Today, the world is pretty much run on algorithms. From automotive anti-lock braking to Amazon’s recommendation engine; from dynamic pricing for airlines to predicting the success of upcoming Hollywood blockbusters; from writing news posts to air traffic control; from credit card fraud detection to the 2 percent of posts that Facebook shows a typical user—algorithms are everywhere in modern life. Recently, McKinsey estimated that of the seven hundred end-to-end bank processes (opening an account or getting a car loan, for example), about half can be fully automated.

Not only has he made that rare transition from founder to large-company CEO, but he has also consistently avoided the short-term thinking that so often comes with running a public company—what Joi Ito calls “nowism.” Amazon regularly makes long bets (e.g., Amazon Web Services, Kindle, and now Fire smartphones and delivery drones), views new products as if they are seedlings needing careful tending for a five-to-seven-year period, is maniacal about growth over profits and ignores the short-term view of Wall Street analysts. Its pioneering initiatives include its Affiliate Program, its recommendation engine (collaborative filtering) and the Mechanical Turk project. As Bezos says, “If you’re competitor-focused, you have to wait until there is a competitor doing something. Being customer-focused allows you to be more pioneering.” Not only has Amazon built ExOs on its edges (such as AWS), it also has had the courage to cannibalize its own products (e.g., Kindle). In addition, after realizing that Amazon’s culture wasn’t a perfect fit with the outstanding service he wanted to offer, Bezos spent $1.2 billion in 2009 to acquire Zappos.


pages: 308 words: 84,713

The Glass Cage: Automation and Us by Nicholas Carr

Airbnb, Airbus A320, Andy Kessler, Atul Gawande, autonomous vehicles, Bernard Ziegler, business process, call centre, Captain Sullenberger Hudson, Charles Lindbergh, Checklist Manifesto, cloud computing, computerized trading, David Brooks, deliberate practice, deskilling, digital map, Douglas Engelbart, drone strike, Elon Musk, Erik Brynjolfsson, Flash crash, Frank Gehry, Frank Levy and Richard Murnane: The New Division of Labor, Frederick Winslow Taylor, future of work, global supply chain, Google Glasses, Google Hangouts, High speed trading, indoor plumbing, industrial robot, Internet of things, Jacquard loom, James Watt: steam engine, job automation, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, Kevin Kelly, knowledge worker, Lyft, Marc Andreessen, Mark Zuckerberg, means of production, natural language processing, new economy, Nicholas Carr, Norbert Wiener, Oculus Rift, pattern recognition, Peter Thiel, place-making, plutocrats, Plutocrats, profit motive, Ralph Waldo Emerson, RAND corporation, randomized controlled trial, Ray Kurzweil, recommendation engine, robot derives from the Czech word robota Czech, meaning slave, Second Machine Age, self-driving car, Silicon Valley, Silicon Valley ideology, software is eating the world, Stephen Hawking, Steve Jobs, TaskRabbit, technoutopianism, The Wealth of Nations by Adam Smith, turn-by-turn navigation, US Airways Flight 1549, Watson beat the top human players on Jeopardy!, William Langewiesche

Thanks to the proliferation of smartphones, tablets, and other small, affordable, and even wearable computers, we now depend on software to carry out many of our daily chores and pastimes. We launch apps to aid us in shopping, cooking, exercising, even finding a mate and raising a child. We follow turn-by-turn GPS instructions to get from one place to the next. We use social networks to maintain friendships and express our feelings. We seek advice from recommendation engines on what to watch, read, and listen to. We look to Google, or to Apple’s Siri, to answer our questions and solve our problems. The computer is becoming our all-purpose tool for navigating, manipulating, and understanding the world, in both its physical and its social manifestations. Just think what happens these days when people misplace their smartphones or lose their connections to the net.

Like all analytical programs, they have a bias toward criteria that lend themselves to statistical analysis, downplaying those that entail the exercise of taste or other subjective judgments. Automated essay-grading algorithms encourage in students a rote mastery of the mechanics of writing. The programs are deaf to tone, uninterested in knowledge’s nuances, and actively resistant to creative expression. The deliberate breaking of a grammatical rule may delight a reader, but it’s anathema to a computer. Recommendation engines, whether suggesting a movie or a potential love interest, cater to our established desires rather than challenging us with the new and unexpected. They assume we prefer custom to adventure, predictability to whimsy. The technologies of home automation, which allow things like lighting, heating, cooking, and entertainment to be meticulously programmed, impose a Taylorist mentality on domestic life.


Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman

cloud computing, crowdsourcing, en.wikipedia.org, first-price auction, G4S, information retrieval, John Snow's cholera map, Netflix Prize, NP-complete, PageRank, pattern recognition, random walk, recommendation engine, second-price auction, sentiment analysis, social graph, statistical model, web application

The items recommended to a user are those preferred by similar users. This sort of recommendation system can use the groundwork laid in Chapter 3 on similarity search and Chapter 7 on clustering. However, these technologies by themselves are not sufficient, and there are some new algorithms that have proven effective for recommendation systems. 9.1A Model for Recommendation Systems In this section we introduce a model for recommendation systems, based on a utility matrix of preferences. We introduce the concept of a “long-tail,” which explains the advantage of on-line vendors over conventional, brick-and-mortar vendors. We then briefly survey the sorts of applications in which recommendation systems have proved useful. 9.1.1The Utility Matrix In a recommendation-system application there are two classes of entities, which we shall refer to as users and items.

See, for example, www.chesterfields.info. 3 Thanks to Anna Karlin for this example. 9 Recommendation Systems There is an extensive class of Web applications that involve predicting user responses to options. Such a facility is called a recommendation system. We shall begin this chapter with a survey of the most important examples of these systems. However, to bring the problem into focus, two good examples of recommendation systems are: (1)Offering news articles to on-line newspaper readers, based on a prediction of reader interests. (2)Offering customers of an on-line retailer suggestions about what they might like to buy, based on their past history of purchases and/or product searches. Recommendation systems use a number of different technologies. We can classify these systems into two broad groups.

Thus, one can examine the ratings of any movie to see if its ratings have an upward or downward slope with time. 9.6Summary of Chapter 9 ✦Utility Matrices: Recommendation systems deal with users and items. A utility matrix offers known information about the degree to which a user likes an item. Normally, most entries are unknown, and the essential problem of recommending items to users is predicting the values of the unknown entries based on the values of the known entries. ✦Two Classes of Recommendation Systems: These systems attempt to predict a user’s response to an item by discovering similar items and the response of the user to those. One class of recommendation system is content-based; it measures similarity by looking for common features of the items. A second class of recommendation system uses collaborative filtering; these measure similarity of users by their item preferences and/or measure similarity of items by the users who like them.


pages: 88 words: 25,047

The Mathematics of Love: Patterns, Proofs, and the Search for the Ultimate Equation by Hannah Fry

Brownian motion, John Nash: game theory, linear programming, Nash equilibrium, Pareto efficiency, recommendation engine, Skype, statistical model

And that’s it – apply this algorithm to the hundreds of available questions and repeat for each of the millions of users on OkCupid and you’ve got everything you need for one of the world’s most successful dating websites. It’s one of the most elegant approaches ever attempted to pairing couples based on their personal preferences. Together with eHarmony and other similar websites, OkCupid sits alongside Amazon and Netflix as one of the most widely used recommendation engines on the internet. But there’s one problem – if the internet is the ultimate matchmaker, why are people still going on terrible dates? If the science is so good, surely that first date will be the last first date of your life? Shouldn’t the algorithm be able to deliver the perfect partner and leave it at that? Maybe the questionnaires and match percentages aren’t all they’re cracked up to be.


pages: 374 words: 94,508

Infonomics: How to Monetize, Manage, and Measure Information as an Asset for Competitive Advantage by Douglas B. Laney

3D printing, Affordable Care Act / Obamacare, banking crisis, blockchain, business climate, business intelligence, business process, call centre, chief data officer, Claude Shannon: information theory, commoditize, conceptual framework, crowdsourcing, dark matter, data acquisition, digital twin, discounted cash flows, disintermediation, diversification, en.wikipedia.org, endowment effect, Erik Brynjolfsson, full employment, informal economy, intangible asset, Internet of things, linked data, Lyft, Nash equilibrium, Network effects, new economy, obamacare, performance metric, profit motive, recommendation engine, RFID, semantic web, smart meter, Snapchat, software as a service, source of truth, supply-chain management, text mining, uber lyft, Y2K, yield curve

Open government initiatives for economic development and for health, welfare, and citizen services are in various stages of implementation throughout the world. This information can also have real commercial value—especially when mashed with other sources—to understand and act on local or global market conditions, population trends, and weather, for example. Public data even can be used to create new (ahem) high-value businesses such as Potbot, a virtual cannabis “budtender.” At its core is a recommendation engine that uses information on strains, cannabinoids, and medical applications aggregated via semantic web technology. Potbot also incorporates data from cannabis seed DNA scans along with recordings of brain activity in clinical tests. It monetizes this information, not just in the form of a consumer app, but also in helping growers improve their yields for the most popular or beneficial strains.1617 Public data is most monetizable when integrated with your own proprietary information.

So with the help of analytics software from Emcien, it produced a demand-shaping pattern analysis for determining the optimal number of product configuration options, resulting in a $110 million bump in revenues and a 5 percent increase in sales efficiency.18 Optimizing Business Processes Ultimately, any form of information monetization is the result of some business process or combination of business processes. BI tools generally are standalone with respect to the business processes that they support. Even when embedded into business applications, they tend to present charts or numbers in an application window. Ideally, output is updated to reflect the user’s activity and needs, but less often is it used to affect the business process directly. Evolving to complex-event processing solutions, recommendation engines, rule-based systems, or artificial intelligence (AI), combined with business process management and workflow systems, can help to optimize business processes more directly, either supplementing or supplanting human intervention. Case in point: a company formed from a collection of shopping stalls in 1919 by an English trader named Jack Cohen today has hardwired its thousands of refrigeration units to a data warehouse.


pages: 307 words: 88,180

AI Superpowers: China, Silicon Valley, and the New World Order by Kai-Fu Lee

AI winter, Airbnb, Albert Einstein, algorithmic trading, artificial general intelligence, autonomous vehicles, barriers to entry, basic income, business cycle, cloud computing, commoditize, computer vision, corporate social responsibility, creative destruction, crony capitalism, Deng Xiaoping, deskilling, Donald Trump, Elon Musk, en.wikipedia.org, Erik Brynjolfsson, full employment, future of work, gig economy, Google Chrome, happiness index / gross national happiness, if you build it, they will come, ImageNet competition, income inequality, informal economy, Internet of things, invention of the telegraph, Jeff Bezos, job automation, John Markoff, Kickstarter, knowledge worker, Lean Startup, low skilled workers, Lyft, mandatory minimum, Mark Zuckerberg, Menlo Park, minimum viable product, natural language processing, new economy, pattern recognition, pirate software, profit maximization, QR code, Ray Kurzweil, recommendation engine, ride hailing / ride sharing, risk tolerance, Robert Mercer, Rodney Brooks, Rubik’s Cube, Sam Altman, Second Machine Age, self-driving car, sentiment analysis, sharing economy, Silicon Valley, Silicon Valley ideology, Silicon Valley startup, Skype, special economic zone, speech recognition, Stephen Hawking, Steve Jobs, strong AI, The Future of Employment, Travis Kalanick, Uber and Lyft, uber lyft, universal basic income, urban planning, Y Combinator

Do video streaming sites have an uncanny knack for recommending that next video that you’ve just got to check out before you get back to work? Does Amazon seem to know what you’ll want to buy before you do? If so, then you have been the beneficiary (or victim, depending on how you value your time, privacy, and money) of internet AI. This first wave began almost fifteen years ago but finally went mainstream around 2012. Internet AI is largely about using AI algorithms as recommendation engines: systems that learn our personal preferences and then serve up content hand-picked for us. The horsepower of these AI engines depends on the digital data they have access to, and there’s currently no greater storehouse of this data than the major internet companies. But that data only becomes truly useful to algorithms once it has been labeled. In this case, “labeled” doesn’t mean you have to actively rate the content or tag it with a keyword.

See artificial intelligence (AI) AI engineers, 14 Airbnb, 39, 49, 73 AI revolution deep learning and, 5, 25, 92, 94, 143 economic impact of, 151–52 speed of, 152–55 AI winters, 6–7, 8, 9, 10 algorithmic bias, 229 algorithms, AI AI revolution and, 152–53 computing power and, 14, 56 credit and, 112–13 data and, 14, 17, 56, 138 fake news detection by, 109 intelligence sharing and, 87 legal applications for, 115–16 medical diagnosis and, 114–15 as recommendation engines, 107–8 robot reporting, 108 white-collar workers and, 167, 168 Alibaba Amazon compared to, 109 Chinese startups and, 58 City Brain, 93–94, 117, 124, 228 as dominant AI player, 83, 91, 93–94 eBay and, 34–35 financial services spun off from, 73 four waves of AI and, 106, 107, 109 global markets and, 137 grid approach and, 95 Microsoft Research Asia and, 89 mobile payments transition, 76 New York Stock Exchange debut, 66–67 online purchasing and, 68 success of, 40 Tencent’s “Pearl Harbor attack” on, 60–61 Wang Xing and, 24 Alipay, 35, 60, 69, 73–74, 75, 112, 118 Alphabet, 92–93 AlphaGo, 1–4, 5, 6, 11, 199 AlphaGo Zero, 90 Altman, Sam, 207 Amazon Alibaba compared to, 109 Chinese market and, 39 data captured by, 77 as dominant AI player, 83, 91 four waves of AI and, 106 grid approach and, 95 innovation mentality at, 33 monopoly of e-commerce, 170 online purchasing and, 68 Wang Xing and, 24 warehouses, 129–30 Amazon Echo, 117, 127 Amazon Go, 163, 213 Anderson, Chris, 130 Andreesen Horowitz, 70 Ant Financial, 73 antitrust laws, 20, 28, 171, 229 Apollo project, 135 app constellation model, 70 Apple, 33, 75, 117, 126, 143, 177, 184 Apple Pay, 75, 76 app-within-an-app model, 59 ARM (British firm), 96 Armstrong, Neil, 3 artificial general intelligence (AGI), 140–44 artificial intelligence (AI) introduction to, ix–xi See also China; deep learning; economy and AI; four waves of AI; global AI story; human coexistence with AI; new world order artificial superintelligence.


pages: 366 words: 94,209

Throwing Rocks at the Google Bus: How Growth Became the Enemy of Prosperity by Douglas Rushkoff

activist fund / activist shareholder / activist investor, Airbnb, algorithmic trading, Amazon Mechanical Turk, Andrew Keen, bank run, banking crisis, barriers to entry, bitcoin, blockchain, Burning Man, business process, buy and hold, buy low sell high, California gold rush, Capital in the Twenty-First Century by Thomas Piketty, carbon footprint, centralized clearinghouse, citizen journalism, clean water, cloud computing, collaborative economy, collective bargaining, colonial exploitation, Community Supported Agriculture, corporate personhood, corporate raider, creative destruction, crowdsourcing, cryptocurrency, disintermediation, diversified portfolio, Elon Musk, Erik Brynjolfsson, Ethereum, ethereum blockchain, fiat currency, Firefox, Flash crash, full employment, future of work, gig economy, Gini coefficient, global supply chain, global village, Google bus, Howard Rheingold, IBM and the Holocaust, impulse control, income inequality, index fund, iterative process, Jaron Lanier, Jeff Bezos, jimmy wales, job automation, Joseph Schumpeter, Kickstarter, loss aversion, Lyft, Marc Andreessen, Mark Zuckerberg, market bubble, market fundamentalism, Marshall McLuhan, means of production, medical bankruptcy, minimum viable product, Mitch Kapor, Naomi Klein, Network effects, new economy, Norbert Wiener, Oculus Rift, passive investing, payday loans, peer-to-peer lending, Peter Thiel, post-industrial society, profit motive, quantitative easing, race to the bottom, recommendation engine, reserve currency, RFID, Richard Stallman, ride hailing / ride sharing, Ronald Reagan, Satoshi Nakamoto, Second Machine Age, shareholder value, sharing economy, Silicon Valley, Snapchat, social graph, software patent, Steve Jobs, TaskRabbit, The Future of Employment, trade route, transportation-network company, Turing test, Uber and Lyft, Uber for X, uber lyft, unpaid internship, Y Combinator, young professional, zero-sum game, Zipcar

., became one of the first publicly traded Internet giants, responsible (or to blame) for not only the first e-commerce Web sites but also the first banner ad.6 Matthew was likely just as surprised by where this all went as I was. The information superhighway morphed into an interactive strip mall; digital technology’s ability to connect people to products, facilitate payments, and track behaviors led to all sorts of new marketing and sales innovations. “Buy” buttons triggered the impulse for instant gratification, while recommendation engines personalized marketing pitches. It was commerce on crack. With a few notable exceptions—such as eBay and Etsy—we didn’t really get a return of the many-to-many marketplace or digital bazaar. No, in online commerce it’s mostly a few companies selling to many, and many people selling to the very few—if anyone at all. Take music. The best part of an online music catalogue is that it is unlimited in size.

Amazon then leveraged its monopoly in books and free shipping to develop monopolies in other verticals, beginning with home electronics (bankrupting Circuit City and Best Buy), and then every other link in the physical and virtual fulfillment chain, from shoes and food to music and videos. Finally, Amazon flips into personhood by reversing the traditional relationship between people and machines. Amazon’s patented recommendation engines attempt to drive our human selection process. Amazon Mechanical Turks gave computers the ability to mete out repetitive tasks to legions of human drones. The computers did the thinking and choosing; the people pointed and clicked as they were instructed or induced to do. Neither Amazon nor its founder, Jeff Bezos, is slipping to new lows here. The company is simply operating true to the core program of corporatism, expressed through new digital means.


pages: 301 words: 85,126

AIQ: How People and Machines Are Smarter Together by Nick Polson, James Scott

Air France Flight 447, Albert Einstein, Amazon Web Services, Atul Gawande, autonomous vehicles, availability heuristic, basic income, Bayesian statistics, business cycle, Cepheid variable, Checklist Manifesto, cloud computing, combinatorial explosion, computer age, computer vision, Daniel Kahneman / Amos Tversky, Donald Trump, Douglas Hofstadter, Edward Charles Pickering, Elon Musk, epigenetics, Flash crash, Grace Hopper, Gödel, Escher, Bach, Harvard Computers: women astronomers, index fund, Isaac Newton, John von Neumann, late fees, low earth orbit, Lyft, Magellanic Cloud, mass incarceration, Moneyball by Michael Lewis explains big data, Moravec's paradox, more computing power than Apollo, natural language processing, Netflix Prize, North Sea oil, p-value, pattern recognition, Pierre-Simon Laplace, ransomware, recommendation engine, Ronald Reagan, self-driving car, sentiment analysis, side project, Silicon Valley, Skype, smart cities, speech recognition, statistical model, survivorship bias, the scientific method, Thomas Bayes, Uber for X, uber lyft, universal basic income, Watson beat the top human players on Jeopardy!, young professional

As recently as 2010, the company’s core business involved filling red envelopes with DVDs that would incur “no late fees, ever!” Each envelope would come back a few days after it had been sent out, along with the subscriber’s rating of the film on a 1-to-5 scale. As that ratings data accumulated, Netflix’s algorithms would look for patterns, and over time, subscribers would get better film recommendations. (This kind of AI is usually called a “recommender system”; we also like the term “suggestion engine.”) Netflix 1.0 was so focused on improving its recommender system that in 2007, to great fanfare among math geeks the world over, it announced a public machine-learning contest with a prize of $1 million. The company put some of its ratings data on a public server, and it challenged all comers to improve upon Netflix’s own system, called Cinematch, by at least 10%—that is, by predicting how you’d rate a film with 10% better accuracy than Netflix could.

Their fighter escorts: the Spitfires and P-51 Mustangs sent along to defend the bombers from the Luftwaffe. 3. A Hungarian-American statistician named Abraham Wald. Abraham Wald never shot down a Messerschmitt or even saw the inside of a combat aircraft. Nonetheless, he made an outsized contribution to the Allied war effort using an equally potent weapon: conditional probability. Specifically, Wald built a recommender system that could make personalized survivability suggestions for different kinds of planes. At its heart, it was just like a modern AI-based recommender system for TV shows. And when you understand how he built it, you’ll also understand a lot more about Netflix, Hulu, Spotify, Instagram, Amazon, YouTube, and just about every tech company that’s ever made you an automatic suggestion worth following. Wald’s Early Years Abraham Wald was born in 1902 to a large Orthodox Jewish family in Kolozsvár, Hungary, which became part of Romania and changed its name to Cluj after World War I.

See health care and medicine Medtronic Menger, Karl Microsoft Microsoft Azure modeling assumptions and deep-learning models imputation and Inception latent feature massive models missing data and model rust natural language processing and prediction rules as reality versus rules-based (top-down) models training the model Moneyball Moore’s law Moravec paradox Morgenstern, Oskar Musk, Elon natural language processing (NLP) ambiguity and bottom-up approach chatbots digital assistants future trends Google Translate growth of statistical NLP knowing how versus knowing that natural language revolution “New Deal” for human-machine linguistic interaction prediction rules and programing language revolution robustness and rule bloat and speech recognition top-down approach word co-location statistics word vectors naturally occurring radioactive materials (NORM) Netflix Crown, The (series) data scientists history of House of Cards (series) Netflix Prize for recommender system personalization recommender systems neural networks deep learning and Friends new episodes and Inception model prediction rules and New England Patriots Newton, Isaac Nightingale, Florence coxcomb diagram (1858) Crimean War and early years and training evidence-based medicine legacy of “lady with the lamp” medical statistics legacy of nursing reform legacy of Nvidia Obama, Barack Office of Scientific Research and Development parallax pattern recognition cucumber sorting input and output learning a pattern maximum heart rate and prediction rules and toilet paper theft and See also prediction rules PayPal personalization conditional probability and latent feature models and Netflix and Wald’s survivability recommendations for aircraft and See also recommender systems; suggestion engines philosophy Pickering, Edward C.


Data Mining: Concepts and Techniques: Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei

bioinformatics, business intelligence, business process, Claude Shannon: information theory, cloud computing, computer vision, correlation coefficient, cyber-physical system, database schema, discrete time, distributed generation, finite state, information retrieval, iterative process, knowledge worker, linked data, natural language processing, Netflix Prize, Occam's razor, pattern recognition, performance metric, phenotype, random walk, recommendation engine, RFID, semantic web, sentiment analysis, speech recognition, statistical model, stochastic process, supply-chain management, text mining, thinkpad, Thomas Bayes, web application

If consumers follow a system recommendation but then do not end up liking the product, they are less likely to use the recommender system again. As with classification systems, recommender systems can make two types of errors: false negatives and false positives. Here, false negatives are products that the system fails to recommend, although the consumer would like them. False positives are products that are recommended, but which the consumer does not like. False positives are less desirable because they can annoy or anger consumers. Content-based recommender systems are limited by the features used to describe the items they recommend. Another challenge for both content-based and collaborative recommender systems is how to deal with new users for which a buying history is not yet available. Hybrid approaches integrate both content-based and collaborative methods to achieve further improved recommendations.

In summary, computer systems are at continual risk of breaks in security. Data mining technology can be used to develop strong intrusion detection and prevention systems, which may employ signature-based or anomaly-based detection. 13.3.5. Data Mining and Recommender Systems Today's consumers are faced with millions of goods and services when shopping online. Recommender systems help consumers by making product recommendations that are likely to be of interest to the user such as books, CDs, movies, restaurants, online news articles, and other services. Recommender systems may use either a content-based approach, a collaborative approach, or a hybrid approach that combines both content-based and collaborative methods. The content-based approach recommends items that are similar to items the user preferred or queried in the past.

They make use of keywords (describing the items) and user profiles that contain information about users' tastes and needs. Such profiles may be obtained explicitly (e.g., through questionnaires) or learned from users' transactional behavior over time. A collaborative recommender system tries to predict the utility of items for a user, u, based on items previously rated by other users who are similar to u. For example, when recommending books, a collaborative recommender system tries to find other users who have a history of agreeing with u (e.g., they tend to buy similar books, or give similar ratings for books). Collaborative recommender systems can be either memory (or heuristic) based or model based. Memory-based methods essentially use heuristics to make rating predictions based on the entire collection of items previously rated by users. That is, the unknown rating of an item–user combination can be estimated as an aggregate of ratings of the most similar users for the same item.


pages: 94 words: 26,453

The End of Nice: How to Be Human in a World Run by Robots (Kindle Single) by Richard Newton

3D printing, Black Swan, British Empire, Buckminster Fuller, Clayton Christensen, crowdsourcing, deliberate practice, disruptive innovation, fear of failure, Filter Bubble, future of work, Google Glasses, Isaac Newton, James Dyson, Jaron Lanier, Jeff Bezos, job automation, lateral thinking, Lean Startup, low skilled workers, Mark Zuckerberg, move fast and break things, move fast and break things, Paul Erdős, Paul Graham, recommendation engine, rising living standards, Robert Shiller, Robert Shiller, Silicon Valley, Silicon Valley startup, skunkworks, social intelligence, Steve Ballmer, Steve Jobs, Y Combinator

Like the sirens of legends sung sweet songs to lure sailors to crash on the rocky shore of their island, so Lanier thinks we must be wary of the attractions of the siren servers. They don’t want to make your life more complicated. They are there to make everything frictionless: “Leave it to me”, they sing. “I’ll find you new music you might like, books you’ll want to read, videos you want to watch and friends you should like.” We’re sort of used to the idea that recommendation engines work like this. We know that ads now follow us around the web and that books will be unhelpfully recommended to us by Amazon. But search results are also tailored to you. And that’s more of a concern. The search results you get will be different to the results for an identical search made by me. In fact, so much insight can be derived from your online behaviour that Google and other organisations can ensure you get news that makes you happy… or even angry the way you like to be angry.


pages: 268 words: 75,850

The Formula: How Algorithms Solve All Our Problems-And Create More by Luke Dormehl

3D printing, algorithmic trading, Any sufficiently advanced technology is indistinguishable from magic, augmented reality, big data - Walmart - Pop Tarts, call centre, Cass Sunstein, Clayton Christensen, commoditize, computer age, death of newspapers, deferred acceptance, disruptive innovation, Edward Lorenz: Chaos theory, Erik Brynjolfsson, Filter Bubble, Flash crash, Florence Nightingale: pie chart, Frank Levy and Richard Murnane: The New Division of Labor, Google Earth, Google Glasses, High speed trading, Internet Archive, Isaac Newton, Jaron Lanier, Jeff Bezos, job automation, John Markoff, Kevin Kelly, Kodak vs Instagram, lifelogging, Marshall McLuhan, means of production, Nate Silver, natural language processing, Netflix Prize, Panopticon Jeremy Bentham, pattern recognition, price discrimination, recommendation engine, Richard Thaler, Rosa Parks, self-driving car, sentiment analysis, Silicon Valley, Silicon Valley startup, Slavoj Žižek, social graph, speech recognition, Steve Jobs, Steven Levy, Steven Pinker, Stewart Brand, the scientific method, The Signal and the Noise by Nate Silver, upwardly mobile, Wall-E, Watson beat the top human players on Jeopardy!, Y Combinator

Conversely, scores fall dramatically in situations where the task takes longer than expected.33 Decimated-Reality Aggregators Speaking in October 1944, during the rebuilding of the House of Commons, which had sustained heavy bombing damage during the Battle of Britain, former British prime minister Winston Churchill observed, “We shape our buildings; thereafter they shape us.”34 A similar sentiment might be said in the age of The Formula, in which users shape their online profiles, and from that point forward their online profiles begin to shape them—both in terms of what we see and, perhaps more crucially, what we don’t. Writing about a start-up called Nara, in the middle of 2013, I coined the phrase “decimated reality aggregators” to describe what the company was trying to do.35 Starting out as a restaurant recommender system by connecting together thousands of restaurants around the world, Nara’s ultimate goal was to become the recommender system for your life: drawing on what it knew about you from the restaurants you ate in, to suggest everything from hotels to clothes. Nara even incorporated the idea of upward mobility into its algorithm. Say, for example, you wanted to be a wine connoisseur two years down the line, but currently had no idea how to tell your Chardonnay from your Chianti.

In all, eHarmony’s arrival represented more than just another addition to an already crowded field of Internet dating websites—but a qualitative change in the way that Internet dating was carried out. “Neil was adamant that this should be based on science,” Carter says. Before eHarmony, the majority of dating websites took the form of searchable personal ads, of the kind that have been appearing in print since the 17th century.11 After eHarmony, the search engine model was replaced with a recommender system praised in press materials for its “scientific precision.” Instead of allowing users to scan through page after page of profiles, eHarmony simply required them to answer a series of questions—and then picked out the right option on their behalf. The website opened its virtual doors for the first time on August 22, 2000. There were a few initial teething problems. “Some people were critical of the matches they were getting,” Warren admits.

All a character has to do—as occurs during one scene in which the novel’s bumbling protagonist, Lenny Abramov, visits a Staten Island nightclub with his friends—is to set the “community parameters” of their iPhone-like device to a particular physical space and hit a button. At this point, every aspect of a person’s profile is revealed, including their “fuckability” and “personality” scores (both ranked on a scale of 800), along with their ranked “anal/oral/vaginal” preferences. There is even a recommender system incorporated, so that a user’s history of romantic relationships can be scrutinized for insights in much the same way that a person’s previous orders on Amazon might dictate what they will be interested in next. As one of Abramov’s friends notes, “This girl [has] a long multimedia thing on how her father abused her . . . Like, you’ve dated a lot of abused girls, so it knows you’re into that shit.”24 The world presented by Super Sad True Love Story is, in many ways, closer than you might think.


pages: 382 words: 105,819

Zucked: Waking Up to the Facebook Catastrophe by Roger McNamee

4chan, Albert Einstein, algorithmic trading, AltaVista, Amazon Web Services, barriers to entry, Bernie Sanders, Boycotts of Israel, Cass Sunstein, cloud computing, computer age, cross-subsidies, data is the new oil, Donald Trump, Douglas Engelbart, Douglas Engelbart, Electric Kool-Aid Acid Test, Elon Musk, Filter Bubble, game design, income inequality, Internet of things, Jaron Lanier, Jeff Bezos, John Markoff, laissez-faire capitalism, Lean Startup, light touch regulation, Lyft, Marc Andreessen, Mark Zuckerberg, market bubble, Menlo Park, Metcalfe’s law, minimum viable product, Mother of all demos, move fast and break things, move fast and break things, Network effects, paypal mafia, Peter Thiel, pets.com, post-work, profit maximization, profit motive, race to the bottom, recommendation engine, Robert Mercer, Ronald Reagan, Sand Hill Road, self-driving car, Silicon Valley, Silicon Valley startup, Skype, Snapchat, social graph, software is eating the world, Stephen Hawking, Steve Jobs, Steven Levy, Stewart Brand, The Chicago School, Tim Cook: Apple, two-sided market, Uber and Lyft, Uber for X, uber lyft, Upton Sinclair, WikiLeaks, Yom Kippur War

Whether by design or by accident, platforms empower extreme views in a variety of ways. The ease with which like-minded extremists can find one another creates the illusion of legitimacy. Protected from real-world stigma, communication among extreme voices over internet platforms generally evolves to more dangerous language. Normalization lowers a barrier for the curious; algorithmic reinforcement leads some users to increasingly extreme positions. Recommendation engines can and do exploit that. For example, former YouTube algorithm engineer Guillaume Chaslot created a program to take snapshots of what YouTube would recommend to users. He learned that when a user watches a regular 9/11 news video, YouTube will then recommend 9/11 conspiracies; if a teenage girl watches a video on food dietary habits, YouTube will recommend videos that promote anorexia-related behaviors.

It is not for nothing that the industry jokes about YouTube’s “three degrees of Alex Jones,” referring to the notion that no matter where you start, YouTube’s algorithms will often surface a Jones conspiracy theory video within three recommendations. In an op-ed in Wired, my colleague Renée DiResta quoted YouTube chief product officer Neal Mohan as saying that 70 percent of the views on his platform are from recommendations. In the absence of a commitment to civic responsibility, the recommendation engine will be programmed to do the things that generate the most profit. Conspiracy theories cause users to spend more time on the site. Once a person identifies with an extreme position on an internet platform, he or she will be subject to both filter bubbles and human nature. A steady flow of ideas that confirm beliefs will lead many users to make choices that exclude other ideas both online and off.


pages: 391 words: 105,382

Utopia Is Creepy: And Other Provocations by Nicholas Carr

Air France Flight 447, Airbnb, Airbus A320, AltaVista, Amazon Mechanical Turk, augmented reality, autonomous vehicles, Bernie Sanders, book scanning, Brewster Kahle, Buckminster Fuller, Burning Man, Captain Sullenberger Hudson, centralized clearinghouse, Charles Lindbergh, cloud computing, cognitive bias, collaborative consumption, computer age, corporate governance, crowdsourcing, Danny Hillis, deskilling, digital map, disruptive innovation, Donald Trump, Electric Kool-Aid Acid Test, Elon Musk, factory automation, failed state, feminist movement, Frederick Winslow Taylor, friendly fire, game design, global village, Google bus, Google Glasses, Google X / Alphabet X, Googley, hive mind, impulse control, indoor plumbing, interchangeable parts, Internet Archive, invention of movable type, invention of the steam engine, invisible hand, Isaac Newton, Jeff Bezos, jimmy wales, Joan Didion, job automation, Kevin Kelly, lifelogging, low skilled workers, Marc Andreessen, Mark Zuckerberg, Marshall McLuhan, means of production, Menlo Park, mental accounting, natural language processing, Network effects, new economy, Nicholas Carr, Norman Mailer, off grid, oil shale / tar sands, Peter Thiel, plutocrats, Plutocrats, profit motive, Ralph Waldo Emerson, Ray Kurzweil, recommendation engine, Republic of Letters, robot derives from the Czech word robota Czech, meaning slave, Ronald Reagan, self-driving car, SETI@home, side project, Silicon Valley, Silicon Valley ideology, Singularitarianism, Snapchat, social graph, social web, speech recognition, Startup school, stem cell, Stephen Hawking, Steve Jobs, Steven Levy, technoutopianism, the medium is the message, theory of mind, Turing test, Whole Earth Catalog, Y Combinator

The great power of modern digital filters lies in their ability to make information that is of inherent interest to us immediately visible to us. The information may take the form of personal messages or updates from friends or colleagues, broadcast messages from experts or celebrities whose opinions or observations we value, headlines and stories from writers or publications we like, alerts about the availability of various other sorts of content on favorite subjects, or suggestions from recommendation engines—but it all shares the quality of being tailored to our particular interests. It’s all needles. And modern filters don’t just organize that information for us; they push the information at us as alerts, updates, streams. We tend to point to spam as an example of information overload. But spam is just an annoyance. The real source of information overload, at least of the ambient sort, is the stuff we like, the stuff we want.

To thine own image be true. 16. No great work of literature could have been written in hypertext. 17. Social media is a palliative for underemployment. 18. The philistine appears ideally suited to the role of cultural impresario online. 19. Television became more interesting when people started paying for it. 20. Instagram shows us what a world without art looks like. SECOND SERIES (2013) 21. Recommendation engines are the best cure for hubris. 22. Vines would be better if they were one second shorter. 23. Hell is other selfies. 24. Twitter has revealed that brevity and verbosity are not always antonyms. 25. Personalized ads provide a running critique of artificial intelligence. 26. Who you are is what you do between notifications. 27. Online is to offline as a swimming pool to a pond. 28. People in love leave the sparsest data trails. 29.


pages: 163 words: 42,402

Machine Learning for Email by Drew Conway, John Myles White

call centre, correlation does not imply causation, Debian, natural language processing, Netflix Prize, pattern recognition, recommendation engine, SpamAssassin, text mining

Generating rules for ranking a list of items is an increasingly common task in machine learning, yet you may not have thought of it in these terms. More likely, you have heard of something like a recommendation system, which implicitly produces a ranking of products. Even if you have not heard of a recommendation system, it’s almost certain that you have used or interacted with a recommendation system at some point. Some of the most successful e-commerce websites have benefitted from leveraging data on their users to generate recommendations for other products their users might be interested in. For example, if you have ever shopped at Amazon.com, then you have interacted with a recommendation system. The problem Amazon faces is simple: what items in their inventory are you most likely to buy? The implication of that statement is that the items in Amazon’s inventory have an ordering specific to each user.

There are many excellent books that focus on the fundamentals, the seminal work being Hastie, Tibshirani, and Friedman’s The Elements of Statistical Learning HTF09.[1] But another important part of the hacker mantra is to learn by doing. Many hackers may be more comfortable thinking of problems in terms of the process by which a solution is attained, rather than the theoretical foundation from which the solution is derived. From this perspective, an alternative approach to teaching machine learning would be to use “cookbook” style examples. To understand how a recommendation system works, for example, we might provide sample training data and a version of the model, and show how the latter uses the former. There are many useful texts of this kind as well—Toby Segaran’s Programming Collective Intelligence is an recent example Seg07. Such a discussion would certainly address the how of a hacker’s method of learning, but perhaps less of the why. Along with understanding the mechanics of a method, we may also want to learn why it is used in a certain context or to address a specific problem.

The implication of that statement is that the items in Amazon’s inventory have an ordering specific to each user. Likewise, Netflix.com has a massive library of DVDs available to its customers to rent. In order for those customers to get the most out of the site, Netflix employs a sophisticated recommendation system to present people with rental suggestions. For both companies, these recommendations are based on two kinds of data. First, there is the data pertaining to the inventory itself. For Amazon, if the product is a television, this data might contain the type (i.e., plasma, LCD, LED), manufacturer, price, and so on. For Netflix, this data might be the genre of a film, its cast, director, running time, etc. Second, there is the data related to the browsing and purchasing behavior of the customers. This sort of data can help Amazon understand what accessories most people look for when shopping for a new plasma TV and can help Netflix understand which romantic comedies George A.


pages: 161 words: 39,526

Applied Artificial Intelligence: A Handbook for Business Leaders by Mariya Yao, Adelyn Zhou, Marlene Jia

Airbnb, Amazon Web Services, artificial general intelligence, autonomous vehicles, business intelligence, business process, call centre, chief data officer, computer vision, conceptual framework, en.wikipedia.org, future of work, industrial robot, Internet of things, iterative process, Jeff Bezos, job automation, Marc Andreessen, natural language processing, new economy, pattern recognition, performance metric, price discrimination, randomized controlled trial, recommendation engine, self-driving car, sentiment analysis, Silicon Valley, skunkworks, software is eating the world, source of truth, speech recognition, statistical model, strong AI, technological singularity

Extensions of this technology include applications such as Pinterest’s Lens and eBay’s ShopBot, which recognize items in pictures uploaded by consumers and make recommendations of similar items currently for sale. The next frontier in recommendation systems is the cold-start scenario, in which algorithms must be able to draw good inferences about users or items despite insufficient information. Layer 6 AI, recently acquired by TD Bank, has focused on making relatively accurate predictions on noisy data in a cold-start scenario. Customer personalization is like a recommendation system on steroids, delivering highly relevant content, experience, or products to consumers without their having to exert additional effort. Companies such as Monetate, Retail Rocket, BloomReach, and Dynamic Yield now offer personalization engines that use visitor preferences and purchase histories to adjust website content in real-time, highlighting products that are most likely to sell. 18.

Semi-supervised learning lies between supervised and unsupervised learning. Many real-world datasets have noisy, incorrect labels or are missing labels entirely, meaning that inputs and outputs are paired incorrectly with each other or are not paired at all. Active learning, a special case of semi-supervised learning, occurs when an algorithm actively queries a user to discover the right output or label for a new input. Active learning is used to optimize recommendation systems, like the ones used to recommend movies on Netflix or products on Amazon. Reinforcement learning is learning by trial-and-error, in which a computer program is instructed to achieve a stated goal in a dynamic environment. The program learns by repeatedly taking actions, measuring the feedback from those actions, and iteratively improving its behavioral policy. Reinforcement learning can be successfully applied to game-playing, robotic control, and other well-defined and contained problems.

In machine learning, you can easily incur massive ongoing systems costs by failing to mitigate risks early in the development process.(84) Your most talented data scientists and machine learning engineers want to build new models. Few of them are dedicated to the unsexy tasks of maintaining existing models. However, the performance of your existing models will deteriorate as environmental conditions change over time. For example, as your e-commerce inventory changes, your recommender system will need to learn to suggest new products to shoppers. As more machine learning algorithms are put into production, you will also need to dedicate more resources to model maintenance—monitoring, validating, and updating the model. A myriad of dependencies lead to machine learning debt, with certain practices incurring more technical debt than others. According to Google researchers, contributing factors include “probabilistic variables, data dependencies, recursive feedback loops, pipeline processes, configuration settings, and other factors that exacerbate the unpredictability of machine learning algorithm performance.”(85) Machine learning debt can be divided into three main types: code debt, data debt, and math debt.(86) Code debt arises from the need to revisit and repurpose older code that may no longer suit the project.


Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage by Zdravko Markov, Daniel T. Larose

Firefox, information retrieval, Internet Archive, iterative process, natural language processing, pattern recognition, random walk, recommendation engine, semantic web, speech recognition, statistical model, William of Occam

COLLABORATIVE FILTERING (RECOMMENDER SYSTEMS) So far we have discussed approaches to content-based retrieval and clustering of documents, where the basic relation that is used in the document description is “document contains term.” At some point we looked into the role of web users as a source of feedback to improve the document ranking. However, we may consider web users as entities in a relation such as the document–term relation. This may, for example, be “web user likes web page.” Then we can build a user–document matrix and use documents to describe users in terms of web pages they like. A more general approach would be to consider persons and items again connected by the relation “person likes item.” This is the approach taken in the area of collaborative filtering (also called recommender systems) [3]. Assume that we have m persons and n items (e.g., books, songs, movies, web pages).

CONTENTS PREFACE xi PART I WEB STRUCTURE MINING 1 2 INFORMATION RETRIEVAL AND WEB SEARCH 3 Web Challenges Web Search Engines Topic Directories Semantic Web Crawling the Web Web Basics Web Crawlers Indexing and Keyword Search Document Representation Implementation Considerations Relevance Ranking Advanced Text Search Using the HTML Structure in Keyword Search Evaluating Search Quality Similarity Search Cosine Similarity Jaccard Similarity Document Resemblance References Exercises 3 4 5 5 6 6 7 13 15 19 20 28 30 32 36 36 38 41 43 43 HYPERLINK-BASED RANKING 47 Introduction Social Networks Analysis PageRank Authorities and Hubs Link-Based Similarity Search Enhanced Techniques for Page Ranking References Exercises 47 48 50 53 55 56 57 57 vii viii CONTENTS PART II WEB CONTENT MINING 3 4 5 CLUSTERING 61 Introduction Hierarchical Agglomerative Clustering k-Means Clustering Probabilty-Based Clustering Finite Mixture Problem Classification Problem Clustering Problem Collaborative Filtering (Recommender Systems) References Exercises 61 63 69 73 74 76 78 84 86 86 EVALUATING CLUSTERING 89 Approaches to Evaluating Clustering Similarity-Based Criterion Functions Probabilistic Criterion Functions MDL-Based Model and Feature Evaluation Minimum Description Length Principle MDL-Based Model Evaluation Feature Selection Classes-to-Clusters Evaluation Precision, Recall, and F-Measure Entropy References Exercises 89 90 95 100 101 102 105 106 108 111 112 112 CLASSIFICATION 115 General Setting and Evaluation Techniques Nearest-Neighbor Algorithm Feature Selection Naive Bayes Algorithm Numerical Approaches Relational Learning References Exercises 115 118 121 125 131 133 137 138 PART III WEB USAGE MINING 6 INTRODUCTION TO WEB USAGE MINING 143 Definition of Web Usage Mining Cross-Industry Standard Process for Data Mining Clickstream Analysis 143 144 147 CONTENTS 7 8 9 ix Web Server Log Files Remote Host Field Date/Time Field HTTP Request Field Status Code Field Transfer Volume (Bytes) Field Common Log Format Identification Field Authuser Field Extended Common Log Format Referrer Field User Agent Field Example of a Web Log Record Microsoft IIS Log Format Auxiliary Information References Exercises 148 PREPROCESSING FOR WEB USAGE MINING 156 Need for Preprocessing the Data Data Cleaning and Filtering Page Extension Exploration and Filtering De-Spidering the Web Log File User Identification Session Identification Path Completion Directories and the Basket Transformation Further Data Preprocessing Steps References Exercises 156 149 149 149 150 151 151 151 151 151 152 152 152 153 154 154 154 158 161 163 164 167 170 171 174 174 174 EXPLORATORY DATA ANALYSIS FOR WEB USAGE MINING 177 Introduction Number of Visit Actions Session Duration Relationship between Visit Actions and Session Duration Average Time per Page Duration for Individual Pages References Exercises 177 MODELING FOR WEB USAGE MINING: CLUSTERING, ASSOCIATION, AND CLASSIFICATION Introduction Modeling Methodology Definition of Clustering The BIRCH Clustering Algorithm Affinity Analysis and the A Priori Algorithm 177 178 181 183 185 188 188 191 191 192 193 194 197 x CONTENTS Discretizing the Numerical Variables: Binning Applying the A Priori Algorithm to the CCSU Web Log Data Classification and Regression Trees The C4.5 Algorithm References Exercises INDEX 199 201 204 208 210 211 213 PREFACE DEFINING DATA MINING THE WEB By data mining the Web, we refer to the application of data mining methodologies, techniques, and models to the variety of data forms, structures, and usage patterns that comprise the World Wide Web.

Concept learning methods can also be used to generate explicit descriptions of sets of web documents, which can then be applied to categorization of new documents or to better understand the document area or topic. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage C 2007 John Wiley & Sons, Inc. By Zdravko Markov and Daniel T. Larose Copyright CHAPTER 3 CLUSTERING INTRODUCTION HIERARCHICAL AGGLOMERATIVE CLUSTERING k-MEANS CLUSTERING PROBABILTY-BASED CLUSTERING COLLABORATIVE FILTERING (RECOMMENDER SYSTEMS) INTRODUCTION The most popular approach to learning is by example. Given a set of objects, each labeled with a class (category), the learning system builds a mapping between objects and classes which can then be used for classifying new (unlabeled) objects. As the labeling (categorization) of the initial (training) set of objects is done by an agent external to the system (teacher), this setting is called supervised learning.


pages: 151 words: 39,757

Ten Arguments for Deleting Your Social Media Accounts Right Now by Jaron Lanier

4chan, basic income, cloud computing, corporate governance, Donald Trump, en.wikipedia.org, Filter Bubble, gig economy, Internet of things, Jaron Lanier, life extension, Mark Zuckerberg, market bubble, Milgram experiment, move fast and break things, move fast and break things, Network effects, ransomware, Ray Kurzweil, recommendation engine, Silicon Valley, Snapchat, Stanford prison experiment, stem cell, Steve Jobs, Ted Nelson, theory of mind, WikiLeaks, zero-sum game

The correlations are effectively theories about the nature of each person, and those theories are constantly measured and rated for how predictive they are. Like all well-managed theories, they improve over time through adaptive feedback. C is for Cramming content down people’s throats Algorithms choose what each person experiences through their devices. This component might be called a feed, a recommendation engine, or personalization. Component C means each person sees different things. The immediate motivation is to deliver stimuli for individualized behavior modification. BUMMER makes it harder to understand why others think and act the way they do. The effects of this component will be examined more in the arguments about how you are losing access to truth and the capacity for empathy. (Not all personalization is part of BUMMER.


pages: 410 words: 119,823

Radical Technologies: The Design of Everyday Life by Adam Greenfield

3D printing, Airbnb, augmented reality, autonomous vehicles, bank run, barriers to entry, basic income, bitcoin, blockchain, business intelligence, business process, call centre, cellular automata, centralized clearinghouse, centre right, Chuck Templeton: OpenTable:, cloud computing, collective bargaining, combinatorial explosion, Computer Numeric Control, computer vision, Conway's Game of Life, cryptocurrency, David Graeber, dematerialisation, digital map, disruptive innovation, distributed ledger, drone strike, Elon Musk, Ethereum, ethereum blockchain, facts on the ground, fiat currency, global supply chain, global village, Google Glasses, IBM and the Holocaust, industrial robot, informal economy, information retrieval, Internet of things, James Watt: steam engine, Jane Jacobs, Jeff Bezos, job automation, John Conway, John Markoff, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, John von Neumann, joint-stock company, Kevin Kelly, Kickstarter, late capitalism, license plate recognition, lifelogging, M-Pesa, Mark Zuckerberg, means of production, megacity, megastructure, minimum viable product, money: store of value / unit of account / medium of exchange, natural language processing, Network effects, New Urbanism, Occupy movement, Oculus Rift, Pareto efficiency, pattern recognition, Pearl River Delta, performance metric, Peter Eisenman, Peter Thiel, planetary scale, Ponzi scheme, post scarcity, post-work, RAND corporation, recommendation engine, RFID, rolodex, Satoshi Nakamoto, self-driving car, sentiment analysis, shareholder value, sharing economy, Silicon Valley, smart cities, smart contracts, social intelligence, sorting algorithm, special economic zone, speech recognition, stakhanovite, statistical model, stem cell, technoutopianism, Tesla Model S, the built environment, The Death and Life of Great American Cities, The Future of Employment, transaction costs, Uber for X, undersea cable, universal basic income, urban planning, urban sprawl, Whole Earth Review, WikiLeaks, women in the workforce

The equivalent of classification for unsupervised learning is clustering, in which an algorithm starts to develop a sense for what is significant in its environment via a process of accretion. A concrete example will help us understand how this works. At the end of the 1990s, two engineers named Tim Westegren and Will Glaser developed a rudimentary music-recommendation engine called the Music Genome Project that worked by rebuilding genre from the bottom up. (The engineers eventually founded the Pandora streaming service, and folded their recommendation engine into it.) Music Genome compared the acoustic signatures and other performance characteristics of the pieces of music it was offered, and from them built up associative maps, clustering together all the songs that had similar qualities; after many iterations, these clusters developed a strong resemblance to the musical categories we’re familiar with.


pages: 579 words: 76,657

Data Science from Scratch: First Principles with Python by Joel Grus

correlation does not imply causation, natural language processing, Netflix Prize, p-value, Paul Graham, recommendation engine, SpamAssassin, statistical model

principal component analysis, Dimensionality Reduction probability, Probability-For Further Exploration, MathematicsBayes's Theorem, Bayes’s Theorem central limit theorem, The Central Limit Theorem conditional, Conditional Probability continuous distributions, Continuous Distributions defined, Probability dependence and independence, Dependence and Independence normal distribution, The Normal Distribution random variables, Random Variables probability density function, Continuous Distributions programming languages for learning data science, From Scratch Python, A Crash Course in Python-For Further Explorationargs and kwargs, args and kwargs arithmetic, Arithmetic benefits of using for data science, From Scratch Booleans, Truthiness control flow, Control Flow Counter, Counter dictionaries, Dictionaries-defaultdict enumerate function, enumerate exceptions, Exceptions functional tools, Functional Tools functions, Functions generators and iterators, Generators and Iterators list comprehensions, List Comprehensions lists, Lists object-oriented programming, Object-Oriented Programming piping data through scripts using stdin and stdout, stdin and stdout random numbers, generating, Randomness regular expressions, Regular Expressions sets, Sets sorting in, The Not-So-Basics strings, Strings tuples, Tuples whitespace formatting, Whitespace Formatting zip function and argument unpacking, zip and Argument Unpacking Q quantile, computing, Central Tendencies query optimization (SQL), Query Optimization R R (programming language), From Scratch, R random forests, Random Forests random module (Python), Randomness random variables, Random VariablesBernoulli, The Central Limit Theorem binomial, The Central Limit Theorem conditioned on events, Random Variables expected value, Random Variables normal, The Normal Distribution-The Central Limit Theorem uniform, Continuous Distributions range, Dispersion range function (Python), Generators and Iterators reading files (see files, reading) recall, Correctness recommendations, Recommender Systems recommender systems, Recommender Systems-For Further ExplorationData Scientists You May Know (example), Data Scientists You May Know item-based collaborative filtering, Item-Based Collaborative Filtering-For Further Exploration manual curation, Manual Curation recommendations based on popularity, Recommending What’s Popular user-based collaborative filtering, User-Based Collaborative Filtering-User-Based Collaborative Filtering reduce function (Python), Functional Toolsusing with vectors, Vectors regression (see linear regression; logistic regression) regression trees, What Is a Decision Tree?

Additionally, both of his endorsers endorsed only him, which means that he doesn’t have to divide their rank with anyone else. For Further Exploration There are many other notions of centrality besides the ones we used (although the ones we used are pretty much the most popular ones). NetworkX is a Python library for network analysis. It has functions for computing centralities and for visualizing graphs. Gephi is a love-it/hate-it GUI-based network-visualization tool. Chapter 22. Recommender Systems O nature, nature, why art thou so dishonest, as ever to send men with these false recommendations into the world! Henry Fielding Another common data problem is producing recommendations of some sort. Netflix recommends movies you might want to watch. Amazon recommends products you might want to buy. Twitter recommends users you might want to follow. In this chapter, we’ll look at several ways to use data to make recommendations.

= other_interest_id and similarity > 0] return sorted(pairs, key=lambda (_, similarity): similarity, reverse=True) which suggests the following similar interests: [('Hadoop', 0.8164965809277261), ('Java', 0.6666666666666666), ('MapReduce', 0.5773502691896258), ('Spark', 0.5773502691896258), ('Storm', 0.5773502691896258), ('Cassandra', 0.4082482904638631), ('artificial intelligence', 0.4082482904638631), ('deep learning', 0.4082482904638631), ('neural networks', 0.4082482904638631), ('HBase', 0.3333333333333333)] Now we can create recommendations for a user by summing up the similarities of the interests similar to his: def item_based_suggestions(user_id, include_current_interests=False): # add up the similar interests suggestions = defaultdict(float) user_interest_vector = user_interest_matrix[user_id] for interest_id, is_interested in enumerate(user_interest_vector): if is_interested == 1: similar_interests = most_similar_interests_to(interest_id) for interest, similarity in similar_interests: suggestions[interest] += similarity # sort them by weight suggestions = sorted(suggestions.items(), key=lambda (_, similarity): similarity, reverse=True) if include_current_interests: return suggestions else: return [(suggestion, weight) for suggestion, weight in suggestions if suggestion not in users_interests[user_id]] For user 0, this generates the following (seemingly reasonable) recommendations: [('MapReduce', 1.861807319565799), ('Postgres', 1.3164965809277263), ('MongoDB', 1.3164965809277263), ('NoSQL', 1.2844570503761732), ('programming languages', 0.5773502691896258), ('MySQL', 0.5773502691896258), ('Haskell', 0.5773502691896258), ('databases', 0.5773502691896258), ('neural networks', 0.4082482904638631), ('deep learning', 0.4082482904638631), ('C++', 0.4082482904638631), ('artificial intelligence', 0.4082482904638631), ('Python', 0.2886751345948129), ('R', 0.2886751345948129)] For Further Exploration Crab is a framework for building recommender systems in Python. Graphlab also has a recommender toolkit. The Netflix Prize was a somewhat famous competition to build a better system to recommend movies to Netflix users. Chapter 23. Databases and SQL Memory is man’s greatest friend and worst enemy. Gilbert Parker The data you need will often live in databases, systems designed for efficiently storing and querying data. The bulk of these are relational databases, such as Oracle, MySQL, and SQL Server, which store data in tables and are typically queried using Structured Query Language (SQL), a declarative language for manipulating data.


pages: 396 words: 117,149

The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World by Pedro Domingos

Albert Einstein, Amazon Mechanical Turk, Arthur Eddington, basic income, Bayesian statistics, Benoit Mandelbrot, bioinformatics, Black Swan, Brownian motion, cellular automata, Claude Shannon: information theory, combinatorial explosion, computer vision, constrained optimization, correlation does not imply causation, creative destruction, crowdsourcing, Danny Hillis, data is the new oil, double helix, Douglas Hofstadter, Erik Brynjolfsson, experimental subject, Filter Bubble, future of work, global village, Google Glasses, Gödel, Escher, Bach, information retrieval, job automation, John Markoff, John Snow's cholera map, John von Neumann, Joseph Schumpeter, Kevin Kelly, lone genius, mandelbrot fractal, Mark Zuckerberg, Moneyball by Michael Lewis explains big data, Narrative Science, Nate Silver, natural language processing, Netflix Prize, Network effects, NP-complete, off grid, P = NP, PageRank, pattern recognition, phenotype, planetary scale, pre–internet, random walk, Ray Kurzweil, recommendation engine, Richard Feynman, scientific worldview, Second Machine Age, self-driving car, Silicon Valley, social intelligence, speech recognition, Stanford marshmallow experiment, statistical model, Stephen Hawking, Steven Levy, Steven Pinker, superintelligent machines, the scientific method, The Signal and the Noise by Nate Silver, theory of mind, Thomas Bayes, transaction costs, Turing machine, Turing test, Vernor Vinge, Watson beat the top human players on Jeopardy!, white flight, zero-sum game

Satellites, DNA sequencers, and particle accelerators probe nature in ever-finer detail, and learning algorithms turn the torrents of data into new scientific knowledge. Companies know their customers like never before. The candidate with the best voter models wins, like Obama against Romney. Unmanned vehicles pilot themselves across land, sea, and air. No one programmed your tastes into the Amazon recommendation system; a learning algorithm figured them out on its own, by generalizing from your past purchases. Google’s self-driving car taught itself how to stay on the road; no engineer wrote an algorithm instructing it, step-by-step, how to get from A to B. No one knows how to program a car to drive, and no one needs to, because a car equipped with a learning algorithm picks it up by observing what the driver does.

It’s an ideal job for machine learning, and yet today’s learners aren’t up to it. Each has some of the needed capabilities but is missing others. The Master Algorithm is the complete package. Applying it to vast amounts of patient and drug data, combined with knowledge mined from the biomedical literature, is how we will cure cancer. A universal learner is sorely needed in many other areas, from life-and-death to mundane situations. Picture the ideal recommender system, one that recommends the books, movies, and gadgets you would pick for yourself if you had the time to check them all out. Amazon’s algorithm is a very far cry from it. That’s partly because it doesn’t have enough data—mainly it just knows which items you previously bought from Amazon—but if you went hog wild and gave it access to your complete stream of consciousness from birth, it wouldn’t know what to do with it.

The price, of course, is that its vision is blurrier: fine details of the frontier get washed away by the voting. When k goes up, variance decreases, but bias increases. Using the k nearest neighbors instead of one is not the end of the story. Intuitively, the examples closest to the test example should count for more. This leads us to the weighted k-nearest-neighbor algorithm. In 1994, a team of researchers from the University of Minnesota and MIT built a recommendation system based on what they called “a deceptively simple idea”: people who agreed in the past are likely to agree again in the future. That notion led directly to the collaborative filtering systems that all self-respecting e-commerce sites have. Suppose that, like Netflix, you’ve gathered a database of movie ratings, with each user giving a rating of one to five stars to the movies he or she has seen.


pages: 463 words: 105,197

Radical Markets: Uprooting Capitalism and Democracy for a Just Society by Eric Posner, E. Weyl

3D printing, activist fund / activist shareholder / activist investor, Affordable Care Act / Obamacare, Airbnb, Amazon Mechanical Turk, anti-communist, augmented reality, basic income, Berlin Wall, Bernie Sanders, Branko Milanovic, business process, buy and hold, carbon footprint, Cass Sunstein, Clayton Christensen, cloud computing, collective bargaining, commoditize, Corn Laws, corporate governance, crowdsourcing, cryptocurrency, Donald Trump, Elon Musk, endowment effect, Erik Brynjolfsson, Ethereum, feminist movement, financial deregulation, Francis Fukuyama: the end of history, full employment, George Akerlof, global supply chain, guest worker program, hydraulic fracturing, Hyperloop, illegal immigration, immigration reform, income inequality, income per capita, index fund, informal economy, information asymmetry, invisible hand, Jane Jacobs, Jaron Lanier, Jean Tirole, Joseph Schumpeter, Kenneth Arrow, labor-force participation, laissez-faire capitalism, Landlord’s Game, liberal capitalism, low skilled workers, Lyft, market bubble, market design, market friction, market fundamentalism, mass immigration, negative equity, Network effects, obamacare, offshore financial centre, open borders, Pareto efficiency, passive investing, patent troll, Paul Samuelson, performance metric, plutocrats, Plutocrats, pre–internet, random walk, randomized controlled trial, Ray Kurzweil, recommendation engine, rent-seeking, Richard Thaler, ride hailing / ride sharing, risk tolerance, road to serfdom, Robert Shiller, Robert Shiller, Ronald Coase, Rory Sutherland, Second Machine Age, second-price auction, self-driving car, shareholder value, sharing economy, Silicon Valley, Skype, special economic zone, spectrum auction, speech recognition, statistical model, stem cell, telepresence, Thales and the olive presses, Thales of Miletus, The Death and Life of Great American Cities, The Future of Employment, The Market for Lemons, The Nature of the Firm, The Rise and Fall of American Growth, The Wealth of Nations by Adam Smith, Thorstein Veblen, trade route, transaction costs, trickle-down economics, Uber and Lyft, uber lyft, universal basic income, urban planning, Vanguard fund, women in the workforce, Zipcar

., 78 Cabral, Luís, 202 Cadappster app, 31 Caesar, Julius, 84 Canada, 10, 13, 159, 182 capitalism, xvi; basic structure of, 24–25; competition and, 17 (see also competition); corporate planning and, 39–40; cultural consequences of, 270, 273; Engels on, 239–40; freedom and, 34–39; George on, 36–37; growth and, 3 (see also growth, economic); industrial revolution, 36, 255; inequality and, 3 (see also inequality); labor and, 136–37, 143, 159, 165, 211, 224, 231, 239–40, 316n4; laissez-faire, 45; liberalism and, 3, 17, 22–27; markets and, 278, 288, 304n36; Marx on, 239–40; monopolies and, 22–23, 34–39, 44, 46–49, 132, 136, 173, 177, 179, 199, 258, 262; monopsony and, 190, 199–201, 223, 234, 238–41, 255; ownership and, 34–36, 39, 45–49, 75, 78–79; property and, 34–36, 39, 45–49, 75, 78–79; Radical Markets and, 169, 180–85, 203, 273; regulations and, 262; Schumpeter on, 47; shareholders and, 118, 170, 178–84, 189, 193–95; technology and, 34, 203, 316n4; wealth and, 45, 75, 78–79, 136, 143, 239, 273 Capitalism and Freedom (Friedman), xiii Capitalism for the People, A (Luigi), 203 Capra, Frank, 17 Carroll, Lewis, 176 central planning: computers and, 277–85, 288–93; consumers and, 19; democracy and, 89; governance and, 19–20, 39–42, 46–48, 62, 89, 277–85, 288–90, 293; healthcare and, 290–91; liberalism and, 19–20; markets and, 277–85, 288–93; property and, 39–42, 46–48, 62; recommendation systems and, 289–90; socialism and, 39–42, 47, 277, 281 Chetty, Raj, 11 Chiang Kai-shek, 46 China, 15, 46, 56, 133–34, 138 Christensen, Clayton, 202 Chrysler, 193 Citigroup, 183, 184, 191 Clarke, Edward, 99, 102, 105 Clayton Act, 176–77, 197, 311n25 Clemens, Michael, 162 Coase, Ronald, 40, 48–51, 299n26 Cold War, xix, 25, 288 collective bargaining, 240–41 collective decisions: democracy and, 97–105, 110–11, 118–20, 122, 124, 273, 303n17, 304n36; manipulation of, 99; markets for, 97–105; public goods and, 98; Quadratic Voting (QV) and, 110–11, 118–20, 122, 124, 273, 303n17, 304n36; Vickrey and, 99, 102, 105 colonialism, 8, 131 Coming of the Third Reich, The (Evans), 93 common ownership self-assessed tax (COST): broader application of, 273–76; cybersquatters and, 72; education and, 258–59; efficiency and, 256, 261; equality and, 258; globalization and, 269–70; growth and, 73, 256; human capital and, 258–61; immigrants and, 261, 269, 273; inequality and, 256–59; international trade and, 270; investment and, 258–59, 270; legal issues and, 275; markets and, 286; methodology of, 63–66; monopolies and, 256–61, 270, 300n43; objections to, 300n43; optimality and, 61, 73, 75–79, 317n18; personal possessions and, 301n47, 317n18; political effects of, 261–64; predatory outsiders and, 300n43; prices and, 62–63, 67–77, 256, 258, 263, 275, 300n43, 317n18; property and, 31, 61–79, 271–74, 300n43, 301n47; public goods and, 256; public leases and, 69–72; Quadratic Voting (QV) and, 123–25, 194, 261–63, 273, 275, 286; Radical Markets and, 79, 123–26, 257–58, 271–72, 286; taxes and, 61–69, 73–76, 258–61, 275, 317n18; technology and, 71–72, 257–59; true market economy and, 72–75; voting and, 263; wealth and, 256–57, 261–64, 269–70, 275, 286 communism, 19–20, 46–47, 93–94, 125, 278 competition: antitrust policies and, 23, 48, 174–77, 180, 184–86, 191, 197–203, 242, 255, 262, 286; auctions and, xv–xix, 49–51, 70–71, 97, 99, 147–49, 156–57; bargaining and, 240–41, 299n26; democracy and, 109, 119–20; by design, 49–55; elitism and, 25–28; equilibrium and, 305n40; eternal vigilance and, 204; horizontal concentration and, 175; imperfect, 304n36; indexing and, 185–91, 302n63; innovation and, 202–3; investment and, 196–97; labor and, 145, 158, 162–63, 220, 234, 236, 239, 243, 245, 256, 266; laissez-faire and, 253; liberalism and, 6, 17, 20–28; lobbyists and, 262; monopolies and, 174; monopsony and, 190, 199–201, 223, 234, 238–41, 255; ownership and, 20–21, 41, 49–55, 79; perfect, 6, 25–28, 109; prices and, 20–22, 25, 173, 175, 180, 185–90, 193, 200–201, 204, 244; property and, 41, 49–55, 79; Quadratic Voting (QV) and, 304n36; regulations and, 262; resale price maintenance and, 200–201; restoring, 191–92; Section 7 and, 196–97, 311n25; selfishness and, 109, 270–71; Smith on, 17; tragedy of the commons and, 44 complexity, 218–20, 226–28, 274–75, 279, 281, 284, 287, 313n15 “Computer and the Market, The” (Lange), 277 computers: algorithms and, 208, 214, 219, 221, 281–82, 289–93; automation of labor and, 222–23, 251, 254; central planning and, 277–85, 288–93; data and, 213–14, 218, 222, 233, 244, 260; Deep Blue, 213; distributed computing and, 282–86, 293; growth in poor countries and, 255; as intermediaries, 274; machine learning (ML) and, 214 (see also machine learning [ML]); markets and, 277, 280–93; Mises and, 281; Moore’s Law and, 286–87; Open-Trac and, 31–32; parallel processing and, 282–86; prices of, 21; recommendation systems and, 289–90 Condorcet, Marquis de, 4, 90–93, 303n15, 306n51 conspicuous consumption, 78 Consumer Reports magazine, 291 consumers: antitrust suits and, 175, 197–98; central planning and, 19; data from, 47, 220, 238, 242–44, 248, 289; drone delivery to, 220; as entrepreneurs, 256; goods and services for, 27, 92, 123, 130, 175, 280, 292; institutional investment and, 190–91; international culture for, 270; lobbyists and, 262; machine learning (ML) and, 238; monopolies and, 175, 186, 197–98; preferences of, 280, 288–93; prices and, 172 (see also prices); recommendation systems and, 289–90; robots and, 287; sharing economy and, 117; Soviet collapse and, 289; technology and, 287 cooperatives, 118, 126, 261, 267, 299n24 Corbyn, Jeremy, 12, 13 corruption, 3, 23, 27, 57, 93, 122, 126, 157, 262 Cortana, 219 cost-benefit analysis, 2, 244 “Counterspeculation, Auctions and Competitive Sealed Tenders” (Vickrey), xx–xxi Cramton, Peter, 52, 54–55, 57 crowdsourcing, 235 crytocurrencies, 117–18 cybersquatters, 72 data: algorithms and, 208, 214, 219, 221, 281–82, 289–93; big, 213, 226, 293; computers and, 213–14, 218, 222, 233, 244, 260; consumer, 47, 220, 238, 242–44, 248, 289; diamond-water paradox and, 224–25; diminishing returns and, 226, 229–30; distribution of complexity and, 228; as entertainment, 233–39, 248–49; Facebook and, 28, 205–9, 212–13, 220–21, 231–48; feedback and, 114, 117, 233, 238, 245; free, 209, 211, 220, 224, 231–35, 239; Google and, 28, 202, 207–13, 219–20, 224, 231–36, 241–42, 246; investment in, 212, 224, 232, 244; labeled, 217–21, 227, 228, 230, 232, 234, 237; labor movement for, 241–43; Lanier and, 208, 220–24, 233, 237, 313n2, 315n48; marginal value and, 224–28, 247; network effects and, 211, 236, 238, 243; neural networks and, 214–19; online services and, 211, 235; overfitting and, 217–18; payment systems for, 210–13, 224–30; photographs and, 64, 214–15, 217, 219–21, 227–28, 291; programmers and, 163, 208–9, 214, 217, 219, 224; Radical Markets for, 246–49; reCAPTCHA and, 235–36; recommendation systems and, 289–90; rise of data work and, 209–13; sample complexity and, 217–18; siren servers and, 220–24, 230–41, 243; social networks and, 202, 212, 231, 233–36; technofeudalism and, 230–33; under-employment and, 256; value of, 243–45; venture capital and, 211, 224; virtual reality and, 206, 208, 229, 251, 253; women’s work and, 209, 313n4 Declaration of Independence, 86 Deep Blue, 213 DeFoe, Daniel, 132 Demanding Work (Gray and Suri), 233 democracy: 1p1v system and, 82–84, 94, 109, 119, 122–24, 304n36, 306n51; artificial intelligence (AI) and, 219; Athenians and, 55, 83–84, 131; auctions and, 97, 99; basic structure of, 24–25; central planning and, 89; check and balance systems and, 23, 25, 87, 92; collective decisions and, 97–105, 110–11, 118–20, 122, 124, 273, 303n17, 304n36; collective mediocrity and, 96; competition and, 109, 119–20; Declaration of Independence and, 86; efficiency and, 92, 110, 126; elections and, 22, 80, 93, 100, 115, 119–21, 124, 217–18, 296n20; elitism and, 89–91, 96, 124; Enlightenment and, 86, 95; Europe and, 90–96; France and, 90–95; governance and, 84, 117; gridlock and, 84, 88, 122–24, 261, 267; Hitler and, 93–94; House of Commons and, 84–85; House of Lords and, 85; impossibility theorem and, 92; inequality and, 123; Jury Theorem and, 90–92; liberalism and, 3–4, 25, 80, 86, 90; limits of, 85–86; majority rule and, 27, 83–89, 92–97, 100–101, 121, 306n51; markets and, 97–105, 262, 276; minorities and, 85–90, 93–97, 101, 106, 110; mixed constitution and, 84–85; multi-candidate, single-winner elections and, 119–20; origins of, 83–85; ownership and, 81–82, 89, 101, 105, 118, 124; public goods and, 28, 97–100, 107, 110, 120, 123, 126; Quadratic Voting (QV) and, 105–22; Radical Markets and, 82, 106, 123–26, 203; supermajorities and, 84–85, 88, 92; tyrannies and, 23, 25, 88, 96–100, 106, 108; United Kingdom and, 95–96; United States and, 86–90, 93, 95; voting and, 80–82, 85–93, 96, 99, 105, 108, 115–16, 119–20, 123–24, 303n14, 303n17, 303n20, 304n36, 305n39; wealth and, 83–84, 87, 95, 116 Demosthenes, 55 Denmark, 182 Department of Justice (DOJ), 176, 186, 191 deregulation, 3, 9, 24 Desmond, Matthew, 201–2 Dewey, John, 43 Dickens, Charles, 36 digital economy: data producers and, 208–9, 230–31; diamond-water paradox and, 224–25; as entertainment, 233–39; facial recognition and, 208, 216, 218–19; free access and, 211; Lanier and, 208, 220–24, 233, 237, 313n2, 315n48; machine learning (ML) and, 208–9, 213–14, 217–21, 226–31, 234–35, 238, 247, 289, 291, 315n48; payment systems for, 210–13, 221–30, 243–45; programmers and, 163, 208–9, 214, 217, 219, 224; rise of data work and, 209–13; siren servers and, 220–24, 230–41, 243; spam and, 210, 245; technofeudalism and, 230–33; virtual reality and, 206, 208, 229, 251, 253 diversification, 171–72, 180–81, 185, 191–92, 194–96, 310n22, 310n24 dot-com bubble, 211 double taxation, 65 Dupuit, Jules, 173 Durkheim, Émile, 297n23 Dworkin, Ronald, 305n40 dystopia, 18, 191, 273, 293 education, 114; common ownership self-assessed tax (COST) and, 258; data and, 229, 232, 248; elitism and, 260; equality in, 89; financing, 276; free compulsory, 23; immigrants and, 14, 143–44, 148; labor and, 140, 143–44, 148, 150, 158, 170–71, 232, 248, 258–60; Mill on, 96; populist movements and, 14; Stolper-Samuelson Theorem and, 143 efficient capital markets hypothesis, 180 elections, 80; data and, 217–18; democracy and, 22, 93, 100, 115, 119–21, 124, 217–18, 296n20; gridlock and, 124; Hitler and, 93; multi-candidate, single-winner, 119–20; polls and, 13, 111; Quadratic Voting (QV) and, 115, 119–21, 268, 306n52; U.S. 2016, 93, 296n20 Elhauge, Einer, 176, 197 elitism: aristocracy and, 16–17, 22–23, 36–38, 84–85, 87, 90, 135–36; bourgeoisie and, 36; bureaucrats and, 267; democracy and, 89–91, 96, 124; education and, 260; feudalism and, 16, 34–35, 37, 41, 61, 68, 136, 230–33, 239; financial deregulation and, 3; immigrants and, 146, 166; liberalism and, 3, 15–16, 25–28; minorities and, 12, 14–15, 19, 23–27, 85–90, 93–97, 101, 106, 110, 181, 194, 273, 303n14, 304n36; monarchies and, 85–86, 91, 95, 160 Emergency Economic Stabilization Act, 121 eminent domain, 33, 62, 89 Empire State Building, 45 Engels, Friedrich, 78, 240 Enlightenment, 86, 95 entrepreneurs, xiv; immigrants and, 144–45, 159, 256; labor and, 129, 144–45, 159, 173, 177, 203, 209–12, 224, 226, 256; ownership and, 35, 39 equality: common ownership self-assessed tax (COST) and, 258; education and, 89; immigrants and, 257; labor and, 147, 166, 239, 257; liberalism and, 4, 8, 24, 29; living standards and, 3, 11, 13, 133, 135, 148, 153, 254, 257; Quadratic Voting (QV) and, 264; Radical Markets and, 262, 276; trickle down theories and, 9, 12 Espinosa, Alejandro, 30–32 Ethereum, 117 Europe, 177, 201; democracy and, 88, 90–95; European Union and, 15; fiefdoms in, 34; government utilities and, 48; income patterns in, 5; instability in, 88; labor and, 11, 130–31, 136–47, 165, 245; social democrats and, 24; unemployment rates in, 11 Evans, Richard, 93 Evicted (Desmond), 201–2 Ex Machina (film), 208 Facebook, xxi; advertising and, 50, 202; data and, 28, 205–9, 212–13, 220–21, 231–48; monetization by, 28; news service of, 289; Vickrey Commons and, 50 facial recognition, 208, 216–19 family reunification programs, 150, 152 farms, 17, 34–35, 37–38, 61, 72, 135, 142, 179, 283–85 Federal Communications Commission (FCC), 50, 71 Federal Trade Commission (FTC), 176, 186 feedback, 114, 117, 233, 238, 245 feudalism, 16, 34–35, 37, 41, 61, 68, 136, 230–33, 239 Fidelity, 171, 181–82, 184 financial crisis of 2008, 3, 121 Fitzgerald, F.

Today, machines learn from the statistical patterns in human behavior, and may be able to use this information to distribute goods (and jobs) as well as, or possibly better than, people can choose goods (and jobs) themselves. We are very far from this point, but we can see the outlines of the route that we might travel. Let us start with an increasingly familiar phenomenon: machine learning–based recommendation systems drawing on existing market behavior. How does Netflix guess what movies you are likely to enjoy? Roughly, it finds people who are like you—who watch many of the movies you watch—and gives those movies ratings similar to your ratings. It then infers that you will enjoy movies you have not yet seen that your hidden doppelgangers have seen and rated highly. Pandora and Spotify take a similar approach in recommending music.

., 240 Amazon, 112, 230–31, 234, 239, 248, 288, 290–91 American Constitution, 86–87 American Federation of Musicians, 210 American Tobacco Company, 174 America OnLine (AOL), 210 Anderson, Chris, 212 antitrust: Clayton Act and, 176–77, 197, 311n25; landlords and, 201–2; monopolies and, 23, 48, 174–77, 180, 184–86, 191, 197–203, 242, 255, 262, 286; resale price maintenance and, 200–201; social media and, 202 Apple, 117, 239, 289 Arginoussai Islands, 83 aristocracy, 16–17, 22–23, 36–38, 84–85, 87, 90, 135–36 Aristotle, 172 Arrow, Kenneth, 92, 303n17 Articles of Confederation, 88 artificial intelligence (AI), 202, 257, 287; Alexa and, 248; algorithms and, 208, 214, 219, 221, 281–82, 289–93; automated video editing and, 208; Cortana and, 219; data capacities and, 236; Deep Blue and, 213; democratization of, 219; diminishing returns and, 229–30; facial recognition and, 208, 216–19; factories for thinking machines and, 213–20; Google Assistant and, 219; human-produced data for, 208–9; marginal value and, 224–28, 247; Microsoft and, 219; neural networks and, 214–19; payment systems for, 224–30; recommendation systems and, 289–90; siren servers and, 220–24, 230–41, 243; Siri and, 219, 248; technofeudalism and, 230–33; techno-optimists and, 254–55, 316n2; techno-pessimists and, 254–55, 316n2; worker replacement and, 223 Athens, 55, 83–84, 131 Atwood, Margaret, 18–19 auctions, xv–xxi, 49–51, 70–71, 97, 99, 147–49, 156–57, 300n34 au pair program, 154–55, 161 Australia, 10, 12, 13, 159, 162 Austrian school, 2 Autor, David, 240 Azar, José, 185, 189, 310n24 Bahrain, 158 banking industry, 182–84, 183, 190 Bank of America, 183, 184 Becker, Gary, 147 Beckford, William, 95 behavioral finance, 180–81 Bénabou, Roland, 236–37 Bentham, Jeremy, 4, 35, 95–96, 98, 132 Berle, Adolf, 177–78, 183, 193–94 Berlin Wall, 1, 140 Berners-Lee, Tim, 210 big data, 213, 226, 293 Bing, xxi BlackRock, 171, 181–84, 183, 187, 191 Brazil, xiii–xvii, 105, 135 Brin, Sergey, 211 broadcast spectrum, xxi, 50–51, 71 Bush, George W., 78 Cabral, Luís, 202 Cadappster app, 31 Caesar, Julius, 84 Canada, 10, 13, 159, 182 capitalism, xvi; basic structure of, 24–25; competition and, 17 (see also competition); corporate planning and, 39–40; cultural consequences of, 270, 273; Engels on, 239–40; freedom and, 34–39; George on, 36–37; growth and, 3 (see also growth, economic); industrial revolution, 36, 255; inequality and, 3 (see also inequality); labor and, 136–37, 143, 159, 165, 211, 224, 231, 239–40, 316n4; laissez-faire, 45; liberalism and, 3, 17, 22–27; markets and, 278, 288, 304n36; Marx on, 239–40; monopolies and, 22–23, 34–39, 44, 46–49, 132, 136, 173, 177, 179, 199, 258, 262; monopsony and, 190, 199–201, 223, 234, 238–41, 255; ownership and, 34–36, 39, 45–49, 75, 78–79; property and, 34–36, 39, 45–49, 75, 78–79; Radical Markets and, 169, 180–85, 203, 273; regulations and, 262; Schumpeter on, 47; shareholders and, 118, 170, 178–84, 189, 193–95; technology and, 34, 203, 316n4; wealth and, 45, 75, 78–79, 136, 143, 239, 273 Capitalism and Freedom (Friedman), xiii Capitalism for the People, A (Luigi), 203 Capra, Frank, 17 Carroll, Lewis, 176 central planning: computers and, 277–85, 288–93; consumers and, 19; democracy and, 89; governance and, 19–20, 39–42, 46–48, 62, 89, 277–85, 288–90, 293; healthcare and, 290–91; liberalism and, 19–20; markets and, 277–85, 288–93; property and, 39–42, 46–48, 62; recommendation systems and, 289–90; socialism and, 39–42, 47, 277, 281 Chetty, Raj, 11 Chiang Kai-shek, 46 China, 15, 46, 56, 133–34, 138 Christensen, Clayton, 202 Chrysler, 193 Citigroup, 183, 184, 191 Clarke, Edward, 99, 102, 105 Clayton Act, 176–77, 197, 311n25 Clemens, Michael, 162 Coase, Ronald, 40, 48–51, 299n26 Cold War, xix, 25, 288 collective bargaining, 240–41 collective decisions: democracy and, 97–105, 110–11, 118–20, 122, 124, 273, 303n17, 304n36; manipulation of, 99; markets for, 97–105; public goods and, 98; Quadratic Voting (QV) and, 110–11, 118–20, 122, 124, 273, 303n17, 304n36; Vickrey and, 99, 102, 105 colonialism, 8, 131 Coming of the Third Reich, The (Evans), 93 common ownership self-assessed tax (COST): broader application of, 273–76; cybersquatters and, 72; education and, 258–59; efficiency and, 256, 261; equality and, 258; globalization and, 269–70; growth and, 73, 256; human capital and, 258–61; immigrants and, 261, 269, 273; inequality and, 256–59; international trade and, 270; investment and, 258–59, 270; legal issues and, 275; markets and, 286; methodology of, 63–66; monopolies and, 256–61, 270, 300n43; objections to, 300n43; optimality and, 61, 73, 75–79, 317n18; personal possessions and, 301n47, 317n18; political effects of, 261–64; predatory outsiders and, 300n43; prices and, 62–63, 67–77, 256, 258, 263, 275, 300n43, 317n18; property and, 31, 61–79, 271–74, 300n43, 301n47; public goods and, 256; public leases and, 69–72; Quadratic Voting (QV) and, 123–25, 194, 261–63, 273, 275, 286; Radical Markets and, 79, 123–26, 257–58, 271–72, 286; taxes and, 61–69, 73–76, 258–61, 275, 317n18; technology and, 71–72, 257–59; true market economy and, 72–75; voting and, 263; wealth and, 256–57, 261–64, 269–70, 275, 286 communism, 19–20, 46–47, 93–94, 125, 278 competition: antitrust policies and, 23, 48, 174–77, 180, 184–86, 191, 197–203, 242, 255, 262, 286; auctions and, xv–xix, 49–51, 70–71, 97, 99, 147–49, 156–57; bargaining and, 240–41, 299n26; democracy and, 109, 119–20; by design, 49–55; elitism and, 25–28; equilibrium and, 305n40; eternal vigilance and, 204; horizontal concentration and, 175; imperfect, 304n36; indexing and, 185–91, 302n63; innovation and, 202–3; investment and, 196–97; labor and, 145, 158, 162–63, 220, 234, 236, 239, 243, 245, 256, 266; laissez-faire and, 253; liberalism and, 6, 17, 20–28; lobbyists and, 262; monopolies and, 174; monopsony and, 190, 199–201, 223, 234, 238–41, 255; ownership and, 20–21, 41, 49–55, 79; perfect, 6, 25–28, 109; prices and, 20–22, 25, 173, 175, 180, 185–90, 193, 200–201, 204, 244; property and, 41, 49–55, 79; Quadratic Voting (QV) and, 304n36; regulations and, 262; resale price maintenance and, 200–201; restoring, 191–92; Section 7 and, 196–97, 311n25; selfishness and, 109, 270–71; Smith on, 17; tragedy of the commons and, 44 complexity, 218–20, 226–28, 274–75, 279, 281, 284, 287, 313n15 “Computer and the Market, The” (Lange), 277 computers: algorithms and, 208, 214, 219, 221, 281–82, 289–93; automation of labor and, 222–23, 251, 254; central planning and, 277–85, 288–93; data and, 213–14, 218, 222, 233, 244, 260; Deep Blue, 213; distributed computing and, 282–86, 293; growth in poor countries and, 255; as intermediaries, 274; machine learning (ML) and, 214 (see also machine learning [ML]); markets and, 277, 280–93; Mises and, 281; Moore’s Law and, 286–87; Open-Trac and, 31–32; parallel processing and, 282–86; prices of, 21; recommendation systems and, 289–90 Condorcet, Marquis de, 4, 90–93, 303n15, 306n51 conspicuous consumption, 78 Consumer Reports magazine, 291 consumers: antitrust suits and, 175, 197–98; central planning and, 19; data from, 47, 220, 238, 242–44, 248, 289; drone delivery to, 220; as entrepreneurs, 256; goods and services for, 27, 92, 123, 130, 175, 280, 292; institutional investment and, 190–91; international culture for, 270; lobbyists and, 262; machine learning (ML) and, 238; monopolies and, 175, 186, 197–98; preferences of, 280, 288–93; prices and, 172 (see also prices); recommendation systems and, 289–90; robots and, 287; sharing economy and, 117; Soviet collapse and, 289; technology and, 287 cooperatives, 118, 126, 261, 267, 299n24 Corbyn, Jeremy, 12, 13 corruption, 3, 23, 27, 57, 93, 122, 126, 157, 262 Cortana, 219 cost-benefit analysis, 2, 244 “Counterspeculation, Auctions and Competitive Sealed Tenders” (Vickrey), xx–xxi Cramton, Peter, 52, 54–55, 57 crowdsourcing, 235 crytocurrencies, 117–18 cybersquatters, 72 data: algorithms and, 208, 214, 219, 221, 281–82, 289–93; big, 213, 226, 293; computers and, 213–14, 218, 222, 233, 244, 260; consumer, 47, 220, 238, 242–44, 248, 289; diamond-water paradox and, 224–25; diminishing returns and, 226, 229–30; distribution of complexity and, 228; as entertainment, 233–39, 248–49; Facebook and, 28, 205–9, 212–13, 220–21, 231–48; feedback and, 114, 117, 233, 238, 245; free, 209, 211, 220, 224, 231–35, 239; Google and, 28, 202, 207–13, 219–20, 224, 231–36, 241–42, 246; investment in, 212, 224, 232, 244; labeled, 217–21, 227, 228, 230, 232, 234, 237; labor movement for, 241–43; Lanier and, 208, 220–24, 233, 237, 313n2, 315n48; marginal value and, 224–28, 247; network effects and, 211, 236, 238, 243; neural networks and, 214–19; online services and, 211, 235; overfitting and, 217–18; payment systems for, 210–13, 224–30; photographs and, 64, 214–15, 217, 219–21, 227–28, 291; programmers and, 163, 208–9, 214, 217, 219, 224; Radical Markets for, 246–49; reCAPTCHA and, 235–36; recommendation systems and, 289–90; rise of data work and, 209–13; sample complexity and, 217–18; siren servers and, 220–24, 230–41, 243; social networks and, 202, 212, 231, 233–36; technofeudalism and, 230–33; under-employment and, 256; value of, 243–45; venture capital and, 211, 224; virtual reality and, 206, 208, 229, 251, 253; women’s work and, 209, 313n4 Declaration of Independence, 86 Deep Blue, 213 DeFoe, Daniel, 132 Demanding Work (Gray and Suri), 233 democracy: 1p1v system and, 82–84, 94, 109, 119, 122–24, 304n36, 306n51; artificial intelligence (AI) and, 219; Athenians and, 55, 83–84, 131; auctions and, 97, 99; basic structure of, 24–25; central planning and, 89; check and balance systems and, 23, 25, 87, 92; collective decisions and, 97–105, 110–11, 118–20, 122, 124, 273, 303n17, 304n36; collective mediocrity and, 96; competition and, 109, 119–20; Declaration of Independence and, 86; efficiency and, 92, 110, 126; elections and, 22, 80, 93, 100, 115, 119–21, 124, 217–18, 296n20; elitism and, 89–91, 96, 124; Enlightenment and, 86, 95; Europe and, 90–96; France and, 90–95; governance and, 84, 117; gridlock and, 84, 88, 122–24, 261, 267; Hitler and, 93–94; House of Commons and, 84–85; House of Lords and, 85; impossibility theorem and, 92; inequality and, 123; Jury Theorem and, 90–92; liberalism and, 3–4, 25, 80, 86, 90; limits of, 85–86; majority rule and, 27, 83–89, 92–97, 100–101, 121, 306n51; markets and, 97–105, 262, 276; minorities and, 85–90, 93–97, 101, 106, 110; mixed constitution and, 84–85; multi-candidate, single-winner elections and, 119–20; origins of, 83–85; ownership and, 81–82, 89, 101, 105, 118, 124; public goods and, 28, 97–100, 107, 110, 120, 123, 126; Quadratic Voting (QV) and, 105–22; Radical Markets and, 82, 106, 123–26, 203; supermajorities and, 84–85, 88, 92; tyrannies and, 23, 25, 88, 96–100, 106, 108; United Kingdom and, 95–96; United States and, 86–90, 93, 95; voting and, 80–82, 85–93, 96, 99, 105, 108, 115–16, 119–20, 123–24, 303n14, 303n17, 303n20, 304n36, 305n39; wealth and, 83–84, 87, 95, 116 Demosthenes, 55 Denmark, 182 Department of Justice (DOJ), 176, 186, 191 deregulation, 3, 9, 24 Desmond, Matthew, 201–2 Dewey, John, 43 Dickens, Charles, 36 digital economy: data producers and, 208–9, 230–31; diamond-water paradox and, 224–25; as entertainment, 233–39; facial recognition and, 208, 216, 218–19; free access and, 211; Lanier and, 208, 220–24, 233, 237, 313n2, 315n48; machine learning (ML) and, 208–9, 213–14, 217–21, 226–31, 234–35, 238, 247, 289, 291, 315n48; payment systems for, 210–13, 221–30, 243–45; programmers and, 163, 208–9, 214, 217, 219, 224; rise of data work and, 209–13; siren servers and, 220–24, 230–41, 243; spam and, 210, 245; technofeudalism and, 230–33; virtual reality and, 206, 208, 229, 251, 253 diversification, 171–72, 180–81, 185, 191–92, 194–96, 310n22, 310n24 dot-com bubble, 211 double taxation, 65 Dupuit, Jules, 173 Durkheim, Émile, 297n23 Dworkin, Ronald, 305n40 dystopia, 18, 191, 273, 293 education, 114; common ownership self-assessed tax (COST) and, 258; data and, 229, 232, 248; elitism and, 260; equality in, 89; financing, 276; free compulsory, 23; immigrants and, 14, 143–44, 148; labor and, 140, 143–44, 148, 150, 158, 170–71, 232, 248, 258–60; Mill on, 96; populist movements and, 14; Stolper-Samuelson Theorem and, 143 efficient capital markets hypothesis, 180 elections, 80; data and, 217–18; democracy and, 22, 93, 100, 115, 119–21, 124, 217–18, 296n20; gridlock and, 124; Hitler and, 93; multi-candidate, single-winner, 119–20; polls and, 13, 111; Quadratic Voting (QV) and, 115, 119–21, 268, 306n52; U.S. 2016, 93, 296n20 Elhauge, Einer, 176, 197 elitism: aristocracy and, 16–17, 22–23, 36–38, 84–85, 87, 90, 135–36; bourgeoisie and, 36; bureaucrats and, 267; democracy and, 89–91, 96, 124; education and, 260; feudalism and, 16, 34–35, 37, 41, 61, 68, 136, 230–33, 239; financial deregulation and, 3; immigrants and, 146, 166; liberalism and, 3, 15–16, 25–28; minorities and, 12, 14–15, 19, 23–27, 85–90, 93–97, 101, 106, 110, 181, 194, 273, 303n14, 304n36; monarchies and, 85–86, 91, 95, 160 Emergency Economic Stabilization Act, 121 eminent domain, 33, 62, 89 Empire State Building, 45 Engels, Friedrich, 78, 240 Enlightenment, 86, 95 entrepreneurs, xiv; immigrants and, 144–45, 159, 256; labor and, 129, 144–45, 159, 173, 177, 203, 209–12, 224, 226, 256; ownership and, 35, 39 equality: common ownership self-assessed tax (COST) and, 258; education and, 89; immigrants and, 257; labor and, 147, 166, 239, 257; liberalism and, 4, 8, 24, 29; living standards and, 3, 11, 13, 133, 135, 148, 153, 254, 257; Quadratic Voting (QV) and, 264; Radical Markets and, 262, 276; trickle down theories and, 9, 12 Espinosa, Alejandro, 30–32 Ethereum, 117 Europe, 177, 201; democracy and, 88, 90–95; European Union and, 15; fiefdoms in, 34; government utilities and, 48; income patterns in, 5; instability in, 88; labor and, 11, 130–31, 136–47, 165, 245; social democrats and, 24; unemployment rates in, 11 Evans, Richard, 93 Evicted (Desmond), 201–2 Ex Machina (film), 208 Facebook, xxi; advertising and, 50, 202; data and, 28, 205–9, 212–13, 220–21, 231–48; monetization by, 28; news service of, 289; Vickrey Commons and, 50 facial recognition, 208, 216–19 family reunification programs, 150, 152 farms, 17, 34–35, 37–38, 61, 72, 135, 142, 179, 283–85 Federal Communications Commission (FCC), 50, 71 Federal Trade Commission (FTC), 176, 186 feedback, 114, 117, 233, 238, 245 feudalism, 16, 34–35, 37, 41, 61, 68, 136, 230–33, 239 Fidelity, 171, 181–82, 184 financial crisis of 2008, 3, 121 Fitzgerald, F.


pages: 502 words: 107,657

Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die by Eric Siegel

Albert Einstein, algorithmic trading, Amazon Mechanical Turk, Apple's 1984 Super Bowl advert, backtesting, Black Swan, book scanning, bounce rate, business intelligence, business process, butter production in bangladesh, call centre, Charles Lindbergh, commoditize, computer age, conceptual framework, correlation does not imply causation, crowdsourcing, dark matter, data is the new oil, en.wikipedia.org, Erik Brynjolfsson, Everything should be made as simple as possible, experimental subject, Google Glasses, happiness index / gross national happiness, job satisfaction, Johann Wolfgang von Goethe, lifelogging, Machine translation of "The spirit is willing, but the flesh is weak." to Russian and back, mass immigration, Moneyball by Michael Lewis explains big data, Nate Silver, natural language processing, Netflix Prize, Network effects, Norbert Wiener, personalized medicine, placebo effect, prediction markets, Ray Kurzweil, recommendation engine, risk-adjusted returns, Ronald Coase, Search for Extraterrestrial Intelligence, self-driving car, sentiment analysis, Shai Danziger, software as a service, speech recognition, statistical model, Steven Levy, text mining, the scientific method, The Signal and the Noise by Nate Silver, The Wisdom of Crowds, Thomas Bayes, Thomas Davenport, Turing test, Watson beat the top human players on Jeopardy!, X Prize, Yogi Berra, zero-sum game

I Knew You Were Going to Do That With this power at hand, what do we want to predict? Every important thing a person does is valuable to predict, namely: consume, think, work, quit, vote, love, procreate, divorce, mess up, lie, cheat, steal, kill, and die. Let’s explore some examples.2 People Consume Hollywood studios predict the success of a screenplay if produced. Netflix awarded $1 million to a team of scientists who best improved their recommendation system’s ability to predict which movies you will like. Australian energy company Energex predicts electricity demand in order to decide where to build out its power grid, and Con Edison predicts system failure in the face of high levels of consumption. Wall Street predicts stock prices by observing how demand drives them up and down. The firms AlphaGenius and Derwent Capital drive hedge fund trading by following trends across the general public’s activities on Twitter.

I was at Walgreens a few years ago, and upon checkout an attractive, colorful coupon spit out of the machine. The product it hawked, pictured for all my fellow shoppers to see, had the potential to mortify. It was a coupon for Beano, a medication for flatulence. I’d developed mild lactose intolerance, but, before figuring that out, had been trying anything to address my symptom. Acting blindly on data, Walgreens’ recommendation system seemed to suggest that others not stand so close. Other clinical data holds a more serious and sensitive status than digestive woes. Once, when teaching a summer program for talented teenagers, I received data I felt would have been better kept away from me. The administrator took me aside to inform me that one of my students had a diagnosis of bipolar disorder. I wasn’t trained in psychology.

Such a contest is a hard-nosed, objective bake-off—whoever can cook up the solution that best handles the predictive task at hand wins kudos and, usually, cash. Dark Horses And so it was with our two Montrealers, Martin and Martin, who took the Netflix Prize by storm despite their lack of experience—or, perhaps, because of it. Neither had a background in statistics or analytics, let alone recommendation systems in particular. By day, the two worked in the telecommunications industry developing software. But by night, at home, the two-member team plugged away, for 10 to 20 hours per week apiece, racing ahead in the contest under the team name PragmaticTheory. The “pragmatic” approach proved groundbreaking. The team wavered in and out of the number one slot; during the final months of the competition, the team was often in the top echelons.


pages: 170 words: 49,193

The People vs Tech: How the Internet Is Killing Democracy (And How We Save It) by Jamie Bartlett

Ada Lovelace, Airbnb, Amazon Mechanical Turk, Andrew Keen, autonomous vehicles, barriers to entry, basic income, Bernie Sanders, bitcoin, blockchain, Boris Johnson, central bank independence, Chelsea Manning, cloud computing, computer vision, creative destruction, cryptocurrency, Daniel Kahneman / Amos Tversky, Dominic Cummings, Donald Trump, Edward Snowden, Elon Musk, Filter Bubble, future of work, gig economy, global village, Google bus, hive mind, Howard Rheingold, information retrieval, Internet of things, Jeff Bezos, job automation, John Maynard Keynes: technological unemployment, Julian Assange, manufacturing employment, Mark Zuckerberg, Marshall McLuhan, Menlo Park, meta analysis, meta-analysis, mittelstand, move fast and break things, move fast and break things, Network effects, Nicholas Carr, off grid, Panopticon Jeremy Bentham, payday loans, Peter Thiel, prediction markets, QR code, ransomware, Ray Kurzweil, recommendation engine, Renaissance Technologies, ride hailing / ride sharing, Robert Mercer, Ross Ulbricht, Sam Altman, Satoshi Nakamoto, Second Machine Age, sharing economy, Silicon Valley, Silicon Valley ideology, Silicon Valley startup, smart cities, smart contracts, smart meter, Snapchat, Stanford prison experiment, Steve Jobs, Steven Levy, strong AI, TaskRabbit, technological singularity, technoutopianism, Ted Kaczynski, the medium is the message, the scientific method, The Spirit Level, The Wealth of Nations by Adam Smith, The Wisdom of Crowds, theory of mind, too big to fail, ultimatum game, universal basic income, WikiLeaks, World Values Survey, Y Combinator

Everything on social media is still curated, usually by some mysterious algorithm rather than a human editor. These algorithms are designed to serve you content that you’re likely to click on, as that means the potential to sell more advertising alongside it. For example, YouTube’s ‘up next’ videos are statistically selected based on an unbelievably sophisticated analysis of what is most likely to keep a person hooked in. According to Guillaume Chaslot, an AI specialist who worked on the recommendation engine for YouTube, the algorithms aren’t there to optimise what is truthful or honest – but to optimise watch-time. ‘Everything else was considered a distraction,’ he recently told the Guardian.17 These non-decision decisions have huge implications, because even mild confirmation bias can set off a cycle of self-perpetuation. Let’s say you’ve clicked on a link about left-wing politics. An algorithm interprets this as you expressing an interest in left-wing politics, and therefore shows you more of it.


Smart Mobs: The Next Social Revolution by Howard Rheingold

A Pattern Language, augmented reality, barriers to entry, battle of ideas, Brewster Kahle, Burning Man, business climate, citizen journalism, computer vision, conceptual framework, creative destruction, Douglas Engelbart, Douglas Engelbart, experimental economics, experimental subject, Extropian, Hacker Ethic, Hedy Lamarr / George Antheil, Howard Rheingold, invention of the telephone, inventory management, John Markoff, John von Neumann, Joi Ito, Joseph Schumpeter, Kevin Kelly, Metcalfe's law, Metcalfe’s law, more computing power than Apollo, New Urbanism, Norbert Wiener, packet switching, Panopticon Jeremy Bentham, pattern recognition, peer-to-peer, peer-to-peer model, pez dispenser, planetary scale, pre–internet, prisoner's dilemma, RAND corporation, recommendation engine, Renaissance Technologies, RFID, Richard Stallman, Robert Metcalfe, Robert X Cringely, Ronald Coase, Search for Extraterrestrial Intelligence, SETI@home, sharing economy, Silicon Valley, skunkworks, slashdot, social intelligence, spectrum auction, Steven Levy, Stewart Brand, the scientific method, transaction costs, ultimatum game, urban planning, web of trust, Whole Earth Review, zero-sum game

The most trusted reviewers are read by more people and therefore make more money. Slashdot and other self-organized online forums enable participants to rate the postings of other participants in discussions, causing the best writing to rise in prominence and most objectionable postings to sink. Amazon’s online recommendation system tells customers about books and records bought by people whose tastes are similar to their own. Google.com, the foremost Internet search engine, lists first those Web sites that have the most links pointing to them—an implicit form of recommendation system. Hordes of programmers who compete for bragging rights as well as paying work are already driving the evolution of the first-generation reputation systems toward more advanced forms. Upendra Shardanand and Pattie Maes at the MIT Media Lab started something growing on the Net when they introduced Ringo, the “social information filtering” system that recommended music on the basis of shared tastes.1 The MIT researchers “automated word-of-mouth recommendations” with computational methods.

Even simple instruments that enable groups to share knowledge online by recommending useful Web sites, without requiring any action by the participants beyond bookmarking them, can multiply the groups’ effectiveness. In 1997, Hui Guo, Thomas Kreifelts, and Angi Voss of the German National Research Center for Information Technology described their “SOaP” social filtering service designed to address several of the problems constraining recommender systems.10 Guo and his colleagues created software agents, programs that could search, query, gather information, report results, even negotiate and execute transactions with other programs. The SOaP agents could implicitly collect recommendation information by the members of a group and mediate among people, groups, and the Web. At the most implicit level, SOaP agents can collect and cluster URLs that members of the group bookmark in the course of their work.

All feedback comments have to be connected to a transaction; only the seller and winning bidder can leave feedback. Buyers searching for items can see the feedback scores of the sellers. Over time, consistently honest sellers build up substantial reputation scores, which are costly to discard, guarding against the temptation to cheat buyers and adopt a new reputation. Paul Resnick, whose GroupLens had been a pioneering recommender system in 1992, and Richard Zeckhauser performed empirical studies on “a large data set from 1999” that indicated that despite the lack of physical presence on eBay, “trust has emerged due to the feedback or reputation system.”29 Biological theories of cooperation and experiments in game theory point to the expectation of dealing with others in future interactions— the “shadow of the future” that influences behavior in the present.


pages: 375 words: 88,306

The Sharing Economy: The End of Employment and the Rise of Crowd-Based Capitalism by Arun Sundararajan

additive manufacturing, Airbnb, AltaVista, Amazon Mechanical Turk, autonomous vehicles, barriers to entry, basic income, bitcoin, blockchain, Burning Man, call centre, collaborative consumption, collaborative economy, collective bargaining, commoditize, corporate social responsibility, cryptocurrency, David Graeber, distributed ledger, employer provided health coverage, Erik Brynjolfsson, Ethereum, ethereum blockchain, Frank Levy and Richard Murnane: The New Division of Labor, future of work, George Akerlof, gig economy, housing crisis, Howard Rheingold, information asymmetry, Internet of things, inventory management, invisible hand, job automation, job-hopping, Kickstarter, knowledge worker, Kula ring, Lyft, Marc Andreessen, megacity, minimum wage unemployment, moral hazard, moral panic, Network effects, new economy, Oculus Rift, pattern recognition, peer-to-peer, peer-to-peer lending, peer-to-peer model, peer-to-peer rental, profit motive, purchasing power parity, race to the bottom, recommendation engine, regulatory arbitrage, rent control, Richard Florida, ride hailing / ride sharing, Robert Gordon, Ronald Coase, Ross Ulbricht, Second Machine Age, self-driving car, sharing economy, Silicon Valley, smart contracts, Snapchat, social software, supply-chain management, TaskRabbit, The Nature of the Firm, total factor productivity, transaction costs, transportation-network company, two-sided market, Uber and Lyft, Uber for X, uber lyft, universal basic income, Zipcar

Thus, a big fraction of Google’s impact on the economy isn’t captured since changes in consumer surplus are not reflected in the GDP. This point has been noted about digital markets more generally. While a conventional brick-and-mortar bookstore may hold 40,000 to 100,000 books, Amazon offers access to over 3 million books. The same expansion in variety holds true for music, movies, electronics, and myriad other products. Furthermore, since Amazon uses several recommender systems to help promote products, it is not just variety but “fit” that has increased.14 Capturing the economic impacts of enhanced variety and automated word-of-mouth promotions, however, is difficult, since once again, what has changed is primarily the quality of the consumer experience. As Erik Brynjolfsson, Yu (Jeffery) Hu, and Michael Smith argue in their study of consumer surplus in the digital economy, these benefits may be particularly difficult to measure because different consumers are impacted to varying degrees.

This improves the welfare of these consumers by allowing them to locate and buy specialty products they otherwise would not have purchased due to high transaction costs or low product awareness. This effect will be especially beneficial to those consumers who live in remote areas.”15 Analogous increases in consumer surplus were documented by Anindya Ghose, Rahul Telang and Michael Smith in their 2005 study of electronic markets for used books.16 These effects are exacerbated by a wide variety of recommender systems that use machine learning algorithms to better direct consumer choice. As Alexander Tuzhilin and Gedas Adomavicius document, such systems are ubiquitous in digital markets.17 It is natural to expect similar challenges when, for example, trying to encompass the different economic impacts of increased variety and fit from Airbnb, or increased convenience from Lyft, or Dennis’s increased access to financing on the Isle of Gigha.

Smith, “Consumer Surplus in the Digital Economy: Estimating the Value of Increased Product Variety at Online Booksellers,” Management Science 49, 11 (2003): 1580–1596, 1581. 16. Anindya Ghose, Rahul Telang and Michael D. Smith, “Internet Exchanges for Used Books: An Empirical Analysis of Product Cannibalization and Welfare Impact,” Information Systems Research 17, 1 (2006): 3–9. http://pubsonline.informs.org/doi/abs/10.1287/isre.1050.0072. 17. Alexander Tuzhilin and Gedas Adomavicius, ”Toward the next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions,” IEEE Transactions on Knowledge and Data Engineering 17, 6 (2006): 734–739. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1423975&tag=1. 18. Prasanna Tambe and Lorin M. Hitt, “Job Hopping, Information Technology Spillovers, and Productivity Growth,” Management Science 60, 2 (2013): 338–355. 19. One might instead consider using the term “efficiency” of capital or “productivity” of capital.


pages: 196 words: 54,339

Team Human by Douglas Rushkoff

1960s counterculture, autonomous vehicles, basic income, Berlin Wall, big-box store, bitcoin, blockchain, Burning Man, carbon footprint, clean water, clockwork universe, cloud computing, collective bargaining, corporate personhood, disintermediation, Donald Trump, drone strike, European colonialism, Filter Bubble, full employment, future of work, game design, gig economy, Google bus, Gödel, Escher, Bach, Internet of things, invention of the printing press, invention of writing, invisible hand, iterative process, Kevin Kelly, knowledge economy, life extension, lifelogging, Mark Zuckerberg, Marshall McLuhan, means of production, new economy, patient HM, pattern recognition, peer-to-peer, Peter Thiel, Ray Kurzweil, recommendation engine, ride hailing / ride sharing, Ronald Reagan, Ronald Reagan: Tear down this wall, shareholder value, sharing economy, Silicon Valley, social intelligence, sovereign wealth fund, Steve Jobs, Steven Pinker, Stewart Brand, technoutopianism, theory of mind, trade route, Travis Kalanick, Turing test, universal basic income, Vannevar Bush, winner-take-all economy, zero-sum game

Connectivity may be the key to participation, but it also gives corporations more license and capacity to extract what little value people have left. Instead of retrieving the peer-to-peer marketplace, the digital economy exacerbates the division of wealth and paralyzes the social instincts for mutual aid that usually mitigate its effects. Digital platforms amplify the power law dynamics that determine winners and losers. While digital music platforms make space for many more performers to sell their music, their architecture and recommendation engines end up promoting many fewer artists than a diverse ecosystem of record stores or FM radio did. One or two superstars get all the plays, and everyone else sells almost nothing. It’s the same across the board. While the net creates more access for artists and businesses of all kinds, it allows fewer than ever to make any money. The same phenomenon takes place on the stock market, where ultra-fast trading algorithms spur unprecedented momentum in certain shares, creating massive surpluses of capital in the biggest digital companies and sudden, disastrous collapses of their would-be competitors.


pages: 222 words: 53,317

Overcomplicated: Technology at the Limits of Comprehension by Samuel Arbesman

algorithmic trading, Anton Chekhov, Apple II, Benoit Mandelbrot, citation needed, combinatorial explosion, Danny Hillis, David Brooks, digital map, discovery of the americas, en.wikipedia.org, Erik Brynjolfsson, Flash crash, friendly AI, game design, Google X / Alphabet X, Googley, HyperCard, Inbox Zero, Isaac Newton, iterative process, Kevin Kelly, Machine translation of "The spirit is willing, but the flesh is weak." to Russian and back, mandelbrot fractal, Minecraft, Netflix Prize, Nicholas Carr, Parkinson's law, Ray Kurzweil, recommendation engine, Richard Feynman, Richard Feynman: Challenger O-ring, Second Machine Age, self-driving car, software studies, statistical model, Steve Jobs, Steve Wozniak, Steven Pinker, Stewart Brand, superintelligent machines, Therac-25, Tyler Cowen: Great Stagnation, urban planning, Watson beat the top human players on Jeopardy!, Whole Earth Catalog, Y2K

The sophisticated machine learning techniques used in linguistics—employing probability and a large array of parameters rather than principled rules—are increasingly being used in numerous other areas, both in science and outside it, from criminal detection to medicine, as well as in the insurance industry. Even our aesthetic tastes are rather complicated, as Netflix discovered when it awarded a prize for improvements in its recommendation engine to a team whose solution was cobbled together from a variety of different statistical techniques. The contest seemed to demonstrate that no simple algorithm could provide a significant improvement in recommendation accuracy; the winners needed to use a more complex suite of methods in order to capture and predict our personal and quirky tastes in films. This phenomenon occurs in all types of technology.


pages: 181 words: 52,147

The Driver in the Driverless Car: How Our Technology Choices Will Create the Future by Vivek Wadhwa, Alex Salkever

23andMe, 3D printing, Airbnb, artificial general intelligence, augmented reality, autonomous vehicles, barriers to entry, Bernie Sanders, bitcoin, blockchain, clean water, correlation does not imply causation, distributed ledger, Donald Trump, double helix, Elon Musk, en.wikipedia.org, epigenetics, Erik Brynjolfsson, Google bus, Hyperloop, income inequality, Internet of things, job automation, Kevin Kelly, Khan Academy, Kickstarter, Law of Accelerating Returns, license plate recognition, life extension, longitudinal study, Lyft, M-Pesa, Menlo Park, microbiome, mobile money, new economy, personalized medicine, phenotype, precision agriculture, RAND corporation, Ray Kurzweil, recommendation engine, Ronald Reagan, Second Machine Age, self-driving car, Silicon Valley, Skype, smart grid, stem cell, Stephen Hawking, Steve Wozniak, Stuxnet, supercomputer in your pocket, Tesla Model S, The Future of Employment, Thomas Davenport, Travis Kalanick, Turing test, Uber and Lyft, Uber for X, uber lyft, uranium enrichment, Watson beat the top human players on Jeopardy!, zero day

In general, narrow-A.I. systems can do a better job on a very specific range of tasks than humans can. I couldn’t, for example, recall the winning and losing pitcher in every baseball game of the major leagues from the previous night. Narrow A.I. is now embedded in the fabric of our everyday lives. The humanoid phone trees that route calls to airlines’ support desks are all narrow A.I., as are recommendation engines in Amazon and Spotify. Google Maps’ astonishingly smart route suggestions (and mid-course modifications to avoid traffic) are classic narrow A.I. Narrow-A.I. systems are much better than humans are at accessing information stored in complex databases, but their capabilities are specific and limited, and exclude creative thought. If you asked Siri to find the perfect gift for your mother for Valentine’s Day, she might make a snarky comment, but she couldn’t venture an educated guess.


pages: 554 words: 149,489

The Content Trap: A Strategist's Guide to Digital Change by Bharat Anand

Airbnb, Benjamin Mako Hill, Bernie Sanders, Clayton Christensen, cloud computing, commoditize, correlation does not imply causation, creative destruction, crowdsourcing, death of newspapers, disruptive innovation, Donald Trump, Google Glasses, Google X / Alphabet X, information asymmetry, Internet of things, inventory management, Jean Tirole, Jeff Bezos, John Markoff, Just-in-time delivery, Khan Academy, Kickstarter, late fees, Mark Zuckerberg, market design, Minecraft, multi-sided market, Network effects, post-work, price discrimination, publish or perish, QR code, recommendation engine, ride hailing / ride sharing, selection bias, self-driving car, shareholder value, Shenzhen was a fishing village, Silicon Valley, Silicon Valley startup, Skype, social graph, social web, special economic zone, Stephen Hawking, Steve Jobs, Steven Levy, Thomas L Friedman, transaction costs, two-sided market, ubercab, WikiLeaks, winner-take-all economy, zero-sum game

Consider the intrinsic technology properties of networked products, or the word-of-mouth benefits that arise from seemingly unpredictable acts of sharing by interested individuals. It’s tempting to view these user connections as “acts of nature” over which managers have little control. But that’s not the case. By 2002 Amazon had spent more than five years creating a formidable advantage in e-commerce. That came not only from a user-friendly platform and recommendation engine—both features were adopted by other entrants—but from its warehousing and logistics operation. By building distribution centers across the country, investing in algorithms to optimize pick-time in the centers, and hiring operational wizards from Walmart and other competitors, Amazon could get products to customers anywhere in the United States faster and cheaper than anyone else. Then, just when it appeared to be distancing itself from its rivals, Amazon did something that seemed incomprehensible: It opened its fulfillment and warehousing network to any third-party retailer that wanted to participate.

Figure 26: Connected Choices at Netflix From 1997 to 2008 Netflix expanded from a single distribution center to forty-four across the country, a significant capital expenditure. It was this system that anchored a set of other choices around it. Netflix’s queueing system, widely regarded as a tool to enhance user convenience, was instead really a powerful lever for demand forecasting: It told the company exactly what movies every customer in every part of the country wanted next, letting it tailor inventory in different warehouses to local preferences. The recommendation engine, also thought of as a means of increasing customer satisfaction, doubled as an inventory management tool: It let the company recommend not only movies a customer might like, but also those that were in stock! Netflix integrated its sorting machines with the U.S. Postal Service to make deliveries more efficient. It even hired a former postmaster general to guide its operations. And its distributed warehouse system allowed it to secure DVD titles at a relatively low cost per user, since it minimized inventory and maximized turns.


How to Be a Liberal by Ian Dunt

4chan, Alfred Russel Wallace, bank run, battle of ideas, Big bang: deregulation of the City of London, Boris Johnson, bounce rate, British Empire, Brixton riot, Carmen Reinhart, centre right, David Ricardo: comparative advantage, Dominic Cummings, Donald Trump, eurozone crisis, experimental subject, feminist movement, Francis Fukuyama: the end of history, full employment, Growth in a Time of Debt, illegal immigration, invisible hand, John Bercow, Kenneth Rogoff, liberal world order, Mark Zuckerberg, mass immigration, means of production, Mohammed Bouazizi, Northern Rock, old-boy network, Paul Samuelson, Peter Thiel, price mechanism, profit motive, quantitative easing, recommendation engine, road to serfdom, Ronald Reagan, Saturday Night Live, Scientific racism, Silicon Valley, The Wealth of Nations by Adam Smith, too big to fail, upwardly mobile, Winter of Discontent, working poor, zero-sum game

And those users were subject to an algorithm that seemed to push them towards ever more extreme material for their political tribe. This was chiefly because of its recommendation engine, which presented a viewer with options for what they might want to watch after they finished a video. The YouTube algorithm was not based on how to make sure people came across alternate views so that it could preserve the health of liberal democracy. It was based, like that of other social media operations, purely on engagement. Initially, the website grounded it in ‘clicks to watch,’ but it then pivoted to ‘watchtime.’ Whatever got people watching longer was what mattered. The political effect was potentially very far-reaching. If someone clicked on a left-wing video and watched it to the end, the recommendation engine would provide more left-wing videos. Out of the options, the user might pick one.


pages: 176 words: 55,819

The Start-Up of You: Adapt to the Future, Invest in Yourself, and Transform Your Career by Reid Hoffman, Ben Casnocha

Airbnb, Andy Kessler, Black Swan, business intelligence, Cal Newport, Clayton Christensen, commoditize, David Brooks, Donald Trump, en.wikipedia.org, fear of failure, follow your passion, future of work, game design, Jeff Bezos, job automation, Joi Ito, late fees, lateral thinking, Marc Andreessen, Mark Zuckerberg, Menlo Park, out of africa, Paul Graham, paypal mafia, Peter Thiel, recommendation engine, Richard Bolles, risk tolerance, rolodex, shareholder value, side project, Silicon Valley, Silicon Valley startup, social web, Steve Jobs, Steve Wozniak, Tony Hsieh, transaction costs

In 1999 he set up a meeting at Blockbuster’s headquarters in part to discuss possibly partnering on local distribution and faster fulfillment. Blockbuster was not impressed. “They just about laughed us out of their office,” Reed recalls.16 Reed and his team kept at it. They perfected their distribution center network so that more than 80 percent of customers received overnight delivery of movies.17 They developed an innovative recommendation engine that prompted users with movies they might like based on past purchases. By 2005 Netflix had a subscriber base four million strong, had fended off competition from imitations like Walmart’s online movie-by-mail effort, and became the king of online movie rentals. In 2010 Netflix made a profit of more than $160 million. Blockbuster, in comparison, failed to adapt to the Internet era. That year it filed for bankruptcy.18 Netflix is not resting.


pages: 593 words: 118,995

Relevant Search: With Examples Using Elasticsearch and Solr by Doug Turnbull, John Berryman

commoditize, crowdsourcing, domain-specific language, finite state, fudge factor, full text search, information retrieval, natural language processing, premature optimization, recommendation engine, sentiment analysis

These methods are less intuitive than the simple co-occurrence counting method presented here, and they tend to be more challenging to implement. But they often provide better results, because they employ a more holistic understanding of item-user relationships. To dive deeper into recommendation systems, we recommend Practical Recommender Systems by Kim Falk (Manning, 2016). And no matter the method you choose, keep in mind that the end result is a model that lets you quickly find the item-to-item or user-to-item affinities. This understanding is important as we explain how collaborative filtering results can be used in the context of search. 11.2.3. Tying user behavior information back to the search index In the previous section, we demonstrated how to build a simple recommendation system. But we’re supposed to be talking about personalized search! In this section, we return to search and explain how the output of collaborative filtering can be used to build a more personalized search experience.

In both cases, we start with relatively simple methods and then outline more sophisticated approaches using machine learning. In the process of laying out personalized search, we introduce recommendations. You can provide users with personalized content recommendations even before they’ve made a search. In addition, you’ll see that a search engine can be a powerful platform for building a recommendation system. Figure 11.1 shows recommendations side-by-side with search, implemented by a relevance engineer. Figure 11.1. By incorporating knowledge about the content and the user, search can be extended to tasks such as personalized search and recommendations. 11.1. Personalizing search based on user profiles Until now, we’ve defined relevance in terms of how well a search result matches a user’s immediate information need.

Here, information comes in three flavors: information about the users, about the items in the catalog, and about the current context of recommendation: User information —As users interact with the application, you can identify patterns in their behavior and learn about their interests and tastes. Particularly engaged users might even be willing to directly tell us about their interests. Item information —To make good recommendations, it’s important to be familiar with the items in the catalog. At a minimum, the items need to have useful textual content to match on. Items also need good metadata for boosting and filtering. In more advanced recommendation systems, you should also take advantage of the overall user behavior that gives you new information about how items in the catalog are interrelated. Recommendation context —To provide users with the best recommendations possible, you must consider their current context. Are they looking at an item details page? Then you should make recommendations for related items in case they aren’t sold on this one.


pages: 223 words: 60,909

Technically Wrong: Sexist Apps, Biased Algorithms, and Other Threats of Toxic Tech by Sara Wachter-Boettcher

Airbnb, airport security, AltaVista, big data - Walmart - Pop Tarts, Donald Trump, Ferguson, Missouri, Firefox, Grace Hopper, job automation, Kickstarter, lifelogging, Mark Zuckerberg, Menlo Park, move fast and break things, move fast and break things, natural language processing, pattern recognition, Peter Thiel, recommendation engine, ride hailing / ride sharing, self-driving car, Silicon Valley, Silicon Valley startup, Snapchat, Steve Jobs, Tim Cook: Apple, Travis Kalanick, upwardly mobile, women in the workforce, zero-sum game

In other words, if a system like Word2vec is fed data that reflects historical biases, then those biases will be reflected in the resulting word embeddings. The problem is that very few people have been talking about this—and meanwhile, because Google released Word2vec as an open-source technology, all kinds of companies are using it as the foundation for other products. These products include recommendation engines (the tools behind all those “you might also like . . .” features on websites), document classification, and search engines—all without considering the implications of relying on data that reflects historical biases and outdated norms to make future predictions. One of the most worrisome developments is this: using word embeddings to automatically review résumés. That’s what a company called Talla, which makes artificial-intelligence software, reported it was doing in 2016.


pages: 229 words: 68,426

Everyware: The Dawning Age of Ubiquitous Computing by Adam Greenfield

augmented reality, business process, defense in depth, demand response, demographic transition, facts on the ground, game design, Howard Rheingold, Internet of things, James Dyson, knowledge worker, late capitalism, Marshall McLuhan, new economy, Norbert Wiener, packet switching, pattern recognition, profit motive, QR code, recommendation engine, RFID, Steve Jobs, technoutopianism, the built environment, the scientific method

But the word "hint" is well-chosen here, because that's really all the cup will be able to communicate. It may well be that a full mug on my desk implies that I am also in the room, but this is not always going to be the case, and any system that correlates the two facts had better do so pretty loosely. Products and services based on such pattern-recognition already exist in the world—I think of Amazon's "collaborative filtering"–driven recommendation engine—but for the most part, their designers are only now beginning to recognize that they have significantly underestimated the difficulty of deriving meaning from those patterns. The better part of my Amazon recommendations turn out to be utterly worthless—and of all commercial pattern-recognition systems, that's among those with the largest pools of data to draw on. Lest we forget: "simple" is hard.


pages: 233 words: 67,596

Competing on Analytics: The New Science of Winning by Thomas H. Davenport, Jeanne G. Harris

always be closing, big data - Walmart - Pop Tarts, business intelligence, business process, call centre, commoditize, data acquisition, digital map, en.wikipedia.org, global supply chain, high net worth, if you build it, they will come, intangible asset, inventory management, iterative process, Jeff Bezos, job satisfaction, knapsack problem, late fees, linear programming, Moneyball by Michael Lewis explains big data, Netflix Prize, new economy, performance metric, personalized medicine, quantitative hedge fund, quantitative trading / quantitative finance, recommendation engine, RFID, search inside the book, shareholder value, six sigma, statistical model, supply-chain management, text mining, the scientific method, traveling salesman, yield management

Netflix offers free shipping of DVDs to its roughly 6 million customers and provides a return shipping package, also free. Customers watch their cinematic choices at their leisure; there are no late fees. When the DVDs are returned, customers select their next films. Besides the logistical expertise that Netflix needs to make this a profitable venture, Netflix employs analytics in two important ways, both driven by customer behavior and buying patterns. The first is a movie-recommendation “engine” called Cinematch that’s based on proprietary, algorithmically driven software. Netflix hired mathematicians with programming experience to write the algorithms and code to define clusters of movies, connect customer movie rankings to the clusters, evaluate thousands of ratings per second, and factor in current Web site behavior—all to ensure a personalized Web page for each visiting customer.


pages: 499 words: 144,278

Coders: The Making of a New Tribe and the Remaking of the World by Clive Thompson

2013 Report for America's Infrastructure - American Society of Civil Engineers - 19 March 2013, 4chan, 8-hour work day, Ada Lovelace, AI winter, Airbnb, Amazon Web Services, Asperger Syndrome, augmented reality, Ayatollah Khomeini, barriers to entry, basic income, Bernie Sanders, bitcoin, blockchain, blue-collar work, Brewster Kahle, Brian Krebs, Broken windows theory, call centre, cellular automata, Chelsea Manning, clean water, cloud computing, cognitive dissonance, computer vision, Conway's Game of Life, crowdsourcing, cryptocurrency, Danny Hillis, David Heinemeier Hansson, don't be evil, don't repeat yourself, Donald Trump, dumpster diving, Edward Snowden, Elon Musk, Erik Brynjolfsson, Ernest Rutherford, Ethereum, ethereum blockchain, Firefox, Frederick Winslow Taylor, game design, glass ceiling, Golden Gate Park, Google Hangouts, Google X / Alphabet X, Grace Hopper, Guido van Rossum, Hacker Ethic, HyperCard, illegal immigration, ImageNet competition, Internet Archive, Internet of things, Jane Jacobs, John Markoff, Jony Ive, Julian Assange, Kickstarter, Larry Wall, lone genius, Lyft, Marc Andreessen, Mark Shuttleworth, Mark Zuckerberg, Menlo Park, microservices, Minecraft, move fast and break things, move fast and break things, Nate Silver, Network effects, neurotypical, Nicholas Carr, Oculus Rift, PageRank, pattern recognition, Paul Graham, paypal mafia, Peter Thiel, pink-collar, planetary scale, profit motive, ransomware, recommendation engine, Richard Stallman, ride hailing / ride sharing, Rubik’s Cube, Ruby on Rails, Sam Altman, Satoshi Nakamoto, Saturday Night Live, self-driving car, side project, Silicon Valley, Silicon Valley ideology, Silicon Valley startup, single-payer health, Skype, smart contracts, Snapchat, social software, software is eating the world, sorting algorithm, South of Market, San Francisco, speech recognition, Steve Wozniak, Steven Levy, TaskRabbit, the High Line, Travis Kalanick, Uber and Lyft, Uber for X, uber lyft, universal basic income, urban planning, Wall-E, Watson beat the top human players on Jeopardy!, WikiLeaks, women in the workforce, Y Combinator, Zimmermann PGP, éminence grise

At Columbia University, the researcher Jonathan Albright experimentally searched on YouTube for the phrase “crisis actors,” in the wake of a major school shooting, and took the “next up” recommendation from the recommendation system. He quickly amassed 9,000 videos, a large percentage that seemed custom designed to shock, inflame, or mislead, ranging from “rape game jokes, shock reality social experiments, celebrity pedophilia, ‘false flag’ rants, and terror-related conspiracy theories,” as he wrote. Some of it, he figured, was driven by sheer profit motive: Post outrageous nonsense, get into the recommendation system, and reap the profit from the clicks. Recommender systems, in other words, may have a bias toward “inflammatory content,” as Tufekci notes. Another academic, Renée DiResta, found the same problem with Facebook’s recommendation system for its “Groups.” People who read posts about vaccines were urged to join anti-vaccination groups, and thence to groups devoted to even more unhinged conspiracies like “chemtrails.”

They couldn’t show users every posting of their friend, because that would drown them in trivia. They needed automation, an algorithm that would pick only posts you’d most likely find interesting. How does Facebook figure that out? It’s hard to know for sure. Social networks do not discuss their ranking systems with much detail, to prevent people from gaming their algorithms; spammers constantly try to suss out how recommendation systems work so they can produce spammy material that will get upranked. So few outside the firms truly know. But generally, the algorithms uprank the type of content you’d expect: posts and photos and videos that have amassed tons of likes or “faves” or attracted many comments, reposts, and retweets, with a particular bias toward recent activity. Signals like these help fuel the “recommended” videos on YouTube, the “trending” topics on Twitter or Reddit, and the posts that materialize in your News Feed.


pages: 252 words: 74,167

Thinking Machines: The Inside Story of Artificial Intelligence and Our Race to Build the Future by Luke Dormehl

Ada Lovelace, agricultural Revolution, AI winter, Albert Einstein, Alexey Pajitnov wrote Tetris, algorithmic trading, Amazon Mechanical Turk, Apple II, artificial general intelligence, Automated Insights, autonomous vehicles, book scanning, borderless world, call centre, cellular automata, Claude Shannon: information theory, cloud computing, computer vision, correlation does not imply causation, crowdsourcing, drone strike, Elon Musk, Flash crash, friendly AI, game design, global village, Google X / Alphabet X, hive mind, industrial robot, information retrieval, Internet of things, iterative process, Jaron Lanier, John Markoff, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, John von Neumann, Kickstarter, Kodak vs Instagram, Law of Accelerating Returns, life extension, Loebner Prize, Marc Andreessen, Mark Zuckerberg, Menlo Park, natural language processing, Norbert Wiener, out of africa, PageRank, pattern recognition, Ray Kurzweil, recommendation engine, remote working, RFID, self-driving car, Silicon Valley, Skype, smart cities, Smart Cities: Big Data, Civic Hackers, and the Quest for a New Utopia, social intelligence, speech recognition, Stephen Hawking, Steve Jobs, Steve Wozniak, Steven Pinker, strong AI, superintelligent machines, technological singularity, The Coming Technological Singularity, The Future of Employment, Tim Cook: Apple, too big to fail, Turing machine, Turing test, Vernor Vinge, Watson beat the top human players on Jeopardy!

Much like laws continue to be followed after lawmakers have passed away, the idea of an expert system is that we ought to be able to continue drawing on an expert’s knowledge about a specialist subject after the person is no longer available to us. The concept failed, but the intention (and, for a while, the funding) was absolutely there. In some senses, the modern parallel of the expert system is the so-called ‘recommender system’. This subclass of information filtering system sets out to anticipate and predict what rating or selection a user is likely to give an item in a specific narrow domain. Everyone reading this will likely have come across the feature on Amazon or Netflix which suggests that, ‘You liked X, so you may also enjoy Y.’ Sometimes these predictions are less than stellar, but as much as we like to think of ourselves as fundamentally unpredictable beings, it’s often surprising just how accurate they are.

But Minsky didn’t simply want to save information and make it available for future generations. He wanted to be around to see it. ‘Eventually we will entirely replace our brains using nanotechnology,’ he wrote. ‘Once delivered from the limitations of biology, we will be able to decide the length of our lives – with the option of immortality – and choose among other, unimagined capabilities as well.’ The Connectome A complex recommender system ‘mindfile’ of the sort described by Marius Ursache and William Sims Bainbridge may go some way towards replicating us in software form. However, the only truly faithful means of making sure that a person is reconstructed in a form other than their original one would be to duplicate all of the cellular pathways in the brain – neuron by painstaking neuron. For this to be possible, we must first accept the central tenet of Artificial Intelligence: that the main task that the brain carries out can be viewed as information processing not dissimilar to that which is carried out by a computer.

(TV show) 135–9, 162, 189–90, 225, 254 Jobs, Steve 6–7, 32, 35, 108, 113, 181, 193, 231 Jochem, Todd 55–6 judges 153–4 Kasparov, Garry 137, 138–9, 177 Katz, Lawrence 159–60 Keck, George Fred 81–2 Keynes, John Maynard 139–40 Kjellberg, Felix (PewDiePie) 151 ‘knowledge engineers’ 29, 37 Knowledge Narrator 110–11 Kodak 238 Kolibree 67 Koza, John 188–9 Ktesibios of Alexandria 71–2 Kubrick, Stanley 2, 228 Kurzweil, Ray 213–14, 231–3 Landauer, Thomas 201–2 Lanier, Jaron 156, 157 Laorden, Carlos 100, 101 learning 37–9, 41–4, 52–3, 55 Deep 11–2, 56–63, 96–7, 164, 225 and email filters 88 machine 3, 71, 84–6, 88, 100, 112, 154, 158, 197, 215, 233, 237, 239 reinforcement 83, 232 and smart homes 84, 85 supervised 57 unsupervised 57–8 legal profession 145, 188, 192 LegalZoom 145 LG 132 Lickel, Charles 136–7 ‘life logging’ software 200 Linden, David J. 213–14 Loebner, Hugh 102–3, 105 Loebner Prize 102–5 Lohn, Jason 182, 183–5, 186 long-term potentiation 39–40 love 122–4 Lovelace, Ada 185, 189 Lovelace Test 185–6 Lucas, George 110–11 M2M communication 70–71 ‘M’ (AI assistant) 153 Machine Intelligence from Cortical Networks (MICrONS) project 214–15 machine learners 38 machine learning 3, 71, 84–6, 88, 100, 112, 154, 158, 197, 215, 233, 237, 239 Machine Translator 8–9, 11 ‘machine-aided recognition’ 19–20 Manhattan Project 14, 229 MARK 1 (computer) 43–4 Mattersight Corporation 127 McCarthy, John 18, 19, 20, 27, 42, 54, 253 McCulloch, Warren 40–2, 43, 60, 142–3 Mechanical Turk jobs 152–7 medicine 11, 30, 87–8, 92–5, 187–8, 192, 247, 254 memory 13, 14, 16, 38–9, 42, 49 ‘micro-worlds’ 25 Microsoft 62–3, 106–7, 111–12, 114, 118, 129 mind mapping the 210–14, 217, 218 ‘mind clones’ 203 uploads 221 mindfiles 201–2, 207, 212 Minsky, Marvin 18, 21, 24, 32, 42, 44–6, 49, 105, 205–7, 253–4 MIT 19–20, 27, 96–7, 129, 194–5 Mitsuku (chatterbot) 103–6, 108 Modernising Medicine 11 Momentum Machines, Inc. 141 Moore’s Law 209, 220, 231 Moravec’s paradox 26–7 mortgage applications 237–8 MTurk platform 153, 154, 155 music 168, 172–7, 179 Musk, Elon 149–50, 223–4 MYCIN (expert system) 30–1 nanobots 213–14 nanosensors 92 Nara Logics 118 NASA 6, 182, 184–5 natural selection 182–3 navigational aids 90–1, 126, 127, 128, 241 Nazis 15, 17, 227 Negobot 99–102 Nest Labs 67, 96, 254 Netflix 156, 198 NETtalk 51, 52–3, 60 neural networks 11–12, 38–9, 41, 42–3, 97, 118, 164–6, 168, 201, 208–9, 211, 214–15, 218, 220, 224–5, 233, 237–8, 249, 254, 256–7 neurons 40, 41–2, 46, 49–50, 207, 209–13, 216 neuroscience 40–2, 211, 212, 214, 215 New York World’s Fair 1964 5–11 Newell, Alan 19, 226 Newman, Judith 128–9 Nuance Communications 109 offices, smart 90 OpenWorm 210 ‘Optical Scanning and Information Retrieval’ 7–8, 10 paedophile detection 99–102 Page, Larry 6–7, 34, 220 ‘paperclip maximiser’ scenario 235 Papert, Seymour 27, 44, 45–6, 49 Paro (therapeutic robot) 130–1 patents 188–9 Perceiving and Recognising Automation (PARA) 43 perceptrons 43–6 personality capture 200–4 pharmaceuticals 187–8 Pitts, Walter 40–2, 43, 60 politics 119–2 Pomerlau, Dean 54, 55–6, 90 prediction 87, 198–9 Profound Hypothermia and Circulatory Arrest 219–20 punch-cards 8 Qualcomm 93 radio-frequency identification device (RFID) 65–6 Ramón y Cajal, Santiago 39–40 Rapidly Adapting Lateral Position Handler (RALPH) 55 ‘recommender system’ 198 refuse collection 142 ‘relational agents’ 130 remote working 238–9 reverse engineering 208, 216, 217 rights for AIs 248–51 risks of AI 223–40 accountability issues 240–4, 246–8 ethics 244–8 rights for AIs 248–51 technological unemployment 139–50, 163, 225, 255 robots 62, 74–7, 89–90, 130–1, 141, 149, 162, 217, 225, 227, 246–7, 255–6 Asimov’s three ethical rules of 244–8 robotic limbs 211–12 Roomba robot vacuum cleaner 75–7, 234, 236 Rosenblatt, Frank 42–6, 61, 220 rules 36–7, 79–80 Rumelhart, David 48, 50–1, 63 Russell, Bertrand 41 Rutter, Brad 138, 139 SAINT program 20 sampling (music) 155, 157 ‘Scheherazade’ (Ai storyteller) 169–70 scikit-learn 239 Scripps Health 92 Sculley, John 110–11 search engines 109–10 Searle, John 24–5 Second Life (video game) 194 Second World War 12–13, 14–15, 17, 72, 227 Sejnowski, Terry 48, 51–3 self-awareness 77, 246–7 self-driving 53–6, 90, 143, 149–50 Semantic Information Retrieval (SIR) 20–2 sensors 75–6, 80, 84–6, 93 SHAKEY robot 23–4, 27–8, 90 Shamir, Lior 172–7, 179, 180 Shannon, Claude 13, 16–18, 28, 253 shipping systems 198 Simon, Herbert 10, 19, 24, 226 Sinclair Oil Corporation 6 Singularity, the 228–3, 251, 256 Siri (AI assistant) 108–11, 113–14, 116, 118–19, 125–30, 132, 225–6, 231, 241, 256 SITU 69, 93 Skynet 231 smart devices 3, 66–7, 69–71, 73–7, 80–8, 92–7, 230–1, 254 and AI assistants 116 and feedback 73–4 problems with 94–7 ubiquitous 92–4 and unemployment 141–2 smartwatches 66, 93, 199 Sony 199–200 Sorto, Erik 211, 212 Space Invaders (video game) 37 spectrometers 93 speech recognition 59, 62, 109, 111, 114, 120 SRI International 28, 89–90, 112–13 StarCraft II (video game) 186–7 story generation 169–70 strategy 36 STUDENT program 20 synapses 209 Synthetic Interview 202–3 Tamagotchis 123–5 Tay (chatbot) 106–7 Taylorism 95–6 Teknowledge 32, 33 Terminator franchise 231, 235 Tetris (video game) 28 Theme Park (video game) 29 thermostats 73, 79, 80 ‘three wise men’ puzzle 246–7 Tojan Room, Cambridge University 69–70 ‘tortoises’ (robots) 74–7 toys 123–5 traffic congestion 90–1 transhumanists 205 transistors 16–17 Transits – Into an abyss (musical composition) 168 translation 8–9, 11, 62–3, 155, 225 Turing, Alan 3, 13–17, 28, 35, 102, 105–6, 227, 232 Turing Test 15, 101–7, 229, 232 tutors, remote 160–1 TV, smart 80, 82 Twitter 153–4 ‘ubiquitous computing’ 91–4 unemployment, technological 139–50, 163, 225, 255 universal micropayment system 156 Universal Turing Machine 15–16 Ursache, Marius 193–7, 203–4, 207 vacuum cleaners, robotic 75–7, 234, 236 video games 28–9, 35–7, 151–2, 186–7, 194, 197 Vinge, Vernor 229–30 virtual assistants 107–32, 225–6, 240–1 characteristics 126–8 falling in love with 122–4 political 119–22 proactive 116–18 therapeutic 128–31 voices 124–126, 127–8 Viv Labs 132 Vladeck, David 242–4 ‘vloggers’ 151–2 von Neumann, John 13–14, 17, 100, 229 Voxta (AI assistant) 119–20 waiter drones 141 ‘Walking Cities’ 89–90 Walter, William Grey 74–7 Warwick, Kevin 65–6 Watson (Blue J) 138–9, 162, 189–92 Waze 90–91, 126 weapons 14, 17, 72, 224–5, 234–5, 247, 255–6 ‘wetware’ 208 Wevorce 145 Wiener, Norbert 72–3, 227 Winston, Patrick 49–50 Wofram Alpha tool 108–9 Wozniak, Steve 35, 114 X.ai 116–17 Xbox 360, Kinect device 114 XCoffee 70 XCON (expert system) 31 Xiaoice 129, 130 YouTube 151 Yudkowsky, Eliezer 237–8 Zuckerberg, Mark 7, 107–8, 230–1, 254–5 Acknowledgments WRITING A BOOK is always a bit of a solitary process.


pages: 561 words: 120,899

The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant From Two Centuries of Controversy by Sharon Bertsch McGrayne

Bayesian statistics, bioinformatics, British Empire, Claude Shannon: information theory, Daniel Kahneman / Amos Tversky, double helix, Edmond Halley, Fellow of the Royal Society, full text search, Henri Poincaré, Isaac Newton, Johannes Kepler, John Markoff, John Nash: game theory, John von Neumann, linear programming, longitudinal study, meta analysis, meta-analysis, Nate Silver, p-value, Pierre-Simon Laplace, placebo effect, prediction markets, RAND corporation, recommendation engine, Renaissance Technologies, Richard Feynman, Richard Feynman: Challenger O-ring, Robert Mercer, Ronald Reagan, speech recognition, statistical model, stochastic process, Thomas Bayes, Thomas Kuhn: the structure of scientific revolutions, traveling salesman, Turing machine, Turing test, uranium enrichment, Yom Kippur War

Pouget A et al. (2009) Neural Computations as Laplacian (or is it Bayesian?) probabilistic inference. In draft. Quatse JT, Najmi A. (2007) Empirical Bayesian targeting. Proceedings, 2007 World Congress in Computer Science, Computer Engineering, and Applied Computing, June 25–28, 2007. Schafer JB, Konstan J, Riedl J. (1999) Recommender systems in E-commerce. In ACM Conference on Electronic Commerce (EC-99) 158–66. Schafer JB, Konstan J, Riedl J. (2001) Recommender systems in E-commerce. Data Mining and Knowledge Discovery (5) 115–53. Schneider, Stephen H. (2005) The Patient from Hell. Perseus Books. Spolsky, Joel. (2005) (http://www.joelonsoftware.com/items/2005/10/17.html). Swinburne, Richard, ed. (2002) Bayes’s Theorem. Oxford University Press. Taylor BL et al. (2000) Incorporating uncertainty into management models for marine mammals.

Users refine their own filters by reading low-scoring messages and either keeping them or sending them to trash and junk files. This use of Bayesian optimal classifiers is similar to the technique used by Frederick Mosteller and David Wallace to determine who wrote certain Federalist papers. Bayesian theory is firmly embedded in Microsoft’s Windows operating system. In addition, a variety of Bayesian techniques are involved in Microsoft’s handwriting recognition; recommender systems; the question-answering box in the upper right corner of a PC’s monitor screen; a datamining software package for tracking business sales; a program that infers the applications that users will want and preloads them before they are requested; and software to make traffic jam predictions for drivers to check before their commute. Bayes was blamed—unfairly, say Heckerman and Horwitz—for Microsoft’s memorably annoying paperclip, Clippy.

As the e-commerce refrain goes, “If you liked this book/song/movie, you’ll like that one too.” The updating used in machine learning does not necessarily follow Bayes’ theorem formally but “shares its perspective.” A 1-million contest sponsored by Netflix.com illustrates the prominent role of Bayesian concepts in modern e-commerce and learning theory. In 2006 the online film-rental company launched a search for the best recommender system to improve its own algorithm. More than 50,000 contestants from 186 countries vied over the four years of the competition. The AT&T Labs team organized around Yehuda Koren, Christopher T. Volinsky, and Robert M. Bell won the prize in September 2009. Interestingly, although no contestants questioned Bayes as a legitimate method, almost none wrote a formal Bayesian model. The winning group relied on empirical Bayes but estimated the initial priors according to their frequencies.


pages: 265 words: 74,000

The Numerati by Stephen Baker

Berlin Wall, Black Swan, business process, call centre, correlation does not imply causation, Drosophila, full employment, illegal immigration, index card, Isaac Newton, job automation, job satisfaction, McMansion, Myron Scholes, natural language processing, PageRank, personalized medicine, recommendation engine, RFID, Silicon Valley, Skype, statistical model, Watson beat the top human players on Jeopardy!

It will simply issue alerts when it detects changes in patterns and perhaps urge the user to schedule a medical appointment. It will be up to doctors and nurses to follow up, figuring out why someone is limping or swaying differently at the kitchen sink. But in time, these systems will have enough feedback from thousands of users that they should be able to point people—either doctors or patients—to the most probable cause. In this way, they will work like the recommendation engines on Netflix or Amazon.com, which point people toward books or movies that are popular among customers with similar patterns. (Amazon and Netflix, of course, don't always get it right, and neither will the analysis issuing from the magic carpet. It will only point caregivers toward statistically probable causes.) Dishman's team has installed magic carpets in the homes of people with neurological disorders or a history of falling.


pages: 231 words: 71,248

Shipping Greatness by Chris Vander Mey

corporate raider, don't be evil, en.wikipedia.org, fudge factor, Google Chrome, Google Hangouts, Gordon Gekko, Jeff Bezos, Kickstarter, Lean Startup, minimum viable product, performance metric, recommendation engine, Skype, slashdot, sorting algorithm, source of truth, Steve Jobs, Superbowl ad, web application

We chose to focus initially on professionals because while teens and tweens have time to spend on Facebook and YouTube, professionals have less time but also have rich networks and strong opinions—not to mention disposable capital to spend on content. Using IMDb’s unique collection of movie data and Amazon’s ability to distribute digital content and proven personalization tools, we will uniquely solve the content discovery problem by integrating these technologies and building unique suggestion algorithms. Unlike competitors such as Netflix, who already have a recommendations engine, we’ll integrate across all video sources and use our richer data to provide more interesting in-viewing experiences and more accurate recommendations. We will deliver these in-viewing experiences through platforms that can expose contextually relevant data (e.g., the cast of a YouTube video), such as a browser plug-in for YouTube and mobile applications for phones. We can also enlighten viewers by providing rich information about the content they are consuming, and prompt for feedback—creating a virtuous cycle in which all users benefit.


pages: 260 words: 76,223

Ctrl Alt Delete: Reboot Your Business. Reboot Your Life. Your Future Depends on It. by Mitch Joel

3D printing, Amazon Web Services, augmented reality, call centre, clockwatching, cloud computing, Firefox, future of work, ghettoisation, Google Chrome, Google Glasses, Google Hangouts, Khan Academy, Kickstarter, Kodak vs Instagram, Lean Startup, Marc Andreessen, Mark Zuckerberg, Network effects, new economy, Occupy movement, place-making, prediction markets, pre–internet, QR code, recommendation engine, Richard Florida, risk tolerance, self-driving car, Silicon Valley, Silicon Valley startup, Skype, social graph, social web, Steve Jobs, Steve Wozniak, Thomas L Friedman, Tim Cook: Apple, Tony Hsieh, white picket fence, WikiLeaks, zero-sum game

In fact, it’s actually very squiggly. Always bear that in mind. Embrace the squiggle. THE REALITY OF CAREER CHOICES IN A CTRL ALT DELETE WORLD. You can contrast the fictional story above with the tale of a friend of mine. This individual was never really sure what she wanted to do. There was no clear desire or talent in a single area of interest. In her final years of high school, a guidance counselor recommended engineering or the sciences because she had above-average math grades. So my friend studied engineering through university and squeaked by. Never passionate about it, she got her diploma and entered the workforce. I had lunch with her a while back and she confessed that she was miserable because of her work but could not figure out why. She had followed all the rules; she did okay in school, she advanced in a field that typically enables you to be both employable and well paid.


pages: 229 words: 72,431

Shadow Work: The Unpaid, Unseen Jobs That Fill Your Day by Craig Lambert

airline deregulation, Asperger Syndrome, banking crisis, Barry Marshall: ulcers, big-box store, business cycle, carbon footprint, cashless society, Clayton Christensen, cognitive dissonance, collective bargaining, Community Supported Agriculture, corporate governance, crowdsourcing, disintermediation, disruptive innovation, financial independence, Galaxy Zoo, ghettoisation, gig economy, global village, helicopter parent, IKEA effect, industrial robot, informal economy, Jeff Bezos, job automation, John Maynard Keynes: Economic Possibilities for our Grandchildren, Mark Zuckerberg, new economy, pattern recognition, plutocrats, Plutocrats, recommendation engine, Schrödinger's Cat, Silicon Valley, single-payer health, statistical model, Thorstein Veblen, Turing test, unpaid internship, Vanguard fund, Vilfredo Pareto, zero-sum game, Zipcar

Routinely, businesses now ask shadow-working customers to cough up personal information as a way to smooth transactions, or even enable them to buy things at all. To make online purchases, customers open accounts with bookstores, banks, newspapers, utilities, sports teams, apparel vendors, phone service providers, and so on. Everyone wants you to open an account. This means supplying contact and demographic data and then having all transactions tracked, building a personal profile for the vendor. That profile enables vendors to activate “recommendation engines.” Once its algorithms have examined your past purchases, Amazon can recommend books or desk lamps you might like, and Netflix can suggest movies to rent. On my computer, opening Amazon.com brings up thumbnails of books by Bill Bryson, an author whose works I have purchased, and books on pharmaceutical companies, a topic I’ve browsed. I also see displays of clock radios, pressure cookers, and Egyptian cotton towels—gifts I’ve bought from Amazon.


pages: 326 words: 74,433

Do More Faster: TechStars Lessons to Accelerate Your Startup by Brad Feld, David Cohen

augmented reality, computer vision, corporate governance, crowdsourcing, disintermediation, hiring and firing, Inbox Zero, Jeff Bezos, Kickstarter, knowledge worker, Lean Startup, Ray Kurzweil, recommendation engine, risk tolerance, Silicon Valley, Skype, slashdot, social web, software as a service, Steve Jobs

—thehighwaygirl.com Travelfli (2008)—Now UsingMiles, helps frequent flyers maximize the full potential of their loyalty programs.—usingmiles.com TutuorialTab (2010)—lets companies make their web site more learnable.—tutorialtab.com Usermojo (2010)—is an emotion analytics platform that tells you why users do what they do.—usermojo.com Vanilla (2009)—is open source forum software.—vanillaforums.com Villij (2007)—is a recommendation engine for people.—villij.com Vacation Rental Partner (2010)—makes it easy to generate revenue from a second home. We offer tools that eliminate the need for traditional property management companies.—vacationrentalpartner.com TechStars companies funded after publication are listed on the TechStars web site. About the Authors Brad Feld is a co-founder and managing director at Foundry Group, an early stage venture capital firm, and a co-founder of TechStars.


pages: 252 words: 72,473

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O'Neil

Affordable Care Act / Obamacare, Bernie Madoff, big data - Walmart - Pop Tarts, call centre, carried interest, cloud computing, collateralized debt obligation, correlation does not imply causation, Credit Default Swap, credit default swaps / collateralized debt obligations, crowdsourcing, Emanuel Derman, housing crisis, I will remember that I didn’t make the world, and it doesn’t satisfy my equations, illegal immigration, Internet of things, late fees, mass incarceration, medical bankruptcy, Moneyball by Michael Lewis explains big data, new economy, obamacare, Occupy movement, offshore financial centre, payday loans, peer-to-peer lending, Peter Thiel, Ponzi scheme, prediction markets, price discrimination, quantitative hedge fund, Ralph Nader, RAND corporation, recommendation engine, Rubik’s Cube, Sharpe ratio, statistical model, Tim Cook: Apple, too big to fail, Unsafe at Any Speed, Upton Sinclair, Watson beat the top human players on Jeopardy!, working poor

Investors, of course, feast on these returns and shower WMD companies with more money. And the victims? Well, an internal data scientist might say, no statistical system can be perfect. Those folks are collateral damage. And often, like Sarah Wysocki, they are deemed unworthy and expendable. Forget about them for a minute, they might say, and focus on all the people who get helpful suggestions from recommendation engines or who find music they love on Pandora, the ideal job on LinkedIn, or perhaps the love of their life on Match.​com. Think of the astounding scale, and ignore the imperfections. Big Data has plenty of evangelists, but I’m not one of them. This book will focus sharply in the other direction, on the damage inflicted by WMDs and the injustice they perpetuate. We will explore harmful examples that affect people at critical life moments: going to college, borrowing money, getting sentenced to prison, or finding and holding a job.


pages: 1,535 words: 337,071

Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley, Jon Kleinberg

Albert Einstein, AltaVista, clean water, conceptual framework, Daniel Kahneman / Amos Tversky, Douglas Hofstadter, Erdős number, experimental subject, first-price auction, fudge factor, George Akerlof, Gerard Salton, Gerard Salton, Gödel, Escher, Bach, incomplete markets, information asymmetry, information retrieval, John Nash: game theory, Kenneth Arrow, longitudinal study, market clearing, market microstructure, moral hazard, Nash equilibrium, Network effects, Pareto efficiency, Paul Erdős, planetary scale, prediction markets, price anchoring, price mechanism, prisoner's dilemma, random walk, recommendation engine, Richard Thaler, Ronald Coase, sealed-bid auction, search engine result page, second-price auction, second-price sealed-bid, Simon Singh, slashdot, social web, Steve Jobs, stochastic process, Ted Nelson, The Market for Lemons, The Wisdom of Crowds, trade route, transaction costs, ultimatum game, Vannevar Bush, Vickrey auction, Vilfredo Pareto, Yogi Berra, zero-sum game

We will consider both of these settings in this chapter. Ideas from the theory of voting have been adopted in a number of recent on-line applications [139]. Different Web search engines produce different rankings of results; a line of work on meta-search has developed tools for combining these rankings into a single aggregate ranking. Recommendation systems for books, music, and other items — such as Amazon’s product-recommendation system — have employed related ideas for aggregating preferences. In this case, a recommendation system determines a set of users whose past history indicates tastes similar to yours, and then uses voting methods to combine the preferences of these other users to produce a ranked list of recommendations (or a single best recommendation) for you. Note that in this case, the goal is not a single aggregate ranking for the whole population, but instead an aggregate ranking for each user, based on the preferences of similar users.

For example, reputation systems and trust systems enable users to provide signals about the behavior — and misbehavior — of other users. We discussed such systems in the context of structural balance in Chapter 5, and will see their role in providing information essential to the functioning of on-line markets in Chapter 22. Web 2.0 sites also make use of recommendations systems, to guide users toward items that they may not know about. In addition to serving as helpful features for a site’s users, such recommendation systems interact in complex but important ways with distributions of popularity and the long tail of niche content, as we will see in Chapter 18. The development of the current generation of Web search engines, led by Google, is sometimes seen as a crucial step in the pivot from the early days of the Web to the era of Web 2.0.

. . . . . . . . . . . . . . . . 299 10.6 Advanced Material: A Proof of the Matching Theorem . . . . . . . . . . . . 300 10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 11 Network Models of Markets with Intermediaries 319 11.1 Price-Setting in Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 11.2 A Model of Trade on Networks . . . . . . . . . . . . . . . . . . . . . . . . . 323 11.3 Equilibria in Trading Networks . . . . . . . . . . . . . . . . . . . . . . . . . 330 11.4 Further Equilibrium Phenomena: Auctions and Ripple Effects . . . . . . . . 334 11.5 Social Welfare in Trading Networks . . . . . . . . . . . . . . . . . . . . . . . 338 11.6 Trader Profits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 11.7 Reflections on Trade with Intermediaries . . . . . . . . . . . . . . . . . . . . 342 11.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 12 Bargaining and Power in Networks 347 12.1 Power in Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 12.2 Experimental Studies of Power and Exchange . . . . . . . . . . . . . . . . . 350 12.3 Results of Network Exchange Experiments . . . . . . . . . . . . . . . . . . . 352 12.4 A Connection to Buyer-Seller Networks . . . . . . . . . . . . . . . . . . . . . 356 12.5 Modeling Two-Person Interaction: The Nash Bargaining Solution . . . . . . 357 12.6 Modeling Two-Person Interaction: The Ultimatum Game . . . . . . . . . . . 360 12.7 Modeling Network Exchange: Stable Outcomes . . . . . . . . . . . . . . . . 362 12.8 Modeling Network Exchange: Balanced Outcomes . . . . . . . . . . . . . . . 366 12.9 Advanced Material: A Game-Theoretic Approach to Bargaining . . . . . . . 369 12.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 IV Information Networks and the World Wide Web 381 13 The Structure of the Web 383 13.1 The World Wide Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 13.2 Information Networks, Hypertext, and Associative Memory . . . . . . . . . . 386 13.3 The Web as a Directed Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 394 13.4 The Bow-Tie Structure of the Web . . . . . . . . . . . . . . . . . . . . . . . 397 13.5 The Emergence of Web 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400 13.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 6 CONTENTS 14 Link Analysis and Web Search 405 14.1 Searching the Web: The Problem of Ranking . . . . . . . . . . . . . . . . . . 405 14.2 Link Analysis using Hubs and Authorities . . . . . . . . . . . . . . . . . . . 407 14.3 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 14.4 Applying Link Analysis in Modern Web Search . . . . . . . . . . . . . . . . 420 14.5 Applications beyond the Web . . . . . . . . . . . . . . . . . . . . . . . . . . 423 14.6 Advanced Material: Spectral Analysis, Random Walks, and Web Search . . . 425 14.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 15 Sponsored Search Markets 445 15.1 Advertising Tied to Search Behavior . . . . . . . . . . . . . . . . . . . . . . 445 15.2 Advertising as a Matching Market . . . . . . . . . . . . . . . . . . . . . . . . 448 15.3 Encouraging Truthful Bidding in Matching Markets: The VCG Principle . . 452 15.4 Analyzing the VCG Procedure: Truth-Telling as a Dominant Strategy . . . . 457 15.5 The Generalized Second Price Auction . . . . . . . . . . . . . . . . . . . . . 460 15.6 Equilibria of the Generalized Second Price Auction . . . . . . . . . . . . . . 464 15.7 Ad Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 15.8 Complex Queries and Interactions Among Keywords . . . . . . . . . . . . . 469 15.9 Advanced Material: VCG Prices and the Market-Clearing Property . . . . . 470 15.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 V Network Dynamics: Population Models 489 16 Information Cascades 491 16.1 Following the Crowd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 16.2 A Simple Herding Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 493 16.3 Bayes’ Rule: A Model of Decision-Making Under Uncertainty . . . . . . . . . 497 16.4 Bayes’ Rule in the Herding Experiment . . . . . . . . . . . . . . . . . . . . . 502 16.5 A Simple, General Cascade Model . . . . . . . . . . . . . . . . . . . . . . . . 504 16.6 Sequential Decision-Making and Cascades . . . . . . . . . . . . . . . . . . . 508 16.7 Lessons from Cascades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 16.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 17 Network Effects 517 17.1 The Economy Without Network Effects . . . . . . . . . . . . . . . . . . . . . 518 17.2 The Economy with Network Effects . . . . . . . . . . . . . . . . . . . . . . . 522 17.3 Stability, Instability, and Tipping Points . . . . . . . . . . . . . . . . . . . . 525 17.4 A Dynamic View of the Market . . . . . . . . . . . . . . . . . . . . . . . . . 527 17.5 Industries with Network Goods . . . . . . . . . . . . . . . . . . . . . . . . . 534 17.6 Mixing Individual Effects with Population-Level Effects . . . . . . . . . . . . 536 17.7 Advanced Material: Negative Externalities and The El Farol Bar Problem . 541 17.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 CONTENTS 7 18 Power Laws and Rich-Get-Richer Phenomena 553 18.1 Popularity as a Network Phenomenon . . . . . . . . . . . . . . . . . . . . . . 553 18.2 Power Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 18.3 Rich-Get-Richer Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 18.4 The Unpredictability of Rich-Get-Richer Effects . . . . . . . . . . . . . . . . 559 18.5 The Long Tail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561 18.6 The Effect of Search Tools and Recommendation Systems . . . . . . . . . . . 564 18.7 Advanced Material: Analysis of Rich-Get-Richer Processes . . . . . . . . . . 565 18.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 VI Network Dynamics: Structural Models 571 19 Cascading Behavior in Networks 573 19.1 Diffusion in Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573 19.2 Modeling Diffusion through a Network . . . . . . . . . . . . . . . . . . . . . 575 19.3 Cascades and Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583 19.4 Diffusion, Thresholds, and the Role of Weak Ties . . . . . . . . . . . . . . . 588 19.5 Extensions of the Basic Cascade Model . . . . . . . . . . . . . . . . . . . . . 590 19.6 Knowledge, Thresholds, and Collective Action . . . . . . . . . . . . . . . . . 593 19.7 Advanced Material: The Cascade Capacity . . . . . . . . . . . . . . . . . . . 597 19.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 20 The Small-World Phenomenon 621 20.1 Six Degrees of Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 20.2 Structure and Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 20.3 Decentralized Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626 20.4 Modeling the Process of Decentralized Search . . . . . . . . . . . . . . . . . 629 20.5 Empirical Analysis and Generalized Models . . . . . . . . . . . . . . . . . . 632 20.6 Core-Periphery Structures and Difficulties in Decentralized Search . . . . . . 638 20.7 Advanced Material: Analysis of Decentralized Search . . . . . . . . . . . . . 640 20.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652 21 Epidemics 655 21.1 Diseases and the Networks that Transmit Them . . . . . . . . . . . . . . . . 655 21.2 Branching Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 21.3 The SIR Epidemic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660 21.4 The SIS Epidemic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666 21.5 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669 21.6 Transient Contacts and the Dangers of Concurrency . . . . . . . . . . . . . . 672 21.7 Genealogy, Genetic Inheritance, and Mitochondrial Eve . . . . . . . . . . . . 676 21.8 Advanced Material: Analysis of Branching and Coalescent Processes . . . . . 682 21.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695 8 CONTENTS VII Institutions and Aggregate Behavior 699 22 Markets and Information 701 22.1 Markets with Exogenous Events . . . . . . . . . . . . . . . . . . . . . . . . . 702 22.2 Horse Races, Betting, and Beliefs . . . . . . . . . . . . . . . . . . . . . . . . 704 22.3 Aggregate Beliefs and the “Wisdom of Crowds” . . . . . . . . . . . . . . . . 710 22.4 Prediction Markets and Stock Markets . . . . . . . . . . . . . . . . . . . . . 714 22.5 Markets with Endogenous Events . . . . . . . . . . . . . . . . . . . . . . . . 717 22.6 The Market for Lemons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719 22.7 Asymmetric Information in Other Markets . . . . . . . . . . . . . . . . . . . 724 22.8 Signaling Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728 22.9 Quality Uncertainty On-Line: Reputation Systems and Other Mechanisms . 729 22.10Advanced Material: Wealth Dynamics in Markets . . . . . . . . . . . . . . . 732 22.11Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740 23 Voting 745 23.1 Voting for Group Decision-Making . . . . . . . . . . . . . . . . . . . . . . . 745 23.2 Individual Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747 23.3 Voting Systems: Majority Rule . . . . . . . . . . . . . . . . . . . . . . . . . 750 23.4 Voting Systems: Positional Voting . . . . . . . . . . . . . . . . . . . . . . . . 755 23.5 Arrow’s Impossibility Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 758 23.6 Single-Peaked Preferences and the Median Voter Theorem . . . . . . . . . . 760 23.7 Voting as a Form of Information Aggregation . . . . . . . . . . . . . . . . . . 766 23.8 Insincere Voting for Information Aggregation . . . . . . . . . . . . . . . . . . 768 23.9 Jury Decisions and the Unanimity Rule . . . . . . . . . . . . . . . . . . . . . 771 23.10Sequential Voting and the Relation to Information Cascades . . . . . . . . . 776 23.11Advanced Material: A Proof of Arrow’s Impossibility Theorem . . . . . . . . 777 23.12Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782 24 Property Rights 785 24.1 Externalities and the Coase Theorem . . . . . . . . . . . . . . . . . . . . . . 785 24.2 The Tragedy of the Commons . . . . . . . . . . . . . . . . . . . . . . . . . . 790 24.3 Intellectual Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793 24.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796 Chapter 1 Overview Over the past decade there has been a growing public fascination with the complex “connectedness” of modern society.


Beautiful Visualization by Julie Steele

barriers to entry, correlation does not imply causation, data acquisition, database schema, Drosophila, en.wikipedia.org, epigenetics, global pandemic, Hans Rosling, index card, information retrieval, iterative process, linked data, Mercator projection, meta analysis, meta-analysis, natural language processing, Netflix Prize, pattern recognition, peer-to-peer, performance metric, QR code, recommendation engine, semantic web, social graph, sorting algorithm, Steve Jobs, web application, wikimedia commons

Preference Similarity A well-known measure of similarity used in many recommendation systems is cosine similarity. A practical introduction to this technique can be found in Linden, Smith, and York (2003). In the case of movies, intuitively, the measure indicates that two movies are similar if users who rated one highly rated the other highly or, conversely, users who rated one poorly rated the other poorly. We’ll use this similarity measure to generate similarity data for all 17,700 movies in the Netflix Prize dataset, then generate coordinates based on that data. If we were interested in building an actual movie recommender system, we might do so simply by recommending the movies that were similar to those a user had rated highly. However, the goal here is just to gain insight into the dynamics of such a recommender system. Labeling The YELLOWPAGES.COM visualization was easier to label than this Netflix Prize visualization for a number of reasons, including fewer nodes and shorter labels, but mostly because the nodes were more uniformly distributed.


pages: 201 words: 21,180

Designing for the Social Web by Joshua Porter

barriers to entry, en.wikipedia.org, endowment effect, Howard Rheingold, late fees, Marc Andreessen, Mark Zuckerberg, Milgram experiment, Paul Buchheit, Ralph Waldo Emerson, recommendation engine, social software, social web, Steve Jobs, web application, zero-sum game

Del.icio.us simply counts the number of bookmarks that people have saved in the last x hours and orders them from most popular to least popular, displaying as a “most popular” list of bookmarks that people have saved recently7. . Participant ranking. The Digg Top Diggers page was a ranking system that took into account measures of desired behavior to come up with an overall rank for each Digger. . Collaborative filtering. Netflix’s recommendation system relies on collaborative filtering to display recommended movies based on your previous ratings. . Relevance. Services like Google rely on a complex algorithm to determine what to display. Figuring out which content is relevant is a big deal to Google—it’s the core value of the entire service. . Social. Social network sites like Slideshare and Flickr display content based on who it is from.

See also Netflix Movies For You screen, Netflix, 105–106 MSN Groups, 122 MSNBC.com, 157–158 MusicLab study, 137–139 MySpace, 13, 16, 18, 119 N nature vs. nurture debate, 8 navigation, non-linear, 171–172 Neeleman, David, 61, 62 negative feedback, 57–62, 139. See also feedback Netflix collaborative filtering of ratings on, 136 as example of complex adaptive system, 128, 129 as example of successful social object, 32 goals/activities/tasks for, 27 “How It Works” graphic, 73–74 Movies For You screen, 105–106 primary activity for, 26 recommendation system, 136 Netvibes, 92–93 network value, 24 networked world, designing for, viii New York Times most-shared articles screen, 160–161 sharing call to action, 149, 150–151, 152 Newmark, Craig, 51, 54 news feed blowup, Facebook, 116–118 news sites, 17, 133, 136 Newsvine, 153 Nielsen/NetRatings, 20 Nike+, 17 non-interactive entertainment, vii–viii non-linear navigation, 171–172 Norman, Dan, 25 notifications feature, 104 nytimes.com, 149.


pages: 660 words: 141,595

Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking by Foster Provost, Tom Fawcett

Albert Einstein, Amazon Mechanical Turk, big data - Walmart - Pop Tarts, bioinformatics, business process, call centre, chief data officer, Claude Shannon: information theory, computer vision, conceptual framework, correlation does not imply causation, crowdsourcing, data acquisition, David Brooks, en.wikipedia.org, Erik Brynjolfsson, Gini coefficient, information retrieval, intangible asset, iterative process, Johann Wolfgang von Goethe, Louis Pasteur, Menlo Park, Nate Silver, Netflix Prize, new economy, p-value, pattern recognition, placebo effect, price discrimination, recommendation engine, Ronald Coase, selection bias, Silicon Valley, Skype, speech recognition, Steve Jobs, supply-chain management, text mining, The Signal and the Noise by Nate Silver, Thomas Bayes, transaction costs, WikiLeaks

At the time of this writing, discussions of data science commonly mention not just analytical skills and techniques for understanding data but popular tools used. Definitions of data scientists (and advertisements for positions) specify not just areas of expertise but also specific programming languages and tools. It is common to see job advertisements mentioning data mining techniques (e.g., random forests, support vector machines), specific application areas (recommendation systems, ad placement optimization), alongside popular software tools for processing big data (Hadoop, MongoDB). There is often little distinction between the science and the technology for dealing with large datasets. We must point out that data science, like computer science, is a young field. The particular concerns of data science are fairly new and general principles are just beginning to emerge.

For example, analyzing purchase records from a supermarket may uncover that ground meat is purchased together with hot sauce much more frequently than we might expect. Deciding how to act upon this discovery might require some creativity, but it could suggest a special promotion, product display, or combination offer. Co-occurrence of products in purchases is a common type of grouping known as market-basket analysis. Some recommendation systems also perform a type of affinity grouping by finding, for example, pairs of books that are purchased frequently by the same people (“people who bought X also bought Y”). The result of co-occurrence grouping is a description of items that occur together. These descriptions usually include statistics on the frequency of the co-occurrence and an estimate of how surprising it is. Profiling (also known as behavior description) attempts to characterize the typical behavior of an individual, group, or population.

Maximizing simple prediction accuracy is usually not an appropriate goal. If that’s what our algorithm is doing, we’re using the wrong algorithm. For regression problems we have a directly analogous baseline: predict the average value over the population (usually the mean or median). In some applications there are multiple simple averages that one may want to combine. For example, when evaluating recommender systems that internally predict how many “stars” a particular customer would give to a particular movie, we have the average number of stars a movie gets across the population (how well liked it is) and the average number of stars a particular customer gives to movies (what that customer’s overall bias is). A simple prediction based on these two may do substantially better than using one or the other in isolation.


pages: 265 words: 69,310

What's Yours Is Mine: Against the Sharing Economy by Tom Slee

4chan, Airbnb, Amazon Mechanical Turk, asset-backed security, barriers to entry, Berlin Wall, big-box store, bitcoin, blockchain, citizen journalism, collaborative consumption, congestion charging, Credit Default Swap, crowdsourcing, data acquisition, David Brooks, don't be evil, gig economy, Hacker Ethic, income inequality, informal economy, invisible hand, Jacob Appelbaum, Jane Jacobs, Jeff Bezos, Khan Academy, Kibera, Kickstarter, license plate recognition, Lyft, Marc Andreessen, Mark Zuckerberg, move fast and break things, move fast and break things, natural language processing, Netflix Prize, Network effects, new economy, Occupy movement, openstreetmap, Paul Graham, peer-to-peer, peer-to-peer lending, Peter Thiel, pre–internet, principal–agent problem, profit motive, race to the bottom, Ray Kurzweil, recommendation engine, rent control, ride hailing / ride sharing, sharing economy, Silicon Valley, Snapchat, software is eating the world, South of Market, San Francisco, TaskRabbit, The Nature of the Firm, Thomas L Friedman, transportation-network company, Travis Kalanick, Uber and Lyft, Uber for X, uber lyft, ultimatum game, urban planning, WikiLeaks, winner-take-all economy, Y Combinator, Zipcar

This meant everyone using the system would pretty quickly develop a relevant ‘reputation’ visible to everyone else in the system.” 2 Friedman was writing just a couple of weeks after his New York Times stablemate David Brooks described “How Airbnb and Lyft Finally Got Americans to Trust Each Other”: “Companies like Airbnb establish trust through ratings mechanisms . . . People in the Airbnb economy don’t have the option of trusting each other on the basis of institutional affiliations, so they do it on the basis of online signaling and peer evaluations.” 3 Sharing Economy companies are not the first to use ratings and algorithms to guide behavior. Their trust systems build on the rating and recommendation systems used by Amazon, Netflix, eBay, Yelp, TripAdvisor, iTunes, the App Store and many others. Each takes individual ratings as their input and transforms them into some form of recommendation. As rating systems have become ubiquitous their usefulness has become a matter of faith in the world of software development. The Sharing Economy is at the cutting edge of a push for “algorithmic regulation” in which rules protecting consumers are replaced by ratings and software algorithms.

For Anderson, Amazon represents the return of variety and diversity after decades of homogenous blockbusters: “We are turning from a mass market back into a niche nation, defined not by geography but by interests.” 19 In a Long Tail world there is no need for formal gatekeepers who select or restrict the works that can find their public; instead, Web 2.0 platforms will do it for us using crowdsourced consumer reviews and recommender systems: “By combining infinite shelf space with real-time information about buying trends and public opinion . . . unlimited selection is revealing truths about what consumers want and how they want to get it.” 20 Amazon and Airbnb are similar in many ways. Both are, at least in part, software companies whose inventory is simply a set of entries in a database, accessed via a web site. Anything can go into the database: for Amazon’s books it might be Harry Potter or a self-published obscurity, or anything in between.


pages: 319 words: 89,477

The Power of Pull: How Small Moves, Smartly Made, Can Set Big Things in Motion by John Hagel Iii, John Seely Brown

Albert Einstein, Andrew Keen, barriers to entry, Black Swan, business process, call centre, Clayton Christensen, cleantech, cloud computing, commoditize, corporate governance, creative destruction, disruptive innovation, Elon Musk, en.wikipedia.org, future of work, game design, George Gilder, intangible asset, Isaac Newton, job satisfaction, Joi Ito, knowledge economy, knowledge worker, loose coupling, Louis Pasteur, Malcom McLean invented shipping containers, Maui Hawaii, medical residency, Network effects, old-boy network, packet switching, pattern recognition, peer-to-peer, pre–internet, profit motive, recommendation engine, Ronald Coase, shareholder value, Silicon Valley, Skype, smart transportation, software as a service, supply-chain management, The Nature of the Firm, the new new thing, too big to fail, trade liberalization, transaction costs

Blurring Creation and Use Pull platforms tend to allow us to perform the following activities, with a blurring of the boundaries between creation and use: • Find. Pull platforms allow us to find not just raw materials, products, and services, but also people with relevant skills and experience. Some of the tools and services that pull platforms use to help participants find relevant resources include search, recommendation engines, directories, agents, and reputation services. • Connect. Again, pull platforms connect us not just to raw materials, products, and services, but also to people with relevant skills and experiences. Performance fabrics5 are particularly helpful in establishing appropriate connections. The mobile Internet is dramatically extending our ability to connect wherever we are. • Innovate. Pull platforms provide much more flexible environments for participants to innovate with the resources made available to them.


pages: 713 words: 93,944

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement by Eric Redmond, Jim Wilson, Jim R. Wilson

AGPL, Amazon Web Services, create, read, update, delete, data is the new oil, database schema, Debian, domain-specific language, en.wikipedia.org, fault tolerance, full text search, general-purpose programming language, Kickstarter, linked data, MVC pattern, natural language processing, node package manager, random walk, recommendation engine, Ruby on Rails, Skype, social graph, web application

Neo4j, as our open source example, is growing in popularity for many social network applications. Unlike other database styles that group collections of like objects into common buckets, graph databases are more free-form—queries consist of following edges shared by two nodes or, namely, traversing nodes. As more projects use them, graph databases are growing the straightforward social examples to occupy more nuanced use cases, such as recommendation engines, access control lists, and geographic data. Good For: Graph databases seem to be tailor-made for networking applications. The prototypical example is a social network, where nodes represent users who have various kinds of relationships to each other. Modeling this kind of data using any of the other styles is often a tough fit, but a graph database would accept it with relish. They are also perfect matches for an object-oriented system.


pages: 339 words: 88,732

The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies by Erik Brynjolfsson, Andrew McAfee

"Robert Solow", 2013 Report for America's Infrastructure - American Society of Civil Engineers - 19 March 2013, 3D printing, access to a mobile phone, additive manufacturing, Airbnb, Albert Einstein, Amazon Mechanical Turk, Amazon Web Services, American Society of Civil Engineers: Report Card, Any sufficiently advanced technology is indistinguishable from magic, autonomous vehicles, barriers to entry, basic income, Baxter: Rethink Robotics, British Empire, business cycle, business intelligence, business process, call centre, Charles Lindbergh, Chuck Templeton: OpenTable:, clean water, combinatorial explosion, computer age, computer vision, congestion charging, corporate governance, creative destruction, crowdsourcing, David Ricardo: comparative advantage, digital map, employer provided health coverage, en.wikipedia.org, Erik Brynjolfsson, factory automation, falling living standards, Filter Bubble, first square of the chessboard / second half of the chessboard, Frank Levy and Richard Murnane: The New Division of Labor, Freestyle chess, full employment, G4S, game design, global village, happiness index / gross national happiness, illegal immigration, immigration reform, income inequality, income per capita, indoor plumbing, industrial robot, informal economy, intangible asset, inventory management, James Watt: steam engine, Jeff Bezos, jimmy wales, job automation, John Markoff, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, Joseph Schumpeter, Kevin Kelly, Khan Academy, knowledge worker, Kodak vs Instagram, law of one price, low skilled workers, Lyft, Mahatma Gandhi, manufacturing employment, Marc Andreessen, Mark Zuckerberg, Mars Rover, mass immigration, means of production, Narrative Science, Nate Silver, natural language processing, Network effects, new economy, New Urbanism, Nicholas Carr, Occupy movement, oil shale / tar sands, oil shock, pattern recognition, Paul Samuelson, payday loans, post-work, price stability, Productivity paradox, profit maximization, Ralph Nader, Ray Kurzweil, recommendation engine, Report Card for America’s Infrastructure, Robert Gordon, Rodney Brooks, Ronald Reagan, Second Machine Age, self-driving car, sharing economy, Silicon Valley, Simon Kuznets, six sigma, Skype, software patent, sovereign wealth fund, speech recognition, statistical model, Steve Jobs, Steven Pinker, Stuxnet, supply-chain management, TaskRabbit, technological singularity, telepresence, The Bell Curve by Richard Herrnstein and Charles Murray, The Signal and the Noise by Nate Silver, The Wealth of Nations by Adam Smith, total factor productivity, transaction costs, Tyler Cowen: Great Stagnation, Vernor Vinge, Watson beat the top human players on Jeopardy!, winner-take-all economy, Y2K

When there are many small local markets, there can be a ‘best’ provider in each, and these local heroes frequently can all earn a good income. If these markets merge into a single global market, top performers have an opportunity to win more customers, while the next-best performers face harsher competition from all directions. A similar dynamic comes into play when technologies like Google or even Amazon’s recommendation engine reduce search costs. Suddenly second-rate producers can no longer count on consumer ignorance or geographic barriers to protect their margins. Digital technologies have aided the transition to winner-take-all markets, even for products we wouldn’t think would have superstar status. In a traditional camera store, cameras typically are not ranked number one versus number ten. But online retailers make it easy to list products in rank order by customer ratings, or to filter results to include only products with every conceivable desirable feature.


pages: 389 words: 87,758

No Ordinary Disruption: The Four Global Forces Breaking All the Trends by Richard Dobbs, James Manyika

2013 Report for America's Infrastructure - American Society of Civil Engineers - 19 March 2013, access to a mobile phone, additive manufacturing, Airbnb, Amazon Mechanical Turk, American Society of Civil Engineers: Report Card, autonomous vehicles, Bakken shale, barriers to entry, business cycle, business intelligence, Carmen Reinhart, central bank independence, cloud computing, corporate governance, creative destruction, crowdsourcing, demographic dividend, deskilling, disintermediation, disruptive innovation, distributed generation, Erik Brynjolfsson, financial innovation, first square of the chessboard, first square of the chessboard / second half of the chessboard, Gini coefficient, global supply chain, global village, hydraulic fracturing, illegal immigration, income inequality, index fund, industrial robot, intangible asset, Intergovernmental Panel on Climate Change (IPCC), Internet of things, inventory management, job automation, Just-in-time delivery, Kenneth Rogoff, Kickstarter, knowledge worker, labor-force participation, low skilled workers, Lyft, M-Pesa, mass immigration, megacity, mobile money, Mohammed Bouazizi, Network effects, new economy, New Urbanism, oil shale / tar sands, oil shock, old age dependency ratio, openstreetmap, peer-to-peer lending, pension reform, private sector deleveraging, purchasing power parity, quantitative easing, recommendation engine, Report Card for America’s Infrastructure, RFID, ride hailing / ride sharing, Second Machine Age, self-driving car, sharing economy, Silicon Valley, Silicon Valley startup, Skype, smart cities, Snapchat, sovereign wealth fund, spinning jenny, stem cell, Steve Jobs, supply-chain management, TaskRabbit, The Great Moderation, trade route, transaction costs, Travis Kalanick, uber lyft, urban sprawl, Watson beat the top human players on Jeopardy!, working-age population, Zipcar

To illustrate the scale of the opportunity, consider this change: on July 31, 2013, the US Bureau of Economic Analysis released GDP figures that for the first time categorized research and development and software into a new category of “intellectual property products.” We estimate that digital capital is now the source of roughly one-third of total global GDP growth, with intangible assets (think of the value of Google’s search algorithm or Amazon’s recommendation engine) being the main driver.41 For businesses and governments alike, failing to navigate today’s technological tide will mean losing out on a huge economic opportunity as well as increasing vulnerability to potential disruptions. Digitization and technological advances can transform industries in the blink of an eye, as BlackBerry has learned. History is littered with such corporate casualties.


pages: 294 words: 96,661

The Fourth Age: Smart Robots, Conscious Computers, and the Future of Humanity by Byron Reese

agricultural Revolution, AI winter, artificial general intelligence, basic income, Buckminster Fuller, business cycle, business process, Claude Shannon: information theory, clean water, cognitive bias, computer age, crowdsourcing, dark matter, Elon Musk, Eratosthenes, estate planning, financial independence, first square of the chessboard, first square of the chessboard / second half of the chessboard, full employment, Hans Rosling, income inequality, invention of agriculture, invention of movable type, invention of the printing press, invention of writing, Isaac Newton, Islamic Golden Age, James Hargreaves, job automation, Johannes Kepler, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, John von Neumann, Kevin Kelly, lateral thinking, life extension, Louis Pasteur, low skilled workers, manufacturing employment, Marc Andreessen, Mark Zuckerberg, Marshall McLuhan, Mary Lou Jepsen, Moravec's paradox, On the Revolutions of the Heavenly Spheres, pattern recognition, profit motive, Ray Kurzweil, recommendation engine, Rodney Brooks, Sam Altman, self-driving car, Silicon Valley, Skype, spinning jenny, Stephen Hawking, Steve Wozniak, Steven Pinker, strong AI, technological singularity, telepresence, telepresence robot, The Future of Employment, the scientific method, Turing machine, Turing test, universal basic income, Von Neumann architecture, Wall-E, Watson beat the top human players on Jeopardy!, women in the workforce, working poor, Works Progress Administration, Y Combinator

How would you program those actions? Reducing it down it to ones and zeros is obviously possible, but equally obviously difficult for a device that can only manipulate abstract symbols in memory. One wrinkle with these sorts of perception problems is that we don’t have the training data to teach the robots. Amazon has a huge database of “people who bought this also bought that” with which to train its recommendation engine. But we don’t have all the tactile data of a million adults holding a million babies in a thousand situations. We could certainly collect the data by making a version of those CGI suits that people wear when making movies. Using upgraded sensors in the hands and fingers, we could get a thousand parents to wear them for a year to begin to collect that data. But no one is doing that right now.


Text Analytics With Python: A Practical Real-World Approach to Gaining Actionable Insights From Your Data by Dipanjan Sarkar

bioinformatics, business intelligence, computer vision, continuous integration, en.wikipedia.org, general-purpose programming language, Guido van Rossum, information retrieval, Internet of things, invention of the printing press, iterative process, natural language processing, out of africa, performance metric, premature optimization, recommendation engine, self-driving car, semantic web, sentiment analysis, speech recognition, statistical model, text mining, Turing test, web application

Keyphrase extraction, also known as terminology extraction, is defined as the process or technique of extracting key important and relevant terms or phrases from a body of unstructured text such that the core topics or themes of the text document(s) are captured in these key phrases. This technique falls under the broad umbrella of information retrieval and extraction. Keyphrase extraction finds its uses in many areas, including the following: Semantic web Query-based search engines and crawlers Recommendation systems Tagging systems Document similarity Translation Keyphrase extraction is often the starting point for carrying out more complex tasks in text analytics or NLP, and the output from this can itself act as features for more complex systems. There are various approaches for keyphrase extraction. We will be covering the following two techniques: Collocations Weighted tag–based phrase extraction An important thing to remember here is that we will be extracting phrases that are usually collections of words, though sometimes that can include a single word.

The core algorithm in PageRank is a graph-based scoring or ranking algorithm, where pages are scored or ranked based on their importance. Web sites and pages contain further links embedded in them, which link to more pages with more links, and this continues across the Internet. This can be represented as a graph-based model where vertices indicate the web pages, and edges indicate links among them. This can be used to form a voting or recommendation system such that when one vertex links to another one in the graph, it is basically casting a vote. Vertex importance is decided not only on the number of votes or edges but also the importance of the vertices that are connected to it and their importance. This helps in determining the score or rank for each vertex or page. This is evident from Figure 5-4, which represents a sample of pages with their importance.

Finally, we took a real-world example of clustering the top hundred greatest movies of all time using IMDb movie synopses data and used different clustering models like k-means, affinity propagation, and Ward’s hierarchical clustering to build, analyze, and visualize clusters. This should be enough for you to get started with analyzing document similarity and clustering, and you can even start combining various techniques from the chapters covered so far. (Hint: Topic models with clustering, building classifiers by combining supervised and unsupervised learning, and augmenting recommendation systems using document clusters—just to name a few!) © Dipanjan Sarkar 2016 Dipanjan Sarkar, Text Analytics with Python, 10.1007/978-1-4842-2388-8_7 7. Semantic and Sentiment Analysis Dipanjan Sarkar1 (1)Bangalore, Karnataka, India Natural language understanding has gained significant importance in the last decade with the advent of machine learning (ML) and further advances like deep learningand artificial intelligence.


Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Geron

Amazon Mechanical Turk, Bayesian statistics, centre right, combinatorial explosion, constrained optimization, correlation coefficient, crowdsourcing, en.wikipedia.org, iterative process, Netflix Prize, NP-complete, optical character recognition, P = NP, p-value, pattern recognition, performance metric, recommendation engine, self-driving car, SpamAssassin, speech recognition, statistical model

Wouldn’t it be great if the algorithm could just exploit the unlabeled data without needing humans to label every picture? Enter unsupervised learning. In Chapter 8, we looked at the most common unsupervised learning task: dimensionality reduction. In this chapter, we will look at a few more unsupervised learning tasks and algorithms: Clustering: the goal is to group similar instances together into clusters. This is a great tool for data analysis, customer segmentation, recommender systems, search engines, image segmentation, semi-supervised learning, dimensionality reduction, and more. Anomaly detection: the objective is to learn what “normal” data looks like, and use this to detect abnormal instances, such as defective items on a production line or a new trend in a time series. Density estimation: this is the task of estimating the probability density function (PDF) of the random process that generated the dataset.

Classification (left) versus clustering (right) Clustering is used in a wide variety of applications, including: For customer segmentation: you can cluster your customers based on their purchases, their activity on your website, and so on. This is useful to understand who your customers are and what they need, so you can adapt your products and marketing campaigns to each segment. For example, this can be useful in recommender systems to suggest content that other users in the same cluster enjoyed. For data analysis: when analyzing a new dataset, it is often useful to first discover clusters of similar instances, as it is often easier to analyze clusters separately. As a dimensionality reduction technique: once a dataset has been clustered, it is usually possible to measure each instance’s affinity with each cluster (affinity is any measure of how well an instance fits into a cluster).


pages: 326 words: 103,170

The Seventh Sense: Power, Fortune, and Survival in the Age of Networks by Joshua Cooper Ramo

Airbnb, Albert Einstein, algorithmic trading, barriers to entry, Berlin Wall, bitcoin, British Empire, cloud computing, crowdsourcing, Danny Hillis, defense in depth, Deng Xiaoping, drone strike, Edward Snowden, Fall of the Berlin Wall, Firefox, Google Chrome, income inequality, Isaac Newton, Jeff Bezos, job automation, Joi Ito, market bubble, Menlo Park, Metcalfe’s law, Mitch Kapor, natural language processing, Network effects, Norbert Wiener, Oculus Rift, packet switching, Paul Graham, price stability, quantitative easing, RAND corporation, recommendation engine, Republic of Letters, Richard Feynman, road to serfdom, Robert Metcalfe, Sand Hill Road, secular stagnation, self-driving car, Silicon Valley, Skype, Snapchat, social web, sovereign wealth fund, Steve Jobs, Steve Wozniak, Stewart Brand, Stuxnet, superintelligent machines, technological singularity, The Coming Technological Singularity, The Wealth of Nations by Adam Smith, too big to fail, Vernor Vinge, zero day

And then the machine would spit back some films you might enjoy. The Paul Newman classic Cool Hand Luke, for instance. And, well, you had liked that film. This seemed magic, just the sort of data-meets-human question that showcased a machine learning and thinking. An honestly artificial intelligence. Maes hoped to design a computer that could predict what movies or music or books you or I might enjoy. (And, of course, buy.) A recommendation engine. We all know how sputtering our own suggestion motors can be. Think of that primitive analog exchange known as the First Date: Oh, you like Radiohead? Do you know Sigur Rós? Pause. Hate them. Can you really predict what albums or novels even your closest friend will enjoy? You might offer an occasional lucky suggestion. But to confidently bridge your knowledge of a friend’s taste and the nearly endless library of movies and songs and books?


pages: 323 words: 95,939

Present Shock: When Everything Happens Now by Douglas Rushkoff

algorithmic trading, Andrew Keen, bank run, Benoit Mandelbrot, big-box store, Black Swan, British Empire, Buckminster Fuller, business cycle, cashless society, citizen journalism, clockwork universe, cognitive dissonance, Credit Default Swap, crowdsourcing, Danny Hillis, disintermediation, Donald Trump, double helix, East Village, Elliott wave, European colonialism, Extropian, facts on the ground, Flash crash, game design, global pandemic, global supply chain, global village, Howard Rheingold, hypertext link, Inbox Zero, invention of agriculture, invention of hypertext, invisible hand, iterative process, John Nash: game theory, Kevin Kelly, laissez-faire capitalism, lateral thinking, Law of Accelerating Returns, loss aversion, mandelbrot fractal, Marshall McLuhan, Merlin Mann, Milgram experiment, mutually assured destruction, negative equity, Network effects, New Urbanism, Nicholas Carr, Norbert Wiener, Occupy movement, passive investing, pattern recognition, peak oil, price mechanism, prisoner's dilemma, Ralph Nelson Elliott, RAND corporation, Ray Kurzweil, recommendation engine, selective serotonin reuptake inhibitor (SSRI), Silicon Valley, Skype, social graph, South Sea Bubble, Steve Jobs, Steve Wozniak, Steven Pinker, Stewart Brand, supply-chain management, the medium is the message, The Wisdom of Crowds, theory of mind, Turing test, upwardly mobile, Whole Earth Catalog, WikiLeaks, Y2K, zero-sum game

Today’s most vocal critic of this trend, The Cult of the Amateur author Andrew Keen, explains, “According to a June 2006 study by the Pew Internet and American Life Project, 34 percent of the 12 million bloggers in America consider their online ‘work’ to be a form of journalism. That adds up to millions of unskilled, untrained, unpaid, unknown ‘journalists’—a thousandfold growth between 1996 and 2006—spewing their (mis)information out in the cyberworld.” More sanguine voices, such as City University of New York journalism professor and BuzzFeed blogger Jeff Jarvis, argue that the market—amplified by search results and recommendation engines—will eventually allow the better journalism to rise to the top of the pile. But even market mechanisms may have a hard time functioning as we consumers of all this media lose our ability to distinguish between facts, informed opinions, and wild assertions. Our impatient disgust with politics as usual combined with our newfound faith in our own gut sensibilities drives us to take matters into our own hands—in journalism and beyond.


Mindf*ck: Cambridge Analytica and the Plot to Break America by Christopher Wylie

4chan, affirmative action, Affordable Care Act / Obamacare, availability heuristic, Berlin Wall, Bernie Sanders, big-box store, Boris Johnson, British Empire, call centre, Chelsea Manning, chief data officer, cognitive bias, cognitive dissonance, colonial rule, computer vision, conceptual framework, cryptocurrency, Daniel Kahneman / Amos Tversky, desegregation, Dominic Cummings, Donald Trump, Downton Abbey, Edward Snowden, Elon Musk, Etonian, first-past-the-post, Google Earth, housing crisis, income inequality, indoor plumbing, information asymmetry, Internet of things, Julian Assange, Lyft, Marc Andreessen, Mark Zuckerberg, Menlo Park, move fast and break things, move fast and break things, Network effects, new economy, obamacare, Peter Thiel, Potemkin village, recommendation engine, Renaissance Technologies, Robert Mercer, Ronald Reagan, Rosa Parks, Sand Hill Road, Scientific racism, Shoshana Zuboff, side project, Silicon Valley, Skype, uber lyft, unpaid internship, Valery Gerasimov, web application, WikiLeaks, zero-sum game

CA wanted to provoke people, to get them to engage. Cambridge Analytica did this because of a specific feature of Facebook’s algorithm at the time. When someone follows pages of generic brands like Walmart or some prime-time sitcom, nothing much changes in his newsfeed. But liking an extreme group, such as the Proud Boys or the Incel Liberation Army, marks the user as distinct from others in such a way that a recommendation engine will prioritize these topics for personalization. Which means the site’s algorithm will start to funnel the user similar stories and pages—all to increase engagement. For Facebook, rising engagement is the only metric that matters, as more engagement means more screen time to be exposed to advertisements. This is the darker side of Silicon Valley’s much celebrated metric of “user engagement.”


pages: 364 words: 99,897

The Industries of the Future by Alec Ross

23andMe, 3D printing, Airbnb, algorithmic trading, AltaVista, Anne Wojcicki, autonomous vehicles, banking crisis, barriers to entry, Bernie Madoff, bioinformatics, bitcoin, blockchain, Brian Krebs, British Empire, business intelligence, call centre, carbon footprint, cloud computing, collaborative consumption, connected car, corporate governance, Credit Default Swap, cryptocurrency, David Brooks, disintermediation, Dissolution of the Soviet Union, distributed ledger, Edward Glaeser, Edward Snowden, en.wikipedia.org, Erik Brynjolfsson, fiat currency, future of work, global supply chain, Google X / Alphabet X, industrial robot, Internet of things, invention of the printing press, Jaron Lanier, Jeff Bezos, job automation, John Markoff, Joi Ito, Kickstarter, knowledge economy, knowledge worker, lifelogging, litecoin, M-Pesa, Marc Andreessen, Mark Zuckerberg, Mikhail Gorbachev, mobile money, money: store of value / unit of account / medium of exchange, Nelson Mandela, new economy, offshore financial centre, open economy, Parag Khanna, paypal mafia, peer-to-peer, peer-to-peer lending, personalized medicine, Peter Thiel, precision agriculture, pre–internet, RAND corporation, Ray Kurzweil, recommendation engine, ride hailing / ride sharing, Rubik’s Cube, Satoshi Nakamoto, selective serotonin reuptake inhibitor (SSRI), self-driving car, sharing economy, Silicon Valley, Silicon Valley startup, Skype, smart cities, social graph, software as a service, special economic zone, supply-chain management, supply-chain management software, technoutopianism, The Future of Employment, Travis Kalanick, underbanked, Vernor Vinge, Watson beat the top human players on Jeopardy!, women in the workforce, Y Combinator, young professional

Academics have likened it to both a microscope and telescope—a tool that allows us to both examine smaller details than could previously be observed and to see data at a larger scale, revealing correlations that were previously too distant for us to notice. The story of big data’s real-world impact to this point has been largely about logistics and persuasion. It has been great for supply chains, elections, and advertising because these tend to be fields with lots of small, repeated, and quantifiable actions—hence the “recommendation engines” used by Amazon and Netflix that help make more precise recommendations to customers. But these fields are just the beginning, and by the time my kids enter the workforce, big data won’t be a buzz phrase any longer. It will have permeated parts of our lives that we do not think of today as being rooted in analytics. It will change what we eat, how we speak, and where we draw the line between our public and private personas.


pages: 416 words: 108,370

Hit Makers: The Science of Popularity in an Age of Distraction by Derek Thompson

Airbnb, Albert Einstein, Alexey Pajitnov wrote Tetris, always be closing, augmented reality, Clayton Christensen, Donald Trump, Downton Abbey, full employment, game design, Gordon Gekko, hindsight bias, indoor plumbing, industrial cluster, information trail, invention of the printing press, invention of the telegraph, Jeff Bezos, John Snow's cholera map, Kodak vs Instagram, linear programming, Lyft, Marc Andreessen, Mark Zuckerberg, Marshall McLuhan, Menlo Park, Metcalfe’s law, Minecraft, Nate Silver, Network effects, Nicholas Carr, out of africa, randomized controlled trial, recommendation engine, Robert Gordon, Ronald Reagan, Silicon Valley, Skype, Snapchat, statistical model, Steve Ballmer, Steve Jobs, Steven Levy, Steven Pinker, subscription business, telemarketer, the medium is the message, The Rise and Fall of American Growth, Uber and Lyft, Uber for X, uber lyft, Vilfredo Pareto, Vincenzo Peruggia: Mona Lisa, women in the workforce

As Loewy understood, neophilia and neophobia are not isolated states, but rather warring states, constantly doing battle both within the mind of every buyer and within an entire economy of buyers. I recently visited Spotify, the large online streaming music company, to talk to Matt Ogle, the lead engineer on a new hit product called Discover Weekly, a personalized list of thirty songs delivered every Monday to tens of million of users. For about a decade, Ogle had worked for several music companies to design the perfect music recommendation engine. His philosophy of music was that most people enjoy new songs, but they don’t enjoy the effort that it takes to find them. They want effortless, frictionless musical revelations, a series of achievable challenges. In the design of Discover Weekly, “every decision we made was shaped by the notion that this should feel like a friend giving you a mix tape,” he said. So the playlist was weekly and included only thirty songs.


The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences by Rob Kitchin

Bayesian statistics, business intelligence, business process, cellular automata, Celtic Tiger, cloud computing, collateralized debt obligation, conceptual framework, congestion charging, corporate governance, correlation does not imply causation, crowdsourcing, discrete time, disruptive innovation, George Gilder, Google Earth, Infrastructure as a Service, Internet Archive, Internet of things, invisible hand, knowledge economy, late capitalism, lifelogging, linked data, longitudinal study, Masdar, means of production, Nate Silver, natural language processing, openstreetmap, pattern recognition, platform as a service, recommendation engine, RFID, semantic web, sentiment analysis, slashdot, smart cities, Smart Cities: Big Data, Civic Hackers, and the Quest for a New Utopia, smart grid, smart meter, software as a service, statistical model, supply-chain management, the scientific method, The Signal and the Noise by Nate Silver, transaction costs

Discovering correlations between certain items led to new product placements and alterations to shelf space management and a 16 per cent increase in revenue per shopping cart in the first month’s trial. There was no hypothesis that Product A was often bought with Product H that was then tested. The data were simply queried to discover what relationships existed that might have previously been unnoticed. Similarly, Amazon’s recommendation system produces suggestions for other items a shopper might be interested in without knowing anything about the culture and conventions of books and reading; it simply identifies patterns of purchasing across customers in order to determine whether, if Person A likes Book X, they are also likely to like Book Y given their own and others’ consumption patterns. Dyche’s contention is that this open, rather than directed, approach to discovery is more likely to reveal unknown, underlying patterns with respect to customer behaviours, product affinities, and financial risks, that can then be exploited.

In fact, both deductive and inductive reasoning are always discursively framed and do not arise out of nowhere. Popper (1979, cited in Callebaut 2012: 74) thus suggests that all science adopts a searchlight approach to scientific discovery, with the focus of light guided by previous findings, theories and training; by speculation that is grounded in experience and knowledge. The same is true for Amazon, Hunch, Ayasdi, and Google. How Amazon constructed its recommendation system was based on scientific reasoning, underpinned by a guiding model and accompanied by empirical testing designed to improve the performance of the algorithms it uses. Likewise, Google undertakes extensive research and development, it works in partnership with scientists and it buys scientific knowledge, either funding research within universities or by buying the IP of other companies, to refine and extend the utility of how it organises, presents and extracts value from data.


pages: 442 words: 94,734

The Art of Statistics: Learning From Data by David Spiegelhalter

Antoine Gombaud: Chevalier de Méré, Bayesian statistics, Carmen Reinhart, complexity theory, computer vision, correlation coefficient, correlation does not imply causation, dark matter, Edmond Halley, Estimating the Reproducibility of Psychological Science, Hans Rosling, Kenneth Rogoff, meta analysis, meta-analysis, Nate Silver, Netflix Prize, p-value, placebo effect, probability theory / Blaise Pascal / Pierre de Fermat, publication bias, randomized controlled trial, recommendation engine, replication crisis, self-driving car, speech recognition, statistical model, The Design of Experiments, The Signal and the Noise by Nate Silver, The Wisdom of Crowds, Thomas Bayes, Thomas Malthus

Then in the twentieth century statistics became more mathematical and, unfortunately for many students and practitioners, the topic became synonymous with the mechanical application of a bag of statistical tools, many named after eccentric and argumentative statisticians that we shall meet later in this book. This common view of statistics as a basic ‘bag of tools’ is now facing major challenges. First, we are in an age of data science, in which large and complex data sets are collected from routine sources such as traffic monitors, social media posts and internet purchases, and used as a basis for technological innovations such as optimizing travel routes, targeted advertising or purchase recommendation systems – we shall look at algorithms based on ‘big data’ in Chapter 6. Statistical training is increasingly seen as just one necessary component of being a data scientist, together with skills in data management, programming and algorithm development, as well as proper knowledge of the subject matter. Another challenge to the traditional view of statistics comes from the huge rise in the amount of scientific research being carried out, particularly in the biomedical and social sciences, combined with pressure to publish in high-ranking journals.

This reflects a general concern that algorithms that win Kaggle competitions tend to be very complex in order to achieve that tiny final margin needed to win. A major problem is that these algorithms tend to be inscrutable black boxes – they come up with a prediction, but it is almost impossible to work out what is going on inside. This has three negative aspects. First, extreme complexity makes implementation and upgrading a great effort: when Netflix offered a $1m prize for prediction recommendation systems, the winner was so complicated that Netflix ended up not using it. The second negative feature is that we do not know how the conclusion was arrived at, or what confidence we should have in it: we just have to take it or leave it. Simpler algorithms can better explain themselves. Finally, if we do not know how an algorithm is producing its answer, we cannot investigate it for implicit but systematic biases against some members of the community – a point I expand on below.


pages: 340 words: 94,464

Randomistas: How Radical Researchers Changed Our World by Andrew Leigh

Albert Einstein, Amazon Mechanical Turk, Anton Chekhov, Atul Gawande, basic income, Black Swan, correlation does not imply causation, crowdsourcing, David Brooks, Donald Trump, ending welfare as we know it, Estimating the Reproducibility of Psychological Science, experimental economics, Flynn Effect, germ theory of disease, Ignaz Semmelweis: hand washing, Indoor air pollution, Isaac Newton, Kickstarter, longitudinal study, loss aversion, Lyft, Marshall McLuhan, meta analysis, meta-analysis, microcredit, Netflix Prize, nudge unit, offshore financial centre, p-value, placebo effect, price mechanism, publication bias, RAND corporation, randomized controlled trial, recommendation engine, Richard Feynman, ride hailing / ride sharing, Robert Metcalfe, Ronald Reagan, statistical model, Steven Pinker, uber lyft, universal basic income, War on Poverty

Landon ended up with just 8 of the 531 electoral college votes. 61Huizhi Xie & Juliette Aurisset, ‘Improving the sensitivity of online controlled experiments: Case studies at Netflix.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 645–54. ACM, 2016. 62Carlos A. Gomez-Uribe & Neil Hunt, ‘The Netflix recommender system: Algorithms, business value, and innovation’, ACM Transactions on Management Information Systems (TMIS), vol. 6, no. 4, 2016, p. 13. 63Gomez-Uribe & Hunt, ‘The Netflix recommender system’, p. 13. 64Adam D.I. Kramer, Jamie E. Guillory & Jeffrey T. Hancock, ‘Experimental evidence of massive-scale emotional contagion through social networks’, Proceedings of the National Academy of Sciences, vol. 3, no. 24, 2014, pp. 8788–90. 65Because 22.4 per cent of Facebook posts contained negative words, and 46.8 per cent contained negative words, the study also had two control groups: one of which randomly omitted 2.24 per cent of all posts, and another that randomly omitted 4.68 per cent of all posts. 66Oddly, some commentators seem unaware of the finding, continuing to make claims like ‘Facebook makes us feel inadequate, so we try to compete, putting a positive spin and a pretty filter on an ordinary moment – prompting someone else to do the same . . . when you sign up to Facebook you put yourself under pressure to appear popular, fun and loved, regardless of your reality’: Daisy Buchanan, ‘Facebook bragging’s route to divorce’, Australian Financial Review, 27 August 2016 67Kate Bullen & John Oates, ‘Facebook’s ‘experiment’ was socially irresponsible’, Guardian, 2 July 2014. 68Quoted in David Goldman, ‘Facebook still won’t say “sorry” for mind games experiment’, CNNMoney, 2 July 2014. 9 TESTING THEORIES IN POLITICS AND PHILANTHROPY 1Julian Jamison & Dean Karlan, ‘Candy elasticity: Halloween experiments on public political statements’, Economic Inquiry, vol. 54, no. 1, 2016, pp. 543–7. 2This experiment is outlined in detail in Dan Siroker, ‘How Obama raised $60 million by running a simple experiment’, Optimizely blog, 29 November 2010. 3Quoted in Brian Christian, ‘The A/B test: Inside the technology that’s changing the rules of business’, Wired, 25 April 2012. 4Alan S.


pages: 421 words: 110,406

Platform Revolution: How Networked Markets Are Transforming the Economy--And How to Make Them Work for You by Sangeet Paul Choudary, Marshall W. van Alstyne, Geoffrey G. Parker

3D printing, Affordable Care Act / Obamacare, Airbnb, Alvin Roth, Amazon Mechanical Turk, Amazon Web Services, Andrei Shleifer, Apple's 1984 Super Bowl advert, autonomous vehicles, barriers to entry, big data - Walmart - Pop Tarts, bitcoin, blockchain, business cycle, business process, buy low sell high, chief data officer, Chuck Templeton: OpenTable:, clean water, cloud computing, connected car, corporate governance, crowdsourcing, data acquisition, data is the new oil, digital map, discounted cash flows, disintermediation, Edward Glaeser, Elon Musk, en.wikipedia.org, Erik Brynjolfsson, financial innovation, Haber-Bosch Process, High speed trading, information asymmetry, Internet of things, inventory management, invisible hand, Jean Tirole, Jeff Bezos, jimmy wales, John Markoff, Khan Academy, Kickstarter, Lean Startup, Lyft, Marc Andreessen, market design, Metcalfe’s law, multi-sided market, Network effects, new economy, payday loans, peer-to-peer lending, Peter Thiel, pets.com, pre–internet, price mechanism, recommendation engine, RFID, Richard Stallman, ride hailing / ride sharing, Robert Metcalfe, Ronald Coase, Satoshi Nakamoto, self-driving car, shareholder value, sharing economy, side project, Silicon Valley, Skype, smart contracts, smart grid, Snapchat, software is eating the world, Steve Jobs, TaskRabbit, The Chicago School, the payments system, Tim Cook: Apple, transaction costs, Travis Kalanick, two-sided market, Uber and Lyft, Uber for X, uber lyft, winner-take-all economy, zero-sum game, Zipcar

Even more unsettling are some of the less obvious ways in which personal data are used. Many firms—both platform businesses and others—track consumers’ web usage, financial interactions, magazine subscriptions, political and charitable contributions, and much more to create highly detailed individual profiles. In the aggregate, such data can be used for cross-marketing to people who share profiles, as when a recommendation engine on a shopping site tells you, “People like you who bought product A often enjoy product B, too!” The anonymity of this process renders it unobjectionable to most people. But the same underlying data can be, and is, sold to prospective employers, government agencies, health care providers, and marketers of all kinds. Individually identifiable data about sensitive topics such as sexual orientation, prescription drug use, alcoholism, and personal travel (tracked through cell phone location data) can be purchased through data broker firms such as Acxiom.32 Consumer concern over the practices of the data broker industry has led to a number of investigations, including a major FTC inquiry that resulted in a report titled “Data Brokers: A Call for Transparency and Accountability.”33 But very little has actually changed to prevent practices that many find objectionable.34 Skeptics say that, in reality, citizen concerns about data privacy are superficial.


The Deep Learning Revolution (The MIT Press) by Terrence J. Sejnowski

AI winter, Albert Einstein, algorithmic trading, Amazon Web Services, Any sufficiently advanced technology is indistinguishable from magic, augmented reality, autonomous vehicles, Baxter: Rethink Robotics, bioinformatics, cellular automata, Claude Shannon: information theory, cloud computing, complexity theory, computer vision, conceptual framework, constrained optimization, Conway's Game of Life, correlation does not imply causation, crowdsourcing, Danny Hillis, delayed gratification, discovery of DNA, Donald Trump, Douglas Engelbart, Drosophila, Elon Musk, en.wikipedia.org, epigenetics, Flynn Effect, Frank Gehry, future of work, Google Glasses, Google X / Alphabet X, Guggenheim Bilbao, Gödel, Escher, Bach, haute couture, Henri Poincaré, I think there is a world market for maybe five computers, industrial robot, informal economy, Internet of things, Isaac Newton, John Conway, John Markoff, John von Neumann, Mark Zuckerberg, Minecraft, natural language processing, Netflix Prize, Norbert Wiener, orbital mechanics / astrodynamics, PageRank, pattern recognition, prediction markets, randomized controlled trial, recommendation engine, Renaissance Technologies, Rodney Brooks, self-driving car, Silicon Valley, Silicon Valley startup, Socratic dialogue, speech recognition, statistical model, Stephen Hawking, theory of mind, Thomas Bayes, Thomas Kuhn: the structure of scientific revolutions, traveling salesman, Turing machine, Von Neumann architecture, Watson beat the top human players on Jeopardy!, X Prize, Yogi Berra

As a consequence, there are fewer parameters to train on each epoch, and the resulting network has fewer dependencies between units than would be the case if the same large network were trained on every epoch. Dropout decreases the error rate in deep learning networks by 10 percent, which is a large improvement. In 2009, Netflix conducted an open competition, offering a prize of $1 million to the first person who could reduce the error of their recommender system by 10 percent.16 Almost every graduate student in machine learning entered the competition. Netflix probably inspired $10 million of research for the cost of the prize. And deep networks are now a core technology for online streaming.17 Intriguingly, cortical synapses drop out at a high rate. On every spike along an input, the typical excitatory synapse in the cortex has a 90 percent failure rate.18 This is like a baseball team where almost all the players are batting .100.

If a network of simulated neurons is trained to read and then is damaged, it produces strikingly similar behavior” (76). 15. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research 15 (2014):1929–1958. 16. “Netflix Prize,” Wikipedia, last modified, August 23, 2017, https://en.wikipedia .org/wiki/Netflix_Prize. 17. Carlos A. Gomez-Uribe, Neil Hunt, “The Netflix Recommender System: Algorithms,” ACM Transactions on Management Information Systems 6, no. 4 (2016) , article no. 13. 18. T. M. Bartol Jr., C. Bromer, J. Kinney, M. A. Chirillo, J. N. Bourne, K. M. Harris, and T. J. Sejnowski, “Nanoconnectomic Upper Bound on the Variability of Synaptic Plasticity,” eLife, 4:e10778, 2015, doi:10.7554/eLife.10778. 19. This follows from the “law of large numbers” in probability theory.


pages: 540 words: 103,101

Building Microservices by Sam Newman

airport security, Amazon Web Services, anti-pattern, business process, call centre, continuous integration, create, read, update, delete, defense in depth, don't repeat yourself, Edward Snowden, fault tolerance, index card, information retrieval, Infrastructure as a Service, inventory management, job automation, Kubernetes, load shedding, loose coupling, microservices, MITM: man-in-the-middle, platform as a service, premature optimization, pull request, recommendation engine, social graph, software as a service, source of truth, the built environment, web application, WebSocket

Then we want to try to understand what bounded contexts the monolith maps to. Let’s imagine that initially we identify four contexts we think our monolithic backend covers: Catalog Everything to do with metadata about the items we offer for sale Finance Reporting for accounts, payments, refunds, etc. Warehouse Dispatching and returning of customer orders, managing inventory levels, etc. Recommendation Our patent-pending, revolutionary recommendation system, which is highly complex code written by a team with more PhDs than the average science lab The first thing to do is to create packages representing these contexts, and then move the existing code into them. With modern IDEs, code movement can be done automatically via refactorings, and can be done incrementally while we are doing other things. You’ll still need tests to catch any breakages made by moving code, however, especially if you’re using a dynamically typed language where the IDEs have a harder time of performing refactoring.

Security MusicCorp has had a security audit, and has decided to tighten up its protection of sensitive information. Currently, all of this is handled by the finance-related code. If we split this service out, we can provide additional protections to this individual service in terms of monitoring, protection of data at transit, and protection of data at rest — ideas we’ll look at in more detail in Chapter 9. Technology The team looking after our recommendation system has been spiking out some new algorithms using a logic programming library in the language Clojure. The team thinks this could benefit our customers by improving what we offer them. If we could split out the recommendation code into a separate service, it would be easy to consider building an alternative implementation that we could test against. Tangled Dependencies The other point to consider when you’ve identified a couple of seams to separate is how entangled that code is with the rest of the system.


pages: 480 words: 123,979

Dawn of the New Everything: Encounters With Reality and Virtual Reality by Jaron Lanier

4chan, augmented reality, back-to-the-land, Buckminster Fuller, Burning Man, carbon footprint, cloud computing, collaborative editing, commoditize, cosmological constant, creative destruction, crowdsourcing, Donald Trump, Douglas Engelbart, Douglas Hofstadter, El Camino Real, Elon Musk, Firefox, game design, general-purpose programming language, gig economy, Google Glasses, Grace Hopper, Gödel, Escher, Bach, Hacker Ethic, Howard Rheingold, impulse control, information asymmetry, invisible hand, Jaron Lanier, John von Neumann, Kevin Kelly, Kickstarter, Kuiper Belt, lifelogging, mandelbrot fractal, Mark Zuckerberg, Marshall McLuhan, Menlo Park, Minecraft, Mitch Kapor, Mother of all demos, Murray Gell-Mann, Netflix Prize, Network effects, new economy, Norbert Wiener, Oculus Rift, pattern recognition, Paul Erdős, profit motive, Ray Kurzweil, recommendation engine, Richard Feynman, Richard Stallman, Ronald Reagan, self-driving car, Silicon Valley, Silicon Valley startup, Skype, Snapchat, stem cell, Stephen Hawking, Steve Jobs, Steven Levy, Stewart Brand, technoutopianism, Ted Nelson, telemarketer, telepresence, telepresence robot, Thorstein Veblen, Turing test, Vernor Vinge, Whole Earth Catalog, Whole Earth Review, WikiLeaks, wikimedia commons

Consider Netflix. The company claims that its smart algorithm gets to know you and then recommends movies. The company even offered a million-dollar prize for ideas to make the algorithm smarter. The thing about Netflix, though, is that it doesn’t offer a comprehensive catalog, especially of recent, hot releases. If you think of any particular movie, it might not be available for streaming. The recommendation engine is a magician’s misdirection, distracting you from the fact that not everything is available. So is the algorithm intelligent, or are people making themselves somewhat blind and silly in order to make the algorithm seem intelligent? What Netflix has done is admirable, because the whole point of Netflix is to deliver theatrical illusions to you. Bravo! (And by the way, after decades of self-certain, snide arguments against copyright and for making art and entertainment “free,” into an all-volunteer zone, look what happened when companies like Netflix and HBO were able to get people to pay for subscriptions for good TV.


pages: 382 words: 120,064

Bank 3.0: Why Banking Is No Longer Somewhere You Go but Something You Do by Brett King

3D printing, additive manufacturing, Airbus A320, Albert Einstein, Amazon Web Services, Any sufficiently advanced technology is indistinguishable from magic, asset-backed security, augmented reality, barriers to entry, bitcoin, bounce rate, business intelligence, business process, business process outsourcing, call centre, capital controls, citizen journalism, Clayton Christensen, cloud computing, credit crunch, crowdsourcing, disintermediation, en.wikipedia.org, fixed income, George Gilder, Google Glasses, high net worth, I think there is a world market for maybe five computers, Infrastructure as a Service, invention of the printing press, Jeff Bezos, jimmy wales, Kickstarter, London Interbank Offered Rate, M-Pesa, Mark Zuckerberg, mass affluent, Metcalfe’s law, microcredit, mobile money, more computing power than Apollo, Northern Rock, Occupy movement, optical character recognition, peer-to-peer, performance metric, Pingit, platform as a service, QR code, QWERTY keyboard, Ray Kurzweil, recommendation engine, RFID, risk tolerance, Robert Metcalfe, self-driving car, Skype, speech recognition, stem cell, telepresence, Tim Cook: Apple, transaction costs, underbanked, US Airways Flight 1549, web application

In Siri’s patent application, various possibilities are hinted at, including being a voice agent providing assistance for “automated teller machines”.4 In fact, SRI (the creator of Siri™) and BBVA recently announced a collaboration to introduce Lola5, a Siri-like technology, to customers through the Internet and via voice. Siri’s near-term capabilities include: 1. Being able to make simple online purchases, such as “Purchase Bank 3.0 from Amazon Kindle” 2. Serving as a recommendation engine or intelligent automated assistant—an “agent avatar”, as it has sometimes been labelled However, there are some challenges in having customers talk into their phones for customer support, or replacing an IVR system with technologies such as Lola, as a recent New York Times article pointed out when it called Siri “the latest public nuisance in the cell phone revolution”. It outlined several scenarios of people using Siri in less than desirable situations (e.g. public transportation) for things as mundane as sending an SMS message wishing a friend a happy birthday.