recommendation engine

90 results back to index


pages: 23 words: 5,264

Designing Great Data Products by Jeremy Howard, Mike Loukides, Margit Zwemer

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

AltaVista, Filter Bubble, PageRank, pattern recognition, recommendation engine, self-driving car, sentiment analysis, Silicon Valley, text mining

One of the authors of this paper was explaining an iterative optimization technique, and the host says, “So, in a sense Jeremy, your approach was like that of doing a startup, which is just get something out there and iterate and iterate and iterate.” The takeaway, whether you are a tiny startup or a giant insurance company, is that we unconsciously use optimization whenever we decide how to get to where we want to go. Drivetrain Approach to recommender systems Let’s look at how we could apply this process to another industry: marketing. We begin by applying the Drivetrain Approach to a familiar example, recommendation engines, and then building this up into an entire optimized marketing strategy. Recommendation engines are a familiar example of a data product based on well-built predictive models that do not achieve an optimal objective. The current algorithms predict what products a customer will like, based on purchase history and the histories of similar customers. A company like Amazon represents every purchase that has ever been made as a giant sparse matrix, with customers as the rows and products as the columns.

These models are good at predicting whether a customer will like a given product, but they often suggest products that the customer already knows about or has already decided not to buy. Amazon’s recommendation engine is probably the best one out there, but it’s easy to get it to show its warts. Here is a screenshot of the “Customers Who Bought This Item Also Bought” feed on Amazon from a search for the latest book in Terry Pratchett’s “Discworld series:” All of the recommendations are for other books in the same series, but it’s a good assumption that a customer who searched for “Terry Pratchett” is already aware of these books. There may be some unexpected recommendations on pages 2 through 14 of the feed, but how many customers are going to bother clicking through? Instead, let’s design an improved recommendation engine using the Drivetrain Approach, starting by reconsidering our objective. The objective of a recommendation engine is to drive additional sales by surprising and delighting the customer with books he or she would not have purchased without the recommendation.

While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. O'Reilly Media * * * Chapter 1. Designing Great Data Products By Jeremy Howard, Margit Zwemer, and Mike Loukides In the past few years, we’ve seen many data products based on predictive modeling. These products range from weather forecasting to recommendation engines to services that predict airline flight times more accurately than the airline itself. But these products are still just making predictions, rather than asking what action they want someone to take as a result of a prediction. Prediction technology can be interesting and mathematically elegant, but we need to take the next step. The technology exists to build data products that can revolutionize entire industries.


pages: 519 words: 102,669

Programming Collective Intelligence by Toby Segaran

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

always be closing, correlation coefficient, Debian, en.wikipedia.org, Firefox, full text search, information retrieval, PageRank, prediction markets, recommendation engine, slashdot, Thomas Bayes, web application

To find a set of links similar to one that you found particularly interesting, you can try: >>url=recommendations.getRecommendations(delusers,user)[0][1] >> recommendations.topMatches(recommendations.transformPrefs(delusers),url) [(0.312, u'http://www.fonttester.com/'), (0.312, u'http://www.cssremix.com/'), (0.266, u'http://www.logoorange.com/color/color-codes-chart.php'), (0.254, u'http://yotophoto.com/'), (0.254, u'http://www.wpdfd.com/editorial/basics/index.html')] That's it! You've successfully added a recommendation engine to del.icio.us. There's a lot more that could be done here. Since del.icio.us supports searching by tags, you can look for tags that are similar to each other. You can even search for people trying to manipulate the "popular" pages by posting the same links with multiple accounts. Item-Based Filtering The way the recommendation engine has been implemented so far requires the use of all the rankings from every user in order to create a dataset. This will probably work well for a few thousand people or items, but a very large site like Amazon has millions of customers and products—comparing a user with every other user and then comparing every product each user has rated can be very slow.

Introduction to Collective Intelligence Netflix is an online DVD rental company that lets people choose movies to be sent to their homes, and makes recommendations based on the movies that customers have previously rented. In late 2006 it announced a prize of $1 million to the first person to improve the accuracy of its recommendation system by 10 percent, along with progress prizes of $50,000 to the current leader each year for as long as the contest runs. Thousands of teams from all over the world entered and, as of April 2007, the leading team has managed to score an improvement of 7 percent. By using data about which movies each customer enjoyed, Netflix is able to recommend movies to other customers that they may never have even heard of and keep them coming back for more. Any way to improve its recommendation system is worth a lot of money to Netflix. The search engine Google was started in 1998, at a time when there were already several big search engines, and many assumed that a new player would never be able to take on the giants.

Google is likely the largest effort—it not only uses web links to rank pages, but it constantly gathers information on when advertisements are clicked by different users, which allows Google to target the advertising more effectively. In Chapter 4 you'll learn about search engines and the PageRank algorithm, an important part of Google's ranking system. Other examples include web sites with recommendation systems. Sites like Amazon and Netflix use information about the things people buy or rent to determine which people or items are similar to one another, and then make recommendations based on purchase history. Other sites like Pandora and Last.fm use your ratings of different bands and songs to create custom radio stations with music they think you will enjoy. Chapter 2 covers ways to build recommendation systems. Prediction markets are also a form of collective intelligence. One of the most well known of these is the Hollywood Stock Exchange (http://hsx.com), where people trade stocks on movies and movie stars.


pages: 1,085 words: 219,144

Solr in Action by Trey Grainger, Timothy Potter

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

business intelligence, cloud computing, commoditize, conceptual framework, crowdsourcing, data acquisition, en.wikipedia.org, failed state, fault tolerance, finite state, full text search, glass ceiling, information retrieval, natural language processing, performance metric, premature optimization, recommendation engine, web application

Instead of thinking of Solr as a text search engine, it can be mentally freeing to think of Solr as a “matching engine that happens to be able to match on parsed text.” Whether the search is manual or automated is of no consequence to Solr. In fact, several organizations have successfully built recommender systems directly on top of Solr using this thinking. The following sections will cover how to build your own Solr-powered recommendation engine and ultimately how to merge the concepts of a user-driven search experience and an automated recommendation system to provide a powerful, personalized search experience. In particular, we will discuss several content-based recommendation approaches including attribute-based matching, hierarchical-classification-based matching, matching based upon extracted interesting terms (More Like This), concept-based matching, and geographical matching.

This shifts the paradigm completely, because it requires software systems to be intelligent enough to recommend information to users as opposed to having them explicitly search for it. Although organizations such as Netflix and Amazon are well known for their recommender systems and have spent millions of dollars developing them, it’s both possible and easy to develop such systems yourself—particularly on top of Solr—to drastically improve the relevancy of your application. 16.5.1. Search vs. recommendations When one thinks of a search engine, the vision of a keyword box (and sometimes a separate location box) typically comes to mind. Likewise, when one thinks of a recommendation engine, the vision of a magical algorithm which automatically suggests information based upon past behavior and preferences likely comes to mind. In reality, both search and recommendations are just related forms of matching, with search engines generally matching keywords and locations in a query to keywords and locations in a document, and recommendation engines typically matching behavior of users to documents for which other users exhibited similar behaviors or matching content of one document to the content of another document.

The beauty of collaborative filtering, regardless of the implementation, is that it’s able to work without any knowledge about the content of your documents. Therefore, you could build a recommendation engine based upon Solr with documents containing nothing more than document IDs and users, and you should still see quality recommendations as long as you have enough users linking your documents together. If you don’t put any text content, attributes, or classifications into Solr, then it means you will not be able to make use of those additional techniques at all. The next section will discuss why you may want to consider combining multiple techniques to achieve optimal relevancy in your recommendation system. 16.5.8. Hybrid approaches Throughout this chapter, you have seen multiple different recommendation approaches, each with its own strengths and weaknesses.


pages: 371 words: 108,317

The Inevitable: Understanding the 12 Technological Forces That Will Shape Our Future by Kevin Kelly

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, A Declaration of the Independence of Cyberspace, AI winter, Airbnb, Albert Einstein, Amazon Web Services, augmented reality, bank run, barriers to entry, Baxter: Rethink Robotics, bitcoin, blockchain, book scanning, Brewster Kahle, Burning Man, cloud computing, commoditize, computer age, connected car, crowdsourcing, dark matter, dematerialisation, Downton Abbey, Edward Snowden, Elon Musk, Filter Bubble, Freestyle chess, game design, Google Glasses, hive mind, Howard Rheingold, index card, indoor plumbing, industrial robot, Internet Archive, Internet of things, invention of movable type, invisible hand, Jaron Lanier, Jeff Bezos, job automation, John Markoff, Kevin Kelly, Kickstarter, lifelogging, linked data, Lyft, M-Pesa, Marc Andreessen, Marshall McLuhan, means of production, megacity, Minecraft, multi-sided market, natural language processing, Netflix Prize, Network effects, new economy, Nicholas Carr, old-boy network, peer-to-peer, peer-to-peer lending, personalized medicine, placebo effect, planetary scale, postindustrial economy, recommendation engine, RFID, ride hailing / ride sharing, Rodney Brooks, self-driving car, sharing economy, Silicon Valley, slashdot, Snapchat, social graph, social web, software is eating the world, speech recognition, Stephen Hawking, Steven Levy, Ted Nelson, the scientific method, transport as a service, two-sided market, Uber for X, Watson beat the top human players on Jeopardy!, Whole Earth Review, zero-sum game

And I’ll make it personal. How would I like to choose what I give my attention to next? First I’d like to be delivered more of what I know I like. This personal filter already exists. It’s called a recommendation engine. It is in wide use at Amazon, Netflix, Twitter, LinkedIn, Spotify, Beats, and Pandora, among other aggregators. Twitter uses a recommendation system to suggest who I should follow based on whom I already follow. Pandora uses a similar system to recommend what new music I’ll like based on what I already like. Over half of the connections made on LinkedIn arise from their follower recommender. Amazon’s recommendation engine is responsible for the well-known banner that “others who like this item also liked this next item.” Netflix uses the same to recommend movies for me. Clever algorithms churn through a massive history of everyone’s behavior in order to closely predict my own behavior.

Amazon’s greatest asset is not its Prime delivery service but the millions of reader reviews it has accumulated over decades. Readers will pay for Amazon’s all-you-can-read ebook service, Kindle Unlimited, even though they will be able to find ebooks for free elsewhere, because Amazon’s reviews will guide them to books they want to read. Ditto for Netflix. Movie fans will pay Netflix because their recommendation engine finds gems they would not otherwise discover. They may be free somewhere else, but they are essentially lost and buried. In these examples, you are not paying for the copies, you are paying for the findability. • • • These eight qualities require a new skill set for creators. Success no longer derives from mastering distribution. Distribution is nearly automatic; it’s all streams. The Great Copy Machine in the Sky takes care of that.

., 70–71 and platform synergy, 122–25 and real-time on demand, 114–17 and renting, 117–18 and right of modification, 124–25 accountability, 260–64 Adobe, 113, 206 advertising, 177–89 aggregated information, 140, 147 Airbnb, 109, 113, 124, 172 algorithms and targeted advertising, 179–82 Alibaba, 109 Amazon and accessibility vs. ownership, 109 and artificial intelligence, 33 cloud of, 128, 129 and on-demand model of access, 115 as ecosystem, 124 and filtering systems, 171–72 and recommendation engines, 169 and robot technology, 50 and tracking technology, 254 and user reviews, 21, 72–73 anime, 198 annotation systems, 202 anonymity, 263–64 anthropomorphization of technology, 259 Apache software, 69, 141, 143 API (application programming interface), 23 Apple, 1–2, 123, 124, 246 Apple Pay, 65 Apple Watch, 224 Arthur, Brian, 193, 209 artificial intelligence (AI), 29–60 ability to think differently, 42–43, 48, 51–52 as accelerant of change, 30 as alien intelligence, 48 in chess, 41–42 and cloud-based services, 127 and collaboration, 273 and commodity consumer attention, 179 and complex questions, 47 concerns regarding, 44 and consciousness, 42 corporate investment in, 32 costs of, 29, 52–53 data informing, 39 and defining humanity, 48–49 and digital storage capacity, 265, 266–67 and emergence of the “holos,” 291 as enhancement of human intelligence, 41–42 and filtering systems, 175 of Google, 36–37 impact of, 29 learning ability of, 32–33, 40 and lifelogging, 251 networked, 30 and network effect, 40 potential applications for, 34–36 questions arising from, 284 specialized applications of, 42 in tagging book content, 98 technological breakthroughs influencing, 38–40 ubiquity of, 30, 33 and video games, 230 and visual intelligence, 203 See also robots arts and artists artist/audience inversion, 81 and augmented reality, 232 and authenticity, 70 and creative remixing, 209 and crowdfunding, 156–61 and low-cost reproduction, 87 and patronage, 72 public art, 232 attention, 168–69, 176, 177–89 audience, 88, 148–49, 155, 156–57 audio recording, 249.


pages: 348 words: 39,850

Data Scientists at Work by Sebastian Gutierrez

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, algorithmic trading, Bayesian statistics, bioinformatics, bitcoin, business intelligence, chief data officer, clean water, cloud computing, commoditize, computer vision, continuous integration, correlation does not imply causation, creative destruction, crowdsourcing, data is the new oil, DevOps, domain-specific language, Donald Knuth, follow your passion, full text search, informal economy, information retrieval, Infrastructure as a Service, Intergovernmental Panel on Climate Change (IPCC), inventory management, iterative process, lifelogging, linked data, Mark Zuckerberg, microbiome, Moneyball by Michael Lewis explains big data, move fast and break things, move fast and break things, natural language processing, Network effects, nuclear winter, optical character recognition, pattern recognition, Paul Graham, personalized medicine, Peter Thiel, pre–internet, quantitative hedge fund, quantitative trading / quantitative finance, recommendation engine, Renaissance Technologies, Richard Feynman, Richard Feynman, self-driving car, side project, Silicon Valley, Skype, software as a service, speech recognition, statistical model, Steve Jobs, stochastic process, technology bubble, text mining, the scientific method, web application

If you were looking for more of an individual output thing, I’m probably most proud of some work I did at Intuit prior to it acquiring Mint. We had a scrappy little team of four people doing an internal startup-like project. I had the chance to lead the creation of a personalization system. It was Mint-like in that we were using a recommendation engine to match a couple hundred advertisers we signed up and who had coupons to people based on people’s spending behaviors. It was super exciting to build a whole recommendation system from scratch that actually worked quite well. It contributed to Intuit’s decision to acquire Mint, because the project was sort of a proof of concept that we could do it and make it work. Gutierrez: What is a typical Netflix day for you and your team? Smallwood: It would be quite different for me versus my team, so I’ll talk about my team.

It’s not about celebrating the material part of it, but it’s about you wanting to look good and have a great night. And everyone should be able to do that. Gutierrez: How do you pick projects to work on? Smith: Interest and ability to persuade others that it’s a good project. A great deal of my work here has been in support for other people’s projects. For instance, one thing I’ve worked on is research into the recommendations system. They built the recommendation system and it’s been running. Now I am doing the research into how it’s actually working and if it’s actually working. Many of the projects end up being formulated this way. I think of an idea or a different hypothesis or assumption than what we are currently doing, and I go and test it. Then I present the data and we discuss the findings. From there we can figure out where to go next.

So as a data scientist, even if I don’t have the domain expertise I can learn it, and can work on any problem that can be quantitatively described. I can almost guarantee that I won’t be in fashion retail in my forties, but I’m sure I’ll be working on something that relies on data and using similar techniques and methodologies. Gutierrez: How would you describe your work to a data scientist? Shellman: I build the recommendation engines like the ones you’re used to seeing all over the web, and sometimes I do it with really unique data, like transactions involving personal stylists in our brick-and-mortar stores or color trends from fabrics. Gutierrez: What have you been working on recently? Shellman: Over the last year and a half I’ve mostly worked on Recommendo, building new algorithms and the real-time scorer. For the last couple months we’ve been working on a follow-up to Recommendo that will offer customer segmentation as a service.


pages: 302 words: 73,581

Platform Scale: How an Emerging Business Model Helps Startups Build Large Empires With Minimum Investment by Sangeet Paul Choudary

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, Airbnb, Amazon Web Services, barriers to entry, bitcoin, blockchain, business process, Chuck Templeton: OpenTable, Clayton Christensen, collaborative economy, commoditize, crowdsourcing, cryptocurrency, data acquisition, frictionless, game design, hive mind, Internet of things, invisible hand, Kickstarter, Lean Startup, Lyft, M-Pesa, Marc Andreessen, Mark Zuckerberg, means of production, multi-sided market, Network effects, new economy, Paul Graham, recommendation engine, ride hailing / ride sharing, shareholder value, sharing economy, Silicon Valley, Skype, Snapchat, social graph, social software, software as a service, software is eating the world, Spread Networks laid a new fibre optics cable between New York and Chicago, TaskRabbit, the payments system, too big to fail, transport as a service, two-sided market, Uber and Lyft, Uber for X, Wave and Pay

Context may be static or dynamic. Many Web 1.0 era filters were created based on long sign-up forms that the user filled out. Today, filters are created based on data captured on an ongoing basis through a user’s actions. Filters may be standalone or collaborative. Amazon’s “People who purchased this product also purchased this product” feature is based on a collaborative filter. Many recommendation platforms allow users to filter results based on a “people like you” parameter. This, again, is a collaborative filter. The most important innovation in recent times that has led to the spread of collaborative filters is the implementation of Facebook’s social graph. Through the social graph, third-party platforms like TripAdvisor serve reviews based on a collaborative filter of people who are close to you on the graph.


pages: 353 words: 104,146

European Founders at Work by Pedro Gairifo Santos

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

business intelligence, cloud computing, crowdsourcing, fear of failure, full text search, information retrieval, inventory management, iterative process, Jeff Bezos, Lean Startup, Mark Zuckerberg, natural language processing, pattern recognition, pre–internet, recommendation engine, Richard Stallman, Silicon Valley, Skype, slashdot, Steve Jobs, Steve Wozniak, subscription business, technology bubble, web application, Y Combinator

All this time, were mainly concerned with keeping the site afloat, keeping it fast, scaling up properly, and this sort of scrobbling data and radio. The recommendation engine wasn't brilliant to begin with. And then, we finally decided we needed to hire somebody who knows what they're doing, who's going to work on this full-time. We e-mailed some mailing lists. We e-mailed the ISMIR2 mailing list. They're a group who meet every year about music recommendations and information retrieval in music. We ended up hiring a guy called Norman, who was both a great scientist and understood all the algorithms and captive audience sort of things, but also an excellent programmer who was able to implement all these ideas. So we got really lucky. The first person we hired was great and he just took over. He chucked out all of our crappy recommendation systems we had and built something good, and then improved it constantly for the next several years. __________ 2 The International Society for Music Information Retrieval So we had some A/B testing, split testing systems in there for the radio so they could try out new tweaks to the algorithms and see what was performing better.

They weren't even interested in recommendations at that point. I didn't really have a good recommender system for a long time. From your listening stats, you could click on an artist, and see who else had been listening to them. You could then see the listening stats of the other fans of artists you like. Just that system of connecting all the listening tastes proved to be really quite addictive. It spread by word of mouth. And then toward the end of my degree, I started working on some collaborative filtering recommendation stuff. Obviously that all tapped into some latent interest that people have in stats on their music listening. So I knew that recommendations weren't necessarily the main focus at that point. Not for a couple years after that did we have a really good recommender system. Music recommendation never really was my field, but I had a go at it, and then later on we hired somebody who knew what they were doing.

. _____________ 1 Digital Millennium Copyright Act Santos: Did you ever have any court problems with any of the copyright holders? Jones: Nothing substantial, really. I think sometimes rights holders, especially in the music industry, will use court action or the threat of court action as a sort of negotiating position. But, no. I think we managed to avoid anything serious in that regard. Santos: From the technical point of view, the actual recommendation engine and statistics, how does that actually work? How hard was it to develop it and tweak it? Did you change the approach many times? Did you have a clear idea on how to do it from the start? Jones: So initially when I was building it, we tried all sorts of stuff. I think what I was using for a long time in the beginning was just to use Lucene, a document indexing system. We just created fake documents of people's profiles.


pages: 368 words: 96,825

Bold: How to Go Big, Create Wealth and Impact the World by Peter H. Diamandis, Steven Kotler

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, additive manufacturing, Airbnb, Amazon Mechanical Turk, Amazon Web Services, augmented reality, autonomous vehicles, cloud computing, creative destruction, crowdsourcing, Daniel Kahneman / Amos Tversky, dematerialisation, deskilling, Elon Musk, en.wikipedia.org, Exxon Valdez, fear of failure, Firefox, Galaxy Zoo, Google Glasses, Google Hangouts, Google X / Alphabet X, gravity well, ImageNet competition, industrial robot, Internet of things, Jeff Bezos, John Harrison: Longitude, John Markoff, Jono Bacon, Just-in-time delivery, Kickstarter, Kodak vs Instagram, Law of Accelerating Returns, Lean Startup, life extension, loss aversion, Louis Pasteur, Mahatma Gandhi, Marc Andreessen, Mark Zuckerberg, Mars Rover, meta analysis, meta-analysis, microbiome, minimum viable product, move fast and break things, Narrative Science, Netflix Prize, Network effects, Oculus Rift, optical character recognition, packet switching, PageRank, pattern recognition, performance metric, Peter H. Diamandis: Planetary Resources, Peter Thiel, pre–internet, Ray Kurzweil, recommendation engine, Richard Feynman, Richard Feynman, ride hailing / ride sharing, risk tolerance, rolodex, self-driving car, sentiment analysis, shareholder value, Silicon Valley, Silicon Valley startup, skunkworks, Skype, smart grid, stem cell, Stephen Hawking, Steve Jobs, Steven Levy, Stewart Brand, technoutopianism, telepresence, telepresence robot, Turing test, urban renewal, web application, X Prize, Y Combinator, zero-sum game

Thus, if you could create an incentive prize that harnessed this competitive love of coding and this argumentative love of movies and tied them together—meaning design a prize around the intrinsic motivations at the core of coder culture—what might be possible? Well, in the case of Netflix, a better movie recommendation engine. A movie recommendation engine is a bit of software that tells you what movie you might want to watch next based on movies you’ve already watched and rated (on a scale of one to five stars). Netflix’s original recommendation engine, Cinematch, was created back in 2000 and quickly proved to be a wild success. Within a few years, nearly two-thirds of their rental business was being driven by their recommendation engine. Thus the obvious corollary: the better their recommendation engine, the better their business. And that was the problem. By the middle 2000s, Netflix engineers had plucked all the low-hanging fruit and the rate of Cinematch optimization had slowed to a crawl.

The prize hunters, even the leaders, are startlingly open about the methods they’re using, acting more like academics huddled over a knotty problem than entrepreneurs jostling for a $1 million payday. In December 2006, a competitor called ‘simonfunk’ posted a complete description of his algorithm—which at the time was tied for third place—giving everyone else the opportunity to piggyback on his progress. ‘We had no idea the extent to which people would collaborate with each other,’ says Jim Bennett, vice president for recommendation systems at Netflix.”16 And this isn’t an aberration. Over the course of the eight XPRIZEs launched to date, there has been an extraordinary amount of cooperation. We’ve seen teams providing unsolicited advice, teams merging, teams acquiring and sharing technology and experts. When the prize is driven by an MTP, while a team’s primary purpose is to win, a close second is their desire to see the primary objective achieved; thus teams exhibit a much higher willingness to share.


pages: 304 words: 82,395

Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schonberger, Kenneth Cukier

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

23andMe, Affordable Care Act / Obamacare, airport security, AltaVista, barriers to entry, Berlin Wall, big data - Walmart - Pop Tarts, Black Swan, book scanning, business intelligence, business process, call centre, cloud computing, computer age, correlation does not imply causation, dark matter, double entry bookkeeping, Eratosthenes, Erik Brynjolfsson, game design, IBM and the Holocaust, index card, informal economy, intangible asset, Internet of things, invention of the printing press, Jeff Bezos, lifelogging, Louis Pasteur, Mark Zuckerberg, Menlo Park, Moneyball by Michael Lewis explains big data, Nate Silver, natural language processing, Netflix Prize, Network effects, obamacare, optical character recognition, PageRank, performance metric, Peter Thiel, Post-materialism, post-materialism, random walk, recommendation engine, self-driving car, sentiment analysis, Silicon Valley, Silicon Valley startup, smart grid, smart meter, social graph, speech recognition, Steve Jobs, Steven Levy, the scientific method, The Signal and the Noise by Nate Silver, The Wealth of Nations by Adam Smith, Turing test, Watson beat the top human players on Jeopardy!

In fact, the company approached its business model in that order, which is the inverse of the norm. It initially only had the idea for its celebrated recommendation system. Its stock market prospectus in 1997 described “collaborative filtering” before Amazon knew how it would work in practice or had enough data to make it useful. Both Google and Amazon span the categories, but their strategies differ. When Google first sets out to collect any sort of data, it has secondary uses in mind. Its Street View cars, as we have seen, collected GPS information not just for its map service but also to train self-driving cars. By contrast, Amazon is more focused on the primary use of data and only taps the secondary uses as a marginal bonus. Its recommendation system, for example, relies on clickstream data as a signal, but the company hasn’t used the information to do extraordinary things like predict the state of the economy or flu outbreaks.

Companies that have failed to appreciate the importance of data’s reuse have learned their lesson the hard way. For example, in Amazon’s early days it signed a deal with AOL to run the technology behind AOL’s e-commerce site. To most people, it looked like an ordinary outsourcing deal. But what really interested Amazon, explains Andreas Weigend, Amazon’s former chief scientist, was getting hold of data on what AOL users were looking at and buying, which would improve the performance of its recommendation engine. Poor AOL never realized this. It only saw the data’s value in terms of its primary purpose—sales. Clever Amazon knew it could reap benefits by putting the data to a secondary use. Or take the case of Google’s entry into speech recognition with GOOG-411 for local search listings, which ran from 2007 to 2010. The search giant didn’t have its own speech-recognition technology so needed to license it.

Buy a book on Poland and you’d be bombarded with Eastern European fare. Purchase one about babies and you’d be inundated with more of the same. “They tended to offer you tiny variations on your previous purchase, ad infinitum,” recalled James Marcus, an Amazon book reviewer from 1996 to 2001, in his memoir, Amazonia. “It felt as if you had gone shopping with the village idiot.” Greg Linden saw a solution. He realized that the recommendation system didn’t actually need to compare people with other people, a task that was technically cumbersome. All it needed to do was find associations among products themselves. In 1998 Linden and his colleagues applied for a patent on “item-to-item” collaborative filtering, as the technique is known. The shift in approach made a big difference. Because the calculations could be done ahead of time, the recommendations were lightning fast.


pages: 377 words: 97,144

Singularity Rising: Surviving and Thriving in a Smarter, Richer, and More Dangerous World by James D. Miller

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

23andMe, affirmative action, Albert Einstein, artificial general intelligence, Asperger Syndrome, barriers to entry, brain emulation, cloud computing, cognitive bias, correlation does not imply causation, crowdsourcing, Daniel Kahneman / Amos Tversky, David Brooks, David Ricardo: comparative advantage, Deng Xiaoping, en.wikipedia.org, feminist movement, Flynn Effect, friendly AI, hive mind, impulse control, indoor plumbing, invention of agriculture, Isaac Newton, John von Neumann, knowledge worker, Long Term Capital Management, low skilled workers, Netflix Prize, neurotypical, pattern recognition, Peter Thiel, phenotype, placebo effect, prisoner's dilemma, profit maximization, Ray Kurzweil, recommendation engine, reversible computing, Richard Feynman, Richard Feynman, Rodney Brooks, Silicon Valley, Singularitarianism, Skype, statistical model, Stephen Hawking, Steve Jobs, supervolcano, technological singularity, The Coming Technological Singularity, the scientific method, Thomas Malthus, transaction costs, Turing test, Vernor Vinge, Von Neumann architecture

A big part of our brain is devoted to processing visual inputs. Hence, a good recommendation system would necessarily have powerful insights into a significant chunk of our brains. 3.Measurable Incremental Progress—Think of AI as a destination a thousand miles away with the entire pathway hidden by fog. To reach our destination, we need to take many small steps, and for each step we need a way to determine if we have gone in the right direction. A video recommendation system provides this corrective by gathering continuous feedback on how many users liked the recommended videos. 4.Profitable with Every Step—Businesses are more motivated to invest in a type of innovation if they can continually increase revenue with each small improvement. Consequently, an application such as a video recommendation engine in which each improvement increases consumer satisfaction is (all else being equal) more likely to attract large corporate investment than an application that would have value only if it achieved near-human-level intelligence. 5.Amenable to Parallel Processing—Imagine we want to move a heavy object from point A to point B.

Fortunately, with video recommendations, many challenges, such as finding what type of cat video a certain set of users might enjoy, can be worked on independently for reasonably long periods of time. 6.Free Labor from Customers—A recommendation system would rely on millions of people to freely help train the system by picking which videos to watch, rating some of the videos they see, writing reviews of videos, and labeling in words the content they upload. 7.Help from Advertisers and Political Consultants—Salesmen would eagerly seek to learn what types of messages appealed to different factions of the population. The recommendation system could piggyback on these salesmen’s attempts to understand their clientele and use their insights to improve recommendation software. 8.AI and Human Recommenders Could Productively Work Together—Unlike what YouTube currently does, an effective AI recommendation system could make use of human evaluators. When my son was four, he enjoyed watching YouTube videos of supernovas and children’s cartoons.

For example, if 90 percent of people who had some unusual allele or brain microstructure enjoyed a certain cat video, then the AI recommender would suggest the video to all other viewers who had that trait. 12.Amenable to Crowdsourcing—Netflix, the rent-by-mail and streaming video distributor, offered (and eventually paid) a $1 million prize to whichever group improved its recommendation system the most, so long as at least one group improved the system by at least 10 percent. This “crowdsourcing,” which occurs when a problem is thrown open to anyone, helps a company by allowing them to draw on the talents of strangers, while only paying the strangers if they help the firm. This kind of crowdsourcing works only if, as with a video recommendation system, there is an easy and objective way of measuring progress toward the crowdsourced goal. 13.Potential Improvement All the Way Up to Superhuman Artificial General Intelligence—A recommendation AI could slowly morph into a content creator.

Remix: Making Art and Commerce Thrive in the Hybrid Economy by Lawrence Lessig

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Web Services, Andrew Keen, Benjamin Mako Hill, Berlin Wall, Bernie Sanders, Brewster Kahle, Cass Sunstein, collaborative editing, commoditize, disintermediation, don't be evil, Erik Brynjolfsson, Internet Archive, invisible hand, Jeff Bezos, jimmy wales, Kevin Kelly, Larry Wall, late fees, Mark Shuttleworth, Netflix Prize, Network effects, new economy, optical character recognition, PageRank, peer-to-peer, recommendation engine, revision control, Richard Stallman, Ronald Coase, Saturday Night Live, SETI@home, sharing economy, Silicon Valley, Skype, slashdot, Steve Jobs, The Nature of the Firm, thinkpad, transaction costs, VA Linux, yellow journalism

And so increasingly, we must ask how these different norms might be made to coexist. Jeff Jarvis, journalist and blogger, suggests companies “pay dividends back to [the] crowd” and avoid trying too hard “to control [the gathered] 80706 i-xxiv 001-328 r4nk.indd 233 8/12/08 1:55:56 AM REMI X 234 wisdom, and limit its use and the sharing of it.”19 Tapscott and Williams make the same recommendation: “platforms for participation will only remain viable for as long as all the stakeholders are adequately and appropriately compensated for their contributions— don’t expect a free ride forever.”20 The key word here is “appropriately.” Obviously, there must be adequate compensation. But the kind of compensation is the puzzle. Once again, the “sharing economy” of two lovers is one in which both need to be concerned that the other is “adequately and appropriately compensated for [his or her] contribution.”


Martin Kleppmann-Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable and Maintainable Systems-O’Reilly (2017) by Unknown

active measures, Amazon Web Services, bitcoin, blockchain, business intelligence, business process, c2.com, cloud computing, collaborative editing, commoditize, conceptual framework, cryptocurrency, database schema, DevOps, distributed ledger, Donald Knuth, Edward Snowden, ethereum blockchain, fault tolerance, finite state, Flash crash, full text search, general-purpose programming language, informal economy, information retrieval, Internet of things, iterative process, John von Neumann, loose coupling, Marc Andreessen, natural language processing, Network effects, packet switching, peer-to-peer, performance metric, place-making, premature optimization, recommendation engine, Richard Feynman, Richard Feynman, self-driving car, semantic web, Shoshana Zuboff, social graph, social web, software as a service, software is eating the world, sorting algorithm, source of truth, SPARQL, speech recognition, statistical model, web application, WebSocket, wikimedia commons

Beyond MapReduce | 423 Graphs and Iterative Processing In “Graph-Like Data Models” on page 49 we discussed using graphs for modeling data, and using graph query languages to traverse the edges and vertices in a graph. The discussion in Chapter 2 was focused around OLTP-style use: quickly executing queries to find a small number of vertices matching certain criteria. It is also interesting to look at graphs in a batch processing context, where the goal is to perform some kind of offline processing or analysis on an entire graph. This need often arises in machine learning applications such as recommendation engines, or in ranking systems. For example, one of the most famous graph analysis algorithms is PageRank [69], which tries to estimate the popularity of a web page based on what other web pages link to it. It is used as part of the formula that determines the order in which web search engines present their results. Dataflow engines like Spark, Flink, and Tez (see “Materialization of Intermediate State” on page 419) typically arrange the operators in a job as a directed acyclic graph (DAG).

The opposite of bounded. 558 | Glossary Index A aborts (transactions), 222, 224 in two-phase commit, 356 performance of optimistic concurrency con‐ trol, 266 retrying aborted transactions, 231 abstraction, 21, 27, 222, 266, 321 access path (in network model), 37, 60 accidental complexity, removing, 21 accountability, 535 ACID properties (transactions), 90, 223 atomicity, 223, 228 consistency, 224, 529 durability, 226 isolation, 225, 228 acknowledgements (messaging), 445 active/active replication (see multi-leader repli‐ cation) active/passive replication (see leader-based rep‐ lication) ActiveMQ (messaging), 137, 444 distributed transaction support, 361 ActiveRecord (object-relational mapper), 30, 232 actor model, 138 (see also message-passing) comparison to Pregel model, 425 comparison to stream processing, 468 Advanced Message Queuing Protocol (see AMQP) aerospace systems, 6, 10, 305, 372 aggregation data cubes and materialized views, 101 in batch processes, 406 in stream processes, 466 aggregation pipeline query language, 48 Agile, 22 minimizing irreversibility, 414, 497 moving faster with confidence, 532 Unix philosophy, 394 agreement, 365 (see also consensus) Airflow (workflow scheduler), 402 Ajax, 131 Akka (actor framework), 139 algorithms algorithm correctness, 308 B-trees, 79-83 for distributed systems, 306 hash indexes, 72-75 mergesort, 76, 402, 405 red-black trees, 78 SSTables and LSM-trees, 76-79 all-to-all replication topologies, 175 AllegroGraph (database), 50 ALTER TABLE statement (SQL), 40, 111 Amazon Dynamo (database), 177 Amazon Web Services (AWS), 8 Kinesis Streams (messaging), 448 network reliability, 279 postmortems, 9 RedShift (database), 93 S3 (object storage), 398 checking data integrity, 530 amplification of bias, 534 of failures, 364, 495 Index | 559 of tail latency, 16, 207 write amplification, 84 AMQP (Advanced Message Queuing Protocol), 444 (see also messaging systems) comparison to log-based messaging, 448, 451 message ordering, 446 analytics, 90 comparison to transaction processing, 91 data warehousing (see data warehousing) parallel query execution in MPP databases, 415 predictive (see predictive analytics) relation to batch processing, 411 schemas for, 93-95 snapshot isolation for queries, 238 stream analytics, 466 using MapReduce, analysis of user activity events (example), 404 anti-caching (in-memory databases), 89 anti-entropy, 178 Apache ActiveMQ (see ActiveMQ) Apache Avro (see Avro) Apache Beam (see Beam) Apache BookKeeper (see BookKeeper) Apache Cassandra (see Cassandra) Apache CouchDB (see CouchDB) Apache Curator (see Curator) Apache Drill (see Drill) Apache Flink (see Flink) Apache Giraph (see Giraph) Apache Hadoop (see Hadoop) Apache HAWQ (see HAWQ) Apache HBase (see HBase) Apache Helix (see Helix) Apache Hive (see Hive) Apache Impala (see Impala) Apache Jena (see Jena) Apache Kafka (see Kafka) Apache Lucene (see Lucene) Apache MADlib (see MADlib) Apache Mahout (see Mahout) Apache Oozie (see Oozie) Apache Parquet (see Parquet) Apache Qpid (see Qpid) Apache Samza (see Samza) Apache Solr (see Solr) Apache Spark (see Spark) 560 | Index Apache Storm (see Storm) Apache Tajo (see Tajo) Apache Tez (see Tez) Apache Thrift (see Thrift) Apache ZooKeeper (see ZooKeeper) Apama (stream analytics), 466 append-only B-trees, 82, 242 append-only files (see logs) Application Programming Interfaces (APIs), 5, 27 for batch processing, 403 for change streams, 456 for distributed transactions, 361 for graph processing, 425 for services, 131-136 (see also services) evolvability, 136 RESTful, 133 SOAP, 133 application state (see state) approximate search (see similarity search) archival storage, data from databases, 131 arcs (see edges) arithmetic mean, 14 ASCII text, 119, 395 ASN.1 (schema language), 127 asynchronous networks, 278, 553 comparison to synchronous networks, 284 formal model, 307 asynchronous replication, 154, 553 conflict detection, 172 data loss on failover, 157 reads from asynchronous follower, 162 Asynchronous Transfer Mode (ATM), 285 atomic broadcast (see total order broadcast) atomic clocks (caesium clocks), 294, 295 (see also clocks) atomicity (concurrency), 553 atomic increment-and-get, 351 compare-and-set, 245, 327 (see also compare-and-set operations) replicated operations, 246 write operations, 243 atomicity (transactions), 223, 228, 553 atomic commit, 353 avoiding, 523, 528 blocking and nonblocking, 359 in stream processing, 360, 477 maintaining derived data, 453 for multi-object transactions, 229 for single-object writes, 230 auditability, 528-533 designing for, 531 self-auditing systems, 530 through immutability, 460 tools for auditable data systems, 532 availability, 8 (see also fault tolerance) in CAP theorem, 337 in service level agreements (SLAs), 15 Avro (data format), 122-127 code generation, 127 dynamically generated schemas, 126 object container files, 125, 131, 414 reader determining writer’s schema, 125 schema evolution, 123 use in Hadoop, 414 awk (Unix tool), 391 AWS (see Amazon Web Services) Azure (see Microsoft) B B-trees (indexes), 79-83 append-only/copy-on-write variants, 82, 242 branching factor, 81 comparison to LSM-trees, 83-85 crash recovery, 82 growing by splitting a page, 81 optimizations, 82 similarity to dynamic partitioning, 212 backpressure, 441, 553 in TCP, 282 backups database snapshot for replication, 156 integrity of, 530 snapshot isolation for, 238 use for ETL processes, 405 backward compatibility, 112 BASE, contrast to ACID, 223 bash shell (Unix), 70, 395, 503 batch processing, 28, 389-431, 553 combining with stream processing lambda architecture, 497 unifying technologies, 498 comparison to MPP databases, 414-418 comparison to stream processing, 464 comparison to Unix, 413-414 dataflow engines, 421-423 fault tolerance, 406, 414, 422, 442 for data integration, 494-498 graphs and iterative processing, 424-426 high-level APIs and languages, 403, 426-429 log-based messaging and, 451 maintaining derived state, 495 MapReduce and distributed filesystems, 397-413 (see also MapReduce) measuring performance, 13, 390 outputs, 411-413 key-value stores, 412 search indexes, 411 using Unix tools (example), 391-394 Bayou (database), 522 Beam (dataflow library), 498 bias, 534 big ball of mud, 20 Bigtable data model, 41, 99 binary data encodings, 115-128 Avro, 122-127 MessagePack, 116-117 Thrift and Protocol Buffers, 117-121 binary encoding based on schemas, 127 by network drivers, 128 binary strings, lack of support in JSON and XML, 114 BinaryProtocol encoding (Thrift), 118 Bitcask (storage engine), 72 crash recovery, 74 Bitcoin (cryptocurrency), 532 Byzantine fault tolerance, 305 concurrency bugs in exchanges, 233 bitmap indexes, 97 blockchains, 532 Byzantine fault tolerance, 305 blocking atomic commit, 359 Bloom (programming language), 504 Bloom filter (algorithm), 79, 466 BookKeeper (replicated log), 372 Bottled Water (change data capture), 455 bounded datasets, 430, 439, 553 (see also batch processing) bounded delays, 553 in networks, 285 process pauses, 298 broadcast hash joins, 409 Index | 561 brokerless messaging, 442 Brubeck (metrics aggregator), 442 BTM (transaction coordinator), 356 bulk synchronous parallel (BSP) model, 425 bursty network traffic patterns, 285 business data processing, 28, 90, 390 byte sequence, encoding data in, 112 Byzantine faults, 304-306, 307, 553 Byzantine fault-tolerant systems, 305, 532 Byzantine Generals Problem, 304 consensus algorithms and, 366 C caches, 89, 553 and materialized views, 101 as derived data, 386, 499-504 database as cache of transaction log, 460 in CPUs, 99, 338, 428 invalidation and maintenance, 452, 467 linearizability, 324 CAP theorem, 336-338, 554 Cascading (batch processing), 419, 427 hash joins, 409 workflows, 403 cascading failures, 9, 214, 281 Cascalog (batch processing), 60 Cassandra (database) column-family data model, 41, 99 compaction strategy, 79 compound primary key, 204 gossip protocol, 216 hash partitioning, 203-205 last-write-wins conflict resolution, 186, 292 leaderless replication, 177 linearizability, lack of, 335 log-structured storage, 78 multi-datacenter support, 184 partitioning scheme, 213 secondary indexes, 207 sloppy quorums, 184 cat (Unix tool), 391 causal context, 191 (see also causal dependencies) causal dependencies, 186-191 capturing, 191, 342, 494, 514 by total ordering, 493 causal ordering, 339 in transactions, 262 sending message to friends (example), 494 562 | Index causality, 554 causal ordering, 339-343 linearizability and, 342 total order consistent with, 344, 345 consistency with, 344-347 consistent snapshots, 340 happens-before relationship, 186 in serializable transactions, 262-265 mismatch with clocks, 292 ordering events to capture, 493 violations of, 165, 176, 292, 340 with synchronized clocks, 294 CEP (see complex event processing) certificate transparency, 532 chain replication, 155 linearizable reads, 351 change data capture, 160, 454 API support for change streams, 456 comparison to event sourcing, 457 implementing, 454 initial snapshot, 455 log compaction, 456 changelogs, 460 change data capture, 454 for operator state, 479 generating with triggers, 455 in stream joins, 474 log compaction, 456 maintaining derived state, 452 Chaos Monkey, 7, 280 checkpointing in batch processors, 422, 426 in high-performance computing, 275 in stream processors, 477, 523 chronicle data model, 458 circuit-switched networks, 284 circular buffers, 450 circular replication topologies, 175 clickstream data, analysis of, 404 clients calling services, 131 pushing state changes to, 512 request routing, 214 stateful and offline-capable, 170, 511 clocks, 287-299 atomic (caesium) clocks, 294, 295 confidence interval, 293-295 for global snapshots, 294 logical (see logical clocks) skew, 291-294, 334 slewing, 289 synchronization and accuracy, 289-291 synchronization using GPS, 287, 290, 294, 295 time-of-day versus monotonic clocks, 288 timestamping events, 471 cloud computing, 146, 275 need for service discovery, 372 network glitches, 279 shared resources, 284 single-machine reliability, 8 Cloudera Impala (see Impala) clustered indexes, 86 CODASYL model, 36 (see also network model) code generation with Avro, 127 with Thrift and Protocol Buffers, 118 with WSDL, 133 collaborative editing multi-leader replication and, 170 column families (Bigtable), 41, 99 column-oriented storage, 95-101 column compression, 97 distinction between column families and, 99 in batch processors, 428 Parquet, 96, 131, 414 sort order in, 99-100 vectorized processing, 99, 428 writing to, 101 comma-separated values (see CSV) command query responsibility segregation (CQRS), 462 commands (event sourcing), 459 commits (transactions), 222 atomic commit, 354-355 (see also atomicity; transactions) read committed isolation, 234 three-phase commit (3PC), 359 two-phase commit (2PC), 355-359 commutative operations, 246 compaction of changelogs, 456 (see also log compaction) for stream operator state, 479 of log-structured storage, 73 issues with, 84 size-tiered and leveled approaches, 79 CompactProtocol encoding (Thrift), 119 compare-and-set operations, 245, 327 implementing locks, 370 implementing uniqueness constraints, 331 implementing with total order broadcast, 350 relation to consensus, 335, 350, 352, 374 relation to transactions, 230 compatibility, 112, 128 calling services, 136 properties of encoding formats, 139 using databases, 129-131 using message-passing, 138 compensating transactions, 355, 461, 526 complex event processing (CEP), 465 complexity distilling in theoretical models, 310 hiding using abstraction, 27 of software systems, managing, 20 composing data systems (see unbundling data‐ bases) compute-intensive applications, 3, 275 concatenated indexes, 87 in Cassandra, 204 Concord (stream processor), 466 concurrency actor programming model, 138, 468 (see also message-passing) bugs from weak transaction isolation, 233 conflict resolution, 171, 174 detecting concurrent writes, 184-191 dual writes, problems with, 453 happens-before relationship, 186 in replicated systems, 161-191, 324-338 lost updates, 243 multi-version concurrency control (MVCC), 239 optimistic concurrency control, 261 ordering of operations, 326, 341 reducing, through event logs, 351, 462, 507 time and relativity, 187 transaction isolation, 225 write skew (transaction isolation), 246-251 conflict-free replicated datatypes (CRDTs), 174 conflicts conflict detection, 172 causal dependencies, 186, 342 in consensus algorithms, 368 in leaderless replication, 184 Index | 563 in log-based systems, 351, 521 in nonlinearizable systems, 343 in serializable snapshot isolation (SSI), 264 in two-phase commit, 357, 364 conflict resolution automatic conflict resolution, 174 by aborting transactions, 261 by apologizing, 527 convergence, 172-174 in leaderless systems, 190 last write wins (LWW), 186, 292 using atomic operations, 246 using custom logic, 173 determining what is a conflict, 174, 522 in multi-leader replication, 171-175 avoiding conflicts, 172 lost updates, 242-246 materializing, 251 relation to operation ordering, 339 write skew (transaction isolation), 246-251 congestion (networks) avoidance, 282 limiting accuracy of clocks, 293 queueing delays, 282 consensus, 321, 364-375, 554 algorithms, 366-368 preventing split brain, 367 safety and liveness properties, 365 using linearizable operations, 351 cost of, 369 distributed transactions, 352-375 in practice, 360-364 two-phase commit, 354-359 XA transactions, 361-364 impossibility of, 353 membership and coordination services, 370-373 relation to compare-and-set, 335, 350, 352, 374 relation to replication, 155, 349 relation to uniqueness constraints, 521 consistency, 224, 524 across different databases, 157, 452, 462, 492 causal, 339-348, 493 consistent prefix reads, 165-167 consistent snapshots, 156, 237-242, 294, 455, 500 (see also snapshots) 564 | Index crash recovery, 82 enforcing constraints (see constraints) eventual, 162, 322 (see also eventual consistency) in ACID transactions, 224, 529 in CAP theorem, 337 linearizability, 324-338 meanings of, 224 monotonic reads, 164-165 of secondary indexes, 231, 241, 354, 491, 500 ordering guarantees, 339-352 read-after-write, 162-164 sequential, 351 strong (see linearizability) timeliness and integrity, 524 using quorums, 181, 334 consistent hashing, 204 consistent prefix reads, 165 constraints (databases), 225, 248 asynchronously checked, 526 coordination avoidance, 527 ensuring idempotence, 519 in log-based systems, 521-524 across multiple partitions, 522 in two-phase commit, 355, 357 relation to consensus, 374, 521 relation to event ordering, 347 requiring linearizability, 330 Consul (service discovery), 372 consumers (message streams), 137, 440 backpressure, 441 consumer offsets in logs, 449 failures, 445, 449 fan-out, 11, 445, 448 load balancing, 444, 448 not keeping up with producers, 441, 450, 502 context switches, 14, 297 convergence (conflict resolution), 172-174, 322 coordination avoidance, 527 cross-datacenter, 168, 493 cross-partition ordering, 256, 294, 348, 523 services, 330, 370-373 coordinator (in 2PC), 356 failure, 358 in XA transactions, 361-364 recovery, 363 copy-on-write (B-trees), 82, 242 CORBA (Common Object Request Broker Architecture), 134 correctness, 6 auditability, 528-533 Byzantine fault tolerance, 305, 532 dealing with partial failures, 274 in log-based systems, 521-524 of algorithm within system model, 308 of compensating transactions, 355 of consensus, 368 of derived data, 497, 531 of immutable data, 461 of personal data, 535, 540 of time, 176, 289-295 of transactions, 225, 515, 529 timeliness and integrity, 524-528 corruption of data detecting, 519, 530-533 due to pathological memory access, 529 due to radiation, 305 due to split brain, 158, 302 due to weak transaction isolation, 233 formalization in consensus, 366 integrity as absence of, 524 network packets, 306 on disks, 227 preventing using write-ahead logs, 82 recovering from, 414, 460 Couchbase (database) durability, 89 hash partitioning, 203-204, 211 rebalancing, 213 request routing, 216 CouchDB (database) B-tree storage, 242 change feed, 456 document data model, 31 join support, 34 MapReduce support, 46, 400 replication, 170, 173 covering indexes, 86 CPUs cache coherence and memory barriers, 338 caching and pipelining, 99, 428 increasing parallelism, 43 CRDTs (see conflict-free replicated datatypes) CREATE INDEX statement (SQL), 85, 500 credit rating agencies, 535 Crunch (batch processing), 419, 427 hash joins, 409 sharded joins, 408 workflows, 403 cryptography defense against attackers, 306 end-to-end encryption and authentication, 519, 543 proving integrity of data, 532 CSS (Cascading Style Sheets), 44 CSV (comma-separated values), 70, 114, 396 Curator (ZooKeeper recipes), 330, 371 curl (Unix tool), 135, 397 cursor stability, 243 Cypher (query language), 52 comparison to SPARQL, 59 D data corruption (see corruption of data) data cubes, 102 data formats (see encoding) data integration, 490-498, 543 batch and stream processing, 494-498 lambda architecture, 497 maintaining derived state, 495 reprocessing data, 496 unifying, 498 by unbundling databases, 499-515 comparison to federated databases, 501 combining tools by deriving data, 490-494 derived data versus distributed transac‐ tions, 492 limits of total ordering, 493 ordering events to capture causality, 493 reasoning about dataflows, 491 need for, 385 data lakes, 415 data locality (see locality) data models, 27-64 graph-like models, 49-63 Datalog language, 60-63 property graphs, 50 RDF and triple-stores, 55-59 query languages, 42-48 relational model versus document model, 28-42 data protection regulations, 542 data systems, 3 about, 4 Index | 565 concerns when designing, 5 future of, 489-544 correctness, constraints, and integrity, 515-533 data integration, 490-498 unbundling databases, 499-515 heterogeneous, keeping in sync, 452 maintainability, 18-22 possible faults in, 221 reliability, 6-10 hardware faults, 7 human errors, 9 importance of, 10 software errors, 8 scalability, 10-18 unreliable clocks, 287-299 data warehousing, 91-95, 554 comparison to data lakes, 415 ETL (extract-transform-load), 92, 416, 452 keeping data systems in sync, 452 schema design, 93 slowly changing dimension (SCD), 476 data-intensive applications, 3 database triggers (see triggers) database-internal distributed transactions, 360, 364, 477 databases archival storage, 131 comparison of message brokers to, 443 dataflow through, 129 end-to-end argument for, 519-520 checking integrity, 531 inside-out, 504 (see also unbundling databases) output from batch workflows, 412 relation to event streams, 451-464 (see also changelogs) API support for change streams, 456, 506 change data capture, 454-457 event sourcing, 457-459 keeping systems in sync, 452-453 philosophy of immutable events, 459-464 unbundling, 499-515 composing data storage technologies, 499-504 designing applications around dataflow, 504-509 566 | Index observing derived state, 509-515 datacenters geographically distributed, 145, 164, 278, 493 multi-tenancy and shared resources, 284 network architecture, 276 network faults, 279 replication across multiple, 169 leaderless replication, 184 multi-leader replication, 168, 335 dataflow, 128-139, 504-509 correctness of dataflow systems, 525 differential, 504 message-passing, 136-139 reasoning about, 491 through databases, 129 through services, 131-136 dataflow engines, 421-423 comparison to stream processing, 464 directed acyclic graphs (DAG), 424 partitioning, approach to, 429 support for declarative queries, 427 Datalog (query language), 60-63 datatypes binary strings in XML and JSON, 114 conflict-free, 174 in Avro encodings, 122 in Thrift and Protocol Buffers, 121 numbers in XML and JSON, 114 Datomic (database) B-tree storage, 242 data model, 50, 57 Datalog query language, 60 excision (deleting data), 463 languages for transactions, 255 serial execution of transactions, 253 deadlocks detection, in two-phase commit (2PC), 364 in two-phase locking (2PL), 258 Debezium (change data capture), 455 declarative languages, 42, 554 Bloom, 504 CSS and XSL, 44 Cypher, 52 Datalog, 60 for batch processing, 427 recursive SQL queries, 53 relational algebra and SQL, 42 SPARQL, 59 delays bounded network delays, 285 bounded process pauses, 298 unbounded network delays, 282 unbounded process pauses, 296 deleting data, 463 denormalization (data representation), 34, 554 costs, 39 in derived data systems, 386 materialized views, 101 updating derived data, 228, 231, 490 versus normalization, 462 derived data, 386, 439, 554 from change data capture, 454 in event sourcing, 458-458 maintaining derived state through logs, 452-457, 459-463 observing, by subscribing to streams, 512 outputs of batch and stream processing, 495 through application code, 505 versus distributed transactions, 492 deterministic operations, 255, 274, 554 accidental nondeterminism, 423 and fault tolerance, 423, 426 and idempotence, 478, 492 computing derived data, 495, 526, 531 in state machine replication, 349, 452, 458 joins, 476 DevOps, 394 differential dataflow, 504 dimension tables, 94 dimensional modeling (see star schemas) directed acyclic graphs (DAGs), 424 dirty reads (transaction isolation), 234 dirty writes (transaction isolation), 235 discrimination, 534 disks (see hard disks) distributed actor frameworks, 138 distributed filesystems, 398-399 decoupling from query engines, 417 indiscriminately dumping data into, 415 use by MapReduce, 402 distributed systems, 273-312, 554 Byzantine faults, 304-306 cloud versus supercomputing, 275 detecting network faults, 280 faults and partial failures, 274-277 formalization of consensus, 365 impossibility results, 338, 353 issues with failover, 157 limitations of distributed transactions, 363 multi-datacenter, 169, 335 network problems, 277-286 quorums, relying on, 301 reasons for using, 145, 151 synchronized clocks, relying on, 291-295 system models, 306-310 use of clocks and time, 287 distributed transactions (see transactions) Django (web framework), 232 DNS (Domain Name System), 216, 372 Docker (container manager), 506 document data model, 30-42 comparison to relational model, 38-42 document references, 38, 403 document-oriented databases, 31 many-to-many relationships and joins, 36 multi-object transactions, need for, 231 versus relational model convergence of models, 41 data locality, 41 document-partitioned indexes, 206, 217, 411 domain-driven design (DDD), 457 DRBD (Distributed Replicated Block Device), 153 drift (clocks), 289 Drill (query engine), 93 Druid (database), 461 Dryad (dataflow engine), 421 dual writes, problems with, 452, 507 duplicates, suppression of, 517 (see also idempotence) using a unique ID, 518, 522 durability (transactions), 226, 554 duration (time), 287 measurement with monotonic clocks, 288 dynamic partitioning, 212 dynamically typed languages analogy to schema-on-read, 40 code generation and, 127 Dynamo-style databases (see leaderless replica‐ tion) E edges (in graphs), 49, 403 property graph model, 50 edit distance (full-text search), 88 effectively-once semantics, 476, 516 Index | 567 (see also exactly-once semantics) preservation of integrity, 525 elastic systems, 17 Elasticsearch (search server) document-partitioned indexes, 207 partition rebalancing, 211 percolator (stream search), 467 usage example, 4 use of Lucene, 79 ElephantDB (database), 413 Elm (programming language), 504, 512 encodings (data formats), 111-128 Avro, 122-127 binary variants of JSON and XML, 115 compatibility, 112 calling services, 136 using databases, 129-131 using message-passing, 138 defined, 113 JSON, XML, and CSV, 114 language-specific formats, 113 merits of schemas, 127 representations of data, 112 Thrift and Protocol Buffers, 117-121 end-to-end argument, 277, 519-520 checking integrity, 531 publish/subscribe streams, 512 enrichment (stream), 473 Enterprise JavaBeans (EJB), 134 entities (see vertices) epoch (consensus algorithms), 368 epoch (Unix timestamps), 288 equi-joins, 403 erasure coding (error correction), 398 Erlang OTP (actor framework), 139 error handling for network faults, 280 in transactions, 231 error-correcting codes, 277, 398 Esper (CEP engine), 466 etcd (coordination service), 370-373 linearizable operations, 333 locks and leader election, 330 quorum reads, 351 service discovery, 372 use of Raft algorithm, 349, 353 Ethereum (blockchain), 532 Ethernet (networks), 276, 278, 285 packet checksums, 306, 519 568 | Index Etherpad (collaborative editor), 170 ethics, 533-543 code of ethics and professional practice, 533 legislation and self-regulation, 542 predictive analytics, 533-536 amplifying bias, 534 feedback loops, 536 privacy and tracking, 536-543 consent and freedom of choice, 538 data as assets and power, 540 meaning of privacy, 539 surveillance, 537 respect, dignity, and agency, 543, 544 unintended consequences, 533, 536 ETL (extract-transform-load), 92, 405, 452, 554 use of Hadoop for, 416 event sourcing, 457-459 commands and events, 459 comparison to change data capture, 457 comparison to lambda architecture, 497 deriving current state from event log, 458 immutability and auditability, 459, 531 large, reliable data systems, 519, 526 Event Store (database), 458 event streams (see streams) events, 440 deciding on total order of, 493 deriving views from event log, 461 difference to commands, 459 event time versus processing time, 469, 477, 498 immutable, advantages of, 460, 531 ordering to capture causality, 493 reads as, 513 stragglers, 470, 498 timestamp of, in stream processing, 471 EventSource (browser API), 512 eventual consistency, 152, 162, 308, 322 (see also conflicts) and perpetual inconsistency, 525 evolvability, 21, 111 calling services, 136 graph-structured data, 52 of databases, 40, 129-131, 461, 497 of message-passing, 138 reprocessing data, 496, 498 schema evolution in Avro, 123 schema evolution in Thrift and Protocol Buffers, 120 schema-on-read, 39, 111, 128 exactly-once semantics, 360, 476, 516 parity with batch processors, 498 preservation of integrity, 525 exclusive mode (locks), 258 eXtended Architecture transactions (see XA transactions) extract-transform-load (see ETL) F Facebook Presto (query engine), 93 React, Flux, and Redux (user interface libra‐ ries), 512 social graphs, 49 Wormhole (change data capture), 455 fact tables, 93 failover, 157, 554 (see also leader-based replication) in leaderless replication, absence of, 178 leader election, 301, 348, 352 potential problems, 157 failures amplification by distributed transactions, 364, 495 failure detection, 280 automatic rebalancing causing cascading failures, 214 perfect failure detectors, 359 timeouts and unbounded delays, 282, 284 using ZooKeeper, 371 faults versus, 7 partial failures in distributed systems, 275-277, 310 fan-out (messaging systems), 11, 445 fault tolerance, 6-10, 555 abstractions for, 321 formalization in consensus, 365-369 use of replication, 367 human fault tolerance, 414 in batch processing, 406, 414, 422, 425 in log-based systems, 520, 524-526 in stream processing, 476-479 atomic commit, 477 idempotence, 478 maintaining derived state, 495 microbatching and checkpointing, 477 rebuilding state after a failure, 478 of distributed transactions, 362-364 transaction atomicity, 223, 354-361 faults, 6 Byzantine faults, 304-306 failures versus, 7 handled by transactions, 221 handling in supercomputers and cloud computing, 275 hardware, 7 in batch processing versus distributed data‐ bases, 417 in distributed systems, 274-277 introducing deliberately, 7, 280 network faults, 279-281 asymmetric faults, 300 detecting, 280 tolerance of, in multi-leader replication, 169 software errors, 8 tolerating (see fault tolerance) federated databases, 501 fence (CPU instruction), 338 fencing (preventing split brain), 158, 302-304 generating fencing tokens, 349, 370 properties of fencing tokens, 308 stream processors writing to databases, 478, 517 Fibre Channel (networks), 398 field tags (Thrift and Protocol Buffers), 119-121 file descriptors (Unix), 395 financial data, 460 Firebase (database), 456 Flink (processing framework), 421-423 dataflow APIs, 427 fault tolerance, 422, 477, 479 Gelly API (graph processing), 425 integration of batch and stream processing, 495, 498 machine learning, 428 query optimizer, 427 stream processing, 466 flow control, 282, 441, 555 FLP result (on consensus), 353 FlumeJava (dataflow library), 403, 427 followers, 152, 555 (see also leader-based replication) foreign keys, 38, 403 forward compatibility, 112 forward decay (algorithm), 16 Index | 569 Fossil (version control system), 463 shunning (deleting data), 463 FoundationDB (database) serializable transactions, 261, 265, 364 fractal trees, 83 full table scans, 403 full-text search, 555 and fuzzy indexes, 88 building search indexes, 411 Lucene storage engine, 79 functional reactive programming (FRP), 504 functional requirements, 22 futures (asynchronous operations), 135 fuzzy search (see similarity search) G garbage collection immutability and, 463 process pauses for, 14, 296-299, 301 (see also process pauses) genome analysis, 63, 429 geographically distributed datacenters, 145, 164, 278, 493 geospatial indexes, 87 Giraph (graph processing), 425 Git (version control system), 174, 342, 463 GitHub, postmortems, 157, 158, 309 global indexes (see term-partitioned indexes) GlusterFS (distributed filesystem), 398 GNU Coreutils (Linux), 394 GoldenGate (change data capture), 161, 170, 455 (see also Oracle) Google Bigtable (database) data model (see Bigtable data model) partitioning scheme, 199, 202 storage layout, 78 Chubby (lock service), 370 Cloud Dataflow (stream processor), 466, 477, 498 (see also Beam) Cloud Pub/Sub (messaging), 444, 448 Docs (collaborative editor), 170 Dremel (query engine), 93, 96 FlumeJava (dataflow library), 403, 427 GFS (distributed file system), 398 gRPC (RPC framework), 135 MapReduce (batch processing), 390 570 | Index (see also MapReduce) building search indexes, 411 task preemption, 418 Pregel (graph processing), 425 Spanner (see Spanner) TrueTime (clock API), 294 gossip protocol, 216 government use of data, 541 GPS (Global Positioning System) use for clock synchronization, 287, 290, 294, 295 GraphChi (graph processing), 426 graphs, 555 as data models, 49-63 example of graph-structured data, 49 property graphs, 50 RDF and triple-stores, 55-59 versus the network model, 60 processing and analysis, 424-426 fault tolerance, 425 Pregel processing model, 425 query languages Cypher, 52 Datalog, 60-63 recursive SQL queries, 53 SPARQL, 59-59 Gremlin (graph query language), 50 grep (Unix tool), 392 GROUP BY clause (SQL), 406 grouping records in MapReduce, 406 handling skew, 407 H Hadoop (data infrastructure) comparison to distributed databases, 390 comparison to MPP databases, 414-418 comparison to Unix, 413-414, 499 diverse processing models in ecosystem, 417 HDFS distributed filesystem (see HDFS) higher-level tools, 403 join algorithms, 403-410 (see also MapReduce) MapReduce (see MapReduce) YARN (see YARN) happens-before relationship, 340 capturing, 187 concurrency and, 186 hard disks access patterns, 84 detecting corruption, 519, 530 faults in, 7, 227 sequential write throughput, 75, 450 hardware faults, 7 hash indexes, 72-75 broadcast hash joins, 409 partitioned hash joins, 409 hash partitioning, 203-205, 217 consistent hashing, 204 problems with hash mod N, 210 range queries, 204 suitable hash functions, 203 with fixed number of partitions, 210 HAWQ (database), 428 HBase (database) bug due to lack of fencing, 302 bulk loading, 413 column-family data model, 41, 99 dynamic partitioning, 212 key-range partitioning, 202 log-structured storage, 78 request routing, 216 size-tiered compaction, 79 use of HDFS, 417 use of ZooKeeper, 370 HDFS (Hadoop Distributed File System), 398-399 (see also distributed filesystems) checking data integrity, 530 decoupling from query engines, 417 indiscriminately dumping data into, 415 metadata about datasets, 410 NameNode, 398 use by Flink, 479 use by HBase, 212 use by MapReduce, 402 HdrHistogram (numerical library), 16 head (Unix tool), 392 head vertex (property graphs), 51 head-of-line blocking, 15 heap files (databases), 86 Helix (cluster manager), 216 heterogeneous distributed transactions, 360, 364 heuristic decisions (in 2PC), 363 Hibernate (object-relational mapper), 30 hierarchical model, 36 high availability (see fault tolerance) high-frequency trading, 290, 299 high-performance computing (HPC), 275 hinted handoff, 183 histograms, 16 Hive (query engine), 419, 427 for data warehouses, 93 HCatalog and metastore, 410 map-side joins, 409 query optimizer, 427 skewed joins, 408 workflows, 403 Hollerith machines, 390 hopping windows (stream processing), 472 (see also windows) horizontal scaling (see scaling out) HornetQ (messaging), 137, 444 distributed transaction support, 361 hot spots, 201 due to celebrities, 205 for time-series data, 203 in batch processing, 407 relieving, 205 hot standbys (see leader-based replication) HTTP, use in APIs (see services) human errors, 9, 279, 414 HyperDex (database), 88 HyperLogLog (algorithm), 466 I I/O operations, waiting for, 297 IBM DB2 (database) distributed transaction support, 361 recursive query support, 54 serializable isolation, 242, 257 XML and JSON support, 30, 42 electromechanical card-sorting machines, 390 IMS (database), 36 imperative query APIs, 46 InfoSphere Streams (CEP engine), 466 MQ (messaging), 444 distributed transaction support, 361 System R (database), 222 WebSphere (messaging), 137 idempotence, 134, 478, 555 by giving operations unique IDs, 518, 522 idempotent operations, 517 immutability advantages of, 460, 531 Index | 571 deriving state from event log, 459-464 for crash recovery, 75 in B-trees, 82, 242 in event sourcing, 457 inputs to Unix commands, 397 limitations of, 463 Impala (query engine) for data warehouses, 93 hash joins, 409 native code generation, 428 use of HDFS, 417 impedance mismatch, 29 imperative languages, 42 setting element styles (example), 45 in doubt (transaction status), 358 holding locks, 362 orphaned transactions, 363 in-memory databases, 88 durability, 227 serial transaction execution, 253 incidents cascading failures, 9 crashes due to leap seconds, 290 data corruption and financial losses due to concurrency bugs, 233 data corruption on hard disks, 227 data loss due to last-write-wins, 173, 292 data on disks unreadable, 309 deleted items reappearing, 174 disclosure of sensitive data due to primary key reuse, 157 errors in transaction serializability, 529 gigabit network interface with 1 Kb/s throughput, 311 network faults, 279 network interface dropping only inbound packets, 279 network partitions and whole-datacenter failures, 275 poor handling of network faults, 280 sending message to ex-partner, 494 sharks biting undersea cables, 279 split brain due to 1-minute packet delay, 158, 279 vibrations in server rack, 14 violation of uniqueness constraint, 529 indexes, 71, 555 and snapshot isolation, 241 as derived data, 386, 499-504 572 | Index B-trees, 79-83 building in batch processes, 411 clustered, 86 comparison of B-trees and LSM-trees, 83-85 concatenated, 87 covering (with included columns), 86 creating, 500 full-text search, 88 geospatial, 87 hash, 72-75 index-range locking, 260 multi-column, 87 partitioning and secondary indexes, 206-209, 217 secondary, 85 (see also secondary indexes) problems with dual writes, 452, 491 SSTables and LSM-trees, 76-79 updating when data changes, 452, 467 Industrial Revolution, 541 InfiniBand (networks), 285 InfiniteGraph (database), 50 InnoDB (storage engine) clustered index on primary key, 86 not preventing lost updates, 245 preventing write skew, 248, 257 serializable isolation, 257 snapshot isolation support, 239 inside-out databases, 504 (see also unbundling databases) integrating different data systems (see data integration) integrity, 524 coordination-avoiding data systems, 528 correctness of dataflow systems, 525 in consensus formalization, 365 integrity checks, 530 (see also auditing) end-to-end, 519, 531 use of snapshot isolation, 238 maintaining despite software bugs, 529 Interface Definition Language (IDL), 117, 122 intermediate state, materialization of, 420-423 internet services, systems for implementing, 275 invariants, 225 (see also constraints) inversion of control, 396 IP (Internet Protocol) unreliability of, 277 ISDN (Integrated Services Digital Network), 284 isolation (in transactions), 225, 228, 555 correctness and, 515 for single-object writes, 230 serializability, 251-266 actual serial execution, 252-256 serializable snapshot isolation (SSI), 261-266 two-phase locking (2PL), 257-261 violating, 228 weak isolation levels, 233-251 preventing lost updates, 242-246 read committed, 234-237 snapshot isolation, 237-242 iterative processing, 424-426 J Java Database Connectivity (JDBC) distributed transaction support, 361 network drivers, 128 Java Enterprise Edition (EE), 134, 356, 361 Java Message Service (JMS), 444 (see also messaging systems) comparison to log-based messaging, 448, 451 distributed transaction support, 361 message ordering, 446 Java Transaction API (JTA), 355, 361 Java Virtual Machine (JVM) bytecode generation, 428 garbage collection pauses, 296 process reuse in batch processors, 422 JavaScript in MapReduce querying, 46 setting element styles (example), 45 use in advanced queries, 48 Jena (RDF framework), 57 Jepsen (fault tolerance testing), 515 jitter (network delay), 284 joins, 555 by index lookup, 403 expressing as relational operators, 427 in relational and document databases, 34 MapReduce map-side joins, 408-410 broadcast hash joins, 409 merge joins, 410 partitioned hash joins, 409 MapReduce reduce-side joins, 403-408 handling skew, 407 sort-merge joins, 405 parallel execution of, 415 secondary indexes and, 85 stream joins, 472-476 stream-stream join, 473 stream-table join, 473 table-table join, 474 time-dependence of, 475 support in document databases, 42 JOTM (transaction coordinator), 356 JSON Avro schema representation, 122 binary variants, 115 for application data, issues with, 114 in relational databases, 30, 42 representing a résumé (example), 31 Juttle (query language), 504 K k-nearest neighbors, 429 Kafka (messaging), 137, 448 Kafka Connect (database integration), 457, 461 Kafka Streams (stream processor), 466, 467 fault tolerance, 479 leader-based replication, 153 log compaction, 456, 467 message offsets, 447, 478 request routing, 216 transaction support, 477 usage example, 4 Ketama (partitioning library), 213 key-value stores, 70 as batch process output, 412 hash indexes, 72-75 in-memory, 89 partitioning, 201-205 by hash of key, 203, 217 by key range, 202, 217 dynamic partitioning, 212 skew and hot spots, 205 Kryo (Java), 113 Kubernetes (cluster manager), 418, 506 L lambda architecture, 497 Lamport timestamps, 345 Index | 573 Large Hadron Collider (LHC), 64 last write wins (LWW), 173, 334 discarding concurrent writes, 186 problems with, 292 prone to lost updates, 246 late binding, 396 latency instability under two-phase locking, 259 network latency and resource utilization, 286 response time versus, 14 tail latency, 15, 207 leader-based replication, 152-161 (see also replication) failover, 157, 301 handling node outages, 156 implementation of replication logs change data capture, 454-457 (see also changelogs) statement-based, 158 trigger-based replication, 161 write-ahead log (WAL) shipping, 159 linearizability of operations, 333 locking and leader election, 330 log sequence number, 156, 449 read-scaling architecture, 161 relation to consensus, 367 setting up new followers, 155 synchronous versus asynchronous, 153-155 leaderless replication, 177-191 (see also replication) detecting concurrent writes, 184-191 capturing happens-before relationship, 187 happens-before relationship and concur‐ rency, 186 last write wins, 186 merging concurrently written values, 190 version vectors, 191 multi-datacenter, 184 quorums, 179-182 consistency limitations, 181-183, 334 sloppy quorums and hinted handoff, 183 read repair and anti-entropy, 178 leap seconds, 8, 290 in time-of-day clocks, 288 leases, 295 implementation with ZooKeeper, 370 574 | Index need for fencing, 302 ledgers, 460 distributed ledger technologies, 532 legacy systems, maintenance of, 18 less (Unix tool), 397 LevelDB (storage engine), 78 leveled compaction, 79 Levenshtein automata, 88 limping (partial failure), 311 linearizability, 324-338, 555 cost of, 335-338 CAP theorem, 336 memory on multi-core CPUs, 338 definition, 325-329 implementing with total order broadcast, 350 in ZooKeeper, 370 of derived data systems, 492, 524 avoiding coordination, 527 of different replication methods, 332-335 using quorums, 334 relying on, 330-332 constraints and uniqueness, 330 cross-channel timing dependencies, 331 locking and leader election, 330 stronger than causal consistency, 342 using to implement total order broadcast, 351 versus serializability, 329 LinkedIn Azkaban (workflow scheduler), 402 Databus (change data capture), 161, 455 Espresso (database), 31, 126, 130, 153, 216 Helix (cluster manager) (see Helix) profile (example), 30 reference to company entity (example), 34 Rest.li (RPC framework), 135 Voldemort (database) (see Voldemort) Linux, leap second bug, 8, 290 liveness properties, 308 LMDB (storage engine), 82, 242 load approaches to coping with, 17 describing, 11 load testing, 16 load balancing (messaging), 444 local indexes (see document-partitioned indexes) locality (data access), 32, 41, 555 in batch processing, 400, 405, 421 in stateful clients, 170, 511 in stream processing, 474, 478, 508, 522 location transparency, 134 in the actor model, 138 locks, 556 deadlock, 258 distributed locking, 301-304, 330 fencing tokens, 303 implementation with ZooKeeper, 370 relation to consensus, 374 for transaction isolation in snapshot isolation, 239 in two-phase locking (2PL), 257-261 making operations atomic, 243 performance, 258 preventing dirty writes, 236 preventing phantoms with index-range locks, 260, 265 read locks (shared mode), 236, 258 shared mode and exclusive mode, 258 in two-phase commit (2PC) deadlock detection, 364 in-doubt transactions holding locks, 362 materializing conflicts with, 251 preventing lost updates by explicit locking, 244 log sequence number, 156, 449 logic programming languages, 504 logical clocks, 293, 343, 494 for read-after-write consistency, 164 logical logs, 160 logs (data structure), 71, 556 advantages of immutability, 460 compaction, 73, 79, 456, 460 for stream operator state, 479 creating using total order broadcast, 349 implementing uniqueness constraints, 522 log-based messaging, 446-451 comparison to traditional messaging, 448, 451 consumer offsets, 449 disk space usage, 450 replaying old messages, 451, 496, 498 slow consumers, 450 using logs for message storage, 447 log-structured storage, 71-79 log-structured merge tree (see LSMtrees) replication, 152, 158-161 change data capture, 454-457 (see also changelogs) coordination with snapshot, 156 logical (row-based) replication, 160 statement-based replication, 158 trigger-based replication, 161 write-ahead log (WAL) shipping, 159 scalability limits, 493 loose coupling, 396, 419, 502 lost updates (see updates) LSM-trees (indexes), 78-79 comparison to B-trees, 83-85 Lucene (storage engine), 79 building indexes in batch processes, 411 similarity search, 88 Luigi (workflow scheduler), 402 LWW (see last write wins) M machine learning ethical considerations, 534 (see also ethics) iterative processing, 424 models derived from training data, 505 statistical and numerical algorithms, 428 MADlib (machine learning toolkit), 428 magic scaling sauce, 18 Mahout (machine learning toolkit), 428 maintainability, 18-22, 489 defined, 23 design principles for software systems, 19 evolvability (see evolvability) operability, 19 simplicity and managing complexity, 20 many-to-many relationships in document model versus relational model, 39 modeling as graphs, 49 many-to-one and many-to-many relationships, 33-36 many-to-one relationships, 34 MapReduce (batch processing), 390, 399-400 accessing external services within job, 404, 412 comparison to distributed databases designing for frequent faults, 417 diversity of processing models, 416 diversity of storage, 415 Index | 575 comparison to stream processing, 464 comparison to Unix, 413-414 disadvantages and limitations of, 419 fault tolerance, 406, 414, 422 higher-level tools, 403, 426 implementation in Hadoop, 400-403 the shuffle, 402 implementation in MongoDB, 46-48 machine learning, 428 map-side processing, 408-410 broadcast hash joins, 409 merge joins, 410 partitioned hash joins, 409 mapper and reducer functions, 399 materialization of intermediate state, 419-423 output of batch workflows, 411-413 building search indexes, 411 key-value stores, 412 reduce-side processing, 403-408 analysis of user activity events (exam‐ ple), 404 grouping records by same key, 406 handling skew, 407 sort-merge joins, 405 workflows, 402 marshalling (see encoding) massively parallel processing (MPP), 216 comparison to composing storage technolo‐ gies, 502 comparison to Hadoop, 414-418, 428 master-master replication (see multi-leader replication) master-slave replication (see leader-based repli‐ cation) materialization, 556 aggregate values, 101 conflicts, 251 intermediate state (batch processing), 420-423 materialized views, 101 as derived data, 386, 499-504 maintaining, using stream processing, 467, 475 Maven (Java build tool), 428 Maxwell (change data capture), 455 mean, 14 media monitoring, 467 median, 14 576 | Index meeting room booking (example), 249, 259, 521 membership services, 372 Memcached (caching server), 4, 89 memory in-memory databases, 88 durability, 227 serial transaction execution, 253 in-memory representation of data, 112 random bit-flips in, 529 use by indexes, 72, 77 memory barrier (CPU instruction), 338 MemSQL (database) in-memory storage, 89 read committed isolation, 236 memtable (in LSM-trees), 78 Mercurial (version control system), 463 merge joins, MapReduce map-side, 410 mergeable persistent data structures, 174 merging sorted files, 76, 402, 405 Merkle trees, 532 Mesos (cluster manager), 418, 506 message brokers (see messaging systems) message-passing, 136-139 advantages over direct RPC, 137 distributed actor frameworks, 138 evolvability, 138 MessagePack (encoding format), 116 messages exactly-once semantics, 360, 476 loss of, 442 using total order broadcast, 348 messaging systems, 440-451 (see also streams) backpressure, buffering, or dropping mes‐ sages, 441 brokerless messaging, 442 event logs, 446-451 comparison to traditional messaging, 448, 451 consumer offsets, 449 replaying old messages, 451, 496, 498 slow consumers, 450 message brokers, 443-446 acknowledgements and redelivery, 445 comparison to event logs, 448, 451 multiple consumers of same topic, 444 reliability, 442 uniqueness in log-based messaging, 522 Meteor (web framework), 456 microbatching, 477, 495 microservices, 132 (see also services) causal dependencies across services, 493 loose coupling, 502 relation to batch/stream processors, 389, 508 Microsoft Azure Service Bus (messaging), 444 Azure Storage, 155, 398 Azure Stream Analytics, 466 DCOM (Distributed Component Object Model), 134 MSDTC (transaction coordinator), 356 Orleans (see Orleans) SQL Server (see SQL Server) migrating (rewriting) data, 40, 130, 461, 497 modulus operator (%), 210 MongoDB (database) aggregation pipeline, 48 atomic operations, 243 BSON, 41 document data model, 31 hash partitioning (sharding), 203-204 key-range partitioning, 202 lack of join support, 34, 42 leader-based replication, 153 MapReduce support, 46, 400 oplog parsing, 455, 456 partition splitting, 212 request routing, 216 secondary indexes, 207 Mongoriver (change data capture), 455 monitoring, 10, 19 monotonic clocks, 288 monotonic reads, 164 MPP (see massively parallel processing) MSMQ (messaging), 361 multi-column indexes, 87 multi-leader replication, 168-177 (see also replication) handling write conflicts, 171 conflict avoidance, 172 converging toward a consistent state, 172 custom conflict resolution logic, 173 determining what is a conflict, 174 linearizability, lack of, 333 replication topologies, 175-177 use cases, 168 clients with offline operation, 170 collaborative editing, 170 multi-datacenter replication, 168, 335 multi-object transactions, 228 need for, 231 Multi-Paxos (total order broadcast), 367 multi-table index cluster tables (Oracle), 41 multi-tenancy, 284 multi-version concurrency control (MVCC), 239, 266 detecting stale MVCC reads, 263 indexes and snapshot isolation, 241 mutual exclusion, 261 (see also locks) MySQL (database) binlog coordinates, 156 binlog parsing for change data capture, 455 circular replication topology, 175 consistent snapshots, 156 distributed transaction support, 361 InnoDB storage engine (see InnoDB) JSON support, 30, 42 leader-based replication, 153 performance of XA transactions, 360 row-based replication, 160 schema changes in, 40 snapshot isolation support, 242 (see also InnoDB) statement-based replication, 159 Tungsten Replicator (multi-leader replica‐ tion), 170 conflict detection, 177 N nanomsg (messaging library), 442 Narayana (transaction coordinator), 356 NATS (messaging), 137 near-real-time (nearline) processing, 390 (see also stream processing) Neo4j (database) Cypher query language, 52 graph data model, 50 Nephele (dataflow engine), 421 netcat (Unix tool), 397 Netflix Chaos Monkey, 7, 280 Network Attached Storage (NAS), 146, 398 network model, 36 Index | 577 graph databases versus, 60 imperative query APIs, 46 Network Time Protocol (see NTP) networks congestion and queueing, 282 datacenter network topologies, 276 faults (see faults) linearizability and network delays, 338 network partitions, 279, 337 timeouts and unbounded delays, 281 next-key locking, 260 nodes (in graphs) (see vertices) nodes (processes), 556 handling outages in leader-based replica‐ tion, 156 system models for failure, 307 noisy neighbors, 284 nonblocking atomic commit, 359 nondeterministic operations accidental nondeterminism, 423 partial failures in distributed systems, 275 nonfunctional requirements, 22 nonrepeatable reads, 238 (see also read skew) normalization (data representation), 33, 556 executing joins, 39, 42, 403 foreign key references, 231 in systems of record, 386 versus denormalization, 462 NoSQL, 29, 499 transactions and, 223 Notation3 (N3), 56 npm (package manager), 428 NTP (Network Time Protocol), 287 accuracy, 289, 293 adjustments to monotonic clocks, 289 multiple server addresses, 306 numbers, in XML and JSON encodings, 114 O object-relational mapping (ORM) frameworks, 30 error handling and aborted transactions, 232 unsafe read-modify-write cycle code, 244 object-relational mismatch, 29 observer pattern, 506 offline systems, 390 (see also batch processing) 578 | Index stateful, offline-capable clients, 170, 511 offline-first applications, 511 offsets consumer offsets in partitioned logs, 449 messages in partitioned logs, 447 OLAP (online analytic processing), 91, 556 data cubes, 102 OLTP (online transaction processing), 90, 556 analytics queries versus, 411 workload characteristics, 253 one-to-many relationships, 30 JSON representation, 32 online systems, 389 (see also services) Oozie (workflow scheduler), 402 OpenAPI (service definition format), 133 OpenStack Nova (cloud infrastructure) use of ZooKeeper, 370 Swift (object storage), 398 operability, 19 operating systems versus databases, 499 operation identifiers, 518, 522 operational transformation, 174 operators, 421 flow of data between, 424 in stream processing, 464 optimistic concurrency control, 261 Oracle (database) distributed transaction support, 361 GoldenGate (change data capture), 161, 170, 455 lack of serializability, 226 leader-based replication, 153 multi-table index cluster tables, 41 not preventing write skew, 248 partitioned indexes, 209 PL/SQL language, 255 preventing lost updates, 245 read committed isolation, 236 Real Application Clusters (RAC), 330 recursive query support, 54 snapshot isolation support, 239, 242 TimesTen (in-memory database), 89 WAL-based replication, 160 XML support, 30 ordering, 339-352 by sequence numbers, 343-348 causal ordering, 339-343 partial order, 341 limits of total ordering, 493 total order broadcast, 348-352 Orleans (actor framework), 139 outliers (response time), 14 Oz (programming language), 504 P package managers, 428, 505 packet switching, 285 packets corruption of, 306 sending via UDP, 442 PageRank (algorithm), 49, 424 paging (see virtual memory) ParAccel (database), 93 parallel databases (see massively parallel pro‐ cessing) parallel execution of graph analysis algorithms, 426 queries in MPP databases, 216 Parquet (data format), 96, 131 (see also column-oriented storage) use in Hadoop, 414 partial failures, 275, 310 limping, 311 partial order, 341 partitioning, 199-218, 556 and replication, 200 in batch processing, 429 multi-partition operations, 514 enforcing constraints, 522 secondary index maintenance, 495 of key-value data, 201-205 by key range, 202 skew and hot spots, 205 rebalancing partitions, 209-214 automatic or manual rebalancing, 213 problems with hash mod N, 210 using dynamic partitioning, 212 using fixed number of partitions, 210 using N partitions per node, 212 replication and, 147 request routing, 214-216 secondary indexes, 206-209 document-based partitioning, 206 term-based partitioning, 208 serial execution of transactions and, 255 Paxos (consensus algorithm), 366 ballot number, 368 Multi-Paxos (total order broadcast), 367 percentiles, 14, 556 calculating efficiently, 16 importance of high percentiles, 16 use in service level agreements (SLAs), 15 Percona XtraBackup (MySQL tool), 156 performance describing, 13 of distributed transactions, 360 of in-memory databases, 89 of linearizability, 338 of multi-leader replication, 169 perpetual inconsistency, 525 pessimistic concurrency control, 261 phantoms (transaction isolation), 250 materializing conflicts, 251 preventing, in serializability, 259 physical clocks (see clocks) pickle (Python), 113 Pig (dataflow language), 419, 427 replicated joins, 409 skewed joins, 407 workflows, 403 Pinball (workflow scheduler), 402 pipelined execution, 423 in Unix, 394 point in time, 287 polyglot persistence, 29 polystores, 501 PostgreSQL (database) BDR (multi-leader replication), 170 causal ordering of writes, 177 Bottled Water (change data capture), 455 Bucardo (trigger-based replication), 161, 173 distributed transaction support, 361 foreign data wrappers, 501 full text search support, 490 leader-based replication, 153 log sequence number, 156 MVCC implementation, 239, 241 PL/pgSQL language, 255 PostGIS geospatial indexes, 87 preventing lost updates, 245 preventing write skew, 248, 261 read committed isolation, 236 recursive query support, 54 representing graphs, 51 Index | 579 serializable snapshot isolation (SSI), 261 snapshot isolation support, 239, 242 WAL-based replication, 160 XML and JSON support, 30, 42 pre-splitting, 212 Precision Time Protocol (PTP), 290 predicate locks, 259 predictive analytics, 533-536 amplifying bias, 534 ethics of (see ethics) feedback loops, 536 preemption of datacenter resources, 418 of threads, 298 Pregel processing model, 425 primary keys, 85, 556 compound primary key (Cassandra), 204 primary-secondary replication (see leaderbased replication) privacy, 536-543 consent and freedom of choice, 538 data as assets and power, 540 deleting data, 463 ethical considerations (see ethics) legislation and self-regulation, 542 meaning of, 539 surveillance, 537 tracking behavioral data, 536 probabilistic algorithms, 16, 466 process pauses, 295-299 processing time (of events), 469 producers (message streams), 440 programming languages dataflow languages, 504 for stored procedures, 255 functional reactive programming (FRP), 504 logic programming, 504 Prolog (language), 61 (see also Datalog) promises (asynchronous operations), 135 property graphs, 50 Cypher query language, 52 Protocol Buffers (data format), 117-121 field tags and schema evolution, 120 provenance of data, 531 publish/subscribe model, 441 publishers (message streams), 440 punch card tabulating machines, 390 580 | Index pure functions, 48 putting computation near data, 400 Q Qpid (messaging), 444 quality of service (QoS), 285 Quantcast File System (distributed filesystem), 398 query languages, 42-48 aggregation pipeline, 48 CSS and XSL, 44 Cypher, 52 Datalog, 60 Juttle, 504 MapReduce querying, 46-48 recursive SQL queries, 53 relational algebra and SQL, 42 SPARQL, 59 query optimizers, 37, 427 queueing delays (networks), 282 head-of-line blocking, 15 latency and response time, 14 queues (messaging), 137 quorums, 179-182, 556 for leaderless replication, 179 in consensus algorithms, 368 limitations of consistency, 181-183, 334 making decisions in distributed systems, 301 monitoring staleness, 182 multi-datacenter replication, 184 relying on durability, 309 sloppy quorums and hinted handoff, 183 R R-trees (indexes), 87 RabbitMQ (messaging), 137, 444 leader-based replication, 153 race conditions, 225 (see also concurrency) avoiding with linearizability, 331 caused by dual writes, 452 dirty writes, 235 in counter increments, 235 lost updates, 242-246 preventing with event logs, 462, 507 preventing with serializable isolation, 252 write skew, 246-251 Raft (consensus algorithm), 366 sensitivity to network problems, 369 term number, 368 use in etcd, 353 RAID (Redundant Array of Independent Disks), 7, 398 railways, schema migration on, 496 RAMCloud (in-memory storage), 89 ranking algorithms, 424 RDF (Resource Description Framework), 57 querying with SPARQL, 59 RDMA (Remote Direct Memory Access), 276 read committed isolation level, 234-237 implementing, 236 multi-version concurrency control (MVCC), 239 no dirty reads, 234 no dirty writes, 235 read path (derived data), 509 read repair (leaderless replication), 178 for linearizability, 335 read replicas (see leader-based replication) read skew (transaction isolation), 238, 266 as violation of causality, 340 read-after-write consistency, 163, 524 cross-device, 164 read-modify-write cycle, 243 read-scaling architecture, 161 reads as events, 513 real-time collaborative editing, 170 near-real-time processing, 390 (see also stream processing) publish/subscribe dataflow, 513 response time guarantees, 298 time-of-day clocks, 288 rebalancing partitions, 209-214, 556 (see also partitioning) automatic or manual rebalancing, 213 dynamic partitioning, 212 fixed number of partitions, 210 fixed number of partitions per node, 212 problems with hash mod N, 210 recency guarantee, 324 recommendation engines batch process outputs, 412 batch workflows, 403, 420 iterative processing, 424 statistical and numerical algorithms, 428 records, 399 events in stream processing, 440 recursive common table expressions (SQL), 54 redelivery (messaging), 445 Redis (database) atomic operations, 243 durability, 89 Lua scripting, 255 single-threaded execution, 253 usage example, 4 redundancy hardware components, 7 of derived data, 386 (see also derived data) Reed–Solomon codes (error correction), 398 refactoring, 22 (see also evolvability) regions (partitioning), 199 register (data structure), 325 relational data model, 28-42 comparison to document model, 38-42 graph queries in SQL, 53 in-memory databases with, 89 many-to-one and many-to-many relation‐ ships, 33 multi-object transactions, need for, 231 NoSQL as alternative to, 29 object-relational mismatch, 29 relational algebra and SQL, 42 versus document model convergence of models, 41 data locality, 41 relational databases eventual consistency, 162 history, 28 leader-based replication, 153 logical logs, 160 philosophy compared to Unix, 499, 501 schema changes, 40, 111, 130 statement-based replication, 158 use of B-tree indexes, 80 relationships (see edges) reliability, 6-10, 489 building a reliable system from unreliable components, 276 defined, 6, 22 hardware faults, 7 human errors, 9 importance of, 10 of messaging systems, 442 Index | 581 software errors, 8 Remote Method Invocation (Java RMI), 134 remote procedure calls (RPCs), 134-136 (see also services) based on futures, 135 data encoding and evolution, 136 issues with, 134 using Avro, 126, 135 using Thrift, 135 versus message brokers, 137 repeatable reads (transaction isolation), 242 replicas, 152 replication, 151-193, 556 and durability, 227 chain replication, 155 conflict resolution and, 246 consistency properties, 161-167 consistent prefix reads, 165 monotonic reads, 164 reading your own writes, 162 in distributed filesystems, 398 leaderless, 177-191 detecting concurrent writes, 184-191 limitations of quorum consistency, 181-183, 334 sloppy quorums and hinted handoff, 183 monitoring staleness, 182 multi-leader, 168-177 across multiple datacenters, 168, 335 handling write conflicts, 171-175 replication topologies, 175-177 partitioning and, 147, 200 reasons for using, 145, 151 single-leader, 152-161 failover, 157 implementation of replication logs, 158-161 relation to consensus, 367 setting up new followers, 155 synchronous versus asynchronous, 153-155 state machine replication, 349, 452 using erasure coding, 398 with heterogeneous data systems, 453 replication logs (see logs) reprocessing data, 496, 498 (see also evolvability) from log-based messaging, 451 request routing, 214-216 582 | Index approaches to, 214 parallel query execution, 216 resilient systems, 6 (see also fault tolerance) response time as performance metric for services, 13, 389 guarantees on, 298 latency versus, 14 mean and percentiles, 14 user experience, 15 responsibility and accountability, 535 REST (Representational State Transfer), 133 (see also services) RethinkDB (database) document data model, 31 dynamic partitioning, 212 join support, 34, 42 key-range partitioning, 202 leader-based replication, 153 subscribing to changes, 456 Riak (database) Bitcask storage engine, 72 CRDTs, 174, 191 dotted version vectors, 191 gossip protocol, 216 hash partitioning, 203-204, 211 last-write-wins conflict resolution, 186 leaderless replication, 177 LevelDB storage engine, 78 linearizability, lack of, 335 multi-datacenter support, 184 preventing lost updates across replicas, 246 rebalancing, 213 search feature, 209 secondary indexes, 207 siblings (concurrently written values), 190 sloppy quorums, 184 ring buffers, 450 Ripple (cryptocurrency), 532 rockets, 10, 36, 305 RocksDB (storage engine), 78 leveled compaction, 79 rollbacks (transactions), 222 rolling upgrades, 8, 112 routing (see request routing) row-oriented storage, 96 row-based replication, 160 rowhammer (memory corruption), 529 RPCs (see remote procedure calls) Rubygems (package manager), 428 rules (Datalog), 61 S safety and liveness properties, 308 in consensus algorithms, 366 in transactions, 222 sagas (see compensating transactions) Samza (stream processor), 466, 467 fault tolerance, 479 streaming SQL support, 466 sandboxes, 9 SAP HANA (database), 93 scalability, 10-18, 489 approaches for coping with load, 17 defined, 22 describing load, 11 describing performance, 13 partitioning and, 199 replication and, 161 scaling up versus scaling out, 146 scaling out, 17, 146 (see also shared-nothing architecture) scaling up, 17, 146 scatter/gather approach, querying partitioned databases, 207 SCD (slowly changing dimension), 476 schema-on-read, 39 comparison to evolvable schema, 128 in distributed filesystems, 415 schema-on-write, 39 schemaless databases (see schema-on-read) schemas, 557 Avro, 122-127 reader determining writer’s schema, 125 schema evolution, 123 dynamically generated, 126 evolution of, 496 affecting application code, 111 compatibility checking, 126 in databases, 129-131 in message-passing, 138 in service calls, 136 flexibility in document model, 39 for analytics, 93-95 for JSON and XML, 115 merits of, 127 schema migration on railways, 496 Thrift and Protocol Buffers, 117-121 schema evolution, 120 traditional approach to design, fallacy in, 462 searches building search indexes in batch processes, 411 k-nearest neighbors, 429 on streams, 467 partitioned secondary indexes, 206 secondaries (see leader-based replication) secondary indexes, 85, 557 partitioning, 206-209, 217 document-partitioned, 206 index maintenance, 495 term-partitioned, 208 problems with dual writes, 452, 491 updating, transaction isolation and, 231 secondary sorts, 405 sed (Unix tool), 392 self-describing files, 127 self-joins, 480 self-validating systems, 530 semantic web, 57 semi-synchronous replication, 154 sequence number ordering, 343-348 generators, 294, 344 insufficiency for enforcing constraints, 347 Lamport timestamps, 345 use of timestamps, 291, 295, 345 sequential consistency, 351 serializability, 225, 233, 251-266, 557 linearizability versus, 329 pessimistic versus optimistic concurrency control, 261 serial execution, 252-256 partitioning, 255 using stored procedures, 253, 349 serializable snapshot isolation (SSI), 261-266 detecting stale MVCC reads, 263 detecting writes that affect prior reads, 264 distributed execution, 265, 364 performance of SSI, 265 preventing write skew, 262-265 two-phase locking (2PL), 257-261 index-range locks, 260 performance, 258 Serializable (Java), 113 Index | 583 serialization, 113 (see also encoding) service discovery, 135, 214, 372 using DNS, 216, 372 service level agreements (SLAs), 15 service-oriented architecture (SOA), 132 (see also services) services, 131-136 microservices, 132 causal dependencies across services, 493 loose coupling, 502 relation to batch/stream processors, 389, 508 remote procedure calls (RPCs), 134-136 issues with, 134 similarity to databases, 132 web services, 132, 135 session windows (stream processing), 472 (see also windows) sessionization, 407 sharding (see partitioning) shared mode (locks), 258 shared-disk architecture, 146, 398 shared-memory architecture, 146 shared-nothing architecture, 17, 146-147, 557 (see also replication) distributed filesystems, 398 (see also distributed filesystems) partitioning, 199 use of network, 277 sharks biting undersea cables, 279 counting (example), 46-48 finding (example), 42 website about (example), 44 shredding (in relational model), 38 siblings (concurrent values), 190, 246 (see also conflicts) similarity search edit distance, 88 genome data, 63 k-nearest neighbors, 429 single-leader replication (see leader-based rep‐ lication) single-threaded execution, 243, 252 in batch processing, 406, 421, 426 in stream processing, 448, 463, 522 size-tiered compaction, 79 skew, 557 584 | Index clock skew, 291-294, 334 in transaction isolation read skew, 238, 266 write skew, 246-251, 262-265 (see also write skew) meanings of, 238 unbalanced workload, 201 compensating for, 205 due to celebrities, 205 for time-series data, 203 in batch processing, 407 slaves (see leader-based replication) sliding windows (stream processing), 472 (see also windows) sloppy quorums, 183 (see also quorums) lack of linearizability, 334 slowly changing dimension (data warehouses), 476 smearing (leap seconds adjustments), 290 snapshots (databases) causal consistency, 340 computing derived data, 500 in change data capture, 455 serializable snapshot isolation (SSI), 261-266, 329 setting up a new replica, 156 snapshot isolation and repeatable read, 237-242 implementing with MVCC, 239 indexes and MVCC, 241 visibility rules, 240 synchronized clocks for global snapshots, 294 snowflake schemas, 95 SOAP, 133 (see also services) evolvability, 136 software bugs, 8 maintaining integrity, 529 solid state drives (SSDs) access patterns, 84 detecting corruption, 519, 530 faults in, 227 sequential write throughput, 75 Solr (search server) building indexes in batch processes, 411 document-partitioned indexes, 207 request routing, 216 usage example, 4 use of Lucene, 79 sort (Unix tool), 392, 394, 395 sort-merge joins (MapReduce), 405 Sorted String Tables (see SSTables) sorting sort order in column storage, 99 source of truth (see systems of record) Spanner (database) data locality, 41 snapshot isolation using clocks, 295 TrueTime API, 294 Spark (processing framework), 421-423 bytecode generation, 428 dataflow APIs, 427 fault tolerance, 422 for data warehouses, 93 GraphX API (graph processing), 425 machine learning, 428 query optimizer, 427 Spark Streaming, 466 microbatching, 477 stream processing on top of batch process‐ ing, 495 SPARQL (query language), 59 spatial algorithms, 429 split brain, 158, 557 in consensus algorithms, 352, 367 preventing, 322, 333 using fencing tokens to avoid, 302-304 spreadsheets, dataflow programming capabili‐ ties, 504 SQL (Structured Query Language), 21, 28, 43 advantages and limitations of, 416 distributed query execution, 48 graph queries in, 53 isolation levels standard, issues with, 242 query execution on Hadoop, 416 résumé (example), 30 SQL injection vulnerability, 305 SQL on Hadoop, 93 statement-based replication, 158 stored procedures, 255 SQL Server (database) data warehousing support, 93 distributed transaction support, 361 leader-based replication, 153 preventing lost updates, 245 preventing write skew, 248, 257 read committed isolation, 236 recursive query support, 54 serializable isolation, 257 snapshot isolation support, 239 T-SQL language, 255 XML support, 30 SQLstream (stream analytics), 466 SSDs (see solid state drives) SSTables (storage format), 76-79 advantages over hash indexes, 76 concatenated index, 204 constructing and maintaining, 78 making LSM-Tree from, 78 staleness (old data), 162 cross-channel timing dependencies, 331 in leaderless databases, 178 in multi-version concurrency control, 263 monitoring for, 182 of client state, 512 versus linearizability, 324 versus timeliness, 524 standbys (see leader-based replication) star replication topologies, 175 star schemas, 93-95 similarity to event sourcing, 458 Star Wars analogy (event time versus process‐ ing time), 469 state derived from log of immutable events, 459 deriving current state from the event log, 458 interplay between state changes and appli‐ cation code, 507 maintaining derived state, 495 maintenance by stream processor in streamstream joins, 473 observing derived state, 509-515 rebuilding after stream processor failure, 478 separation of application code and, 505 state machine replication, 349, 452 statement-based replication, 158 statically typed languages analogy to schema-on-write, 40 code generation and, 127 statistical and numerical algorithms, 428 StatsD (metrics aggregator), 442 stdin, stdout, 395, 396 Stellar (cryptocurrency), 532 Index | 585 stock market feeds, 442 STONITH (Shoot The Other Node In The Head), 158 stop-the-world (see garbage collection) storage composing data storage technologies, 499-504 diversity of, in MapReduce, 415 Storage Area Network (SAN), 146, 398 storage engines, 69-104 column-oriented, 95-101 column compression, 97-99 defined, 96 distinction between column families and, 99 Parquet, 96, 131 sort order in, 99-100 writing to, 101 comparing requirements for transaction processing and analytics, 90-96 in-memory storage, 88 durability, 227 row-oriented, 70-90 B-trees, 79-83 comparing B-trees and LSM-trees, 83-85 defined, 96 log-structured, 72-79 stored procedures, 161, 253-255, 557 and total order broadcast, 349 pros and cons of, 255 similarity to stream processors, 505 Storm (stream processor), 466 distributed RPC, 468, 514 Trident state handling, 478 straggler events, 470, 498 stream processing, 464-481, 557 accessing external services within job, 474, 477, 478, 517 combining with batch processing lambda architecture, 497 unifying technologies, 498 comparison to batch processing, 464 complex event processing (CEP), 465 fault tolerance, 476-479 atomic commit, 477 idempotence, 478 microbatching and checkpointing, 477 rebuilding state after a failure, 478 for data integration, 494-498 586 | Index maintaining derived state, 495 maintenance of materialized views, 467 messaging systems (see messaging systems) reasoning about time, 468-472 event time versus processing time, 469, 477, 498 knowing when window is ready, 470 types of windows, 472 relation to databases (see streams) relation to services, 508 search on streams, 467 single-threaded execution, 448, 463 stream analytics, 466 stream joins, 472-476 stream-stream join, 473 stream-table join, 473 table-table join, 474 time-dependence of, 475 streams, 440-451 end-to-end, pushing events to clients, 512 messaging systems (see messaging systems) processing (see stream processing) relation to databases, 451-464 (see also changelogs) API support for change streams, 456 change data capture, 454-457 derivative of state by time, 460 event sourcing, 457-459 keeping systems in sync, 452-453 philosophy of immutable events, 459-464 topics, 440 strict serializability, 329 strong consistency (see linearizability) strong one-copy serializability, 329 subjects, predicates, and objects (in triplestores), 55 subscribers (message streams), 440 (see also consumers) supercomputers, 275 surveillance, 537 (see also privacy) Swagger (service definition format), 133 swapping to disk (see virtual memory) synchronous networks, 285, 557 comparison to asynchronous networks, 284 formal model, 307 synchronous replication, 154, 557 chain replication, 155 conflict detection, 172 system models, 300, 306-310 assumptions in, 528 correctness of algorithms, 308 mapping to the real world, 309 safety and liveness, 308 systems of record, 386, 557 change data capture, 454, 491 treating event log as, 460 systems thinking, 536 T t-digest (algorithm), 16 table-table joins, 474 Tableau (data visualization software), 416 tail (Unix tool), 447 tail vertex (property graphs), 51 Tajo (query engine), 93 Tandem NonStop SQL (database), 200 TCP (Transmission Control Protocol), 277 comparison to circuit switching, 285 comparison to UDP, 283 connection failures, 280 flow control, 282, 441 packet checksums, 306, 519, 529 reliability and duplicate suppression, 517 retransmission timeouts, 284 use for transaction sessions, 229 telemetry (see monitoring) Teradata (database), 93, 200 term-partitioned indexes, 208, 217 termination (consensus), 365 Terrapin (database), 413 Tez (dataflow engine), 421-423 fault tolerance, 422 support by higher-level tools, 427 thrashing (out of memory), 297 threads (concurrency) actor model, 138, 468 (see also message-passing) atomic operations, 223 background threads, 73, 85 execution pauses, 286, 296-298 memory barriers, 338 preemption, 298 single (see single-threaded execution) three-phase commit, 359 Thrift (data format), 117-121 BinaryProtocol, 118 CompactProtocol, 119 field tags and schema evolution, 120 throughput, 13, 390 TIBCO, 137 Enterprise Message Service, 444 StreamBase (stream analytics), 466 time concurrency and, 187 cross-channel timing dependencies, 331 in distributed systems, 287-299 (see also clocks) clock synchronization and accuracy, 289 relying on synchronized clocks, 291-295 process pauses, 295-299 reasoning about, in stream processors, 468-472 event time versus processing time, 469, 477, 498 knowing when window is ready, 470 timestamp of events, 471 types of windows, 472 system models for distributed systems, 307 time-dependence in stream joins, 475 time-of-day clocks, 288 timeliness, 524 coordination-avoiding data systems, 528 correctness of dataflow systems, 525 timeouts, 279, 557 dynamic configuration of, 284 for failover, 158 length of, 281 timestamps, 343 assigning to events in stream processing, 471 for read-after-write consistency, 163 for transaction ordering, 295 insufficiency for enforcing constraints, 347 key range partitioning by, 203 Lamport, 345 logical, 494 ordering events, 291, 345 Titan (database), 50 tombstones, 74, 191, 456 topics (messaging), 137, 440 total order, 341, 557 limits of, 493 sequence numbers or timestamps, 344 total order broadcast, 348-352, 493, 522 consensus algorithms and, 366-368 Index | 587 implementation in ZooKeeper and etcd, 370 implementing with linearizable storage, 351 using, 349 using to implement linearizable storage, 350 tracking behavioral data, 536 (see also privacy) transaction coordinator (see coordinator) transaction manager (see coordinator) transaction processing, 28, 90-95 comparison to analytics, 91 comparison to data warehousing, 93 transactions, 221-267, 558 ACID properties of, 223 atomicity, 223 consistency, 224 durability, 226 isolation, 225 compensating (see compensating transac‐ tions) concept of, 222 distributed transactions, 352-364 avoiding, 492, 502, 521-528 failure amplification, 364, 495 in doubt/uncertain status, 358, 362 two-phase commit, 354-359 use of, 360-361 XA transactions, 361-364 OLTP versus analytics queries, 411 purpose of, 222 serializability, 251-266 actual serial execution, 252-256 pessimistic versus optimistic concur‐ rency control, 261 serializable snapshot isolation (SSI), 261-266 two-phase locking (2PL), 257-261 single-object and multi-object, 228-232 handling errors and aborts, 231 need for multi-object transactions, 231 single-object writes, 230 snapshot isolation (see snapshots) weak isolation levels, 233-251 preventing lost updates, 242-246 read committed, 234-238 transitive closure (graph algorithm), 424 trie (data structure), 88 triggers (databases), 161, 441 implementing change data capture, 455 implementing replication, 161 588 | Index triple-stores, 55-59 SPARQL query language, 59 tumbling windows (stream processing), 472 (see also windows) in microbatching, 477 tuple spaces (programming model), 507 Turtle (RDF data format), 56 Twitter constructing home timelines (example), 11, 462, 474, 511 DistributedLog (event log), 448 Finagle (RPC framework), 135 Snowflake (sequence number generator), 294 Summingbird (processing library), 497 two-phase commit (2PC), 353, 355-359, 558 confusion with two-phase locking, 356 coordinator failure, 358 coordinator recovery, 363 how it works, 357 issues in practice, 363 performance cost, 360 transactions holding locks, 362 two-phase locking (2PL), 257-261, 329, 558 confusion with two-phase commit, 356 index-range locks, 260 performance of, 258 type checking, dynamic versus static, 40 U UDP (User Datagram Protocol) comparison to TCP, 283 multicast, 442 unbounded datasets, 439, 558 (see also streams) unbounded delays, 558 in networks, 282 process pauses, 296 unbundling databases, 499-515 composing data storage technologies, 499-504 federation versus unbundling, 501 need for high-level language, 503 designing applications around dataflow, 504-509 observing derived state, 509-515 materialized views and caching, 510 multi-partition data processing, 514 pushing state changes to clients, 512 uncertain (transaction status) (see in doubt) uniform consensus, 365 (see also consensus) uniform interfaces, 395 union type (in Avro), 125 uniq (Unix tool), 392 uniqueness constraints asynchronously checked, 526 requiring consensus, 521 requiring linearizability, 330 uniqueness in log-based messaging, 522 Unix philosophy, 394-397 command-line batch processing, 391-394 Unix pipes versus dataflow engines, 423 comparison to Hadoop, 413-414 comparison to relational databases, 499, 501 comparison to stream processing, 464 composability and uniform interfaces, 395 loose coupling, 396 pipes, 394 relation to Hadoop, 499 UPDATE statement (SQL), 40 updates preventing lost updates, 242-246 atomic write operations, 243 automatically detecting lost updates, 245 compare-and-set operations, 245 conflict resolution and replication, 246 using explicit locking, 244 preventing write skew, 246-251 V validity (consensus), 365 vBuckets (partitioning), 199 vector clocks, 191 (see also version vectors) vectorized processing, 99, 428 verification, 528-533 avoiding blind trust, 530 culture of, 530 designing for auditability, 531 end-to-end integrity checks, 531 tools for auditable data systems, 532 version control systems, reliance on immutable data, 463 version vectors, 177, 191 capturing causal dependencies, 343 versus vector clocks, 191 Vertica (database), 93 handling writes, 101 replicas using different sort orders, 100 vertical scaling (see scaling up) vertices (in graphs), 49 property graph model, 50 Viewstamped Replication (consensus algo‐ rithm), 366 view number, 368 virtual machines, 146 (see also cloud computing) context switches, 297 network performance, 282 noisy neighbors, 284 reliability in cloud services, 8 virtualized clocks in, 290 virtual memory process pauses due to page faults, 14, 297 versus memory management by databases, 89 VisiCalc (spreadsheets), 504 vnodes (partitioning), 199 Voice over IP (VoIP), 283 Voldemort (database) building read-only stores in batch processes, 413 hash partitioning, 203-204, 211 leaderless replication, 177 multi-datacenter support, 184 rebalancing, 213 reliance on read repair, 179 sloppy quorums, 184 VoltDB (database) cross-partition serializability, 256 deterministic stored procedures, 255 in-memory storage, 89 output streams, 456 secondary indexes, 207 serial execution of transactions, 253 statement-based replication, 159, 479 transactions in stream processing, 477 W WAL (write-ahead log), 82 web services (see services) Web Services Description Language (WSDL), 133 webhooks, 443 webMethods (messaging), 137 WebSocket (protocol), 512 Index | 589 windows (stream processing), 466, 468-472 infinite windows for changelogs, 467, 474 knowing when all events have arrived, 470 stream joins within a window, 473 types of windows, 472 winners (conflict resolution), 173 WITH RECURSIVE syntax (SQL), 54 workflows (MapReduce), 402 outputs, 411-414 key-value stores, 412 search indexes, 411 with map-side joins, 410 working set, 393 write amplification, 84 write path (derived data), 509 write skew (transaction isolation), 246-251 characterizing, 246-251, 262 examples of, 247, 249 materializing conflicts, 251 occurrence in practice, 529 phantoms, 250 preventing in snapshot isolation, 262-265 in two-phase locking, 259-261 options for, 248 write-ahead log (WAL), 82, 159 writes (database) atomic write operations, 243 detecting writes affecting prior reads, 264 preventing dirty writes with read commit‐ ted, 235 WS-* framework, 133 (see also services) WS-AtomicTransaction (2PC), 355 590 | Index X XA transactions, 355, 361-364 heuristic decisions, 363 limitations of, 363 xargs (Unix tool), 392, 396 XML binary variants, 115 encoding RDF data, 57 for application data, issues with, 114 in relational databases, 30, 41 XSL/XPath, 45 Y Yahoo!

Derived data systems Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way. If you lose derived data, you can recreate it from the original source. A classic example is a cache: data can be served from the cache if present, but if the cache doesn’t contain what you need, you can fall back to the underlying database. Denormalized values, indexes, and materialized views also fall into this category. In recommendation systems, predictive summary data is often derived from usage logs. Technically speaking, derived data is redundant, in the sense that it duplicates exist‐ ing information. However, it is often essential for getting good performance on read queries. It is commonly denormalized. You can derive several different datasets from a single source, enabling you to look at the data from different “points of view.”


pages: 201 words: 63,192

Graph Databases by Ian Robinson, Jim Webber, Emil Eifrem

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Web Services, anti-pattern, bioinformatics, commoditize, corporate governance, create, read, update, delete, data acquisition, en.wikipedia.org, fault tolerance, linked data, loose coupling, Network effects, recommendation engine, semantic web, sentiment analysis, social graph, software as a service, SPARQL, web application

Common Use Cases | 95 As in the social use case, making an effective recommendation depends on under‐ standing the connections between things, as well as the quality and strength of those connections—all of which are best expressed as a property graph. Queries are primarily graph local, in that they start with one or more identifiable subjects, whether people or resources, and thereafter discover surrounding portions of the graph. Taken together, social networks and recommendation engines provide key differenti‐ ating capabilities in the areas of retail, recruitment, sentiment analysis, search, and knowledge management. Graphs are a good fit for the densely connected data structures germane to each of these areas; storing and querying this data using a graph database allows an application to surface end-user realtime results that reflect recent changes to the data, rather than pre-calculated, stale results.

. • Foreign key constraints add additional development and maintenance overhead just to make the database work. • Sparse tables with nullable columns require special checking in code, despite the presence of a schema. • Several expensive joins are needed just to discover what a customer bought. • Reciprocal queries are even more costly. “What products did a customer buy?” is relatively cheap compared to “which customers bought this product?”, which is the basis of recommendation systems. We could introduce an index, but even with an index, recursive questions such as “which customers bought this product who also bought that product?” quickly become prohibitively expensive as the degree of re‐ cursion increases. Relational databases struggle with highly-connected domains. To understand the cost of performing connected queries in a relational database, we’ll look at some simple and not-so-simple queries in a social network domain.


pages: 398 words: 86,855

Bad Data Handbook by Q. Ethan McCallum

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Mechanical Turk, asset allocation, barriers to entry, Benoit Mandelbrot, business intelligence, cellular automata, chief data officer, Chuck Templeton: OpenTable, cloud computing, cognitive dissonance, combinatorial explosion, commoditize, conceptual framework, database schema, en.wikipedia.org, Firefox, Flash crash, Gini coefficient, illegal immigration, iterative process, labor-force participation, loose coupling, natural language processing, Netflix Prize, quantitative trading / quantitative finance, recommendation engine, selection bias, sentiment analysis, statistical model, supply-chain management, survivorship bias, text mining, too big to fail, web application

Facebook is powered by its Open Graph, the “people and the connections they have to everything they care about.”[68] Facebook provides an API to access this social network and make it available for integration into other networked datasets. On Twitter, the network structure resulting from friends and followers leads to recommendations of “Who to follow.” On LinkedIn, network-based recommendations include “Jobs you may be interested in” and “Groups you may like.” The recommendation engine hunch.com is built on a “Taste Graph” that “uses signals from around the Web to map members with their predicted affinity for products, services, other people, websites, or just about anything, and customizes recommended topics for them.”[69] A search on Google can be considered a type of recommendation about which of possibly millions of search hits are most relevant for a particular query.

Springer-Verlag New York, Inc., New York, NY, USA. [63] http://en.wikipedia.org/wiki/File:KochFlake.svg [64] http://blueprints.tinkerpop.com [65] http://gremlin.tinkerpop.com [66] http://gremlin.tinkerpop.com/Path-Pattern [67] Ted G. Lewis. 2009. Network Science: Theory and Applications. Wiley Publishing. [68] http://developers.facebook.com/docs/opengraph [69] “eBay Acquires Recommendation Engine Hunch.com,” http://www.businesswire.com/news/home/20111121005831/en [70] Brin, S.; Page, L. 1998. “The anatomy of a large-scale hypertextual Web search engine.” Computer Networks and ISDN Systems 30: 107–117 Chapter 14. Myths of Cloud Computing Steve Francia Myths are an important and natural part of the emergence of any new technology, product, or idea as identified by the hype cycle.

I’ve written code to process accelerometer and hydrophone signals for analysis of dams and other large structures (as an undergraduate student in Engineering at Harvey Mudd College), analyzed recordings of calls from various species of bats (as a graduate student in Electrical Engineering at the University of Washington), built systems to visualize imaging sonar data (as a Graduate Research Assistant at the Applied Physics Lab), used large amounts of crawled web content to build content filtering systems (as the co-founder and CTO of N2H2, Inc.), designed intranet search systems for portal software (at DataChannel), and combined multiple sets of directory assistance data into a searchable website (as CTO at WhitePages.com). For the past five years or so, I’ve spent most of my time at Demand Media using a wide variety of data sources to build optimization systems for advertising and content recommendation systems, with various side excursions into large-scale data-driven search engine optimization (SEO) and search engine marketing (SEM). Most of my examples will be related to work I’ve done in Ad Optimization, Content Recommendation, SEO, and SEM. These areas, as with most, have their own terminology, so a few term definitions may be helpful. Table 2-1. Term Definitions TermDefinition PPC Pay Per Click—Internet advertising model used to drive traffic to websites with a payment model based on clicks on advertisements.


pages: 320 words: 87,853

The Black Box Society: The Secret Algorithms That Control Money and Information by Frank Pasquale

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Affordable Care Act / Obamacare, algorithmic trading, Amazon Mechanical Turk, American Legislative Exchange Council, asset-backed security, Atul Gawande, bank run, barriers to entry, basic income, Berlin Wall, Bernie Madoff, Black Swan, bonus culture, Brian Krebs, call centre, Capital in the Twenty-First Century by Thomas Piketty, Chelsea Manning, Chuck Templeton: OpenTable, cloud computing, collateralized debt obligation, computerized markets, corporate governance, Credit Default Swap, credit default swaps / collateralized debt obligations, crowdsourcing, cryptocurrency, Debian, don't be evil, drone strike, Edward Snowden, en.wikipedia.org, Fall of the Berlin Wall, Filter Bubble, financial innovation, financial thriller, fixed income, Flash crash, full employment, Goldman Sachs: Vampire Squid, Google Earth, Hernando de Soto, High speed trading, hiring and firing, housing crisis, informal economy, information asymmetry, information retrieval, interest rate swap, Internet of things, invisible hand, Jaron Lanier, Jeff Bezos, job automation, Julian Assange, Kevin Kelly, knowledge worker, Kodak vs Instagram, kremlinology, late fees, London Interbank Offered Rate, London Whale, Marc Andreessen, Mark Zuckerberg, mobile money, moral hazard, new economy, Nicholas Carr, offshore financial centre, PageRank, pattern recognition, Philip Mirowski, precariat, profit maximization, profit motive, quantitative easing, race to the bottom, recommendation engine, regulatory arbitrage, risk-adjusted returns, Satyajit Das, search engine result page, shareholder value, Silicon Valley, Snapchat, Spread Networks laid a new fibre optics cable between New York and Chicago, statistical arbitrage, statistical model, Steven Levy, the scientific method, too big to fail, transaction costs, two-sided market, universal basic income, Upton Sinclair, value at risk, WikiLeaks, zero-sum game

But what do we know about them? A bad credit score may cost a borrower hundreds of thousands of dollars, but he will never understand exactly how it was calculated. A predictive INTRODUCTION—THE NEED TO KNOW 5 analytics firm may score someone as a “high cost” or “unreliable” worker, yet never tell her about the decision. More benignly, perhaps, these companies influence the choices we make ourselves. Recommendation engines at Amazon and YouTube affect an automated familiarity, gently suggesting offerings they think we’ll like. But don’t discount the significance of that “perhaps.” The economic, political, and cultural agendas behind their suggestions are hard to unravel. As middlemen, they specialize in shifting alliances, sometimes advancing the interests of customers, sometimes suppliers: all to orchestrate an online world that maximizes their own profits.

In short, they improve the quality of our daily lives in ways both noticeable and not. But where do we call a halt? Similar protocols also influence— invisibly—not only the route we take to a new restaurant, but which restaurant Google, Yelp, OpenTable, or Siri recommends to us. They might help us fi nd reviews of the car we drive. Yet choosing a car, or even a restaurant, is not as straightforward as optimizing an engine or routing a drive. Does the recommendation engine take into account, say, whether the restaurant or car company gives its workers health benefits or maternity leave? Could we prompt it to do so? In their race for the most profitable methods of mapping social reality, the data scientists of Silicon Valley and Wall Street tend to treat recommendations as purely technical problems. The values and prerogatives that the encoded rules enact are hidden within black boxes.23 INTRODUCTION—THE NEED TO KNOW 9 The most obvious question is: Are these algorithmic applications fair?

Even if it is the former, we should note that Google’s autosuggest feature may have automatically entered the word “bomb” after “pressure cooker” while he was 228 NOTES TO PAGES 21–23 typing— certainly many people would have done the search in the days after the Boston bombing merely to learn just how lethal such an attack could be. The police had no way of knowing whether Catalano had actually typed “bomb” himself, or accidentally clicked on it thanks to Google’s increasingly aggressive recommendation engines. See also Philip Bump, “Update: Now We Know Why Googling ‘Pressure Cookers’ Gets a Visit from the Cops,” The Wire, August 1, 2013, http://www.thewire.com /national /2013/08/government-knocking -doors-because-google-searches/67864 /#.UfqCSAXy7zQ.facebook. 10. Martin Kuhn, Federal Dataveillance: Implications for Constitutional Privacy Protections (New York: LFB Scholarly Publishing, 2007), 178. 11.


pages: 254 words: 79,052

Evil by Design: Interaction Design to Lead Us Into Temptation by Chris Nodder

4chan, affirmative action, Amazon Mechanical Turk, cognitive dissonance, crowdsourcing, Daniel Kahneman / Amos Tversky, Donald Trump, en.wikipedia.org, endowment effect, game design, haute couture, jimmy wales, Jony Ive, Kickstarter, late fees, loss aversion, Mark Zuckerberg, meta analysis, meta-analysis, Milgram experiment, Netflix Prize, Nick Leeson, Occupy movement, pets.com, price anchoring, recommendation engine, Rory Sutherland, Silicon Valley, stealth mode startup, Steve Jobs, telemarketer, Tim Cook: Apple, trickle-down economics, upwardly mobile

To reduce the confusion caused by the number of options while still retaining the perception of quality, many sites employ recommendation engines or filters. Recommendation engines provide a small set of options based on either comparison with prior behavior or on answers to a set of preference questions. Netflix uses a recommendation engine to suggest new movies based on ones that customers have already watched. Its business is so dependent upon this functionality that it recently offered a one million dollar prize to anyone who could increase the accuracy of the engine by more than 10 percent. Currently, 75 percent of movies watched on Netflix come from a recommendation made by the site. Recommendation engines are a great way to limit choice from an otherwise overwhelming quantity of items. (Netflix.com) Filters rely less on preference algorithms and more on on-screen choices.

However, maximizers like knowing that they chose from all the available options, so just presenting three alternatives may not be sufficient. The problem here is that they may look to other sellers for more choices. So the trick is to demonstrate that you have sufficient options to keep the maximizers happy but also provide tools that allow both the maximizers and the satisficers to find the options they want quickly. The three techniques you can use (alone or in combination) are to present many compatible choices, to use a recommendation engine or filter, and to offer a best choice guarantee. Brands that offer greater variety of compatible (that is, focused and internally consistent) options are perceived as having greater commitment and expertise in the category, which, in turn, enhances their perceived quality and purchase likelihood. When you want to increase the perceived importance of making the decision, allow users to choose between multiple similar options (all with positive outcomes for you).

How to design for fewer options If you want users to make a quick decision about your services, don’t give them too many options. More choices lead to more procrastination. Conversely, if you want to increase the perceived importance of a decision, or if customization is important, ensure that the only choices available to users are between multiple compatible options within narrow boundaries. If you have a larger number of items, use a recommendation engine or filters to quickly bring the number down to a manageable set. If you can’t easily reduce the number of available items, speed people to a decision by reassuring them with a best-choice guarantee. Pre-pick your preferred option Prime people so that they are open to accepting the choice you highlight. Psychologists have known for a long time that if they show you specific words or pictures beforehand, you’ll find it easier to recall those items or related ones in a later test, even after you have consciously forgotten the specific words.


pages: 481 words: 125,946

What to Think About Machines That Think: Today's Leading Thinkers on the Age of Machine Intelligence by John Brockman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, agricultural Revolution, AI winter, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, algorithmic trading, artificial general intelligence, augmented reality, autonomous vehicles, basic income, bitcoin, blockchain, clean water, cognitive dissonance, Colonization of Mars, complexity theory, computer age, computer vision, constrained optimization, corporate personhood, cosmological principle, cryptocurrency, cuban missile crisis, Danny Hillis, dark matter, discrete time, Douglas Engelbart, Elon Musk, Emanuel Derman, endowment effect, epigenetics, Ernest Rutherford, experimental economics, Flash crash, friendly AI, functional fixedness, Google Glasses, hive mind, income inequality, information trail, Internet of things, invention of writing, iterative process, Jaron Lanier, job automation, John Markoff, John von Neumann, Kevin Kelly, knowledge worker, loose coupling, microbiome, Moneyball by Michael Lewis explains big data, natural language processing, Network effects, Norbert Wiener, pattern recognition, Peter Singer: altruism, phenotype, planetary scale, Ray Kurzweil, recommendation engine, Republic of Letters, RFID, Richard Thaler, Rory Sutherland, Satyajit Das, Search for Extraterrestrial Intelligence, self-driving car, sharing economy, Silicon Valley, Skype, smart contracts, speech recognition, statistical model, stem cell, Stephen Hawking, Steve Jobs, Steven Pinker, Stewart Brand, strong AI, Stuxnet, superintelligent machines, supervolcano, the scientific method, The Wisdom of Crowds, theory of mind, Thorstein Veblen, too big to fail, Turing machine, Turing test, Von Neumann architecture, Watson beat the top human players on Jeopardy!, Y2K

Is it possible to create an artificial mentor for each student? We already have recommender systems on the Internet that tell us, “If you liked X, you might also like Y,” based on data of many others with similar patterns of preference. Someday the mind of each student may be tracked from childhood by a personalized deep-learning system. To achieve this level of understanding of a human mind is beyond the capabilities of current technology, but there are already efforts at Facebook to use their vast social database of friends, photos, and likes to create a Theory of Mind for every person on the planet. So my prediction is that as more and more cognitive appliances, like chess-playing programs and recommender systems are devised, humans will become smarter and more capable. SHALLOW LEARNING SETH LLOYD Professor of quantum mechanical engineering, MIT; author, Programming the Universe Pity the poor folks at the National Security Agency: They’re spying on everyone (quelle surprise!)

Conceptually, autonomous or artificial intelligence systems can develop in two ways: either as an extension of human thinking or as radically new thinking. Call the first “Humanoid Thinking,” or Humanoid AI, and the second “Alien Thinking,” or Alien AI. Almost all AI today is Humanoid Thinking. We use AI to solve problems too difficult, time-consuming, or boring for our limited brains to process: electrical-grid balancing, recommendation engines, self-driving cars, face recognition, trading algorithms, and the like. These artificial agents work in narrow domains with clear goals their human creators specify. Such AI aims to accomplish human objectives—often better, with fewer cognitive errors, distractions, outbursts of bad temper, or processing limitations. In a couple of decades, AI agents might serve as virtual insurance sellers, doctors, psychotherapists, and maybe even virtual spouses and children.

He implies that the Age of the Thinking Machine is resulting in ossification rather than renewal. As our lives become increasingly recorded, archived, and accessed, we have become cannibals driven to consume our history and terrified of transgressing its established norms. To some extent, the future is blocked to us; we’re stuck in stasis; we’re stuck with a version of ourselves that’s becoming increasingly narrow. No thanks to recent tools such as “recommender systems,” we’re lodged in a seemingly endless feedback loop of “If you liked that, you’ll love this.” As we might become increasingly stuck in Curtis’s idea of the “you-loop,” so the nature of what it means to be human might be compromised by job-hogging machines that will render many of us obsolete. This Edge Question points to the next chapter in human history/evolution; we’re facing the beginning of a new definition of man, a new civilization.


pages: 504 words: 67,845

Designing Web Interfaces: Principles and Patterns for Rich Interactions by Bill Scott, Theresa Neil

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

A Pattern Language, anti-pattern, en.wikipedia.org, Firefox, recommendation engine, Ruby on Rails, Silicon Valley, web application

An alternate approach would be to hide them and show them on mouse hover (we will discuss this approach in the next section). It turns out that voting and rating systems are the most common places to make tools always visible. Netflix was the earliest to use a one-click rating system (Figure 4-4). Figure 4-4. Netflix star ratings are always visible Just as with Digg, rating movies is central to the health of Netflix. The Cinematch™ recommendation engine is driven largely by the user's ratings. So a clear call to action (to rate) is important. Not only do the stars serve as a strong call to action to rate movies, but they also provide important information for the other in-context tool: the "Add" button. Adding movies to your movie-shipping queue is key to having a good experience with the Netflix service. Relative importance One way to clarify this process is to decide on the relative importance of each exposed action.

Quick and easy The Gap integrates the shopping cart into its entire site as a drop-down shade. In fact, the Gap, Old Navy, Banana Republic, and PiperLime all share the same Inline Assistant Process-style shopping cart. The Gap is betting that making it quick and easy to add items to the cart across four stores will equal more sales. Additional step Amazon, on the other hand, is betting on its recommendation engine. By going to a second page, Amazon can display other shirts like the one added—as well as advertise the Amazon.com Visa card (Figure 8-8). Figure 8-8. Amazon shows recommendations when confirming an add to its shopping cart Which is the better experience? The Gap seems to be the clear winner in pure user experience. But which brings in more money? It's a question we cannot answer, but the right one for any site to ask

This is what Netflix does when a user adds movies to his shipping queue (Figure 8-9). Figure 8-9. Netflix displays its recommendations in an overlay Each movie on the site has an "Add" button. Clicking "Add" immediately adds the movie to the user's queue. As a confirmation and an opportunity for recommendations, a Dialog Overlay is displayed on top of the movie page. Just like Amazon, Netflix has a sophisticated recommendation engine. The bet is that since the user has expressed interest in an item (shirt or movie), the site can find other items similar to it to suggest. Amazon does this in a separate page. Netflix does it in an overlay that is easily dismissed by clicking anywhere outside the overlay (or by clicking the close button at the top or bottom). In a previous version of Netflix (or if JavaScript is disabled), this becomes a multiple-page experience (Figure 8-10).


pages: 406 words: 88,820

Television disrupted: the transition from network to networked TV by Shelly Palmer

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

barriers to entry, call centre, commoditize, disintermediation, en.wikipedia.org, hypertext link, interchangeable parts, invention of movable type, Irwin Jacobs: Qualcomm, James Watt: steam engine, Leonard Kleinrock, linear programming, Marc Andreessen, market design, Metcalfe’s law, pattern recognition, peer-to-peer, recommendation engine, Saturday Night Live, shareholder value, Skype, spectrum auction, Steve Jobs, subscription business, Telecommunications Act of 1996, There's no reason for any individual to have a computer in his home - Ken Olsen, Vickrey auction, Vilfredo Pareto, yield management

We could probably list dozens of reasons why a person might choose to be his or her own program director. The key problem with on-demand technology is not desire; it is complexity. It’s just too hard for the average person to do. Now, making a playlist in iTunes could not be simpler. But, putting your iPod in shuffle mode is actually easier, and it is also the path of least resistance. There are other factors that help with playlist creation. Recommendation engines and collaborative filtering like Amazon’s “if you like this … you might also like …” are good ways to help people pick the right stuff for their playlists. Consumers can also skew shuffle modes, setting them to play the content they manually play the most more often than the content they play less often. Of course, all of this technology requires consumers to collect all of their media into one place.

You can (and should) ask the same question about high traffic Web sites like Google,Yahoo!, MSN, Amazon, eBay, and of course, about every existing broadcast and cable network. A trip to the video section of the Apple Music Store through iTunes is a very interesting experience, particularly when you see how the interface handles show branding vs. network branding. Social Search Solution Another probable future is Tim Halle’s vision of a “social search,” a recommendation system that will emerge from social networking sites. Of course, the biggest social Copyright © 2006, Shelly Palmer. All rights reserved. 8-Television.Chap Eight v3.qxd 3/20/06 7:25 AM Page 114 114 C H A P T E R 8 Media Consumption networking sites like friendster.com or myspace.com are also big brands, so this may be just another permutation of branded search. (See “Folksonomy” in Chapter 6.)


pages: 283 words: 85,824

The People's Platform: Taking Back Power and Culture in the Digital Age by Astra Taylor

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

A Declaration of the Independence of Cyberspace, American Legislative Exchange Council, Andrew Keen, barriers to entry, Berlin Wall, big-box store, Brewster Kahle, citizen journalism, cloud computing, collateralized debt obligation, Community Supported Agriculture, conceptual framework, corporate social responsibility, creative destruction, cross-subsidies, crowdsourcing, David Brooks, digital Maoism, disintermediation, don't be evil, Donald Trump, Edward Snowden, Fall of the Berlin Wall, Filter Bubble, future of journalism, George Gilder, Google Chrome, Google Glasses, hive mind, income inequality, informal economy, Internet Archive, Internet of things, invisible hand, Jane Jacobs, Jaron Lanier, Jeff Bezos, job automation, John Markoff, Julian Assange, Kevin Kelly, Kickstarter, knowledge worker, Mark Zuckerberg, means of production, Metcalfe’s law, Naomi Klein, Narrative Science, Network effects, new economy, New Journalism, New Urbanism, Nicholas Carr, oil rush, peer-to-peer, Peter Thiel, Plutocrats, plutocrats, pre–internet, profit motive, recommendation engine, Richard Florida, Richard Stallman, self-driving car, shareholder value, sharing economy, Silicon Valley, Silicon Valley ideology, slashdot, Slavoj Žižek, Snapchat, social graph, Steve Jobs, Stewart Brand, technoutopianism, trade route, Whole Earth Catalog, WikiLeaks, winner-take-all economy, Works Progress Administration, young professional

A more democratic culture is one where previously excluded populations are given the material means to fully engage. To create a culture that is more diverse and inclusive, we have to pioneer ways of addressing discrimination and bias head-on, despite the difficulties of applying traditional methods of mitigating prejudice to digital networks. We have to shape our tools of discovery, the recommendation engines and personalization filters, so they do more than reinforce our prior choices and private bubbles. Finally, if we want a culture that is more resistant to the short-term expectations of corporate shareholders and the whims of marketers, we have to invest in noncommercial enterprises. There is no shortage of good ideas. By not experimenting, we court disillusionment. The Internet was supposed to be free and ubiquitous, but a cable cartel would rather rake in profits than provide universal service.

,” Wired, blog post, November 15, 2008, http://www.longtail.com/the_long_tail/2008/11/does-the-long-t.html. 35. Fang Wu and Bernardo A. Huberman, “The Persistence Paradox,” First Monday 15, nos. 1–4 (January 2010). 36. James Evans, “Electronic Publication and the Narrowing of Science and Scholarship,” Science 321, no. 5887 (July 18, 2008): 395–99. 37. Daniel M. Fleder and Kartik Hosanagar, “Blockbuster Culture’s Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity,” Management Science 55, no. 5 (May 2009): 697–712. 38. Evan Hughes, “Here’s How Amazon Self-Destructs,” Salon, July 19, 2013. 39. Gary Flake et al., “Winners Don’t Take All: Characterizing the Competition for Links on the Web,” Proceedings of the National Academy of Sciences 99, no. 8 (April 16, 2002). 40. Eli Pariser, The Filter Bubble: What the Internet Is Hiding from You (New York: Penguin Press, 2011), 128. 41.


pages: 344 words: 96,020

Hacking Growth: How Today's Fastest-Growing Companies Drive Breakout Success by Sean Ellis, Morgan Brown

Airbnb, Amazon Web Services, barriers to entry, bounce rate, business intelligence, business process, correlation does not imply causation, crowdsourcing, DevOps, Elon Musk, game design, Google Glasses, Internet of things, inventory management, iterative process, Jeff Bezos, Khan Academy, Lean Startup, Lyft, Mark Zuckerberg, market design, minimum viable product, Network effects, Paul Graham, Peter Thiel, Ponzi scheme, recommendation engine, ride hailing / ride sharing, side project, Silicon Valley, Silicon Valley startup, Skype, Snapchat, software as a service, Steve Jobs, subscription business, Uber and Lyft, Uber for X, working poor, Y Combinator, young professional

Personalization is also a good monetization tactic, and particularly effective are customized recommendations, usually delivered on the site or in the app while a customer is visiting, and also through email and mobile push messages. Amazon is, once again, a leading practitioner, having developed one of the most powerful “recommendation engines,” the term for the algorithmic programs that customize which items are recommended to you while browsing the site. The selections are based on a combination of a customer’s search history and buying habits, and data about the habits of other shoppers like that customer. All Amazon shoppers in effect see their own version of Amazon with a unique experience tailored to their preferences. Some recommendation engines, such as Amazon’s, as well as those deployed by Google and Netflix, are incredibly complex, but many are based on relatively simple math. As Colin Zima, the chief analytics officer at Looker, a business intelligence software, explains, it can be relatively easy to generate recommendations based on a simple formula called a Jaccard index, or Jaccard similarity coefficient, which determines how similar two products are to each other.

In contrast, the score for peanut butter and, for example, laundry detergent will almost surely be much lower. This calculation can be done for a host of combinations of every item in the store, creating powerful recommendations that lead to more purchases. And with the best recommendation engines, these product suggestions will only get better and more personalized over time because the more customers shop, the more data is available not just about what an individual customer has purchased, but also about common patterns among a large pool of shoppers. The grocery app recommendation engine might, for example, recommend seltzer water and limes when a shopper puts Red Bull in her shopping cart—even if that shopper has no history of buying any of those products—based on data that shows most people buying Red Bull are purchasing mixers for vodka.6 DON’T BE INTRUSIVE An important word of caution about customizing is that it can backfire if you’re not sensitive about how you’re doing it.


pages: 247 words: 81,135

The Great Fragmentation: And Why the Future of All Business Is Small by Steve Sammartino

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, additive manufacturing, Airbnb, augmented reality, barriers to entry, Bill Gates: Altair 8800, bitcoin, BRICs, Buckminster Fuller, citizen journalism, collaborative consumption, cryptocurrency, David Heinemeier Hansson, Elon Musk, fiat currency, Frederick Winslow Taylor, game design, Google X / Alphabet X, haute couture, helicopter parent, illegal immigration, index fund, Jeff Bezos, jimmy wales, Kickstarter, knowledge economy, Law of Accelerating Returns, lifelogging, market design, Metcalfe's law, Metcalfe’s law, Minecraft, minimum viable product, Network effects, new economy, peer-to-peer, post scarcity, prediction markets, pre–internet, profit motive, race to the bottom, random walk, Ray Kurzweil, recommendation engine, remote working, RFID, Rubik’s Cube, self-driving car, sharing economy, side project, Silicon Valley, Silicon Valley startup, skunkworks, Skype, social graph, social web, software is eating the world, Steve Jobs, survivorship bias, too big to fail, US Airways Flight 1549, web application, zero-sum game

Creative types Collaboration, creative orientation and counter intuition Note Chapter 6: Demographics is history: moving on from predictive marketing How to get profiled The price of pop culture The best average The weapon of choice Don’t fence me in How do you define a teenager? Stealing music or connecting? Marketing 1.0 Marketing revised The new intersection Social + interests = intention The story of cities Do I know you? The interest graph in action The anti-demographic recommendation engine Chapter 7: The truth about pricing: technology and omnipresent deflation Technology deflation Real-world technology deflation The free super computer The crux is human It’s getting quicker Technology curve jumping Technology stacking Omnipresent deflation Consumer price index trickery Connections and the impact on prices Economic border hopping The new minimum wage Notes Chapter 8: A zero-barrier world: how access to knowledge is breaking down barriers So what’s changed?

They focused on direct connection, one new fan at a time. They didn’t try to build an audience. They helped a person, which is a very different approach. It seems old-school BMXers are a little bit smarter than old-school marketers. What a great way to build a community; one that I’m now a part of. While everyone gets enamoured with ‘big data’, there’s probably a lot more we can do with ‘little data’. The anti-demographic recommendation engine A lot of e-commerce platforms and social-media engines seem to be able to do what mainstream marketers could never quite pull off. Every day, I’m exposed to products and services that I have zero interest in ever purchasing, mainly due to the laziness of the marketers who allocate the budget behind them. But occasionally I’m utterly inspired and thankful when great marketers (with permission) introduce me to things that are just perfectly suited.

Twitter is terrific at this with its who-to-follow recommendations. But the best example has to be Amazon’s ‘Recommended for you’ books. It’s always spot on, sitting perfectly in the centre of my personal interest graph, based on the simplicity of what I’ve bought, looked at, wish listed and what others have in their list when there are overlaps. For me personally, it’s very accurate indeed. What’s interesting is that this recommendation engine is what I’d coin an ‘anti-demographic’ profiler: It doesn’t care what sex I am. It doesn’t care where I live. It doesn’t care or know how much I earn. It doesn’t care if I finished school. None of this matters. What matters is the direct connection and the reality of my interests based on my digital footprint. It’s the type of efficiency that mass can never achieve. The smart marketing money now lives in a node-by-node approach.


pages: 215 words: 55,212

The Mesh: Why the Future of Business Is Sharing by Lisa Gansky

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Airbnb, Amazon Mechanical Turk, Amazon Web Services, banking crisis, barriers to entry, carbon footprint, Chuck Templeton: OpenTable, cloud computing, credit crunch, crowdsourcing, diversification, Firefox, fixed income, Google Earth, industrial cluster, Internet of things, Kickstarter, late fees, Network effects, new economy, peer-to-peer lending, recommendation engine, RFID, Richard Florida, Richard Thaler, ride hailing / ride sharing, sharing economy, Silicon Valley, smart grid, social web, software as a service, TaskRabbit, the built environment, walkable city, yield management, young professional, Zipcar

As the service developed, the company added layers of information to inform a user’s choices, such as reviews from people in the network whose profile of selections and ratings were similar. Recently, it sponsored a contest awarding a million dollars to anyone who could significantly improve the movie recommendation service. Thousands of teams from more than a hundred nations competed. Netflix’s “recommendation engine” relies on algorithms culled from masses of data collected on the Web, including that provided directly by customers. The lesson learned from the contest, according to the New York Times, was the power of collaboration, as winning teams began sharing ideas and information: “The formula for success was to bring together people with complementary skills and combine different methods of problem solving.”

See Social networking starting Mesh company Sweet Spot trends influencing growth of trust building Millennial generation Mobile networks digital translation to physical and flash branding as foundation of the Mesh share-based business operation users, increase in Modular design Mohsenin, Kamran Movie rentals online, Mesh companies Mozilla Firefox Music-based businesses, Mesh companies Natural ecosystem, relationship to Mesh ecosystem Netflix annual sales as information business Mesh strategy perfection recommendation engine recommendations Network effect Niche markets for maintaining/servicing products Mesh companies opening, reason for sharing as North Portland Tool Library (NPTL) Ofoto Olapic Ombudsman Open Architecture Network Open Design Open innovation service provider Open networks advantages of Architecture for Humanity communal IP concept and marketing products openness versus proprietary approach and product improvement software development OpenTable O’Reilly, Tim Ostrom, Elinor Own-to-Mesh model car-sharing services profits, generation from retirees as customers Partnerships characteristics of corporations and Mesh companies income generation from in Mesh ecosystem unexpected value of Patagonia recycled textiles of Walmart partnership Paul, Sunil Payne, Steven Peer-to-peer lending.


pages: 58 words: 12,386

Big Data Glossary by Pete Warden

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

business intelligence, crowdsourcing, fault tolerance, information retrieval, linked data, natural language processing, recommendation engine, web application

To achieve that scalability, most of the code is written as parallelizable jobs on top of Hadoop. It comes with algorithms to perform a lot of common tasks, like clustering and classifying objects into groups, recommending items based on other users’ behaviors, and spotting attributes that occur together a lot. In practical terms, the framework makes it easy to use analysis techniques to implement features such as Amazon’s “People who bought this also bought” recommendation engine on your own site. It’s a heavily used project with an active community of developers and users, and it’s well worth trying if you have any significant number of transaction or similar data that you’d like to get more value out of. Introducing Mahout Using Mahout with Cassandra scikits.learn It’s hard to find good off-the-shelf tools for practical machine learning. Many of the projects are aimed at students and researchers who want access to the inner workings of the algorithms, which can be off-putting when you’re looking for more of a black box to solve a particular problem.


pages: 274 words: 75,846

The Filter Bubble: What the Internet Is Hiding From You by Eli Pariser

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

A Declaration of the Independence of Cyberspace, A Pattern Language, Amazon Web Services, augmented reality, back-to-the-land, Black Swan, borderless world, Build a better mousetrap, Cass Sunstein, citizen journalism, cloud computing, cognitive dissonance, crowdsourcing, Danny Hillis, data acquisition, disintermediation, don't be evil, Filter Bubble, Flash crash, fundamental attribution error, global village, Haight Ashbury, Internet of things, Isaac Newton, Jaron Lanier, Jeff Bezos, jimmy wales, Kevin Kelly, knowledge worker, Mark Zuckerberg, Marshall McLuhan, megacity, Metcalfe’s law, Netflix Prize, new economy, PageRank, paypal mafia, Peter Thiel, recommendation engine, RFID, Robert Metcalfe, sentiment analysis, shareholder value, Silicon Valley, Silicon Valley startup, social graph, social software, social web, speech recognition, Startup school, statistical model, stem cell, Steve Jobs, Steven Levy, Stewart Brand, technoutopianism, the scientific method, urban planning, Whole Earth Catalog, WikiLeaks, Y Combinator

In a memo for fellow progressives, Mark Steitz, one of the primary Democratic data gurus, recently wrote that “targeting too often returns to a bombing metaphor—dropping message from planes. Yet the best data tools help build relationships based on observed contacts with people. Someone at the door finds out someone is interested in education; we get back to that person and others like him or her with more information. Amazon’s recommendation engine is the direction we need to head.” The trend is clear: We’re moving from swing states to swing people. Consider this scenario: It’s 2016, and the race is on for the presidency of the United States. Or is it? It depends on who you are, really. If the data says you vote frequently and that you may have been a swing voter in the past, the race is a maelstrom. You’re besieged with ads, calls, and invitations from friends.

Quora Forum, accessed Dec. 17, 2010, www.quora.com/Facebook-company/Whats-the-history-of-the-Awesome-Button-that-eventually-became-the-Like-button-on-Facebook. 151 “against the cruise line industry”: Hollis Thomases, “Google Drops Anti-Cruise Line Ads from AdWords,” Web Ad.vantage, Feb. 13, 2004, accessed Dec. 17, 2010, www.webadvantage.net/webadblog/google-drops-anti-cruise-line-ads-from-adwords-338. 151–52 identify who was persuadable: “How Rove Targeted the Republican Vote,” Frontline, accessed Feb. 8, 2011, www.pbs.org/wgbh/pages/frontline/shows/architect/rove/metrics.html. 152 “Amazon’s recommendation engine is the direction”: Mark Steitz and Laura Quinn, “An Introduction to Microtargeting in Politics,” accessed Dec. 17, 2010, www.docstoc.com/docs/43575201/An-Introduction-to-Microtargeting-in-Politics. 153 round-the-clock “war room”: “Google’s War Room for the Home Stretch of Campaign 2010,” e.politics, Sept. 24, 2010, accessed Feb. 9, 2011, www.epolitics.com/2010/09/24/googles-war-room-for-the-home-stretch-of-campaign-2010/. 155 “campaign wanted to spend on Facebook”: Vincent R.


pages: 222 words: 70,132

Move Fast and Break Things: How Facebook, Google, and Amazon Cornered Culture and Undermined Democracy by Jonathan Taplin

1960s counterculture, 3D printing, affirmative action, Affordable Care Act / Obamacare, Airbnb, Amazon Mechanical Turk, American Legislative Exchange Council, Apple's 1984 Super Bowl advert, back-to-the-land, barriers to entry, basic income, battle of ideas, big data - Walmart - Pop Tarts, bitcoin, Brewster Kahle, Buckminster Fuller, Burning Man, Clayton Christensen, commoditize, creative destruction, crony capitalism, crowdsourcing, data is the new oil, David Brooks, David Graeber, don't be evil, Donald Trump, Douglas Engelbart, Douglas Engelbart, Dynabook, Edward Snowden, Elon Musk, equal pay for equal work, Erik Brynjolfsson, future of journalism, future of work, George Akerlof, George Gilder, Google bus, Hacker Ethic, Howard Rheingold, income inequality, informal economy, information asymmetry, information retrieval, Internet Archive, Internet of things, invisible hand, Jaron Lanier, Jeff Bezos, job automation, John Markoff, John Maynard Keynes: technological unemployment, John von Neumann, Joseph Schumpeter, Kevin Kelly, Kickstarter, labor-force participation, life extension, Marc Andreessen, Mark Zuckerberg, Menlo Park, Metcalfe’s law, Mother of all demos, move fast and break things, move fast and break things, natural language processing, Network effects, new economy, Norbert Wiener, offshore financial centre, packet switching, Paul Graham, Peter Thiel, Plutocrats, plutocrats, pre–internet, Ray Kurzweil, recommendation engine, rent-seeking, revision control, Robert Bork, Robert Gordon, Robert Metcalfe, Ronald Reagan, Sand Hill Road, secular stagnation, self-driving car, sharing economy, Silicon Valley, Silicon Valley ideology, smart grid, Snapchat, software is eating the world, Steve Jobs, Stewart Brand, technoutopianism, The Chicago School, The Market for Lemons, Tim Cook: Apple, trade route, transfer pricing, trickle-down economics, Tyler Cowen: Great Stagnation, universal basic income, unpaid internship, We wanted flying cars, instead we got 140 characters, web application, Whole Earth Catalog, winner-take-all economy, women in the workforce, Y Combinator

But it turned out it wasn’t just elite Harvard kids who wanted to fashion an online persona—it was everyone. When Thefacebook really started to grow, in the late spring of 2004, Zuckerberg and his right-hand man, Dustin Moskovitz, decided to go to Silicon Valley for the summer. Zuckerberg had met Sean Parker in a Chinese restaurant in New York in May and had been awed by his outlaw tales of Napster. Zuckerberg had written a music-recommendation engine while he was a senior at Exeter, and so Napster loomed large in his notion of hipness. When the two men got to Palo Alto in June, they ran into Parker, who was essentially homeless, having been thrown out of his latest company, Plaxo, an online address-book application. It is a tribute to Zuckerberg’s naive trust that he invited Parker to live in the house he and Moskovitz had rented. Parker promised to teach them about the shark tank known as Sand Hill Road—the center of the Valley’s venture capital business.

In Huxley’s world, the obsession with taking drugs, going to the “feelies” (his equivalent of IMAX movies), playing interactive games, and downloading porn filled the lives of the citizens. They had no time for politics or even for wondering why their horizons were so narrow. The kids attending DigiTour would fit right into the plot of Brave New World. The Internet’s self-curated view from everywhere has the amazing ability to distract us in trivial pursuits, narrow our choices, and keep us safe in a balkanized suburb of our own taste. Search engines and recommendation engines constantly favor the most popular options and constantly make our discovery more limited. I began this chapter wondering whether technology was robbing us of some of our essential humanity. Google’s chief technologist proclaims that technology will “allow us to transcend these limitations of our biological bodies and brains.… There will be no distinction, post-Singularity, between human and machine.”

Data Mining: Concepts and Techniques: Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

bioinformatics, business intelligence, business process, Claude Shannon: information theory, cloud computing, computer vision, correlation coefficient, cyber-physical system, database schema, discrete time, distributed generation, finite state, information retrieval, iterative process, knowledge worker, linked data, natural language processing, Netflix Prize, Occam's razor, pattern recognition, performance metric, phenotype, random walk, recommendation engine, RFID, semantic web, sentiment analysis, speech recognition, statistical model, stochastic process, supply-chain management, text mining, thinkpad, Thomas Bayes, web application

If consumers follow a system recommendation but then do not end up liking the product, they are less likely to use the recommender system again. As with classification systems, recommender systems can make two types of errors: false negatives and false positives. Here, false negatives are products that the system fails to recommend, although the consumer would like them. False positives are products that are recommended, but which the consumer does not like. False positives are less desirable because they can annoy or anger consumers. Content-based recommender systems are limited by the features used to describe the items they recommend. Another challenge for both content-based and collaborative recommender systems is how to deal with new users for which a buying history is not yet available. Hybrid approaches integrate both content-based and collaborative methods to achieve further improved recommendations.

In summary, computer systems are at continual risk of breaks in security. Data mining technology can be used to develop strong intrusion detection and prevention systems, which may employ signature-based or anomaly-based detection. 13.3.5. Data Mining and Recommender Systems Today's consumers are faced with millions of goods and services when shopping online. Recommender systems help consumers by making product recommendations that are likely to be of interest to the user such as books, CDs, movies, restaurants, online news articles, and other services. Recommender systems may use either a content-based approach, a collaborative approach, or a hybrid approach that combines both content-based and collaborative methods. The content-based approach recommends items that are similar to items the user preferred or queried in the past.

They make use of keywords (describing the items) and user profiles that contain information about users' tastes and needs. Such profiles may be obtained explicitly (e.g., through questionnaires) or learned from users' transactional behavior over time. A collaborative recommender system tries to predict the utility of items for a user, u, based on items previously rated by other users who are similar to u. For example, when recommending books, a collaborative recommender system tries to find other users who have a history of agreeing with u (e.g., they tend to buy similar books, or give similar ratings for books). Collaborative recommender systems can be either memory (or heuristic) based or model based. Memory-based methods essentially use heuristics to make rating predictions based on the entire collection of items previously rated by users. That is, the unknown rating of an item–user combination can be estimated as an aggregate of ratings of the most similar users for the same item.


pages: 380 words: 118,675

The Everything Store: Jeff Bezos and the Age of Amazon by Brad Stone

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, airport security, AltaVista, Amazon Mechanical Turk, Amazon Web Services, bank run, Bernie Madoff, big-box store, Black Swan, book scanning, Brewster Kahle, call centre, centre right, Chuck Templeton: OpenTable, Clayton Christensen, cloud computing, collapse of Lehman Brothers, crowdsourcing, cuban missile crisis, Danny Hillis, Douglas Hofstadter, Elon Musk, facts on the ground, game design, housing crisis, invention of movable type, inventory management, James Dyson, Jeff Bezos, John Markoff, Kevin Kelly, Kodak vs Instagram, late fees, loose coupling, low skilled workers, Maui Hawaii, Menlo Park, Network effects, new economy, optical character recognition, pets.com, Ponzi scheme, quantitative hedge fund, recommendation engine, Renaissance Technologies, RFID, Rodney Brooks, search inside the book, shareholder value, Silicon Valley, Silicon Valley startup, six sigma, skunkworks, Skype, statistical arbitrage, Steve Ballmer, Steve Jobs, Steven Levy, Stewart Brand, Thomas L Friedman, Tony Hsieh, Whole Earth Catalog, why are manhole covers round?, zero-sum game

Once again, Amazon’s lawyers caught wind of this and renamed the program Vendor Realignment. Over the next year, Miller tangled with the European divisions of Random House, Hachette, and Bloomsbury, the publisher of the Harry Potter series. “I did everything I could to screw with their performance,” he says. He took selections of their catalog to full price and yanked their books from Amazon’s recommendation engine; with some titles, like travel books, he promoted comparable books from competitors. Miller’s constant search for new points of leverage exploited the anxieties of neurotic authors who obsessively tracked sales rank—the number on Amazon.com that showed an author how well his or her book was doing compared to other products on the site. “We would constantly meet with authors, so we’d know who would be watching their rankings.”

“Lyn was our ambassador. I credit her for maintaining these relationships.” Amazon approached large publishers aggressively. It demanded accommodations like steeper discounts on bulk purchases, longer periods to pay its bills, and shipping arrangements that leveraged Amazon’s discounts with UPS. To publishers that didn’t comply, Amazon threatened to pull their books out of its automated personalization and recommendation systems, meaning that they would no longer be suggested to customers. “Publishers didn’t really understand Amazon. They were very naïve about what was going on with their back catalog,” says Goss. “Most didn’t know their sales were up because their backlist was getting such visibility.” Amazon had an easy way to demonstrate its market power. When a publisher did not capitulate and the company shut off the recommendation algorithms for its books, the publisher’s sales usually fell by as much as 40 percent.


pages: 268 words: 75,850

The Formula: How Algorithms Solve All Our Problems-And Create More by Luke Dormehl

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, algorithmic trading, Any sufficiently advanced technology is indistinguishable from magic, augmented reality, big data - Walmart - Pop Tarts, call centre, Cass Sunstein, Clayton Christensen, commoditize, computer age, death of newspapers, deferred acceptance, Edward Lorenz: Chaos theory, Erik Brynjolfsson, Filter Bubble, Flash crash, Florence Nightingale: pie chart, Frank Levy and Richard Murnane: The New Division of Labor, Google Earth, Google Glasses, High speed trading, Internet Archive, Isaac Newton, Jaron Lanier, Jeff Bezos, job automation, John Markoff, Kevin Kelly, Kodak vs Instagram, lifelogging, Marshall McLuhan, means of production, Nate Silver, natural language processing, Netflix Prize, pattern recognition, price discrimination, recommendation engine, Richard Thaler, Rosa Parks, self-driving car, sentiment analysis, Silicon Valley, Silicon Valley startup, Slavoj Žižek, social graph, speech recognition, Steve Jobs, Steven Levy, Steven Pinker, Stewart Brand, the scientific method, The Signal and the Noise by Nate Silver, upwardly mobile, Wall-E, Watson beat the top human players on Jeopardy!, Y Combinator

Conversely, scores fall dramatically in situations where the task takes longer than expected.33 Decimated-Reality Aggregators Speaking in October 1944, during the rebuilding of the House of Commons, which had sustained heavy bombing damage during the Battle of Britain, former British prime minister Winston Churchill observed, “We shape our buildings; thereafter they shape us.”34 A similar sentiment might be said in the age of The Formula, in which users shape their online profiles, and from that point forward their online profiles begin to shape them—both in terms of what we see and, perhaps more crucially, what we don’t. Writing about a start-up called Nara, in the middle of 2013, I coined the phrase “decimated reality aggregators” to describe what the company was trying to do.35 Starting out as a restaurant recommender system by connecting together thousands of restaurants around the world, Nara’s ultimate goal was to become the recommender system for your life: drawing on what it knew about you from the restaurants you ate in, to suggest everything from hotels to clothes. Nara even incorporated the idea of upward mobility into its algorithm. Say, for example, you wanted to be a wine connoisseur two years down the line, but currently had no idea how to tell your Chardonnay from your Chianti.

In all, eHarmony’s arrival represented more than just another addition to an already crowded field of Internet dating websites—but a qualitative change in the way that Internet dating was carried out. “Neil was adamant that this should be based on science,” Carter says. Before eHarmony, the majority of dating websites took the form of searchable personal ads, of the kind that have been appearing in print since the 17th century.11 After eHarmony, the search engine model was replaced with a recommender system praised in press materials for its “scientific precision.” Instead of allowing users to scan through page after page of profiles, eHarmony simply required them to answer a series of questions—and then picked out the right option on their behalf. The website opened its virtual doors for the first time on August 22, 2000. There were a few initial teething problems. “Some people were critical of the matches they were getting,” Warren admits.

All a character has to do—as occurs during one scene in which the novel’s bumbling protagonist, Lenny Abramov, visits a Staten Island nightclub with his friends—is to set the “community parameters” of their iPhone-like device to a particular physical space and hit a button. At this point, every aspect of a person’s profile is revealed, including their “fuckability” and “personality” scores (both ranked on a scale of 800), along with their ranked “anal/oral/vaginal” preferences. There is even a recommender system incorporated, so that a user’s history of romantic relationships can be scrutinized for insights in much the same way that a person’s previous orders on Amazon might dictate what they will be interested in next. As one of Abramov’s friends notes, “This girl [has] a long multimedia thing on how her father abused her . . . Like, you’ve dated a lot of abused girls, so it knows you’re into that shit.”24 The world presented by Super Sad True Love Story is, in many ways, closer than you might think.


pages: 94 words: 26,453

The End of Nice: How to Be Human in a World Run by Robots (Kindle Single) by Richard Newton

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, Black Swan, British Empire, Buckminster Fuller, Clayton Christensen, crowdsourcing, deliberate practice, fear of failure, Filter Bubble, future of work, Google Glasses, Isaac Newton, James Dyson, Jaron Lanier, Jeff Bezos, job automation, Lean Startup, low skilled workers, Mark Zuckerberg, move fast and break things, move fast and break things, Paul Erdős, Paul Graham, recommendation engine, rising living standards, Robert Shiller, Robert Shiller, Silicon Valley, Silicon Valley startup, skunkworks, Steve Ballmer, Steve Jobs, Y Combinator

Like the sirens of legends sung sweet songs to lure sailors to crash on the rocky shore of their island, so Lanier thinks we must be wary of the attractions of the siren servers. They don’t want to make your life more complicated. They are there to make everything frictionless: “Leave it to me”, they sing. “I’ll find you new music you might like, books you’ll want to read, videos you want to watch and friends you should like.” We’re sort of used to the idea that recommendation engines work like this. We know that ads now follow us around the web and that books will be unhelpfully recommended to us by Amazon. But search results are also tailored to you. And that’s more of a concern. The search results you get will be different to the results for an identical search made by me. In fact, so much insight can be derived from your online behaviour that Google and other organisations can ensure you get news that makes you happy… or even angry the way you like to be angry.


pages: 88 words: 25,047

The Mathematics of Love: Patterns, Proofs, and the Search for the Ultimate Equation by Hannah Fry

Brownian motion, John Nash: game theory, linear programming, Nash equilibrium, Pareto efficiency, recommendation engine, Skype, statistical model

And that’s it – apply this algorithm to the hundreds of available questions and repeat for each of the millions of users on OkCupid and you’ve got everything you need for one of the world’s most successful dating websites. It’s one of the most elegant approaches ever attempted to pairing couples based on their personal preferences. Together with eHarmony and other similar websites, OkCupid sits alongside Amazon and Netflix as one of the most widely used recommendation engines on the internet. But there’s one problem – if the internet is the ultimate matchmaker, why are people still going on terrible dates? If the science is so good, surely that first date will be the last first date of your life? Shouldn’t the algorithm be able to deliver the perfect partner and leave it at that? Maybe the questionnaires and match percentages aren’t all they’re cracked up to be.


pages: 163 words: 42,402

Machine Learning for Email by Drew Conway, John Myles White

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

call centre, correlation does not imply causation, Debian, natural language processing, Netflix Prize, pattern recognition, recommendation engine, SpamAssassin, text mining

Generating rules for ranking a list of items is an increasingly common task in machine learning, yet you may not have thought of it in these terms. More likely, you have heard of something like a recommendation system, which implicitly produces a ranking of products. Even if you have not heard of a recommendation system, it’s almost certain that you have used or interacted with a recommendation system at some point. Some of the most successful e-commerce websites have benefitted from leveraging data on their users to generate recommendations for other products their users might be interested in. For example, if you have ever shopped at Amazon.com, then you have interacted with a recommendation system. The problem Amazon faces is simple: what items in their inventory are you most likely to buy? The implication of that statement is that the items in Amazon’s inventory have an ordering specific to each user.

There are many excellent books that focus on the fundamentals, the seminal work being Hastie, Tibshirani, and Friedman’s The Elements of Statistical Learning HTF09.[1] But another important part of the hacker mantra is to learn by doing. Many hackers may be more comfortable thinking of problems in terms of the process by which a solution is attained, rather than the theoretical foundation from which the solution is derived. From this perspective, an alternative approach to teaching machine learning would be to use “cookbook” style examples. To understand how a recommendation system works, for example, we might provide sample training data and a version of the model, and show how the latter uses the former. There are many useful texts of this kind as well—Toby Segaran’s Programming Collective Intelligence is an recent example Seg07. Such a discussion would certainly address the how of a hacker’s method of learning, but perhaps less of the why. Along with understanding the mechanics of a method, we may also want to learn why it is used in a certain context or to address a specific problem.

The implication of that statement is that the items in Amazon’s inventory have an ordering specific to each user. Likewise, Netflix.com has a massive library of DVDs available to its customers to rent. In order for those customers to get the most out of the site, Netflix employs a sophisticated recommendation system to present people with rental suggestions. For both companies, these recommendations are based on two kinds of data. First, there is the data pertaining to the inventory itself. For Amazon, if the product is a television, this data might contain the type (i.e., plasma, LCD, LED), manufacturer, price, and so on. For Netflix, this data might be the genre of a film, its cast, director, running time, etc. Second, there is the data related to the browsing and purchasing behavior of the customers. This sort of data can help Amazon understand what accessories most people look for when shopping for a new plasma TV and can help Netflix understand which romantic comedies George A.

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage by Zdravko Markov, Daniel T. Larose

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Firefox, information retrieval, Internet Archive, iterative process, natural language processing, pattern recognition, random walk, recommendation engine, semantic web, speech recognition, statistical model, William of Occam

COLLABORATIVE FILTERING (RECOMMENDER SYSTEMS) So far we have discussed approaches to content-based retrieval and clustering of documents, where the basic relation that is used in the document description is “document contains term.” At some point we looked into the role of web users as a source of feedback to improve the document ranking. However, we may consider web users as entities in a relation such as the document–term relation. This may, for example, be “web user likes web page.” Then we can build a user–document matrix and use documents to describe users in terms of web pages they like. A more general approach would be to consider persons and items again connected by the relation “person likes item.” This is the approach taken in the area of collaborative filtering (also called recommender systems) [3]. Assume that we have m persons and n items (e.g., books, songs, movies, web pages).

CONTENTS PREFACE xi PART I WEB STRUCTURE MINING 1 2 INFORMATION RETRIEVAL AND WEB SEARCH 3 Web Challenges Web Search Engines Topic Directories Semantic Web Crawling the Web Web Basics Web Crawlers Indexing and Keyword Search Document Representation Implementation Considerations Relevance Ranking Advanced Text Search Using the HTML Structure in Keyword Search Evaluating Search Quality Similarity Search Cosine Similarity Jaccard Similarity Document Resemblance References Exercises 3 4 5 5 6 6 7 13 15 19 20 28 30 32 36 36 38 41 43 43 HYPERLINK-BASED RANKING 47 Introduction Social Networks Analysis PageRank Authorities and Hubs Link-Based Similarity Search Enhanced Techniques for Page Ranking References Exercises 47 48 50 53 55 56 57 57 vii viii CONTENTS PART II WEB CONTENT MINING 3 4 5 CLUSTERING 61 Introduction Hierarchical Agglomerative Clustering k-Means Clustering Probabilty-Based Clustering Finite Mixture Problem Classification Problem Clustering Problem Collaborative Filtering (Recommender Systems) References Exercises 61 63 69 73 74 76 78 84 86 86 EVALUATING CLUSTERING 89 Approaches to Evaluating Clustering Similarity-Based Criterion Functions Probabilistic Criterion Functions MDL-Based Model and Feature Evaluation Minimum Description Length Principle MDL-Based Model Evaluation Feature Selection Classes-to-Clusters Evaluation Precision, Recall, and F-Measure Entropy References Exercises 89 90 95 100 101 102 105 106 108 111 112 112 CLASSIFICATION 115 General Setting and Evaluation Techniques Nearest-Neighbor Algorithm Feature Selection Naive Bayes Algorithm Numerical Approaches Relational Learning References Exercises 115 118 121 125 131 133 137 138 PART III WEB USAGE MINING 6 INTRODUCTION TO WEB USAGE MINING 143 Definition of Web Usage Mining Cross-Industry Standard Process for Data Mining Clickstream Analysis 143 144 147 CONTENTS 7 8 9 ix Web Server Log Files Remote Host Field Date/Time Field HTTP Request Field Status Code Field Transfer Volume (Bytes) Field Common Log Format Identification Field Authuser Field Extended Common Log Format Referrer Field User Agent Field Example of a Web Log Record Microsoft IIS Log Format Auxiliary Information References Exercises 148 PREPROCESSING FOR WEB USAGE MINING 156 Need for Preprocessing the Data Data Cleaning and Filtering Page Extension Exploration and Filtering De-Spidering the Web Log File User Identification Session Identification Path Completion Directories and the Basket Transformation Further Data Preprocessing Steps References Exercises 156 149 149 149 150 151 151 151 151 151 152 152 152 153 154 154 154 158 161 163 164 167 170 171 174 174 174 EXPLORATORY DATA ANALYSIS FOR WEB USAGE MINING 177 Introduction Number of Visit Actions Session Duration Relationship between Visit Actions and Session Duration Average Time per Page Duration for Individual Pages References Exercises 177 MODELING FOR WEB USAGE MINING: CLUSTERING, ASSOCIATION, AND CLASSIFICATION Introduction Modeling Methodology Definition of Clustering The BIRCH Clustering Algorithm Affinity Analysis and the A Priori Algorithm 177 178 181 183 185 188 188 191 191 192 193 194 197 x CONTENTS Discretizing the Numerical Variables: Binning Applying the A Priori Algorithm to the CCSU Web Log Data Classification and Regression Trees The C4.5 Algorithm References Exercises INDEX 199 201 204 208 210 211 213 PREFACE DEFINING DATA MINING THE WEB By data mining the Web, we refer to the application of data mining methodologies, techniques, and models to the variety of data forms, structures, and usage patterns that comprise the World Wide Web.

Concept learning methods can also be used to generate explicit descriptions of sets of web documents, which can then be applied to categorization of new documents or to better understand the document area or topic. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage C 2007 John Wiley & Sons, Inc. By Zdravko Markov and Daniel T. Larose Copyright CHAPTER 3 CLUSTERING INTRODUCTION HIERARCHICAL AGGLOMERATIVE CLUSTERING k-MEANS CLUSTERING PROBABILTY-BASED CLUSTERING COLLABORATIVE FILTERING (RECOMMENDER SYSTEMS) INTRODUCTION The most popular approach to learning is by example. Given a set of objects, each labeled with a class (category), the learning system builds a mapping between objects and classes which can then be used for classifying new (unlabeled) objects. As the labeling (categorization) of the initial (training) set of objects is done by an agent external to the system (teacher), this setting is called supervised learning.


pages: 579 words: 76,657

Data Science from Scratch: First Principles with Python by Joel Grus

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

correlation does not imply causation, natural language processing, Netflix Prize, p-value, Paul Graham, recommendation engine, SpamAssassin, statistical model

principal component analysis, Dimensionality Reduction probability, Probability-For Further Exploration, MathematicsBayes's Theorem, Bayes’s Theorem central limit theorem, The Central Limit Theorem conditional, Conditional Probability continuous distributions, Continuous Distributions defined, Probability dependence and independence, Dependence and Independence normal distribution, The Normal Distribution random variables, Random Variables probability density function, Continuous Distributions programming languages for learning data science, From Scratch Python, A Crash Course in Python-For Further Explorationargs and kwargs, args and kwargs arithmetic, Arithmetic benefits of using for data science, From Scratch Booleans, Truthiness control flow, Control Flow Counter, Counter dictionaries, Dictionaries-defaultdict enumerate function, enumerate exceptions, Exceptions functional tools, Functional Tools functions, Functions generators and iterators, Generators and Iterators list comprehensions, List Comprehensions lists, Lists object-oriented programming, Object-Oriented Programming piping data through scripts using stdin and stdout, stdin and stdout random numbers, generating, Randomness regular expressions, Regular Expressions sets, Sets sorting in, The Not-So-Basics strings, Strings tuples, Tuples whitespace formatting, Whitespace Formatting zip function and argument unpacking, zip and Argument Unpacking Q quantile, computing, Central Tendencies query optimization (SQL), Query Optimization R R (programming language), From Scratch, R random forests, Random Forests random module (Python), Randomness random variables, Random VariablesBernoulli, The Central Limit Theorem binomial, The Central Limit Theorem conditioned on events, Random Variables expected value, Random Variables normal, The Normal Distribution-The Central Limit Theorem uniform, Continuous Distributions range, Dispersion range function (Python), Generators and Iterators reading files (see files, reading) recall, Correctness recommendations, Recommender Systems recommender systems, Recommender Systems-For Further ExplorationData Scientists You May Know (example), Data Scientists You May Know item-based collaborative filtering, Item-Based Collaborative Filtering-For Further Exploration manual curation, Manual Curation recommendations based on popularity, Recommending What’s Popular user-based collaborative filtering, User-Based Collaborative Filtering-User-Based Collaborative Filtering reduce function (Python), Functional Toolsusing with vectors, Vectors regression (see linear regression; logistic regression) regression trees, What Is a Decision Tree?

Additionally, both of his endorsers endorsed only him, which means that he doesn’t have to divide their rank with anyone else. For Further Exploration There are many other notions of centrality besides the ones we used (although the ones we used are pretty much the most popular ones). NetworkX is a Python library for network analysis. It has functions for computing centralities and for visualizing graphs. Gephi is a love-it/hate-it GUI-based network-visualization tool. Chapter 22. Recommender Systems O nature, nature, why art thou so dishonest, as ever to send men with these false recommendations into the world! Henry Fielding Another common data problem is producing recommendations of some sort. Netflix recommends movies you might want to watch. Amazon recommends products you might want to buy. Twitter recommends users you might want to follow. In this chapter, we’ll look at several ways to use data to make recommendations.

= other_interest_id and similarity > 0] return sorted(pairs, key=lambda (_, similarity): similarity, reverse=True) which suggests the following similar interests: [('Hadoop', 0.8164965809277261), ('Java', 0.6666666666666666), ('MapReduce', 0.5773502691896258), ('Spark', 0.5773502691896258), ('Storm', 0.5773502691896258), ('Cassandra', 0.4082482904638631), ('artificial intelligence', 0.4082482904638631), ('deep learning', 0.4082482904638631), ('neural networks', 0.4082482904638631), ('HBase', 0.3333333333333333)] Now we can create recommendations for a user by summing up the similarities of the interests similar to his: def item_based_suggestions(user_id, include_current_interests=False): # add up the similar interests suggestions = defaultdict(float) user_interest_vector = user_interest_matrix[user_id] for interest_id, is_interested in enumerate(user_interest_vector): if is_interested == 1: similar_interests = most_similar_interests_to(interest_id) for interest, similarity in similar_interests: suggestions[interest] += similarity # sort them by weight suggestions = sorted(suggestions.items(), key=lambda (_, similarity): similarity, reverse=True) if include_current_interests: return suggestions else: return [(suggestion, weight) for suggestion, weight in suggestions if suggestion not in users_interests[user_id]] For user 0, this generates the following (seemingly reasonable) recommendations: [('MapReduce', 1.861807319565799), ('Postgres', 1.3164965809277263), ('MongoDB', 1.3164965809277263), ('NoSQL', 1.2844570503761732), ('programming languages', 0.5773502691896258), ('MySQL', 0.5773502691896258), ('Haskell', 0.5773502691896258), ('databases', 0.5773502691896258), ('neural networks', 0.4082482904638631), ('deep learning', 0.4082482904638631), ('C++', 0.4082482904638631), ('artificial intelligence', 0.4082482904638631), ('Python', 0.2886751345948129), ('R', 0.2886751345948129)] For Further Exploration Crab is a framework for building recommender systems in Python. Graphlab also has a recommender toolkit. The Netflix Prize was a somewhat famous competition to build a better system to recommend movies to Netflix users. Chapter 23. Databases and SQL Memory is man’s greatest friend and worst enemy. Gilbert Parker The data you need will often live in databases, systems designed for efficiently storing and querying data. The bulk of these are relational databases, such as Oracle, MySQL, and SQL Server, which store data in tables and are typically queried using Structured Query Language (SQL), a declarative language for manipulating data.


pages: 308 words: 84,713

The Glass Cage: Automation and Us by Nicholas Carr

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Airbnb, Airbus A320, Andy Kessler, Atul Gawande, autonomous vehicles, Bernard Ziegler, business process, call centre, Captain Sullenberger Hudson, Checklist Manifesto, cloud computing, computerized trading, David Brooks, deliberate practice, deskilling, digital map, Douglas Engelbart, drone strike, Elon Musk, Erik Brynjolfsson, Flash crash, Frank Gehry, Frank Levy and Richard Murnane: The New Division of Labor, Frederick Winslow Taylor, future of work, global supply chain, Google Glasses, Google Hangouts, High speed trading, indoor plumbing, industrial robot, Internet of things, Jacquard loom, Jacquard loom, James Watt: steam engine, job automation, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, Kevin Kelly, knowledge worker, Lyft, Marc Andreessen, Mark Zuckerberg, means of production, natural language processing, new economy, Nicholas Carr, Norbert Wiener, Oculus Rift, pattern recognition, Peter Thiel, place-making, Plutocrats, plutocrats, profit motive, Ralph Waldo Emerson, RAND corporation, randomized controlled trial, Ray Kurzweil, recommendation engine, robot derives from the Czech word robota Czech, meaning slave, Second Machine Age, self-driving car, Silicon Valley, Silicon Valley ideology, software is eating the world, Stephen Hawking, Steve Jobs, TaskRabbit, technoutopianism, The Wealth of Nations by Adam Smith, turn-by-turn navigation, US Airways Flight 1549, Watson beat the top human players on Jeopardy!, William Langewiesche

Thanks to the proliferation of smartphones, tablets, and other small, affordable, and even wearable computers, we now depend on software to carry out many of our daily chores and pastimes. We launch apps to aid us in shopping, cooking, exercising, even finding a mate and raising a child. We follow turn-by-turn GPS instructions to get from one place to the next. We use social networks to maintain friendships and express our feelings. We seek advice from recommendation engines on what to watch, read, and listen to. We look to Google, or to Apple’s Siri, to answer our questions and solve our problems. The computer is becoming our all-purpose tool for navigating, manipulating, and understanding the world, in both its physical and its social manifestations. Just think what happens these days when people misplace their smartphones or lose their connections to the net.

Like all analytical programs, they have a bias toward criteria that lend themselves to statistical analysis, downplaying those that entail the exercise of taste or other subjective judgments. Automated essay-grading algorithms encourage in students a rote mastery of the mechanics of writing. The programs are deaf to tone, uninterested in knowledge’s nuances, and actively resistant to creative expression. The deliberate breaking of a grammatical rule may delight a reader, but it’s anathema to a computer. Recommendation engines, whether suggesting a movie or a potential love interest, cater to our established desires rather than challenging us with the new and unexpected. They assume we prefer custom to adventure, predictability to whimsy. The technologies of home automation, which allow things like lighting, heating, cooking, and entertainment to be meticulously programmed, impose a Taylorist mentality on domestic life.


pages: 292 words: 85,151

Exponential Organizations: Why New Organizations Are Ten Times Better, Faster, and Cheaper Than Yours (And What to Do About It) by Salim Ismail, Yuri van Geest

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

23andMe, 3D printing, Airbnb, Amazon Mechanical Turk, Amazon Web Services, augmented reality, autonomous vehicles, Baxter: Rethink Robotics, bioinformatics, bitcoin, Black Swan, blockchain, Burning Man, business intelligence, business process, call centre, chief data officer, Chris Wanstrath, Clayton Christensen, clean water, cloud computing, cognitive bias, collaborative consumption, collaborative economy, commoditize, corporate social responsibility, cross-subsidies, crowdsourcing, cryptocurrency, dark matter, Dean Kamen, dematerialisation, discounted cash flows, distributed ledger, Edward Snowden, Elon Musk, en.wikipedia.org, ethereum blockchain, Galaxy Zoo, game design, Google Glasses, Google Hangouts, Google X / Alphabet X, gravity well, hiring and firing, Hyperloop, industrial robot, Innovator's Dilemma, intangible asset, Internet of things, Iridium satellite, Isaac Newton, Jeff Bezos, Kevin Kelly, Kickstarter, knowledge worker, Kodak vs Instagram, Law of Accelerating Returns, Lean Startup, life extension, lifelogging, loose coupling, loss aversion, Lyft, Marc Andreessen, Mark Zuckerberg, market design, means of production, minimum viable product, natural language processing, Netflix Prize, Network effects, new economy, Oculus Rift, offshore financial centre, p-value, PageRank, pattern recognition, Paul Graham, peer-to-peer, peer-to-peer model, Peter H. Diamandis: Planetary Resources, Peter Thiel, prediction markets, profit motive, publish or perish, Ray Kurzweil, recommendation engine, RFID, ride hailing / ride sharing, risk tolerance, Ronald Coase, Second Machine Age, self-driving car, sharing economy, Silicon Valley, skunkworks, Skype, smart contracts, Snapchat, social software, software is eating the world, speech recognition, stealth mode startup, Stephen Hawking, Steve Jobs, subscription business, supply-chain management, TaskRabbit, telepresence, telepresence robot, Tony Hsieh, transaction costs, Tyler Cowen: Great Stagnation, urban planning, WikiLeaks, winner-take-all economy, X Prize, Y Combinator, zero-sum game

Ten years later, its revenues had jumped 125x and the company was generating a half-billion dollars every three days. At the heart of this staggering growth was the PageRank algorithm, which ranks the popularity of web pages. (Google doesn’t gauge which page is better from a human perspective; its algorithms simply respond to the pages that deliver the most clicks.) Google isn’t alone. Today, the world is pretty much run on algorithms. From automotive anti-lock braking to Amazon’s recommendation engine; from dynamic pricing for airlines to predicting the success of upcoming Hollywood blockbusters; from writing news posts to air traffic control; from credit card fraud detection to the 2 percent of posts that Facebook shows a typical user—algorithms are everywhere in modern life. Recently, McKinsey estimated that of the seven hundred end-to-end bank processes (opening an account or getting a car loan, for example), about half can be fully automated.

Not only has he made that rare transition from founder to large-company CEO, but he has also consistently avoided the short-term thinking that so often comes with running a public company—what Joi Ito calls “nowism.” Amazon regularly makes long bets (e.g., Amazon Web Services, Kindle, and now Fire smartphones and delivery drones), views new products as if they are seedlings needing careful tending for a five-to-seven-year period, is maniacal about growth over profits and ignores the short-term view of Wall Street analysts. Its pioneering initiatives include its Affiliate Program, its recommendation engine (collaborative filtering) and the Mechanical Turk project. As Bezos says, “If you’re competitor-focused, you have to wait until there is a competitor doing something. Being customer-focused allows you to be more pioneering.” Not only has Amazon built ExOs on its edges (such as AWS), it also has had the courage to cannibalize its own products (e.g., Kindle). In addition, after realizing that Amazon’s culture wasn’t a perfect fit with the outstanding service he wanted to offer, Bezos spent $1.2 billion in 2009 to acquire Zappos.


pages: 391 words: 105,382

Utopia Is Creepy: And Other Provocations by Nicholas Carr

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Air France Flight 447, Airbnb, Airbus A320, AltaVista, Amazon Mechanical Turk, augmented reality, autonomous vehicles, Bernie Sanders, book scanning, Brewster Kahle, Buckminster Fuller, Burning Man, Captain Sullenberger Hudson, centralized clearinghouse, cloud computing, cognitive bias, collaborative consumption, computer age, corporate governance, crowdsourcing, Danny Hillis, deskilling, digital map, Donald Trump, Electric Kool-Aid Acid Test, Elon Musk, factory automation, failed state, feminist movement, Frederick Winslow Taylor, friendly fire, game design, global village, Google bus, Google Glasses, Google X / Alphabet X, Googley, hive mind, impulse control, indoor plumbing, interchangeable parts, Internet Archive, invention of movable type, invention of the steam engine, invisible hand, Isaac Newton, Jeff Bezos, jimmy wales, job automation, Kevin Kelly, lifelogging, low skilled workers, Marc Andreessen, Mark Zuckerberg, Marshall McLuhan, means of production, Menlo Park, mental accounting, natural language processing, Network effects, new economy, Nicholas Carr, Norman Mailer, off grid, oil shale / tar sands, Peter Thiel, Plutocrats, plutocrats, profit motive, Ralph Waldo Emerson, Ray Kurzweil, recommendation engine, Republic of Letters, robot derives from the Czech word robota Czech, meaning slave, Ronald Reagan, self-driving car, SETI@home, side project, Silicon Valley, Silicon Valley ideology, Singularitarianism, Snapchat, social graph, social web, speech recognition, Startup school, stem cell, Stephen Hawking, Steve Jobs, Steven Levy, technoutopianism, the medium is the message, theory of mind, Turing test, Whole Earth Catalog, Y Combinator

The great power of modern digital filters lies in their ability to make information that is of inherent interest to us immediately visible to us. The information may take the form of personal messages or updates from friends or colleagues, broadcast messages from experts or celebrities whose opinions or observations we value, headlines and stories from writers or publications we like, alerts about the availability of various other sorts of content on favorite subjects, or suggestions from recommendation engines—but it all shares the quality of being tailored to our particular interests. It’s all needles. And modern filters don’t just organize that information for us; they push the information at us as alerts, updates, streams. We tend to point to spam as an example of information overload. But spam is just an annoyance. The real source of information overload, at least of the ambient sort, is the stuff we like, the stuff we want.

To thine own image be true. 16. No great work of literature could have been written in hypertext. 17. Social media is a palliative for underemployment. 18. The philistine appears ideally suited to the role of cultural impresario online. 19. Television became more interesting when people started paying for it. 20. Instagram shows us what a world without art looks like. SECOND SERIES (2013) 21. Recommendation engines are the best cure for hubris. 22. Vines would be better if they were one second shorter. 23. Hell is other selfies. 24. Twitter has revealed that brevity and verbosity are not always antonyms. 25. Personalized ads provide a running critique of artificial intelligence. 26. Who you are is what you do between notifications. 27. Online is to offline as a swimming pool to a pond. 28. People in love leave the sparsest data trails. 29.


pages: 366 words: 94,209

Throwing Rocks at the Google Bus: How Growth Became the Enemy of Prosperity by Douglas Rushkoff

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, activist fund / activist shareholder / activist investor, Airbnb, algorithmic trading, Amazon Mechanical Turk, Andrew Keen, bank run, banking crisis, barriers to entry, bitcoin, blockchain, Burning Man, business process, buy low sell high, California gold rush, Capital in the Twenty-First Century by Thomas Piketty, carbon footprint, centralized clearinghouse, citizen journalism, clean water, cloud computing, collaborative economy, collective bargaining, colonial exploitation, Community Supported Agriculture, corporate personhood, corporate raider, creative destruction, crowdsourcing, cryptocurrency, disintermediation, diversified portfolio, Elon Musk, Erik Brynjolfsson, ethereum blockchain, fiat currency, Firefox, Flash crash, full employment, future of work, gig economy, Gini coefficient, global supply chain, global village, Google bus, Howard Rheingold, IBM and the Holocaust, impulse control, income inequality, index fund, iterative process, Jaron Lanier, Jeff Bezos, jimmy wales, job automation, Joseph Schumpeter, Kickstarter, loss aversion, Lyft, Marc Andreessen, Mark Zuckerberg, market bubble, market fundamentalism, Marshall McLuhan, means of production, medical bankruptcy, minimum viable product, Naomi Klein, Network effects, new economy, Norbert Wiener, Oculus Rift, passive investing, payday loans, peer-to-peer lending, Peter Thiel, post-industrial society, profit motive, quantitative easing, race to the bottom, recommendation engine, reserve currency, RFID, Richard Stallman, ride hailing / ride sharing, Ronald Reagan, Satoshi Nakamoto, Second Machine Age, shareholder value, sharing economy, Silicon Valley, Snapchat, social graph, software patent, Steve Jobs, TaskRabbit, The Future of Employment, trade route, transportation-network company, Turing test, Uber and Lyft, Uber for X, unpaid internship, Y Combinator, young professional, zero-sum game, Zipcar

., became one of the first publicly traded Internet giants, responsible (or to blame) for not only the first e-commerce Web sites but also the first banner ad.6 Matthew was likely just as surprised by where this all went as I was. The information superhighway morphed into an interactive strip mall; digital technology’s ability to connect people to products, facilitate payments, and track behaviors led to all sorts of new marketing and sales innovations. “Buy” buttons triggered the impulse for instant gratification, while recommendation engines personalized marketing pitches. It was commerce on crack. With a few notable exceptions—such as eBay and Etsy—we didn’t really get a return of the many-to-many marketplace or digital bazaar. No, in online commerce it’s mostly a few companies selling to many, and many people selling to the very few—if anyone at all. Take music. The best part of an online music catalogue is that it is unlimited in size.

Amazon then leveraged its monopoly in books and free shipping to develop monopolies in other verticals, beginning with home electronics (bankrupting Circuit City and Best Buy), and then every other link in the physical and virtual fulfillment chain, from shoes and food to music and videos. Finally, Amazon flips into personhood by reversing the traditional relationship between people and machines. Amazon’s patented recommendation engines attempt to drive our human selection process. Amazon Mechanical Turks gave computers the ability to mete out repetitive tasks to legions of human drones. The computers did the thinking and choosing; the people pointed and clicked as they were instructed or induced to do. Neither Amazon nor its founder, Jeff Bezos, is slipping to new lows here. The company is simply operating true to the core program of corporatism, expressed through new digital means.


pages: 502 words: 107,657

Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die by Eric Siegel

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, algorithmic trading, Amazon Mechanical Turk, Apple's 1984 Super Bowl advert, backtesting, Black Swan, book scanning, bounce rate, business intelligence, business process, call centre, commoditize, computer age, conceptual framework, correlation does not imply causation, crowdsourcing, dark matter, data is the new oil, en.wikipedia.org, Erik Brynjolfsson, Everything should be made as simple as possible, experimental subject, Google Glasses, happiness index / gross national happiness, job satisfaction, Johann Wolfgang von Goethe, lifelogging, Machine translation of "The spirit is willing, but the flesh is weak." to Russian and back, mass immigration, Moneyball by Michael Lewis explains big data, Nate Silver, natural language processing, Netflix Prize, Network effects, Norbert Wiener, personalized medicine, placebo effect, prediction markets, Ray Kurzweil, recommendation engine, risk-adjusted returns, Ronald Coase, Search for Extraterrestrial Intelligence, self-driving car, sentiment analysis, software as a service, speech recognition, statistical model, Steven Levy, text mining, the scientific method, The Signal and the Noise by Nate Silver, The Wisdom of Crowds, Thomas Bayes, Turing test, Watson beat the top human players on Jeopardy!, X Prize, Yogi Berra, zero-sum game

I Knew You Were Going to Do That With this power at hand, what do we want to predict? Every important thing a person does is valuable to predict, namely: consume, think, work, quit, vote, love, procreate, divorce, mess up, lie, cheat, steal, kill, and die. Let’s explore some examples.2 People Consume Hollywood studios predict the success of a screenplay if produced. Netflix awarded $1 million to a team of scientists who best improved their recommendation system’s ability to predict which movies you will like. Australian energy company Energex predicts electricity demand in order to decide where to build out its power grid, and Con Edison predicts system failure in the face of high levels of consumption. Wall Street predicts stock prices by observing how demand drives them up and down. The firms AlphaGenius and Derwent Capital drive hedge fund trading by following trends across the general public’s activities on Twitter.

I was at Walgreens a few years ago, and upon checkout an attractive, colorful coupon spit out of the machine. The product it hawked, pictured for all my fellow shoppers to see, had the potential to mortify. It was a coupon for Beano, a medication for flatulence. I’d developed mild lactose intolerance, but, before figuring that out, had been trying anything to address my symptom. Acting blindly on data, Walgreens’ recommendation system seemed to suggest that others not stand so close. Other clinical data holds a more serious and sensitive status than digestive woes. Once, when teaching a summer program for talented teenagers, I received data I felt would have been better kept away from me. The administrator took me aside to inform me that one of my students had a diagnosis of bipolar disorder. I wasn’t trained in psychology.

Such a contest is a hard-nosed, objective bake-off—whoever can cook up the solution that best handles the predictive task at hand wins kudos and, usually, cash. Dark Horses And so it was with our two Montrealers, Martin and Martin, who took the Netflix Prize by storm despite their lack of experience—or, perhaps, because of it. Neither had a background in statistics or analytics, let alone recommendation systems in particular. By day, the two worked in the telecommunications industry developing software. But by night, at home, the two-member team plugged away, for 10 to 20 hours per week apiece, racing ahead in the contest under the team name PragmaticTheory. The “pragmatic” approach proved groundbreaking. The team wavered in and out of the number one slot; during the final months of the competition, the team was often in the top echelons.


pages: 396 words: 117,149

The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World by Pedro Domingos

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, Albert Einstein, Amazon Mechanical Turk, Arthur Eddington, basic income, Bayesian statistics, Benoit Mandelbrot, bioinformatics, Black Swan, Brownian motion, cellular automata, Claude Shannon: information theory, combinatorial explosion, computer vision, constrained optimization, correlation does not imply causation, creative destruction, crowdsourcing, Danny Hillis, data is the new oil, double helix, Douglas Hofstadter, Erik Brynjolfsson, experimental subject, Filter Bubble, future of work, global village, Google Glasses, Gödel, Escher, Bach, information retrieval, job automation, John Markoff, John Snow's cholera map, John von Neumann, Joseph Schumpeter, Kevin Kelly, lone genius, mandelbrot fractal, Mark Zuckerberg, Moneyball by Michael Lewis explains big data, Narrative Science, Nate Silver, natural language processing, Netflix Prize, Network effects, NP-complete, off grid, P = NP, PageRank, pattern recognition, phenotype, planetary scale, pre–internet, random walk, Ray Kurzweil, recommendation engine, Richard Feynman, Richard Feynman, Second Machine Age, self-driving car, Silicon Valley, speech recognition, statistical model, Stephen Hawking, Steven Levy, Steven Pinker, superintelligent machines, the scientific method, The Signal and the Noise by Nate Silver, theory of mind, Thomas Bayes, transaction costs, Turing machine, Turing test, Vernor Vinge, Watson beat the top human players on Jeopardy!, white flight, zero-sum game

Satellites, DNA sequencers, and particle accelerators probe nature in ever-finer detail, and learning algorithms turn the torrents of data into new scientific knowledge. Companies know their customers like never before. The candidate with the best voter models wins, like Obama against Romney. Unmanned vehicles pilot themselves across land, sea, and air. No one programmed your tastes into the Amazon recommendation system; a learning algorithm figured them out on its own, by generalizing from your past purchases. Google’s self-driving car taught itself how to stay on the road; no engineer wrote an algorithm instructing it, step-by-step, how to get from A to B. No one knows how to program a car to drive, and no one needs to, because a car equipped with a learning algorithm picks it up by observing what the driver does.

It’s an ideal job for machine learning, and yet today’s learners aren’t up to it. Each has some of the needed capabilities but is missing others. The Master Algorithm is the complete package. Applying it to vast amounts of patient and drug data, combined with knowledge mined from the biomedical literature, is how we will cure cancer. A universal learner is sorely needed in many other areas, from life-and-death to mundane situations. Picture the ideal recommender system, one that recommends the books, movies, and gadgets you would pick for yourself if you had the time to check them all out. Amazon’s algorithm is a very far cry from it. That’s partly because it doesn’t have enough data—mainly it just knows which items you previously bought from Amazon—but if you went hog wild and gave it access to your complete stream of consciousness from birth, it wouldn’t know what to do with it.

The price, of course, is that its vision is blurrier: fine details of the frontier get washed away by the voting. When k goes up, variance decreases, but bias increases. Using the k nearest neighbors instead of one is not the end of the story. Intuitively, the examples closest to the test example should count for more. This leads us to the weighted k-nearest-neighbor algorithm. In 1994, a team of researchers from the University of Minnesota and MIT built a recommendation system based on what they called “a deceptively simple idea”: people who agreed in the past are likely to agree again in the future. That notion led directly to the collaborative filtering systems that all self-respecting e-commerce sites have. Suppose that, like Netflix, you’ve gathered a database of movie ratings, with each user giving a rating of one to five stars to the movies he or she has seen.


pages: 375 words: 88,306

The Sharing Economy: The End of Employment and the Rise of Crowd-Based Capitalism by Arun Sundararajan

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, additive manufacturing, Airbnb, Amazon Mechanical Turk, autonomous vehicles, barriers to entry, basic income, bitcoin, blockchain, Burning Man, call centre, collaborative consumption, collaborative economy, collective bargaining, commoditize, corporate social responsibility, cryptocurrency, David Graeber, distributed ledger, employer provided health coverage, Erik Brynjolfsson, ethereum blockchain, Frank Levy and Richard Murnane: The New Division of Labor, future of work, George Akerlof, gig economy, housing crisis, Howard Rheingold, information asymmetry, Internet of things, inventory management, invisible hand, job automation, job-hopping, Kickstarter, knowledge worker, Kula ring, Lyft, Marc Andreessen, megacity, minimum wage unemployment, moral hazard, moral panic, Network effects, new economy, Oculus Rift, pattern recognition, peer-to-peer, peer-to-peer lending, peer-to-peer model, peer-to-peer rental, profit motive, purchasing power parity, race to the bottom, recommendation engine, regulatory arbitrage, rent control, Richard Florida, ride hailing / ride sharing, Robert Gordon, Ronald Coase, Second Machine Age, self-driving car, sharing economy, Silicon Valley, smart contracts, Snapchat, social software, supply-chain management, TaskRabbit, The Nature of the Firm, total factor productivity, transaction costs, transportation-network company, two-sided market, Uber and Lyft, Uber for X, universal basic income, Zipcar

Thus, a big fraction of Google’s impact on the economy isn’t captured since changes in consumer surplus are not reflected in the GDP. This point has been noted about digital markets more generally. While a conventional brick-and-mortar bookstore may hold 40,000 to 100,000 books, Amazon offers access to over 3 million books. The same expansion in variety holds true for music, movies, electronics, and myriad other products. Furthermore, since Amazon uses several recommender systems to help promote products, it is not just variety but “fit” that has increased.14 Capturing the economic impacts of enhanced variety and automated word-of-mouth promotions, however, is difficult, since once again, what has changed is primarily the quality of the consumer experience. As Erik Brynjolfsson, Yu (Jeffery) Hu, and Michael Smith argue in their study of consumer surplus in the digital economy, these benefits may be particularly difficult to measure because different consumers are impacted to varying degrees.

This improves the welfare of these consumers by allowing them to locate and buy specialty products they otherwise would not have purchased due to high transaction costs or low product awareness. This effect will be especially beneficial to those consumers who live in remote areas.”15 Analogous increases in consumer surplus were documented by Anindya Ghose, Rahul Telang and Michael Smith in their 2005 study of electronic markets for used books.16 These effects are exacerbated by a wide variety of recommender systems that use machine learning algorithms to better direct consumer choice. As Alexander Tuzhilin and Gedas Adomavicius document, such systems are ubiquitous in digital markets.17 It is natural to expect similar challenges when, for example, trying to encompass the different economic impacts of increased variety and fit from Airbnb, or increased convenience from Lyft, or Dennis’s increased access to financing on the Isle of Gigha.

Smith, “Consumer Surplus in the Digital Economy: Estimating the Value of Increased Product Variety at Online Booksellers,” Management Science 49, 11 (2003): 1580–1596, 1581. 16. Anindya Ghose, Rahul Telang and Michael D. Smith, “Internet Exchanges for Used Books: An Empirical Analysis of Product Cannibalization and Welfare Impact,” Information Systems Research 17, 1 (2006): 3–9. http://pubsonline.informs.org/doi/abs/10.1287/isre.1050.0072. 17. Alexander Tuzhilin and Gedas Adomavicius, ”Toward the next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions,” IEEE Transactions on Knowledge and Data Engineering 17, 6 (2006): 734–739. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1423975&tag=1. 18. Prasanna Tambe and Lorin M. Hitt, “Job Hopping, Information Technology Spillovers, and Productivity Growth,” Management Science 60, 2 (2013): 338–355. 19. One might instead consider using the term “efficiency” of capital or “productivity” of capital.


pages: 410 words: 119,823

Radical Technologies: The Design of Everyday Life by Adam Greenfield

3D printing, Airbnb, augmented reality, autonomous vehicles, bank run, barriers to entry, basic income, bitcoin, blockchain, business intelligence, business process, call centre, cellular automata, centralized clearinghouse, centre right, Chuck Templeton: OpenTable, cloud computing, collective bargaining, combinatorial explosion, Computer Numeric Control, computer vision, Conway's Game of Life, cryptocurrency, David Graeber, dematerialisation, digital map, distributed ledger, drone strike, Elon Musk, ethereum blockchain, facts on the ground, fiat currency, global supply chain, global village, Google Glasses, IBM and the Holocaust, industrial robot, informal economy, information retrieval, Internet of things, James Watt: steam engine, Jane Jacobs, Jeff Bezos, job automation, John Conway, John Markoff, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, John von Neumann, joint-stock company, Kevin Kelly, Kickstarter, late capitalism, license plate recognition, lifelogging, M-Pesa, Mark Zuckerberg, means of production, megacity, megastructure, minimum viable product, money: store of value / unit of account / medium of exchange, natural language processing, Network effects, New Urbanism, Occupy movement, Oculus Rift, Pareto efficiency, pattern recognition, Pearl River Delta, performance metric, Peter Eisenman, Peter Thiel, planetary scale, Ponzi scheme, post scarcity, RAND corporation, recommendation engine, RFID, rolodex, Satoshi Nakamoto, self-driving car, sentiment analysis, shareholder value, sharing economy, Silicon Valley, smart cities, smart contracts, sorting algorithm, special economic zone, speech recognition, stakhanovite, statistical model, stem cell, technoutopianism, Tesla Model S, the built environment, The Death and Life of Great American Cities, The Future of Employment, transaction costs, Uber for X, universal basic income, urban planning, urban sprawl, Whole Earth Review, WikiLeaks, women in the workforce

The equivalent of classification for unsupervised learning is clustering, in which an algorithm starts to develop a sense for what is significant in its environment via a process of accretion. A concrete example will help us understand how this works. At the end of the 1990s, two engineers named Tim Westegren and Will Glaser developed a rudimentary music-recommendation engine called the Music Genome Project that worked by rebuilding genre from the bottom up. (The engineers eventually founded the Pandora streaming service, and folded their recommendation engine into it.) Music Genome compared the acoustic signatures and other performance characteristics of the pieces of music it was offered, and from them built up associative maps, clustering together all the songs that had similar qualities; after many iterations, these clusters developed a strong resemblance to the musical categories we’re familiar with.


pages: 176 words: 55,819

The Start-Up of You by Reid Hoffman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Airbnb, Andy Kessler, Black Swan, business intelligence, Cal Newport, Clayton Christensen, commoditize, David Brooks, Donald Trump, en.wikipedia.org, fear of failure, follow your passion, future of work, game design, Jeff Bezos, job automation, late fees, Marc Andreessen, Mark Zuckerberg, Menlo Park, out of africa, Paul Graham, Peter Thiel, recommendation engine, Richard Bolles, risk tolerance, rolodex, shareholder value, side project, Silicon Valley, Silicon Valley startup, social web, Steve Jobs, Steve Wozniak, Tony Hsieh, transaction costs

In 1999 he set up a meeting at Blockbuster’s headquarters in part to discuss possibly partnering on local distribution and faster fulfillment. Blockbuster was not impressed. “They just about laughed us out of their office,” Reed recalls.16 Reed and his team kept at it. They perfected their distribution center network so that more than 80 percent of customers received overnight delivery of movies.17 They developed an innovative recommendation engine that prompted users with movies they might like based on past purchases. By 2005 Netflix had a subscriber base four million strong, had fended off competition from imitations like Walmart’s online movie-by-mail effort, and became the king of online movie rentals. In 2010 Netflix made a profit of more than $160 million. Blockbuster, in comparison, failed to adapt to the Internet era. That year it filed for bankruptcy.18 Netflix is not resting.


pages: 222 words: 53,317

Overcomplicated: Technology at the Limits of Comprehension by Samuel Arbesman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, algorithmic trading, Anton Chekhov, Apple II, Benoit Mandelbrot, citation needed, combinatorial explosion, Danny Hillis, David Brooks, digital map, discovery of the americas, en.wikipedia.org, Erik Brynjolfsson, Flash crash, friendly AI, game design, Google X / Alphabet X, Googley, HyperCard, Inbox Zero, Isaac Newton, iterative process, Kevin Kelly, Machine translation of "The spirit is willing, but the flesh is weak." to Russian and back, mandelbrot fractal, Minecraft, Netflix Prize, Nicholas Carr, Parkinson's law, Ray Kurzweil, recommendation engine, Richard Feynman, Richard Feynman, Richard Feynman: Challenger O-ring, Second Machine Age, self-driving car, software studies, statistical model, Steve Jobs, Steve Wozniak, Steven Pinker, Stewart Brand, superintelligent machines, Therac-25, Tyler Cowen: Great Stagnation, urban planning, Watson beat the top human players on Jeopardy!, Whole Earth Catalog, Y2K

The sophisticated machine learning techniques used in linguistics—employing probability and a large array of parameters rather than principled rules—are increasingly being used in numerous other areas, both in science and outside it, from criminal detection to medicine, as well as in the insurance industry. Even our aesthetic tastes are rather complicated, as Netflix discovered when it awarded a prize for improvements in its recommendation engine to a team whose solution was cobbled together from a variety of different statistical techniques. The contest seemed to demonstrate that no simple algorithm could provide a significant improvement in recommendation accuracy; the winners needed to use a more complex suite of methods in order to capture and predict our personal and quirky tastes in films. This phenomenon occurs in all types of technology.


pages: 181 words: 52,147

The Driver in the Driverless Car: How Our Technology Choices Will Create the Future by Vivek Wadhwa, Alex Salkever

23andMe, 3D printing, Airbnb, artificial general intelligence, augmented reality, autonomous vehicles, barriers to entry, Bernie Sanders, bitcoin, blockchain, clean water, correlation does not imply causation, distributed ledger, Donald Trump, double helix, Elon Musk, en.wikipedia.org, epigenetics, Erik Brynjolfsson, Google bus, Hyperloop, income inequality, Internet of things, job automation, Kevin Kelly, Khan Academy, Law of Accelerating Returns, license plate recognition, life extension, Lyft, M-Pesa, Menlo Park, microbiome, mobile money, new economy, personalized medicine, phenotype, precision agriculture, RAND corporation, Ray Kurzweil, recommendation engine, Ronald Reagan, Second Machine Age, self-driving car, Silicon Valley, Skype, smart grid, stem cell, Stephen Hawking, Steve Wozniak, Stuxnet, supercomputer in your pocket, Tesla Model S, The Future of Employment, Turing test, Uber and Lyft, Uber for X, uranium enrichment, Watson beat the top human players on Jeopardy!, zero day

In general, narrow-A.I. systems can do a better job on a very specific range of tasks than humans can. I couldn’t, for example, recall the winning and losing pitcher in every baseball game of the major leagues from the previous night. Narrow A.I. is now embedded in the fabric of our everyday lives. The humanoid phone trees that route calls to airlines’ support desks are all narrow A.I., as are recommendation engines in Amazon and Spotify. Google Maps’ astonishingly smart route suggestions (and mid-course modifications to avoid traffic) are classic narrow A.I. Narrow-A.I. systems are much better than humans are at accessing information stored in complex databases, but their capabilities are specific and limited, and exclude creative thought. If you asked Siri to find the perfect gift for your mother for Valentine’s Day, she might make a snarky comment, but she couldn’t venture an educated guess.


pages: 593 words: 118,995

Relevant Search: With Examples Using Elasticsearch and Solr by Doug Turnbull, John Berryman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

commoditize, crowdsourcing, domain-specific language, finite state, fudge factor, full text search, information retrieval, natural language processing, premature optimization, recommendation engine, sentiment analysis

These methods are less intuitive than the simple co-occurrence counting method presented here, and they tend to be more challenging to implement. But they often provide better results, because they employ a more holistic understanding of item-user relationships. To dive deeper into recommendation systems, we recommend Practical Recommender Systems by Kim Falk (Manning, 2016). And no matter the method you choose, keep in mind that the end result is a model that lets you quickly find the item-to-item or user-to-item affinities. This understanding is important as we explain how collaborative filtering results can be used in the context of search. 11.2.3. Tying user behavior information back to the search index In the previous section, we demonstrated how to build a simple recommendation system. But we’re supposed to be talking about personalized search! In this section, we return to search and explain how the output of collaborative filtering can be used to build a more personalized search experience.

In both cases, we start with relatively simple methods and then outline more sophisticated approaches using machine learning. In the process of laying out personalized search, we introduce recommendations. You can provide users with personalized content recommendations even before they’ve made a search. In addition, you’ll see that a search engine can be a powerful platform for building a recommendation system. Figure 11.1 shows recommendations side-by-side with search, implemented by a relevance engineer. Figure 11.1. By incorporating knowledge about the content and the user, search can be extended to tasks such as personalized search and recommendations. 11.1. Personalizing search based on user profiles Until now, we’ve defined relevance in terms of how well a search result matches a user’s immediate information need.

Here, information comes in three flavors: information about the users, about the items in the catalog, and about the current context of recommendation: User information —As users interact with the application, you can identify patterns in their behavior and learn about their interests and tastes. Particularly engaged users might even be willing to directly tell us about their interests. Item information —To make good recommendations, it’s important to be familiar with the items in the catalog. At a minimum, the items need to have useful textual content to match on. Items also need good metadata for boosting and filtering. In more advanced recommendation systems, you should also take advantage of the overall user behavior that gives you new information about how items in the catalog are interrelated. Recommendation context —To provide users with the best recommendations possible, you must consider their current context. Are they looking at an item details page? Then you should make recommendations for related items in case they aren’t sold on this one.


pages: 561 words: 120,899

The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant From Two Centuries of Controversy by Sharon Bertsch McGrayne

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Bayesian statistics, bioinformatics, British Empire, Claude Shannon: information theory, Daniel Kahneman / Amos Tversky, double helix, Edmond Halley, Fellow of the Royal Society, full text search, Henri Poincaré, Isaac Newton, John Markoff, John Nash: game theory, John von Neumann, linear programming, meta analysis, meta-analysis, Nate Silver, p-value, Pierre-Simon Laplace, placebo effect, prediction markets, RAND corporation, recommendation engine, Renaissance Technologies, Richard Feynman, Richard Feynman, Richard Feynman: Challenger O-ring, Ronald Reagan, speech recognition, statistical model, stochastic process, Thomas Bayes, Thomas Kuhn: the structure of scientific revolutions, traveling salesman, Turing machine, Turing test, uranium enrichment, Yom Kippur War

Pouget A et al. (2009) Neural Computations as Laplacian (or is it Bayesian?) probabilistic inference. In draft. Quatse JT, Najmi A. (2007) Empirical Bayesian targeting. Proceedings, 2007 World Congress in Computer Science, Computer Engineering, and Applied Computing, June 25–28, 2007. Schafer JB, Konstan J, Riedl J. (1999) Recommender systems in E-commerce. In ACM Conference on Electronic Commerce (EC-99) 158–66. Schafer JB, Konstan J, Riedl J. (2001) Recommender systems in E-commerce. Data Mining and Knowledge Discovery (5) 115–53. Schneider, Stephen H. (2005) The Patient from Hell. Perseus Books. Spolsky, Joel. (2005) (http://www.joelonsoftware.com/items/2005/10/17.html). Swinburne, Richard, ed. (2002) Bayes’s Theorem. Oxford University Press. Taylor BL et al. (2000) Incorporating uncertainty into management models for marine mammals.

Users refine their own filters by reading low-scoring messages and either keeping them or sending them to trash and junk files. This use of Bayesian optimal classifiers is similar to the technique used by Frederick Mosteller and David Wallace to determine who wrote certain Federalist papers. Bayesian theory is firmly embedded in Microsoft’s Windows operating system. In addition, a variety of Bayesian techniques are involved in Microsoft’s handwriting recognition; recommender systems; the question-answering box in the upper right corner of a PC’s monitor screen; a datamining software package for tracking business sales; a program that infers the applications that users will want and preloads them before they are requested; and software to make traffic jam predictions for drivers to check before their commute. Bayes was blamed—unfairly, say Heckerman and Horwitz—for Microsoft’s memorably annoying paperclip, Clippy.

As the e-commerce refrain goes, “If you liked this book/song/movie, you’ll like that one too.” The updating used in machine learning does not necessarily follow Bayes’ theorem formally but “shares its perspective.” A 1-million contest sponsored by Netflix.com illustrates the prominent role of Bayesian concepts in modern e-commerce and learning theory. In 2006 the online film-rental company launched a search for the best recommender system to improve its own algorithm. More than 50,000 contestants from 186 countries vied over the four years of the competition. The AT&T Labs team organized around Yehuda Koren, Christopher T. Volinsky, and Robert M. Bell won the prize in September 2009. Interestingly, although no contestants questioned Bayes as a legitimate method, almost none wrote a formal Bayesian model. The winning group relied on empirical Bayes but estimated the initial priors according to their frequencies.


pages: 229 words: 68,426

Everyware: The Dawning Age of Ubiquitous Computing by Adam Greenfield

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

augmented reality, business process, defense in depth, demand response, demographic transition, facts on the ground, game design, Howard Rheingold, Internet of things, James Dyson, knowledge worker, late capitalism, Marshall McLuhan, new economy, Norbert Wiener, packet switching, pattern recognition, profit motive, QR code, recommendation engine, RFID, Steve Jobs, technoutopianism, the built environment, the scientific method

But the word "hint" is well-chosen here, because that's really all the cup will be able to communicate. It may well be that a full mug on my desk implies that I am also in the room, but this is not always going to be the case, and any system that correlates the two facts had better do so pretty loosely. Products and services based on such pattern-recognition already exist in the world—I think of Amazon's "collaborative filtering"–driven recommendation engine—but for the most part, their designers are only now beginning to recognize that they have significantly underestimated the difficulty of deriving meaning from those patterns. The better part of my Amazon recommendations turn out to be utterly worthless—and of all commercial pattern-recognition systems, that's among those with the largest pools of data to draw on. Lest we forget: "simple" is hard.


pages: 265 words: 74,000

The Numerati by Stephen Baker

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Berlin Wall, Black Swan, business process, call centre, correlation does not imply causation, Drosophila, full employment, illegal immigration, index card, Isaac Newton, job automation, job satisfaction, McMansion, Myron Scholes, natural language processing, PageRank, personalized medicine, recommendation engine, RFID, Silicon Valley, Skype, statistical model, Watson beat the top human players on Jeopardy!

It will simply issue alerts when it detects changes in patterns and perhaps urge the user to schedule a medical appointment. It will be up to doctors and nurses to follow up, figuring out why someone is limping or swaying differently at the kitchen sink. But in time, these systems will have enough feedback from thousands of users that they should be able to point people—either doctors or patients—to the most probable cause. In this way, they will work like the recommendation engines on Netflix or Amazon.com, which point people toward books or movies that are popular among customers with similar patterns. (Amazon and Netflix, of course, don't always get it right, and neither will the analysis issuing from the magic carpet. It will only point caregivers toward statistically probable causes.) Dishman's team has installed magic carpets in the homes of people with neurological disorders or a history of falling.


pages: 231 words: 71,248

Shipping Greatness by Chris Vander Mey

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

corporate raider, don't be evil, en.wikipedia.org, fudge factor, Google Chrome, Google Hangouts, Gordon Gekko, Jeff Bezos, Kickstarter, Lean Startup, minimum viable product, performance metric, recommendation engine, Skype, slashdot, sorting algorithm, source of truth, Steve Jobs, Superbowl ad, web application

We chose to focus initially on professionals because while teens and tweens have time to spend on Facebook and YouTube, professionals have less time but also have rich networks and strong opinions—not to mention disposable capital to spend on content. Using IMDb’s unique collection of movie data and Amazon’s ability to distribute digital content and proven personalization tools, we will uniquely solve the content discovery problem by integrating these technologies and building unique suggestion algorithms. Unlike competitors such as Netflix, who already have a recommendations engine, we’ll integrate across all video sources and use our richer data to provide more interesting in-viewing experiences and more accurate recommendations. We will deliver these in-viewing experiences through platforms that can expose contextually relevant data (e.g., the cast of a YouTube video), such as a browser plug-in for YouTube and mobile applications for phones. We can also enlighten viewers by providing rich information about the content they are consuming, and prompt for feedback—creating a virtuous cycle in which all users benefit.


pages: 260 words: 76,223

Ctrl Alt Delete: Reboot Your Business. Reboot Your Life. Your Future Depends on It. by Mitch Joel

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, Amazon Web Services, augmented reality, call centre, clockwatching, cloud computing, Firefox, future of work, ghettoisation, Google Chrome, Google Glasses, Google Hangouts, Khan Academy, Kickstarter, Kodak vs Instagram, Lean Startup, Marc Andreessen, Mark Zuckerberg, Network effects, new economy, Occupy movement, place-making, prediction markets, pre–internet, QR code, recommendation engine, Richard Florida, risk tolerance, self-driving car, Silicon Valley, Silicon Valley startup, Skype, social graph, social web, Steve Jobs, Steve Wozniak, Thomas L Friedman, Tim Cook: Apple, Tony Hsieh, white picket fence, WikiLeaks, zero-sum game

In fact, it’s actually very squiggly. Always bear that in mind. Embrace the squiggle. THE REALITY OF CAREER CHOICES IN A CTRL ALT DELETE WORLD. You can contrast the fictional story above with the tale of a friend of mine. This individual was never really sure what she wanted to do. There was no clear desire or talent in a single area of interest. In her final years of high school, a guidance counselor recommended engineering or the sciences because she had above-average math grades. So my friend studied engineering through university and squeaked by. Never passionate about it, she got her diploma and entered the workforce. I had lunch with her a while back and she confessed that she was miserable because of her work but could not figure out why. She had followed all the rules; she did okay in school, she advanced in a field that typically enables you to be both employable and well paid.


pages: 326 words: 74,433

Do More Faster: TechStars Lessons to Accelerate Your Startup by Brad Feld, David Cohen

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

augmented reality, computer vision, corporate governance, crowdsourcing, disintermediation, hiring and firing, Inbox Zero, Jeff Bezos, knowledge worker, Lean Startup, Ray Kurzweil, recommendation engine, risk tolerance, Silicon Valley, Skype, slashdot, social web, software as a service, Steve Jobs

—thehighwaygirl.com Travelfli (2008)—Now UsingMiles, helps frequent flyers maximize the full potential of their loyalty programs.—usingmiles.com TutuorialTab (2010)—lets companies make their web site more learnable.—tutorialtab.com Usermojo (2010)—is an emotion analytics platform that tells you why users do what they do.—usermojo.com Vanilla (2009)—is open source forum software.—vanillaforums.com Villij (2007)—is a recommendation engine for people.—villij.com Vacation Rental Partner (2010)—makes it easy to generate revenue from a second home. We offer tools that eliminate the need for traditional property management companies.—vacationrentalpartner.com TechStars companies funded after publication are listed on the TechStars web site. About the Authors Brad Feld is a co-founder and managing director at Foundry Group, an early stage venture capital firm, and a co-founder of TechStars.


pages: 252 words: 72,473

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O'Neil

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Affordable Care Act / Obamacare, Bernie Madoff, big data - Walmart - Pop Tarts, call centre, carried interest, cloud computing, collateralized debt obligation, correlation does not imply causation, Credit Default Swap, credit default swaps / collateralized debt obligations, crowdsourcing, Emanuel Derman, housing crisis, I will remember that I didn’t make the world, and it doesn’t satisfy my equations, illegal immigration, Internet of things, late fees, mass incarceration, medical bankruptcy, Moneyball by Michael Lewis explains big data, new economy, obamacare, Occupy movement, offshore financial centre, payday loans, peer-to-peer lending, Peter Thiel, Ponzi scheme, prediction markets, price discrimination, quantitative hedge fund, Ralph Nader, RAND corporation, recommendation engine, Rubik’s Cube, Sharpe ratio, statistical model, Tim Cook: Apple, too big to fail, Unsafe at Any Speed, Upton Sinclair, Watson beat the top human players on Jeopardy!, working poor

Investors, of course, feast on these returns and shower WMD companies with more money. And the victims? Well, an internal data scientist might say, no statistical system can be perfect. Those folks are collateral damage. And often, like Sarah Wysocki, they are deemed unworthy and expendable. Forget about them for a minute, they might say, and focus on all the people who get helpful suggestions from recommendation engines or who find music they love on Pandora, the ideal job on LinkedIn, or perhaps the love of their life on Match.​com. Think of the astounding scale, and ignore the imperfections. Big Data has plenty of evangelists, but I’m not one of them. This book will focus sharply in the other direction, on the damage inflicted by WMDs and the injustice they perpetuate. We will explore harmful examples that affect people at critical life moments: going to college, borrowing money, getting sentenced to prison, or finding and holding a job.


pages: 265 words: 69,310

What's Yours Is Mine: Against the Sharing Economy by Tom Slee

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

4chan, Airbnb, Amazon Mechanical Turk, asset-backed security, barriers to entry, Berlin Wall, big-box store, bitcoin, blockchain, citizen journalism, collaborative consumption, congestion charging, Credit Default Swap, crowdsourcing, data acquisition, David Brooks, don't be evil, gig economy, Hacker Ethic, income inequality, informal economy, invisible hand, Jacob Appelbaum, Jane Jacobs, Jeff Bezos, Khan Academy, Kibera, Kickstarter, license plate recognition, Lyft, Marc Andreessen, Mark Zuckerberg, move fast and break things, move fast and break things, natural language processing, Netflix Prize, Network effects, new economy, Occupy movement, openstreetmap, Paul Graham, peer-to-peer, peer-to-peer lending, Peter Thiel, pre–internet, principal–agent problem, profit motive, race to the bottom, Ray Kurzweil, recommendation engine, rent control, ride hailing / ride sharing, sharing economy, Silicon Valley, Snapchat, software is eating the world, South of Market, San Francisco, TaskRabbit, The Nature of the Firm, Thomas L Friedman, transportation-network company, Uber and Lyft, Uber for X, ultimatum game, urban planning, WikiLeaks, winner-take-all economy, Y Combinator, Zipcar

This meant everyone using the system would pretty quickly develop a relevant ‘reputation’ visible to everyone else in the system.” 2 Friedman was writing just a couple of weeks after his New York Times stablemate David Brooks described “How Airbnb and Lyft Finally Got Americans to Trust Each Other”: “Companies like Airbnb establish trust through ratings mechanisms . . . People in the Airbnb economy don’t have the option of trusting each other on the basis of institutional affiliations, so they do it on the basis of online signaling and peer evaluations.” 3 Sharing Economy companies are not the first to use ratings and algorithms to guide behavior. Their trust systems build on the rating and recommendation systems used by Amazon, Netflix, eBay, Yelp, TripAdvisor, iTunes, the App Store and many others. Each takes individual ratings as their input and transforms them into some form of recommendation. As rating systems have become ubiquitous their usefulness has become a matter of faith in the world of software development. The Sharing Economy is at the cutting edge of a push for “algorithmic regulation” in which rules protecting consumers are replaced by ratings and software algorithms.

For Anderson, Amazon represents the return of variety and diversity after decades of homogenous blockbusters: “We are turning from a mass market back into a niche nation, defined not by geography but by interests.” 19 In a Long Tail world there is no need for formal gatekeepers who select or restrict the works that can find their public; instead, Web 2.0 platforms will do it for us using crowdsourced consumer reviews and recommender systems: “By combining infinite shelf space with real-time information about buying trends and public opinion . . . unlimited selection is revealing truths about what consumers want and how they want to get it.” 20 Amazon and Airbnb are similar in many ways. Both are, at least in part, software companies whose inventory is simply a set of entries in a database, accessed via a web site. Anything can go into the database: for Amazon’s books it might be Harry Potter or a self-published obscurity, or anything in between.


pages: 319 words: 89,477

The Power of Pull: How Small Moves, Smartly Made, Can Set Big Things in Motion by John Hagel Iii, John Seely Brown

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, Andrew Keen, barriers to entry, Black Swan, business process, call centre, Clayton Christensen, cleantech, cloud computing, commoditize, corporate governance, creative destruction, Elon Musk, en.wikipedia.org, future of work, game design, George Gilder, intangible asset, Isaac Newton, job satisfaction, knowledge economy, knowledge worker, loose coupling, Louis Pasteur, Malcom McLean invented shipping containers, Maui Hawaii, medical residency, Network effects, old-boy network, packet switching, pattern recognition, peer-to-peer, pre–internet, profit motive, recommendation engine, Ronald Coase, shareholder value, Silicon Valley, Skype, smart transportation, software as a service, supply-chain management, The Nature of the Firm, the new new thing, too big to fail, trade liberalization, transaction costs

Blurring Creation and Use Pull platforms tend to allow us to perform the following activities, with a blurring of the boundaries between creation and use: • Find. Pull platforms allow us to find not just raw materials, products, and services, but also people with relevant skills and experience. Some of the tools and services that pull platforms use to help participants find relevant resources include search, recommendation engines, directories, agents, and reputation services. • Connect. Again, pull platforms connect us not just to raw materials, products, and services, but also to people with relevant skills and experiences. Performance fabrics5 are particularly helpful in establishing appropriate connections. The mobile Internet is dramatically extending our ability to connect wherever we are. • Innovate. Pull platforms provide much more flexible environments for participants to innovate with the resources made available to them.


pages: 713 words: 93,944

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement by Eric Redmond, Jim Wilson, Jim R. Wilson

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Web Services, create, read, update, delete, data is the new oil, database schema, Debian, domain-specific language, en.wikipedia.org, fault tolerance, full text search, general-purpose programming language, linked data, MVC pattern, natural language processing, node package manager, random walk, recommendation engine, Ruby on Rails, Skype, social graph, web application

Neo4j, as our open source example, is growing in popularity for many social network applications. Unlike other database styles that group collections of like objects into common buckets, graph databases are more free-form—queries consist of following edges shared by two nodes or, namely, traversing nodes. As more projects use them, graph databases are growing the straightforward social examples to occupy more nuanced use cases, such as recommendation engines, access control lists, and geographic data. Good For: Graph databases seem to be tailor-made for networking applications. The prototypical example is a social network, where nodes represent users who have various kinds of relationships to each other. Modeling this kind of data using any of the other styles is often a tough fit, but a graph database would accept it with relish. They are also perfect matches for an object-oriented system.


pages: 339 words: 88,732

The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies by Erik Brynjolfsson, Andrew McAfee

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

2013 Report for America's Infrastructure - American Society of Civil Engineers - 19 March 2013, 3D printing, access to a mobile phone, additive manufacturing, Airbnb, Albert Einstein, Amazon Mechanical Turk, Amazon Web Services, American Society of Civil Engineers: Report Card, Any sufficiently advanced technology is indistinguishable from magic, autonomous vehicles, barriers to entry, basic income, Baxter: Rethink Robotics, British Empire, business intelligence, business process, call centre, Chuck Templeton: OpenTable, clean water, combinatorial explosion, computer age, computer vision, congestion charging, corporate governance, creative destruction, crowdsourcing, David Ricardo: comparative advantage, digital map, employer provided health coverage, en.wikipedia.org, Erik Brynjolfsson, factory automation, falling living standards, Filter Bubble, first square of the chessboard / second half of the chessboard, Frank Levy and Richard Murnane: The New Division of Labor, Freestyle chess, full employment, game design, global village, happiness index / gross national happiness, illegal immigration, immigration reform, income inequality, income per capita, indoor plumbing, industrial robot, informal economy, intangible asset, inventory management, James Watt: steam engine, Jeff Bezos, jimmy wales, job automation, John Markoff, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, Joseph Schumpeter, Kevin Kelly, Khan Academy, knowledge worker, Kodak vs Instagram, law of one price, low skilled workers, Lyft, Mahatma Gandhi, manufacturing employment, Marc Andreessen, Mark Zuckerberg, Mars Rover, mass immigration, means of production, Narrative Science, Nate Silver, natural language processing, Network effects, new economy, New Urbanism, Nicholas Carr, Occupy movement, oil shale / tar sands, oil shock, pattern recognition, Paul Samuelson, payday loans, price stability, Productivity paradox, profit maximization, Ralph Nader, Ray Kurzweil, recommendation engine, Report Card for America’s Infrastructure, Robert Gordon, Rodney Brooks, Ronald Reagan, Second Machine Age, self-driving car, sharing economy, Silicon Valley, Simon Kuznets, six sigma, Skype, software patent, sovereign wealth fund, speech recognition, statistical model, Steve Jobs, Steven Pinker, Stuxnet, supply-chain management, TaskRabbit, technological singularity, telepresence, The Bell Curve by Richard Herrnstein and Charles Murray, The Signal and the Noise by Nate Silver, The Wealth of Nations by Adam Smith, total factor productivity, transaction costs, Tyler Cowen: Great Stagnation, Vernor Vinge, Watson beat the top human players on Jeopardy!, winner-take-all economy, Y2K

When there are many small local markets, there can be a ‘best’ provider in each, and these local heroes frequently can all earn a good income. If these markets merge into a single global market, top performers have an opportunity to win more customers, while the next-best performers face harsher competition from all directions. A similar dynamic comes into play when technologies like Google or even Amazon’s recommendation engine reduce search costs. Suddenly second-rate producers can no longer count on consumer ignorance or geographic barriers to protect their margins. Digital technologies have aided the transition to winner-take-all markets, even for products we wouldn’t think would have superstar status. In a traditional camera store, cameras typically are not ranked number one versus number ten. But online retailers make it easy to list products in rank order by customer ratings, or to filter results to include only products with every conceivable desirable feature.


pages: 323 words: 95,939

Present Shock: When Everything Happens Now by Douglas Rushkoff

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

algorithmic trading, Andrew Keen, bank run, Benoit Mandelbrot, big-box store, Black Swan, British Empire, Buckminster Fuller, cashless society, citizen journalism, clockwork universe, cognitive dissonance, Credit Default Swap, crowdsourcing, Danny Hillis, disintermediation, Donald Trump, double helix, East Village, Elliott wave, European colonialism, Extropian, facts on the ground, Flash crash, game design, global supply chain, global village, Howard Rheingold, hypertext link, Inbox Zero, invention of agriculture, invention of hypertext, invisible hand, iterative process, John Nash: game theory, Kevin Kelly, laissez-faire capitalism, Law of Accelerating Returns, loss aversion, mandelbrot fractal, Marshall McLuhan, Merlin Mann, Milgram experiment, mutually assured destruction, negative equity, Network effects, New Urbanism, Nicholas Carr, Norbert Wiener, Occupy movement, passive investing, pattern recognition, peak oil, price mechanism, prisoner's dilemma, Ralph Nelson Elliott, RAND corporation, Ray Kurzweil, recommendation engine, selective serotonin reuptake inhibitor (SSRI), Silicon Valley, Skype, social graph, South Sea Bubble, Steve Jobs, Steve Wozniak, Steven Pinker, Stewart Brand, supply-chain management, the medium is the message, The Wisdom of Crowds, theory of mind, Turing test, upwardly mobile, Whole Earth Catalog, WikiLeaks, Y2K, zero-sum game

Today’s most vocal critic of this trend, The Cult of the Amateur author Andrew Keen, explains, “According to a June 2006 study by the Pew Internet and American Life Project, 34 percent of the 12 million bloggers in America consider their online ‘work’ to be a form of journalism. That adds up to millions of unskilled, untrained, unpaid, unknown ‘journalists’—a thousandfold growth between 1996 and 2006—spewing their (mis)information out in the cyberworld.” More sanguine voices, such as City University of New York journalism professor and BuzzFeed blogger Jeff Jarvis, argue that the market—amplified by search results and recommendation engines—will eventually allow the better journalism to rise to the top of the pile. But even market mechanisms may have a hard time functioning as we consumers of all this media lose our ability to distinguish between facts, informed opinions, and wild assertions. Our impatient disgust with politics as usual combined with our newfound faith in our own gut sensibilities drives us to take matters into our own hands—in journalism and beyond.


pages: 364 words: 99,897

The Industries of the Future by Alec Ross

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

23andMe, 3D printing, Airbnb, algorithmic trading, AltaVista, Anne Wojcicki, autonomous vehicles, banking crisis, barriers to entry, Bernie Madoff, bioinformatics, bitcoin, blockchain, Brian Krebs, British Empire, business intelligence, call centre, carbon footprint, cloud computing, collaborative consumption, connected car, corporate governance, Credit Default Swap, cryptocurrency, David Brooks, disintermediation, Dissolution of the Soviet Union, distributed ledger, Edward Glaeser, Edward Snowden, en.wikipedia.org, Erik Brynjolfsson, fiat currency, future of work, global supply chain, Google X / Alphabet X, industrial robot, Internet of things, invention of the printing press, Jaron Lanier, Jeff Bezos, job automation, John Markoff, knowledge economy, knowledge worker, lifelogging, litecoin, M-Pesa, Marc Andreessen, Mark Zuckerberg, Mikhail Gorbachev, mobile money, money: store of value / unit of account / medium of exchange, new economy, offshore financial centre, open economy, Parag Khanna, peer-to-peer, peer-to-peer lending, personalized medicine, Peter Thiel, precision agriculture, pre–internet, RAND corporation, Ray Kurzweil, recommendation engine, ride hailing / ride sharing, Rubik’s Cube, Satoshi Nakamoto, selective serotonin reuptake inhibitor (SSRI), self-driving car, sharing economy, Silicon Valley, Silicon Valley startup, Skype, smart cities, social graph, software as a service, special economic zone, supply-chain management, supply-chain management software, technoutopianism, The Future of Employment, underbanked, Vernor Vinge, Watson beat the top human players on Jeopardy!, women in the workforce, Y Combinator, young professional

Academics have likened it to both a microscope and telescope—a tool that allows us to both examine smaller details than could previously be observed and to see data at a larger scale, revealing correlations that were previously too distant for us to notice. The story of big data’s real-world impact to this point has been largely about logistics and persuasion. It has been great for supply chains, elections, and advertising because these tend to be fields with lots of small, repeated, and quantifiable actions—hence the “recommendation engines” used by Amazon and Netflix that help make more precise recommendations to customers. But these fields are just the beginning, and by the time my kids enter the workforce, big data won’t be a buzz phrase any longer. It will have permeated parts of our lives that we do not think of today as being rooted in analytics. It will change what we eat, how we speak, and where we draw the line between our public and private personas.


pages: 421 words: 110,406

Platform Revolution: How Networked Markets Are Transforming the Economy--And How to Make Them Work for You by Sangeet Paul Choudary, Marshall W. van Alstyne, Geoffrey G. Parker

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, Affordable Care Act / Obamacare, Airbnb, Alvin Roth, Amazon Mechanical Turk, Amazon Web Services, Andrei Shleifer, Apple's 1984 Super Bowl advert, autonomous vehicles, barriers to entry, big data - Walmart - Pop Tarts, bitcoin, blockchain, business process, buy low sell high, chief data officer, Chuck Templeton: OpenTable, clean water, cloud computing, connected car, corporate governance, crowdsourcing, data acquisition, data is the new oil, digital map, discounted cash flows, disintermediation, Edward Glaeser, Elon Musk, en.wikipedia.org, Erik Brynjolfsson, financial innovation, Haber-Bosch Process, High speed trading, information asymmetry, Internet of things, inventory management, invisible hand, Jean Tirole, Jeff Bezos, jimmy wales, John Markoff, Khan Academy, Kickstarter, Lean Startup, Lyft, Marc Andreessen, market design, Metcalfe’s law, multi-sided market, Network effects, new economy, payday loans, peer-to-peer lending, Peter Thiel, pets.com, pre–internet, price mechanism, recommendation engine, RFID, Richard Stallman, ride hailing / ride sharing, Robert Metcalfe, Ronald Coase, Satoshi Nakamoto, self-driving car, shareholder value, sharing economy, side project, Silicon Valley, Skype, smart contracts, smart grid, Snapchat, software is eating the world, Steve Jobs, TaskRabbit, The Chicago School, the payments system, Tim Cook: Apple, transaction costs, two-sided market, Uber and Lyft, Uber for X, winner-take-all economy, zero-sum game, Zipcar

Even more unsettling are some of the less obvious ways in which personal data are used. Many firms—both platform businesses and others—track consumers’ web usage, financial interactions, magazine subscriptions, political and charitable contributions, and much more to create highly detailed individual profiles. In the aggregate, such data can be used for cross-marketing to people who share profiles, as when a recommendation engine on a shopping site tells you, “People like you who bought product A often enjoy product B, too!” The anonymity of this process renders it unobjectionable to most people. But the same underlying data can be, and is, sold to prospective employers, government agencies, health care providers, and marketers of all kinds. Individually identifiable data about sensitive topics such as sexual orientation, prescription drug use, alcoholism, and personal travel (tracked through cell phone location data) can be purchased through data broker firms such as Acxiom.32 Consumer concern over the practices of the data broker industry has led to a number of investigations, including a major FTC inquiry that resulted in a report titled “Data Brokers: A Call for Transparency and Accountability.”33 But very little has actually changed to prevent practices that many find objectionable.34 Skeptics say that, in reality, citizen concerns about data privacy are superficial.


pages: 326 words: 103,170

The Seventh Sense: Power, Fortune, and Survival in the Age of Networks by Joshua Cooper Ramo

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Airbnb, Albert Einstein, algorithmic trading, barriers to entry, Berlin Wall, bitcoin, British Empire, cloud computing, crowdsourcing, Danny Hillis, defense in depth, Deng Xiaoping, drone strike, Edward Snowden, Fall of the Berlin Wall, Firefox, Google Chrome, income inequality, Isaac Newton, Jeff Bezos, job automation, market bubble, Menlo Park, Metcalfe’s law, natural language processing, Network effects, Norbert Wiener, Oculus Rift, packet switching, Paul Graham, price stability, quantitative easing, RAND corporation, recommendation engine, Republic of Letters, Richard Feynman, Richard Feynman, road to serfdom, Robert Metcalfe, Sand Hill Road, secular stagnation, self-driving car, Silicon Valley, Skype, Snapchat, social web, sovereign wealth fund, Steve Jobs, Steve Wozniak, Stewart Brand, Stuxnet, superintelligent machines, technological singularity, The Coming Technological Singularity, The Wealth of Nations by Adam Smith, too big to fail, Vernor Vinge, zero day

And then the machine would spit back some films you might enjoy. The Paul Newman classic Cool Hand Luke, for instance. And, well, you had liked that film. This seemed magic, just the sort of data-meets-human question that showcased a machine learning and thinking. An honestly artificial intelligence. Maes hoped to design a computer that could predict what movies or music or books you or I might enjoy. (And, of course, buy.) A recommendation engine. We all know how sputtering our own suggestion motors can be. Think of that primitive analog exchange known as the First Date: Oh, you like Radiohead? Do you know Sigur Rós? Pause. Hate them. Can you really predict what albums or novels even your closest friend will enjoy? You might offer an occasional lucky suggestion. But to confidently bridge your knowledge of a friend’s taste and the nearly endless library of movies and songs and books?


pages: 540 words: 103,101

Building Microservices by Sam Newman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

airport security, Amazon Web Services, anti-pattern, business process, call centre, continuous integration, create, read, update, delete, defense in depth, don't repeat yourself, Edward Snowden, fault tolerance, index card, information retrieval, Infrastructure as a Service, inventory management, job automation, load shedding, loose coupling, platform as a service, premature optimization, pull request, recommendation engine, social graph, software as a service, source of truth, the built environment, web application, WebSocket, x509 certificate

Then we want to try to understand what bounded contexts the monolith maps to. Let’s imagine that initially we identify four contexts we think our monolithic backend covers: Catalog Everything to do with metadata about the items we offer for sale Finance Reporting for accounts, payments, refunds, etc. Warehouse Dispatching and returning of customer orders, managing inventory levels, etc. Recommendation Our patent-pending, revolutionary recommendation system, which is highly complex code written by a team with more PhDs than the average science lab The first thing to do is to create packages representing these contexts, and then move the existing code into them. With modern IDEs, code movement can be done automatically via refactorings, and can be done incrementally while we are doing other things. You’ll still need tests to catch any breakages made by moving code, however, especially if you’re using a dynamically typed language where the IDEs have a harder time of performing refactoring.

Security MusicCorp has had a security audit, and has decided to tighten up its protection of sensitive information. Currently, all of this is handled by the finance-related code. If we split this service out, we can provide additional protections to this individual service in terms of monitoring, protection of data at transit, and protection of data at rest — ideas we’ll look at in more detail in Chapter 9. Technology The team looking after our recommendation system has been spiking out some new algorithms using a logic programming library in the language Clojure. The team thinks this could benefit our customers by improving what we offer them. If we could split out the recommendation code into a separate service, it would be easy to consider building an alternative implementation that we could test against. Tangled Dependencies The other point to consider when you’ve identified a couple of seams to separate is how entangled that code is with the rest of the system.

The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences by Rob Kitchin

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Bayesian statistics, business intelligence, business process, cellular automata, Celtic Tiger, cloud computing, collateralized debt obligation, conceptual framework, congestion charging, corporate governance, correlation does not imply causation, crowdsourcing, discrete time, George Gilder, Google Earth, Infrastructure as a Service, Internet Archive, Internet of things, invisible hand, knowledge economy, late capitalism, lifelogging, linked data, Masdar, means of production, Nate Silver, natural language processing, openstreetmap, pattern recognition, platform as a service, recommendation engine, RFID, semantic web, sentiment analysis, slashdot, smart cities, Smart Cities: Big Data, Civic Hackers, and the Quest for a New Utopia, smart grid, smart meter, software as a service, statistical model, supply-chain management, the scientific method, The Signal and the Noise by Nate Silver, transaction costs

Discovering correlations between certain items led to new product placements and alterations to shelf space management and a 16 per cent increase in revenue per shopping cart in the first month’s trial. There was no hypothesis that Product A was often bought with Product H that was then tested. The data were simply queried to discover what relationships existed that might have previously been unnoticed. Similarly, Amazon’s recommendation system produces suggestions for other items a shopper might be interested in without knowing anything about the culture and conventions of books and reading; it simply identifies patterns of purchasing across customers in order to determine whether, if Person A likes Book X, they are also likely to like Book Y given their own and others’ consumption patterns. Dyche’s contention is that this open, rather than directed, approach to discovery is more likely to reveal unknown, underlying patterns with respect to customer behaviours, product affinities, and financial risks, that can then be exploited.

In fact, both deductive and inductive reasoning are always discursively framed and do not arise out of nowhere. Popper (1979, cited in Callebaut 2012: 74) thus suggests that all science adopts a searchlight approach to scientific discovery, with the focus of light guided by previous findings, theories and training; by speculation that is grounded in experience and knowledge. The same is true for Amazon, Hunch, Ayasdi, and Google. How Amazon constructed its recommendation system was based on scientific reasoning, underpinned by a guiding model and accompanied by empirical testing designed to improve the performance of the algorithms it uses. Likewise, Google undertakes extensive research and development, it works in partnership with scientists and it buys scientific knowledge, either funding research within universities or by buying the IP of other companies, to refine and extend the utility of how it organises, presents and extracts value from data.


pages: 493 words: 139,845

Women Leaders at Work: Untold Tales of Women Achieving Their Ambitions by Elizabeth Ghaffari

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, AltaVista, business process, cloud computing, Columbine, corporate governance, corporate social responsibility, dark matter, family office, Fellow of the Royal Society, financial independence, follow your passion, glass ceiling, Grace Hopper, high net worth, knowledge worker, Long Term Capital Management, performance metric, pink-collar, profit maximization, profit motive, recommendation engine, Ronald Reagan, shareholder value, Silicon Valley, Silicon Valley startup, Steve Ballmer, Steve Jobs, thinkpad, trickle-down economics, urban planning, women in the workforce, young professional

You begin to see both similarities and huge cultural differences. Kate just came back from rural India, studying the ways people there use technology. These are intriguing issues. I find it especially interesting to bring such people together with more mathematical people like me. I have worked on models of social networks and recommendation systems that exist in social networks. When I talk to danah, I'm trying to understand what people are seeking through recommendation systems. When you merge qualitative and quantitative skill sets, it takes a while for each to adapt to the other because there are language barriers and differences in what we're trying to achieve. When we finally do achieve something jointly, I find that it's usually very good and very deep. __________ 3 The lower case spelling of danah boyd is “how she chooses to identify” herself.


pages: 170 words: 51,205

Information Doesn't Want to Be Free: Laws for the Internet Age by Cory Doctorow, Amanda Palmer, Neil Gaiman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Airbnb, barriers to entry, Brewster Kahle, cloud computing, Dean Kamen, Edward Snowden, game design, Internet Archive, John von Neumann, Kickstarter, optical character recognition, Plutocrats, plutocrats, pre–internet, profit maximization, recommendation engine, rent-seeking, Saturday Night Live, Skype, Steve Jobs, Steve Wozniak, Stewart Brand, transfer pricing, Whole Earth Catalog, winner-take-all economy

But all these sectors are in sharp decline, and in many cases the most significant channel for creative work is now the Internet. Customers don’t necessarily deliver themselves to “stores”—virtual or physical—and when they do, the titles on offer are rarely the neatly curated, finite, and browsable selections that once dominated. The shelves, instead, are nearly infinite. Browsing has been augmented by search algorithms and automated recommendation systems. And the number of ways for customers to discover new work has exploded. Word of mouth has always been a creator’s best friend. Recommendations from personally trusted sources were a surefire way to sell products. When I worked in a bookstore, one of the most reliable indicators of an imminent sale was two friends entering the store together, and one of them picking up a book and handing it to the other with the words “Oh, you’ve got to read this; you’ll love it.”


pages: 606 words: 157,120

To Save Everything, Click Here: The Folly of Technological Solutionism by Evgeny Morozov

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, algorithmic trading, Amazon Mechanical Turk, Andrew Keen, augmented reality, Automated Insights, Berlin Wall, big data - Walmart - Pop Tarts, Buckminster Fuller, call centre, carbon footprint, Cass Sunstein, choice architecture, citizen journalism, cloud computing, cognitive bias, creative destruction, crowdsourcing, data acquisition, Dava Sobel, disintermediation, East Village, en.wikipedia.org, Fall of the Berlin Wall, Filter Bubble, Firefox, Francis Fukuyama: the end of history, frictionless, future of journalism, game design, Gary Taubes, Google Glasses, illegal immigration, income inequality, invention of the printing press, Jane Jacobs, Jean Tirole, Jeff Bezos, jimmy wales, Julian Assange, Kevin Kelly, Kickstarter, license plate recognition, lifelogging, lone genius, Louis Pasteur, Mark Zuckerberg, market fundamentalism, Marshall McLuhan, moral panic, Narrative Science, Nicholas Carr, packet switching, PageRank, Parag Khanna, Paul Graham, peer-to-peer, Peter Singer: altruism, Peter Thiel, pets.com, placebo effect, pre–internet, Ray Kurzweil, recommendation engine, Richard Thaler, Ronald Coase, Rosa Parks, self-driving car, Silicon Valley, Silicon Valley ideology, Silicon Valley startup, Skype, Slavoj Žižek, smart meter, social graph, social web, stakhanovite, Steve Jobs, Steven Levy, Stuxnet, technoutopianism, the built environment, The Chicago School, The Death and Life of Great American Cities, the medium is the message, The Nature of the Firm, the scientific method, The Wisdom of Crowds, Thomas Kuhn: the structure of scientific revolutions, Thomas L Friedman, transaction costs, urban decay, urban planning, urban sprawl, Vannevar Bush, WikiLeaks

., do you think the government has a role to play in education?). Ruck.us then calculates your “political DNA” in order to match you with similar users and encourage you to join relevant “rucks” (according to the site, “the word comes from rugby, where players form a ruck when they loosely come together to fight the other team for possession of the ball.”). Ruck.us is like Netflix for politics, with its cause-recommendation engine essentially encouraging you to, say, check out a campaign to ban abortion if you have expressed strong opposition to gun control, much in the way that Netflix would recommend that you check out Rambo if you liked Rocky. Once in a “ruck,” members can simply follow news posted by other members or be more proactive and share information themselves: links to relevant petitions, organizations, and events are particularly encouraged.


pages: 382 words: 120,064

Bank 3.0: Why Banking Is No Longer Somewhere You Go but Something You Do by Brett King

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, additive manufacturing, Airbus A320, Albert Einstein, Amazon Web Services, Any sufficiently advanced technology is indistinguishable from magic, asset-backed security, augmented reality, barriers to entry, bitcoin, bounce rate, business intelligence, business process, business process outsourcing, call centre, capital controls, citizen journalism, Clayton Christensen, cloud computing, credit crunch, crowdsourcing, disintermediation, en.wikipedia.org, fixed income, George Gilder, Google Glasses, high net worth, I think there is a world market for maybe five computers, Infrastructure as a Service, invention of the printing press, Jeff Bezos, jimmy wales, London Interbank Offered Rate, M-Pesa, Mark Zuckerberg, mass affluent, Metcalfe’s law, microcredit, mobile money, more computing power than Apollo, Northern Rock, Occupy movement, optical character recognition, peer-to-peer, performance metric, Pingit, platform as a service, QR code, QWERTY keyboard, Ray Kurzweil, recommendation engine, RFID, risk tolerance, Robert Metcalfe, self-driving car, Skype, speech recognition, stem cell, telepresence, Tim Cook: Apple, transaction costs, underbanked, US Airways Flight 1549, web application

In Siri’s patent application, various possibilities are hinted at, including being a voice agent providing assistance for “automated teller machines”.4 In fact, SRI (the creator of Siri™) and BBVA recently announced a collaboration to introduce Lola5, a Siri-like technology, to customers through the Internet and via voice. Siri’s near-term capabilities include: 1. Being able to make simple online purchases, such as “Purchase Bank 3.0 from Amazon Kindle” 2. Serving as a recommendation engine or intelligent automated assistant—an “agent avatar”, as it has sometimes been labelled However, there are some challenges in having customers talk into their phones for customer support, or replacing an IVR system with technologies such as Lola, as a recent New York Times article pointed out when it called Siri “the latest public nuisance in the cell phone revolution”. It outlined several scenarios of people using Siri in less than desirable situations (e.g. public transportation) for things as mundane as sending an SMS message wishing a friend a happy birthday.


pages: 527 words: 147,690

Terms of Service: Social Media and the Price of Constant Connection by Jacob Silverman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

23andMe, 4chan, A Declaration of the Independence of Cyberspace, Airbnb, airport security, Amazon Mechanical Turk, augmented reality, basic income, Brian Krebs, California gold rush, call centre, cloud computing, cognitive dissonance, commoditize, correlation does not imply causation, Credit Default Swap, crowdsourcing, don't be evil, drone strike, Edward Snowden, feminist movement, Filter Bubble, Firefox, Flash crash, game design, global village, Google Chrome, Google Glasses, hive mind, income inequality, informal economy, information retrieval, Internet of things, Jaron Lanier, jimmy wales, Kevin Kelly, Kickstarter, knowledge economy, knowledge worker, late capitalism, license plate recognition, life extension, lifelogging, Lyft, Mark Zuckerberg, Mars Rover, Marshall McLuhan, mass incarceration, meta analysis, meta-analysis, Minecraft, move fast and break things, move fast and break things, national security letter, Network effects, new economy, Nicholas Carr, Occupy movement, optical character recognition, payday loans, Peter Thiel, postindustrial economy, prediction markets, pre–internet, price discrimination, price stability, profit motive, quantitative hedge fund, race to the bottom, Ray Kurzweil, recommendation engine, rent control, RFID, ride hailing / ride sharing, self-driving car, sentiment analysis, shareholder value, sharing economy, Silicon Valley, Silicon Valley ideology, Snapchat, social graph, social web, sorting algorithm, Steve Ballmer, Steve Jobs, Steven Levy, TaskRabbit, technoutopianism, telemarketer, transportation-network company, Turing test, Uber and Lyft, Uber for X, universal basic income, unpaid internship, women in the workforce, Y Combinator, Zipcar

Negative reviews proliferate as acts of revenge against scorned rivals or as ways to push one’s own rating ahead of a competitor. Even so, companies remain extraordinarily reliant on these reviews. A 2011 Harvard Business School study found that, on Yelp, “an extra star is worth an extra 5 to 9 percent in revenue.” The result of all this reviewing has been the atrophying of the critical culture, with professional critics seen as dispensable, nothing more than recommendation engines who can be replaced with algorithms and free, crowdsourced reviews. (Even so, some prominent cultural critics remain, though with less influence than they used to hold, and a smattering of publications, from the actuarially precise Consumer Reports to the liberal humanist New York Review of Books, continue to thrive.) It’s also expanded the idea of what should be reviewed, with everything now potentially susceptible to, if not a star rating, then the kind of up-or-down judgment we perform all the time when we choose to like things.


pages: 236 words: 77,098

I Live in the Future & Here's How It Works: Why Your World, Work, and Brain Are Being Creatively Disrupted by Nick Bilton

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, 4chan, Albert Einstein, augmented reality, barriers to entry, book scanning, Cass Sunstein, death of newspapers, en.wikipedia.org, Internet of things, John Gruber, John Markoff, Marshall McLuhan, Nicholas Carr, QR code, recommendation engine, RFID, Saturday Night Live, Steve Jobs, Steven Pinker, Stewart Brand

In short, I base my choice on the overall experience and what I want at that particular time. Here are three different ways people, especially young ones, may evaluate whether something is worth purchasing. Bad = Free My friend Mike loves music. In fact, Mike is a music fanatic. In every spare moment he has, Mike scours the Web and his social networks, searching for new music to listen to and potentially purchase. Like most of his friends, Mike uses his recommendation systems and social networks to find the music he’s interested in. He’ll preview a few songs, and if he decides the content is good, he’ll follow through with a purchase. He rarely buys entire albums because he believes most albums contain only one or two good songs. Mike also follows a handful of bands and immediately buys their entire albums on release day. But Mike steals music, too. He doesn’t steal music because he can’t afford it or to take a stand against media moguls and corporations, and he definitely doesn’t do it for the thrill.


pages: 204 words: 67,922

Elsewhere, U.S.A: How We Got From the Company Man, Family Dinners, and the Affluent Society to the Home Office, BlackBerry Moms,and Economic Anxiety by Dalton Conley

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

3D printing, assortative mating, call centre, clean water, commoditize, dematerialisation, demographic transition, Edward Glaeser, extreme commuting, feminist movement, financial independence, Firefox, Frank Levy and Richard Murnane: The New Division of Labor, Home mortgage interest deduction, income inequality, informal economy, Jane Jacobs, John Maynard Keynes: Economic Possibilities for our Grandchildren, knowledge economy, knowledge worker, labor-force participation, late capitalism, low skilled workers, manufacturing employment, mass immigration, McMansion, mortgage tax deduction, new economy, off grid, oil shock, PageRank, Ponzi scheme, positional goods, post-industrial society, Post-materialism, post-materialism, principal–agent problem, recommendation engine, Richard Florida, rolodex, Ronald Reagan, Silicon Valley, Skype, statistical model, The Death and Life of Great American Cities, The Great Moderation, The Wealth of Nations by Adam Smith, Thomas Malthus, Thorstein Veblen, transaction costs, women in the workforce, Yom Kippur War

Not only would my local video store not have been able to afford the shelf space to stock Ring of Bright Water, but the issue more germane to the present discussion is that I would have never even known to ask for it. In fact, short of some chance encounter of a recommendation at a dinner party, I would have never even known that this 1969 British film existed. The fact that I now know it exists can be attributed to the network basis of the Netflix recommendation system. The connected economy, then, does not merely facilitate sameness and the diffusion of hits. It can encourage niche consumption (as Chris Anderson celebrates in The Long Tail). But as wonderful as it is to have a computer recommend a sleeper film that even the slacker clerks at my neighborhood video store wouldn’t be able to name, there is a subtle cost to this form of knowledge diffusion.


pages: 411 words: 80,925

What's Mine Is Yours: How Collaborative Consumption Is Changing the Way We Live by Rachel Botsman, Roo Rogers

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Airbnb, barriers to entry, Bernie Madoff, bike sharing scheme, Buckminster Fuller, carbon footprint, Cass Sunstein, collaborative consumption, collaborative economy, commoditize, Community Supported Agriculture, credit crunch, crowdsourcing, dematerialisation, disintermediation, en.wikipedia.org, experimental economics, George Akerlof, global village, Hugh Fearnley-Whittingstall, information retrieval, iterative process, Kevin Kelly, Kickstarter, late fees, Mark Zuckerberg, market design, Menlo Park, Network effects, new economy, new new economy, out of africa, Parkinson's law, peer-to-peer, peer-to-peer lending, peer-to-peer rental, Ponzi scheme, pre–internet, recommendation engine, RFID, Richard Stallman, ride hailing / ride sharing, Robert Shiller, Robert Shiller, Ronald Coase, Search for Extraterrestrial Intelligence, SETI@home, Simon Kuznets, Skype, slashdot, smart grid, South of Market, San Francisco, Stewart Brand, The Nature of the Firm, The Spirit Level, The Wealth of Nations by Adam Smith, The Wisdom of Crowds, Thorstein Veblen, Torches of Freedom, transaction costs, traveling salesman, ultimatum game, Victor Gruen, web of trust, women in the workforce, Zipcar

Collective Wisdom of Members At the same time, Netflix has built a sophisticated platform to foster a community among members, and to tailor recommendations to individual tastes. Talk to anyone who has ever used Netflix and they will tell you about how they “discovered releases,” “learned about classics,” and “found rare gems” they never would have found on their own at a store. Approximately 60 percent of members base their selections on Netflix’s Cinematch recommendations system. Early on, people’s willingness to share and rate the films they had watched and to make suggestions to “friends” surprised the founders. The user community itself adopted the ethos of “Millions of members helping you.” Impressively, there are now more than 2 billion ratings from members, and the average member has evaluated approximately two hundred movies. The result is an invaluable collective wisdom impossible to replicate elsewhere.


pages: 247 words: 71,698

Avogadro Corp by William Hertling

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Any sufficiently advanced technology is indistinguishable from magic, cloud computing, crowdsourcing, Hacker Ethic, hive mind, invisible hand, natural language processing, Netflix Prize, private military company, Ray Kurzweil, recommendation engine, Richard Stallman, Ruby on Rails, technological singularity, Turing test, web application

If there was one thing that drove Mike crazy about David, it was his tendency to become uncommunicative exactly when the stakes were highest. Another minute passed, and Mike started to mentally squirm. “I wish I could find something,” he finally said, “but I don’t know what. There’s this brilliant self-taught Serbian kid who is doing some stuff with artificial intelligence algorithms, and he’s doing it all on his home PC. I’ve been reading his blog, and it sounds like he has some really novel approaches to recommendation systems. But I don’t see any way we could duplicate what he’s doing before the end of the week.” Mike was really grasping at straws. Thin straws at that. He hated to bring bad news to David. “Maybe we can turn down the accuracy of the system. If we use fewer language-goal clusters, we can run with less memory and fewer processor cycles. Maybe...” “No, don’t do that.” David’s soft voice floated up out of the dim light, startling Mike.


pages: 270 words: 64,235

Effective Programming: More Than Writing Code by Jeff Atwood

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

AltaVista, Amazon Web Services, barriers to entry, cloud computing, endowment effect, Firefox, future of work, game design, Google Chrome, gravity well, job satisfaction, Khan Academy, Kickstarter, loss aversion, Marc Andreessen, Mark Zuckerberg, Merlin Mann, Minecraft, Paul Buchheit, Paul Graham, price anchoring, race to the bottom, recommendation engine, science of happiness, Skype, social software, Steve Jobs, web application, Y Combinator, zero-sum game

AWS is, of course, the preeminent provider of so-called “cloud computing,” so this can essentially be read as key advice for any website considering a move to the cloud. And it’s great advice, too. Here’s the one bit that struck me as most essential: We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends. If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine. One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture.

Big Data at Work: Dispelling the Myths, Uncovering the Opportunities by Thomas H. Davenport

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Automated Insights, autonomous vehicles, bioinformatics, business intelligence, business process, call centre, chief data officer, cloud computing, commoditize, data acquisition, Edward Snowden, Erik Brynjolfsson, intermodal, Internet of things, Jeff Bezos, knowledge worker, lifelogging, Mark Zuckerberg, move fast and break things, move fast and break things, Narrative Science, natural language processing, Netflix Prize, New Journalism, recommendation engine, RFID, self-driving car, sentiment analysis, Silicon Valley, smart grid, smart meter, social graph, sorting algorithm, statistical model, Tesla Model S, text mining

LinkedIn also employs big data for internal processes, including sales and marketing campaigns. For instance, LinkedIn has used some of its own internal data to predict which companies will buy LinkedIn Chapter_07.indd 158 03/12/13 12:42 PM What You Can Learn from Start-Ups and Online Firms   159 products, and even who in those firms has the highest likelihood of buying. This work led to an internal recommendation system for salespeople that makes it much easier for them to get the data in one place, and has improved conversion rates by several hundred percent. LinkedIn’s cofounder, Reid Hoffman, is a strong advocate for big data: Because of Web 2.0 [the explosion of social networks and c ­ onsumer participation in the web] and the increasing number of sensors, there’s all this data. With these massive amounts of highly semantically indexed data that’s indexed around people and places and all the things that matter to us and our lives, I believe there are going to be a ton of interesting apps that come out of that . . . the way our products and services are ­constituted, how we determine our strategy and maintain a ­competitive edge against other folks—if data is a very strong element of each of these, and you’re not doing anything, it’s like trying to run a business without business intelligence.a a.

Writing Effective Use Cases by Alistair Cockburn

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

business process, c2.com, create, read, update, delete, finite state, index card, information retrieval, iterative process, recommendation engine, Silicon Valley, web application

System presents a list of saved solutions for this Shopper 26c2. Shopper selects the solution they wish to recall 26c3. System recalls the selected solution. 26c4. Continue at step 26 26d. Shopper wants to finance products in the shopping cart with available Finance Plans: 26d1. Shopper chooses to finance products in the shopping cart 26d2. System will present a series of questions that are dependent on previous answers to determine finance plan recommendations. System interfaces with Finance System to obtain credit rating approval. Initiate Obtain Finance Rating. 26d3. Shopper will select a finance plan 26d4. System will present a series of questions based on previous answers to determine details of the selected finance plan. 26d5. Shopper will view financial plan details and chooses to go with the plan. 26d6. System will place the finance plan order with the Finance System initiate Place Finance order.


pages: 903 words: 235,753

The Stack: On Software and Sovereignty by Benjamin H. Bratton

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

1960s counterculture, 3D printing, 4chan, Ada Lovelace, additive manufacturing, airport security, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, algorithmic trading, Amazon Mechanical Turk, Amazon Web Services, augmented reality, autonomous vehicles, basic income, Benevolent Dictator For Life (BDFL), Berlin Wall, bioinformatics, bitcoin, blockchain, Buckminster Fuller, Burning Man, call centre, carbon footprint, carbon-based life, Cass Sunstein, Celebration, Florida, charter city, clean water, cloud computing, connected car, corporate governance, crowdsourcing, cryptocurrency, dark matter, David Graeber, deglobalization, dematerialisation, disintermediation, distributed generation, don't be evil, Douglas Engelbart, Douglas Engelbart, Edward Snowden, Elon Musk, en.wikipedia.org, Eratosthenes, ethereum blockchain, facts on the ground, Flash crash, Frank Gehry, Frederick Winslow Taylor, future of work, Georg Cantor, gig economy, global supply chain, Google Earth, Google Glasses, Guggenheim Bilbao, High speed trading, Hyperloop, illegal immigration, industrial robot, information retrieval, Intergovernmental Panel on Climate Change (IPCC), intermodal, Internet of things, invisible hand, Jacob Appelbaum, Jaron Lanier, John Markoff, Jony Ive, Julian Assange, Khan Academy, liberal capitalism, lifelogging, linked data, Mark Zuckerberg, market fundamentalism, Marshall McLuhan, Masdar, McMansion, means of production, megacity, megastructure, Menlo Park, Minecraft, Monroe Doctrine, Network effects, new economy, offshore financial centre, oil shale / tar sands, packet switching, PageRank, pattern recognition, peak oil, peer-to-peer, performance metric, personalized medicine, Peter Eisenman, Peter Thiel, phenotype, Philip Mirowski, Pierre-Simon Laplace, place-making, planetary scale, RAND corporation, recommendation engine, reserve currency, RFID, Robert Bork, Sand Hill Road, self-driving car, semantic web, sharing economy, Silicon Valley, Silicon Valley ideology, Slavoj Žižek, smart cities, smart grid, smart meter, social graph, software studies, South China Sea, sovereign wealth fund, special economic zone, spectrum auction, Startup school, statistical arbitrage, Steve Jobs, Steven Levy, Stewart Brand, Stuxnet, Superbowl ad, supply-chain management, supply-chain management software, TaskRabbit, the built environment, The Chicago School, the scientific method, Torches of Freedom, transaction costs, Turing complete, Turing machine, Turing test, universal basic income, urban planning, Vernor Vinge, Washington Consensus, web application, Westphalian system, WikiLeaks, working poor, Y Combinator

In that the service is provided to a device-User that is in motion, moving through the City layer and encountering different contexts on the go, the App platform provides that provisional link between a preexisting physical spatial context and this User-directed overlay of a Cloud service onto immediate circumstances. As discussed in the City layer chapter, there is then a kind of programmatic blending between the urban situation through which a User moves and the interactions he may be having with a specific App and Cloud service. A mall becomes a game board, a sidewalk becomes a banking center, a restaurant becomes the scene of a crime in a crowd-sourced recommendation engine, birds are angry and enemies are identified, and the experience of these may be very different for different people and purposes. At any given moment, multiple Users interacting with different Apps in the same place may have brought their shared location into contrasting Cloud dramas; one may be ensconced in a first-person shooter game and the other in measuring his carbon footprint, further fragmenting any apparent solidarity of the crowd.


pages: 669 words: 210,153

Tools of Titans: The Tactics, Routines, and Habits of Billionaires, Icons, and World-Class Performers by Timothy Ferriss

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Airbnb, Alexander Shulgin, artificial general intelligence, asset allocation, Atul Gawande, augmented reality, back-to-the-land, Bernie Madoff, Bertrand Russell: In Praise of Idleness, Black Swan, blue-collar work, Buckminster Fuller, business process, Cal Newport, call centre, Checklist Manifesto, cognitive bias, cognitive dissonance, Colonization of Mars, Columbine, commoditize, correlation does not imply causation, David Brooks, David Graeber, diversification, diversified portfolio, Donald Trump, effective altruism, Elon Musk, fault tolerance, fear of failure, Firefox, follow your passion, future of work, Google X / Alphabet X, Howard Zinn, Hugh Fearnley-Whittingstall, Jeff Bezos, job satisfaction, Johann Wolfgang von Goethe, John Markoff, Kevin Kelly, Kickstarter, Lao Tzu, life extension, lifelogging, Mahatma Gandhi, Marc Andreessen, Mark Zuckerberg, Mason jar, Menlo Park, Mikhail Gorbachev, Nicholas Carr, optical character recognition, PageRank, passive income, pattern recognition, Paul Graham, peer-to-peer, Peter H. Diamandis: Planetary Resources, Peter Singer: altruism, Peter Thiel, phenotype, PIHKAL and TIHKAL, post scarcity, premature optimization, QWERTY keyboard, Ralph Waldo Emerson, Ray Kurzweil, recommendation engine, rent-seeking, Richard Feynman, Richard Feynman, risk tolerance, Ronald Reagan, selection bias, sharing economy, side project, Silicon Valley, skunkworks, Skype, Snapchat, social graph, software as a service, software is eating the world, stem cell, Stephen Hawking, Steve Jobs, Stewart Brand, superintelligent machines, Tesla Model S, The Wisdom of Crowds, Thomas L Friedman, Wall-E, Washington Consensus, Whole Earth Catalog, Y Combinator, zero-sum game

Chris Anderson (my successor at Wired) named this effect “the Long Tail,” for the visually graphed shape of the sales distribution curve: a low, nearly interminable line of items selling only a few copies per year that form a long “tail” for the abrupt vertical beast of a few bestsellers. But the area of the tail was as big as the head. With that insight, the aggregators had great incentive to encourage audiences to click on the obscure items. They invented recommendation engines and other algorithms to channel attention to the rare creations in the long tail. Even web search companies like Google, Bing, and Baidu found it in their interests to reward searchers with the obscure because they could sell ads in the long tail as well. The result was that the most obscure became less obscure. If you live in any of the 2 million small towns on Earth, you might be the only one in your town to crave death metal music, or get turned on by whispering, or want a left-handed fishing reel.


pages: 757 words: 193,541

The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services, Volume 2 by Thomas A. Limoncelli, Strata R. Chalup, Christina J. Hogan

active measures, Amazon Web Services, anti-pattern, barriers to entry, business process, cloud computing, commoditize, continuous integration, correlation coefficient, database schema, Debian, defense in depth, delayed gratification, DevOps, domain-specific language, en.wikipedia.org, fault tolerance, finite state, Firefox, Google Glasses, information asymmetry, Infrastructure as a Service, intermodal, Internet of things, job automation, job satisfaction, load shedding, loose coupling, Malcom McLean invented shipping containers, Marc Andreessen, place-making, platform as a service, premature optimization, recommendation engine, revision control, risk tolerance, side project, Silicon Valley, software as a service, sorting algorithm, statistical model, Steven Levy, supply-chain management, Toyota Production System, web application, Yogi Berra

It could not be installed by users because the framework does not permit Python libraries that include portions written in compiled languages. PaaS provides many high-level services including storage services, database services, and many of the same services available in IaaS offerings. Some offer more more esoteric services such as Google’s Machine Learning service, which can be used to build a recommendation engine. Additional services are announced periodically. 3.1.3 Software as a Service SaaS is what we used to call a web site before the marketing department decided adding “as a service” made it more appealing. SaaS is a web-accessible application. The application is the service, and you interact with it as you would any web site. The provider handles all the details of hardware, operating system, and platform.


pages: 476 words: 132,042

What Technology Wants by Kevin Kelly

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, Alfred Russel Wallace, Buckminster Fuller, c2.com, carbon-based life, Cass Sunstein, charter city, Clayton Christensen, cloud computing, computer vision, Danny Hillis, dematerialisation, demographic transition, double entry bookkeeping, Douglas Engelbart, en.wikipedia.org, Exxon Valdez, George Gilder, gravity well, hive mind, Howard Rheingold, interchangeable parts, invention of air conditioning, invention of writing, Isaac Newton, Jaron Lanier, John Conway, John Markoff, John von Neumann, Kevin Kelly, knowledge economy, Lao Tzu, life extension, Louis Daguerre, Marshall McLuhan, megacity, meta analysis, meta-analysis, new economy, off grid, out of africa, performance metric, personalized medicine, phenotype, Picturephone, planetary scale, RAND corporation, random walk, Ray Kurzweil, recommendation engine, refrigerator car, Richard Florida, Rubik’s Cube, Silicon Valley, silicon-based life, Skype, speech recognition, Stephen Hawking, Steve Jobs, Stewart Brand, Ted Kaczynski, the built environment, the scientific method, Thomas Malthus, Vernor Vinge, wealth creators, Whole Earth Catalog, Y2K

It is true that too many choices may induce regret, but “no choice” is a far worse option. Civilization is a steady migration away from “no choice.” As always, the solution to the problems that technology brings, such as an overwhelming diversity of choices, is better technologies. The solution to ultradiversity will be choice-assist technologies. These better tools will aid humans in making choices among bewildering options. That is what search engines, recommendation systems, tagging, and a lot of social media are all about. Diversity, in fact, will produce tools to handle diversity. (Diversity-taming tools will be among the wildly diversity-making 821 million patents that current rates predict will have been filed in the U.S. Patent Office by 2060!) We are already discovering how to use computers to augment our choices with information and web pages (Google is one such tool), but it will take additional learning and technologies to do this with tangible stuff and idiosyncratic media.


pages: 574 words: 164,509

Superintelligence: Paths, Dangers, Strategies by Nick Bostrom

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

agricultural Revolution, AI winter, Albert Einstein, algorithmic trading, anthropic principle, anti-communist, artificial general intelligence, autonomous vehicles, barriers to entry, Bayesian statistics, bioinformatics, brain emulation, cloud computing, combinatorial explosion, computer vision, cosmological constant, dark matter, DARPA: Urban Challenge, data acquisition, delayed gratification, demographic transition, Donald Knuth, Douglas Hofstadter, Drosophila, Elon Musk, en.wikipedia.org, endogenous growth, epigenetics, fear of failure, Flash crash, Flynn Effect, friendly AI, Gödel, Escher, Bach, income inequality, industrial robot, informal economy, information retrieval, interchangeable parts, iterative process, job automation, John Markoff, John von Neumann, knowledge worker, Menlo Park, meta analysis, meta-analysis, mutually assured destruction, Nash equilibrium, Netflix Prize, new economy, Norbert Wiener, NP-complete, nuclear winter, optical character recognition, pattern recognition, performance metric, phenotype, prediction markets, price stability, principal–agent problem, race to the bottom, random walk, Ray Kurzweil, recommendation engine, reversible computing, social graph, speech recognition, Stanislav Petrov, statistical model, stem cell, Stephen Hawking, strong AI, superintelligent machines, supervolcano, technological singularity, technoutopianism, The Coming Technological Singularity, The Nature of the Firm, Thomas Kuhn: the structure of scientific revolutions, transaction costs, Turing machine, Vernor Vinge, Watson beat the top human players on Jeopardy!, World Values Survey, zero-sum game

Then the entire system was overthrown by the heliocentric theory of Copernicus, which was simpler and—though only after further elaboration by Kepler—more predictively accurate.63 Artificial intelligence methods are now used in more areas than it would make sense to review here, but mentioning a sampling of them will give an idea of the breadth of applications. Aside from the game AIs listed in Table 1, there are hearing aids with algorithms that filter out ambient noise; route-finders that display maps and offer navigation advice to drivers; recommender systems that suggest books and music albums based on a user’s previous purchases and ratings; and medical decision support systems that help doctors diagnose breast cancer, recommend treatment plans, and aid in the interpretation of electrocardiograms. There are robotic pets and cleaning robots, lawn-mowing robots, rescue robots, surgical robots, and over a million industrial robots.64 The world population of robots exceeds 10 million.65 Modern speech recognition, based on statistical techniques such as hidden Markov models, has become sufficiently accurate for practical use (some fragments of this book were drafted with the help of a speech recognition program).


pages: 598 words: 134,339

Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World by Bruce Schneier

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

23andMe, Airbnb, airport security, AltaVista, Anne Wojcicki, augmented reality, Benjamin Mako Hill, Black Swan, Brewster Kahle, Brian Krebs, call centre, Cass Sunstein, Chelsea Manning, citizen journalism, cloud computing, congestion charging, disintermediation, drone strike, Edward Snowden, experimental subject, failed state, fault tolerance, Ferguson, Missouri, Filter Bubble, Firefox, friendly fire, Google Chrome, Google Glasses, hindsight bias, informal economy, Internet Archive, Internet of things, Jacob Appelbaum, Jaron Lanier, John Markoff, Julian Assange, Kevin Kelly, license plate recognition, lifelogging, linked data, Lyft, Mark Zuckerberg, moral panic, Nash equilibrium, Nate Silver, national security letter, Network effects, Occupy movement, payday loans, pre–internet, price discrimination, profit motive, race to the bottom, RAND corporation, recommendation engine, RFID, self-driving car, Shoshana Zuboff, Silicon Valley, Skype, smart cities, smart grid, Snapchat, social graph, software as a service, South China Sea, stealth mode startup, Steven Levy, Stuxnet, TaskRabbit, telemarketer, Tim Cook: Apple, transaction costs, Uber and Lyft, urban planning, WikiLeaks, zero day

The idea was that it would be useful for researchers; to protect people’s identity, they replaced names with numbers. So, for example, Bruce Schneier might be 608429. They were surprised when researchers were able to attach names to numbers by correlating different items in individuals’ search history. In 2008, Netflix published 10 million movie rankings by 500,000 anonymized customers, as part of a challenge for people to come up with better recommendation systems than the one the company was using at that time. Researchers were able to de-anonymize people by comparing rankings and time stamps with public rankings and time stamps in the Internet Movie Database. These might seem like special cases, but correlation opportunities pop up more frequently than you might think. Someone with access to an anonymous data set of telephone records, for example, might partially de-anonymize it by correlating it with a catalog merchant’s telephone order database.


pages: 752 words: 131,533

Python for Data Analysis by Wes McKinney

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

backtesting, cognitive dissonance, crowdsourcing, Debian, Firefox, Google Chrome, Guido van Rossum, index card, random walk, recommendation engine, revision control, sentiment analysis, Sharpe ratio, side project, sorting algorithm, statistical model, type inference

MovieLens 1M Data Set GroupLens Research (http://www.grouplens.org/node/73) provides a number of collections of movie ratings data collected from users of MovieLens in the late 1990s and early 2000s. The data provide movie ratings, movie metadata (genres and year), and demographic data about the users (age, zip code, gender, and occupation). Such data is often of interest in the development of recommendation systems based on machine learning algorithms. While I will not be exploring machine learning techniques in great detail in this book, I will show you how to slice and dice data sets like these into the exact form you need. The MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies. It’s spread across 3 tables: ratings, user information, and movie information.

Enriching the Earth: Fritz Haber, Carl Bosch, and the Transformation of World Food Production by Vaclav Smil

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

agricultural Revolution, Albert Einstein, demographic transition, Deng Xiaoping, Haber-Bosch Process, invention of gunpowder, Louis Pasteur, Pearl River Delta, precision agriculture, recommendation engine, The Design of Experiments

Power. 1997. Soil Fertility Management 318 Notes to Chapter 10 for Sustainable Agriculture. Boca Raton, Fla.: Lewis Publishing; Trenkel, M. A. 1997. Improving Fertilizer Use Efficiency. Paris: IFA. 11. Havlin, J. L., et al., eds. 1994. Soil Testing: Prospects for Improving Nutrient Recommendations. Madison, Wis.: Soil Science Society of America; MacKenzie, G. H., and J.-C. Taureau. 1997. Recommendation Systems for Nitrogen—A Review. York: Fertiliser Society. Periodic testing for major macronutrients has been common in high-income nations for decades, but testing for micronutrient deficiencies (ranging from boron and copper in many crops to molybdenum and cobalt needed by nitrogenase in leguminous species) has been much less frequent. 12. Cassman, K. G., et al. 1993. Nitrogen use efficiency of rice reconsidered: what are the key constraints?


pages: 1,202 words: 144,667

The Linux kernel primer: a top-down approach for x86 and PowerPC architectures by Claudia Salzberg Rodriguez, Gordon Fischer, Steven Smolski

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Debian, domain-specific language, en.wikipedia.org, recommendation engine, Richard Stallman

Many of the C library routines available to user mode programs, such as the fork() function in Figure 3.9, bundle code and one or more system calls to accomplish a single function. When a user process calls one of these functions, certain values are placed into the appropriate processor registers and a software interrupt is generated. This software interrupt then calls the kernel entry point. Although not recommended, system calls (syscalls) can also be accessed from kernel code. From where a syscall should be accessed is the source of some discussion because syscalls called from the kernel can have an improvement in performance. This improvement in performance is weighed against the added complexity and maintainability of the code. In this section, we explore the "traditional" syscall implementation where syscalls are called from user space.


pages: 666 words: 181,495

In the Plex: How Google Thinks, Works, and Shapes Our Lives by Steven Levy

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

23andMe, AltaVista, Anne Wojcicki, Apple's 1984 Super Bowl advert, autonomous vehicles, book scanning, Brewster Kahle, Burning Man, business process, clean water, cloud computing, crowdsourcing, Dean Kamen, discounted cash flows, don't be evil, Donald Knuth, Douglas Engelbart, Douglas Engelbart, El Camino Real, fault tolerance, Firefox, Gerard Salton, Gerard Salton, Google bus, Google Chrome, Google Earth, Googley, HyperCard, hypertext link, IBM and the Holocaust, informal economy, information retrieval, Internet Archive, Jeff Bezos, John Markoff, Kevin Kelly, Mark Zuckerberg, Menlo Park, one-China policy, optical character recognition, PageRank, Paul Buchheit, Potemkin village, prediction markets, recommendation engine, risk tolerance, Rubik’s Cube, Sand Hill Road, Saturday Night Live, search inside the book, second-price auction, selection bias, Silicon Valley, skunkworks, Skype, slashdot, social graph, social software, social web, spectrum auction, speech recognition, statistical model, Steve Ballmer, Steve Jobs, Steven Levy, Ted Nelson, telemarketer, trade route, traveling salesman, turn-by-turn navigation, Vannevar Bush, web application, WikiLeaks, Y Combinator

While he put the pieces of YouTube together, though, he always kept in mind that he was documenting a traditional media system on the verge of collapse. He had to deal with the music world as it was but also plan for the way it would be after disruptions, which Google and YouTube were accelerating. Kamangar had some specific ideas for improvement of YouTube. He urged a simpler user interface and a smarter recommendation system to point users to other videos they might enjoy. He urged more flexibility with producers of professional video so YouTube would get more commercial content. He also emphasized how some of Google’s key attributes—notably speed—had a huge impact on the overall experience. If Google could reliably deliver videos with almost no latency, he reasoned, users might not balk so much at the “preroll” ads that come before the actual content, especially if the video was one of a series that users subscribed to and so were already eager to see what was coming.


pages: 933 words: 205,691

Hadoop: The Definitive Guide by Tom White

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Web Services, bioinformatics, business intelligence, combinatorial explosion, database schema, Debian, domain-specific language, en.wikipedia.org, fault tolerance, full text search, Grace Hopper, information retrieval, Internet Archive, linked data, loose coupling, openstreetmap, recommendation engine, RFID, SETI@home, social graph, web application

The Last.fm player or website can be used to access these streams and extra functionality is made available to the user, allowing her to love, skip, or ban each track that she listens to. When processing the received data, we distinguish between a track listen submitted by a user (the first source above, referred to as a scrobble from here on) and a track listened to on the Last.fm radio (the second source, mentioned earlier, referred to as a radio listen from here on). This distinction is very important in order to prevent a feedback loop in the Last.fm recommendation system, which is based only on scrobbles. One of the most fundamental Hadoop jobs at Last.fm takes the incoming listening data and summarizes it into a format that can be used for display purposes on the Last.fm website as well as for input to other Hadoop programs. This is achieved by the Track Statistics program, which is the example described in the following sections. The Track Statistics Program When track listening data is submitted to Last.fm, it undergoes a validation and conversion phase, the end result of which is a number of space-delimited text files containing the user ID, the track ID, the number of times the track was scrobbled, the number of times the track was listened to on the radio, and the number of times it was skipped.


pages: 678 words: 216,204

The Wealth of Networks: How Social Production Transforms Markets and Freedom by Yochai Benkler

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

affirmative action, barriers to entry, bioinformatics, Brownian motion, call centre, Cass Sunstein, centre right, clean water, commoditize, dark matter, desegregation, East Village, fear of failure, Firefox, game design, George Gilder, hiring and firing, Howard Rheingold, informal economy, invention of radio, Isaac Newton, iterative process, Jean Tirole, jimmy wales, John Markoff, Kenneth Arrow, market bubble, market clearing, Marshall McLuhan, New Journalism, optical character recognition, pattern recognition, peer-to-peer, pre–internet, price discrimination, profit maximization, profit motive, random walk, recommendation engine, regulatory arbitrage, rent-seeking, RFID, Richard Stallman, Ronald Coase, Search for Extraterrestrial Intelligence, SETI@home, shareholder value, Silicon Valley, Skype, slashdot, social software, software patent, spectrum auction, technoutopianism, The Fortune at the Bottom of the Pyramid, The Nature of the Firm, transaction costs, Vilfredo Pareto

Cable broadband covers roughly two-thirds of the home market, in many places without alternative; and where there is an alternative, there is only one--the incumbent telephone company. Without one of these noncompetitive infrastructure owners, the home user has no broadband access to the Internet. In Amazon's case, the consumer outrage when the practice was revealed focused on the lack of transparency. Users had little objection to clearly demarcated advertisement. The resistance was to the nontransparent manipulation of the recommendation system aimed at causing the consumers to act in ways consistent with Amazon's goals, rather than their own. In that case, however, there were alternatives. There are many different places from which to find book reviews and recommendations, and [pg 157] at the time, barnesandnoble.com was already available as an online bookseller--and had not significantly adopted similar practices. The exaction was therefore less significant.


pages: 647 words: 43,757

Types and Programming Languages by Benjamin C. Pierce

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, combinatorial explosion, experimental subject, finite state, Henri Poincaré, Perl 6, recommendation engine, sorting algorithm, Turing complete, Turing machine, type inference, Y Combinator

In particular, the proofs of type preservation and progress are straightforward extensions of the ones we saw in Chapter 9. 23.5.1 23.5.2 Theorem [Preservation]: If Γ ` t : T and t -→ t0 , then Γ ` t0 : T. Proof: Exercise [Recommended, «««]. Theorem [Progress]: If t is a closed, well-typed term, then either t is a value or else there is some t 0 with t -→ t0 . Proof: Exercise [Recommended, «««]. System F also shares with λ→ the property of normalization—the fact that the evaluation of every well-typed program terminates. 2 Unlike the type safety theorems above, normalization is quite difficult to prove (indeed, it is somewhat astonishing that it holds at all, considering that we can code things like sorting functions in the pure language, as we did in Exercise 23.4.12, without resorting to fix).


pages: 834 words: 180,700

The Architecture of Open Source Applications by Amy Brown, Greg Wilson

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

8-hour work day, anti-pattern, bioinformatics, c2.com, cloud computing, collaborative editing, combinatorial explosion, computer vision, continuous integration, create, read, update, delete, David Heinemeier Hansson, Debian, domain-specific language, Donald Knuth, en.wikipedia.org, fault tolerance, finite state, Firefox, friendly fire, Guido van Rossum, linked data, load shedding, locality of reference, loose coupling, Mars Rover, MVC pattern, peer-to-peer, Perl 6, premature optimization, recommendation engine, revision control, Ruby on Rails, side project, Skype, slashdot, social web, speech recognition, the scientific method, The Wisdom of Crowds, web application, WebSocket

VisTrails addresses important usability issues that have hampered a wider adoption of workflow and visualization systems. To cater to a broader set of users, including many who do not have programming expertise, it provides a series of operations and user interfaces that simplify workflow design and use [FSC+06], including the ability to create and refine workflows by analogy, to query workflows by example, and to suggest workflow completions as users interactively construct their workflows using a recommendation system [SVK+07]. We have also developed a new framework that allows the creation of custom applications that can be more easily deployed to (non-expert) end users. The extensibility of VisTrails comes from an infrastructure that makes it simple for users to integrate tools and libraries, as well as to quickly prototype new functions. This has been instrumental in enabling the use of the system in a wide range of application areas, including environmental sciences, psychiatry, astronomy, cosmology, high-energy physics, quantum physics, and molecular modeling.


pages: 775 words: 208,604

The Great Leveler: Violence and the History of Inequality From the Stone Age to the Twenty-First Century by Walter Scheidel

agricultural Revolution, assortative mating, basic income, Berlin Wall, Bernie Sanders, Branko Milanovic, British Empire, capital controls, Capital in the Twenty-First Century by Thomas Piketty, collective bargaining, colonial rule, Columbian Exchange, conceptual framework, corporate governance, cosmological principle, crony capitalism, dark matter, declining real wages, demographic transition, Dissolution of the Soviet Union, Downton Abbey, Edward Glaeser, failed state, Fall of the Berlin Wall, financial deregulation, fixed income, Francisco Pizarro, full employment, Gini coefficient, hiring and firing, income inequality, John Markoff, knowledge worker, land reform, land tenure, low skilled workers, means of production, mega-rich, Network effects, nuclear winter, offshore financial centre, Plutocrats, plutocrats, race to the bottom, recommendation engine, rent control, rent-seeking, road to serfdom, Robert Gordon, Ronald Reagan, Second Machine Age, Simon Kuznets, The Future of Employment, The Wealth of Nations by Adam Smith, Thomas Malthus, transaction costs, transatlantic slave trade, universal basic income, very high income, working-age population, zero-sum game

The wealthy either held office themselves or were linked to those who did, and state service and connections to those who performed it in turn generated more personal wealth.8 These dynamics both favored and constrained familial continuity in wealth holding. On the one hand, the sons of high officials were more likely to follow in their footsteps. They and other junior relatives were automatically entitled to enter officialdom and benefited disproportionately from the recommendation system employed to fill governmental positions. We hear of officials among whose brothers and sons six or seven—in one case, no fewer than thirteen sons—also came to serve as imperial administrators. On the other hand, the same predatory and capricious exercise of political power that turned civil servants into plutocrats also undermined their success. Guan Fu, a highly placed government official, had accumulated a large fortune and owned so much land in his native region that widespread loathing of this preeminence inspired a local children’s song: While the Ying River is clear the Guan family will be secure; when the Ying River is muddy the Guan family will be exterminated!