information retrieval

94 results back to index

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage by Zdravko Markov, Daniel T. Larose


Firefox, information retrieval, Internet Archive, iterative process, natural language processing, pattern recognition, random walk, recommendation engine, semantic web, speech recognition, statistical model, William of Occam

Then we describe briefly the basics of the Web and explore the approaches taken by web search engines to retrieve web pages by keyword search. To do this we look into the technology for text analysis and search developed earlier in the area of information retrieval and extended recently with ranking methods based on web hyperlink structure. All that may be seen as a preprocessing step in the overall process of data mining the web content, which provides the input to machine learning methods for extracting knowledge from hypertext data, discussed in the second part of the book. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage C 2007 John Wiley & Sons, Inc. By Zdravko Markov and Daniel T. Larose Copyright CHAPTER 1 INFORMATION RETRIEVAL AND WEB SEARCH WEB CHALLENGES CRAWLING THE WEB INDEXING AND KEYWORD SEARCH EVALUATING SEARCH QUALITY SIMILARITY SEARCH WEB CHALLENGES As originally proposed by Tim Berners-Lee [1], the Web was intended to improve the management of general information about accelerators and experiments at CERN.

For example, the web page with the phone numbers mentioned above can be indexed by all the terms that occur in the anchor text pointing to it: department, chairs, locations, phone, and numbers. More terms may be collected from other pages pointing to it. This idea was implemented in one of the first search 32 CHAPTER 1 INFORMATION RETRIEVAL AND WEB SEARCH engines, the World Wide Web Worm system [4], and later used by Lycos and Google. This allows search engines to increase their indices with pages that have never been crawled, are unavailable, or include nontextual content that cannot be indexed, such as images and programs. As reported by Brin and Page [5] in 1998, Google indexed 24 million pages and over 259 million anchors. EVALUATING SEARCH QUALITY Information retrieval systems do not have formal semantics (such as that of databases), and consequently, the query and the set of documents retrieved (the response of the IR system) cannot be mapped one to one.

Includes index. 978-0-471-66655-4 (cloth) 1. Data mining. 2. Web databases. I. Larose, Daniel T. II. Title. QA76.9.D343M38 2007 005.74 – dc22 2006025099 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 For my children Teodora, Kalin, and Svetoslav – Z.M. For my children Chantal, Ellyriane, Tristan, and Ravel – D.T.L. CONTENTS PREFACE xi PART I WEB STRUCTURE MINING 1 2 INFORMATION RETRIEVAL AND WEB SEARCH 3 Web Challenges Web Search Engines Topic Directories Semantic Web Crawling the Web Web Basics Web Crawlers Indexing and Keyword Search Document Representation Implementation Considerations Relevance Ranking Advanced Text Search Using the HTML Structure in Keyword Search Evaluating Search Quality Similarity Search Cosine Similarity Jaccard Similarity Document Resemblance References Exercises 3 4 5 5 6 6 7 13 15 19 20 28 30 32 36 36 38 41 43 43 HYPERLINK-BASED RANKING 47 Introduction Social Networks Analysis PageRank Authorities and Hubs Link-Based Similarity Search Enhanced Techniques for Page Ranking References Exercises 47 48 50 53 55 56 57 57 vii viii CONTENTS PART II WEB CONTENT MINING 3 4 5 CLUSTERING 61 Introduction Hierarchical Agglomerative Clustering k-Means Clustering Probabilty-Based Clustering Finite Mixture Problem Classification Problem Clustering Problem Collaborative Filtering (Recommender Systems) References Exercises 61 63 69 73 74 76 78 84 86 86 EVALUATING CLUSTERING 89 Approaches to Evaluating Clustering Similarity-Based Criterion Functions Probabilistic Criterion Functions MDL-Based Model and Feature Evaluation Minimum Description Length Principle MDL-Based Model Evaluation Feature Selection Classes-to-Clusters Evaluation Precision, Recall, and F-Measure Entropy References Exercises 89 90 95 100 101 102 105 106 108 111 112 112 CLASSIFICATION 115 General Setting and Evaluation Techniques Nearest-Neighbor Algorithm Feature Selection Naive Bayes Algorithm Numerical Approaches Relational Learning References Exercises 115 118 121 125 131 133 137 138 PART III WEB USAGE MINING 6 INTRODUCTION TO WEB USAGE MINING 143 Definition of Web Usage Mining Cross-Industry Standard Process for Data Mining Clickstream Analysis 143 144 147 CONTENTS 7 8 9 ix Web Server Log Files Remote Host Field Date/Time Field HTTP Request Field Status Code Field Transfer Volume (Bytes) Field Common Log Format Identification Field Authuser Field Extended Common Log Format Referrer Field User Agent Field Example of a Web Log Record Microsoft IIS Log Format Auxiliary Information References Exercises 148 PREPROCESSING FOR WEB USAGE MINING 156 Need for Preprocessing the Data Data Cleaning and Filtering Page Extension Exploration and Filtering De-Spidering the Web Log File User Identification Session Identification Path Completion Directories and the Basket Transformation Further Data Preprocessing Steps References Exercises 156 149 149 149 150 151 151 151 151 151 152 152 152 153 154 154 154 158 161 163 164 167 170 171 174 174 174 EXPLORATORY DATA ANALYSIS FOR WEB USAGE MINING 177 Introduction Number of Visit Actions Session Duration Relationship between Visit Actions and Session Duration Average Time per Page Duration for Individual Pages References Exercises 177 MODELING FOR WEB USAGE MINING: CLUSTERING, ASSOCIATION, AND CLASSIFICATION Introduction Modeling Methodology Definition of Clustering The BIRCH Clustering Algorithm Affinity Analysis and the A Priori Algorithm 177 178 181 183 185 188 188 191 191 192 193 194 197 x CONTENTS Discretizing the Numerical Variables: Binning Applying the A Priori Algorithm to the CCSU Web Log Data Classification and Regression Trees The C4.5 Algorithm References Exercises INDEX 199 201 204 208 210 211 213 PREFACE DEFINING DATA MINING THE WEB By data mining the Web, we refer to the application of data mining methodologies, techniques, and models to the variety of data forms, structures, and usage patterns that comprise the World Wide Web.


pages: 298 words: 43,745

Understanding Sponsored Search: Core Elements of Keyword Advertising by Jim Jansen


AltaVista, barriers to entry, Black Swan, bounce rate, business intelligence, butterfly effect, call centre, Claude Shannon: information theory, complexity theory, correlation does not imply causation,, first-price auction, information retrieval, inventory management, life extension, linear programming, megacity, Nash equilibrium, Network effects, PageRank, place-making, price mechanism, psychological pricing, random walk, Schrödinger's Cat, sealed-bid auction, search engine result page, second-price auction, second-price sealed-bid, sentiment analysis, social web, software as a service, stochastic process, telemarketer, the market place, The Present Situation in Quantum Mechanics, the scientific method, The Wisdom of Crowds, Vickrey auction, yield management

Journal of the American Society for Information Science and Technology, vol. 56(6), pp. 559–570. [46] Belkin, N. J. 1993. “Interaction with Texts: Information Retrieval as Information-Seeking Behavior.” In Information retrieval ’93. Von der Modellierung zur Anwendung. Konstanz, Germany: Universitaetsverlag Konstanz, pp. 55–66. [47] Saracevic, T. 1997. “Extension and Application of the Stratified Model of Information Retrieval Interaction.” In the Annual Meeting of the American Society for Information Science, Washington, DC, pp. 313–327. [48] Saracevic, T. 1996. “Modeling Interaction in Information Retrieval (IR): A Review and Proposal.” In the 59th American Society for Information Science Annual Meeting, Baltimore, MD, pp. 3–9. [49] Belkin, N., Cool, C., Croft, W. B., and Callan, J. 1993. “The Effect of Multiple Query Representations on Information Retrieval Systems.” In 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 339–346. [50] Belkin, N., Cool, C., Kelly, D., Lee, H.

In 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 339–346. [50] Belkin, N., Cool, C., Kelly, D., Lee, H.-J., Muresan, G., Tang, M.-C., and Yuan, X.-J. 2003. “Query Length in Interactive Information Retrieval.” In 26th Annual International ACM 58 Understanding Sponsored Search Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 205–212. [51] Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2002. “Predicting Query Performance.” In 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, pp. 299–306. [52] Efthimiadis, E. N. 2000. “Interactive Query Expansion: A User-Based Evaluation in a Relevance Feedback Environment.” Journal of the American Society of Information Science and Technology, vol. 51(11), pp. 989–1003. [53] Belkin, N.

Information overload: refers to the difficulty a person can have understanding an issue and making decisions that can be caused by the presence of too much information (see Chapter 5 customers). Information retrieval: a field of study related to information extraction. Information retrieval is about developing systems to effectively index and search vast amounts of data (Source: (see Chapter 3 keywords). Information scent: cues related to the desired outcome (see Chapter 3 keywords). Information searching: refers to people’s interaction with information-retrieval systems, ranging from adopting search strategy to judging the relevance of information retrieved (see Chapter 3 keywords). Insertion: actual placement of an ad in a document, as recorded by the ad server (Source: IAB) (see Chapter 2 model). Insertion order: purchase order between a seller of interactive advertising and a buyer (usually an advertiser or its agency) regarding the insertion date(s), number of insertions in a stated period, ad size (or commercial length), and ad placement (or time slot).


pages: 593 words: 118,995

Relevant Search: With Examples Using Elasticsearch and Solr by Doug Turnbull, John Berryman


crowdsourcing, domain-specific language, finite state, fudge factor, full text search, information retrieval, natural language processing, premature optimization, recommendation engine, sentiment analysis

In reality, there is a discipline behind relevance: the academic field of information retrieval. It has generally accepted practices to improve relevance broadly across many domains. But you’ve seen that what’s relevant depends a great deal on your application. Given that, as we introduce information retrieval, think about how its general findings can be used to solve your narrower relevance problem.[2] 2 For an introduction to the field of information retrieval, we highly recommend the classic text Introduction to Information Retrieval by Christopher D. Manning et al. (Cambridge University Press, 2008); see 1.3.1. Information retrieval Luckily, experts have been studying search for decades. The academic field of information retrieval focuses on the precise recall of information to satisfy a user’s information need.

Example of making a relevance judgment for the query “Rambo” in Quepid, a judgment list management application Using judgment lists, researchers aim to measure whether changes to text relevance calculations improve the overall relevance of the results across every test collection. To classic information retrieval, a solution that improves a dozen text-heavy test collections 1% overall is a success. Rather than focusing on one particular problem in depth, information retrieval focuses on solving search for a broad set of problems. 1.3.2. Can we use information retrieval to solve relevance? You’ve already seen there’s no silver bullet. But information retrieval does seem to systematically create relevance solutions. So ask yourself: Do these insights apply to your application? Does your application care about solutions that offer incremental, general improvements to searching article-length text? Would it be better to solve the specific problems faced by your application, here and now? To be more precise, classic information retrieval begs several questions when brought to bear on applied relevance problems.

If you’re fortunate, you’ll find a result addressing a problem similar to your own. That information will solve your problem, and you’ll move on. In information retrieval, relevance is defined as the practice of returning search results that most satisfy the user’s information needs. Further, classic information retrieval focuses on text ranking. Many findings in information retrieval try to measure how likely a given article is going to be relevant to a user’s text search. You’ll learn about several of these invaluable methods throughout this book—as many of these findings are implemented in open source search engines. To discover better text-searching methods, information retrieval researchers benchmark different strategies by using test collections of articles. These test collections include Amazon reviews, Reuters news articles, Usenet posts, and other similar, article-length data sets.


Data Mining: Concepts and Techniques: Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei


bioinformatics, business intelligence, business process, Claude Shannon: information theory, cloud computing, computer vision, correlation coefficient, cyber-physical system, database schema, discrete time, distributed generation, finite state, information retrieval, iterative process, knowledge worker, linked data, natural language processing, Netflix Prize, Occam's razor, pattern recognition, performance metric, phenotype, random walk, recommendation engine, RFID, semantic web, sentiment analysis, speech recognition, statistical model, stochastic process, supply-chain management, text mining, thinkpad, web application

Textbooks and reference books on information retrieval include Introduction to Information Retrieval by Manning, Raghavan, and Schutz [MRS08]; Information Retrieval: Implementing and Evaluating Search Engines by Büttcher, Clarke, and Cormack [BCC10]; Search Engines: Information Retrieval in Practice by Croft, Metzler, and Strohman [CMS09]; Modern Information Retrieval: The Concepts and Technology Behind Search by Baeza-Yates and Ribeiro-Neto [BYRN11]; and Information Retrieval: Algorithms and Heuristics by Grossman and Frieder [GR04]. Information retrieval research is published in the proceedings of several information retrieval and Web search and mining conferences, including the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), the International World Wide Web Conference (WWW), the ACM International Conference on Web Search and Data Mining (WSDM), the ACM Conference on Information and Knowledge Management (CIKM), the European Conference on Information Retrieval (ECIR), the Text Retrieval Conference (TREC), and the ACM/IEEE Joint Conference on Digital Libraries (JCDL).

The data cube model not only facilitates OLAP in multidimensional databases but also promotes multidimensional data mining (see Section 1.3.2). 1.5.4. Information Retrieval Information retrieval (IR) is the science of searching for documents or information in documents. Documents can be text or multimedia, and may reside on the Web. The differences between traditional information retrieval and database systems are twofold: Information retrieval assumes that (1) the data under search are unstructured; and (2) the queries are formed mainly by keywords, which do not have complex structures (unlike SQL queries in database systems). The typical approaches in information retrieval adopt probabilistic models. For example, a text document can be regarded as a bag of words, that is, a multiset of words appearing in the document.

Information retrieval research is published in the proceedings of several information retrieval and Web search and mining conferences, including the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), the International World Wide Web Conference (WWW), the ACM International Conference on Web Search and Data Mining (WSDM), the ACM Conference on Information and Knowledge Management (CIKM), the European Conference on Information Retrieval (ECIR), the Text Retrieval Conference (TREC), and the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Other sources of publication include major information retrieval, information systems, and Web journals, such as Journal of Information Retrieval, ACM Transactions on Information Systems (TOIS), Information Processing and Management, Knowledge and Information Systems (KAIS), and IEEE Transactions on Knowledge and Data Engineering (TKDE). 2. Getting to Know Your Data It's tempting to jump straight into mining, but first, we need to get the data ready. This involves having a closer look at attributes and data values.


pages: 263 words: 75,610

Delete: The Virtue of Forgetting in the Digital Age by Viktor Mayer-Schönberger

Amazon:, Erik Brynjolfsson, Firefox, full text search, George Akerlof, information retrieval, information trail, Internet Archive, invention of movable type, invention of the printing press, moveable type in China, Network effects, packet switching, pattern recognition, RFID, slashdot, Steve Jobs, Steven Levy, The Market for Lemons, The Structural Transformation of the Public Sphere, Vannevar Bush

See information dossiers Dutch citizen registry, 141, 157–58 DVD, 64–65, 145 eBay, 93, 95 Ecommerce, 131 Egypt, 32 Eisenstein, Elizabeth, 37, 38 e-mails: preservation of, 69 entropy, 22 epics, 25, 26, 27 European Human Rights Convention, 110 European Union Privacy Directive, 158–59, 160 exit, 99, 8 expiration dates for information, 171–95, 198–99 binary nature of, 192–93 imperfection of, 194–95 negotiating, 185–89, 187 persistence of, 183–85 societal preferences for, 182–83 external memory, limitations of, 34 Facebook, 2, 3, 84, 86, 197 Feldmar, Andrew, 3–4, 5, 104–5, 109, 111, 197 Felten, Edward, 151–52, 188 fiber-optic cables, 80–81 fidelity, 60 filing systems, 74 film, 47 fingerprints, 78 First Amendment, 110 Flash memory, 63 Flickr, 84, 102, 124 flight reservation, 8 Foer, Joshua, 21 forgetting: cost of, 68, 91, 92 human, 19–20, 114–17 central importance of, 13, 21 societal, 13 forgiving, 197 Foucault, Michel, 11, 112 free-riding, 133 Friedman, Lawrence, 106 Gandy, Oscar, 11, 165 Gasser, Urs, 3, 130 “Goblin edits,” 62 Google, 2, 6–8, 70–71, 84, 103, 104, 109, 130–31, 175–78, 179, 186, 197 governmental decision-making, 94 GPS, 9 Graham, Mary, 94 Gutenberg, 37–38 hard disks, 62–63 hieroglyphs, 32 Hilton, Paris, 86 history: omnipresence of, 125 Hotmail, 69 human cognition, 154–57 “Imagined Communities,” 43 index, 73–74, 90 full-text, 76–77 information: abstract, 17 biometric, 9 bundling of, 82–83 control over, 85–87, 91, 97–112, 135–36, 140, 167–68, 181–82 deniability of, 87 decontextualization of, 78, 89–90, 142 economics of, 82–83 incompleteness of, 156 interpretation of, 96 leakages of, 105, 133–34 legally mandated retention of, 160–61 lifespan of, 172 markets for, 145–46 misuse of, 140 peer-to-peer sharing of, 84, 86 processors of, 175–78 production cost of, 82–83 property of, 143 quality of, 96–97 recall of, 18–19 recombining of, 61–62, 85, 88–90 recontextualization of, 89–90 retrieval of, 72–79 risk of collecting, 158 role of, 85 self-disclosure of, 4 sharing of, 3, 84–85 total amount of, 52 information control: relational concepts of, 153 information dossiers, 104 digital, 123–25 information ecology, 157–63 information power, 112 differences in, 107, 133, 187, 191, 192 information privacy, 100, 108, 135, 174, 181–82 effectiveness of rights to, 135–36, 139–40, 143–44 enforcement of right to, 139–40 purpose limitation principle in, 136, 138, 159 rights to, 134–44 information retrieval. See information: retrieval of information sharing: default of, 88 information storage: capacity, 66 cheap, 62–72 corporate, 68–69 density of, 71 economics of, 68 increase in, 71–72 magnetic, 62–64 optical, 64–65 relative cost of, 65–66 sequential nature of analog, 75 informational self-determination, 137 relational dimension of, 170 intellectual property (IP), 144, 146, 150, 174 Internet, 79 “future proof,” 59–60 peer-production and, 131–32 Internet archives, 4 Islam: printing in, 40 Ito, Joi, 126 Johnson, Deborah, 14 Keohane, Robert, 98 Kodak, Eastman, 45–46 Korea: printing in, 40 language, 23–28 Lasica, J.

The likely medium-term outcome is that storage capacity will continue to double and storage costs to halve about every eighteen to twenty-four months, leaving us with an abundance of cheap digital storage. Easy Retrieval Remembering is more than committing information to memory. It includes the ability to retrieve that information later easily and at will. As humans, we are all too familiar with the challenges of information retrieval from our brain’s long-term memory. External analog memory, like books, hold huge amounts of information, but finding a particular piece of information in it is difficult and time-consuming. Much of the latent value of stored information remains trapped, unlikely to be utilized. Even though we may have stored it, analog information that cannot be retrieved easily in practical terms is no different from having been forgotten.

Even though we may have stored it, analog information that cannot be retrieved easily in practical terms is no different from having been forgotten. In contrast, retrieval from digital memory is vastly easier, cheaper, and swifter: a few words in the search box, a click, and within a few seconds a list of matching information is retrieved and presented in neatly formatted lists. Such trouble-free retrieval greatly enhances the value of information. To be sure, humans have always tried to make information retrieval easier and less cumbersome, but they faced significant hurdles. Take written information. The switch from tablets and scrolls to bound books helped in keeping information together, and certainly improved accessibility, but it did not revolutionize retrieval. Similarly, libraries helped amass information, but didn’t do as much in tracking it down. Only well into the second millennium, when workable indices of book collections (initially perhaps developed out of the extensive organization into subdivisions, later chapters and verses of Hebrew and Christian scriptures) became common, were librarians able to locate a book based on title and author.30 It took centuries of refinement to develop standardized book cataloguing and shelving techniques, as part of the rise of the modern library.


pages: 291 words: 77,596

Total Recall: How the E-Memory Revolution Will Change Everything by C. Gordon Bell, Jim Gemmell


airport security, Albert Einstein, book scanning, cloud computing, conceptual framework, full text search, information retrieval, invention of writing, inventory management, Isaac Newton, Menlo Park, optical character recognition, pattern recognition, performance metric, RAND corporation, RFID, semantic web, Silicon Valley, Skype, social web, statistical model, Stephen Hawking, Steve Ballmer, Ted Nelson, telepresence, Turing test, Vannevar Bush, web application

Cathal Gurrin’s Web page is here are some of his papers about e-memories. Doherty, A., C. Gurrin, G. Jones, and A. F. Smeaton. “Retrieval of Similar Travel Routes Using GPS Tracklog Place Names.” SIGIR 2006—Conference on Research and Development on Information Retrieval, Workshop on Geographic Information Retrieval, Seattle, Washington, August 6-11, 2006. Gurrin, C., A. F. Smeaton, D. Byrne, N. O’Hare, G. Jones, and N. O’Connor. “An Examination of a Large Visual Lifelog.” AIRS 2008—Asia Information Retrieval Symposium, Harbin, China, January 16-18, 2008. Lavelle, B., D. Byrne, C. Gurrin, A. F. Smeaton, and G. Jones. “Bluetooth Familiarity: Methods of Calculation, Applications and Limitations.” MIRW 2007—Mobile Interaction with the Real World, Workshop at the MobileHCI07: 9th International Conference on Human Computer Interaction with Mobile Devices and Services, Singapore, September 9, 2007.

“Physical Context for Just-in-Time Information Retrieval.” IEEE Transactions on Computers 52, no. 8 (August): 1011-14. ———. 1997. “The Wearable Remembrance Agent: A System for Augmented Memory.” Special Issue on Wearable Computing, Personal Technologies Journal 1:218-24. Rhodes, Bradley J. “Margin Notes: Building a Contextually Aware Associative Memory” (html), to appear in The Proceedings of the International Conference on Intelligent User Interfaces (IUI ’00), New Orleans, Louisiana, January 9-12, 2000. Rhodes, Bradley, and Pattie Maes. 2000. “Just-in-Time Information Retrieval Agents.” Special issue on the MIT Media Laboratory, IBM Systems Journal 39, nos. 3 and 4: 685-704. Rhodes, Bradley, and Thad Starner. “The Remembrance Agent: A Continuously Running Automated Information Retrieval System. The Proceedings of the First International Conference on the Practical Application of Intelligent Agents and Multi Agent Technology (PAAM ’96), London, UK, April 1996, 487-95.

Ellis. “Multimodal Segmentation of Lifelog Data.” Eighth RIAO Conference—Large-Scale Semantic Access to Content (Text, Image, Video and Sound), Pittsburgh, Pennsylvania, May 30-June 1, 2007. Lee, Hyowon, Alan F. Smeaton, Noel E. O’Connor, and Gareth J. F. Jones. “Adaptive Visual Summary of LifeLog Photos for Personal Information Management.” AIR 2006—First International Workshop on Adaptive Information Retrieval, Glasgow, UK, October 14, 2006. O’Conaire, C., N. O’Connor, A. F. Smeaton, and G. Jones. “Organizing a Daily Visual Diary Using Multi-Feature Clustering.” SPIE Electronic Imaging—Multimedia Content Access: Algorithms and Systems (EI121), San Jose, California, January 28-February 1, 2007. Smeaton, A. F. “Content vs. Context for Multimedia Semantics: The Case of SenseCam Image Structuring.”


pages: 193 words: 19,478

Memory Machines: The Evolution of Hypertext by Belinda Barnet


augmented reality, Benoit Mandelbrot, Bill Duvall, British Empire, Buckminster Fuller, Claude Shannon: information theory, collateralized debt obligation, computer age, conceptual framework, Douglas Engelbart, game design, hiring and firing, Howard Rheingold, HyperCard, hypertext link, information retrieval, Internet Archive, linked data, mandelbrot fractal, Marshall McLuhan, Menlo Park, nonsequential writing, Norbert Wiener, publish or perish, semantic web, Steve Jobs, Stewart Brand, technoutopianism, Ted Nelson, the scientific method, Vannevar Bush, wikimedia commons

Engelbart, however, believes that his paper failed not for lack of prototypes but because the institution of computing science at the time ‘kept trying to fit [his] ideas into the existing paradigm’, claiming that he should just ‘join their forefront problem pursuits’ and stop setting himself apart with far-flung augmentation babble (Engelbart 1988, 190). He protested that he was doing neither ‘information retrieval’ nor ‘electrical engineering’, but a new thing somewhere in between, and that it should be recognized as a new field of research. In our interview he remembered that: After I’d given a talk at Stanford, [three angry guys] got me later outside at a table. They said, ‘All you’re talking about is information retrieval.’ I said no. They said, ‘YES, it is, we’re professionals and we know, so we’re telling you don’t know enough so stay out of it, ’cause goddamit, you’re bollocksing it all up. You’re in engineering, not information retrieval.’ (Engelbart 1999) Computers, in large part, were still seen as number crunchers, and computer engineers had no business talking about psychology and the human beings who used these machines.

In making this passage, however, Engelbart also fell into a kind of failure, at least by the common understanding of an engineer’s calling in the national security state. As Engelbart told the author of this book in 1999, he was often told to mind his own business and keep off well-defined turf: After I’d given a talk at Stanford, [three angry guys] got me later outside at a table. They said, ‘All you’re talking about is information retrieval.’ I said no. They said, ‘YES, it is, we’re professionals and we know, so we’re telling you don’t know enough so stay out of it, ’cause goddamit, you’re bollocksing it all up. You’re in engineering, not information retrieval.’ (Engelbart 1999) My hero; the man who never knew too much about disciplinary confines, professional flocking rules and the mere retrieval of information; the man who straps bricks to pencils, who annoys the specialists, who insists on bollocksing up the computer world in all kinds of fascinating ways.

Gleick quotes a rather different assessment of Babbage from an early twentieth-century edition of the Dictionary of National Biography: Mathematician and scientific mechanician […] obtained government grant for making a calculating machine […] but the work of construction ceased, owing to disagreements with the engineer; offered the government an improved design, which was refused on grounds of expense […] Lucasian professor of mathematics, Cambridge, but delivered no lectures. (Cited in Gleick 2011, 121) In the words of the information retrievers, Babbage seems a resounding failure, no matter if he did (undeservedly, according to the insinuation) have Newton’s chair. Perhaps biography does not belong in dictionaries. Among other blessings that came to Babbage was one of the great friendships in intellectual history, with Augusta Ada King, Countess Lovelace. She also fell into obscurity for nearly a century after her death, but is now remembered as prodigy and prophet, the first lady of computing.


pages: 1,085 words: 219,144

Solr in Action by Trey Grainger, Timothy Potter


business intelligence, cloud computing, conceptual framework, crowdsourcing, data acquisition,, failed state, fault tolerance, finite state, full text search, glass ceiling, information retrieval, natural language processing, performance metric, premature optimization, recommendation engine, web application

To begin, we need to know how Solr matches home listings in the index to queries entered by users, as this is the basis for all search applications. 1.2.1. Information retrieval engine Solr is built on Apache Lucene, a popular, Java-based, open source, information retrieval library. We’ll save a detailed discussion of what information retrieval is for chapter 3. For now, we’ll touch on the key concepts behind information retrieval, starting with the formal definition taken from one of the prominent academic texts on modern search concepts: Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).[1] 1 Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, Introduction to Information Retrieval (Cambridge University Press, 2008). In our example real estate application, the user’s primary need is finding a home to purchase based on location, home style, features, and price.

The key data structure supporting information retrieval is the inverted index. You’ll learn all about how an inverted index works in chapter 3. For now, it’s sufficient to review figure 1.2 to get a feel for what happens when a new document (#44 in the diagram) is added to the index and how documents are matched to query terms using the inverted index. You might be thinking that a relational database could easily return the same results using an SQL query, which is true for this simple example. But one key difference between a Lucene query and a database query is that in Lucene results are ranked by their relevance to a query, and database results can only be sorted by one or more of the table columns. In other words, ranking documents by relevance is a key aspect of information retrieval and helps differentiate it from other types of queries.

IBSimilarity class ICUFoldingFilterFactory idf (inverse document frequency), 2nd, 3rd, 4th if function implicit routing importing documents common formats DIH ExtractingRequestHandler Nutch relational database data using JSON using SolrJ library using XML Inactive state incremental indexing indent parameter indexlog utility IndicNormalizationFilterFactory Indonesian language IndonesianStemFilterFactory information discovery use case information retrieval. See IR. installing Solr instanceDir parameter <int> element Integrated Development Environment. See IDE. IntelliJ IDEA internationalization. See multilingual search. Intersects operation invalidating cached objects invariants section inverse document frequency. See idf. inverted index ordering of terms overview IR (information retrieval) Irish language IrishLowerCaseFilterFactory, 2nd IsDisjointTo operation IsWithin operation Italian language ItalianLightStemFilterFactory J J2EE (Java 2 Platform, Enterprise Edition) Japanese language, 2nd JapaneseBaseFormFilterFactory JapaneseKatakanaStemFilterFactory JapaneseTokenizerFactory JAR files Java 2 Platform, Enterprise Edition.


pages: 666 words: 181,495

In the Plex: How Google Thinks, Works, and Shapes Our Lives by Steven Levy


23andMe, AltaVista, Anne Wojcicki, Apple's 1984 Super Bowl advert, autonomous vehicles, book scanning, Brewster Kahle, Burning Man, business process, clean water, cloud computing, crowdsourcing, Dean Kamen, discounted cash flows, don't be evil, Douglas Engelbart, El Camino Real, fault tolerance, Firefox, Gerard Salton, Google bus, Google Chrome, Google Earth, Googley, HyperCard, hypertext link, IBM and the Holocaust, informal economy, information retrieval, Internet Archive, Jeff Bezos, Kevin Kelly, Mark Zuckerberg, Menlo Park, optical character recognition, PageRank, Paul Buchheit, Potemkin village, prediction markets, recommendation engine, risk tolerance, Sand Hill Road, Saturday Night Live, search inside the book, second-price auction, Silicon Valley, skunkworks, Skype, slashdot, social graph, social software, social web, spectrum auction, speech recognition, statistical model, Steve Ballmer, Steve Jobs, Steven Levy, Ted Nelson, telemarketer, trade route, traveling salesman, Vannevar Bush, web application, WikiLeaks, Y Combinator

When DEC opened it to outsiders on December 15, 1995, nearly 300,000 people tried it out. They were dazzled. AltaVista’s actual search quality techniques—what determined the ranking of results—were based on traditional information retrieval (IR) algorithms. Many of those algorithms arose from the work of one man, a refugee from Nazi Germany named Gerard Salton, who had come to America, got a PhD at Harvard, and moved to Cornell University, where he cofounded its computer science department. Searching through databases using the same commands you’d use with a human—“natural language” became the term of art—was Salton’s specialty. During the 1960s, Salton developed a system that was to become a model for information retrieval. It was called SMART, supposedly an acronym for “Salton’s Magical Retriever of Text.” The system established many conventions that still persist in search, including indexing and relevance algorithms.

Fortunately, Page’s visions extended to the commercial: “Probably from when I was twelve, I knew I was going to start a company eventually,” he’d later say. Page’s brother, nine years older, was already in Silicon Valley, working for an Internet start-up. Page chose to work in the department’s Human-Computer Interaction Group. The subject would stand Page in good stead in the future with respect to product development, even though it was not in the HCI domain to figure out a new model of information retrieval. On his desk and permeating his conversations was Apple interface guru Donald Norman’s classic tome The Psychology of Everyday Things, the bible of a religion whose first, and arguably only, commandment is “The user is always right.” (Other Norman disciples, such as Jeff Bezos at, were adopting this creed on the web.) Another influential book was a biography of Nikola Tesla, the brilliant Serb scientist; though Tesla’s contributions arguably matched Thomas Edison’s—and his ambitions were grand enough to impress even Page—he died in obscurity.

A key designer was Louis Monier, a droll Frenchman and idealistic geek who had come to America with a doctorate in 1980. DEC had been built on the minicomputer, a once innovative category now rendered a dinosaur by the personal computer revolution. “DEC was very much living in the past,” says Monier. “But they had small groups of people who were very forward-thinking, experimenting with lots of toys.” One of those toys was the web. Monier himself was no expert in information retrieval but a big fan of data in the abstract. “To me, that was the secret—data,” he says. What the data was telling him was that if you had the right tools, it was possible to treat everything in the open web like a single document. Even at that early date, the basic building blocks of web search had been already set in stone. Search was a four-step process. First came a sweeping scan of all the world’s web pages, via a spider.


pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack by Matthew A. Russell


Climategate, cloud computing, crowdsourcing,, fault tolerance, Firefox, full text search, Georg Cantor, Google Earth, information retrieval, Mark Zuckerberg, natural language processing, NP-complete, profit motive, Saturday Night Live, semantic web, Silicon Valley, slashdot, social graph, social web, statistical model, Steve Jobs, supply-chain management, text mining, traveling salesman, Turing test, web application

Text Mining Fundamentals Although rigorous approaches to natural language processing (NLP) that include such things as sentence segmentation, tokenization, word chunking, and entity detection are necessary in order to achieve the deepest possible understanding of textual data, it’s helpful to first introduce some fundamentals from Information Retrieval theory. The remainder of this chapter introduces some of its more foundational aspects, including TF-IDF, the cosine similarity metric, and some of the theory behind collocation detection. Chapter 8 provides a deeper discussion of NLP. Note If you want to dig deeper into IR theory, the full text of Introduction to Information Retrieval is available online and provides more information than you could ever want to know about the field. A Whiz-Bang Introduction to TF-IDF Information retrieval is an extensive field with many specialties. This discussion narrows in on TF-IDF, one of the most fundamental techniques for retrieving relevant documents from a corpus.

I identity consolidation, Brief analysis of breadth-first techniques IDF (inverse document frequency), A Whiz-Bang Introduction to TF-IDF, A Whiz-Bang Introduction to TF-IDF (see also TF-IDF) calculation of, A Whiz-Bang Introduction to TF-IDF idf function, A Whiz-Bang Introduction to TF-IDF IETF OAuth 2.0 protocol, No, You Can’t Have My Password IMAP (Internet Message Access Protocol), Analyzing Your Own Mail Data, Accessing Gmail with OAuth, Fetching and Parsing Email Messages connecting to, using OAuth, Accessing Gmail with OAuth constructing an IMAP query, Fetching and Parsing Email Messages imaplib, Fetching and Parsing Email Messages ImportError, Installing Python Development Tools indexing function, JavaScript-based, couchdb-lucene: Full-Text Indexing and More inference, Open-World Versus Closed-World Assumptions, Inferencing About an Open World with FuXi application to machine knowledge, Inferencing About an Open World with FuXi in logic-based programming languages and RDF, Open-World Versus Closed-World Assumptions influence, measuring for Twitter users, Measuring Influence, Measuring Influence, Measuring Influence, Measuring Influence calculating Twitterer’s most popular followers, Measuring Influence crawling friends/followers connections, Measuring Influence Infochimps, Strong Links API, The Infochimps “Strong Links” API, Interactive 3D Graph Visualization information retrieval industry, Before You Go Off and Try to Build a Search Engine… information retrieval theory, Text Mining Fundamentals (see IR theory) intelligent clustering, Intelligent clustering enables compelling user experiences interactive 3D graph visualization, Interactive 3D Graph Visualization interactive 3D tag clouds for tweet entities co-occurring with #JustinBieber and #TeaParty, Visualizing Tweets with Tricked-Out Tag Clouds interpreter, Python (IPython), Closing Remarks intersection operations, Elementary Set Operations, How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets?

For comparative purposes, note that it’s certainly possible to perform text-based indexing by writing a simple mapping function that associates keywords and documents, like the one in Example 3-10. Example 3-10. A mapper that tokenizes documents def tokenizingMapper(doc): tokens = doc.split() for token in tokens: if isInteresting(token): # Filter out stop words, etc. yield token, doc However, you’ll quickly find that you need to do a lot more homework about basic Information Retrieval (IR) concepts if you want to establish a good scoring function to rank documents by relevance or anything beyond basic frequency analysis. Fortunately, the benefits of Lucene are many, and chances are good that you’ll want to use couchdb-lucene instead of writing your own mapping function for full-text indexing. Note Unlike the previous sections that opted to use the couchdb module, this section uses httplib to exercise CouchDB’s REST API directly and includes view functions written in JavaScript.


pages: 502 words: 107,510

Natural Language Annotation for Machine Learning by James Pustejovsky, Amber Stubbs


Amazon Mechanical Turk, bioinformatics, cloud computing, computer vision, crowdsourcing, easy for humans, difficult for computers, finite state, game design, information retrieval, iterative process, natural language processing, pattern recognition, performance metric, sentiment analysis, social web, speech recognition, statistical model, text mining

Timeline of Corpus Linguistics Here's a quick overview of some of the milestones in the field, leading up to where we are now. 1950s: Descriptive linguists compile collections of spoken and written utterances of various languages from field research. Literary researchers begin compiling systematic collections of the complete works of different authors. Key Word in Context (KWIC) is invented as a means of indexing documents and creating concordances. 1960s: Kucera and Francis publish A Standard Corpus of Present-Day American English (the Brown Corpus), the first broadly available large corpus of language texts. Work in Information Retrieval (IR) develops techniques for statistical similarity of document content. 1970s: Stochastic models developed from speech corpora make Speech Recognition systems possible. The vector space model is developed for document indexing. The London-Lund Corpus (LLC) is developed through the work of the Survey of English Usage. 1980s: The Lancaster-Oslo-Bergen (LOB) Corpus, designed to match the Brown Corpus in terms of size and genres, is compiled.

They are also used in speech disambiguation—if a person speaks unclearly but utters a sequence that does not commonly (or ever) occur in the language being spoken, an n-gram model can help recognize that problem and find the words that the speaker probably intended to say. Another modern corpus is ClueWeb09 (, a dataset “created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009.” This corpus is too large to use for an annotation project (it’s about 25 terabytes uncompressed), but some projects have taken parts of the dataset (such as a subset of the English websites) and used them for research (Pomikálek et al. 2012).

So the first word in the ranking occurs about twice as often as the second word in the ranking, and three times as often as the third word in the ranking, and so on. N-grams In this section we introduce the notion of an n-gram. N-grams are important for a wide range of applications in Natural Language Processing (NLP), because fairly straightforward language models can be built using them, for speech, Machine Translation, indexing, Information Retrieval (IR), and, as we will see, classification. Imagine that we have a string of tokens, W, consisting of the elements w1, w2, … , wn. Now consider a sliding window over W. If the sliding window consists of one cell (wi), then the collection of one-cell substrings is called the unigram profile of the string; there will be as many unigram profiles as there are elements in the string. Consider now all two-cell substrings, where we look at w1 w2, w2 w3, and so forth, to the end of the string, wn–1 wn.


pages: 481 words: 121,669

The Invisible Web: Uncovering Information Sources Search Engines Can't See by Gary Price, Chris Sherman, Danny Sullivan


AltaVista, American Society of Civil Engineers: Report Card, bioinformatics, Brewster Kahle, business intelligence, dark matter, Douglas Engelbart, full text search, HyperCard, hypertext link, information retrieval, Internet Archive, joint-stock company, knowledge worker, natural language processing, pre–internet, profit motive, publish or perish, search engine result page, side project, Silicon Valley, speech recognition, stealth mode startup, Ted Nelson, Vannevar Bush, web application

Despite this increased accessibility, the Internet was still primarily a tool for academics and government contractors well into the early 1990s. As more and more computers connected to the Internet, users began to demand tools that would allow them to search for and locate text and other files on computers anywhere on the Net. Early Net Search Tools Although sophisticated search and information retrieval techniques date back to the late 1950s and early ‘60s, these techniques were used primarily in closed or proprietary systems. Early Internet search and retrieval tools lacked even the most basic capabilities, primarily because it was thought that traditional information retrieval techniques would not work well on an open, unstructured information universe like the Internet. Accessing a file on the Internet was a two-part process. First, you needed to establish direct connection to the remote computer where the file was located using a terminal emulation program called Telnet.

But they relied on Web page authors to submit information, and the Web’s relentless growth rate ultimately made it impossible to keep the lists either current or comprehensive. What was needed was an automated approach to Web page discovery and indexing. The Web had now grown large enough that information scientists became interested in creating search services specifically for the Web. Sophisticated information retrieval techniques had been available since the early 1960s, but they were only effective when searching closed, relatively structured databases. The open, laissez-faire nature of the Web made it too messy to easily adapt traditional information retrieval techniques. New, Web-centric approaches were needed. But how best to approach the problem? Web search would clearly have to be more sophisticated than a simple Archie-type service. But should these new “search engines” attempt to index the full text of Web documents, much as earlier Gopher tools had done, or simply broker requests to local Web search services on individual computers, following the WAIS model?

The First Search Engines Tim Berners-Lee’s vision of the Web was of an information space where data of all types could be freely accessed. But in the early days of the Web, the reality was that most of the Web consisted of simple HTML text documents. Since few servers offered local site search services, developers of the first Web search engines opted for the model of indexing the full text of pages stored on Web servers. To adapt traditional information retrieval techniques to Web search, they built huge databases that attempted to replicate the Web, searching over these relatively controlled, closed archives of pages rather than trying to search the Web itself in real time. With this fateful architectural decision, limiting search engines to HTML text documents and essentially ignoring all other types of data available via the Web, the Invisible Web was born.


pages: 205 words: 20,452

Data Mining in Time Series Databases by Mark Last, Abraham Kandel, Horst Bunke


4chan, call centre, computer vision, discrete time, information retrieval, iterative process, NP-complete, p-value, pattern recognition, random walk, sensor fusion, speech recognition, web application

., Sawhney, H.S., and Shim, K. (1995). Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Database. Proc. 21st Int. Conf. on Very Large Databases (VLDB), pp. 490– 501. 3. Baeza-Yates, R. and Gonnet, G.H. (1999). A Fast Algorithm on Average for All-Against-All Sequence Matching. Proc. 6th String Processing and Information Retrieval Symposium (SPIRE), pp. 16–23. 4. Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. ACM Press/Addison–Wesley Longman Limited. 5. Chakrabarti, K. and Mehrotra, S. (1999). The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces. Proc. 15th Int. Conf. on Data Engineering (ICDE), pp. 440–447. 6. Chan, K. and Fu, A.W. (1999). Efficient Time Series Matching by Wavelets. Proc. 15th Int. Conf. on Data Engineering (ICDE), pp. 126–133. 7.

An Enhanced Representation of Time Series which Allows Fast and Accurate Classification, Clustering and Relevance Feedback. Proceedings of the 4th International Conference of Knowledge Discovery and Data Mining, AAAI Press, pp. 239–241. 14. Keogh, E. and Pazzani, M. (1999). Relevance Feedback Retrieval of Time Series Data. Proceedings of the 22th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 183–190. 15. Keogh, E. and Smyth, P. (1997). A Probabilistic Approach to Fast Pattern Matching in Time Series Databases. Proceedings of the 3rd International Conference of Knowledge Discovery and Data Mining, pp. 24–20. 16. Last, M., Klein, Y., and Kandel, A. (2001). Knowledge Discovery in Time Series Databases. IEEE Transactions on Systems, Man, and Cybernetics, 31B(1), 160–169. 17.

In many of these applications, searching through large, unstructured databases based on sample sequences is often desirable. Such similarity-based retrieval has attracted a great deal of attention in recent years. Although several different approaches have appeared, most are based on the common premise of dimensionality reduction and spatial access methods. This chapter gives an overview of recent research and shows how the methods fit into a general context of signature extraction. Keywords: Information retrieval; sequence databases; similarity search; spatial indexing; time sequences. 1. Introduction Time sequences arise in many applications—any applications that involve storing sensor inputs, or sampling a value that changes over time. A problem which has received an increasing amount of attention lately is the problem of similarity retrieval in databases of time sequences, so-called “query by example.”


Bootstrapping: Douglas Engelbart, Coevolution, and the Origins of Personal Computing (Writing Science) by Thierry Bardini


Apple II, augmented reality, Bill Duvall, conceptual framework, Douglas Engelbart, Dynabook, experimental subject, Grace Hopper, hiring and firing, hypertext link, index card, information retrieval, invention of hypertext, Jaron Lanier, Jeff Rulifson, John von Neumann, knowledge worker, Menlo Park, Mother of all demos, new economy, Norbert Wiener, packet switching, QWERTY keyboard, Ralph Waldo Emerson, RAND corporation, RFC: Request For Comment, Silicon Valley, Steve Crocker, Steve Jobs, Steve Wozniak, Steven Levy, Stewart Brand, stochastic process, Ted Nelson, the medium is the message, theory of mind, Turing test, unbiased observer, Vannevar Bush, Whole Earth Catalog

The regnant term at the time for what Bush was proposing was indeed "in- formation retrieval," and Engelbart himself has testified to the power that a preconceived notion of information retrieval held for creating misunderstand- ing of his work on hypertext networks: I started trying to reach out to make connections in domains of interest and con- cerns out there that fit along the vector I was interested in. I went to the informa- tion retrieval people. I remember one instance when I went to the Ford Founda- tion's Center for Advanced Study in Social Sciences to see somebody who was there for a year, who was into informatIon retrieval. We sat around. In fact, at coffee break, there were about five people sitting there. I was trying to explain what I wanted to do and one guy just kept telling me, "You are just givIng fancy names to information retrieval. Why do that? Why don't you just admit that it's information retrieval and get on with the rest of it and make it all work?"

Why don't you just admit that it's information retrieval and get on with the rest of it and make it all work?" He was getting kind of nasty. The other guy was trying to get him to back off. (Engelbart I996) It seems difficult to dispute, therefore, that the Memex was not conceived as a medium, only as a personal "tool" for information retrieval. Personal ac- cess to information was emphasized over communication. The later research of Ted Nelson on hypertext is very representative of that emphasis. 4 It is problematic, however, to grant Bush the status of the "unique forefa- ther" of computerized hypertext systems. The situation is more complicated than that. 5 For the development of hypertext, the important distinction is not between personal access to information and communication, but between dif- ferent conceptions of what communication could mean, and there were in fact two different approaches to communication at the origin of current hypertext and hypermedia systems.

The second is represented by Douglas Engelbart and his NLS, as his oN-Line System was called, which was conceived as a way to support group collabo- 40 Language and the Body ration. The difference in objectives signals the difference in means that char- acterized the two approaches. The first revolved around the "association" of ideas on the model of how the individual mind is supposed to work. The sec- ond revolved around the intersubjective "connection" of words in the systems of natural languages. What actually differentiates hypertext systems from information -retrieval systems is not the process of "association," the term Bush proposed as analo- gous to the way the individual mind works. Instead, what constitutes a hyper- text system is clear in the definition of hypertext already cited: "a style of building systems for information representation and management around a network of nodes connected together by typed l,nks." A hypertext system is constituted by the presence of "links."


pages: 394 words: 108,215

What the Dormouse Said: How the Sixties Counterculture Shaped the Personal Computer Industry by John Markoff


Any sufficiently advanced technology is indistinguishable from magic, Apple II, back-to-the-land, Bill Duvall, Bill Gates: Altair 8800, Buckminster Fuller, California gold rush, card file, computer age, computer vision, conceptual framework, cuban missile crisis, Douglas Engelbart, Dynabook, El Camino Real, general-purpose programming language, Golden Gate Park, Hacker Ethic, hypertext link, informal economy, information retrieval, invention of the printing press, Jeff Rulifson, John Nash: game theory, John von Neumann, Kevin Kelly, knowledge worker, Mahatma Gandhi, Menlo Park, Mother of all demos, Norbert Wiener, packet switching, Paul Terrell, popular electronics, QWERTY keyboard, RAND corporation, RFC: Request For Comment, Richard Stallman, Robert X Cringely, Sand Hill Road, Silicon Valley, Silicon Valley startup, South of Market, San Francisco, speech recognition, Steve Crocker, Steve Jobs, Steve Wozniak, Steven Levy, Stewart Brand, Ted Nelson, Thorstein Veblen, Turing test, union organizing, Vannevar Bush, Whole Earth Catalog, William Shockley: the traitorous eight

But the AI researchers translated his ideas into their own, and the concept of Augmentation seemed pallid when viewed through their eyes, reduced to the more mundane idea of information retrieval, missing Engelbart’s dream entirely.4 Gradually, he began to understand that the AI community was actually his philosophical enemy. After all, their vision was to replace humans with machines, while he wanted to extend and empower people. Engelbart would later say that he had nothing against the vision of AI but just believed that it would be decades and decades before it could be realized. He thought his idea was the one that was more practical. He frequently ran up against a wall of intellectual prejudice, which continued to plague him throughout his career. In 1960, Engelbart presented a paper at the annual meeting of the American Documentation Institute, outlining how computer systems of the future might change the role of information-retrieval specialists.

In 1960, Engelbart presented a paper at the annual meeting of the American Documentation Institute, outlining how computer systems of the future might change the role of information-retrieval specialists. The idea didn’t sit at all well with his audience, which gave his paper a blasé reception. He also got into an argument with a researcher who asserted that Engelbart was proposing nothing that was any different from any of the other information-retrieval efforts that were already under way. It was a long and lonely two years. The state of the art of computer science was moving quickly toward mathematical algorithms, and the computer scientists looked down their nose at his work, belittling it as mere office automation and hence beneath their notice. Moreover, his support from the air force was slightly suspect as well. The Office of Scientific Research had a reputation for funding way-out ideas, or in some cases outright kooks. Engelbart’s research was in danger of being thrown in with the work of somebody who was studying the clustering behavior of gnats.

There was an abyss between the original work done by Engelbart’s group in the sixties and the motley crew of hobbyists that would create the personal-computer industry beginning in 1975. In their hunger to possess their own computers, the PC hobbyists would miss the crux of the original idea: communications as an integral part of the design. That was at the heart of the epiphanies that Engelbart had years earlier, which led to the realization of Vannevar Bush’s Memex information-retrieval system of the 1940s. During the period from the early 1960s until 1969, when most of the development of the NLS system was completed, Engelbart and his band of researchers remained in a comfortable bubble. They were largely Pentagon funded, but unlike many of the engineering and computing groups that surrounded them at SRI, they weren’t doing work that directly contributed to the Vietnam War.


pages: 504 words: 89,238

Natural language processing with Python by Steven Bird, Ewan Klein, Edward Loper


bioinformatics, business intelligence, conceptual framework, elephant in my pajamas,, finite state, Firefox, information retrieval, Menlo Park, natural language processing, P = NP, search inside the book, speech recognition, statistical model, text mining, Turing test

This can be broken down into two subtasks: identifying the boundaries of the NE, and identifying its type. While named entity recognition is frequently a prelude to identifying relations in Information Extraction, it can also contribute to other tasks. For example, in Question Answering (QA), we try to improve the precision of Information Retrieval by recovering not whole pages, but just those parts which contain an answer to the user’s question. Most QA systems take the 7.5 Named Entity Recognition | 281 documents returned by standard Information Retrieval, and then attempt to isolate the minimal text snippet in the document containing the answer. Now suppose the question was Who was the first President of the US?, and one of the documents that was retrieved contained the following passage: (5) The Washington Monument is the most prominent structure in Washington, D.C. and one of the city’s early attractions.

English, 63 code blocks, nested, 25 code examples, downloading, 57 code points, 94 codecs module, 95 coindex (in feature structure), 340 collocations, 20, 81 comma operator (,), 133 comparative wordlists, 65 comparison operators numerical, 22 for words, 23 complements of lexical head, 347 complements of verbs, 313 complex types, 373 complex values, 336 components, language understanding, 31 computational linguistics, challenges of natural language, 441 computer understanding of sentence meaning, 368 concatenation, 11, 88 lists and strings, 87 strings, 16 conclusions in logic, 369 concordances creating, 40 graphical POS-concordance tool, 184 conditional classifiers, 254 conditional expressions, 25 conditional frequency distributions, 44, 52–56 combining with regular expressions, 103 condition and event pairs, 52 counting words by genre, 52 generating random text with bigrams, 55 male and female names ending in each alphabet letter, 62 plotting and tabulating distributions, 53 using to find minimally contrasting set of words, 64 ConditionalFreqDist, 52 commonly used methods, 56 conditionals, 22, 133 confusion matrix, 207, 240 consecutive classification, 232 non phrase chunking with consecutive classifier, 275 consistent, 366 466 | General Index constituent structure, 296 constituents, 297 context exploiting in part-of-speech classifier, 230 for taggers, 203 context-free grammar, 298, 300 (see also grammars) probabilistic context-free grammar, 320 contractions in tokenization, 112 control, 22 control structures, 26 conversion specifiers, 118 conversions of data formats, 419 coordinate structures, 295 coreferential, 373 corpora, 39–52 annotated text corpora, 46–48 Brown Corpus, 42–44 creating and accessing, resources for further reading, 438 defined, 39 differences in corpus access methods, 50 exploring text corpora using a chunker, 267 Gutenberg Corpus, 39–42 Inaugural Address Corpus, 45 from languages other than English, 48 loading your own corpus, 51 obtaining from Web, 416 Reuters Corpus, 44 sources of, 73 tagged, 181–189 text corpus structure, 49–51 web and chat text, 42 wordlists, 60–63 corpora, included with NLTK, 46 corpus case study, structure of TIMIT, 407–412 corpus HOWTOs, 122 life cycle of, 412–416 creation scenarios, 412 curation versus evolution, 415 quality control, 413 widely-used format for, 421 counters, legitimate uses of, 141 cross-validation, 241 CSV (comma-separated value) format, 418 CSV (comma-separated-value) format, 170 D \d decimal digits in regular expressions, 110 \D nondigit characters in regular expressions, 111 data formats, converting, 419 data types dictionary, 190 documentation for Python standard types, 173 finding type of Python objects, 86 function parameter, 146 operations on objects, 86 database query via natural language, 361–365 databases, obtaining data from, 418 debugger (Python), 158 debugging techniques, 158 decimal integers, formatting, 119 decision nodes, 242 decision stumps, 243 decision trees, 242–245 entropy and information gain, 243 decision-tree classifier, 229 declarative style, 140 decoding, 94 def keyword, 9 defaultdict, 193 defensive programming, 159 demonstratives, agreement with noun, 329 dependencies, 310 criteria for, 312 existential dependencies, modeling in XML, 427 non-projective, 312 projective, 311 unbounded dependency constructions, 349–353 dependency grammars, 310–315 valency and the lexicon, 312 dependents, 310 descriptive models, 255 determiners, 186 agreement with nouns, 333 deve-test set, 225 development set, 225 similarity to test set, 238 dialogue act tagging, 214 dialogue acts, identifying types, 235 dialogue systems (see spoken dialogue systems) dictionaries feature set, 223 feature structures as, 337 pronouncing dictionary, 63–65 Python, 189–198 default, 193 defining, 193 dictionary data type, 190 finding key given a value, 197 indexing lists versus, 189 summary of dictionary methods, 197 updating incrementally, 195 storing features and values, 327 translation, 66 dictionary methods, 197 dictionary data structure (Python), 65 directed acyclic graphs (DAGs), 338 discourse module, 401 discourse semantics, 397–402 discourse processing, 400–402 discourse referents, 397 discourse representation structure (DRS), 397 Discourse Representation Theory (DRT), 397–400 dispersion plot, 6 divide-and-conquer strategy, 160 docstrings, 143 contents and structure of, 148 example of complete docstring, 148 module-level, 155 doctest block, 148 doctest module, 160 document classification, 227 documentation functions, 148 online Python documentation, versions and, 173 Python, resources for further information, 173 docutils module, 148 domain (of a model), 377 DRS (discourse representation structure), 397 DRS conditions, 397 DRT (Discourse Representation Theory), 397– 400 Dublin Core Metadata initiative, 435 duck typing, 281 dynamic programming, 165 General Index | 467 application to parsing with context-free grammar, 307 different approaches to, 167 E Earley chart parser, 334 electronic books, 80 elements, XML, 425 ElementTree interface, 427–429 using to access Toolbox data, 429 elif clause, if . . . elif statement, 133 elif statements, 26 else statements, 26 encoding, 94 encoding features, 223 encoding parameters, codecs module, 95 endangered languages, special considerations with, 423–424 entities, 373 entity detection, using chunking, 264–270 entries adding field to, in Toolbox, 431 contents of, 60 converting data formats, 419 formatting in XML, 430 entropy, 251 (see also Maximum Entropy classifiers) calculating for gender prediction task, 243 maximizing in Maximum Entropy classifier, 252 epytext markup language, 148 equality, 132, 372 equivalence (<->) operator, 368 equivalent, 340 error analysis, 225 errors runtime, 13 sources of, 156 syntax, 3 evaluation sets, 238 events, pairing with conditions in conditional frequency distribution, 52 exceptions, 158 existential quantifier, 374 exists operator, 376 Expected Likelihood Estimation, 249 exporting data, 117 468 | General Index F f-structure, 357 feature extractors defining for dialogue acts, 235 defining for document classification, 228 defining for noun phrase (NP) chunker, 276–278 defining for punctuation, 234 defining for suffix checking, 229 Recognizing Textual Entailment (RTE), 236 selecting relevant features, 224–227 feature paths, 339 feature sets, 223 feature structures, 328 order of features, 337 resources for further reading, 357 feature-based grammars, 327–360 auxiliary verbs and inversion, 348 case and gender in German, 353 example grammar, 333 extending, 344–356 lexical heads, 347 parsing using Earley chart parser, 334 processing feature structures, 337–344 subsumption and unification, 341–344 resources for further reading, 357 subcategorization, 344–347 syntactic agreement, 329–331 terminology, 336 translating from English to SQL, 362 unbounded dependency constructions, 349–353 using attributes and constraints, 331–336 features, 223 non-binary features in naive Bayes classifier, 249 fields, 136 file formats, libraries for, 172 files opening and reading local files, 84 writing program output to, 120 fillers, 349 first-order logic, 372–385 individual variables and assignments, 378 model building, 383 quantifier scope ambiguity, 381 summary of language, 376 syntax, 372–375 theorem proving, 375 truth in model, 377 floating-point numbers, formatting, 119 folds, 241 for statements, 26 combining with if statements, 26 inside a list comprehension, 63 iterating over characters in strings, 90 format strings, 118 formatting program output, 116–121 converting from lists to strings, 116 strings and formats, 117–118 text wrapping, 120 writing results to file, 120 formulas of propositional logic, 368 formulas, type (t), 373 free, 375 Frege’s Principle, 385 frequency distributions, 17, 22 conditional (see conditional frequency distributions) functions defined for, 22 letters, occurrence in strings, 90 functions, 142–154 abstraction provided by, 147 accumulative, 150 as arguments to another function, 149 call-by-value parameter passing, 144 checking parameter types, 146 defined, 9, 57 documentation for Python built-in functions, 173 documenting, 148 errors from, 157 for frequency distributions, 22 for iteration over sequences, 134 generating plurals of nouns (example), 58 higher-order, 151 inputs and outputs, 143 named arguments, 152 naming, 142 poorly-designed, 147 recursive, call structure, 165 saving in modules, 59 variable scope, 145 well-designed, 147 gazetteer, 282 gender identification, 222 Decision Tree model for, 242 gender in German, 353–356 Generalized Phrase Structure Grammar (GPSG), 345 generate_model ( ) function, 55 generation of language output, 29 generative classifiers, 254 generator expressions, 138 functions exemplifying, 151 genres, systematic differences between, 42–44 German, case and gender in, 353–356 gerunds, 211 glyphs, 94 gold standard, 201 government-sponsored challenges to machine learning application in NLP, 257 gradient (grammaticality), 318 grammars, 327 (see also feature-based grammars) chunk grammar, 265 context-free, 298–302 parsing with, 302–310 validating Toolbox entries with, 433 writing your own, 300 dependency, 310–315 development, 315–321 problems with ambiguity, 317 treebanks and grammars, 315–317 weighted grammar, 318–321 dilemmas in sentence structure analysis, 292–295 resources for further reading, 322 scaling up, 315 grammatical category, 328 graphical displays of data conditional frequency distributions, 56 Matplotlib, 168–170 graphs defining and manipulating, 170 directed acyclic graphs, 338 greedy sequence classification, 232 Gutenberg Corpus, 40–42, 80 G hapaxes, 19 hash arrays, 189, 190 (see also dictionaries) gaps, 349 H General Index | 469 head of a sentence, 310 criteria for head and dependencies, 312 heads, lexical, 347 headword (lemma), 60 Heldout Estimation, 249 hexadecimal notation for Unicode string literal, 95 Hidden Markov Models, 233 higher-order functions, 151 holonyms, 70 homonyms, 60 HTML documents, 82 HTML markup, stripping out, 418 hypernyms, 70 searching corpora for, 106 semantic similarity and, 72 hyphens in tokenization, 110 hyponyms, 69 I identifiers for variables, 15 idioms, Python, 24 IDLE (Interactive DeveLopment Environment), 2 if . . . elif statements, 133 if statements, 25 combining with for statements, 26 conditions in, 133 immediate constituents, 297 immutable, 93 implication (->) operator, 368 in operator, 91 Inaugural Address Corpus, 45 inconsistent, 366 indenting code, 138 independence assumption, 248 naivete of, 249 indexes counting from zero (0), 12 list, 12–14 mapping dictionary definition to lexeme, 419 speeding up program by using, 163 string, 15, 89, 91 text index created using a stemmer, 107 words containing a given consonant-vowel pair, 103 inference, 369 information extraction, 261–289 470 | General Index architecture of system, 263 chunking, 264–270 defined, 262 developing and evaluating chunkers, 270– 278 named entity recognition, 281–284 recursion in linguistic structure, 278–281 relation extraction, 284 resources for further reading, 286 information gain, 243 inside, outside, begin tags (see IOB tags) integer ordinal, finding for character, 95 interpreter >>> prompt, 2 accessing, 2 using text editor instead of to write programs, 56 inverted clauses, 348 IOB tags, 269, 286 reading, 270–272 is operator, 145 testing for object identity, 132 ISO 639 language codes, 65 iterative optimization techniques, 251 J joint classifier models, 231 joint-features (maximum entropy model), 252 K Kappa coefficient (k), 414 keys, 65, 191 complex, 196 keyword arguments, 153 Kleene closures, 100 L lambda expressions, 150, 386–390 example, 152 lambda operator (λ), 386 Lancaster stemmer, 107 language codes, 65 language output, generating, 29 language processing, symbol processing versus, 442 language resources describing using OLAC metadata, 435–437 LanguageLog (linguistics blog), 35 latent semantic analysis, 171 Latin-2 character encoding, 94 leaf nodes, 242 left-corner parser, 306 left-recursive, 302 lemmas, 60 lexical relationships between, 71 pairing of synset with a word, 68 lemmatization, 107 example of, 108 length of a text, 7 letter trie, 162 lexical categories, 179 lexical entry, 60 lexical relations, 70 lexical resources comparative wordlists, 65 pronouncing dictionary, 63–65 Shoebox and Toolbox lexicons, 66 wordlist corpora, 60–63 lexicon, 60 (see also lexical resources) chunking Toolbox lexicon, 434 defined, 60 validating in Toolbox, 432–435 LGB rule of name resolution, 145 licensed, 350 likelihood ratios, 224 Linear-Chain Conditional Random Field Models, 233 linguistic objects, mappings from keys to values, 190 linguistic patterns, modeling, 255 linguistics and NLP-related concepts, resources for, 34 list comprehensions, 24 for statement in, 63 function invoked in, 64 used as function parameters, 55 lists, 10 appending item to, 11 concatenating, using + operator, 11 converting to strings, 116 indexing, 12–14 indexing, dictionaries versus, 189 normalizing and sorting, 86 Python list type, 86 sorted, 14 strings versus, 92 tuples versus, 136 local variables, 58 logic first-order, 372–385 natural language, semantics, and, 365–368 propositional, 368–371 resources for further reading, 404 logical constants, 372 logical form, 368 logical proofs, 370 loops, 26 looping with conditions, 26 lowercase, converting text to, 45, 107 M machine learning application to NLP, web pages for government challenges, 257 decision trees, 242–245 Maximum Entropy classifiers, 251–254 naive Bayes classifiers, 246–250 packages, 237 resources for further reading, 257 supervised classification, 221–237 machine translation (MT) limitations of, 30 using NLTK’s babelizer, 30 mapping, 189 Matplotlib package, 168–170 maximal projection, 347 Maximum Entropy classifiers, 251–254 Maximum Entropy Markov Models, 233 Maximum Entropy principle, 253 memoization, 167 meronyms, 70 metadata, 435 OLAC (Open Language Archives Community), 435 modals, 186 model building, 383 model checking, 379 models interpretation of sentences of logical language, 371 of linguistic patterns, 255 representation using set theory, 367 truth-conditional semantics in first-order logic, 377 General Index | 471 what can be learned from models of language, 255 modifiers, 314 modules defined, 59 multimodule programs, 156 structure of Python module, 154 morphological analysis, 213 morphological cues to word category, 211 morphological tagging, 214 morphosyntactic information in tagsets, 212 MSWord, text from, 85 mutable, 93 N \n newline character in regular expressions, 111 n-gram tagging, 203–208 across sentence boundaries, 208 combining taggers, 205 n-gram tagger as generalization of unigram tagger, 203 performance limitations, 206 separating training and test data, 203 storing taggers, 206 unigram tagging, 203 unknown words, 206 naive Bayes assumption, 248 naive Bayes classifier, 246–250 developing for gender identification task, 223 double-counting problem, 250 as generative classifier, 254 naivete of independence assumption, 249 non-binary features, 249 underlying probabilistic model, 248 zero counts and smoothing, 248 name resolution, LGB rule for, 145 named arguments, 152 named entities commonly used types of, 281 relations between, 284 named entity recognition (NER), 281–284 Names Corpus, 61 negative lookahead assertion, 284 NER (see named entity recognition) nested code blocks, 25 NetworkX package, 170 new words in languages, 212 472 | General Index newlines, 84 matching in regular expressions, 109 printing with print statement, 90 resources for further information, 122 non-logical constants, 372 non-standard words, 108 normalizing text, 107–108 lemmatization, 108 using stemmers, 107 noun phrase (NP), 297 noun phrase (NP) chunking, 264 regular expression–based NP chunker, 267 using unigram tagger, 272 noun phrases, quantified, 390 nouns categorizing and tagging, 184 program to find most frequent noun tags, 187 syntactic agreement, 329 numerically intense algorithms in Python, increasing efficiency of, 257 NumPy package, 171 O object references, 130 copying, 132 objective function, 114 objects, finding data type for, 86 OLAC metadata, 74, 435 definition of metadata, 435 Open Language Archives Community, 435 Open Archives Initiative (OAI), 435 open class, 212 open formula, 374 Open Language Archives Community (OLAC), 435 operators, 369 (see also names of individual operators) addition and multiplication, 88 Boolean, 368 numerical comparison, 22 scope of, 157 word comparison, 23 or operator, 24 orthography, 328 out-of-vocabulary items, 206 overfitting, 225, 245 P packages, 59 parameters, 57 call-by-value parameter passing, 144 checking types of, 146 defined, 9 defining for functions, 143 parent nodes, 279 parsing, 318 (see also grammars) with context-free grammar left-corner parser, 306 recursive descent parsing, 303 shift-reduce parsing, 304 well-formed substring tables, 307–310 Earley chart parser, parsing feature-based grammars, 334 parsers, 302 projective dependency parser, 311 part-of-speech tagging (see POS tagging) partial information, 341 parts of speech, 179 PDF text, 85 Penn Treebank Corpus, 51, 315 personal pronouns, 186 philosophical divides in contemporary NLP, 444 phonetics computer-readable phonetic alphabet (SAMPA), 137 phones, 63 resources for further information, 74 phrasal level, 347 phrasal projections, 347 pipeline for NLP, 31 pixel images, 169 plotting functions, Matplotlib, 168 Porter stemmer, 107 POS (part-of-speech) tagging, 179, 208, 229 (see also tagging) differences in POS tagsets, 213 examining word context, 230 finding IOB chunk tag for word's POS tag, 272 in information retrieval, 263 morphology in POS tagsets, 212 resources for further reading, 214 simplified tagset, 183 storing POS tags in tagged corpora, 181 tagged data from four Indian languages, 182 unsimplifed tags, 187 use in noun phrase chunking, 265 using consecutive classifier, 231 pre-sorting, 160 precision, evaluating search tasks for, 239 precision/recall trade-off in information retrieval, 205 predicates (first-order logic), 372 prepositional phrase (PP), 297 prepositional phrase attachment ambiguity, 300 Prepositional Phrase Attachment Corpus, 316 prepositions, 186 present participles, 211 Principle of Compositionality, 385, 443 print statements, 89 newline at end, 90 string formats and, 117 prior probability, 246 probabilistic context-free grammar (PCFG), 320 probabilistic model, naive Bayes classifier, 248 probabilistic parsing, 318 procedural style, 139 processing pipeline (NLP), 86 productions in grammars, 293 rules for writing CFGs for parsing in NLTK, 301 program development, 154–160 debugging techniques, 158 defensive programming, 159 multimodule programs, 156 Python module structure, 154 sources of error, 156 programming style, 139 programs, writing, 129–177 advanced features of functions, 149–154 algorithm design, 160–167 assignment, 130 conditionals, 133 equality, 132 functions, 142–149 resources for further reading, 173 sequences, 133–138 style considerations, 138–142 legitimate uses for counters, 141 procedural versus declarative style, 139 General Index | 473 Python coding style, 138 summary of important points, 172 using Python libraries, 167–172 Project Gutenberg, 80 projections, 347 projective, 311 pronouncing dictionary, 63–65 pronouns anaphoric antecedents, 397 interpreting in first-order logic, 373 resolving in discourse processing, 401 proof goal, 376 properties of linguistic categories, 331 propositional logic, 368–371 Boolean operators, 368 propositional symbols, 368 pruning decision nodes, 245 punctuation, classifier for, 233 Python carriage return and linefeed characters, 80 codecs module, 95 dictionary data structure, 65 dictionary methods, summary of, 197 documentation, 173 documentation and information resources, 34 ElementTree module, 427 errors in understanding semantics of, 157 finding type of any object, 86 getting started, 2 increasing efficiency of numerically intense algorithms, 257 libraries, 167–172 CSV, 170 Matplotlib, 168–170 NetworkX, 170 NumPy, 171 other, 172 reference materials, 122 style guide for Python code, 138 textwrap module, 120 Python Package Index, 172 Q quality control in corpus creation, 413 quantification first-order logic, 373, 380 quantified noun phrases, 390 scope ambiguity, 381, 394–397 474 | General Index quantified formulas, interpretation of, 380 questions, answering, 29 quotation marks in strings, 87 R random text generating in various styles, 6 generating using bigrams, 55 raster (pixel) images, 169 raw strings, 101 raw text, processing, 79–128 capturing user input, 85 detecting word patterns with regular expressions, 97–101 formatting from lists to strings, 116–121 HTML documents, 82 NLP pipeline, 86 normalizing text, 107–108 reading local files, 84 regular expressions for tokenizing text, 109– 112 resources for further reading, 122 RSS feeds, 83 search engine results, 82 segmentation, 112–116 strings, lowest level text processing, 87–93 summary of important points, 121 text from web and from disk, 80 text in binary formats, 85 useful applications of regular expressions, 102–106 using Unicode, 93–97 raw( ) function, 41 re module, 101, 110 recall, evaluating search tasks for, 240 Recognizing Textual Entailment (RTE), 32, 235 exploiting word context, 230 records, 136 recursion, 161 function to compute Sanskrit meter (example), 165 in linguistic structure, 278–281 tree traversal, 280 trees, 279–280 performance and, 163 in syntactic structure, 301 recursive, 301 recursive descent parsing, 303 reentrancy, 340 references (see object references) regression testing framework, 160 regular expressions, 97–106 character class and other symbols, 110 chunker based on, evaluating, 272 extracting word pieces, 102 finding word stems, 104 matching initial and final vowel sequences and all consonants, 102 metacharacters, 101 metacharacters, summary of, 101 noun phrase (NP) chunker based on, 265 ranges and closures, 99 resources for further information, 122 searching tokenized text, 105 symbols, 110 tagger, 199 tokenizing text, 109–112 use in PlaintextCorpusReader, 51 using basic metacharacters, 98 using for relation extraction, 284 using with conditional frequency distributions, 103 relation detection, 263 relation extraction, 284 relational operators, 22 reserved words, 15 return statements, 144 return value, 57 reusing code, 56–59 creating programs using a text editor, 56 functions, 57 modules, 59 Reuters Corpus, 44 root element (XML), 427 root hypernyms, 70 root node, 242 root synsets, 69 Rotokas language, 66 extracting all consonant-vowel sequences from words, 103 Toolbox file containing lexicon, 429 RSS feeds, 83 feedparser library, 172 RTE (Recognizing Textual Entailment), 32, 235 exploiting word context, 230 runtime errors, 13 S \s whitespace characters in regular expressions, 111 \S nonwhitespace characters in regular expressions, 111 SAMPA computer-readable phonetic alphabet, 137 Sanskrit meter, computing, 165 satisfies, 379 scope of quantifiers, 381 scope of variables, 145 searches binary search, 160 evaluating for precision and recall, 239 processing search engine results, 82 using POS tags, 187 segmentation, 112–116 in chunking and tokenization, 264 sentence, 112 word, 113–116 semantic cues to word category, 211 semantic interpretations, NLTK functions for, 393 semantic role labeling, 29 semantics natural language, logic and, 365–368 natural language, resources for information, 403 semantics of English sentences, 385–397 quantifier ambiguity, 394–397 transitive verbs, 391–394 ⋏-calculus, 386–390 SemCor tagging, 214 sentence boundaries, tagging across, 208 sentence segmentation, 112, 233 in chunking, 264 in information retrieval process, 263 sentence structure, analyzing, 291–326 context-free grammar, 298–302 dependencies and dependency grammar, 310–315 grammar development, 315–321 grammatical dilemmas, 292 parsing with context-free grammar, 302– 310 resources for further reading, 322 summary of important points, 321 syntax, 295–298 sents( ) function, 41 General Index | 475 sequence classification, 231–233 other methods, 233 POS tagging with consecutive classifier, 232 sequence iteration, 134 sequences, 133–138 combining different sequence types, 136 converting between sequence types, 135 operations on sequence types, 134 processing using generator expressions, 137 strings and lists as, 92 shift operation, 305 shift-reduce parsing, 304 Shoebox, 66, 412 sibling nodes, 279 signature, 373 similarity, semantic, 71 Sinica Treebank Corpus, 316 slash categories, 350 slicing lists, 12, 13 strings, 15, 90 smoothing, 249 space-time trade-offs in algorihm design, 163 spaces, matching in regular expressions, 109 Speech Synthesis Markup Language (W3C SSML), 214 spellcheckers, Words Corpus used by, 60 spoken dialogue systems, 31 spreadsheets, obtaining data from, 418 SQL (Structured Query Language), 362 translating English sentence to, 362 stack trace, 158 standards for linguistic data creation, 421 standoff annotation, 415, 421 start symbol for grammars, 298, 334 startswith( ) function, 45 stemming, 107 NLTK HOWTO, 122 stemmers, 107 using regular expressions, 104 using stem( ) fuinction, 105 stopwords, 60 stress (in pronunciation), 64 string formatting expressions, 117 string literals, Unicode string literal in Python, 95 strings, 15, 87–93 476 | General Index accessing individual characters, 89 accessing substrings, 90 basic operations with, 87–89 converting lists to, 116 formats, 117–118 formatting lining things up, 118 tabulating data, 119 immutability of, 93 lists versus, 92 methods, 92 more operations on, useful string methods, 92 printing, 89 Python’s str data type, 86 regular expressions as, 101 tokenizing, 86 structurally ambiguous sentences, 300 structure sharing, 340 interaction with unification, 343 structured data, 261 style guide for Python code, 138 stylistics, 43 subcategories of verbs, 314 subcategorization, 344–347 substrings (WFST), 307 substrings, accessing, 90 subsumes, 341 subsumption, 341–344 suffixes, classifier for, 229 supervised classification, 222–237 choosing features, 224–227 documents, 227 exploiting context, 230 gender identification, 222 identifying dialogue act types, 235 part-of-speech tagging, 229 Recognizing Textual Entailment (RTE), 235 scaling up to large datasets, 237 sentence segmentation, 233 sequence classification, 231–233 Swadesh wordlists, 65 symbol processing, language processing versus, 442 synonyms, 67 synsets, 67 semantic similarity, 71 in WordNet concept hierarchy, 69 syntactic agreement, 329–331 syntactic cues to word category, 211 syntactic structure, recursion in, 301 syntax, 295–298 syntax errors, 3 T \t tab character in regular expressions, 111 T9 system, entering text on mobile phones, 99 tabs avoiding in code indentation, 138 matching in regular expressions, 109 tag patterns, 266 matching, precedence in, 267 tagging, 179–219 adjectives and adverbs, 186 combining taggers, 205 default tagger, 198 evaluating tagger performance, 201 exploring tagged corpora, 187–189 lookup tagger, 200–201 mapping words to tags using Python dictionaries, 189–198 nouns, 184 part-of-speech (POS) tagging, 229 performance limitations, 206 reading tagged corpora, 181 regular expression tagger, 199 representing tagged tokens, 181 resources for further reading, 214 across sentence boundaries, 208 separating training and testing data, 203 simplified part-of-speech tagset, 183 storing taggers, 206 transformation-based, 208–210 unigram tagging, 202 unknown words, 206 unsimplified POS tags, 187 using POS (part-of-speech) tagger, 179 verbs, 185 tags in feature structures, 340 IOB tags representing chunk structures, 269 XML, 425 tagsets, 179 morphosyntactic information in POS tagsets, 212 simplified POS tagset, 183 terms (first-order logic), 372 test sets, 44, 223 choosing for classification models, 238 testing classifier for document classification, 228 text, 1 computing statistics from, 16–22 counting vocabulary, 7–10 entering on mobile phones (T9 system), 99 as lists of words, 10–16 searching, 4–7 examining common contexts, 5 text alignment, 30 text editor, creating programs with, 56 textonyms, 99 textual entailment, 32 textwrap module, 120 theorem proving in first order logic, 375 timeit module, 164 TIMIT Corpus, 407–412 tokenization, 80 chunking and, 264 in information retrieval, 263 issues with, 111 list produced from tokenizing string, 86 regular expressions for, 109–112 representing tagged tokens, 181 segmentation and, 112 with Unicode strings as input and output, 97 tokenized text, searching, 105 tokens, 8 Toolbox, 66, 412, 431–435 accessing data from XML, using ElementTree, 429 adding field to each entry, 431 resources for further reading, 438 validating lexicon, 432–435 tools for creation, publication, and use of linguistic data, 421 top-down approach to dynamic programming, 167 top-down parsing, 304 total likelihood, 251 training classifier, 223 classifier for document classification, 228 classifier-based chunkers, 274–278 taggers, 203 General Index | 477 unigram chunker using CoNLL 2000 Chunking Corpus, 273 training sets, 223, 225 transformation-based tagging, 208–210 transitive verbs, 314, 391–394 translations comparative wordlists, 66 machine (see machine translation) treebanks, 315–317 trees, 279–281 representing chunks, 270 traversal of, 280 trie, 162 trigram taggers, 204 truth conditions, 368 truth-conditional semantics in first-order logic, 377 tuples, 133 lists versus, 136 parentheses with, 134 representing tagged tokens, 181 Turing Test, 31, 368 type-raising, 390 type-token distinction, 8 TypeError, 157 types, 8, 86 (see also data types) types (first-order logic), 373 U unary predicate, 372 unbounded dependency constructions, 349– 353 defined, 350 underspecified, 333 Unicode, 93–97 decoding and encoding, 94 definition and description of, 94 extracting gfrom files, 94 resources for further information, 122 using your local encoding in Python, 97 unicodedata module, 96 unification, 342–344 unigram taggers confusion matrix for, 240 noun phrase chunking with, 272 unigram tagging, 202 lookup tagger (example), 200 separating training and test data, 203 478 | General Index unique beginners, 69 Universal Feed Parser, 83 universal quantifier, 374 unknown words, tagging, 206 updating dictionary incrementally, 195 US Presidential Inaugural Addresses Corpus, 45 user input, capturing, 85 V valencies, 313 validity of arguments, 369 validity of XML documents, 426 valuation, 377 examining quantifier scope ambiguity, 381 Mace4 model converted to, 384 valuation function, 377 values, 191 complex, 196 variables arguments of predicates in first-order logic, 373 assignment, 378 bound by quantifiers in first-order logic, 373 defining, 14 local, 58 naming, 15 relabeling bound variables, 389 satisfaction of, using to interpret quantified formulas, 380 scope of, 145 verb phrase (VP), 297 verbs agreement paradigm for English regular verbs, 329 auxiliary, 336 auxiliary verbs and inversion of subject and verb, 348 categorizing and tagging, 185 examining for dependency grammar, 312 head of sentence and dependencies, 310 present participle, 211 transitive, 391–394 W \W non-word characters in Python, 110, 111 \w word characters in Python, 110, 111 web text, 42 Web, obtaining data from, 416 websites, obtaining corpora from, 416 weighted grammars, 318–321 probabilistic context-free grammar (PCFG), 320 well-formed (XML), 425 well-formed formulas, 368 well-formed substring tables (WFST), 307– 310 whitespace regular expression characters for, 109 tokenizing text on, 109 wildcard symbol (.), 98 windowdiff scorer, 414 word classes, 179 word comparison operators, 23 word occurrence, counting in text, 8 word offset, 45 word processor files, obtaining data from, 417 word segmentation, 113–116 word sense disambiguation, 28 word sequences, 7 wordlist corpora, 60–63 WordNet, 67–73 concept hierarchy, 69 lemmatizer, 108 more lexical relations, 70 semantic similarity, 71 visualization of hypernym hierarchy using Matplotlib and NetworkX, 170 Words Corpus, 60 words( ) function, 40 wrapping text, 120 Z zero counts (naive Bayes classifier), 249 zero projection, 347 X XML, 425–431 ElementTree interface, 427–429 formatting entries, 430 representation of lexical entry from chunk parsing Toolbox record, 434 resources for further reading, 438 role of, in using to represent linguistic structures, 426 using ElementTree to access Toolbox data, 429 using for linguistic structures, 425 validity of documents, 426 General Index | 479 About the Authors Steven Bird is Associate Professor in the Department of Computer Science and Software Engineering at the University of Melbourne, and Senior Research Associate in the Linguistic Data Consortium at the University of Pennsylvania.

Other useful books in this area include (Biber, Conrad, & Reppen, 1998), (McEnery, 2006), (Meyer, 2002), (Sampson & McCarthy, 2005), and (Scott & Tribble, 2006). Further readings in quantitative data analysis in linguistics are: (Baayen, 2008), (Gries, 2009), and (Woods, Fletcher, & Hughes, 1986). The original description of WordNet is (Fellbaum, 1998). Although WordNet was originally developed for research in psycholinguistics, it is now widely used in NLP and Information Retrieval. WordNets are being developed for many other languages, as documented at For a study of WordNet similarity measures, see (Budanitsky & Hirst, 2006). Other topics touched on in this chapter were phonetics and lexical semantics, and we refer readers to Chapters 7 and 20 of (Jurafsky & Martin, 2008). 2.8 Exercises 1. ○ Create a variable phrase containing a list of words.


pages: 523 words: 143,139

Algorithms to Live By: The Computer Science of Human Decisions by Brian Christian, Tom Griffiths


4chan, Ada Lovelace, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, Albert Einstein, algorithmic trading, anthropic principle, asset allocation, autonomous vehicles, Berlin Wall, Bill Duvall, bitcoin, Community Supported Agriculture, complexity theory, constrained optimization, cosmological principle, cryptocurrency, Danny Hillis, delayed gratification, dematerialisation, diversification, double helix, Elon Musk, fault tolerance, Fellow of the Royal Society, Firefox, first-price auction, Flash crash, Frederick Winslow Taylor, George Akerlof, global supply chain, Google Chrome, Henri Poincaré, information retrieval, Internet Archive, Jeff Bezos, John Nash: game theory, John von Neumann, knapsack problem, Lao Tzu, linear programming, martingale, Nash equilibrium, natural language processing, NP-complete, P = NP, packet switching, prediction markets, race to the bottom, RAND corporation, RFC: Request For Comment, Robert X Cringely, sealed-bid auction, second-price auction, self-driving car, Silicon Valley, Skype, sorting algorithm, spectrum auction, Steve Jobs, stochastic process, Thomas Malthus, traveling salesman, Turing machine, urban planning, Vickrey auction, Walter Mischel, Y Combinator

the information retrieval systems of university libraries: Anderson’s findings on human memory are published in Anderson and Milson, “Human Memory,” and in the book The Adaptive Character of Thought. This book has been influential for laying out a strategy for analyzing everyday cognition in terms of ideal solutions, used by Tom and many others in their research. Anderson and Milson, “Human Memory,” in turn, draws from a statistical study of library borrowing that appears in Burrell, “A Simple Stochastic Model for Library Loans.” the missing piece in the study of the mind: Anderson’s initial exploration of connections between information retrieval by computers and the organization of human memory was conducted in an era when most people had never interacted with an information retrieval system, and the systems in use were quite primitive.

In 1987, Carnegie Mellon psychologist and computer scientist John Anderson found himself reading about the information retrieval systems of university libraries. Anderson’s goal—or so he thought—was to write about how the design of those systems could be informed by the study of human memory. Instead, the opposite happened: he realized that information science could provide the missing piece in the study of the mind. “For a long time,” says Anderson, “I had felt that there was something missing in the existing theories of human memory, including my own. Basically, all of these theories characterize memory as an arbitrary and non-optimal configuration.… I had long felt that the basic memory processes were quite adaptive and perhaps even optimal; however, I had never been able to see a framework in which to make this point. In the computer science work on information retrieval, I saw that framework laid out before me.”

“Some things that might seem frustrating as we grow older (like remembering names!) are a function of the amount of stuff we have to sift through … and are not necessarily a sign of a failing mind.” As he puts it, “A lot of what is currently called decline is simply learning.” Caching gives us the language to understand what’s happening. We say “brain fart” when we should really say “cache miss.” The disproportionate occasional lags in information retrieval are a reminder of just how much we benefit the rest of the time by having what we need at the front of our minds. So as you age, and begin to experience these sporadic latencies, take heart: the length of a delay is partly an indicator of the extent of your experience. The effort of retrieval is a testament to how much you know. And the rarity of those lags is a testament to how well you’ve arranged it: keeping the most important things closest to hand. 5 Scheduling First Things First How we spend our days is, of course, how we spend our lives.


pages: 397 words: 102,910

The Idealist: Aaron Swartz and the Rise of Free Culture on the Internet by Justin Peters


4chan, Any sufficiently advanced technology is indistinguishable from magic, Brewster Kahle, buy low sell high, corporate governance, crowdsourcing, disintermediation, don't be evil, global village, Hacker Ethic, hypertext link, index card, informal economy, information retrieval, Internet Archive, invention of movable type, invention of writing, Isaac Newton, Lean Startup, Paul Buchheit, Paul Graham, profit motive, RAND corporation, Republic of Letters, Richard Stallman, semantic web, Silicon Valley, social web, Steve Jobs, Steven Levy, Stewart Brand, strikebreaker, Vannevar Bush, Whole Earth Catalog, Y Combinator

His brief remarks to the group at Woods Hole were wistful: “I merely wish I were young enough to participate with you in the fascinating intricacies you will encounter and bring under your control.”48 Vannevar rhymes with believer, and when it came to government funding of scientific research, Bush certainly was. He was also a lifelong believer in libraries, and the benefits to be derived from their automation. In 1945, he published an article in the Atlantic Monthly that proposed a rudimentary mechanized library called Memex, a linked-information retrieval system. Memex was a desk-size machine that was equal parts stenographer, filing cabinet, and reference librarian: “a device in which an individual stores his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility.”49 The goal was to build a machine that could capture a user’s thought patterns, compile and organize his reading material and correspondence, and record the resulting “associative trails” between them all, such that the user could trace his end insights back to conception.

The Chicago Tribune opined, “The school mimeograph should be viewed not as a piratical rival to the trade publisher but as a helpful unpaid publicity agent who helps publishers’ long-term sales.”59 That argument didn’t take root. The rise of photoduplication technologies that facilitated the rapid spread of information merely underscored the fragility of copyright holders’ claims that intellectual property was indistinguishable from regular physical property. “We know that volumes of information can be stored on microfilm and magnetic tape. We keep hearing about information-retrieval networks,” former senator Kenneth B. Keating told Congress in 1965. “The inexorable question arises—what will happen in the long run if authors’ income is cut down and down by increasing free uses by photocopy and information storage and retrieval? Will the authors continue writing? Will the publishers continue publishing if their markets are diluted, eroded, and eventually, the profit motive and incentive completely destroyed?

Though he took a moment to celebrate what he deemed “the pinnacle of my career,” he couldn’t help but predict that future milestones would likely be few and far between, unless the American reading public took control of its nation’s copyright laws. Project Gutenberg had become an eloquent counterargument to copyright advocates’ dismissive claims about the public domain. It demonstrated just how easily a network could be used to breathe new life into classics that might otherwise go unseen. Despite the existence of initiatives such as Project Gutenberg, despite the emergence of the Internet as a new medium for information retrieval and distribution, the same official attitudes about intellectual property prevailed. The public domain was regarded as a penalty rather than as an opportunity. Parochial concerns were conflated with the public interest. The rise of the Internet might portend an informational revolution, but from the standpoint of the people in power, Hart warned, revolution was a bad thing. “Every single time a new publishing technique has promised to get the common people a home library, laws have been passed to stop, dead in its tracks, this kind of ‘Information Age,’ ” Hart wrote.


pages: 250 words: 73,574

Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers by John MacCormick, Chris Bishop


Ada Lovelace, AltaVista, Claude Shannon: information theory, fault tolerance, information retrieval, Menlo Park, PageRank, pattern recognition, Richard Feynman, Richard Feynman, Silicon Valley, Simon Singh, sorting algorithm, speech recognition, Stephen Hawking, Steve Jobs, Steve Wozniak, traveling salesman, Turing machine, Turing test, Vannevar Bush

To summarize: although humans don't use NEAR queries much, search engines use the information about nearness constantly to improve their rankings—and the reason they can do this efficiently is because they use the word-location trick. An example set of web pages that each have a title and a body. We already know that the Babylonians were using indexing 5000 years before search engines existed. It turns out that search engines did not invent the word-location trick either: this is a well-known technique that was used in other types of information retrieval before the internet arrived on the scene. However, in the next section we will learn about a new trick that does appear to have been invented by search engine designers: the metaword trick. The cunning use of this trick and various related ideas helped to catapult the AltaVista search engine to the top of the search industry in the late 1990s. THE METAWORD TRICK So far, we've been using extremely simple examples of web pages.

Public key cryptography and the related digital signature algorithms are examples of this. In other cases, the algorithms may have existed in the research community for some time, waiting in the wings for the right wave of new technology to give them wide applicability. The search algorithms for indexing and ranking fall into this category: similar algorithms had existed for years in the field known as information retrieval, but it took the phenomenon of web search to make these algorithms “great,” in the sense of daily use by ordinary computer users. Of course, the algorithms also evolved for their new application; PageRank is a good example of this. Note that the emergence of new technology does not necessarily lead to new algorithms. Consider the phenomenal growth of laptop computers over the 1980s and 1990s.

Among the many college-level computer science texts on algorithms, three particularly readable options are Algorithms, by Dasgupta, Papadimitriou, and Vazirani; Algorithmics: The Spirit of Computing, by Harel and Feldman; and Introduction to Algorithms, by Cormen, Leiserson, Rivest, and Stein. Search engine indexing (chapter 2). The original AltaVista patent covering the metaword trick is U.S. patent 6105019, “Constrained Searching of an Index,” by Mike Burrows (2000). For readers with a computer science background, Search Engines: Information Retrieval in Practice, by Croft, Metzler, and Strohman, is a good option for learning more about indexing and many other aspects of search engines. PageRank (chapter 3). The opening quotation by Larry Page is taken from an interview by Ben Elgin, published in Businessweek, May 3, 2004. Vannevar Bush's “As We May Think” was, as mentioned above, originally published in The Atlantic magazine (July 1945).


pages: 893 words: 199,542

Structure and interpretation of computer programs by Harold Abelson, Gerald Jay Sussman, Julie Sussman


Andrew Wiles, conceptual framework, Douglas Hofstadter, Eratosthenes, Fermat's Last Theorem, Gödel, Escher, Bach, industrial robot, information retrieval, iterative process, loose coupling, probability theory / Blaise Pascal / Pierre de Fermat, Richard Stallman, Turing machine

What is the order of growth in the number of steps required by list->tree to convert a list of n elements? Exercise 2.65. Use the results of exercises 2.63 and 2.64 to give θ(n) implementations of union-set and intersection-set for sets implemented as (balanced) binary trees.41 Sets and information retrieval We have examined options for using lists to represent sets and have seen how the choice of representation for a data object can have a large impact on the performance of the programs that use the data. Another reason for concentrating on sets is that the techniques discussed here appear again and again in applications involving information retrieval. Consider a data base containing a large number of individual records, such as the personnel files for a company or the transactions in an accounting system. A typical data-management system spends a large amount of time accessing or modifying the data in the records and therefore requires an efficient method for accessing records.

In particular, there will be an “eval” part that classifies expressions according to type and an “apply” part that implements the language's abstraction mechanism (procedures in the case of Lisp, and rules in the case of logic programming). Also, a central role is played in the implementation by a frame data structure, which determines the correspondence between symbols and their associated values. One additional interesting aspect of our query-language implementation is that we make substantial use of streams, which were introduced in chapter 3. 4.4.1 Deductive Information Retrieval Logic programming excels in providing interfaces to data bases for information retrieval. The query language we shall implement in this chapter is designed to be used in this way. In order to illustrate what the query system does, we will show how it can be used to manage the data base of personnel records for Microshaft, a thriving high-technology company in the Boston area. The language provides pattern-directed access to personnel information and can also take advantage of general rules in order to make logical deductions.

The resulting RSA algorithm has become a widely used technique for enhancing the security of electronic communications. Because of this and related developments, the study of prime numbers, once considered the epitome of a topic in “pure” mathematics to be studied only for its own sake, now turns out to have important practical applications to cryptography, electronic funds transfer, and information retrieval. 1.3 Formulating Abstractions with Higher-Order Procedures We have seen that procedures are, in effect, abstractions that describe compound operations on numbers independent of the particular numbers. For example, when we (define (cube x) (* x x x)) we are not talking about the cube of a particular number, but rather about a method for obtaining the cube of any number.


pages: 223 words: 52,808

Intertwingled: The Work and Influence of Ted Nelson (History of Computing) by Douglas R. Dechow


3D printing, Apple II, Bill Duvall, Brewster Kahle, Buckminster Fuller, Claude Shannon: information theory, cognitive dissonance, computer age, conceptual framework, Douglas Engelbart, Dynabook, Edward Snowden, game design, HyperCard, hypertext link, information retrieval, Internet Archive, Jaron Lanier, knowledge worker, linked data, Marshall McLuhan, Menlo Park, Mother of all demos, pre–internet, RAND corporation, semantic web, Silicon Valley, software studies, Steve Jobs, Steve Wozniak, Stewart Brand, Ted Nelson, the medium is the message, Vannevar Bush, Wall-E, Whole Earth Catalog

Childress V (1998) Engineering problem solving for mathematics, science, and technology education. J Technol Educ 10(1). http://​scholar.​lib.​vt.​edu/​ejournals/​JTE/​v10n1/​childress.​html 5. Nelson TH (1965) A file structure for the complex, the changing and the indeterminate. In: Proceedings of the ACM 20th national conference. ACM Press, New York, pp 84–100 6. Nelson TH (1967) Getting it out of our system. In: Schlechter G (ed) Information retrieval: a critical review. Thompson Books, Washington, DC, pp 191–210 7. Nelson TH (1968) Hypertext implementation notes, 6–10 March 1968. Xuarchives. http://​xanadu.​com/​REF%20​XUarchive%20​SET%20​03.​11.​06/​hin68.​tif 8. Nelson TH (1974) Computer lib: you can and must understand computers now/dream machines. Hugo’s Book Service, Chicago 9. Nelson TH (1993) Literary machines. Mindful Press, Sausalito 10.

Ted signed my copy of Literary Machines [25] at a talk in the mid-1990s, thus I was in awe of the man when Bill Dutton put us together as visiting scholars in the OII attic, a wonderful space overlooking the Ashmolean Museum. Ted and I arrived at concepts of data and metadata from very different paths. He brought his schooling in the theater and literary theory to the pioneer days of personal computing. I brought my schooling in mathematics, information retrieval, documentation, libraries, and communication to the study of scholarship. While Ted was sketching personal computers to revolutionize written communication [24], I was learning how to pry data out of card catalogs and move them into the first generation of online catalogs [6]. Our discussions that began 30 years later revealed the interaction of these threads, which have since converged. 10.2 Collecting and Organizing Data Ted overwhelms himself in data, hence he needs metadata to manage his collections.

Paper presented to Sixth National Symposium on Information Display, Los Angeles, pp 31–39 *Nelson TH (1965) The hypertext. In: Proceedings of the World Documentation Federation Nelson TH (1966–1967) Hypertext notes. http://​web.​archive.​org/​web/​20031127035740/​http://​www.​xanadu.​com/​XUarchive/​. Unpublished series of ten short essays or “notes“ Nelson TH (1967) Getting it out of our system. In: Schechter G (ed) Information retrieval: a critical review. Thompson Books, Washington, DC, pp 191–210 Nelson TH, Carmody S, Gross W, Rice D, van Dam A (1969) A hypertext editing system for the/360. In: Faiman M, Nievergelt J (eds) Pertinent concepts in computer graphics. Proceedings of the Second University of Illinois conference on computer graphics. University of Illinois Press, Urbana, pp 291–330 Nelson TH (1970) Las Vegas confrontation sit-out: a CAI radical’s view from solitary.


pages: 1,387 words: 202,295

Structure and Interpretation of Computer Programs, Second Edition by Harold Abelson, Gerald Jay Sussman, Julie Sussman


Andrew Wiles, conceptual framework, Douglas Hofstadter, Eratosthenes, Gödel, Escher, Bach, industrial robot, information retrieval, iterative process, loose coupling, probability theory / Blaise Pascal / Pierre de Fermat, Richard Stallman, Turing machine, wikimedia commons

What is the order of growth in the number of steps required by list->tree to convert a list of elements? Exercise 2.65: Use the results of Exercise 2.63 and Exercise 2.64 to give implementations of union-set and intersection-set for sets implemented as (balanced) binary trees.107 Sets and information retrieval We have examined options for using lists to represent sets and have seen how the choice of representation for a data object can have a large impact on the performance of the programs that use the data. Another reason for concentrating on sets is that the techniques discussed here appear again and again in applications involving information retrieval. Consider a data base containing a large number of individual records, such as the personnel files for a company or the transactions in an accounting system. A typical data-management system spends a large amount of time accessing or modifying the data in the records and therefore requires an efficient method for accessing records.

In particular, there will be an “eval” part that classifies expressions according to type and an “apply” part that implements the language’s abstraction mechanism (procedures in the case of Lisp, and rules in the case of logic programming). Also, a central role is played in the implementation by a frame data structure, which determines the correspondence between symbols and their associated values. One additional interesting aspect of our query-language implementation is that we make substantial use of streams, which were introduced in Chapter 3. 4.4.1Deductive Information Retrieval Logic programming excels in providing interfaces to data bases for information retrieval. The query language we shall implement in this chapter is designed to be used in this way. In order to illustrate what the query system does, we will show how it can be used to manage the data base of personnel records for Microshaft, a thriving high-technology company in the Boston area. The language provides pattern-directed access to personnel information and can also take advantage of general rules in order to make logical deductions.

2.1.4 Extended Exercise: Interval Arithmetic 2.2 Hierarchical Data and the Closure Property 2.2.1 Representing Sequences 2.2.2 Hierarchical Structures 2.2.3 Sequences as Conventional Interfaces 2.2.4 Example: A Picture Language 2.3 Symbolic Data 2.3.1 Quotation 2.3.2 Example: Symbolic Differentiation 2.3.3 Example: Representing Sets 2.3.4 Example: Huffman Encoding Trees 2.4 Multiple Representations for Abstract Data 2.4.1 Representations for Complex Numbers 2.4.2 Tagged data 2.4.3 Data-Directed Programming and Additivity 2.5 Systems with Generic Operations 2.5.1 Generic Arithmetic Operations 2.5.2 Combining Data of Different Types 2.5.3 Example: Symbolic Algebra 3 Modularity, Objects, and State 3.1 Assignment and Local State 3.1.1 Local State Variables 3.1.2 The Benefits of Introducing Assignment 3.1.3 The Costs of Introducing Assignment 3.2 The Environment Model of Evaluation 3.2.1 The Rules for Evaluation 3.2.2 Applying Simple Procedures 3.2.3 Frames as the Repository of Local State 3.2.4 Internal Definitions 3.3 Modeling with Mutable Data 3.3.1 Mutable List Structure 3.3.2 Representing Queues 3.3.3 Representing Tables 3.3.4 A Simulator for Digital Circuits 3.3.5 Propagation of Constraints 3.4 Concurrency: Time Is of the Essence 3.4.1 The Nature of Time in Concurrent Systems 3.4.2 Mechanisms for Controlling Concurrency 3.5 Streams 3.5.1 Streams Are Delayed Lists 3.5.2 Infinite Streams 3.5.3 Exploiting the Stream Paradigm 3.5.4 Streams and Delayed Evaluation 3.5.5 Modularity of Functional Programs and Modularity of Objects 4 Metalinguistic Abstraction 4.1 The Metacircular Evaluator 4.1.1 The Core of the Evaluator 4.1.2 Representing Expressions 4.1.3 Evaluator Data Structures 4.1.4 Running the Evaluator as a Program 4.1.5 Data as Programs 4.1.6 Internal Definitions 4.1.7 Separating Syntactic Analysis from Execution 4.2 Variations on a Scheme — Lazy Evaluation 4.2.1 Normal Order and Applicative Order 4.2.2 An Interpreter with Lazy Evaluation 4.2.3 Streams as Lazy Lists 4.3 Variations on a Scheme — Nondeterministic Computing 4.3.1 Amb and Search 4.3.2 Examples of Nondeterministic Programs 4.3.3 Implementing the Amb Evaluator 4.4 Logic Programming 4.4.1 Deductive Information Retrieval 4.4.2 How the Query System Works 4.4.3 Is Logic Programming Mathematical Logic? 4.4.4 Implementing the Query System The Driver Loop and Instantiation The Evaluator Finding Assertions by Pattern Matching Rules and Unification Maintaining the Data Base Stream Operations Query Syntax Procedures Frames and Bindings 5 Computing with Register Machines 5.1 Designing Register Machines 5.1.1 A Language for Describing Register Machines 5.1.2 Abstraction in Machine Design 5.1.3 Subroutines 5.1.4 Using a Stack to Implement Recursion 5.1.5 Instruction Summary 5.2 A Register-Machine Simulator 5.2.1 The Machine Model 5.2.2 The Assembler 5.2.3 Generating Execution Procedures for Instructions 5.2.4 Monitoring Machine Performance 5.3 Storage Allocation and Garbage Collection 5.3.1 Memory as Vectors 5.3.2 Maintaining the Illusion of Infinite Memory 5.4 The Explicit-Control Evaluator 5.4.1 The Core of the Explicit-Control Evaluator 5.4.2 Sequence Evaluation and Tail Recursion 5.4.3 Conditionals, Assignments, and Definitions 5.4.4 Running the Evaluator 5.5 Compilation 5.5.1 Structure of the Compiler 5.5.2 Compiling Expressions 5.5.3 Compiling Combinations 5.5.4 Combining Instruction Sequences 5.5.5 An Example of Compiled Code 5.5.6 Lexical Addressing 5.5.7 Interfacing Compiled Code to the Evaluator References List of Exercises List of Figures Term Index Colophon Next: UTF, Prev: (dir), Up: (dir) [Contents] Next: UTF, Prev: (dir), Up: (dir) [Contents] Next: Dedication, Prev: Top, Up: Top [Contents] Unofficial Texinfo Format This is the second edition SICP book, from Unofficial Texinfo Format.


pages: 348 words: 39,850

Data Scientists at Work by Sebastian Gutierrez


Albert Einstein, algorithmic trading, bioinformatics, bitcoin, business intelligence, chief data officer, clean water, cloud computing, computer vision, continuous integration, correlation does not imply causation, crowdsourcing, data is the new oil, DevOps, domain-specific language, follow your passion, full text search, informal economy, information retrieval, Infrastructure as a Service, inventory management, iterative process, linked data, Mark Zuckerberg, microbiome, Moneyball by Michael Lewis explains big data, move fast and break things, natural language processing, Network effects, nuclear winter, optical character recognition, pattern recognition, Paul Graham, personalized medicine, Peter Thiel, pre–internet, quantitative hedge fund, quantitative trading / quantitative finance, recommendation engine, Renaissance Technologies, Richard Feynman, Richard Feynman, self-driving car, side project, Silicon Valley, Skype, software as a service, speech recognition, statistical model, Steve Jobs, stochastic process, technology bubble, text mining, the scientific method, web application

Search is the problem at the heart of the information economy. The information is out there, if only we can find it. What’s also great about search is that it’s an area full of open problems, many of them pretty fundamental. Maybe search will be boring fifty years from now, but I doubt it. Gutierrez: If someone wants to get started in search today, what should they do? Tunkelang: They should start by taking a class on information retrieval or learn from the vast array of resources available offline and online. Given the open source technology for search, they should learn by doing—for instance, implementing a basic search engine for a public data collection. It’s not hard to get started with search. Where things get interesting is in the details. Ranking is still an open area of research, especially for personalized and social search applications.

Gutierrez: When did you realize you wanted to work with data as a career? Tunkelang: I’m not sure there was any particular moment of realization. I always loved math and computer science. Early on, I was more tempted by theory than practice, obsessed with open problems in combinatorics and computational complexity. But ultimately I couldn’t resist working on problems with practical consequences, and that’s how I found myself specializing in information retrieval and data science more broadly. Gutierrez: How did you get interested in working with data? Tunkelang: One of the problems I worked on at IBM was visualizing semantic networks obtained by applying natural language processing algorithms to large document collections. Even though my focus was on the network visualization algorithms, I couldn’t help noticing that the natural language processing algorithms had their good moments and bad moments. Data Scientists at Work Several years later, when I was at Endeca, I found myself working on terminology extraction and had to confront the noise problems personally. Ironically, we ended up licensing our terminology extraction algorithms to IBM as part of a search application we built for them. Gutierrez: What was the first data set you worked with? Tunkelang: I feel bad that I can’t remember my first, as that makes it sound like it wasn’t a deep, meaningful experience! I did spend a lot of time working with a Reuters news corpus to test out information retrieval and information extraction algorithms. One of the great things about my time at Endeca was the opportunity to work with our customers’ data, especially when we were prototyping new product features. Gutierrez: How did the fact that the Endeca data was customers’ data make you think about the data? Tunkelang: It was nice to have a diverse set of customers and thus gain exposure to lots of different problems.


pages: 480 words: 99,288

Mastering ElasticSearch by Rafal Kuc, Marek Rogozinski


Amazon Web Services, create, read, update, delete,, fault tolerance, finite state, full text search, information retrieval

Keep in mind, that in order to adjust your query relevance, you don't need to understand that, but it is very important to at least know how it works. The Lucene conceptual formula The conceptual version of the TF/IDF formula looks like: The previous presented formula is a representation of Boolean model of Information Retrieval combined with Vector Space Model of Information Retrieval. Let's not discuss it and let's just jump into the practical formula, which is implemented by Apache Lucene and is actually used. Note The information about Boolean model and Vector Space Model of Information Retrieval are far beyond the scope of this book. If you would like to read more about it, start with and The Lucene practical formula Now let's look at the practical formula Apache Lucene uses: As you may be able to see, the score factor for the document is a function of query q and document d.

He is also a speaker for various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, and Lucene Revolution. Rafał began his journey with Lucene in 2002 and it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and this was it. He started working with ElasticSearch in the middle of 2010. Currently, Lucene, Solr, ElasticSearch, and information retrieval are his main points of interest. Rafał is also an author of Solr 3.1 Cookbook, the update to it—Solr 4.0 Cookbook, and is a co-author of ElasticSearch Server all published by Packt Publishing. The book you are holding in your hands was something that I wanted to write after finishing the ElasticSearch Server book and I got the opportunity. I wanted not to jump from topic to topic, but concentrate on a few of them and write about what I know and share the knowledge.


pages: 402 words: 110,972

Nerds on Wall Street: Math, Machines and Wired Markets by David J. Leinweber


AI winter, algorithmic trading, asset allocation, banking crisis, barriers to entry, Big bang: deregulation of the City of London, butterfly effect, buttonwood tree, buy low sell high, capital asset pricing model, citizen journalism, collateralized debt obligation, corporate governance, Craig Reynolds: boids flock, credit crunch, Credit Default Swap, credit default swaps / collateralized debt obligations, Danny Hillis, demand response, disintermediation, distributed generation, diversification, diversified portfolio, Emanuel Derman,, experimental economics, financial innovation, Gordon Gekko, implied volatility, index arbitrage, index fund, information retrieval, Internet Archive, John Nash: game theory, Khan Academy, load shedding, Long Term Capital Management, Machine translation of "The spirit is willing, but the flesh is weak." to Russian and back, market fragmentation, market microstructure, Mars Rover, moral hazard, mutually assured destruction, natural language processing, Network effects, optical character recognition, paper trading, passive investing, pez dispenser, phenotype, prediction markets, quantitative hedge fund, quantitative trading / quantitative finance, QWERTY keyboard, RAND corporation, random walk, Ray Kurzweil, Renaissance Technologies, Richard Stallman, risk tolerance, risk-adjusted returns, risk/return, Ronald Reagan, semantic web, Sharpe ratio, short selling, Silicon Valley, Small Order Execution System, smart grid, smart meter, social web, South Sea Bubble, statistical arbitrage, statistical model, Steve Jobs, Steven Levy, Tacoma Narrows Bridge, the scientific method, The Wisdom of Crowds, time value of money, too big to fail, transaction costs, Turing machine, Upton Sinclair, value at risk, Vernor Vinge, yield curve, Yogi Berra

Reporters were necessary intermediaries in an era when (for example) press releases were sent to a few thousand fax machines and assigned to reporters by editors, and when SEC filings were found on a shelf in the Commission’s reading rooms in major cities. Press releases go to everyone over the Web. SEC filings are completely electronic. The reading rooms are closed. There is a great deal of effort to develop persistent specialized information-retrieval software agents for these sorts of routine newsgathering activities, which in turn creates incentives for reporters to move up from moving information around to interpretation and analysis. Examples and more in-depth discussion on these “new research” topics are forthcoming in Chapters 9 and 10. 86 Nerds on Wall Str eet Innovative algo systems will facilitate the use of news, in processed and raw forms.

Dow Jones Elementized News Feed, 26. Reuters Newscope algorithmic offerings, newsscoperealtime/index.aspx?user=1&. 27. These tools are called Open Calais ( 28. For the technically ambitious reader, Lucene (, Lingpipe (, and Lemur ( are popular open source language and information retrieval tools. 29. Anthony Oettinger, a pioneer in machine translation at Harvard going back to the 1950s, told a story of an early English-Russian-English system sponsored by U.S. intelligence agencies. The English “The spirit is willing but the flesh is weak” went in, was translated to Russian, which was then sent in again to be translated back into English. The result: “The vodka is ready but the meat is rotten.”

Direct market access has disintermediated brokers, many of whom are now in other lines of work. Direct access to primary sources of financially relevant information is disintermediating reporters, who now have to provide more than just a conduit to earn their keep. We would be hard-pressed to find more innovation than we see today on the Web. Google Finance, Yahoo! Finance, and their brethren have made more advanced information retrieval and analysis tools available for free than could be purchased for any amount in the notso-distant past. Other new technologies enable a new level of human-machine collaboration in investment research, such as XML (extensible markup language), discussed in Chapter 2. One of this technology’s most vocal proponents is Christopher Cox, former chairman of the SEC, who has taken the lead in encouraging the adoption of XBRL (extensible Business Reporting Language) to keep U.S. markets, exchanges, companies, and investors ahead of the curve. 106 Nerds on Wall Str eet We constantly hear about information overload, information glut, information anxiety, data smog, and the like.


pages: 32 words: 7,759

8 Day Trips From London by Dee Maldon


Doomsday Book, information retrieval, Isaac Newton, Stephen Hawking, the market place

8 Day Trips from London A simple guide for visitors who want to see more than the capital By Dee Maldon Bookline & Thinker Ltd Bookline & Thinker Ltd #231, 405 King’s Road London SW10 OBB Eight Days Out From London Copyright © Bookline & Thinker Ltd 2010 This book is a work of non-fiction A CIP catalogue record for this book is available from the British Library All rights reserved. No part of this work may be reproduced or stored in an information retrieval system without the express permission of the publisher ISBN: 9780956517715 Printed and bound by Lightning Source UK Book cover designed by Donald McColl Contents Bath Brighton Cambridge Canterbury Oxford Stonehenge Winchester Windsor Introduction Why take any day trips from London? After all London has so much to see and do. Who could ever be bored there? But escaping London is not about being bored.


pages: 413 words: 119,587

Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots by John Markoff


A Declaration of the Independence of Cyberspace, AI winter, airport security, Apple II, artificial general intelligence, augmented reality, autonomous vehicles, Baxter: Rethink Robotics, Bill Duvall, bioinformatics, Brewster Kahle, Burning Man, call centre, cellular automata, Chris Urmson, Claude Shannon: information theory, Clayton Christensen, clean water, cloud computing, collective bargaining, computer age, computer vision, crowdsourcing, Danny Hillis, DARPA: Urban Challenge, data acquisition, Dean Kamen, deskilling, don't be evil, Douglas Engelbart, Douglas Hofstadter, Dynabook, Edward Snowden, Elon Musk, Erik Brynjolfsson, factory automation, From Mathematics to the Technologies of Life and Death, future of work, Galaxy Zoo, Google Glasses, Google X / Alphabet X, Grace Hopper, Gödel, Escher, Bach, Hacker Ethic, haute couture, hive mind, hypertext link, indoor plumbing, industrial robot, information retrieval, Internet Archive, Internet of things, invention of the wheel, Jacques de Vaucanson, Jaron Lanier, Jeff Bezos, job automation, John Conway, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, John von Neumann, Kevin Kelly, knowledge worker, Kodak vs Instagram, labor-force participation, loose coupling, Mark Zuckerberg, Marshall McLuhan, medical residency, Menlo Park, Mother of all demos, natural language processing, new economy, Norbert Wiener, PageRank, pattern recognition, pre–internet, RAND corporation, Ray Kurzweil, Richard Stallman, Robert Gordon, Rodney Brooks, Sand Hill Road, Second Machine Age, self-driving car, semantic web, shareholder value, side project, Silicon Valley, Silicon Valley startup, Singularitarianism, skunkworks, Skype, social software, speech recognition, stealth mode startup, Stephen Hawking, Steve Ballmer, Steve Jobs, Steve Wozniak, Steven Levy, Stewart Brand, strong AI, superintelligent machines, technological singularity, Ted Nelson, telemarketer, telepresence, telepresence robot, Tenerife airport disaster, The Coming Technological Singularity, the medium is the message, Thorstein Veblen, Turing test, Vannevar Bush, Vernor Vinge, Watson beat the top human players on Jeopardy!, Whole Earth Catalog, William Shockley: the traitorous eight

Engelbart’s researchers, an eclectic collection of buttoned-down white-shirted engineers and long-haired computer hackers, were taking computing in a direction so different it was not even in the same coordinate system. The Shakey project was struggling to mimic the human mind and body. Engelbart had a very different goal. During World War II he had stumbled across an article by Vannevar Bush, who had proposed a microfiche-based information retrieval system called Memex to manage all of the world’s knowledge. Engelbart later decided that such a system could be assembled based on the then newly available computers. He thought the time was right to build an interactive system to capture knowledge and organize information in such a way that it would now be possible for a small group of people—scientists, engineers, educators—to create and collaborate more effectively.

In one sense the company began as the quintessential intelligence augmentation, or IA, company. The PageRank algorithm Larry Page developed to improve Internet search results essentially mined human intelligence by using the crowd-sourced accumulation of human decisions about valuable information sources. Google initially began by collecting and organizing human knowledge and then making it available to humans as part of a glorified Memex, the original global information retrieval system first proposed by Vannevar Bush in the Atlantic Monthly in 1945.11 As the company has evolved, however, it has started to push heavily toward systems that replace rather than extend humans. Google’s executives have obviously thought to some degree about the societal consequences of the systems they are creating. Their corporate motto remains “Don’t be evil.” Of course, that is nebulous enough to be construed to mean almost anything.

A student in computer science first at the State University of New York at Buffalo, he then entered graduate programs in computer science at both Washington University in St. Louis and Stanford, but dropped out of both programs before receiving an advanced degree. Once he was on the West Coast, he had gotten involved with Brewster Kahle’s Internet Archive Project, which sought to save a copy of every Web page on the Internet. Larry Page and Sergey Brin had given Hassan stock for programming PageRank, and Hassan also sold E-Groups, another of his information retrieval projects, to Yahoo! for almost a half-billion dollars. By then, he was a very wealthy Silicon Valley technologist looking for interesting projects. In 2006 he backed both Ng and Salisbury and hired Salisbury’s students to join Willow Garage, a laboratory he’d already created to facilitate the next generation of robotics technology—like designing driverless cars. Hassan believed that building a home robot was a more marketable and achievable goal, so he set Willow Garage to work designing a PR2 robot to develop technology that he could ultimately introduce into more commercial projects.


pages: 331 words: 60,536

The Sovereign Individual: How to Survive and Thrive During the Collapse of the Welfare State by James Dale Davidson, Rees Mogg


affirmative action, agricultural Revolution, bank run, barriers to entry, Berlin Wall, borderless world, British Empire, California gold rush, clean water, colonial rule, Columbine, compound rate of return, Danny Hillis, debt deflation, ending welfare as we know it, epigenetics, Fall of the Berlin Wall, falling living standards, feminist movement, financial independence, Francis Fukuyama: the end of history, full employment, George Gilder, Hernando de Soto, illegal immigration, income inequality, informal economy, information retrieval, Isaac Newton, Kevin Kelly, market clearing, Martin Wolf, Menlo Park, money: store of value / unit of account / medium of exchange, new economy, New Urbanism, offshore financial centre, Parkinson's law, pattern recognition, phenotype, price mechanism, profit maximization, rent-seeking, reserve currency, road to serfdom, Ronald Coase, school vouchers, seigniorage, Silicon Valley, spice trade, statistical model, telepresence, The Nature of the Firm, the scientific method, The Wealth of Nations by Adam Smith, Thomas L Friedman, Thomas Malthus, trade route, transaction costs, Turing machine, union organizing, very high income

They may perform the whole operation from another jurisdiction where taxes are lower and courts do not honor exorbitant malpractice claims. Digital Lawyers 154 Before agreeing to perform an operation, the skilled surgeon will probably call upon a digital lawyer to draft an instant contract that specifies and limits liability based upon the size and characteristics of the tumor revealed in images displayed by the magnetic resonance machine. Digital lawyers will be information-retrieval systems that automate selection of contract provisions, employing artificial intelligence processes such as neural networks to customize private contracts to meet transnational legal conditions. Participants in most high-value or important transactions will not only shop for suitable partners with whom to conduct a business; they will also shop for a suitable domicile for their transactions.

Lifetime employment will disappear as "jobs" increasingly become tasks or "piece work" rather than positions within an organization. Control over economic resources will shift away from the state to persons of superior skills and intelligence, as it becomes increasingly easy to create wealth by adding knowledge to products. Many members of learned professions will be displaced by interactive information-retrieval systems. New survival strategies for persons of lower intelligence will evolve, involving greater concentration on development of leisure skills, sports abilities, and crime, as well as service to the growing numbers of Sovereign Individuals as income inequality within jurisdictions rises. Political systems that grew up at a time when there were rising returns to violence must undergo wrenching adjustments.

Rapidly changing technology is undermining the megapolitical basis of social and economic organization. As a consequence, broad paradigmatic understanding, or unspoken theories about the way the world works, are being antiquated more quickly than in the past. This increases the importance of the broad overview and diminishes the value of individual "facts" of the kind that are readily available to almost anyone with an information retrieval system. 3. The growing tribalization and marginalization of life have had a stunting effect on discourse, and even on thinking. Many people have consequently gotten into the habit of shying away from conclusions that are obviously implied by the facts at their disposal. A recent psychological study disguised as a public opinion poll showed that members of individual occupational groups were almost uniformly unwilling to accept any conclusion that implied a loss of income for them, no matter how airtight the logic supporting it.


pages: 379 words: 109,612

Is the Internet Changing the Way You Think?: The Net's Impact on Our Minds and Future by John Brockman


A Declaration of the Independence of Cyberspace, Albert Einstein, AltaVista, Amazon Mechanical Turk, Asperger Syndrome, availability heuristic, Benoit Mandelbrot, biofilm, Black Swan, British Empire, conceptual framework, corporate governance, Danny Hillis, Douglas Engelbart, Emanuel Derman, epigenetics, Flynn Effect, Frank Gehry, Google Earth, hive mind, Howard Rheingold, index card, information retrieval, Internet Archive, invention of writing, Jane Jacobs, Jaron Lanier, Kevin Kelly, lone genius, loss aversion, mandelbrot fractal, Marshall McLuhan, Menlo Park, meta analysis, meta-analysis, New Journalism, Nicholas Carr, out of africa, Ponzi scheme, pre–internet, Richard Feynman, Richard Feynman, Rodney Brooks, Ronald Reagan, Schrödinger's Cat, Search for Extraterrestrial Intelligence, SETI@home, Silicon Valley, Skype, slashdot, smart grid, social graph, social software, social web, Stephen Hawking, Steve Wozniak, Steven Pinker, Stewart Brand, Ted Nelson, telepresence, the medium is the message, the scientific method, The Wealth of Nations by Adam Smith, theory of mind, trade route, upwardly mobile, Vernor Vinge, Whole Earth Catalog, X Prize

And when a file becomes corrupt, all I am left with is a pointer, a void where an idea should be, the ghost of a departed thought. The New Balance: More Processing, Less Memorization Fiery Cushman Postdoctoral fellow, Mind/Brain/Behavior Interfaculty Initiative, Harvard University The Internet changes the way I behave, and possibly the way I think, by reducing the processing costs of information retrieval. I focus more on knowing how to obtain and use information online and less on memorizing it. This tradeoff between processing and memory reminds me of one of my father’s favorite stories, perhaps apocryphal, about studying the periodic table of the elements in his high school chemistry class. On their test, the students were given a blank table and asked to fill in names and atomic weights.

I look up recipes after I arrive at the supermarket. And when a friend cooks a good meal, I’m more interested to learn what Website it came from than how it was spiced. I don’t know most of the American Psychological Association rules for style and citation, but my computer does. For any particular “computation” I perform, I don’t need the same depth of knowledge, because I have access to profoundly more efficient processes of information retrieval. So the Internet clearly changes the way I behave. It must be changing the way I think at some level, insofar as my behavior is a product of my thoughts. It probably is not changing the basic kinds of mental processes I can perform but it might be changing their relative weighting. We psychologists love to impress undergraduates with the fact that taxi drivers have unusually large hippocampi.

Anthony Aguirre Associate professor of physics, University of California, Santa Cruz Recently I wanted to learn about twelfth-century China—not a deep or scholarly understanding, just enough to add a bit of not-wrong color to something I was writing. Wikipedia was perfect! More regularly, my astrophysics and cosmology endeavors bring me to databases such as arXiv, ADS (Astrophysics Data System), and SPIRES (Stanford Physics Information Retrieval System), which give instant and organized access to all the articles and information I might need to research and write. Between such uses and an appreciable fraction of my time spent processing e-mails, I, like most of my colleagues, spend a lot of time connected to the Internet. It is a central tool in my research life. Yet what I do that is most valuable—to me, at least—is the occasional generation of genuine creative insights.


pages: 347 words: 97,721

Only Humans Need Apply: Winners and Losers in the Age of Smart Machines by Thomas H. Davenport, Julia Kirby


AI winter, Andy Kessler, artificial general intelligence, asset allocation, Automated Insights, autonomous vehicles, Baxter: Rethink Robotics, business intelligence, business process, call centre, carbon-based life, Clayton Christensen, clockwork universe, conceptual framework, dark matter, David Brooks, deliberate practice, deskilling, Edward Lloyd's coffeehouse, Elon Musk, Erik Brynjolfsson, estate planning, follow your passion, Frank Levy and Richard Murnane: The New Division of Labor, Freestyle chess, game design, general-purpose programming language, Google Glasses, Hans Lippershey, haute cuisine, income inequality, index fund, industrial robot, information retrieval, intermodal, Internet of things, inventory management, Isaac Newton, job automation, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, Khan Academy, knowledge worker, labor-force participation, loss aversion, Mark Zuckerberg, Narrative Science, natural language processing, Norbert Wiener, nuclear winter, pattern recognition, performance metric, Peter Thiel, precariat, quantitative trading / quantitative finance, Ray Kurzweil, Richard Feynman, Richard Feynman, risk tolerance, Robert Shiller, Robert Shiller, Rodney Brooks, Second Machine Age, self-driving car, Silicon Valley, six sigma, Skype, speech recognition, spinning jenny, statistical model, Stephen Hawking, Steve Jobs, Steve Wozniak, strong AI, superintelligent machines, supply-chain management, transaction costs, Tyler Cowen: Great Stagnation, Watson beat the top human players on Jeopardy!, Works Progress Administration, Zipcar

When a machine greatly augments your powers of information retrieval, as many information systems do, we would call that gaining a superpower. Indeed, in the Terminator film franchise, out of all the superhuman capabilities Skynet designed into its “cybernetic organisms,” the one filmgoers covet most is the instant pop-up retrieval of biographical information on any humans encountered. It was the inspiration, for example, for Google Glass, according to the technical lead on that product, Thad Starner.6 (And although we had to say Hasta la vista, baby, to that particular product, Google assures us it will be back.) When Tom wrote a book about knowledge workers a decade ago, there were already some examples of how empowering such information retrieval can be for them. He wrote in some detail, for example, about the idea of “computer-aided physician order entry,” particularly focusing on an example of this type of system at Partners HealthCare, a care network in Boston.

See also augmentation; specific professions augmentation and, 31–32, 62, 65, 74, 76, 100, 122, 139, 176, 185, 228, 234, 251 big-picture perspective and, 100 codified tasks and automation, 12–13, 14, 16–18, 19, 27–28, 30, 70, 139, 156, 167, 191, 204, 216, 246 creativity and, 120–21 defined, 5 demand peak, 6 deskilling and, 16 five options for, 76–77, 218, 232 (see also specific steps) how job loss happens, 23–24 information retrieval and, 65–66 lack of wage growth, 24 machine encroachment, 13, 24–25 political strategy to help, 239 roles better done by humans, 26–30 signs of coming automation, 19–22 Stepping In, post-automation work, 30–32 taking charge of destiny, 8–9 time frame for dislocation of, 24–26 who they are, 5–6 working hours of, 70 Kraft, Robert, 172–73 Krans, Mike, 102–3, 132, 134–35, 138 Kurup, Deepika, 164 Kurzweil, Ray, 36 labor unions, 1, 16, 25 Lacerte, 22 language recognition technologies, 39–40, 43, 44–45, 50, 53, 56, 212 natural language processing (NLP), 34, 37, 178 Lawton, Jim, 50, 182–83, 193 Learning by Doing (Bessen), 133, 233 legal field augmentation as leverage in, 68 automation (e-discovery), 13, 142–44, 145, 151 content analysis and automation, 20 narrow specializations, 159–60, 162 number of U.S. lawyers, 68 Stepping Up in, 93 Leibniz Institute for Astrophysics, 59 Levasseur, M.


pages: 58 words: 12,386

Big Data Glossary by Pete Warden


business intelligence, crowdsourcing, fault tolerance, information retrieval, linked data, natural language processing, recommendation engine, web application

As a user, you select elements on an example page that contain the data you’re interested in, and the tool then uses the patterns you’ve defined to pull out information from other pages on a site with a similar structure. For example, you might want to extract product names and prices from a shopping site. With the tool, you could find a single product page, select the product name and price, and then the same elements would be pulled for every other page it crawled from the site. It relies on the fact that most web pages are generated by combining templates with information retrieved from a database, and so have a very consistent structure. Once you’ve gathered the data, it offers some features that are a bit like Google Refine’s for de-duplicating and cleaning up the data. All in all, it’s a very powerful tool for turning web content into structured information, with a very approachable interface. ScraperWiki ScraperWiki is a hosted environment for writing automated processes to scan public websites and extract structured information from the pages they’ve published.


pages: 259 words: 73,193

The End of Absence: Reclaiming What We've Lost in a World of Constant Connection by Michael Harris


4chan, Albert Einstein, AltaVista, Andrew Keen, augmented reality, Burning Man, cognitive dissonance, crowdsourcing, dematerialisation,, Filter Bubble, Firefox, Google Glasses, informal economy, information retrieval, invention of movable type, invention of the printing press, invisible hand, James Watt: steam engine, Jaron Lanier, jimmy wales, Kevin Kelly, Loebner Prize, Marshall McLuhan, McMansion, Nicholas Carr, pattern recognition, pre–internet, Republic of Letters, Silicon Valley, Skype, Snapchat, social web, Steve Jobs, the medium is the message, The Wisdom of Crowds, Turing test

Others argue that future generations will learn to make new connections with facts that aren’t held in their heads, that dematerialized knowledge can still lead to innovation. As we inevitably off-load media content to the cloud—storing our books, our television programs, our videos of the trip to Taiwan, and photos of Grandma’s ninetieth birthday, all on a nameless server—can we happily dematerialize our mind’s stores, too? Perhaps we should side with philosopher Lewis Mumford, who insisted in The Myth of the Machine that “information retrieving,” however expedient, is simply no substitute for the possession of knowledge accrued through personal and direct labor. Author Clive Thompson wondered about this when he came across recent research suggesting that we remember fewer and fewer facts these days—of three thousand people polled by neuroscientist Ian Robertson, the young were less able to recall basic personal information (a full one-third, for example, didn’t know their own phone numbers).

., 92 Franklin, Benjamin, 192 friends, 30–31 Frind, Markus, 182–83 Furbies, 29–30 Füssel, Stephan, 103 Gaddam, Sai, 173 Gallup, 123 genes, 41–43 Gentile, Douglas, 118–21 German Ideology, The (Marx), 12n Gleick, James, 137 Globe and Mail, 81–82, 89 glossary, 211–16 Google, 3, 8, 18–19, 24, 33, 43, 49, 82, 96, 142, 185 memory and, 143–47 search results on, 85–86, 91 Google AdSense, 85 Google Books, 102–3 Google Glass, 99–100 Google Maps, 91 Google Plus, 31 Gopnik, Alison, 33–34 Gould, Glenn, 200–201, 204 GPS, 35, 59, 68, 171 Greenfield, Susan, 20, 25 Grindr, 165, 167, 171, 173–74, 176 Guardian, 66n Gutenberg, Johannes, 11–13, 14, 16, 21, 34, 98 Gutenberg Bible, 83, 103 Gutenberg Galaxy, The (McLuhan), 179, 201 Gutenberg Revolution, The (Man), 12n, 103 GuySpy, 171, 172, 173 Hangul, 12n Harari, Haim, 141 Harry Potter series, 66n Hazlehurst, Ronnie, 74 Heilman, James, 75–79 Henry, William A., III, 84–85 “He Poos Clouds” (Pallett), 164 History of Reading, A (Manguel), 16, 117, 159 Hollinghurst, Alan, 115 Holmes, Sherlock, 147–48 House at Pooh Corner, The (Milne), 93 Hugo, Victor, 20–21 “Idea of North, The” (Gould), 200–201 In Defense of Elitism (Henry), 84–85 Information, The (Gleick), 137 information retrieval, 141–42 Innis, Harold, 202 In Search of Lost Time (Proust), 160 Instagram, 19, 104, 149 Internet, 19, 20, 21, 23, 26–27, 55, 69, 125, 126, 129, 141, 143, 145, 146, 187, 199, 205 brain and, 37–38, 40, 142, 185 going without, 185, 186, 189–97, 200, 208–9 remembering life before, 7–8, 15–16, 21–22, 48, 55, 203 Internship, The, 89 iPad, 21, 31 children and, 26–27, 45 iPhone, see phones iPotty, 26 iTunes, 89 Jobs, Steve, 134 Jones, Patrick, 152n Justification of Johann Gutenberg, The (Morrison), 12 Kaiser Foundation, 27, 28n Kandel, Eric, 154 Kaufman, Charlie, 155 Keen, Andrew, 88 Kelly, Kevin, 43 Kierkegaard, Søren, 49 Kinsey, Alfred, 173 knowledge, 11–12, 75, 80, 82, 83, 86, 92, 94, 98, 141, 145–46 Google Books and, 102–3 Wikipedia and, 63, 78 Koller, Daphne, 95 Kranzberg, Melvin, 7 Kundera, Milan, 184 Lanier, Jaron, 85, 106–7, 189 latent Dirichlet allocation (LDA), 64–65 Leonardo da Vinci, 56 Lewis, R.


pages: 281 words: 95,852

The Googlization of Everything: by Siva Vaidhyanathan


1960s counterculture, AltaVista, barriers to entry, Berlin Wall, borderless world, Burning Man, Cass Sunstein, choice architecture, cloud computing, computer age, corporate social responsibility, correlation does not imply causation, data acquisition, death of newspapers, don't be evil, Firefox, Francis Fukuyama: the end of history, full text search, global village, Google Earth, Howard Rheingold, informal economy, information retrieval, Joseph Schumpeter, Kevin Kelly, knowledge worker, libertarian paternalism, market fundamentalism, Marshall McLuhan, means of production, Mikhail Gorbachev, Naomi Klein, Network effects, new economy, Nicholas Carr, PageRank, pirate software, Ray Kurzweil, Richard Thaler, Ronald Reagan, side project, Silicon Valley, Silicon Valley ideology, single-payer health, Skype, social web, Steven Levy, Stewart Brand, technoutopianism, The Nature of the Firm, The Structural Transformation of the Public Sphere, Thorstein Veblen, urban decay, web application

In 2009 the core service of Google—its Web search engine—handled more than 70 percent of the Web search business in the United States and more than 90 percent in much of Europe, and grew at impressive rates elsewhere around the world. 15. Thorsten Joachims et al., “Accurately Interpreting Clickthrough Data as Implicit Feedback,” Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Salvador, Brazil: ACM, 2005), 154–61. 16. B. J. Jansen and U. Pooch, “A Review of Web Searching Studies and a Framework for Future Research,” Journal of the American Society for Information Science and Technology 52, no. 3 (2001): 235–46; Amanda Spink and Bernard J. Jansen, Web Search: Public Searching on the Web (Dordrecht: Kluwer Academic Publishers, 2004); Caroline M. Eastman and Bernard J.

Liwen Vaughan and Yanjun Zhang, “Equal Representation by Search Engines? A Comparison of Websites across Countries and Domains,” Journal of Computer-Mediated Communication 12, no. 3 (2007), 69. Wingyan Chung, “Web Searching in a Multilingual World,” Communications of the ACM 51, no. 5 (2008): 32–40; Fotis Lazarinis et al., “Current Research Issues and Trends in Non-English Web Searching,” Information Retrieval 12, no. 3 (2009): 230–50. 70. “Google’s Market Share in Your Country.” 71. Choe Sang-Hun, “Crowd’s Wisdom Helps South Korean Search Engine Beat Google and Yahoo,” New York Times, July 4, 2007. 72. “S. Korea May Clash with Google over Internet Regulation Differences,” Hankyoreh, April 17, 2009; Kim Tong-hyung, “Google Refuses to Bow to Gov’t Pressure,” Korea Times, April 9, 2009. 73. Marcus Alexander, “The Internet and Democratization: The Development of Russian Internet Policy,” Demokratizatsiya 12, no. 4 (Fall 2004): 607–27; Ronald Deibert et al., Access Denied: The Practice and Policy of Global Internet Filtering (Cambridge, MA: MIT Press, 2008). 74.


pages: 353 words: 104,146

European Founders at Work by Pedro Gairifo Santos


business intelligence, cloud computing, crowdsourcing, fear of failure, full text search, information retrieval, inventory management, iterative process, Jeff Bezos, Lean Startup, Mark Zuckerberg, natural language processing, pattern recognition, pre–internet, recommendation engine, Richard Stallman, Silicon Valley, Skype, slashdot, Steve Jobs, Steve Wozniak, subscription business, technology bubble, web application, Y Combinator

We e-mailed some mailing lists. We e-mailed the ISMIR2 mailing list. They're a group who meet every year about music recommendations and information retrieval in music. We ended up hiring a guy called Norman, who was both a great scientist and understood all the algorithms and captive audience sort of things, but also an excellent programmer who was able to implement all these ideas. So we got really lucky. The first person we hired was great and he just took over. He chucked out all of our crappy recommendation systems we had and built something good, and then improved it constantly for the next several years. __________ 2 The International Society for Music Information Retrieval So we had some A/B testing, split testing systems in there for the radio so they could try out new tweaks to the algorithms and see what was performing better.


pages: 153 words: 27,424

REST API Design Rulebook by Mark Masse


anti-pattern, conceptual framework, create, read, update, delete, data acquisition, database schema, hypertext link, information retrieval, web application

The HyperText Mark-up Language (HTML), to represent informative documents that contain links to related documents. The first web server.[8] The first web browser, which Berners-Lee also named “WorldWideWeb” and later renamed “Nexus” to avoid confusion with the Web itself. The first WYSIWYG[9] HTML editor, which was built right into the browser. On August 6, 1991, on the Web’s first page, Berners-Lee wrote, The WorldWideWeb (W3) is a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents.[10] From that moment, the Web began to grow, at times exponentially. Within five years, the number of web users skyrocketed to 40 million. At one point, the number was doubling every two months. The “universe of documents” that Berners-Lee had described was indeed expanding. In fact, the Web was growing too large, too fast, and it was heading toward collapse.


pages: 123 words: 32,382

Grouped: How Small Groups of Friends Are the Key to Influence on the Social Web by Paul Adams


Airbnb, Cass Sunstein, cognitive dissonance, David Brooks, information retrieval, invention of the telegraph, planetary scale, race to the bottom, Richard Thaler, sentiment analysis, social web, statistical model, The Wisdom of Crowds, web application, white flight

To reiterate, the first driving factor is that our online world is catching up with our offline world. Just as we are surrounded by people throughout our daily life, the web is being rebuilt around people. People are increasingly using the web to seek the information they need from each other, rather than from businesses directly. People always sourced information from each other offline, but up until now, online information retrieval tended to be from a business to a person. The second driving factor is an acknowledgment in our business models of the fact that people live in networks. For many years, we considered people as isolated, independent actors. Most of our consumer behavior models are structured this way—people acting independently, moving down a decision funnel, making objective choices along the way. Recent research in psychology and neuroscience shows that this isn’t how people make decisions.


pages: 855 words: 178,507

The Information: A History, a Theory, a Flood by James Gleick


Ada Lovelace, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, Albert Einstein, AltaVista, bank run, bioinformatics, Brownian motion, butterfly effect, citation needed, Claude Shannon: information theory, clockwork universe, computer age, conceptual framework, crowdsourcing, death of newspapers, discovery of DNA, double helix, Douglas Hofstadter,, Eratosthenes, Fellow of the Royal Society, Gödel, Escher, Bach, Henri Poincaré, Honoré de Balzac, index card, informal economy, information retrieval, invention of the printing press, invention of writing, Isaac Newton, Jacquard loom, Jacquard loom, Jaron Lanier, jimmy wales, John von Neumann, Joseph-Marie Jacquard, Louis Daguerre, Marshall McLuhan, Menlo Park, microbiome, Milgram experiment, Network effects, New Journalism, Norbert Wiener, On the Economy of Machinery and Manufactures, PageRank, pattern recognition, phenotype, pre–internet, Ralph Waldo Emerson, RAND corporation, reversible computing, Richard Feynman, Richard Feynman, Simon Singh, Socratic dialogue, Stephen Hawking, Steven Pinker, stochastic process, talking drums, the High Line, The Wisdom of Crowds, transcontinental railway, Turing machine, Turing test, women in the workforce

And then, when it was made simple, distilled, counted in bits, information was found to be everywhere. Shannon’s theory made a bridge between information and uncertainty; between information and entropy; and between information and chaos. It led to compact discs and fax machines, computers and cyberspace, Moore’s law and all the world’s Silicon Alleys. Information processing was born, along with information storage and information retrieval. People began to name a successor to the Iron Age and the Steam Age. “Man the food-gatherer reappears incongruously as information-gatherer,”♦ remarked Marshall McLuhan in 1967.♦ He wrote this an instant too soon, in the first dawn of computation and cyberspace. We can see now that information is what our world runs on: the blood and the fuel, the vital principle. It pervades the sciences from top to bottom, transforming every branch of knowledge.

(Eliot said that, too: “Where is the wisdom we have lost in knowledge? / Where is the knowledge we have lost in information?”) It is an ancient observation, but one that seemed to bear restating when information became plentiful—particularly in a world where all bits are created equal and information is divorced from meaning. The humanist and philosopher of technology Lewis Mumford, for example, restated it in 1970: “Unfortunately, ‘information retrieving,’ however swift, is no substitute for discovering by direct personal inspection knowledge whose very existence one had possibly never been aware of, and following it at one’s own pace through the further ramification of relevant literature.”♦ He begged for a return to “moral self-discipline.” There is a whiff of nostalgia in this sort of warning, along with an undeniable truth: that in the pursuit of knowledge, slower can be better.

♦ “THOSE DAYS, WHEN (AFTER PROVIDENCE”: Alexander Pope, The Dunciad (1729) (London: Methuen, 1943), 41. ♦ “KNOWLEDGE OF SPEECH, BUT NOT OF SILENCE”: T. S. Eliot, “The Rock,” in Collected Poems: 1909–1962 (New York: Harcourt Brace, 1963), 147. ♦ “THE TSUNAMI OF AVAILABLE FACT”: David Foster Wallace, Introduction to The Best American Essays 2007 (New York: Mariner, 2007). ♦ “UNFORTUNATELY, ‘INFORMATION RETRIEVING,’ HOWEVER SWIFT”: Lewis Mumford, The Myth of the Machine, vol. 2, The Pentagon of Power (New York: Harcourt, Brace, 1970), 182. ♦ “ELECTRONIC MAIL SYSTEM”: Jacob Palme, “You Have 134 Unread Mail! Do You Want to Read Them Now?” in Computer-Based Message Services, ed. Hugh T. Smith (North Holland: Elsevier, 1984), 175–76. ♦ A PAIR OF PSYCHOLOGISTS: C. J. Bartlett and Calvin G. Green, “Clinical Prediction: Does One Sometimes Know Too Much,” Journal of Counseling Psychology 13, no. 3 (1966): 267–70


pages: 648 words: 108,814

Solr 1.4 Enterprise Search Server by David Smiley, Eric Pugh


Amazon Web Services, bioinformatics, cloud computing, continuous integration, database schema, domain-specific language,, fault tolerance, Firefox, information retrieval, Internet Archive, web application, Y Combinator

The major features found in Lucene are as follows: • A text-based inverted index persistent storage for efficient retrieval of documents by indexed terms • A rich set of text analyzers to transform a string of text into a series of terms (words), which are the fundamental units indexed and searched • A query syntax with a parser and a variety of query types from a simple term lookup to exotic fuzzy matches • A good scoring algorithm based on sound Information Retrieval (IR) principles to produce the more likely candidates first, with flexible means to affect the scoring • A highlighter feature to show words found in context • A query spellchecker based on indexed content For even more information on the query spellchecker, check out the Lucene In Action book (LINA for short) by Erik Hatcher and Otis Gospodnetić. Solr, the Server-ization of Lucene With the definition of Lucene behind us, Solr can be described succinctly as the server-ization of Lucene.

See hl.fl fq, query parameter 95 highlighting component, search function argument components limitations 120 about 161 function queries configuring 163 _val_ pseudo-field hack 117 example 161, 163 about 117 hl 164 bf parameter 117 hl.fl 164 Daydreaming search example 119 hl.fragsize 164 example 118 hl.highlightMultiTerm 164 field references 120 hl.mergeContiguous 165 function references 120 hl.requireFieldMatch 164 incorporating, to searches 117 hl.snippets 164 t_trm_lookups 118 hl.usePhraseHighlighter 164 function query, tips 128 hl alternateField 165 function references hl formatter 165 mathematical primitives 121 hl fragmenter 165 function references, function queries 120 hl maxAnalyzedChars 165 parameters 164 G hl, highlighting component 164 hl.fl 161 g, query parameter 95 hl.fl, highlighting component 164 g.op, query parameter 95 hl.fragsize, highlighting component 164 generic XML data structure hl.highlightMultiTerm, highlighting about 92 component 164 appends 111 hl.increment, regex fragmenter 166 arr, XML element 92 hl.mergeContiguous, highlighting bool element 92 component 165 components 111 hl.regex.maxAnalyzedChars, regex date element 93 fragmenter 166 defaults 111 hl.regex.pattern, regex fragmenter 166 double element 92 hl.regex.slop, regex fragmenter 166 first-components 111 hl.requireFieldMatch, highlighting float element 92 component 164 int element 92 hl.snippets, highlighting component 164 invariants 111 hl.usePhraseHighlighter, highlighting last-components 111 component 164 long element 92 hl alternateField, highlighting component lst, XML element 92 165 str element 92 hl formatter, highlighting component Git about 165 URL 11 hl.simple.pre and 165 [ 306 ] Download at Boykma.Com This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327 hl fragmenter, highlighting component 165 factors, committing 285 hl maxAlternateFieldLength, highlighting factors, optimizing 285 component 165 unique document checking, disabling 285 hl maxAnalyzedChars, highlighting Index Searchers 280 component 165 Information Retrieval. S ee IR home directory, Solr int element 92 bin 15 InternetArchive 226 conf 15 invariants 111 conf/schema.xml 15 Inverse Document Frequency. S ee IDF conf/solrconfig.xml 15 inverse reciprocals 125 conf/xslt 15 IR 8 data 15 ISOLatin1AccentFilterFactory filter 62 lib 15 issue tracker, Solr 27 HTML, indexing in Solr 227 HTMLStripStandardTokenizerFactory 52 J HTMLStripStandardTokenizerFactory tokenizer 227 J2SE HTMLStripWhitespaceTokenizerFactory 52 with JConsole 212 HTTP caching 277-279 JARmageddon 205 HTTP server request access logs, logging jarowinkler, spellchecker 172 about 201, 202 java.util.logging package 203 log directory, creating 201 Java class names Tailing 202 abbreviated 40 org.apache.solr.schema.BoolField 40 I Java Development Kit (JDK) URL 11 IDF 33 JavaDoc tags 234 idf 112 Java Management Extensions.


pages: 429 words: 114,726

The Computer Boys Take Over: Computers, Programmers, and the Politics of Technical Expertise by Nathan L. Ensmenger


barriers to entry, business process, Claude Shannon: information theory, computer age, deskilling, Firefox, Frederick Winslow Taylor, future of work, Grace Hopper, informal economy, information retrieval, interchangeable parts, Isaac Newton, Jacquard loom, Jacquard loom, job satisfaction, John von Neumann, knowledge worker, loose coupling, new economy, Norbert Wiener, pattern recognition, performance metric, post-industrial society, Productivity paradox, RAND corporation, Robert Gordon, sorting algorithm, Steve Jobs, Steven Levy, the market place, Thomas Kuhn: the structure of scientific revolutions, Thorstein Veblen, Turing machine, Von Neumann architecture, Y2K

Mahoney, “Computer Science.” 79. Daniel McCracken, “The Human Side of Computing,” Datamation 7, no. 1 (1961): 9–11. Chapter 6 1. “The Thinking Machine,” Time magazine, January 23, 1950, 54–60. 2. J. Lear, “Can a Mechanical Brain Replace You?” Colliers, no. 131 (1953), 58–63. 3. “Office Robots,” Fortune 45 (January 1952), 82–87, 112, 114, 116, 118. 4. Cheryl Knott Malone, “Imagining Information Retrieval in the Library: Desk Set in Historical Context,” IEEE Annals of the History of Computing 24, no. 3 (2002): 14–22. 5. Ibid. 6. Ibid. 7. Thorstein Veblen, The Theory of the Leisure Class (New York: McMillan, 1899). 8. Thomas Haigh, “The Chromium-Plated Tabulator: Institutionalizing an Electronic Revolution, 1954–1958,” IEEE Annals of the History of Computing 4, no. 23 (2001), 75–104. 9.

New York: Oxford University Press, 2002. Mahoney, Michael. “Software as Science—Science as Software.” In History of Computing: Software Issues, ed. Ulf Hashagen, Reinhard Keil-Slawik, and Arthur Norberg. Berlin: Springer-Verlag, 2002, 25–48. Mahoney, Michael. “What Makes the History of Software Hard.” IEEE Annals of the History of Computing 30 (3) (2008): 8–18. Malone, Cheryl Knott. “Imagining Information Retrieval in the Library: Desk Set in Historical Context.” IEEE Annals of the History of Computing 24 (3) (2002): 14–22. Mandel, Lois. “The Computer Girls.” Cosmopolitan, April 1967, 52–56. Manion, Mark, and William M. Evan. “The Y2K problem: technological risk and professional responsibility.” ACM SIGCAS Computers and Society 29 (4) (1999): 24–29. Markham, Edward. “EDP Schools: An Inside View.”


pages: 286 words: 94,017

Future Shock by Alvin Toffler


Albert Einstein, Brownian motion, Buckminster Fuller, cognitive dissonance, Colonization of Mars, corporate governance, East Village, global village, Haight Ashbury, information retrieval, invention of agriculture, invention of movable type, invention of writing, Marshall McLuhan, Menlo Park, New Urbanism, post-industrial society, RAND corporation, the market place, Thomas Kuhn: the structure of scientific revolutions, urban renewal, Whole Earth Catalog

The profession of airline flight engineer, he notes, emerged and then began to die out within a brief period of fifteen years. A look at the "help wanted" pages of any major newspaper brings home the fact that new occupations are increasing at a mind-dazzling rate. Systems analyst, console operator, coder, tape librarian, tape handler, are only a few of those connected with computer operations. Information retrieval, optical scanning, thin-film technology all require new kinds of expertise, while old occupations lose importance or vanish altogether. When Fortune magazine in the mid-1960's surveyed 1,003 young executives employed by major American corporations, it found that fully one out of three held a job that simply had not existed until he stepped into it. Another large group held positions that had been filled by only one incumbent before them.

Just as economic mass production required large numbers of workers to be assembled in factories, educational mass production required large numbers of students to be assembled in schools. This itself, with its demands for uniform discipline, regular hours, attendance checks and the like, was a standardizing force. Advanced technology will, in the future, make much of this unnecessary. A good deal of education will take place in the student's own room at home or in a dorm, at hours of his own choosing. With vast libraries of data available to him via computerized information retrieval systems, with his own tapes and video units, his own language laboratory and his own electronically equipped study carrel, he will be freed, for much of the time, of the restrictions and unpleasantness that dogged him in the lockstep classroom. The technology upon which these new freedoms will be based will inevitably spread through the schools in the years ahead—aggressively pushed, no doubt, by major corporations like IBM, RCA, and Xerox.


pages: 742 words: 137,937

The Future of the Professions: How Technology Will Transform the Work of Human Experts by Richard Susskind, Daniel Susskind


23andMe, 3D printing, additive manufacturing, AI winter, Albert Einstein, Amazon Mechanical Turk, Amazon Web Services, Andrew Keen, Atul Gawande, Automated Insights, autonomous vehicles, Big bang: deregulation of the City of London, big data - Walmart - Pop Tarts, Bill Joy: nanobots, business process, business process outsourcing, Cass Sunstein, Checklist Manifesto, Clapham omnibus, Clayton Christensen, clean water, cloud computing, computer age, computer vision, conceptual framework, corporate governance, crowdsourcing, Daniel Kahneman / Amos Tversky, death of newspapers, disintermediation, Douglas Hofstadter,, Erik Brynjolfsson, Filter Bubble, Frank Levy and Richard Murnane: The New Division of Labor, full employment, future of work, Google Glasses, Google X / Alphabet X, Hacker Ethic, industrial robot, informal economy, information retrieval, interchangeable parts, Internet of things, Isaac Newton, James Hargreaves, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, Joseph Schumpeter, Khan Academy, knowledge economy, lump of labour, Marshall McLuhan, Narrative Science, natural language processing, Network effects, optical character recognition, personalized medicine, pre–internet, Ray Kurzweil, Richard Feynman, Richard Feynman, Second Machine Age, self-driving car, semantic web, Skype, social web, speech recognition, spinning jenny, strong AI, supply-chain management, telepresence, the market place, The Wealth of Nations by Adam Smith, The Wisdom of Crowds, transaction costs, Turing test, Watson beat the top human players on Jeopardy!, young professional

For us, it represents the coming of the second wave of AI (section 4.9). Here is a system that undoubtedly performs tasks that we would normally think require human intelligence. The version of Watson that competed on Jeopardy! holds over 200 million pages of documents and implements a wide range of AI tools and techniques, including natural language processing, machine learning, speech synthesis, game-playing, information retrieval, intelligent search, knowledge processing and reasoning, and much more. This type of AI, we stress again, is radically different from the first wave of rule-based expert systems of the 1980s (see section 4.9). It is interesting to note, harking back again to the exponential growth of information technology, that the hardware on which Watson ran in 2011 was said to be about the size of the average bedroom.

It has instead been hibernating, conserving its energy, as it were, ticking over quietly in the background, waiting for enabling technologies to emerge and catch up with some of the original aspirations of the early AI scientists. In the thaw that has followed the winter, over the past few years, we have seen a series of significant developments—Big Data, Watson, robotics, and affective computing—that we believe point to a second wave of AI. In summary, the computerization of the work of professionals began in earnest in the late 1970s with information retrieval systems. Then, in the 1980s, there were first-generation AI systems in the professions, whose main focus was expert systems technologies. In the next decade, the 1990s, there was a shift towards the field of knowledge management, when professionals started to store and retrieve not just source materials but know-how and working practices. In the 2000s, Google came to dominate the research habits of many professionals, and grew to become the indispensable tool of practitioners searching for materials, if not for solutions.


Principles of Protocol Design by Robin Sharp


accounting loophole / creative accounting, business process, discrete time, fault tolerance, finite state, Gödel, Escher, Bach, information retrieval, loose coupling, packet switching, RFC: Request For Comment, stochastic process, x509 certificate

Specification of acceptable message types, languages, content encodings, character sets. Challenge-response mechanism for authentication of client (see Section 11.4.4). Coding: ASCII encoding of all PDUs. Addressing: Uniform Resource Identifier (URI) identifies destination system and path to resource. Fault tolerance: Resistance to corruption via optional MD5 checksumming of resource content during transfer. 11.4.3 Web Caching Since most distributed information retrieval applications involve transfer of considerable amounts of data through the network, caching is commonly used in order to reduce the amount of network traffic and reduce response times. HTTP, which is intended to support such applications, therefore includes explicit mechanisms for controlling the operation of caching. Since these illustrate a number of ideas which are important in several application areas, they will be described in some detail here.

A good review of coordination languages and the protocols used to implement them can be found in the monograph edited by Omicini et al. [103], while Baumann [6] gives a good overview of the technologies behind mobile agents. The proceedings of the two series of international workshops on “Intelligent Agents for Telecommunication Applications”, and on “Cooperative Information Agents” are good places to search for the results of recent research into both theory and applications of agents in the telecommunications and information retrieval areas. A new trend in the construction of very large distributed systems is to base them on Grid technology. This is a technology for coordinating the activities of a potentially huge number of computers, in order to supply users with computer power, in the form of CPU power, storage and other resources. The analogy is to the electric grid, which provides users with electric power without their having to think about exactly where it comes from.


pages: 189 words: 57,632

Content: Selected Essays on Technology, Creativity, Copyright, and the Future of the Future by Cory Doctorow


book scanning, Brewster Kahle, Burning Man,, informal economy, information retrieval, Internet Archive, invention of movable type, Jeff Bezos, Law of Accelerating Returns, Metcalfe's law, mutually assured destruction, new economy, optical character recognition, patent troll, pattern recognition, Ponzi scheme, post scarcity, QWERTY keyboard, Ray Kurzweil, RFID, Sand Hill Road, Skype, slashdot, social software, speech recognition, Steve Jobs, Turing test, Vernor Vinge

This sort of observational metadata is far more reliable than the stuff that human beings create for the purposes of having their documents found. It cuts through the marketing bullshit, the self-delusion, and the vocabulary collisions. Taken more broadly, this kind of metadata can be thought of as a pedigree: who thinks that this document is valuable? How closely correlated have this person's value judgments been with mine in times gone by? This kind of implicit endorsement of information is a far better candidate for an information-retrieval panacea than all the world's schema combined. Amish for QWERTY (Originally published on the O'Reilly Network, 07/09/2003) I learned to type before I learned to write. The QWERTY keyboard layout is hard-wired to my brain, such that I can't write anything of significance without that I have a 101-key keyboard in front of me. This has always been a badge of geek pride: unlike the creaking pen-and-ink dinosaurs that I grew up reading, I'm well adapted to the modern reality of technology.


pages: 144 words: 55,142

Interlibrary Loan Practices Handbook by Cherie L. Weible, Karen L. Janke


Firefox, information retrieval, Internet Archive, late fees, optical character recognition, pull request, transaction costs, Works Progress Administration

If an electronic resources management system is not available or used, it is important to find the interlibrary loan terms on a license and record this information in the ILL department. The terms of the license should be upheld. Regular communication with 41 42 lending workflow basics library staff who are responsible for licensing will ensure that ILL staff are aware of any new or updated license information. Retrieving the Item If the print item is owned and available, the call number or other location-specific information should be noted on the request. Borrowers might request a particular edition or year, so careful attention should be paid to make sure the call number and item are an exact match. All requests should be collected and sorted by location and the items pulled from the stacks at least daily.


pages: 222 words: 74,587

Paper Machines: About Cards & Catalogs, 1548-1929 by Markus Krajewski, Peter Krapp


business process, double entry bookkeeping, Frederick Winslow Taylor, Gödel, Escher, Bach, index card, Index librorum prohibitorum, information retrieval, invention of movable type, invention of the printing press, Jacques de Vaucanson, Johann Wolfgang von Goethe, Joseph-Marie Jacquard, knowledge worker, means of production, new economy, paper trading, Turing machine

Paper Machines History and Foundations of Information Science Edited by Michael Buckland, Jonathan Furner, and Markus Krajewski Human Information Retrieval by Julian Warner Good Faith Collaboration: The Culture of Wikipedia by Joseph Michael Reagle Jr. Paper Machines: About Cards & Catalogs, 1548–1929 by Markus Krajewski Paper Machines About Cards & Catalogs, 1548–1929 Markus Krajewski translated by Peter Krapp The MIT Press Cambridge, Massachusetts London, England © 2011 Massachusetts Institute of Technology © für die deutsche Ausgabe 2002, Kulturverlag Kadmos Berlin All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.


pages: 242 words: 71,938

The Google Resume: How to Prepare for a Career and Land a Job at Apple, Microsoft, Google, or Any Top Tech Company by Gayle Laakmann Mcdowell


barriers to entry, cloud computing, game design, information retrieval, job-hopping, side project, Silicon Valley, Steve Jobs, why are manhole covers round?

They provide just the right amount of detail to be useful, without overwhelming the reader. 10. The one thing that would make this slightly stronger is for Bill to list the dates of the projects. Distributed Hash Table (Language/Platform: Java/Linux) Successfully implemented Distributed Hash Table based on chord lookup protocol, Chord protocol is one solution for connecting the peers of a P2P network. Chord consistently maps a key onto a node. Information Retrieval System (Language/Platform: Java/Linux) Developed an indexer to index corpus of file and a Query Processor to process the Boolean query. The Query Processor outputs the file name, title, line number, and word position. Implemented using Java API such as serialization and collections (Sortedset, Hashmaps). Achievements Won Star Associate Award at Capgemini for outstanding performance. Received client appreciation for increasing productivity by developing Batch Stat Automation tool. 11.


pages: 411 words: 80,925

What's Mine Is Yours: How Collaborative Consumption Is Changing the Way We Live by Rachel Botsman, Roo Rogers


Airbnb, barriers to entry, Bernie Madoff, bike sharing scheme, Buckminster Fuller, carbon footprint, Cass Sunstein, collaborative consumption, collaborative economy, Community Supported Agriculture, credit crunch, crowdsourcing, dematerialisation, disintermediation,, experimental economics, George Akerlof, global village, Hugh Fearnley-Whittingstall, information retrieval, iterative process, Kevin Kelly, Kickstarter, late fees, Mark Zuckerberg, market design, Menlo Park, Network effects, new economy, new new economy, out of africa, Parkinson's law, peer-to-peer lending, Ponzi scheme, pre–internet, recommendation engine, RFID, Richard Stallman, ride hailing / ride sharing, Robert Shiller, Robert Shiller, Ronald Coase, Search for Extraterrestrial Intelligence, SETI@home, Simon Kuznets, Skype, slashdot, smart grid, South of Market, San Francisco, Stewart Brand, The Nature of the Firm, The Spirit Level, The Wealth of Nations by Adam Smith, The Wisdom of Crowds, Thorstein Veblen, Torches of Freedom, transaction costs, traveling salesman, ultimatum game, Victor Gruen, web of trust, women in the workforce, Zipcar

This section was heavily influenced by Richard Grants, “Drowning in Plastic: The Great Pacific Garbage Patch Is Twice the Size of France,” Telegraph (April 24, 2009), 5. Statistics on annual consumption of plastic materials come from “Plastics Recycling Information.” Retrieved August 2009, 6. Thomas M. Kostigen, “The World’s Largest Dump: The Great Pacific Garbage Patch,” Discover Magazine (July 10, 2008), 7. Paul Hawken, Amory Lovins, and L. Hunter Lovins, Natural Capitalism (Rocky Mountain Institute, 1999), 4, 8.


pages: 238 words: 77,730

Final Jeopardy: Man vs. Machine and the Quest to Know Everything by Stephen Baker


23andMe, AI winter, Albert Einstein, artificial general intelligence, business process, call centre, clean water, computer age, Frank Gehry, information retrieval, Iridium satellite, Isaac Newton, job automation, pattern recognition, Ray Kurzweil, Silicon Valley, Silicon Valley startup, statistical model, theory of mind, thinkpad, Turing test, Vernor Vinge, Wall-E, Watson beat the top human players on Jeopardy!

And the brain appears to busy itself with this internal dispute instead of systematically trawling for the most promising clues and pathways. Researchers at Harvard, studying the brain scans of people suffering from tip of the tongue syndrome, have noted increased activity in the anterior cingulate—a part of the brain behind the frontal lobe, devoted to conflict resolution and detecting surprise. Few of these conflicts appeared to interfere with Jennings’s information retrieval. During his unprecedented seventy-four-game streak, he routinely won the buzz on more than half the clues. And his snap judgments that the answers were on call in his head somewhere led him to a remarkable 92 percent precision rate, according to statistics compiled by the quiz show’s fans. This topped the average champion by 10 percent. As IBM’s scientists contemplated building a machine that could compete with the likes of Ken Jennings, they understood their constraints.


Deep Work: Rules for Focused Success in a Distracted World by Cal Newport


8-hour work day, Albert Einstein, barriers to entry, business climate, Cal Newport, Capital in the Twenty-First Century by Thomas Piketty, Clayton Christensen, David Brooks, deliberate practice, Donald Trump, Downton Abbey,, Erik Brynjolfsson, experimental subject, follow your passion, Frank Gehry, informal economy, information retrieval, Internet Archive, Jaron Lanier, knowledge worker, Mark Zuckerberg, Marshall McLuhan, Merlin Mann, Nate Silver, new economy, Nicholas Carr, popular electronics, remote working, Richard Feynman, Richard Feynman, Silicon Valley, Silicon Valley startup, Snapchat, statistical model, the medium is the message, Watson beat the top human players on Jeopardy!, web application, winner-take-all economy

If you find yourself glued to a smartphone or laptop throughout your evenings and weekends, then it’s likely that your behavior outside of work is undoing many of your attempts during the workday to rewire your brain (which makes little distinction between the two settings). In this case, I would suggest that you maintain the strategy of scheduling Internet use even after the workday is over. To simplify matters, when scheduling Internet use after work, you can allow time-sensitive communication into your offline blocks (e.g., texting with a friend to agree on where you’ll meet for dinner), as well as time-sensitive information retrieval (e.g., looking up the location of the restaurant on your phone). Outside of these pragmatic exceptions, however, when in an offline block, put your phone away, ignore texts, and refrain from Internet usage. As in the workplace variation of this strategy, if the Internet plays a large and important role in your evening entertainment, that’s fine: Schedule lots of long Internet blocks. The key here isn’t to avoid or even to reduce the total amount of time you spend engaging in distracting behavior, but is instead to give yourself plenty of opportunities throughout your evening to resist switching to these distractions at the slightest hint of boredom.


Mastering Book-Keeping: A Complete Guide to the Principles and Practice of Business Accounting by Peter Marshall


accounting loophole / creative accounting, asset allocation, double entry bookkeeping, information retrieval, the market place

Tel: (01865) 375794. Fax: (01865) 379162. © 2009 Dr Peter Marshall First edition 1992 Second edition 1995 Third edition 1997 Fourth edition 1999 Fifth edition 2001 Sixth edition 2003 Seventh edition 2005 Reprinted 2006 Eighth edition 2009 First published in electronic form 2009 All rights reserved. No part of this work may be reproduced or stored in an information retrieval system (other than for purposes of review) without the express permission of the publisher in writing. The rights of Peter Marshall to be identified as the author this work has been asserted by him in accordance with the Copyright Designs and Patents Act 1988. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 978 1 84803 324 5 Produced for How To Books by Deer Park Productions, Tavistock, Devon Typeset by PDQ Typesetting, Newcastle-under-Lyme, Staffordshire Cover design by Baseline Arts Ltd, Oxford NOTE: The material contained in this book is set out in good faith for general guidance and no liability can be accepted for loss or expense incurred as a result of relying in particular circumstances on statements made in the book.


pages: 671 words: 228,348

Pro AngularJS by Adam Freeman


business process, create, read, update, delete,, Google Chrome, information retrieval, inventory management, MVC pattern, place-making, premature optimization, revision control, single page application, web application

For the URL, I specified productData.json. A URL like this will be a requested relative to the main HTML document, which means that I don’t have to hard-code protocols, hostnames, and ports into the application. GET AND POST: PICK THE RIGHT ONE The rule of thumb is that GET requests should be used for all read-only information retrieval, while POST requests should be used for any operation that changes the application state. In standards-compliance terms, GET requests are for safe interactions (having no side effects besides information retrieval), and POST requests are for unsafe interactions (making a decision or changing something). These conventions are set by the World Wide Web Consortium (W3C), at GET requests are addressable—all the information is contained in the URL, so it’s possible to bookmark and link to these addresses.


pages: 519 words: 102,669

Programming Collective Intelligence by Toby Segaran


correlation coefficient, Debian,, Firefox, full text search, information retrieval, PageRank, prediction markets, recommendation engine, slashdot, web application

Searching and Ranking This chapter covers full-text search engines, which allow people to search a large set of documents for a list of words, and which rank results according to how relevant the documents are to those words. Algorithms for full-text searches are among the most important collective intelligence algorithms, and many fortunes have been made by new ideas in this field. It is widely believed that Google's rapid rise from an academic project to the world's most popular search engine was based largely on the PageRank algorithm, a variation that you'll learn about in this chapter. Information retrieval is a huge field with a long history. This chapter will only be able to cover a few key concepts, but we'll go through the construction of a search engine that will index a set of documents and leave you with ideas on how to improve things further. Although the focus will be on algorithms for searching and ranking rather than on the infrastructure requirements for indexing large portions of the Web, the search engine you build should have no problem with collections of up to 100,000 pages.


How to Form Your Own California Corporation by Anthony Mancuso


corporate governance, distributed generation, estate planning, information retrieval, passive income, passive investing, Silicon Valley

If not, use the form provided on the website. California Secretary of State contact Information Office hours for all locations are Monday through Friday 8:00 a.m. to 5:00 p.m. Sacramento Office 1500 11th Street Sacramento, CA 95814 (916) 657-5448* • Name Availability Unit (*recorded information on how to obtain) • Document Filing Support Unit • Legal Review Unit • Information Retrieval and Certification Unit • Status (*recorded information on how to obtain) • Statement of Information Unit (filings only) P.O. Box 944230 Sacramento, CA 94244-2300 • Substituted Service of Process (must be hand delivered to the Sacramento office) San Francisco Regional Office 455 Golden Gate Avenue, Suite 14500 San Francisco, CA 94102-7007 415-557-8000 Fresno Regional Office 1315 Van Ness Ave., Suite 203 Fresno, CA 93721-1729 559-445-6900 Los Angeles Regional Office 300 South Spring Street, Room 12513 Los Angeles, CA 90013-1233 213-897-3062 San Diego Regional Office 1350 Front Street, Suite 2060 San Diego, CA 92101-3609 619-525-4113 California Department of Corporations contact information Contact Information The Department of Corporations, the office that receives your Notice of Stock Issuance, as explained in Chapter 5, Step 7, has four offices.


pages: 540 words: 103,101

Building Microservices by Sam Newman


airport security, Amazon Web Services, anti-pattern, business process, call centre, continuous integration, create, read, update, delete, defense in depth, Edward Snowden, fault tolerance, index card, information retrieval, Infrastructure as a Service, inventory management, job automation, load shedding, loose coupling, platform as a service, premature optimization, pull request, recommendation engine, social graph, software as a service, the built environment, web application, WebSocket, x509 certificate

The benefit here is that if we use existing software for this purpose, someone has done much of the hard work for us. However, we still need to know how to set up and maintain these systems in a resilient fashion. Starting Again The architecture that gets you started may not be the architecture that keeps you going when your system has to handle very different volumes of load. As Jeff Dean said in his presentation “Challenges in Building Large-Scale Information Retrieval Systems” (WSDM 2009 conference), you should “design for ~10× growth, but plan to rewrite before ~100×.” At certain points, you need to do something pretty radical to support the next level of growth. Recall the story of Gilt, which we touched on in Chapter 6. A simple monolithic Rails application did well for Gilt for two years. Its business became increasingly successful, which meant more customers and more load.


pages: 791 words: 85,159

Social Life of Information by John Seely Brown, Paul Duguid


AltaVista, business process, Claude Shannon: information theory, computer age, cross-subsidies, disintermediation, double entry bookkeeping, Frank Gehry, frictionless, frictionless market, future of work, George Gilder, global village, Howard Rheingold, informal economy, information retrieval, invisible hand, Isaac Newton, Just-in-time delivery, Kevin Kelly, knowledge economy, knowledge worker, loose coupling, Marshall McLuhan, medical malpractice, moral hazard, Network effects, new economy, Productivity paradox, rolodex, Ronald Coase, shareholder value, Silicon Valley, Steve Jobs, Superbowl ad, Ted Nelson, telepresence, the medium is the message, The Nature of the Firm, The Wealth of Nations by Adam Smith, Thomas Malthus, transaction costs, Turing test, Vannevar Bush, Y2K

The difficulty of this central challenge, however, has been obscured by the redefinition that, as we noted earlier, infoenthusiasts tend to indulge. The definitions of knowledge management that began this chapter perform a familiar two-step. First, they define the core problem in terms of information, so that, second, they can put solutions in the province of information technology.13 Here, retrieval looks as easy as search. Page 125 If information retrieval were all that is required for such things as knowledge management or best practice, HP would have nothing to worry about. It has an abundance of very good information technology. The persistence of HP's problem, then, argues that knowledge management, knowledge, and learning involve more than information. In the rest of this chapter we try to understand what else is involved, looking primarily at knowledge and learning on the assumption that these need to be understood before knowledge management can be considered.


pages: 382 words: 92,138

The Entrepreneurial State: Debunking Public vs. Private Sector Myths by Mariana Mazzucato


Apple II, banking crisis, barriers to entry, Bretton Woods, California gold rush, call centre, carbon footprint, Carmen Reinhart, cleantech, computer age, credit crunch, David Ricardo: comparative advantage, demand response, deskilling, energy security, energy transition, eurozone crisis, everywhere but in the productivity statistics, Financial Instability Hypothesis, full employment, Growth in a Time of Debt, Hyman Minsky, incomplete markets, information retrieval, invisible hand, Joseph Schumpeter, Kenneth Rogoff, knowledge economy, knowledge worker, natural language processing, new economy, offshore financial centre, popular electronics, profit maximization, Ralph Nader, renewable energy credits, rent-seeking, ride hailing / ride sharing, risk tolerance, shareholder value, Silicon Valley, Silicon Valley ideology, smart grid, Steve Jobs, Steve Wozniak, The Wealth of Nations by Adam Smith, Tim Cook: Apple, too big to fail, total factor productivity, trickle-down economics, Washington Consensus, William Shockley: the traitorous eight

Available online at (accessed 10 October 2012). DIUS (Department of Innovation, Universities and Skills). 2008. Innovation Nation, March. Cm 7345. London: DIUS. DoD (United States Department of Defense). 2011. Selected Acquisition Report (SAR): RCS: DD-A&T(Q&A)823-166 : NAVSTAR GPS: Defense Acquisition Management Information Retrieval (DAMIR). Los Angeles, 31 December. DoE (United States Department of Energy). 2007. ‘DOE-Supported Researcher Is Co-winner of 2007 Nobel Prize in Physics’. 10 September. Available online at (accessed 21 January 2013). _____. 2009. ‘DOE Awards $377 Million in Funding for 46 Energy Frontier Research Centers’., 6 August.


pages: 416 words: 106,582

This Will Make You Smarter: 150 New Scientific Concepts to Improve Your Thinking by John Brockman


23andMe, Albert Einstein, Alfred Russel Wallace, banking crisis, Barry Marshall: ulcers, Benoit Mandelbrot, Berlin Wall, biofilm, Black Swan, butterfly effect, Cass Sunstein, cloud computing, congestion charging, correlation does not imply causation, Daniel Kahneman / Amos Tversky, dark matter, data acquisition, David Brooks, delayed gratification, Emanuel Derman, epigenetics, Exxon Valdez, Flash crash, Flynn Effect, hive mind, impulse control, information retrieval, Isaac Newton, Jaron Lanier, John von Neumann, Kevin Kelly, mandelbrot fractal, market design, Mars Rover, Marshall McLuhan, microbiome, Murray Gell-Mann, Nicholas Carr, open economy, place-making, placebo effect, pre–internet, QWERTY keyboard, random walk, randomized controlled trial, rent control, Richard Feynman, Richard Feynman, Richard Feynman: Challenger O-ring, Richard Thaler, Schrödinger's Cat, security theater, Silicon Valley, stem cell, Steve Jobs, Steven Pinker, Stewart Brand, the scientific method, Thorstein Veblen, Turing complete, Turing machine, Walter Mischel, Whole Earth Catalog

Although some have written about information overload, data smog, and the like, my view has always been the more information online the better, as long as good search tools are available. Sometimes this information is found by directed search using a Web search engine, sometimes by serendipity by following links, and sometimes by asking hundreds of people in your social network or hundreds of thousands of people on a question-answering Web site such as, Quora, or Yahoo Answers. I do not actually know of a real findability index, but tools in the field of information retrieval could be applied to develop one. One of the unsolved problems in the field is how to help the searcher to determine if the information simply is not available. An Assertion Is Often an Empirical Question, Settled by Collecting Evidence Susan Fiske Eugene Higgins Professor of Psychology, Princeton University; author, Envy Up, Scorn Down: How Status Divides Us The most important scientific concept is that an assertion is often an empirical question, settled by collecting evidence.


Writing Effective Use Cases by Alistair Cockburn


business process,, create, read, update, delete, finite state, index card, information retrieval, iterative process, recommendation engine, Silicon Valley, web application

The shopper passes a point that the web site owner has predetermined to generate sales leads (dynamic business rule): System generates sales lead. 2e. System has been setup to require the Shopper to identify themselves: Shopper establishes identity 2f. System is setup to interact with known other systems (parts inventory, process & planning) that will affect product availability and selection: 2f.1. System interacts with known other systems (parts inventory, process & planning) to get the needed information. (Retrieve Part Availability, Retrieve Build Schedule). 2f.2. System uses the results to filter or show availability of product and/or options(parts). 2g. Shopper was presented and selects a link to an Industry related web-site: Shopper views other web-site. 2h. System is setup to interact with known Customer Information System: 2h.1. System retrieves customer information from Customer Information System 2h.2.


pages: 352 words: 96,532

Where Wizards Stay Up Late: The Origins of the Internet by Katie Hafner, Matthew Lyon


air freight, Bill Duvall, computer age, conceptual framework, Douglas Engelbart, fault tolerance, Hush-A-Phone, information retrieval, Kevin Kelly, Menlo Park, natural language processing, packet switching, RAND corporation, RFC: Request For Comment, Ronald Reagan, Silicon Valley, speech recognition, Steve Crocker, Steven Levy

The comments appeared in a paper written jointly, using e-mail, with five hundred miles between them. It was “published” electronically in the MsgGroup in 1977. They went on: “As computer communication systems become more powerful, more humane, more forgiving and above all, cheaper, they will become ubiquitous.” Automated hotel reservations, credit checking, real-time financial transactions, access to insurance and medical records, general information retrieval, and real-time inventory control in businesses would all come. In the late 1970s, the Information Processing Techniques Office’s final report to ARPA management on the completion of the ARPANET research program concluded similarly: “The largest single surprise of the ARPANET program has been the incredible popularity and success of network mail. There is little doubt that the techniques of network mail developed in connection with the ARPANET program are going to sweep the country and drastically change the techniques used for intercommunication in the public and private sectors.”


pages: 344 words: 94,332

The 100-Year Life: Living and Working in an Age of Longevity by Lynda Gratton, Andrew Scott


3D printing, Airbnb, carbon footprint, Clayton Christensen, collapse of Lehman Brothers, crowdsourcing, delayed gratification, diversification, Downton Abbey, Erik Brynjolfsson, falling living standards, financial independence, first square of the chessboard, first square of the chessboard / second half of the chessboard, future of work, gender pay gap, gig economy, Google Glasses, indoor plumbing, information retrieval, Isaac Newton, job satisfaction, low skilled workers, Lyft, Network effects, New Economic Geography, pattern recognition, pension reform, Peter Thiel, Ray Kurzweil, Richard Florida, Richard Thaler, Second Machine Age, sharing economy, side project, Silicon Valley, smart cities, Stephen Hawking, Steve Jobs, women in the workforce, young professional

Indeed this is already happening at the authors’ own institution of London Business School, where there is an ever-increasing emphasis on the part of both students and firms on ideas and innovation, creativity and entrepreneurship. Aligned to this is the growing importance of human skills and judgement. There are those who argue that even these skills can be performed by AI – pointing, for example, to the development of IBM’s supercomputer Watson, which is able to perform detailed oncological diagnosis. This means that with diagnostic augmentation, the skill set for the medical profession will shift from information retrieval to deeper intuitive experience, more person-to- person skills and greater emphasis on team motivation and judgement. The same technological developments will occur in the education sector, where digital teaching will replace textbooks and classroom teaching and the valuable skills will move towards the intricate human skills of empathy, motivation and encouragement. Across a long productive life there will be an increasing focus on general portable skills and capabilities such as mental flexibility and agility.


pages: 364 words: 102,926

What the F: What Swearing Reveals About Our Language, Our Brains, and Ourselves by Benjamin K. Bergen


correlation does not imply causation, information retrieval, pre–internet, Ronald Reagan, statistical model, Steven Pinker

Smith, M. (March 3, 2014). Richard Sherman calls NFL banning the N-word “an atrocious idea.” NBC Sports. Retrieved from http://profootballtalk.nbcsportscom/2014/03/03/richard-sherman-calls-nfl-banning-the-n-word-an-atrocious-idea. Snopes. (October 11, 2014). Pluck Yew. Retrieved from Social Security Administration. (n.d.). Background information. Retrieved from Songbass. (November 3, 2008). Obama gives McCain the middle finger. YouTube. Retrieved from Spears, A. K. (1998). African-American language use: Ideology and so-called obscenity. In African-American English: Structure, history, and use, Salikoko S. Mufwene (Ed.), 226–250. New York: Routledge.


pages: 433 words: 106,048

The End of Illness by David B. Agus M. D.


Danny Hillis, discovery of penicillin, double helix, epigenetics, germ theory of disease, Google Earth, impulse control, information retrieval, meta analysis, meta-analysis, microbiome, Murray Gell-Mann, pattern recognition, personalized medicine, randomized controlled trial, risk tolerance, Steve Jobs, the scientific method

The poll also found that 68 percent of those who have access have used the Internet to look for information about specific medicines, and nearly four in ten use it to look for other patients’ experiences of a condition. Without a doubt new technologies are helping more people around the world to find out more about their health and to make better-informed decisions, but often their online searches lack usefulness because the information retrieved cannot be personalized. Relying on dodgy information can easily lead people to take risks with inappropriate tests and treatments, wasting money and causing unnecessary worry. But with a health-record system like Dell’s and its developing infrastructure to tailor health advice and guidance to individual people based on their personal records, the outcome could be revolutionary to our health-care system, instigating the reform that’s sorely needed.


Paper Knowledge: Toward a Media History of Documents by Lisa Gitelman


Andrew Keen, computer age, corporate governance, deskilling, Douglas Engelbart, East Village,, information retrieval, Internet Archive, invention of movable type, Jaron Lanier, knowledge economy, Marshall McLuhan, Mikhail Gorbachev, national security letter, On the Economy of Machinery and Manufactures, optical character recognition, profit motive, RAND corporation, RFC: Request For Comment, Silicon Valley, Steve Jobs, The Structural Transformation of the Public Sphere, Turing test, Works Progress Administration

Even his text was drawn, in the sense that the text display program used to put legends on drawings built characters “by means of special tables which indicate the locations of line and circle segments to make up the letters and numbers”—a process that, in a sense, looked forward to PostScript, TrueType, and pdf. 30 Characters were typed in, but then they were generated graphically by the system for display on screen. Both microform databanks and Sutherland’s Sketchpad gesture selectively toward a prehistory for the pdf page image because both—though differently—mobilized pages and images of pages for a screen-­based interface. The databanks retrieved televisual reproductions of existing source pages, modeling not just information retrieval but also encouraging certain citation norms (since users could indicate that, for example, “the information appears on page 10”). Meanwhile, Sketchpad established a page as a fixed computational field, a visible ground on which further computational objects might be rendered. The portable document format is related more tenuously to mainframes and microform, even though today’s reference databases—the majority of which of course include and serve up pdf —clearly descend in some measure from experiments like Intrex and the Times Information Bank.


pages: 386 words: 91,913

The Elements of Power: Gadgets, Guns, and the Struggle for a Sustainable Future in the Rare Metal Age by David S. Abraham


3D printing, carbon footprint, clean water, cleantech, Deng Xiaoping, Elon Musk,, glass ceiling, global supply chain, information retrieval, Internet of things, new economy, oil shale / tar sands, oil shock, reshoring, Ronald Reagan, Silicon Valley, South China Sea, Steve Ballmer, Steve Jobs, telemarketer, Tesla Model S, thinkpad, upwardly mobile, uranium enrichment, Y2K

The weights of rare earth materials included in the congressional committee report likely include a substantial weight portion from other metals as well. Ronald H. O’Rourke, “Navy Virginia (SSN-774) Class Attack Submarine Procurement: Background and Issues for Congress,” Congressional Research Service, July 31, 2014, For information on Virginia class submarine purchases, see, “DDG 51 Arleigh Burke Class Guided Missile Destroyer,” Defense Acquisition Management Information Retrieval, December 31, 2012, accessed December 18, 2014, For information on the DDG 51 Aegis Destroyer Ships as of 2012, including expected production until 2016, see “Next Global Positioning System Receiver Equipment,” Committee Reports 113th Congress (2013–2014), House Report 113-102, June 7, 2013, accessed December 18, 2014,


The Dream Machine: J.C.R. Licklider and the Revolution That Made Computing Personal by M. Mitchell Waldrop


Ada Lovelace, air freight, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, Albert Einstein, anti-communist, Apple II, battle of ideas, Berlin Wall, Bill Duvall, Bill Gates: Altair 8800, Byte Shop, Claude Shannon: information theory, computer age, conceptual framework, cuban missile crisis, double helix, Douglas Engelbart, Dynabook, experimental subject, fault tolerance, Frederick Winslow Taylor, friendly fire, From Mathematics to the Technologies of Life and Death, Haight Ashbury, Howard Rheingold, information retrieval, invisible hand, Isaac Newton, James Watt: steam engine, Jeff Rulifson, John von Neumann, Menlo Park, New Journalism, Norbert Wiener, packet switching, pink-collar, popular electronics, RAND corporation, RFC: Request For Comment, Silicon Valley, Steve Crocker, Steve Jobs, Steve Wozniak, Steven Levy, Stewart Brand, Ted Nelson, Turing machine, Turing test, Vannevar Bush, Von Neumann architecture, Wiener process

I first tried to find close relevance within es- tablished disciplines [such as artificial intelligence,] but in each case I found that the people I would talk with would immediately translate my admittedly strange (for the times) statements of purpose and possibility into their own discipline's framework."9 At the 1960 meeting of the American Documentation Institute, a talk he gave was greeted with yawns, and his proposed augmentation environ- ment was dismissed as just another information-retrieval system. No, Engelbart realized, if his augmentation ideas were ever going to fly, he would have to create a new discipline from scratch. And to do that, he would have to give this new discipline a conceptual framework all its own-a manifesto that would layout his thinking in the most compelling way possible. Creating that manifesto took him the better part of two years. "Augmenting the Human Intellect: A Conceptual Framework" wouldn't be completed until October 1962.

No, he didn't-though, as is so often the case with Bob Taylor, the reasons were more complex than they seemed on the surface. To begin with, while he very much liked the idea of having a big influence on PARC's research, he considered Pake's notion of a "graphics research group" a complete nonstarter. Sure, graphics technology was a critical part of this what- ever-it-was he wanted to create. But so were text display, mass-storage technol- ogy, networking technology, information retrieval, and all the rest. Taylor wanted to go after the whole, integrated vision, just as he'd gone after the whole Intergalactic Network. To focus entirely on graphics would be like trying to build the Arpanet by focusing entirely on the technology of telephone lines. And yet Pake did have a point, damn it. At age thirty-eight Taylor had spent his entire adult career funding computer research, but he had never actually done LIVING IN THE FUTURE 345 computer research.


pages: 696 words: 143,736

The Age of Spiritual Machines: When Computers Exceed Human Intelligence by Ray Kurzweil


Ada Lovelace, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, Albert Einstein, Any sufficiently advanced technology is indistinguishable from magic, Buckminster Fuller, call centre, cellular automata, combinatorial explosion, complexity theory, computer age, computer vision, cosmological constant, cosmological principle, Danny Hillis, double helix, Douglas Hofstadter, first square of the chessboard / second half of the chessboard, fudge factor, George Gilder, Gödel, Escher, Bach, I think there is a world market for maybe five computers, information retrieval, invention of movable type, Isaac Newton, iterative process, Jacquard loom, Jacquard loom, John von Neumann, Lao Tzu, Law of Accelerating Returns, mandelbrot fractal, Marshall McLuhan, Menlo Park, natural language processing, Norbert Wiener, optical character recognition, pattern recognition, phenotype, Ralph Waldo Emerson, Ray Kurzweil, Richard Feynman, Richard Feynman, Schrödinger's Cat, Search for Extraterrestrial Intelligence, self-driving car, Silicon Valley, speech recognition, Steven Pinker, Stewart Brand, stochastic process, technological singularity, Ted Kaczynski, telepresence, the medium is the message, traveling salesman, Turing machine, Turing test, Whole Earth Review, Y2K

Cybernetic poet A computer program that is able to create original poetry. Cybernetics A term coined by Norbert Wiener to describe the “science of control and communication in animals and machines.” Cybernetics is based on the theory that intelligent living beings adapt to their environments and accomplish objectives primarily by reacting to feedback from their surroundings. Database The structured collection of data that is designed in connection with an information retrieval system. A database management system (DBMS) allows monitoring, updating, and interacting with the database. Debugging The process of discovering and correcting errors in computer hardware and software. The issue of bugs or errors in a program will become increasingly important as computers are integrated into the human brain and physiology throughout the twenty-first century. The first “bug” was an actual moth, discovered by Grace Murray Hopper, the first programmer of the Mark I computer.


pages: 470 words: 109,589

Apache Solr 3 Enterprise Search Server by Unknown


bioinformatics, continuous integration, database schema,, fault tolerance, Firefox, full text search, information retrieval, Internet Archive, natural language processing, performance metric, platform as a service, web application

The major features found in Lucene are: An inverted index for efficient retrieval of documents by indexed terms. The same technology supports numeric data with range queries too. A rich set of chainable text analysis components, such as tokenizers and language-specific stemmers that transform a text string into a series of terms (words). A query syntax with a parser and a variety of query types from a simple term lookup to exotic fuzzy matching. A good scoring algorithm based on sound Information Retrieval (IR) principles to produce the more likely candidates first, with flexible means to affect the scoring. Search enhancing features like:A highlighter feature to show query words found in context. A query spellchecker based on indexed content or a supplied dictionary. A "more like this" feature to list documents that are statistically similar to provided text. To learn more about Lucene, read Lucene In Action, 2nd Edition by Michael McCandless, Erik Hatcher, and Otis Gospodnetić.


pages: 597 words: 119,204

Website Optimization by Andrew B. King


AltaVista, bounce rate, don't be evil,, Firefox, In Cold Blood by Truman Capote, information retrieval, iterative process, medical malpractice, Network effects, performance metric, search engine result page, second-price auction, second-price sealed-bid, semantic web, Silicon Valley, slashdot, social graph, Steve Jobs, web application

If all you have is the total page download time, or even large "buckets" of time (like content download versus network), you won't be able to improve performance. This is especially true in the more complex world of the Web where application calls are hidden within the content portion of the page and third parties are critical to the overall download time. You need to have a view into every piece of the page load in order to manage and improve it. * * * [167] Roast, C. 1998. "Designing for Delay in Interactive Information Retrieval." Interacting with Computers 10 (1): 87–104. [168] Balashov, K., and A. King. 2003. "Compressing the Web." In Speed Up Your Site: Web Site Optimization. Indianapolis: New Riders, 412. A test of 25 popular sites found that HTTP gzip compression saved 75% on average off text file sizes and 37% overall. [169] Bent, L. et al. 2004. "Characterization of a large web site population with implications for content delivery."


pages: 542 words: 161,731

Alone Together by Sherry Turkle


Albert Einstein, Columbine, global village, Hacker Ethic, helicopter parent, Howard Rheingold, industrial robot, information retrieval, Jacques de Vaucanson, Jaron Lanier, Kevin Kelly, Loebner Prize, Marshall McLuhan, meta analysis, meta-analysis, Nicholas Carr, Norbert Wiener, Ralph Waldo Emerson, Rodney Brooks, Skype, stem cell, technoutopianism, The Great Good Place, the medium is the message, theory of mind, Turing test, Vannevar Bush, Wall-E, women in the workforce

From 1996 on, Thad Starner, who like Steve Mann was a member of the MIT cyborg group, worked on the Remembrance Agent, a tool that would sit on your computer desktop (or now, your mobile device) and not only record what you were doing but make suggestions about what you might be interested in looking at next. See Bradley J. Rhodes and Thad Starner, “Remembrance Agent: A Continuously Running Personal Information Retrieval System,” Proceedings of the First International Conference on the Practical Application of Intelligent Agents and Multi Agent Technology (PAAM ’96),487-495, 487-495, (accessed December 14, 2009).Albert Frigo’s “Storing, Indexing and Retrieving My Autobiography,” presented at the 2004 Workshop on Memory and the Sharing of Experience in Vienna, Austria, describes a device to take pictures of what comes into his hand.


pages: 436 words: 124,373

Galactic North by Alastair Reynolds


back-to-the-land, Buckminster Fuller, hive mind, information retrieval, risk/return, stem cell, trade route

"Disclose all our confidential practices while you're at it, Mirsky," Seven said. She glared at him through her visor. "Veda would have figured it out." "We'll never know now, will we?" "What does it matter?" she said. "Gonna kill them anyway, aren't you?" Seven flashed an arc of teeth filed to points and waved a hand towards the female pirate. "Allow me to introduce Mirsky, our loose-tongued but efficient information retrieval specialist. She's going to take you on a little trip down memory lane; see if we can't remember those access codes." "What codes?" "It'll come back to you," Seven said. They were taken through the tunnels, past half-assembled mining machines, onto the surface and then into the pirate ship. The ship was huge: most of it living space. Cramped corridors snaked through hydroponics galleries of spring wheat and dwarf papaya, strung with xenon lights.


pages: 550 words: 154,725

The Idea Factory: Bell Labs and the Great Age of American Innovation by Jon Gertner


Albert Einstein, back-to-the-land, Black Swan, business climate, Claude Shannon: information theory, Clayton Christensen, complexity theory, corporate governance, cuban missile crisis, horn antenna, Hush-A-Phone, information retrieval, invention of the telephone, James Watt: steam engine, Karl Jansky, knowledge economy, Nicholas Carr, Norbert Wiener, Picturephone, Richard Feynman, Richard Feynman, Sand Hill Road, Silicon Valley, Skype, Steve Jobs, Telecommunications Act of 1996, traveling salesman, uranium enrichment, William Shockley: the traitorous eight

A visitor could also try something called a portable “pager,” a big, blocky device that could alert doctors and other busy professionals when they received urgent calls.2 New York’s fair would dwarf Seattle’s. The crowds were expected to be immense—probably somewhere around 50 or 60 million people in total. Pierce and David’s 1961 memo recommended a number of exhibits: “personal hand-carried telephones,” “business letters in machine-readable form, transmitted by wire,” “information retrieval from a distant computer-automated library,” and “satellite and space communications.” By the time the fair opened in April 1964, though, the Bell System exhibits, housed in a huge white cantilevered building nicknamed the “floating wing,” described a more conservative future than the one Pierce and David had envisioned. The exhibit was primarily explanatory. Visitors could get a sense of how quality control worked at Western Electric factories, or how researchers at Bell Labs grew pure crystals necessary for transistors.


pages: 574 words: 164,509

Superintelligence: Paths, Dangers, Strategies by Nick Bostrom


agricultural Revolution, AI winter, Albert Einstein, algorithmic trading, anthropic principle, anti-communist, artificial general intelligence, autonomous vehicles, barriers to entry, bioinformatics, brain emulation, cloud computing, combinatorial explosion, computer vision, cosmological constant, dark matter, DARPA: Urban Challenge, data acquisition, delayed gratification, demographic transition, Douglas Hofstadter, Drosophila, Elon Musk,, epigenetics, fear of failure, Flash crash, Flynn Effect, friendly AI, Gödel, Escher, Bach, income inequality, industrial robot, informal economy, information retrieval, interchangeable parts, iterative process, job automation, John von Neumann, knowledge worker, Menlo Park, meta analysis, meta-analysis, mutually assured destruction, Nash equilibrium, Netflix Prize, new economy, Norbert Wiener, NP-complete, nuclear winter, optical character recognition, pattern recognition, performance metric, phenotype, prediction markets, price stability, principal–agent problem, race to the bottom, random walk, Ray Kurzweil, recommendation engine, reversible computing, social graph, speech recognition, Stanislav Petrov, statistical model, stem cell, Stephen Hawking, strong AI, superintelligent machines, supervolcano, technological singularity, technoutopianism, The Coming Technological Singularity, The Nature of the Firm, Thomas Kuhn: the structure of scientific revolutions, transaction costs, Turing machine, Vernor Vinge, Watson beat the top human players on Jeopardy!, World Values Survey

Software polices the world’s email traffic, and despite continual adaptation by spammers to circumvent the countermeasures being brought against them, Bayesian spam filters have largely managed to hold the spam tide at bay. Software using AI components is responsible for automatically approving or declining credit card transactions, and continuously monitors account activity for signs of fraudulent use. Information retrieval systems also make extensive use of machine learning. The Google search engine is, arguably, the greatest AI system that has yet been built. Now, it must be stressed that the demarcation between artificial intelligence and software in general is not sharp. Some of the applications listed above might be viewed more as generic software applications rather than AI in particular—though this brings us back to McCarthy’s dictum that when something works it is no longer called AI.


pages: 320 words: 87,853

The Black Box Society: The Secret Algorithms That Control Money and Information by Frank Pasquale


Affordable Care Act / Obamacare, algorithmic trading, Amazon Mechanical Turk, asset-backed security, Atul Gawande, bank run, barriers to entry, Berlin Wall, Bernie Madoff, Black Swan, bonus culture, Brian Krebs, call centre, Capital in the Twenty-First Century by Thomas Piketty, Chelsea Manning, cloud computing, collateralized debt obligation, corporate governance, Credit Default Swap, credit default swaps / collateralized debt obligations, crowdsourcing, cryptocurrency, Debian, don't be evil, Edward Snowden,, Fall of the Berlin Wall, Filter Bubble, financial innovation, Flash crash, full employment, Goldman Sachs: Vampire Squid, Google Earth, Hernando de Soto, High speed trading, hiring and firing, housing crisis, informal economy, information retrieval, interest rate swap, Internet of things, invisible hand, Jaron Lanier, Jeff Bezos, job automation, Julian Assange, Kevin Kelly, knowledge worker, Kodak vs Instagram, kremlinology, late fees, London Interbank Offered Rate, London Whale, Mark Zuckerberg, mobile money, moral hazard, new economy, Nicholas Carr, offshore financial centre, PageRank, pattern recognition, precariat, profit maximization, profit motive, quantitative easing, race to the bottom, recommendation engine, regulatory arbitrage, risk-adjusted returns, search engine result page, shareholder value, Silicon Valley, Snapchat, Spread Networks laid a new fibre optics cable between New York and Chicago, statistical arbitrage, statistical model, Steven Levy, the scientific method, too big to fail, transaction costs, two-sided market, universal basic income, Upton Sinclair, value at risk, WikiLeaks

But here, again, competition may be illusory: it’s hard to see the rationale (or investor or public enthusiasm) for subjecting millions of volumes (many of them delicate) to another round of scanning. Once again, Google reigns by default. The question now is whether its dictatorship will be benign. Does Google intend Book Search to promote widespread public access, or is it envisioning finely tiered access to content, granted (and withheld) in opaque ways?168 Will Google grant open access to search results on its platform, so experts in library science and information retrieval can understand (and critique) its orderings of results?169 Finally, where will the profits go from this immense cooperative project? Will they be distributed fairly among contributors, or will this be another instance in which the aggregator of content captures an unfair share of revenues from well-established dynamics of content digitization? If the Internet is to prosper, all who provide content—its critical source of value—must share in the riches now enjoyed mainly by the megafirms that organize it.170 And to the extent that Google, Amazon, or any other major search engine limits access to an index of books, its archiving projects are suspect, whatever public-spirited slogans it may adduce in defense of them.171 Philosopher Iris Murdoch once said, “Man is a creature who makes pictures of himself and then comes to resemble the picture.


pages: 303 words: 67,891

Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms: Proceedings of the Agi Workshop 2006 by Ben Goertzel, Pei Wang


AI winter, artificial general intelligence, bioinformatics, brain emulation, combinatorial explosion, complexity theory, computer vision, conceptual framework, correlation coefficient, epigenetics, friendly AI, information retrieval, Isaac Newton, John Conway, Loebner Prize, Menlo Park, natural language processing, Occam's razor, p-value, pattern recognition, performance metric, Ray Kurzweil, Rodney Brooks, semantic web, statistical model, strong AI, theory of mind, traveling salesman, Turing machine, Turing test, Von Neumann architecture, Y2K

NARS can be connected to existing knowledge bases, such as Cyc (for commonsense knowledge), WordNet (for linguistic knowledge), Mizar (for mathematical knowledge), and so on. For each of them, a special interface module should be able to approximately translate knowledge from its original format into Narsese. x The Internet. It is possible for NARS to be equipped with additional modules, which use techniques like semantic web, information retrieval, and data mining, to directly acquire certain knowledge from the Internet, and put them into Narsese. x Natural language interface. After NARS has learned a natural language (as discussed previously), it should be able to accept knowledge from various sources in that language. Additionally, interactive tutoring will be necessary, which allows a human trainer to monitor the establishing of the knowledge base, to answer questions, to guide the system to form a proper goal structure and priority distributions among its concepts, tasks, and beliefs.


pages: 570 words: 115,722

The Tangled Web: A Guide to Securing Modern Web Applications by Michal Zalewski


barriers to entry, business process, defense in depth, easy for humans, difficult for computers, fault tolerance, finite state, Firefox, Google Chrome, information retrieval, RFC: Request For Comment, semantic web, Steve Jobs, telemarketer, Turing test, Vannevar Bush, web application, WebRTC, WebSocket

The subsequent proposals experimented with an increasingly bizarre set of methods to permit interactions other than retrieving a document or running a script, including such curiosities as SHOWMETHOD, CHECKOUT, or—why not—SPACEJUMP.[122] Most of these thought experiments have been abandoned in HTTP/1.1, which settles on a more manageable set of eight methods. Only the first two request types—GET and POST—are of any significance to most of the modern Web. GET The GET method is meant to signify information retrieval. In practice, it is used for almost all client-server interactions in the course of a normal browsing session. Regular GET requests carry no browser-supplied payloads, although they are not strictly prohibited from doing so. The expectation is that GET requests should not have, to quote the RFC, “significance of taking an action other than retrieval” (that is, they should make no persistent changes to the state of the application).


pages: 492 words: 153,565

Countdown to Zero Day: Stuxnet and the Launch of the World's First Digital Weapon by Kim Zetter


Ayatollah Khomeini, Brian Krebs, crowdsourcing, data acquisition, Doomsday Clock, Edward Snowden, facts on the ground, Firefox, friendly fire, Google Earth, information retrieval, Julian Assange, Loma Prieta earthquake, Maui Hawaii, pre–internet, RAND corporation, Silicon Valley, skunkworks, smart grid, smart meter, South China Sea, Stuxnet, uranium enrichment, Vladimir Vetrov: Farewell Dossier, WikiLeaks, Y2K, zero day

See “Software Problem Led to System Failure at Dhahran, Saudi Arabia,” US Government Accountability Office, February 4, 1992, available at 22 Bryan, “Lessons from Our Cyber Past.” 23 “The Information Operations Roadmap,” dated October 30, 2003, is a seventy-four-page report that was declassified in 2006, though the pages dealing with computer network attacks are heavily redacted. The document is available at 24 Arquilla Frontline “CyberWar!” interview. A Washington Post story indicates that attacks on computers controlling air-defense systems in Kosovo were launched from electronic-jamming aircraft rather than over computer networks from ground-based keyboards. Bradley Graham, “Military Grappling with Rules for Cyber,” Washington Post, November 8, 1999. 25 James Risen, “Crisis in the Balkans: Subversion; Covert Plan Said to Take Aim at Milosevic’s Hold on Power,” New York Times, June 18, 1999.


pages: 396 words: 117,149

The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World by Pedro Domingos


3D printing, Albert Einstein, Amazon Mechanical Turk, Arthur Eddington, Benoit Mandelbrot, bioinformatics, Black Swan, Brownian motion, cellular automata, Claude Shannon: information theory, combinatorial explosion, computer vision, constrained optimization, correlation does not imply causation, crowdsourcing, Danny Hillis, data is the new oil, double helix, Douglas Hofstadter, Erik Brynjolfsson, experimental subject, Filter Bubble, future of work, global village, Google Glasses, Gödel, Escher, Bach, information retrieval, job automation, John Snow's cholera map, John von Neumann, Joseph Schumpeter, Kevin Kelly, lone genius, mandelbrot fractal, Mark Zuckerberg, Moneyball by Michael Lewis explains big data, Narrative Science, Nate Silver, natural language processing, Netflix Prize, Network effects, NP-complete, P = NP, PageRank, pattern recognition, phenotype, planetary scale, pre–internet, random walk, Ray Kurzweil, recommendation engine, Richard Feynman, Richard Feynman, Second Machine Age, self-driving car, Silicon Valley, speech recognition, statistical model, Stephen Hawking, Steven Levy, Steven Pinker, superintelligent machines, the scientific method, The Signal and the Noise by Nate Silver, theory of mind, transaction costs, Turing machine, Turing test, Vernor Vinge, Watson beat the top human players on Jeopardy!, white flight

Milton Friedman argues for oversimplified theories in “The methodology of positive economics,” which appears in Essays in Positive Economics (University of Chicago Press, 1966). The use of Naïve Bayes in spam filtering is described in “Stopping spam,” by Joshua Goodman, David Heckerman, and Robert Rounthwaite (Scientific American, 2005). “Relevance weighting of search terms,”* by Stephen Robertson and Karen Sparck Jones (Journal of the American Society for Information Science, 1976), explains the use of Naïve Bayes–like methods in information retrieval. “First links in the Markov chain,” by Brian Hayes (American Scientist, 2013), recounts Markov’s invention of the eponymous chains. “Large language models in machine translation,”* by Thorsten Brants et al. (Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007), explains how Google Translate works.


pages: 455 words: 138,716

The Divide: American Injustice in the Age of the Wealth Gap by Matt Taibbi


banking crisis, Bernie Madoff, butterfly effect, collapse of Lehman Brothers, collateralized debt obligation, Credit Default Swap, credit default swaps / collateralized debt obligations, Edward Snowden, ending welfare as we know it, forensic accounting, Gordon Gekko, greed is good, illegal immigration, information retrieval, London Interbank Offered Rate, London Whale, naked short selling, offshore financial centre, Ponzi scheme, profit motive, regulatory arbitrage, short selling, telemarketer, too big to fail, War on Poverty

“Just think what I could do with your emails,” he hissed, adding that he, Spyro, was going to “consider all my options as maintaining our confidentiality,” and that if the executive didn’t cooperate, he could “no longer rely on my discretion.” Contogouris seemed to be playing a triple game. First, he was genuinely trying to deliver an informant to the FBI and set himself up as an FBI informant. Second, he was trying to deliver confidential information to the hedge funds, to whom he had set himself up as an expert at information retrieval. And third, he was playing secret source to “reputable” journalists, to whom he had promised to deliver stunning exposés. Contogouris even referenced one of those contacts in his adolescent coded emails to Sender sent from London that day: CONTOGOURIS: We have been rapping here about the postman. He’s going to deliver mail. The senders want a message delivered*11 “The postman” here was Boyd of the New York Post, with whom Contogouris had been working to prepare a major “exposé” on Fairfax.


pages: 365 words: 117,713

The Selfish Gene by Richard Dawkins


double helix, information retrieval, Necker cube, pattern recognition, phenotype, prisoner's dilemma

A notable advance was the evolutionary 'invention' of memory. By this device, the timing of muscle contractions could be influenced not only by events in the immediate past, but by events in the distant past as well. The memory, or store, is an essential part of a digital computer too. Computer memories are more reliable than human ones, but they are less capacious, and enormously less sophisticated in their techniques of information-retrieval. One of the most striking properties of survival-machine behaviour is its apparent purposiveness. By this I do not just mean that it seems to be well calculated to help the animal's genes to survive, although of course it is. I am talking about a closer analogy to human purposeful behaviour. When we watch an animal 'searching' for food, or for a mate, or for a lost child, we can hardly help imputing to it some of the subjective feelings we ourselves experience when we search.


pages: 437 words: 113,173

Age of Discovery: Navigating the Risks and Rewards of Our New Renaissance by Ian Goldin, Chris Kutarna


2013 Report for America's Infrastructure - American Society of Civil Engineers - 19 March 2013, 3D printing, Airbnb, Albert Einstein, AltaVista, Asian financial crisis, asset-backed security, autonomous vehicles, banking crisis, barriers to entry, battle of ideas, Berlin Wall, bioinformatics, bitcoin, Bonfire of the Vanities, clean water, collective bargaining, Colonization of Mars, Credit Default Swap, crowdsourcing, cryptocurrency, Dava Sobel, demographic dividend, Deng Xiaoping, Doha Development Round, double helix, Edward Snowden, Elon Musk,, epigenetics, experimental economics, failed state, Fall of the Berlin Wall, financial innovation, full employment, Galaxy Zoo, global supply chain, Hyperloop, immigration reform, income inequality, indoor plumbing, industrial robot, information retrieval, intermodal, Internet of things, invention of the printing press, Isaac Newton, Islamic Golden Age, Khan Academy, Kickstarter, labour market flexibility, low cost carrier, low skilled workers, Lyft, Malacca Straits, megacity, Mikhail Gorbachev, moral hazard, Network effects, New Urbanism, non-tariff barriers, Occupy movement, On the Revolutions of the Heavenly Spheres, open economy, Panamax, personalized medicine, Peter Thiel, post-Panamax, profit motive, rent-seeking, reshoring, Robert Gordon, Search for Extraterrestrial Intelligence, Second Machine Age, self-driving car, Shenzhen was a fishing village, Silicon Valley, Silicon Valley startup, Skype, smart grid, Snapchat, special economic zone, spice trade, statistical model, Stephen Hawking, Steve Jobs, Stuxnet, TaskRabbit, too big to fail, trade liberalization, trade route, transaction costs, transatlantic slave trade, uranium enrichment, We are the 99%, We wanted flying cars, instead we got 140 characters, working poor, working-age population, zero day

Goldin, Ian, ed. (2014). Is the Planet Full? Oxford: Oxford University Press. 48. Goldin, Ian and Kenneth Reinert (2012). Globalization for Development. Oxford: Oxford University Press. 49. Vietnam Food Association (2014). “Yearly Export Statistics.” Retrieved from 50. Bangladesh Garment Manufacturers and Exporters Association (2015). “Trade Information.” Retrieved from 51. Burke, Jason (2013, November 14). “Bangladesh Garment Workers Set for 77% Pay Rise.” The Guardian. Retrieved from 52. Goldin, Ian and Kenneth Reinert (2012). Globalization for Development. Oxford: Oxford University Press. 53. Industrial Development Bureau (2015). “Industry Introduction—History of Industrial Development.”


pages: 397 words: 110,130

Smarter Than You Think: How Technology Is Changing Our Minds for the Better by Clive Thompson


3D printing, 4chan, A Declaration of the Independence of Cyberspace, augmented reality, barriers to entry, Benjamin Mako Hill, butterfly effect, citizen journalism, Claude Shannon: information theory, conceptual framework, corporate governance, crowdsourcing, Deng Xiaoping, discovery of penicillin, Douglas Engelbart, Edward Glaeser,, experimental subject, Filter Bubble, Freestyle chess, Galaxy Zoo, Google Earth, Google Glasses, Henri Poincaré, hindsight bias, hive mind, Howard Rheingold, information retrieval, iterative process, jimmy wales, Kevin Kelly, Khan Academy, knowledge worker, Mark Zuckerberg, Marshall McLuhan, Menlo Park, Netflix Prize, Nicholas Carr, patent troll, pattern recognition, pre–internet, Richard Feynman, Richard Feynman, Ronald Coase, Ronald Reagan, sentiment analysis, Silicon Valley, Skype, Snapchat, Socratic dialogue, spaced repetition, telepresence, telepresence robot, The Nature of the Firm, the scientific method, The Wisdom of Crowds, theory of mind, transaction costs, Vannevar Bush, Watson beat the top human players on Jeopardy!, WikiLeaks, X Prize, éminence grise

the Wikipedia page on “Drone attacks in Pakistan”: “Drone attacks in Pakistan,” Wikipedia, accessed March 24, 2013, 40 percent of all queries are acts of remembering: Jaime Teevan, Eytan Adar, Rosie Jones, and Michael A. S. Potts, “Information Re-Retrieval: Repeat Queries in Yahoo’s Logs,” in SIGIR ’07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2007), 151–58. collaborative inhibition: Celia B. Harris, Paul G. Keil, John Sutton, and Amanda J. Barnier, “We Remember, We Forget: Collaborative Remembering in Older Couples,” Discourse Processes 48, no. 4 (2011), 267–303. In his essay “Mathematical Creation”: Henri Poincaré, “Mathematical Creation,” in The Anatomy of Memory: An Anthology (New York: Oxford University Press, 1996), 126–35.


pages: 527 words: 147,690

Terms of Service: Social Media and the Price of Constant Connection by Jacob Silverman


23andMe, 4chan, A Declaration of the Independence of Cyberspace, Airbnb, airport security, Amazon Mechanical Turk, augmented reality, Brian Krebs, California gold rush, call centre, cloud computing, cognitive dissonance, correlation does not imply causation, Credit Default Swap, crowdsourcing, don't be evil, Edward Snowden, feminist movement, Filter Bubble, Firefox, Flash crash, game design, global village, Google Chrome, Google Glasses, hive mind, income inequality, informal economy, information retrieval, Internet of things, Jaron Lanier, jimmy wales, Kevin Kelly, Kickstarter, knowledge economy, knowledge worker, late capitalism, license plate recognition, life extension, Lyft, Mark Zuckerberg, Mars Rover, Marshall McLuhan, meta analysis, meta-analysis, Minecraft, move fast and break things, national security letter, Network effects, new economy, Nicholas Carr, Occupy movement, optical character recognition, payday loans, Peter Thiel, postindustrial economy, prediction markets, pre–internet, price discrimination, price stability, profit motive, quantitative hedge fund, race to the bottom, Ray Kurzweil, recommendation engine, rent control, RFID, ride hailing / ride sharing, self-driving car, sentiment analysis, shareholder value, sharing economy, Silicon Valley, Silicon Valley ideology, Snapchat, social graph, social web, sorting algorithm, Steve Ballmer, Steve Jobs, Steven Levy, TaskRabbit, technoutopianism, telemarketer, transportation-network company, Turing test, Uber and Lyft, Uber for X, universal basic income, unpaid internship, women in the workforce, Y Combinator, Zipcar

Already the NSA is able to record and sort all phone calls in several countries in real time. As storage costs decrease and analytical powers grow, it’s not unreasonable to think that this capability will be extended to other targets, including, should the political environment allow it, the United States. Some of the NSA’s surveillance capacity derives from deals made with Internet firms—procedures for automating court-authorized information retrieval, direct access to central servers, and even (as in the case of Verizon) fiber optic cables piped from military bases into major Internet hubs. In the United States, the NSA uses the FBI to conduct surveillance authorized under the Patriot Act and to issue National Security Letters (NSLs)—subpoenas requiring recipients to turn over any information deemed relevant to an ongoing investigation.


The Singularity Is Near: When Humans Transcend Biology by Ray Kurzweil


additive manufacturing, AI winter, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, Albert Einstein, anthropic principle, Any sufficiently advanced technology is indistinguishable from magic, artificial general intelligence, augmented reality, autonomous vehicles, Benoit Mandelbrot, Bill Joy: nanobots, bioinformatics, brain emulation, Brewster Kahle, Brownian motion, business intelligence,, call centre, carbon-based life, cellular automata, Claude Shannon: information theory, complexity theory, conceptual framework, Conway's Game of Life, cosmological constant, cosmological principle, cuban missile crisis, data acquisition, Dava Sobel, David Brooks, Dean Kamen, disintermediation, double helix, Douglas Hofstadter,, epigenetics, factory automation, friendly AI, George Gilder, Gödel, Escher, Bach, informal economy, information retrieval, invention of the telephone, invention of the telescope, invention of writing, Isaac Newton, iterative process, Jaron Lanier, Jeff Bezos, job automation, job satisfaction, John von Neumann, Kevin Kelly, Law of Accelerating Returns, life extension, linked data, Loebner Prize, Louis Pasteur, mandelbrot fractal, Mikhail Gorbachev, mouse model, Murray Gell-Mann, mutually assured destruction, natural language processing, Network effects, new economy, Norbert Wiener, oil shale / tar sands, optical character recognition, pattern recognition, phenotype, premature optimization, randomized controlled trial, Ray Kurzweil, remote working, reversible computing, Richard Feynman, Richard Feynman, Rodney Brooks, Search for Extraterrestrial Intelligence, semantic web, Silicon Valley, Singularitarianism, speech recognition, statistical model, stem cell, Stephen Hawking, Stewart Brand, strong AI, superintelligent machines, technological singularity, Ted Kaczynski, telepresence, The Coming Technological Singularity, transaction costs, Turing machine, Turing test, Vernor Vinge, Y2K, Yogi Berra

John Smith, director of the ABC Institute—you last saw him six months ago at the XYZ conference" or, "That's the Time-Life Building—your meeting is on the tenth floor." We'll have real-time translation of foreign languages, essentially subtitles on the world, and access to many forms of online information integrated into our daily activities. Virtual personalities that overlay the real world will help us with information retrieval and our chores and transactions. These virtual assistants won't always wait for questions and directives but will step forward if they see us struggling to find a piece of information. (As we wonder about "That actress ... who played the princess, or was it the queen ... in that movie with the robot," our virtual assistant may whisper in our ear or display in our visual field of view: "Natalie Portman as Queen Amidala in Star Wars, episodes 1, 2, and 3.")


pages: 496 words: 174,084

Masterminds of Programming: Conversations With the Creators of Major Programming Languages by Federico Biancuzzi, Shane Warden


business intelligence, business process, cellular automata, cloud computing, complexity theory, conceptual framework, continuous integration, data acquisition, domain-specific language, Douglas Hofstadter, Fellow of the Royal Society, finite state, Firefox, follow your passion, Frank Gehry, general-purpose programming language, HyperCard, information retrieval, iterative process, John von Neumann, linear programming, loose coupling, Mars Rover, millennium bug, NP-complete, Paul Graham, performance metric, QWERTY keyboard, RAND corporation, randomized controlled trial, Renaissance Technologies, Silicon Valley, slashdot, software as a service, software patent, sorting algorithm, Steve Jobs, traveling salesman, Turing complete, type inference, Valgrind, Von Neumann architecture, web application

Don: That’s right, and if you misspell something, or don’t remember exactly what the join column is in a table, your query might not work at all in SQL, whereas less deterministic interfaces like Google are much more forgiving on small mistakes like that. You believe in the importance of determinism. When I write a line of code, I need to rely on understanding what it’s going to do. Don: Well, there are applications where determinism is important and applications where it is not. Traditionally there has been a dividing line between what you might call databases and what you might call information retrieval. Certainly both of those are flourishing fields and they have their respective uses. XQuery and XML Will XML affect the way we use search engines in the future? Don: I think it’s possible. Search engines already exploit the kinds of metadata that are included in HTML tags such as hyperlinks. As you know, XML is a more extensible markup language than HTML. As we begin to see more XML-based standards for marking up specialized documents such as medical and business documents, I think that search engines will learn to take advantage of the semantic information in that markup.


pages: 1,201 words: 233,519

Coders at Work by Peter Seibel


Ada Lovelace, bioinformatics, cloud computing, Conway's Game of Life, domain-specific language, fault tolerance, Fermat's Last Theorem, Firefox, George Gilder, glass ceiling, HyperCard, information retrieval, loose coupling, Menlo Park, Metcalfe's law, premature optimization, publish or perish, random walk, revision control, Richard Stallman, rolodex, Saturday Night Live, side project, slashdot, speech recognition, the scientific method, Therac-25, Turing complete, Turing machine, Turing test, type inference, Valgrind, web application

If I was going to draw lessons from it—well again, I'm kind of an elitist: I would say that the people who should be programming are the people who feel comfortable in the world of symbols. If you don't feel really pretty comfortable swimming around in that world, maybe programming isn't what you should be doing. Seibel: Did you have any important mentors? Deutsch: There were two people. One of them is someone who's no longer around; his name was Calvin Mooers. He was an early pioneer in information systems. I believe he is credited with actually coining the term information retrieval. His background was originally in library science. I met him when I was, I think, high-school or college age. He had started to design a programming language that he thought would be usable directly by just people. But he didn't know anything about programming languages. And at that point, I did because I had built this Lisp system and I'd studied some other programming languages. So we got together and the language that he eventually wound up making was one that I think it's fair to say he and I kind of codesigned.


pages: 933 words: 205,691

Hadoop: The Definitive Guide by Tom White


Amazon Web Services, bioinformatics, business intelligence, combinatorial explosion, database schema, Debian, domain-specific language,, fault tolerance, full text search, Grace Hopper, information retrieval, Internet Archive, linked data, loose coupling, openstreetmap, recommendation engine, RFID, SETI@home, social graph, web application

Here are the contents of CounterGroupName=Air Temperature Records Hadoop uses the standard Java localization mechanisms to load the correct properties for the locale you are running in, so, for example, you can create a Chinese version of the properties in a file named, and they will be used when running in the zh_CN locale. Refer to the documentation for java.util.PropertyResourceBundle for more information. Retrieving counters In addition to being available via the web UI and the command line (using hadoop job -counter), you can retrieve counter values using the Java API. You can do this while the job is running, although it is more usual to get counters at the end of a job run, when they are stable. Example 8-2 shows a program that calculates the proportion of records that have missing temperature fields.


pages: 1,156 words: 229,431

The IDA Pro Book by Chris Eagle


barriers to entry, business process,, information retrieval, iterative process

The behavior of the breakpoint is implemented by the IDC function bpt_NtContinue, which is shown here: static bpt_NtContinue() { auto p_ctx = Dword(ESP + 4); //get CONTEXT pointer argument auto next_eip = Dword(p_ctx + 0xB8); //retrieve eip from CONTEXT AddBpt(next_eip); //set a breakpoint at the new eip SetBptCnd(next_eip, "Warning(\"Exception return hit\") || 1"); return 0; //don't stop } This function locates the pointer to the process’s saved register context information , retrieves the saved instruction pointer value from offset 0xB8 within the CONTEXT structure , and sets a breakpoint on this address . In order to make it clear to the user why execution has stopped, a breakpoint condition (which is always true) is added to display a message to the user . We choose to do this because the breakpoint was not set explicitly by the user, and the user may not correlate the event to the return from an exception handler.


Red Rabbit by Tom Clancy, Scott Brick


anti-communist, battle of ideas, diversified portfolio, Ignaz Semmelweis: hand washing, information retrieval, union organizing, urban renewal

"They operate within fairly strict rules, and both sides seem to play by them." And on both sides, killings had to be authorized at a very high level. Not that this would matter all that much to the corpse in question. "Wet" operations interfered with the main mission, which was gathering information. That was something people occasionally forgot, but something that CIA and KGB mainly understood, which was why both agencies had gotten away from it. But when the information retrieved frightened or otherwise upset the politicians who oversaw the intelligence services, then the spook shops were ordered to do things that they usually preferred to avoid—and so, then, they took their action through surrogates and/or mercenaries, mainly… "Arthur, if KGB wants to hurt the Pope, how do you suppose they'd go about it?" "Not one of their own," Moore thought. "Too dangerous. It would be a political catastrophe, like a tornado going right through the Kremlin.


pages: 685 words: 203,949

The Organized Mind: Thinking Straight in the Age of Information Overload by Daniel J. Levitin


airport security, Albert Einstein, Amazon Mechanical Turk, Anton Chekhov, big-box store, business process, call centre, Claude Shannon: information theory, cloud computing, cognitive bias, complexity theory, computer vision, conceptual framework, correlation does not imply causation, crowdsourcing, cuban missile crisis, Daniel Kahneman / Amos Tversky, delayed gratification, Donald Trump,, epigenetics, Eratosthenes, Exxon Valdez, framing effect, friendly fire, fundamental attribution error, Golden Gate Park, Google Glasses, haute cuisine, impulse control, index card, indoor plumbing, information retrieval, invention of writing, iterative process, jimmy wales, job satisfaction, Kickstarter, life extension, meta analysis, meta-analysis, more computing power than Apollo, Network effects, new economy, Nicholas Carr, optical character recognition, pattern recognition, phenotype, placebo effect, pre–internet, profit motive, randomized controlled trial, Skype, Snapchat, statistical model, Steve Jobs, supply-chain management, the scientific method, The Wealth of Nations by Adam Smith, The Wisdom of Crowds, theory of mind, Turing test, ultimatum game

Publisher’s Weekly, 246(2), p. 63. All bits are created equal After writing this, I discovered the same phrase “all bits are created equal” in Gleick, J. (2011). The information: A history, a theory, a flood. New York, NY: Vintage. Information has thus become separated from meaning Gleick writes “information is divorced from meaning.” He cites the technology philosopher Lewis Mumford from 1970: “Unfortunately, ‘information retrieving,’ however swift, is no substitute for discovering by direct personal inspection knowledge whose very existence one had possibly never been aware of, and following it at one’s own pace through the further ramification of relevant literature.” Gleick, J. (2011). The information: A history, a theory, a flood. New York, NY: Vintage. “The medium does matter. . . .” Carr, N. (2010). The shallows: What the internet is doing to our brains.


pages: 843 words: 223,858

The Rise of the Network Society by Manuel Castells


Apple II, Asian financial crisis, barriers to entry, Big bang: deregulation of the City of London, borderless world, British Empire, capital controls, complexity theory, computer age, Credit Default Swap, declining real wages, deindustrialization, delayed gratification, dematerialisation, deskilling, disintermediation, double helix, Douglas Engelbart, edge city, experimental subject, financial deregulation, financial independence, floating exchange rates, future of work, global village, Hacker Ethic, hiring and firing, Howard Rheingold, illegal immigration, income inequality, industrial robot, informal economy, information retrieval, intermodal, invention of the steam engine, invention of the telephone, inventory management, James Watt: steam engine, job automation, job-hopping, knowledge economy, knowledge worker, labor-force participation, labour market flexibility, labour mobility, laissez-faire capitalism, low skilled workers, manufacturing employment, Marshall McLuhan, means of production, megacity, Menlo Park, new economy, New Urbanism, offshore financial centre, oil shock, open economy, packet switching, planetary scale, popular electronics, post-industrial society, postindustrial economy, prediction markets, Productivity paradox, profit maximization, purchasing power parity, RAND corporation, Robert Gordon, Silicon Valley, Silicon Valley startup, social software, South China Sea, South of Market, San Francisco, special economic zone, spinning jenny, statistical model, Steve Jobs, Steve Wozniak, Ted Nelson, the built environment, the medium is the message, The Wealth of Nations by Adam Smith, Thomas Kuhn: the structure of scientific revolutions, total factor productivity, trade liberalization, transaction costs, urban renewal, urban sprawl

From these he accepts only a few dozen each instant, from which to make an image.19 Because of the low definition of TV, McLuhan argued, viewers have to fill in the gaps in the image, thus becoming more emotionally involved in the viewing (what he, paradoxically, characterized as a “cool medium”). Such involvement does not contradict the hypothesis of the least effort because TV appeals to the associative/lyrical mind, not involving the psychological effort of information retrieving and analyzing to which Herbert Simon’s theory refers. This is why Neil Postman, a leading media scholar, considers that television represents an historical rupture with the typographic mind. While print favors systematic exposition, TV is best suited to casual conversation. To make the distinction sharply, in his own words: “Typography has the strongest possible bias towards exposition: a sophisticated ability to think conceptually, deductively and sequentially; a high valuation of reason and order; an abhorrence of contradiction; a large capacity for detachment and objectivity; and a tolerance for delayed response.”20 While for television, “entertainment is the supra-ideology of all discourse on television.


pages: 903 words: 235,753

The Stack: On Software and Sovereignty by Benjamin H. Bratton


1960s counterculture, 3D printing, 4chan, Ada Lovelace, additive manufacturing, airport security, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, algorithmic trading, Amazon Mechanical Turk, Amazon Web Services, augmented reality, autonomous vehicles, Berlin Wall, bioinformatics, bitcoin, blockchain, Buckminster Fuller, Burning Man, call centre, carbon footprint, carbon-based life, Cass Sunstein, Celebration, Florida, charter city, clean water, cloud computing, connected car, corporate governance, crowdsourcing, cryptocurrency, dark matter, David Graeber, deglobalization, dematerialisation, disintermediation, distributed generation, don't be evil, Douglas Engelbart, Edward Snowden, Elon Musk,, Eratosthenes, ethereum blockchain, facts on the ground, Flash crash, Frank Gehry, Frederick Winslow Taylor, future of work, Georg Cantor, gig economy, global supply chain, Google Earth, Google Glasses, Guggenheim Bilbao, High speed trading, Hyperloop, illegal immigration, industrial robot, information retrieval, intermodal, Internet of things, invisible hand, Jacob Appelbaum, Jaron Lanier, Jony Ive, Julian Assange, Khan Academy, linked data, Mark Zuckerberg, market fundamentalism, Marshall McLuhan, Masdar, McMansion, means of production, megacity, megastructure, Menlo Park, Minecraft, Monroe Doctrine, Network effects, new economy, offshore financial centre, oil shale / tar sands, packet switching, PageRank, pattern recognition, peak oil, performance metric, personalized medicine, Peter Thiel, phenotype, place-making, planetary scale, RAND corporation, recommendation engine, reserve currency, RFID, Sand Hill Road, self-driving car, semantic web, sharing economy, Silicon Valley, Silicon Valley ideology, Slavoj Žižek, smart cities, smart grid, smart meter, social graph, software studies, South China Sea, sovereign wealth fund, special economic zone, spectrum auction, Startup school, statistical arbitrage, Steve Jobs, Steven Levy, Stewart Brand, Stuxnet, Superbowl ad, supply-chain management, supply-chain management software, TaskRabbit, the built environment, The Chicago School, the scientific method, Torches of Freedom, transaction costs, Turing complete, Turing machine, Turing test, universal basic income, urban planning, Vernor Vinge, Washington Consensus, web application, WikiLeaks, working poor, Y Combinator

., telcos, states, standards bodies, hardware original equipment manufacturers, and cloud software platforms) all play different roles and control hardware and software applications in different ways and toward different ends. Internet backbone is generally provided and shared by tier 1 bandwidth providers (such as telcos), but one key trend is for very large platforms, such as Google, to bypass other actors and architect complete end-to-end networks, from browser, to fiber, to data center, such that information retrieval, composition, and analysis are consolidated and optimized on private loops. Consider that if Google's own networks, both internal and external, were compared to others, they would represent one of the largest Internet service providers in the world, and by the time this sentence is published, they may very well be the largest. Google indexes the public Internet and mirrors as much of it as possible on its own servers so that it can serve search results and popular pages quickly to Users, regardless of where the original page may originally be coded, sourced, and hosted.


pages: 1,199 words: 332,563

Golden Holocaust: Origins of the Cigarette Catastrophe and the Case for Abolition by Robert N. Proctor


bioinformatics, carbon footprint, clean water, corporate social responsibility, Deng Xiaoping, desegregation, facts on the ground, friendly fire, germ theory of disease, index card, Indoor air pollution, information retrieval, invention of gunpowder, John Snow's cholera map, language of flowers, life extension, New Journalism, optical character recognition, pink-collar, Ponzi scheme, Potemkin village, Ralph Nader, Ronald Reagan, speech recognition, stem cell, telemarketer, Thomas Kuhn: the structure of scientific revolutions, Triangle Shirtwaist Factory, Upton Sinclair, Yogi Berra

The Ad Hoc Committee was a group of lawyers spun off from the Policy Committee whose duties included maintaining the Central File (aka “Cenfile”), “a collection of every document which can be found relating to the smoking and health controversy” (Bates 80684691–4695). The Ad Hoc Committee was also responsible for helping to locate medical witnesses and prepare testimony. Edwin Jacob from Jacob, Medinger & Finnegan supervised the Central File with financial support from all parties to the conspiracy. Responsibility for maintaining the Central File Information Center in 1971 was transferred to the CTR, which managed “informational retrieval” and maintenance through a CTR Special Project, organized as part of a new Information Systems division, by which means the CTR became a crucial resource for the industry’s effort to defend itself against litigation. See Kessler’s “Amended Final Opinion,” pp. 165–68. 46. “Congressional Preparation,” Jan. 26, 1968, Bates 955007434–7439; F. P. Haas, “Memorandum,” Nov. 4, 1965, Bates 502052217–2220.


The Art of Computer Programming by Donald Ervin Knuth


Brownian motion, complexity theory, correlation coefficient, Eratosthenes, Georg Cantor, information retrieval, Isaac Newton, iterative process, John von Neumann, Louis Pasteur, mandelbrot fractal, Menlo Park, NP-complete, P = NP, Paul Erdős, probability theory / Blaise Pascal / Pierre de Fermat, RAND corporation, random walk, sorting algorithm, Turing machine, Y2K

The probability that max(C/i, U2, • • ¦, Ut) < x is the probability that U\ < x and U2 < x and ... and Ut < x, which is the product of the individual probabilities, namely xx ... x = xl. I. Collision test. Chi-square tests can be made only when a nontrivial number of items are expected in each category. But another kind of test can be used when the number of categories is much larger than the number of observations; this test is related to "hashing," an important method for information retrieval that we shall study in Section 6.4. Suppose we have m urns and we throw n balls at random into those urns, where m is much greater than n. Most of the balls will land in urns that were previously empty, but if a ball falls into an urn that already contains at least one ball we say that a "collision" has occurred. The collision test counts the number of collisions, and a generator passes this test if it doesn't induce too many or too few collisions.


pages: 889 words: 433,897

The Best of 2600: A Hacker Odyssey by Emmanuel Goldstein


affirmative action, Apple II, call centre, don't be evil, Firefox, game design, Hacker Ethic, hiring and firing, information retrieval, late fees, license plate recognition, optical character recognition, packet switching, pirate software, place-making, profit motive, QWERTY keyboard, RFID, Robert Hanssen: Double agent, rolodex, Ronald Reagan, Silicon Valley, Skype, spectrum auction, statistical model, Steve Jobs, Steve Wozniak, Steven Levy, Telecommunications Act of 1996, telemarketer, Y2K

Immediately, you’ll get a list of everyone with that name, as well as their city and state, which often don’t fit properly on the line. There are no reports of any wildcards that allow you to see everybody at once. (The closest thing is *R, which will show all of the usernames that you’re sending to.) It’s also impossible for a user not to be seen if you get his name or alias right. It’s a good free information retrieval system. But there’s more. MCI Mail can also be used as a free word processor of sorts. The system will allow you to enter a letter, or for that matter, a manuscript. You can then hang up and do other things, come back within 24 hours, and your words will still be there. You can conceivably list them out using your own printer on a fresh sheet of paper and send it through the mail all by yourself, thus sparing MCI Mail’s laser printer the trouble.