text mining

35 results back to index


Exploring Everyday Things with R and Ruby by Sau Sheong Chang

Alfred Russel Wallace, bioinformatics, business process, butterfly effect, cloud computing, Craig Reynolds: boids flock, Debian, Edward Lorenz: Chaos theory, Gini coefficient, income inequality, invisible hand, p-value, price stability, Ruby on Rails, Skype, statistical model, stem cell, Stephen Hawking, text mining, The Wealth of Nations by Adam Smith, We are the 99%, web application, wikimedia commons

:), in Ruby ternary conditional expression, if and unless R R Development Core Team, Introducing R R language, Packing Your Bags, Introducing R–Introducing R, Introducing R, Introducing R, Using R–Using R, The R Console–The R Console, The R Console, The R Console, Sourcing Files and the Command Line–Sourcing Files and the Command Line, Sourcing Files and the Command Line, Packages–Using packages, Programming R, Variables and Functions, Variables and Functions–Variables and Functions, Variables and Functions–Variables and Functions, Conditionals and Loops, Conditionals and Loops–Conditionals and Loops, Vectors–Vectors, Lists–Lists, Matrices–Matrices, Arrays–Arrays, Factors–Factors, Data frames–Data frames, Importing Data–Importing data from a database, Charting–Adjustments, Basic Graphs, MailMiner–MailMiner arrays, Arrays–Arrays assignment operators, Variables and Functions batch mode, Sourcing Files and the Command Line charting, Charting–Adjustments conditionals, Conditionals and Loops console for, The R Console–The R Console data frames, Data frames–Data frames expressions, Programming R factors, Factors–Factors functions, Variables and Functions–Variables and Functions importing data, Importing Data–Importing data from a database installing, Introducing R lists, Lists–Lists loops, Conditionals and Loops–Conditionals and Loops matrices, Matrices–Matrices output formats, Basic Graphs packages for, Packages–Using packages packages for, creating, MailMiner–MailMiner running, Using R–Using R running code from a file, Sourcing Files and the Command Line–Sourcing Files and the Command Line statistical functions, The R Console variables, Variables and Functions–Variables and Functions vectors, The R Console, Vectors–Vectors version of, Introducing R R-Forge repository, Packages .rb file extension, Running Ruby rbind() function, R, Data frames read() function, R, Importing data from text files read.table() function, R, Interpreting the Data, Number of Messages by Day of the Month repeat loop, R, Conditionals and Loops require statement, Ruby, Requiring External Libraries restrooms example, Offices and Restrooms–The Final Simulation, Offices and Restrooms, The Simple Scenario–The Simple Scenario, The Simple Scenario–The First Simulation, Representing Restrooms and Such, Representing Restrooms and Such, Representing Restrooms and Such, Interpreting the Data–Interpreting the Data, Interpreting the Data–Interpreting the Data, The Second Simulation–The Final Simulation, The Second Simulation, The Second Simulation–The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation charts for, Interpreting the Data–Interpreting the Data, The Second Simulation, The Second Simulation–The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation data results, interpreting, Interpreting the Data–Interpreting the Data Facility class for, Representing Restrooms and Such HSE (Health and Safety Executive) data regarding, Offices and Restrooms modeling, The Simple Scenario–The Simple Scenario Person class for, Representing Restrooms and Such Restroom class for, Representing Restrooms and Such simulations for, The Simple Scenario–The First Simulation, The Second Simulation–The Final Simulation return keyword, R, Variables and Functions return keyword, Ruby, Methods Reynolds, Craig (creator of Boids algorithm), Schooling Fish and Flocking Birds RIFF format, Extracting Data from Sound right angle bracket (>), The R Console, Variables and Functions -> assignment operator, R, Variables and Functions > R console prompt, The R Console RMagick library, Extracting Data from Video, Extracting Data from Video Ruby language, Packing Your Bags, Ruby–Why Ruby, Why Ruby, Installing Ruby–Installing Ruby using your platform’s package management tool, Installing Ruby from source, Running Ruby–Running Ruby, Running Ruby, Running Ruby, Requiring External Libraries–Requiring External Libraries, Strings–Strings, Strings, Arrays and hashes–Arrays and hashes, Arrays and hashes–Arrays and hashes, Arrays and hashes, Symbols, Conditionals and loops–case expression, Loops, Classes and objects–Classes and objects, Classes and objects–Classes and objects, Methods, Class methods and variables, Class methods and variables–Class methods and variables, Inheritance–Inheritance, Inheritance–Inheritance, Inheritance, Inheritance, Code like a duck–Code like a duck, Code like a duck–Code like a duck, Shoes–Shoes doodler, Roids arrays, Arrays and hashes–Arrays and hashes, Arrays and hashes class methods, Class methods and variables class variables, Class methods and variables–Class methods and variables classes, Classes and objects–Classes and objects compiling from source code, Installing Ruby from source conditionals, Conditionals and loops–case expression duck typing, Code like a duck–Code like a duck dynamic typing, Code like a duck–Code like a duck external libraries for, Requiring External Libraries–Requiring External Libraries hashes, Arrays and hashes–Arrays and hashes here-documents, Strings inheritance, Inheritance–Inheritance installing, Installing Ruby–Installing Ruby using your platform’s package management tool interactive tool for, Running Ruby interpreter for, Running Ruby loops, Loops methods, Methods mixin mechanism, Inheritance modules, Inheritance objects, Classes and objects–Classes and objects open classes, Roids running, Running Ruby–Running Ruby Shoes toolkit for, Shoes–Shoes doodler strings, Strings–Strings subclassing, Inheritance–Inheritance symbols, Symbols website for, Why Ruby Ruby Version Manager (RVM), Ruby Version Manager (RVM) RubyGems package manager, Requiring External Libraries RubyInstaller, RubyInstaller RVideo library, Extracting Data from Video RVM (Ruby Version Manager), Ruby Version Manager (RVM) S saccadic masking, Data, Data, Everywhere sample frame, Extracting Data from Sound sample points, Extracting Data from Sound sapply() function, R, Variables and Functions scale_shape_manual() function, Interpreting the Data scale_x_continuous() function, Interpreting the Data scale_y_continuous() function, Interpreting the Data scatterplot, R, The R Console, Sourcing Files and the Command Line scatterplots, Interpreting the Data–Interpreting the Data, The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation, Number of Messages by Day of the Month–Number of Messages by Hour of the Day, Implementation schools of fish, Schooling Fish and Flocking Birds (see flocking example) sd() function, R, The R Console self keyword, Ruby, Class methods and variables seq() function, R, Vectors Shapiro-Wilk test, Money Shoes toolkit, Shoes–Shoes doodler, A Rainbow of Shoes, Installing Shoes–Installing Shoes, Programming Shoes–Shoes doodler, Shoes stopwatch, Shoes stopwatch, Shoes stopwatch, Simulation–Simulation, Roids–Roids flows, Shoes stopwatch installing, Installing Shoes–Installing Shoes programming in, Programming Shoes–Shoes doodler slots, Shoes stopwatch stacks, Shoes stopwatch versions (colors) of, A Rainbow of Shoes simulations, Bringing the World to Us, The Simple Scenario–The First Simulation, The Simple Scenario–The First Simulation, The Second Simulation–The Final Simulation, The Simulation–The Simulation, The Simulation–The Simulation, Simulation–Simulation, The Boid Flocking Rules–Putting in Obstacles economics example, The Simulation–The Simulation, The Simulation–The Simulation flocking example, Simulation–Simulation, The Boid Flocking Rules–Putting in Obstacles Monte Carlo method, The Simple Scenario–The First Simulation restrooms example, The Simple Scenario–The First Simulation, The Second Simulation–The Final Simulation single quotes (' '), enclosing Ruby strings, Strings slots, Shoes, Shoes stopwatch Smith, Adam (author), The Invisible Hand An Inquiry into the Nature and Causes of the Wealth of Nations (University of Chicago Press), The Invisible Hand source() function, R, Sourcing Files and the Command Line square brackets ([ ]), Vectors, Matrices, Data frames accessing subset of R data frame, Data frames enclosing R matrix indexes, Matrices enclosing R vector indexes, Vectors square brackets, double ([[ ]]), enclosing single R vector index, Vectors stacks, Shoes, Shoes stopwatch standard deviation, R, The R Console Standard library, Ruby, Requiring External Libraries Starlings in Flight (STARFLAG) project, A Variation on the Rules statistical functions, R, The R Console, Packages, Interpreting the Data–Interpreting the Data stats package, R, Packages stat_bin() function, R, Statistical transformation, Statistical transformation stethoscope, homemade, Homemade Digital Stethoscope stopwatch example, Shoes stopwatch–Shoes stopwatch String class, Extracting Data from Sound strings, Ruby, Strings–Strings subclassing, Ruby, Inheritance–Inheritance sudo command, Installing Ruby using your platform’s package management tool symbols, Ruby, Symbols T table() function, R, Interpreting the Data, Number of Messages by Day of the Month term-document matrix, Text Mining ternary conditional expression, Ruby, if and unless text document, Text Mining text files, Importing data from text files, Importing data from text files, The Emailing Habits of Enron Executives (see also CSV files) email message data in, The Emailing Habits of Enron Executives importing data from, R, Importing data from text files text mining, Text MiningText Mining The Grammar of Graphics (Springer), Introducing ggplot2 tm library, Text Mining U Ubuntu system, installing Ruby on, Installing Ruby using your platform’s package management tool UI toolkits, Shoes toolkit, Shoes–Shoes doodler unless expression, Ruby, if and unless unpack method, String class, Extracting Data from Sound until loop, Ruby, Loops Utopia example, It’s a Good Life, It’s a Good Life, Money–Money, Money–Money, Money–Money, Sex–The Changes, Birth and Death, The Changes–The Changes, Evolution–Implementation, Implementation, Implementation charts for, Money–Money, Implementation data, analyzing, Money–Money, The Changes–The Changes, Implementation evolution added to simulation, Evolution–Implementation flocking roids, as basis for simulation, It’s a Good Life food added to simulation, Money–Money mortality added to simulation, Birth and Death procreation added to simulation, Sex–The Changes research regarding, It’s a Good Life V variables, R, Variables and Functions–Variables and Functions Vector class, Ruby, Roids vectors, R, The R Console, Vectors–Vectors video file, extracting data from, Extracting Data from Video–Extracting Data from Video W WAV files, Homemade Digital Stethoscope, Extracting Data from Sound–Extracting Data from Sound, Extracting Data from Sound extracting to CSV file, Extracting Data from Sound–Extracting Data from Sound format of, Extracting Data from Sound recording audio to, Homemade Digital Stethoscope waveforms, Generating the Heart Sounds Waveform–Generating the Heart Sounds Waveform, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate webcam, pulse oximeter using, Homemade Pulse Oximeter website resources, How to Contact Us, Why Ruby, Installing Ruby from source, Ruby Version Manager (RVM), RubyInstaller, Introducing R, Packages, The Emailing Habits of Enron Executives, The Emailing Habits of Enron Executives Enron email data, The Emailing Habits of Enron Executives for this book, How to Contact Us Internet Message Format, The Emailing Habits of Enron Executives R language, Introducing R R packages, Packages Ruby language, Why Ruby Ruby source code, Installing Ruby from source RubyInstaller, RubyInstaller RVM (Ruby Version Manager), Ruby Version Manager (RVM) weight example, The R Console (see height and weight example) while loop, R, Conditionals and Loops while loop, Ruby, Loops Wickham, Hadley (creator of ggplot2 package), Introducing ggplot2 Wilkinson, Leland (author), Introducing ggplot2 The Grammar of Graphics (Springer), Introducing ggplot2 win.metafile() function, R, Basic Graphs Windows, RubyInstaller, Installing Shoes, Using R, Basic Graphs installing Ruby on, RubyInstaller installing Shoes on, Installing Shoes opening graphics device, Basic Graphs R user interface for, Using R windows() function, R, Basic Graphs with() function, R, Data frames, Importing data from a database write() method, Ruby, The Simulation X X11() function, R, Basic Graphs About the Author Sau Sheong Chang has been in software development, mostly web applications and recently cloud- and data-related systems, for almost 17 years and is still a keen and enthusiastic programmer.

Along the way, we looked at the publicly available Enron email dataset and focused on one of the executives in that dataset. The code we have written might be simple, but the insights could be significant. I have ventured a bit into the territory of text mining, but overall we’ve barely scraped the surface of what could be done. The tm library, for example, is extremely powerful for text mining, and various other text mining packages have been built on it as well. A few things you should take note of (especially for text mining) before you wend your way to mining your mailbox: The Enron dataset was cleaned up before it was published, so it was a lot easier to mine. Your own mailbox, on the other hand, could be wild and unruly, so your mileage will definitely vary.

: (question mark, colon), in Ruby ternary conditional expression, if and unless > (right angle bracket), The R Console, Variables and Functions -> assignment operator, R, Variables and Functions > R console prompt, The R Console ' ' (single quotes), enclosing Ruby strings, Strings [ ] (square brackets), Vectors, Matrices, Data frames accessing subset of R data frame, Data frames enclosing R matrix indexes, Matrices enclosing R vector indexes, Vectors [[ ]] (square brackets, double), enclosing single R vector index, Vectors A aes() function, R, Aesthetics An Inquiry into the Nature and Causes of the Wealth of Nations (University of Chicago Press), The Invisible Hand apply() function, R, Interpreting the Data Armchair Economist (Free Press), How to Be an Armchair Economist array() function, R, Arrays arrays, R, Arrays–Arrays arrays, Ruby, Arrays and hashes–Arrays and hashes, Arrays and hashes artificial society, Money (see Utopia example) as.Date() function, R, Number of Messages by Day of the Month ascultation, Auscultation assignment operators, R, Variables and Functions at sign, double (@@), preceding Ruby class variables, Class methods and variables attr keyword, Ruby, Classes and objects Audacity audio editor, Homemade Digital Stethoscope average, Interpreting the Data (see mean() function, R) Axtell, Robert (researcher), It’s a Good Life Growing Artificial Societies: Social Science from the Bottom Up (Brookings Institution Press/MIT Press), It’s a Good Life B backticks (` `), enclosing R operators as functions, Variables and Functions bar charts, Plotting charts, Interpreting the Data–Interpreting the Data, The Second Simulation–The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation barplot() function, R, Plotting charts batch mode, R, Sourcing Files and the Command Line Bioconductor repository, Packages birds flocking, Schooling Fish and Flocking Birds (see flocking example) bmp() function, R, Basic Graphs Boids algorithm, Schooling Fish and Flocking Birds–The Origin of Boids Box, George Edward Pelham (statistician), regarding usefulness of models, The Simple Scenario break keyword, R, Conditionals and Loops brew command, Installing Ruby using your platform’s package management tool butterfly effect, The Changes C c() function, R, Vectors CALO Project, The Emailing Habits of Enron Executives camera, pulse oximeter using, Homemade Pulse Oximeter case expression, Ruby, case expression chaos theory, The Changes charts, Charting–Adjustments, Plotting charts, Statistical transformation, Geometric object, Interpreting the Data–Interpreting the Data, Interpreting the Data–Interpreting the Data, Interpreting the Data–Interpreting the Data, The Second Simulation, The Second Simulation–The Second Simulation, The Third Simulation–The Third Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation, The Final Simulation–The Final Simulation, Analyzing the Simulation–Analyzing the Simulation, Analyzing the Second Simulation–Analyzing the Second Simulation, Number of Messages by Day of the Month–Number of Messages by Hour of the Day, Generating the Heart Sounds Waveform–Generating the Heart Sounds Waveform, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate, Money–Money, Money–Money, Implementation bar charts, Plotting charts, Interpreting the Data–Interpreting the Data, The Second Simulation–The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation histograms, Statistical transformation, Geometric object, Money–Money line charts, Interpreting the Data–Interpreting the Data, Analyzing the Simulation–Analyzing the Simulation, Analyzing the Second Simulation–Analyzing the Second Simulation Lorenz curves, Money–Money scatterplots, Interpreting the Data–Interpreting the Data, The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation, Number of Messages by Day of the Month–Number of Messages by Hour of the Day, Implementation waveforms, Generating the Heart Sounds Waveform–Generating the Heart Sounds Waveform, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate class methods, Ruby, Class methods and variables class variables, Ruby, Class methods and variables–Class methods and variables classes, R, Programming R classes, Ruby, Classes and objects–Classes and objects code examples, Using Code Examples (see example applications) colon (:), Symbols, Vectors creating R vectors, Vectors preceding Ruby symbols, Symbols comma-separated value (CSV) files, Importing data from text files (see CSV files) Comprehensive R Archive Network (CRAN), Packages conditionals, R, Conditionals and Loops conditionals, Ruby, Conditionals and loops–case expression contact information for this book, How to Contact Us conventions used in this book, Conventions Used in This Book cor() function, R, The R Console Core library, Ruby, Requiring External Libraries corpus, Text Mining correlation, R, The R Console CRAN (Comprehensive R Archive Network), Packages CSV (comma-separated value) files, Importing data from text files, The First Simulation–The First Simulation, The First Simulation, Interpreting the Data, The Simulation, Extracting Data from Sound–Extracting Data from Sound, Extracting Data from Video extracting video data to, Extracting Data from Video extracting WAV data to, Extracting Data from Sound–Extracting Data from Sound reading data from, Interpreting the Data writing data to, The First Simulation–The First Simulation, The Simulation csv library, Ruby, The First Simulation, The Simulation, Grab and Parse curl utility, Ruby Version Manager (RVM) D data, Data, Data, Everywhere–Data, Data, Everywhere, Bringing the World to Us, Importing Data–Importing data from a database, Importing data from text files, The First Simulation–The First Simulation, Interpreting the Data, How to Be an Armchair Economist, The Simulation, Grab and Parse–Grab and Parse, The Emailing Habits of Enron Executives–The Emailing Habits of Enron Executives, Homemade Digital Stethoscope–Extracting Data from Sound, Extracting Data from Sound–Extracting Data from Sound, Homemade Pulse Oximeter–Extracting Data from Video, Extracting Data from Video analyzing, Data, Data, Everywhere–Data, Data, Everywhere, Bringing the World to Us, How to Be an Armchair Economist charts for, How to Be an Armchair Economist (see charts) obstacles to, Data, Data, Everywhere–Data, Data, Everywhere simulations for, Bringing the World to Us (see simulations) audio, from stethoscope, Homemade Digital Stethoscope–Extracting Data from Sound CSV files for, Importing data from text files, The First Simulation–The First Simulation, Interpreting the Data, The Simulation, Extracting Data from Sound–Extracting Data from Sound, Extracting Data from Video from Enron, The Emailing Habits of Enron Executives–The Emailing Habits of Enron Executives from Gmail, Grab and Parse–Grab and Parse importing, R, Importing Data–Importing data from a database video, from pulse oximeter, Homemade Pulse Oximeter–Extracting Data from Video data frames, R, Data frames–Data frames data mining, The Idea data.frame() function, R, Data frames database, importing data from, Importing data from a database–Importing data from a database dbConnect() function, R, Importing data from a database dbGet() function, R, Importing data from a database DBI packages, R, Importing data from a database–Importing data from a database Debian system, installing Ruby on, Installing Ruby using your platform’s package management tool def keyword, Ruby, Classes and objects dimnames() function, R, Matrices distribution, normal, Money dollar sign ($), preceding R list item names, Lists doodling example, Shoes doodler–Shoes doodler double quotes (" "), enclosing Ruby strings, Strings duck typing, Ruby, Code like a duck–Code like a duck dynamic typing, Ruby, Code like a duck–Code like a duck E economics example, A Simple Market Economy–A Simple Market Economy, The Producer–The Producer, The Consumer–The Consumer, Some Convenience Methods–Some Convenience Methods, The Simulation–The Simulation, Analyzing the Simulation–Analyzing the Simulation, The Producer–The Producer, The Consumer–The Consumer, Market–Market, The Simulation–The Simulation, Analyzing the Second Simulation–Analyzing the Second Simulation, Price Controls–Price Controls charts for, Analyzing the Simulation–Analyzing the Simulation, Analyzing the Second Simulation–Analyzing the Second Simulation Consumer class for, The Consumer–The Consumer, The Consumer–The Consumer Market class for, Some Convenience Methods–Some Convenience Methods, Market–Market modeling, A Simple Market Economy–A Simple Market Economy price controls analysis, Price Controls–Price Controls Producer class for, The Producer–The Producer, The Producer–The Producer simulations for, The Simulation–The Simulation, The Simulation–The Simulation email example, Grab and Parse–Grab and Parse, The Emailing Habits of Enron Executives–The Emailing Habits of Enron Executives, Number of Messages by Day of the Month–Number of Messages by Day of the Month, Number of Messages by Day of the Month–Number of Messages by Hour of the Day, MailMiner–MailMiner, Number of Messages by Day of Week–Number of Messages by Hour of the Day, Interactions–Comparative Interactions, Text MiningText Mining charts for, Number of Messages by Day of the Month–Number of Messages by Hour of the Day content of messages, analyzing, Text MiningText Mining data for, Grab and Parse–Grab and Parse Enron data for, The Emailing Habits of Enron Executives–The Emailing Habits of Enron Executives interactions in email, analyzing, Interactions–Comparative Interactions number of messages, analyzing, Number of Messages by Day of the Month–Number of Messages by Day of the Month, Number of Messages by Day of Week–Number of Messages by Hour of the Day R package for, creating, MailMiner–MailMiner emergent behavior, The Origin of Boids (see also flocking example) Enron Corporation scandal, The Emailing Habits of Enron Executives Epstein, Joshua (researcher), It’s a Good Life Growing Artificial Societies: Social Science from the Bottom Up (Brookings Institution Press/MIT Press), It’s a Good Life equal sign (=), assignment operator, R, Variables and Functions Euclidean distance, Roids evolution, Evolution example applications, Using Code Examples, Shoes stopwatch–Shoes stopwatch, Shoes doodler–Shoes doodler, The R Console–Sourcing Files and the Command Line, Data frames–Introducing ggplot2, qplot–qplot, Statistical transformation–Geometric object, Adjustments–Adjustments, Offices and Restrooms, A Simple Market Economy, Grab and Parse, My Beating Heart, Schooling Fish and Flocking Birds, Money artificial utopian society, Money (see Utopia example) birds flocking, Schooling Fish and Flocking Birds (see flocking example) doodling, Shoes doodler–Shoes doodler economics, A Simple Market Economy (see economics example) email, Grab and Parse (see email example) fuel economy, qplot–qplot, Adjustments–Adjustments heartbeat, My Beating Heart (see heartbeat example) height and weight, The R Console–Sourcing Files and the Command Line league table, Data frames–Introducing ggplot2 movie database, Statistical transformation–Geometric object permission to use, Using Code Examples restrooms, Offices and Restrooms (see restrooms example) stopwatch, Shoes stopwatch–Shoes stopwatch expressions, R, Programming R external libraries, Ruby, Requiring External Libraries–Requiring External Libraries F factor() function, R, Factors, Text Mining factors, R, Factors–Factors FFmpeg library, Extracting Data from Video, Extracting Data from Video field of vision (FOV), Roids fish, schools of, Schooling Fish and Flocking Birds (see flocking example) flocking example, Schooling Fish and Flocking Birds–The Origin of Boids, The Origin of Boids, Simulation–Simulation, Roids–Roids, The Boid Flocking Rules–Putting in Obstacles, The Boid Flocking Rules–The Boid Flocking Rules, A Variation on the Rules–A Variation on the Rules, Going Round and Round–Going Round and Round, Putting in Obstacles–Putting in Obstacles Boids algorithm for, Schooling Fish and Flocking Birds–The Origin of Boids centering path for, Going Round and Round–Going Round and Round obstacles in path for, Putting in Obstacles–Putting in Obstacles research regarding, A Variation on the Rules–A Variation on the Rules Roid class for, Roids–Roids rules for, The Origin of Boids, The Boid Flocking Rules–The Boid Flocking Rules simulations for, Simulation–Simulation, The Boid Flocking Rules–Putting in Obstacles flows, Shoes, Shoes stopwatch fonts used in this book, Conventions Used in This Book–Conventions Used in This Book for loop, R, Conditionals and Loops format() function, R, Number of Messages by Day of the Month FOV (field of vision), Roids fuel economy example, qplot–qplot, Adjustments–Adjustments function class, R, Programming R functions, R, Variables and Functions–Variables and Functions G GAM (generalized addictive model), The Changes gem command, Ruby, Requiring External Libraries .gem file extension, Requiring External Libraries generalized addictive model (GAM), The Changes Gentleman, Robert (creator of R), Introducing R geom_bar() function, R, Interpreting the Data, The Second Simulation, The Final Simulation geom_histogram() function, R, Geometric object geom_line() function, R, Analyzing the Simulation geom_point() function, R, Plot, Interpreting the Data, Generating the Heart Sounds Waveform geom_smooth() function, R, Interpreting the Data ggplot() function, R, Plot ggplot2 package, R, Introducing ggplot2–Adjustments Gini coefficient, Money Git utility, Ruby Version Manager (RVM) Gmail, retrieving message data from, Grab and Parse–Grab and Parse graphics device, opening, Basic Graphs graphics package, R, Basic Graphs graphs, Charting (see charts) Growing Artificial Societies: Social Science from the Bottom Up (Brookings Institution Press/MIT Press), It’s a Good Life H hash mark, curly brackets (#{ }), enclosing Ruby string escape sequences, Strings hashes, Ruby, Arrays and hashes–Arrays and hashes heart, diagram of, Generating the Heart Sounds Waveform heartbeat example, My Beating Heart, My Beating Heart, My Beating Heart, Homemade Digital Stethoscope, Homemade Digital Stethoscope, Homemade Digital Stethoscope–Extracting Data from Sound, Generating the Heart Sounds Waveform–Generating the Heart Sounds Waveform, Generating the Heart Sounds Waveform, Finding the Heart Rate–Finding the Heart Rate, Homemade Pulse Oximeter–Homemade Pulse Oximeter, Homemade Pulse Oximeter–Extracting Data from Video, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate charts for, Generating the Heart Sounds Waveform–Generating the Heart Sounds Waveform, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate data for, Homemade Digital Stethoscope–Extracting Data from Sound, Homemade Pulse Oximeter–Extracting Data from Video audio from stethoscope, Homemade Digital Stethoscope–Extracting Data from Sound video from pulse oximeter, Homemade Pulse Oximeter–Extracting Data from Video heart rate, My Beating Heart, Finding the Heart Rate–Finding the Heart Rate, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate finding from video file, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate finding from WAV file, Finding the Heart Rate–Finding the Heart Rate health parameters for, My Beating Heart heart sounds, My Beating Heart, My Beating Heart, Homemade Digital Stethoscope, Generating the Heart Sounds Waveform health parameters for, My Beating Heart recording, Homemade Digital Stethoscope types of, My Beating Heart, Generating the Heart Sounds Waveform homemade pulse oximeter for, Homemade Pulse Oximeter–Homemade Pulse Oximeter homemade stethoscope for, Homemade Digital Stethoscope height and weight example, The R Console–Sourcing Files and the Command Line here-documents, Ruby, Strings hex editor, Extracting Data from Sound histograms, Statistical transformation, Geometric object, Money–Money Homebrew tool, Installing Ruby using your platform’s package management tool hyphen (-), Variables and Functions, Variables and Functions -> assignment operator, R, Variables and Functions <- assignment operator, R, Variables and Functions I icons used in this book, Conventions Used in This Book if expression, R, Conditionals and Loops if expression, Ruby, if and unless–if and unless Ihaka, Ross (creator of R), Introducing R ImageMagick library, Extracting Data from Video IMAP (Internet Message Access Protocol), Grab and Parse importing data, R, Importing Data–Importing data from a database inheritance, Ruby, Inheritance–Inheritance initialize method, Ruby, Classes and objects inner product, Roids–Roids installation, Installing Ruby–Installing Ruby using your platform’s package management tool, Installing Shoes–Installing Shoes, Introducing R, Installing packages–Installing packages R, Introducing R R packages, Installing packages–Installing packages Ruby, Installing Ruby–Installing Ruby using your platform’s package management tool Shoes, Installing Shoes–Installing Shoes Internet Message Access Protocol (IMAP), Grab and Parse Internet Message Format, The Emailing Habits of Enron Executives invisible hand metaphor, The Invisible Hand irb application, Running Ruby–Running Ruby J jittering, Adjustments jpeg() function, R, Basic Graphs L Landsburg, Stephen E.


pages: 721 words: 197,134

Data Mining: Concepts, Models, Methods, and Algorithms by Mehmed Kantardzić

Albert Einstein, algorithmic bias, backpropagation, bioinformatics, business cycle, business intelligence, business process, butter production in bangladesh, combinatorial explosion, computer vision, conceptual framework, correlation coefficient, correlation does not imply causation, data acquisition, discrete time, El Camino Real, fault tolerance, finite state, Gini coefficient, information retrieval, Internet Archive, inventory management, iterative process, knowledge worker, linked data, loose coupling, Menlo Park, natural language processing, Netflix Prize, NP-complete, PageRank, pattern recognition, peer-to-peer, phenotype, random walk, RFID, semantic web, speech recognition, statistical model, Telecommunications Act of 1996, telemarketer, text mining, traveling salesman, web application

All algorithms are presented in easily understood pseudo-code and they are suitable for use in real-world, large-scale data-mining projects including advanced applications such as Web mining and text mining. 11 WEB MINING AND TEXT MINING Chapter Objectives Explain the specifics of Web mining. Introduce a classification of basic Web-mining subtasks. Illustrate the possibilities of Web mining using Hyperlink-Induced Topic Search (HITS), LOGSOM, and Path Traversal algorithms. Describe query-independent ranking of Web pages. Formalize a text-mining framework specifying the refining and distillation phases. Outline latent semantic indexing. 11.1 WEB MINING In a distributed information environment, documents or objects are usually linked together to facilitate interactive access.

The automatic analysis of text information can be used for several different general purposes: 1. to provide an overview of the contents of a large document collection and organize them in the most efficient way; 2. to identify hidden structures between documents or groups of documents; 3. to increase the efficiency and effectiveness of a search process to find similar or related information; and 4. to detect duplicate information or documents in an archive. Text mining is an emerging set of functionalities that are primarily built on text-analysis technology. Text is the most common vehicle for the formal exchange of information. The motivation for trying to automatically extract, organize, and use information from it is compelling, even if success is only partial. While traditional commercial text-retrieval systems are based on inverted text indices composed of statistics such as word occurrence per document, text mining must provide values beyond the retrieval of text indices such as keywords. Text mining is about looking for semantic patterns in text, and it may be defined as the process of analyzing text to extract interesting, nontrivial information that is useful for particular purposes.

Text mining is about looking for semantic patterns in text, and it may be defined as the process of analyzing text to extract interesting, nontrivial information that is useful for particular purposes. As the most natural form of storing information is text, text mining is believed to have a commercial potential even higher than that of traditional data mining with structured data. In fact, recent studies indicate that 80% of a company’s information is contained in text documents. Text mining, however, is also a much more complex task than traditional data mining as it involves dealing with unstructured text data that are inherently ambiguous. Text mining is a multidisciplinary field involving IR, text analysis, information extraction, natural language processing, clustering, categorization, visualization, machine learning, and other methodologies already included in the data-mining “menu”; even some additional specific techniques developed lately and applied on semi-structured data can be included in this field.


Programming Computer Vision with Python by Jan Erik Solem

augmented reality, computer vision, database schema, en.wikipedia.org, optical character recognition, pattern recognition, text mining, Thomas Bayes, web application

Symbols 3D plotting, A Sample Data Set 3D reconstruction, 3D Reconstruction Example 4-neighborhood, 9.1 Graph Cuts A affine transformation, 3.1 Homographies affine warping, Affine Transformations affinity matrix, Clustering Images agglomerative clustering, 6.2 Hierarchical Clustering alpha map, Image in Image AR, 4.3 Pose Estimation from Planes and Markers array, Interactive Annotation array slicing, Array Image Representation aspect ratio, 4.1 The Pin-Hole Camera Model association, 9.2 Segmentation Using Clustering augmented reality, 4.3 Pose Estimation from Planes and Markers B bag-of-visual-words, Inspiration from Text Mining—The Vector Space Model bag-of-word representation, Searching Images baseline, Bundle adjustment Bayes classifier, Classifying Images—Hand Gesture Recognition binary image, Morphology—Counting Objects blurring, Using the Pickle Module bundle adustment, Bundle adjustment C calibration matrix, 4.1 The Pin-Hole Camera Model camera calibration, Computing the Camera Center camera center, Camera Models and Augmented Reality camera matrix, Camera Models and Augmented Reality camera model, Camera Models and Augmented Reality camera pose estimation, 4.3 Pose Estimation from Planes and Markers camera resectioning, Triangulation CBIR, Searching Images Chan-Vese segmentation, 9.3 Variational Methods characteristic functions, 9.3 Variational Methods CherryPy, 7.6 Building Demos and Web Applications, Image Search Demo class centroids, Clustering Images classifying images, Classifying Image Content clustering images, Clustering Images, Clustering Images complete linking, 6.2 Hierarchical Clustering confusion matrix, Classifying Images—Hand Gesture Recognition content-based image retrieval, Searching Images convex combination, Image in Image corner detection, Local Image Descriptors correlation, 2.1 Harris Corner Detector corresponding points, 2.1 Harris Corner Detector cpickle, PCA of Images cross-correlation, Finding Corresponding Points Between Images cumulative distribution function, Graylevel Transforms cv, OpenCV, 10.4 Tracking cv2, OpenCV D de-noising, Reading and writing .mat files Delaunay triangulation, Piecewise Affine Warping dendrogram, Clustering Images dense depth reconstruction, Bundle adjustment dense image features, A Simple 2D Example dense SIFT, A Simple 2D Example descriptor, 2.1 Harris Corner Detector difference-of-Gaussian, Finding Corresponding Points Between Images digit classification, Hand Gesture Recognition Again direct linear transformation, 3.1 Homographies directed graph, Image Segmentation distance matrix, Clustering Images E Edmonds-Karp algorithm, 9.1 Graph Cuts eight point algorithm, Plotting 3D Data with Matplotlib epipolar constraint, 5.1 Epipolar Geometry epipolar geometry, Multiple View Geometry epipolar line, 5.1 Epipolar Geometry epipole, 5.1 Epipolar Geometry essential matrix, The calibrated case—metric reconstruction F factorization, Factoring the Camera Matrix feature matches, Finding Corresponding Points Between Images feature matching, Matching Descriptors flood fill, Displaying Images and Results focal length, 4.1 The Pin-Hole Camera Model fundamental matrix, 5.1 Epipolar Geometry fundamental matrix estimation, 5.3 Multiple View Reconstruction G Gaussian blurring, Using the Pickle Module Gaussian derivative filters, Image Derivatives Gaussian distributions, 8.2 Bayes Classifier gesture recognition, Dense SIFT as Image Feature GL_MODELVIEW, PyGame and PyOpenGL GL_PROJECTION, PyGame and PyOpenGL Grab Cut dataset, Segmentation with User Input gradient angle, Blurring Images gradient magnitude, Blurring Images graph, Image Segmentation graph cut, Image Segmentation GraphViz, Matching Using Local Descriptors graylevel transforms, Array Image Representation H Harris corner detection, Local Image Descriptors Harris matrix, Local Image Descriptors hierarchical clustering, 6.2 Hierarchical Clustering hierarchical k-means, 6.3 Spectral Clustering histogram equalization, Graylevel Transforms Histogram of Oriented Gradients, A Simple 2D Example HOG, A Simple 2D Example homogeneous coordinates, Image to Image Mappings homography, Image to Image Mappings homography estimation, 3.1 Homographies Hough transform, Inpainting I Image, Basic Image Handling and Processing image contours, Plotting Images, Points, and Lines image gradient, Blurring Images image graph, 9.1 Graph Cuts image histograms, Plotting Images, Points, and Lines image patch, 2.1 Harris Corner Detector image plane, Camera Models and Augmented Reality image registration, Piecewise Affine Warping image retrieval, Searching Images image search demo, 7.6 Building Demos and Web Applications image segmentation, Visualizing the Images on Principal Components, Image Segmentation image thumbnails, Convert Images to Another Format ImageDraw, Clustering Images inliers, 3.3 Creating Panoramas inpainting, Using generators integral image, Color Spaces interest point descriptor, 2.1 Harris Corner Detector interest points, Local Image Descriptors inverse depth, 4.1 The Pin-Hole Camera Model inverse document frequency, Inspiration from Text Mining—The Vector Space Model io, Useful SciPy Modules iso-contours, Plotting Images, Points, and Lines J JSON, Downloading Geotagged Images from Panoramio K k-means, Clustering Images k-nearest neighbor classifier, Classifying Image Content kernel functions, 8.3 Support Vector Machines kNN, Classifying Image Content L Laplacian matrix, 6.3 Spectral Clustering least squares triangulation, Triangulation LibSVM, 8.3 Support Vector Machines local descriptors, Local Image Descriptors Lucas-Kanade tracking algorithm, Optical Flow M marking points, Interactive Annotation mathematical morphology, Morphology—Counting Objects Matplotlib, Create Thumbnails maximum flow (max flow), 9.1 Graph Cuts measurements, Morphology—Counting Objects, Extracting Cells and Recognizing Characters metric reconstruction, 5.1 Epipolar Geometry, Computing the Camera Matrix from a Fundamental Matrix minidom, Registering Images minimum cut (min cut), 9.1 Graph Cuts misc, Useful SciPy Modules morphology, Morphology—Counting Objects, Morphology—Counting Objects, Exercises mplot3d, A Sample Data Set, 3D Reconstruction Example multi-class SVM, Selecting Features multi-dimensional arrays, Interactive Annotation multi-dimensional histograms, Clustering Images multiple view geometry, Multiple View Geometry N naive Bayes classifier, Classifying Images—Hand Gesture Recognition ndimage, Affine Transformations ndimage.filters, Computing Disparity Maps normalized cross-correlation, Finding Corresponding Points Between Images normalized cut, 9.2 Segmentation Using Clustering NumPy, Interactive Annotation O objloader, Tying It All Together OCR, Hand Gesture Recognition Again OpenCV, Chapter Overview, OpenCV OpenGL, PyGame and PyOpenGL OpenGL projection matrix, From Camera Matrix to OpenGL Format optic flow, 10.4 Tracking optical axis, Camera Models and Augmented Reality optical center, The Camera Matrix optical character recognition, Hand Gesture Recognition Again optical flow, 10.4 Tracking optical flow equation, 10.4 Tracking outliers, 3.3 Creating Panoramas overfitting, Exercises P panograph, Exercises panorama, 3.3 Creating Panoramas PCA, PCA of Images pickle, PCA of Images, The SciPy Clustering Package, Creating a Vocabulary pickling, PCA of Images piecewise affine warping, Image in Image piecewise constant image model, 9.3 Variational Methods PIL, Basic Image Handling and Processing pin-hole camera, Camera Models and Augmented Reality plane sweeping, 5.4 Stereo Images plot formatting, Plotting Images, Points, and Lines plotting, Create Thumbnails point correspondence, 2.1 Harris Corner Detector pose estimation, 4.3 Pose Estimation from Planes and Markers Prewitt filters, Blurring Images Principal Component Analysis, PCA of Images, 8.2 Bayes Classifier principal point, The Camera Matrix projection, Camera Models and Augmented Reality projection matrix, Camera Models and Augmented Reality projective camera, Camera Models and Augmented Reality projective transformation, Image to Image Mappings pydot, Matching Using Local Descriptors pygame, PyGame and PyOpenGL pygame.image, PyGame and PyOpenGL pygame.locals, PyGame and PyOpenGL Pylab, Create Thumbnails PyOpenGL, PyGame and PyOpenGL pyplot, Exercises pysqlite, Setting Up the Database pysqlite2, Setting Up the Database Python Imaging Library, Basic Image Handling and Processing python-graph, 9.1 Graph Cuts Q quad, From Camera Matrix to OpenGL Format query with image, Querying with an Image quotient image, Exercises R radial basis functions, 8.3 Support Vector Machines ranking using homographies, 7.5 Ranking Results Using Geometry RANSAC, 3.3 Creating Panoramas, 5.3 Multiple View Reconstruction rectified image pair, Bundle adjustment rectifying images, Extracting Cells and Recognizing Characters registration, Piecewise Affine Warping rigid transformation, 3.1 Homographies robust homography estimation, RANSAC ROF, Reading and writing .mat files, 9.3 Variational Methods RQ-factorization, Factoring the Camera Matrix Rudin-Osher-Fatemi de-noising model, Reading and writing .mat files S Scale-Invariant Feature Transform, Finding Corresponding Points Between Images scikit.learn, Exercises Scipy, Using the Pickle Module scipy.cluster.vq, The SciPy Clustering Package, Clustering Images scipy.io, Useful SciPy Modules, Reading and writing .mat files scipy.misc, Reading and writing .mat files scipy.ndimage, Blurring Images, Morphology—Counting Objects, Extracting Cells and Recognizing Characters, Rectifying Images, Exercises scipy.ndimage.filters, Blurring Images, Blurring Images, 2.1 Harris Corner Detector scipy.sparse, Exercises searching images, Searching Images, Adding Images segmentation, Image Segmentation self-calibration, Bundle adjustment separating hyperplane, Using PCA to Reduce Dimensions SfM, The calibrated case—metric reconstruction SIFT, Finding Corresponding Points Between Images similarity matrix, Clustering Images similarity transformation, 3.1 Homographies similarity tree, 6.2 Hierarchical Clustering simplejson, Downloading Geotagged Images from Panoramio, Downloading Geotagged Images from Panoramio single linking, 6.2 Hierarchical Clustering slicing, Array Image Representation Sobel filters, Blurring Images spectral clustering, Clustering Images, 9.2 Segmentation Using Clustering SQLite, Setting Up the Database SSD, Finding Corresponding Points Between Images stereo imaging, Bundle adjustment stereo reconstruction, Bundle adjustment stereo rig, Bundle adjustment stereo vision, Bundle adjustment stitching images, Robust Homography Estimation stop words, Inspiration from Text Mining—The Vector Space Model structure from motion, The calibrated case—metric reconstruction structuring element, Morphology—Counting Objects Sudoku reader, Hand Gesture Recognition Again sum of squared differences, Finding Corresponding Points Between Images Support Vector Machines, Using PCA to Reduce Dimensions support vectors, 8.3 Support Vector Machines SVM, Using PCA to Reduce Dimensions T term frequency, Inspiration from Text Mining—The Vector Space Model term frequency–inverse document frequency, Inspiration from Text Mining—The Vector Space Model text mining, Searching Images tf-idf weighting, Inspiration from Text Mining—The Vector Space Model total variation, Reading and writing .mat files total within-class variance, Clustering Images tracking, 10.4 Tracking triangulation, 5.2 Computing with Cameras and 3D Structure U unpickling, PCA of Images unsharp masking, 1.5 Advanced Example: Image De-Noising urllib, Downloading Geotagged Images from Panoramio V variational methods, 9.3 Variational Methods variational problems, 9.3 Variational Methods vector quantization, The SciPy Clustering Package vector space model, Searching Images vertical field of view, From Camera Matrix to OpenGL Format video, Displaying Images and Results visual codebook, Inspiration from Text Mining—The Vector Space Model visual vocabulary, Inspiration from Text Mining—The Vector Space Model visual words, Inspiration from Text Mining—The Vector Space Model visualizing image distribution, Visualizing the Images on Principal Components VLFeat, Interest Points W warping, Affine Transformations watershed, Inpainting web applications, 7.6 Building Demos and Web Applications webcam, Optical Flow word index, Setting Up the Database X XML, Registering Images xml.dom, Registering Images About the Author Jan Erik Solem is a Python enthusiast and a computer vision researcher and entrepreneur.

Symbols 3D plotting, A Sample Data Set 3D reconstruction, 3D Reconstruction Example 4-neighborhood, 9.1 Graph Cuts A affine transformation, 3.1 Homographies affine warping, Affine Transformations affinity matrix, Clustering Images agglomerative clustering, 6.2 Hierarchical Clustering alpha map, Image in Image AR, 4.3 Pose Estimation from Planes and Markers array, Interactive Annotation array slicing, Array Image Representation aspect ratio, 4.1 The Pin-Hole Camera Model association, 9.2 Segmentation Using Clustering augmented reality, 4.3 Pose Estimation from Planes and Markers B bag-of-visual-words, Inspiration from Text Mining—The Vector Space Model bag-of-word representation, Searching Images baseline, Bundle adjustment Bayes classifier, Classifying Images—Hand Gesture Recognition binary image, Morphology—Counting Objects blurring, Using the Pickle Module bundle adustment, Bundle adjustment C calibration matrix, 4.1 The Pin-Hole Camera Model camera calibration, Computing the Camera Center camera center, Camera Models and Augmented Reality camera matrix, Camera Models and Augmented Reality camera model, Camera Models and Augmented Reality camera pose estimation, 4.3 Pose Estimation from Planes and Markers camera resectioning, Triangulation CBIR, Searching Images Chan-Vese segmentation, 9.3 Variational Methods characteristic functions, 9.3 Variational Methods CherryPy, 7.6 Building Demos and Web Applications, Image Search Demo class centroids, Clustering Images classifying images, Classifying Image Content clustering images, Clustering Images, Clustering Images complete linking, 6.2 Hierarchical Clustering confusion matrix, Classifying Images—Hand Gesture Recognition content-based image retrieval, Searching Images convex combination, Image in Image corner detection, Local Image Descriptors correlation, 2.1 Harris Corner Detector corresponding points, 2.1 Harris Corner Detector cpickle, PCA of Images cross-correlation, Finding Corresponding Points Between Images cumulative distribution function, Graylevel Transforms cv, OpenCV, 10.4 Tracking cv2, OpenCV D de-noising, Reading and writing .mat files Delaunay triangulation, Piecewise Affine Warping dendrogram, Clustering Images dense depth reconstruction, Bundle adjustment dense image features, A Simple 2D Example dense SIFT, A Simple 2D Example descriptor, 2.1 Harris Corner Detector difference-of-Gaussian, Finding Corresponding Points Between Images digit classification, Hand Gesture Recognition Again direct linear transformation, 3.1 Homographies directed graph, Image Segmentation distance matrix, Clustering Images E Edmonds-Karp algorithm, 9.1 Graph Cuts eight point algorithm, Plotting 3D Data with Matplotlib epipolar constraint, 5.1 Epipolar Geometry epipolar geometry, Multiple View Geometry epipolar line, 5.1 Epipolar Geometry epipole, 5.1 Epipolar Geometry essential matrix, The calibrated case—metric reconstruction F factorization, Factoring the Camera Matrix feature matches, Finding Corresponding Points Between Images feature matching, Matching Descriptors flood fill, Displaying Images and Results focal length, 4.1 The Pin-Hole Camera Model fundamental matrix, 5.1 Epipolar Geometry fundamental matrix estimation, 5.3 Multiple View Reconstruction G Gaussian blurring, Using the Pickle Module Gaussian derivative filters, Image Derivatives Gaussian distributions, 8.2 Bayes Classifier gesture recognition, Dense SIFT as Image Feature GL_MODELVIEW, PyGame and PyOpenGL GL_PROJECTION, PyGame and PyOpenGL Grab Cut dataset, Segmentation with User Input gradient angle, Blurring Images gradient magnitude, Blurring Images graph, Image Segmentation graph cut, Image Segmentation GraphViz, Matching Using Local Descriptors graylevel transforms, Array Image Representation H Harris corner detection, Local Image Descriptors Harris matrix, Local Image Descriptors hierarchical clustering, 6.2 Hierarchical Clustering hierarchical k-means, 6.3 Spectral Clustering histogram equalization, Graylevel Transforms Histogram of Oriented Gradients, A Simple 2D Example HOG, A Simple 2D Example homogeneous coordinates, Image to Image Mappings homography, Image to Image Mappings homography estimation, 3.1 Homographies Hough transform, Inpainting I Image, Basic Image Handling and Processing image contours, Plotting Images, Points, and Lines image gradient, Blurring Images image graph, 9.1 Graph Cuts image histograms, Plotting Images, Points, and Lines image patch, 2.1 Harris Corner Detector image plane, Camera Models and Augmented Reality image registration, Piecewise Affine Warping image retrieval, Searching Images image search demo, 7.6 Building Demos and Web Applications image segmentation, Visualizing the Images on Principal Components, Image Segmentation image thumbnails, Convert Images to Another Format ImageDraw, Clustering Images inliers, 3.3 Creating Panoramas inpainting, Using generators integral image, Color Spaces interest point descriptor, 2.1 Harris Corner Detector interest points, Local Image Descriptors inverse depth, 4.1 The Pin-Hole Camera Model inverse document frequency, Inspiration from Text Mining—The Vector Space Model io, Useful SciPy Modules iso-contours, Plotting Images, Points, and Lines J JSON, Downloading Geotagged Images from Panoramio K k-means, Clustering Images k-nearest neighbor classifier, Classifying Image Content kernel functions, 8.3 Support Vector Machines kNN, Classifying Image Content L Laplacian matrix, 6.3 Spectral Clustering least squares triangulation, Triangulation LibSVM, 8.3 Support Vector Machines local descriptors, Local Image Descriptors Lucas-Kanade tracking algorithm, Optical Flow M marking points, Interactive Annotation mathematical morphology, Morphology—Counting Objects Matplotlib, Create Thumbnails maximum flow (max flow), 9.1 Graph Cuts measurements, Morphology—Counting Objects, Extracting Cells and Recognizing Characters metric reconstruction, 5.1 Epipolar Geometry, Computing the Camera Matrix from a Fundamental Matrix minidom, Registering Images minimum cut (min cut), 9.1 Graph Cuts misc, Useful SciPy Modules morphology, Morphology—Counting Objects, Morphology—Counting Objects, Exercises mplot3d, A Sample Data Set, 3D Reconstruction Example multi-class SVM, Selecting Features multi-dimensional arrays, Interactive Annotation multi-dimensional histograms, Clustering Images multiple view geometry, Multiple View Geometry N naive Bayes classifier, Classifying Images—Hand Gesture Recognition ndimage, Affine Transformations ndimage.filters, Computing Disparity Maps normalized cross-correlation, Finding Corresponding Points Between Images normalized cut, 9.2 Segmentation Using Clustering NumPy, Interactive Annotation O objloader, Tying It All Together OCR, Hand Gesture Recognition Again OpenCV, Chapter Overview, OpenCV OpenGL, PyGame and PyOpenGL OpenGL projection matrix, From Camera Matrix to OpenGL Format optic flow, 10.4 Tracking optical axis, Camera Models and Augmented Reality optical center, The Camera Matrix optical character recognition, Hand Gesture Recognition Again optical flow, 10.4 Tracking optical flow equation, 10.4 Tracking outliers, 3.3 Creating Panoramas overfitting, Exercises P panograph, Exercises panorama, 3.3 Creating Panoramas PCA, PCA of Images pickle, PCA of Images, The SciPy Clustering Package, Creating a Vocabulary pickling, PCA of Images piecewise affine warping, Image in Image piecewise constant image model, 9.3 Variational Methods PIL, Basic Image Handling and Processing pin-hole camera, Camera Models and Augmented Reality plane sweeping, 5.4 Stereo Images plot formatting, Plotting Images, Points, and Lines plotting, Create Thumbnails point correspondence, 2.1 Harris Corner Detector pose estimation, 4.3 Pose Estimation from Planes and Markers Prewitt filters, Blurring Images Principal Component Analysis, PCA of Images, 8.2 Bayes Classifier principal point, The Camera Matrix projection, Camera Models and Augmented Reality projection matrix, Camera Models and Augmented Reality projective camera, Camera Models and Augmented Reality projective transformation, Image to Image Mappings pydot, Matching Using Local Descriptors pygame, PyGame and PyOpenGL pygame.image, PyGame and PyOpenGL pygame.locals, PyGame and PyOpenGL Pylab, Create Thumbnails PyOpenGL, PyGame and PyOpenGL pyplot, Exercises pysqlite, Setting Up the Database pysqlite2, Setting Up the Database Python Imaging Library, Basic Image Handling and Processing python-graph, 9.1 Graph Cuts Q quad, From Camera Matrix to OpenGL Format query with image, Querying with an Image quotient image, Exercises R radial basis functions, 8.3 Support Vector Machines ranking using homographies, 7.5 Ranking Results Using Geometry RANSAC, 3.3 Creating Panoramas, 5.3 Multiple View Reconstruction rectified image pair, Bundle adjustment rectifying images, Extracting Cells and Recognizing Characters registration, Piecewise Affine Warping rigid transformation, 3.1 Homographies robust homography estimation, RANSAC ROF, Reading and writing .mat files, 9.3 Variational Methods RQ-factorization, Factoring the Camera Matrix Rudin-Osher-Fatemi de-noising model, Reading and writing .mat files S Scale-Invariant Feature Transform, Finding Corresponding Points Between Images scikit.learn, Exercises Scipy, Using the Pickle Module scipy.cluster.vq, The SciPy Clustering Package, Clustering Images scipy.io, Useful SciPy Modules, Reading and writing .mat files scipy.misc, Reading and writing .mat files scipy.ndimage, Blurring Images, Morphology—Counting Objects, Extracting Cells and Recognizing Characters, Rectifying Images, Exercises scipy.ndimage.filters, Blurring Images, Blurring Images, 2.1 Harris Corner Detector scipy.sparse, Exercises searching images, Searching Images, Adding Images segmentation, Image Segmentation self-calibration, Bundle adjustment separating hyperplane, Using PCA to Reduce Dimensions SfM, The calibrated case—metric reconstruction SIFT, Finding Corresponding Points Between Images similarity matrix, Clustering Images similarity transformation, 3.1 Homographies similarity tree, 6.2 Hierarchical Clustering simplejson, Downloading Geotagged Images from Panoramio, Downloading Geotagged Images from Panoramio single linking, 6.2 Hierarchical Clustering slicing, Array Image Representation Sobel filters, Blurring Images spectral clustering, Clustering Images, 9.2 Segmentation Using Clustering SQLite, Setting Up the Database SSD, Finding Corresponding Points Between Images stereo imaging, Bundle adjustment stereo reconstruction, Bundle adjustment stereo rig, Bundle adjustment stereo vision, Bundle adjustment stitching images, Robust Homography Estimation stop words, Inspiration from Text Mining—The Vector Space Model structure from motion, The calibrated case—metric reconstruction structuring element, Morphology—Counting Objects Sudoku reader, Hand Gesture Recognition Again sum of squared differences, Finding Corresponding Points Between Images Support Vector Machines, Using PCA to Reduce Dimensions support vectors, 8.3 Support Vector Machines SVM, Using PCA to Reduce Dimensions T term frequency, Inspiration from Text Mining—The Vector Space Model term frequency–inverse document frequency, Inspiration from Text Mining—The Vector Space Model text mining, Searching Images tf-idf weighting, Inspiration from Text Mining—The Vector Space Model total variation, Reading and writing .mat files total within-class variance, Clustering Images tracking, 10.4 Tracking triangulation, 5.2 Computing with Cameras and 3D Structure U unpickling, PCA of Images unsharp masking, 1.5 Advanced Example: Image De-Noising urllib, Downloading Geotagged Images from Panoramio V variational methods, 9.3 Variational Methods variational problems, 9.3 Variational Methods vector quantization, The SciPy Clustering Package vector space model, Searching Images vertical field of view, From Camera Matrix to OpenGL Format video, Displaying Images and Results visual codebook, Inspiration from Text Mining—The Vector Space Model visual vocabulary, Inspiration from Text Mining—The Vector Space Model visual words, Inspiration from Text Mining—The Vector Space Model visualizing image distribution, Visualizing the Images on Principal Components VLFeat, Interest Points W warping, Affine Transformations watershed, Inpainting web applications, 7.6 Building Demos and Web Applications webcam, Optical Flow word index, Setting Up the Database X XML, Registering Images xml.dom, Registering Images About the Author Jan Erik Solem is a Python enthusiast and a computer vision researcher and entrepreneur.

For high-level queries, like finding similar objects, it is not feasible to do a full comparison (for example using feature matching) between a query image and all images in the database. It would simply take too much time to return any results if the database is large. In the last couple of years, researchers have successfully introduced techniques from the world of text mining for CBIR problems, making it possible to search millions of images for similar content. Inspiration from Text Mining—The Vector Space Model The vector space model is a model for representing and searching text documents. As we will see, it can be applied to essentially any kind of objects, including images. The name comes from the fact that text documents are represented with vectors that are histograms of the word frequencies in the text.[18] In other words, the vector will contain the number of occurrences of every word (at the position corresponding to that word) and zeros everywhere else.


pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack by Matthew A. Russell

Climategate, cloud computing, crowdsourcing, en.wikipedia.org, fault tolerance, Firefox, full text search, Georg Cantor, Google Earth, information retrieval, Mark Zuckerberg, natural language processing, NP-complete, Saturday Night Live, semantic web, Silicon Valley, slashdot, social graph, social web, statistical model, Steve Jobs, supply-chain management, text mining, traveling salesman, Turing test, web application

There’s no real reason to introduce Buzz earlier in the book than blogs (the topic of Chapter 8), other than the fact that it fills an interesting niche somewhere between Twitter and blogs, so this ordering facilitates telling a story from cover to cover. All in all, the text-mining techniques you’ll learn in any chapter of this book could just as easily be applied to any other chapter. Wherever possible we won’t reinvent the wheel and implement analysis tools from scratch, but we will take a couple of “deep dives” when particularly foundational topics come up that are essential to an understanding of text mining. The Natural Language Toolkit (NLTK), a powerful technology that you may recall from some opening examples in Chapter 1, provides many of the tools.

The Graph Your Inbox Chrome Extension provides a concise summary of your Gmail activity Tapping into Your Gmail provides an overview of how to use Python’s smtplib module to tap into your Gmail account (or any other mail account that speaks SMTP) and mine the textual information in messages. Be sure to check it out when you’re interested in moving beyond mail header information and ready to dig into text mining. Closing Remarks We’ve covered a lot of ground in this chapter, but we’ve just barely begun to scratch the surface of what’s possible with mail data. Our focus has been on mboxes, a simple and convenient file format that lends itself to high portability and easy analysis by many Python tools and packages.

As a follow-up exercise, it could be interesting to compute the average number of hyperlink entities per tweet, or even go so far as to follow the links and try to discover new information about Tim’s interests by inspecting the title or content of the linked web pages. (In the chapters ahead, especially Chapters 7 and 8, we’ll learn more about text mining, an essential skill for analyzing web pages.) * * * [30] The Twitter API documentation states that the friend time line is similar to the home time line, except that it does not contain retweets for backward-compatibility purposes. [31] See the May 2010 cover of Inc. magazine. [32] Note that as of late December 2010, the retweet_count field maxes out at 100.


pages: 163 words: 42,402

Machine Learning for Email by Drew Conway, John Myles White

call centre, correlation does not imply causation, Debian, natural language processing, Netflix Prize, pattern recognition, recommendation engine, SpamAssassin, text mining

In these cases you will need to install from source: install.packages("tm", dependencies=TRUE) setwd("~/Downloads/") install.packages("RCurl_1.5-0.tar.gz", repos=NULL, type="source") In the first example above, we use the default settings to install the tm package from CRAN. The tm provides function used to do text mining, and we will use it in Chapter 3 to perform classification on email text. One useful parameter in the install.packages function is suggests, which by default is set to FALSE but if activated will instruct the function to download and install any secondary packages used by the primary installation.

R packages used in this book NameLocationAuthorDescription & Use ggplot2 http://had.co.nz/ggplot2/ Hadley Wickham An implementation of the grammar of graphics in R. The premier package for creating high-quality graphics. plyr http://had.co.nz/plyr/ Hadley Wickham A set of tools used to manipulate, aggregate and manage data in R. tm http://www.spatstat.org/spatstat/ Ingo Feinerer A collection of functions for performing text mining in R. Used to work with unstructured text data. R Basics for Machine Learning UFO Sightings in the United States, from 1990-2010 As we stated at the outset, we believe that the best way to learn a new technical skill is to start with a problem you wish to solve or a question you wish to answer.

In our case, that will be a 0/1 coding strategy: spam or ham. For example, we may want to determine the following: “does containing HTML tags make an email more likely to be spam?” To answer this, we will need a strategy for turning the text in our email into numbers. Fortunately, the general-purpose text mining packages available in R will do much of this work for us. For that reason, much of this chapter will focus on building up your intuition for the types of features that people have used in the past when working with text data. Feature generation is a major topic in current machine learning research and is still very far from being automated in a general purpose way.


pages: 233 words: 67,596

Competing on Analytics: The New Science of Winning by Thomas H. Davenport, Jeanne G. Harris

always be closing, big data - Walmart - Pop Tarts, business intelligence, business process, call centre, commoditize, data acquisition, digital map, en.wikipedia.org, global supply chain, high net worth, if you build it, they will come, intangible asset, inventory management, iterative process, Jeff Bezos, job satisfaction, knapsack problem, late fees, linear programming, Moneyball by Michael Lewis explains big data, Netflix Prize, new economy, performance metric, personalized medicine, quantitative hedge fund, quantitative trading / quantitative finance, recommendation engine, RFID, search inside the book, shareholder value, six sigma, statistical model, supply-chain management, text mining, The future is already here, the scientific method, traveling salesman, yield management

SAS offers both data and text mining capabilities and is a major vendor in both categories. Text mining tools can help managers quickly identify emerging trends in near–real time. Spiders, or data crawlers, which identify and count words and phrases on Web sites, are a simple example of text mining. Text mining tools can be invaluable in sniffing out new trends or relationships. For example, by monitoring technical-user blogs, a vendor can recognize that a new product has a defect within hours of being shipped instead of waiting for complaints to arrive from customers. Other text mining products can recognize references to people, places, things, or topics and use this information to draw inferences about competitor behavior.

The mining or detailed analysis of data is by now quite advanced, but the mining of text is clearly in its early stages and is likely to expand considerably over the next few years. In chapter 4, we described an example of text mining of automobile service reports at Honda, which supports its distinctive capability of quality manufacturing and service. Other text mining applications could be equally important in analyzing strategic and market trends. The vast amount of textual information on the Internet will undoubtedly fuel a rapid rise for text mining. There are also a few technology-oriented changes in the world of analytical competition that are not generally present today, even among the most sophisticated firms.

Technology could play a substantial (but not exclusive) role in generating and analyzing intangible capabilities.10 Intangibles are not easily reducible to a set of numbers, and hence this type of analysis will often involve work with text and other less structured forms of information. We’ve already described, for example, how Honda uses text mining to identify and head off potential quality problems in its cars. Similar text and Web mining approaches could be used to, for example, better understand customer perceptions of service and brand value. Consultant and former academic Richard Hackathorn was writing as early as 1999 that “the web is the mother of all data warehouses,” and even then described a Web farming application for Eaton Corporation that was “monitoring hundreds of markets for technology shifts, emerging competitors, and governmental regulations.”11 What could be more strategic and competitive?


Data Mining: Concepts and Techniques: Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei

backpropagation, bioinformatics, business intelligence, business process, Claude Shannon: information theory, cloud computing, computer vision, correlation coefficient, cyber-physical system, database schema, discrete time, disinformation, distributed generation, finite state, information retrieval, iterative process, knowledge worker, linked data, natural language processing, Netflix Prize, Occam's razor, pattern recognition, performance metric, phenotype, random walk, recommendation engine, RFID, semantic web, sentiment analysis, speech recognition, statistical model, stochastic process, supply-chain management, text mining, thinkpad, Thomas Bayes, web application

., learning relations between named entities). Other examples include multilingual data mining, multidimensional text analysis, contextual text mining, and trust and evolution analysis in text data, as well as text mining applications in security, biomedical literature analysis, online media analysis, and analytical customer relationship management. Various kinds of text mining and analysis software and tools are available in academic institutions, open-source forums, and industry. Text mining often also uses WordNet, Sematic Web, Wikipedia, and other information sources to enhance the understanding and mining of text data.

This is typically done through the discovery of patterns and trends by means such as statistical pattern learning, topic modeling, and statistical language modeling. Text mining usually requires structuring the input text (e.g., parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database). This is followed by deriving patterns within the structured data, and evaluation and interpretation of the output. “High quality” in text mining usually refers to a combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity-relation modeling (i.e., learning relations between named entities).

Other topics in multimedia mining include classification and prediction analysis, mining associations, and video and audio data mining (Section 13.2.3). Mining Text Data Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. A substantial portion of information is stored as text such as news articles, technical papers, books, digital libraries, email messages, blogs, and web pages. Hence, research in text mining has been very active. An important goal is to derive high-quality information from text. This is typically done through the discovery of patterns and trends by means such as statistical pattern learning, topic modeling, and statistical language modeling.


pages: 451 words: 103,606

Machine Learning for Hackers by Drew Conway, John Myles White

call centre, centre right, correlation does not imply causation, Debian, Erdős number, Nate Silver, natural language processing, Netflix Prize, p-value, pattern recognition, Paul Erdős, recommendation engine, social graph, SpamAssassin, statistical model, text mining, the scientific method, traveling salesman

In these cases you will need to install from source: install.packages("tm", dependencies=TRUE) setwd("~/Downloads/") install.packages("RCurl_1.5-0.tar.gz", repos=NULL, type="source") In the first example, we use the default settings to install the tm package from CRAN. The tm provides a function used to do text mining, and we will use it in Chapter 3 to perform classifications on email text. One useful parameter in the install.packages function is suggests, which by default is set to FALSE, but if activated will instruct the function to download and install any secondary packages used by the primary installation.

reshape http://had.co.nz/plyr/ Hadley Wickham A set of tools used to manipulate, aggregate, and manage data in R. RJSONIO http://www.omegahat.org/RJSONIO/ Duncan Temple Lang Provides functions for reading and writing JavaScript Object Notation (JSON). Used to parse data from web-based APIs. tm http://www.spatstat.org/spatstat/ Ingo Feinerer A collection of functions for performing text mining in R. Used to work with unstructured text data. XML http://www.omegahat.org/RSXML/ Duncan Temple Lang Provides the facility to parse XML and HTML documents. Used to extract structured data from the Web. As mentioned, we will use several packages through the course of this book. Table 1-2 lists all of the packages used in the case studies and includes a brief description of their purpose, along with a link to additional information about each.

In our case, that will be a 0/1 coding strategy: spam or ham. For example, we may want to determine the following: “Does containing HTML tags make an email more likely to be spam?” To answer this, we will need a strategy for turning the text in our email into numbers. Fortunately, the general-purpose text-mining packages available in R will do much of this work for us. For that reason, much of this chapter will focus on building up your intuition for the types of features that people have used in the past when working with text data. Feature generation is a major topic in current machine learning research and is still very far from being automated in a general-purpose way.


pages: 502 words: 107,657

Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die by Eric Siegel

Albert Einstein, algorithmic trading, Amazon Mechanical Turk, Apple's 1984 Super Bowl advert, backtesting, Black Swan, book scanning, bounce rate, business intelligence, business process, butter production in bangladesh, call centre, Charles Lindbergh, commoditize, computer age, conceptual framework, correlation does not imply causation, crowdsourcing, dark matter, data is the new oil, en.wikipedia.org, Erik Brynjolfsson, Everything should be made as simple as possible, experimental subject, Google Glasses, happiness index / gross national happiness, job satisfaction, Johann Wolfgang von Goethe, lifelogging, Machine translation of "The spirit is willing, but the flesh is weak." to Russian and back, mass immigration, Moneyball by Michael Lewis explains big data, Nate Silver, natural language processing, Netflix Prize, Network effects, Norbert Wiener, personalized medicine, placebo effect, prediction markets, Ray Kurzweil, recommendation engine, risk-adjusted returns, Ronald Coase, Search for Extraterrestrial Intelligence, self-driving car, sentiment analysis, Shai Danziger, software as a service, speech recognition, statistical model, Steven Levy, text mining, the scientific method, The Signal and the Noise by Nate Silver, The Wisdom of Crowds, Thomas Bayes, Thomas Davenport, Turing test, Watson beat the top human players on Jeopardy!, X Prize, Yogi Berra, zero-sum game

For final results, see http://stat.duke.edu/datafest/results. G. Miner, D. Delen, J. Elder, A. Fast, T. Hill, and B. Nisbet, Practical Text Mining and Statistical Analysis for Non-Structured Data Text Applications (Academic Press, 2012), Part II, Tutorial K, p. 417, by Richard Foley of SAS. Tap directly into Kiva’s loan database with the Kiva API: http://build.kiva.org. U.S. Social Security Administration: Thanks to John Elder, PhD, Elder Research, Inc. (www.datamininglab.com), for this case study. John Elder, PhD, “Text Mining to Fast-Track Deserving Disability Applicants,” Elder Research, Inc., August 7, 2010. http://videolectures.net/site/normal_dl/tag=73772/kdd2010_elder_tmft_01.pdf.

Machine and the Quest to Know Everything (Houghton Mifflin Harcourt, 2011), 212–224. Quote about Google’s book scanning project: George Dyson, Turing’s Cathedral: The Origins of the Digital Universe (Pantheon Books, 2012). Natural language processing: Dursun Delen, Andrew Fast, Thomas Hill, Robert Nisbit, John Elder, and Gary Miner, Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications (Academic Press, 2012). James Allen, Natural Language Understanding, 2nd ed. (Addison-Wesley, 1994). Regarding the translation of “The spirit is willing, but the flesh is weak”: John Hutchins, “The Whisky Was Invisible or Persistent Myths of MT,” MT News International 11 (June 1995), 17–18. www.hutchinsweb.me.uk/MTNI-11–1995.pdf.

TTX: Thanks to Mahesh Kumar at Tiger Analytics for this case study, “Predicting Wheel Failure Rate for Railcars.” Fortune 500 global technology company: Thanks to Dean Abbott, Abbot Analytics (http://abbottanalytics.com/index.php) for information about this case study. “Inductive Business-Rule Discovery in Text Mining.” Text Analytics World San Francisco Conference, March 7, 2012, San Francisco, CA. www.textanalyticsworld.com/sanfrancisco/2012/agenda/full-agenda#day11040–11–2. Leading payments processor: Thanks to Robert Grossman, Open Data Group (http://opendatagroup.com), for this case study. “Scaling Health and Status Models to Large, Complex Systems,” Predictive Analytics World San Francisco Conference, February 16, 2010, San Francisco, CA. www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day1–17.


pages: 660 words: 141,595

Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking by Foster Provost, Tom Fawcett

Albert Einstein, Amazon Mechanical Turk, big data - Walmart - Pop Tarts, bioinformatics, business process, call centre, chief data officer, Claude Shannon: information theory, computer vision, conceptual framework, correlation does not imply causation, crowdsourcing, data acquisition, David Brooks, en.wikipedia.org, Erik Brynjolfsson, Gini coefficient, independent contractor, information retrieval, intangible asset, iterative process, Johann Wolfgang von Goethe, Louis Pasteur, Menlo Park, Nate Silver, Netflix Prize, new economy, p-value, pattern recognition, placebo effect, price discrimination, recommendation engine, Ronald Coase, selection bias, Silicon Valley, Skype, speech recognition, Steve Jobs, supply-chain management, text mining, The Signal and the Noise by Nate Silver, Thomas Bayes, transaction costs, WikiLeaks

From this point of view, there is a huge stream of market news coming in—some interesting, most not. We’d like predictive text mining to recommend interesting news stories that we should pay attention to. What’s an interesting story? Here we’ll define it as news that will likely result in a significant change in a stock’s price. We have to simplify the problem further to make it more tractable (in fact, this task is a good example of problem formulation as much as it is of text mining). Here are some of the problems and simplifying assumptions: It is difficult to predict the effect of news far in advance.

Expressing this formally would lead to equations like: Instead we opt for the more readable: with the understanding that x is a vector and Age and Balance are components of it. We have tried to be consistent with typography, reserving fixed-width typewriter fonts like sepal_width to indicate attributes or keywords in data. For example, in the text-mining chapter, a word like 'discussing' designates a word in a document while discuss might be the resulting token in the data. The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

In principle, text is just another form of data, and text processing is just a special case of representation engineering. In reality, dealing with text requires dedicated pre-processing steps and sometimes specific expertise on the part of the data science team. Entire books and conferences (and companies) are devoted to text mining. In this chapter we can only scratch the surface, to give a basic overview of the techniques and issues involved in typical business applications. First, let’s discuss why text is so important and why it’s difficult. Why Text Is Important Text is everywhere. Many legacy applications still produce or record text.


pages: 398 words: 86,855

Bad Data Handbook by Q. Ethan McCallum

Amazon Mechanical Turk, asset allocation, barriers to entry, Benoit Mandelbrot, business intelligence, cellular automata, chief data officer, Chuck Templeton: OpenTable:, cloud computing, cognitive dissonance, combinatorial explosion, commoditize, conceptual framework, database schema, DevOps, en.wikipedia.org, Firefox, Flash crash, functional programming, Gini coefficient, illegal immigration, iterative process, labor-force participation, loose coupling, natural language processing, Netflix Prize, quantitative trading / quantitative finance, recommendation engine, selection bias, sentiment analysis, statistical model, supply-chain management, survivorship bias, text mining, too big to fail, web application

Used with unicode.encode('ascii', ‘xmlcharrefreplace') Example 4-9 Example 4-10 HTMLParser.unescape Decodes HTML-encoded string Example 4-9 Example 4-10 csv.reader Parses delimited text Example 4-11 The functions listed in Table 4-5 are good low-level building blocks for creating text processing and text mining applications. There are a lot of excellent Open Source Python libraries for higher level text analysis. A few of my favorites are listed in Table 4-6. Table 4-6. Third-party Python reference LibraryNotes NLTK Parsers, tokenizers, stemmers, classifiers BeautifulSoup HTML & XML parsers, tolerant of bad inputs gensim Topic modeling jellyfish Approximate and phoenetic string matching These tools provide a great starting point for many text processing, text mining, and text analysis applications. Exercises The results shown in Example 4-3 were generated when the n parameter to make_alnum_sample was set to 512.

His research area is Statistical Computing and Graphics and he is a member of the core development team for the R project. He is the author of two books, R Graphics and Introduction to Data Technologies, and is a Fellow of the American Statistical Association. Josh Levy is a data scientist in Austin, Texas. He works on content recommendation and text mining systems. He earned his doctorate at the University of North Carolina where he researched statistical shape models for medical image segmentation. His favorite foosball shot is banked from the backfield. Adam Laiacano has a BS in Electrical Engineering from Northeastern University and spent several years designing signal detection systems for atomic clocks before joining a prominent NYC-based startup.

Problem: Application-Specific Characters Leaking into Plain Text Some applications have characters or sequences of characters with application-specific meanings. One source of bad text data I have encountered is when these sequences leak into places where they don’t belong. This can arise anytime the data flows through a tool with a restricted vocabulary. One project where I had to clean up this type of bad data involved text mining on web content. The users of one system would submit content through a web form to a server where it was stored in a database and in a text index before being embedded in other HTML files for display to additional users. The analysis I performed looked at dumps from various tools that sat on top of the database and/or the final HTML files.


pages: 523 words: 112,185

Doing Data Science: Straight Talk From the Frontline by Cathy O'Neil, Rachel Schutt

Amazon Mechanical Turk, augmented reality, Augustin-Louis Cauchy, barriers to entry, Bayesian statistics, bioinformatics, computer vision, correlation does not imply causation, crowdsourcing, distributed generation, Edward Snowden, Emanuel Derman, fault tolerance, Filter Bubble, finite state, Firefox, game design, Google Glasses, index card, information retrieval, iterative process, John Harrison: Longitude, Khan Academy, Kickstarter, Mars Rover, Nate Silver, natural language processing, Netflix Prize, p-value, pattern recognition, performance metric, personalized medicine, pull request, recommendation engine, rent-seeking, selection bias, Silicon Valley, speech recognition, statistical model, stochastic process, text mining, the scientific method, The Wisdom of Crowds, Watson beat the top human players on Jeopardy!, X Prize

For the first class, the initial thought experiment was: can we use data science to define data science? The class broke into small groups to think about and discuss this question. Here are a few interesting things that emerged from those conversations: Start with a text-mining model. We could do a Google search for “data science” and perform a text-mining model. But that would depend on us being a usagist rather than a prescriptionist with respect to language. A usagist would let the masses define data science (where “the masses” refers to whatever Google’s search engine finds). Would it be better to be a prescriptionist and refer to an authority such as the Oxford English Dictionary?

Just because your method converges, it doesn’t mean the results are meaningful. Make sure you’ve created a reasonable narrative and ways to check its validity. Chapter 12. Epidemiology The contributor for this chapter is David Madigan, professor and chair of statistics at Columbia. Madigan has over 100 publications in such areas as Bayesian statistics, text mining, Monte Carlo methods, pharmacovigilance, and probabilistic graphical models. Madigan’s Background Madigan went to college at Trinity College Dublin in 1980, and specialized in math except for his final year, when he took a bunch of stats courses, and learned a bunch about computers: Pascal, operating systems, compilers, artificial intelligence, database theory, and rudimentary computing skills.

goodness, Research Experiment (Observational Medical Outcomes Partnership) Google, Big Data and Data Science Hype, Getting Past the Hype, Getting Past the Hype, Datafication, Populations and Samples of Big Data, Machine Learning Algorithms, Evaluation, David Huffaker: Google’s Hybrid Approach to Social Research Bell Labs and, Exploratory Data Analysis experimental infrastructures, A/B Tests issues with, Feature Selection machine learning and, Machine Learning Algorithms MapReduce and, Data Engineering: MapReduce, Pregel, and Hadoop mixed-method approaches and, Moving from Descriptive to Predictive privacy and, Privacy sampling and, Populations and Samples of Big Data skills for, The Current Landscape (with a Little History) social layer at, Social at Google social research, approach to, David Huffaker: Google’s Hybrid Approach to Social Research–Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control? text-mining models and, Thought Experiment: Meta-Definition Google glasses, Datafication Google+, The Data Science Process, David Huffaker: Google’s Hybrid Approach to Social Research, Moving from Descriptive to Predictive, Social Networks and Data Journalism graph statistics, A Second Example of Random Graphs: The Exponential Random Graph Model graph theory, Social Network Analysis grouping data, k-means groups, Terminology from Social Networks Guyon, Isabelle, Example: User Retention, Filters H Hadoop, Populations and Samples of Big Data, Economic Interlude: Hadoop–Cloudera analytical applications, So How to Get Started with Hadoop?


pages: 504 words: 89,238

Natural language processing with Python by Steven Bird, Ewan Klein, Edward Loper

bioinformatics, business intelligence, conceptual framework, Donald Knuth, elephant in my pajamas, en.wikipedia.org, finite state, Firefox, functional programming, Guido van Rossum, information retrieval, Menlo Park, natural language processing, P = NP, search inside the book, speech recognition, statistical model, text mining, Turing test

By providing more natural human-machine interfaces, and more sophisticated access to stored information, language processing has come to play a central role in the multilingual information society. This book provides a highly accessible introduction to the field of NLP. It can be used for individual study or as the textbook for a course on natural language processing or computational linguistics, or as a supplement to courses in artificial intelligence, text mining, or corpus linguistics. The book is intensely practical, containing hundreds of fully worked examples and graded exercises. The book is based on the Python programming language together with an open source library called the Natural Language Toolkit (NLTK). NLTK includes extensive software, data, and documentation, all freely downloadable from http://www.nltk.org/.

The same format was adopted by CoNLL 2000 for annotating a section of Wall Street Journal text as part of a shared task on NP chunking. Section 13.5 of (Jurafsky & Martin, 2008) contains a discussion of chunking. Chapter 22 covers information extraction, including named entity recognition. For information about text mining in biology and medicine, see (Ananiadou & McNaught, 2006). For more information on the Getty and Alexandria gazetteers, see http://en.wikipedia .org/wiki/Getty_Thesaurus_of_Geographic_Names and http://www.alexandria.ucsb .edu/gazetteer/. 7.9 Exercises 1. ○ The IOB format categorizes tagged tokens as I, O, and B.

Chapman and Hall, 2008. [Agirre and Edmonds, 2007] Eneko Agirre and Philip Edmonds. Word Sense Disambiguation: Algorithms and Applications. Springer, 2007. [Alpaydin, 2004] Ethem Alpaydin. Introduction to Machine Learning. MIT Press, 2004. [Ananiadou and McNaught, 2006] Sophia Ananiadou and John McNaught, editors. Text Mining for Biology and Biomedicine. Artech House, 2006. [Androutsopoulos et al., 1995] Ion Androutsopoulos, Graeme Ritchie, and Peter Thanisch. Natural language interfaces to databases—an introduction. Journal of Natural Language Engineering, 1:29–81, 1995. [Artstein and Poesio, 2008] Ron Artstein and Massimo Poesio.


The Data Journalism Handbook by Jonathan Gray, Lucy Chambers, Liliana Bounegru

Amazon Web Services, barriers to entry, bioinformatics, business intelligence, carbon footprint, citizen journalism, correlation does not imply causation, crowdsourcing, David Heinemeier Hansson, eurozone crisis, Firefox, Florence Nightingale: pie chart, game design, Google Earth, Hans Rosling, information asymmetry, Internet Archive, John Snow's cholera map, Julian Assange, linked data, moral hazard, MVC pattern, New Journalism, openstreetmap, Ronald Reagan, Ruby on Rails, Silicon Valley, social graph, SPARQL, text mining, web application, WikiLeaks

It was built using ArcView with Spatial Analyst. Figure 6-25. Animated heat map (Verdens Gang) Text Mining For this visualization, we text mined speeches held by the seven Norwegian party leaders during their conventions. All speeches were analyzed, and the analyses supplied angles for some stories. Every story was linked to the graph, and the readers could explore and study the language of politicians. This was built using Excel, Access, Flash, and Illustrator. If this had been built in 2012, we would have made the interactive graph in JavaScript. Figure 6-26. Text mining speeches from party leaders (Verdens Gang) Concluding Notes When do we need to visualize a story?


pages: 23 words: 5,264

Designing Great Data Products by Jeremy Howard, Mike Loukides, Margit Zwemer

AltaVista, Filter Bubble, PageRank, pattern recognition, recommendation engine, self-driving car, sentiment analysis, Silicon Valley, text mining

In another area where objective-based data products have the power to change lives, the CMU extension in Silicon Valley has an active project for building data products to help first responders after natural or man-made disasters. Jeannie Stamberger of Carnegie Mellon University Silicon Valley explained to us many of the possible applications of predictive algorithms to disaster response, from text-mining and sentiment analysis of tweets to determine the extent of the damage, to swarms of autonomous robots for reconnaissance and rescue, to logistic optimization tools that help multiple jurisdictions coordinate their responses. These disaster applications are a particularly good example of why data products need simple, well-designed interfaces that produce concrete recommendations.


pages: 204 words: 58,565

Keeping Up With the Quants: Your Guide to Understanding and Using Analytics by Thomas H. Davenport, Jinho Kim

Black-Scholes formula, business intelligence, business process, call centre, computer age, correlation coefficient, correlation does not imply causation, Credit Default Swap, en.wikipedia.org, feminist movement, Florence Nightingale: pie chart, forensic accounting, global supply chain, Hans Rosling, hypertext link, invention of the telescope, inventory management, Jeff Bezos, Johannes Kepler, longitudinal study, margin call, Moneyball by Michael Lewis explains big data, Myron Scholes, Netflix Prize, p-value, performance metric, publish or perish, quantitative hedge fund, random walk, Renaissance Technologies, Robert Shiller, Robert Shiller, self-driving car, sentiment analysis, six sigma, Skype, statistical model, supply-chain management, text mining, the scientific method, Thomas Davenport

There are various types of analytics that serve different purposes for researchers: Statistics: The science of collection, organization, analysis, interpretation, and presentation of data Forecasting: The estimation of some variable of interest at some specified future point in time as a function of past data Data mining: The automatic or semiautomatic extraction of previously unknown, interesting patterns in large quantities of data through the use of computational algorithmic and statistical techniques Text mining: The process of deriving patterns and trends from text in a manner similar to data mining Optimization: The use of mathematical techniques to find optimal solutions with regard to some criteria while satisfying constraints Experimental design: The use of test and control groups, with random assignment of subjects or cases to each group, to elicit the cause and effect relationships in a particular outcome Although the list presents a range of analytics approaches in common use, it is unavoidable that considerable overlaps exist in the use of techniques across the types.

A demanding course of study: Full-time (M–F/9–5) study on campus; an integrated curriculum shared with the other students; working in teams throughout the program; and, typically, working on projects when not in class A broad and practical content focus: An integrated, multidisciplinary curriculum (drawing from multiple schools and departments at NC State) aimed at the acquisition of practical skills which can be applied to real-world problems, drawing on fields such as statistics, applied mathematics, computer science, operations research, finance and economics, and marketing science Learning by doing: Use of a practicum rather than the traditional MS thesis (students work in teams of five individuals, then use real-world problems and data provided by an industry sponsor; highly structured, substantive work conducted over seven months culminates in a final report to the sponsor) The NCSU MSA has a novel curriculum consisting of classes developed exclusively for the program. Topics include data mining, text mining, forecasting, optimization, databases, data visualization, data privacy and security, financial analytics, and customer analytics. Students come into the program with a variety of backgrounds, though some degree of quantitative orientation is desired. The average age of students is twenty-seven, and about 26 percent of students enrolled have a prior graduate degree.


Big Data at Work: Dispelling the Myths, Uncovering the Opportunities by Thomas H. Davenport

Automated Insights, autonomous vehicles, bioinformatics, business intelligence, business process, call centre, chief data officer, cloud computing, commoditize, data acquisition, disruptive innovation, Edward Snowden, Erik Brynjolfsson, intermodal, Internet of things, Jeff Bezos, knowledge worker, lifelogging, Mark Zuckerberg, move fast and break things, move fast and break things, Narrative Science, natural language processing, Netflix Prize, New Journalism, recommendation engine, RFID, self-driving car, sentiment analysis, Silicon Valley, smart grid, smart meter, social graph, sorting algorithm, statistical model, Tesla Model S, text mining, Thomas Davenport

In keeping with the high-performance capabilities of the platform, MapReduce instructions are processed in parallel across various nodes on the big data platform, and then quickly assembled to provide a new data structure or answer set. An example of a big data application in Hadoop might be “Find the number of all the influential customers who like us on social media.” A text-mining application might crunch through social Chapter_05.indd 122 03/12/13 1:04 PM Technology for Big Data   123 media t­ ransactions, searching for words such as fan, love, bought, or awesome and consolidating a list of key influencer customers with positive sentiment. Apache Pig and Hive are two open-source scripting languages that sit on top of Hadoop and provide a higher-level language for carrying out MapReduce functionality in application code.

Over time there will be increasing integration between Macys.com and the rest of Macy’s systems and data on customers, since Tomak and his colleagues believe that an omnichannel approach to customer relationships is the right direction for the future. Chapter_08.indd 183 03/12/13 12:57 PM 184 big data @ work ­n atural ­language processing or text-mining skills, video or image ­a nalytics, and visual analytics. Many of the data scientists are also able to code in scripting languages like Python, Pig, and Hive. In terms of backgrounds, some have PhDs in scientific fields; others are simply strong programmers with some analytical skills. Many of our interviewees questioned whether a data scientist could possess all the needed skills and were taking a team-based approach to ­a ssembling them.


Beautiful Data: The Stories Behind Elegant Data Solutions by Toby Segaran, Jeff Hammerbacher

23andMe, airport security, Amazon Mechanical Turk, bioinformatics, Black Swan, business intelligence, card file, cloud computing, computer vision, correlation coefficient, correlation does not imply causation, crowdsourcing, Daniel Kahneman / Amos Tversky, DARPA: Urban Challenge, data acquisition, database schema, double helix, en.wikipedia.org, epigenetics, fault tolerance, Firefox, Hans Rosling, housing crisis, information retrieval, lake wobegon effect, longitudinal study, Mars Rover, natural language processing, openstreetmap, prediction markets, profit motive, semantic web, sentiment analysis, Simon Singh, social graph, SPARQL, speech recognition, statistical model, supply-chain management, text mining, Vernor Vinge, web application

Google, seeing an opportunity to leverage its expertise in the search domain, introduced an enterprise search appliance in 2000. In a few short years, enterprise search has grown into a multibillion-dollar market segment that is almost totally separate from the data warehouse market. Endeca has some tools for more traditional business intelligence, and some database vendors have worked to introduce text mining capabilities into their systems, but a complete, integrated solution for structured and unstructured enterprise data management remains unrealized. Both enterprise search and data warehousing are technical solutions to the larger problem of leveraging the information resources of an organization to improve performance.

Her primary area of interest is helping the law enforcement and intelligence communities discover actionable information buried within their very large data collections. She has architected a large number of systems that detect and assess threat risk relative to fraud, terrorism, counterintelligence, and criminal activity. CONTRIBUTORS Download at Boykma.Com 353 She has helped pioneer the application of technologies such as data mining, text mining, and machine translation to exploit the information accessible to shared intelligence environments. Dr. Sokol has numerous papers published in these areas. She received her doctorate in Operations Research from the University of Massachusetts. Utkarsh Srivastava is a senior research scientist at Yahoo!


Practical OCaml by Joshua B. Smith

cellular automata, Debian, domain-specific language, functional programming, general-purpose programming language, Grace Hopper, hiring and firing, John Conway, Paul Graham, slashdot, SpamAssassin, text mining, Turing complete, type inference, web application, Y2K

DSLs, which are found in a variety of applications, are configuration languages, embedded scripting languages, and data description languages. SQL is probably the most popular DSL in wide use. These kinds of languages can be built using ocamlyacc. You also can use a parser to handle situations that are too complicated for regular expressions alone. Text mining and log file analysis are two areas in which having a lexer/parser combination can result in better code and easier maintenance. A Small Discussion of Small Languages DSLs are programming languages that are focused on one problem domain. That problem domain can be anything: text processing, image manipulation, configuration, page layout, and so on.

See functional programming Haskell, 253 imperative, 249 non-LISPy, 254 PHP, 273 programming styles and, 261 Prolog, 263 reasoning and, 272 SPARK Ada, 271 web programming and, 273–291 Prolog programming language, 263 prototyping languages, 231 purity, OCaml and, 249–260 defined, 251 impurity and, 254–260 Pythagorean Theorem, 353 ■Q queues, 103 quotations, 420 quote function, 137 quoting, 137, 140 ■R race conditions, 315 random numbers, sample code and, 226, 236 range operator (…), 64 Ratio libraries, 30 reading directories, 119 readline support, 357 really_input function, 114 really_output function, 114 reasoning, about applications, 272 receive function, 320 record types, 53 records, 23–29 defining, 28 mutability and, 28 name clashes and, 29 -rectypes compiler flag, 406 recursion, 30, 44 recursive algorithms, 31 Find it faster at http://superindex.apress.com/ pattern matching, 31, 46–48 chars/strings and, 64 exceptions and, 123 Paul Graham algorithm, 170 paul_graham function, 172 Pcre library, 142 permission integers, 116 perror function, 137 persistence, 249 Pervasives module, 66 comparison functions and, 90, 247 primitives and, 113 PF_INET type, 121 -pflag, 408 PHP programming language, 273 physical equality, 39 poll function, 320 polymorphic functions, 68 polymorphic types, 25, 35, 67 polymorphic variant types, 69 polymorphism arrays/lists and, 66 classes and, 230, 234 exceptions and, 129 methods and, 229 POP3 client (example), 322–327 Portable Operating System Interface (POSIX), 73, 120 portable paths, 118 position type, 53, 80 positions, 197 POSIX (Portable Operating System Interface), 73, 120 pos_bol, 197 pos_cnum, 197 pos_fname, 197 pos_in function, 114 pos_lnum, 197 pos_out function, 114 pound sign (#), 91 -pp PREPROC compiler flag, 406 pr_dump.cmo module, 420 pr_o.cmo module, 420, 422 prefix functions, 34 preprocessors, Camlp4 and, 411–429 primitive types, 24, 61–71 camlidl tool for, 355 implementing your own, 350 -principal compiler flag, 406 printers, 80, 411, 413 Printexc function, 124 printf command, 17 printf format strings, 17 printf function, 64, 75 453 620Xidxfinal.qxd 454 9/22/06 4:19 PM Page 454 ■INDEX recursive functions, 44 recursive types, 25 recv function, 121 reducing (folding), 39 references, 21, 23 regular expressions, 142, 330 ocamllex and, 193, 199 reasons for using Printf/Scanf functions instead of, 77 strings and, 65 Remote Procedure Call (RPC) mechanism, 349 remove function hashtables and, 101 lists and, 95 replace function, hashtables and, 101 reports, 73–87 research and analysis applications, 3 resources for further reading, 433 built-in functions, 48 LGPL, 11 parametric polymorphism, 26 responsetype type, 339 resume (sample), 434–443 return types, 35 reversed lists, 92 revised syntax, 411–429 rev_append function, 93 RFC 2396 (URI syntax), 329 rlwrap, 404 Rot13 quoter, 141 rounding, floats and, 63 RPC (Remote Procedure Call) mechanism, 349 RRR#remove_printer directive, 80 rstrip function, 298 rule keyword, 198 run function, 183 runner function, 337, 342 run_server function, 188 ■S sample code average number of words in set of files, 263–266 blog server, 278–288 BMP files, 386, 397 command-line client, 190 configuration file parser, 415–419 configuration language for set/unset variables, 208 functors and modules, 163 log files and, 213– 223 network-aware scoring function, 179–191 ocamldoc documentation, 149, 151 POP3 client, 322–327 random numbers, 226, 236 resume, 434–443 securities trades database, 54–60 spam filter, 171–178 strings and, 243–248 syntax extension, 421–429 URI module, 137–140 web crawler, 329–348 sample data, for sample securities trades database, 53 scanf commands, 38 scanf function, 77 Scanf functions, 76, 142, 185 binary files and, 295 reasons for using vs. regular expressions, 77 Scanf-specific formatting codes, 76 Scanning module, 76 scan_buffer function, 185 scope, 23 scripts (fragments), 274 search_forward function, 331 secure-by-design programming, 276 securities trades sample database, 51–60 displaying/importing data and, 73–87 generating data and, 86 interacting with, 56 reports and, 73– 87 stock price information and, 59 Seek_in function, 114 Seek_out function, 114 select function, 122, 183, 317 blocking and, 320 double call to, 187 select-based servers, 307 self function, 316 self_init, 71 sell function, for securities trades database, 54 semantic actions, 193, 197, 203 semantics, 29, 40 semicolons (;;), 91 send function, 121, 320 sender function, 341 servers creating, 179 high-level functions and, 122 OCaml support for, 179–191 server_setup function, 185 Set functor 136, 333 set methods, arrays and, 97 sets, 106 Shared-Memory-Processor (SMP) systems, 309 shells, 404 Shootout, 266 620Xidxfinal.qxd 9/22/06 4:19 PM Page 455 ■INDEX String_val(v) function, 351 strip command, caution for, 409 strongly typed types, 62 strptime function, 359, 362 structured programming, 262 str_item level, 420 style.css file, ocamldoc HTML output and, 147 sub function, 98 subsections, 262 sum function, 39 SWIG (Simplified Wrapper Interface Generator), 358 symmetric multiprocessing (SMP), 309 sync function, 315, 320 synchronization, threads and, 309 syntax, 21–32 extending, 411, 416, 420–428 semantics and, 40 Sys.file_exists, 117 syslog, 365 system threads, 310 Sys_blocked_io exception, 127 Sys_error of string exception, 126 ■T t type, 136 tags, creating custom, 153 tail recursion, 44 templates, mod_caml library and, 289 temporary files, 118, 135 temp_file function, 119 Texinfo pages, ocamldoc output and, 148 text mining, parsers and, 203 theorem solvers, 29 -thread flag, 407 Thread module, 311 threaded_link_harvest class, 339 threads, 309–27 creating/using, 310–316 exiting/killing, 316 modules for, 316–322 sample POP3 client and, 322–327 THREADS variable, 189 threadsafe libraries, 310 Time library, 359–365 Tk graphical user interface, 13 tokens, 193–210 tools camlidl, 349, 355, 357 Coq proof assistant, 29 Findlib, 167, 311, 409 GraphViz, 347 ocamldep, 401–410 xxd, 376 top-down design, 262, 267 Find it faster at http://superindex.apress.com/ shortest keyword, 198 Shoutcast protocol, 293 Shoutcast server, 293–308 connecting to, 307 framework for, 300–305 implementing, 305 shutdown_connection function, 122 side effects, 251, 253–260 signal function, 318 signatures, 32 functions and, 33, 89 modules and, 159 Signoles, Julien, 360 Simplified Wrapper Interface Generator (SWIG), 358 Singleton design pattern, multiple module views and, 160 sitemaps, 329 SMP, 309 .so files, 408 socket functions, 120 sort function, 96 source files, processed by ocamllex, 197–201 spam filters, 169–178 spam server, 182–89 spam.cmo library, 174 SPARK Ada programming language, 271 split function, 96 sprintf function, 75 sscanf function, 77 stacks, 105 Stack_overflow exception, 45, 126 stat function, 117, 423 state, CGI and, 274 static linking of code, 356 stdout, 17 Stolpmann, Gerd, 135, 409 store function, for securities trades database, 56 Store_field(block,index,val) function, 352 Str library, 142 Str module, 330 Str.regexp function, 331 strcopy function, 351 stream parsers, 432 streams, 413–419 strftime function, 359 string keywords, 415 StringMap module, 335 strings, 24, 64, 110, 377 allocating, 352 copying, 352 sample code and, 243–248 StringSet module, 335 string_length(v) function, 351 string_match function, 331 455 620Xidxfinal.qxd 456 9/22/06 4:19 PM Page 456 ■INDEX toplevel.


pages: 169 words: 56,250

Startup Communities: Building an Entrepreneurial Ecosystem in Your City by Brad Feld

barriers to entry, cleantech, cloud computing, corporate social responsibility, G4S, Grace Hopper, job satisfaction, Kickstarter, Lean Startup, minimum viable product, Network effects, paypal mafia, Peter Thiel, place-making, pre–internet, Richard Florida, Ruby on Rails, Silicon Valley, Silicon Valley startup, smart cities, software as a service, Steve Jobs, text mining, Y Combinator, zero-sum game, Zipcar

I bumped into Fred Wilson’s blog (http://startuprev.com/l4) and Brad Feld’s blog (http://startuprev.com/o4 and http://startuprev.com/h1) and was amazed at the wealth of knowledge and wisdom that these two individuals were sharing freely on the Internet. I met this small team sitting in the old fishing factory in the Reykjavik Harbor, working on text mining. They were each younger than 25 years old and called their company CLARA. They wanted to build a software-as-a-service company that helped gaming companies understand their communities. I was startled. These kids were not worried about the ISK or the government or the global financial crisis or anything.


pages: 523 words: 61,179

Human + Machine: Reimagining Work in the Age of AI by Paul R. Daugherty, H. James Wilson

3D printing, AI winter, algorithmic trading, Amazon Mechanical Turk, augmented reality, autonomous vehicles, blockchain, business process, call centre, carbon footprint, cloud computing, computer vision, correlation does not imply causation, crowdsourcing, digital twin, disintermediation, Douglas Hofstadter, en.wikipedia.org, Erik Brynjolfsson, friendly AI, future of work, industrial robot, Internet of things, inventory management, iterative process, Jeff Bezos, job automation, job satisfaction, knowledge worker, Lyft, natural language processing, personalized medicine, precision agriculture, Ray Kurzweil, recommendation engine, RFID, ride hailing / ride sharing, risk tolerance, robotic process automation, Rodney Brooks, Second Machine Age, self-driving car, sensor fusion, sentiment analysis, Shoshana Zuboff, Silicon Valley, software as a service, speech recognition, telepresence, telepresence robot, text mining, the scientific method, uber lyft

AI now enables even more rapid analysis of customer preferences, allowing for personalized and customizable experiences. IntelligentX Brewing Company bills its products as the first beer brewed by AI. It translates its customer feedback, directed through Facebook Messenger, into recipe changes, which affect the brew composition over time.a Lenovo uses text-mining tools to listen to customers voicing their problems worldwide. Insights from discussions of those problems then feed into product and service improvements.b Las Vegas Sands Corp. uses AI to model different layouts of gaming stations throughout its casino to optimize financial performance. By monitoring how different layouts affect profits, the company gains continuous insights that inform future renovations.c a.


pages: 284 words: 79,265

The Half-Life of Facts: Why Everything We Know Has an Expiration Date by Samuel Arbesman

Albert Einstein, Alfred Russel Wallace, Amazon Mechanical Turk, Andrew Wiles, bioinformatics, British Empire, Cesare Marchetti: Marchetti’s constant, Chelsea Manning, Clayton Christensen, cognitive bias, cognitive dissonance, conceptual framework, David Brooks, demographic transition, double entry bookkeeping, double helix, Galaxy Zoo, guest worker program, Gödel, Escher, Bach, Ignaz Semmelweis: hand washing, index fund, invention of movable type, Isaac Newton, John Harrison: Longitude, Kevin Kelly, life extension, Marc Andreessen, meta-analysis, Milgram experiment, National Debt Clock, Nicholas Carr, P = NP, p-value, Paul Erdős, Pluto: dwarf planet, publication bias, randomized controlled trial, Richard Feynman, Rodney Brooks, scientific worldview, social graph, social web, text mining, the scientific method, the strength of weak ties, Thomas Kuhn: the structure of scientific revolutions, Thomas Malthus, Tyler Cowen: Great Stagnation

While scientific progress isn’t necessarily correlated with a single publication—some papers might have multiple discoveries, and others might simply be confirming something we already know—it is often a good unit of study. Focusing on the scientific paper gives us many pieces of data to measure and study. We can look at the title and text and, using sophisticated algorithms from computational linguistics or text mining, determine the subject area. We can look at the authors themselves and create a web illustrating the interactions between scientists who write papers together. We can examine the affiliations of each of the authors and try to see which collaborations between individuals at different institutions are more effective.


Mastering Structured Data on the Semantic Web: From HTML5 Microdata to Linked Open Data by Leslie Sikos

AGPL, Amazon Web Services, bioinformatics, business process, cloud computing, create, read, update, delete, Debian, en.wikipedia.org, fault tolerance, Firefox, Google Chrome, Google Earth, information retrieval, Infrastructure as a Service, Internet of things, linked data, natural language processing, openstreetmap, optical character recognition, platform as a service, search engine result page, semantic web, Silicon Valley, social graph, software as a service, SPARQL, text mining, Watson beat the top human players on Jeopardy!, web application, wikimedia commons

Oracle Spatial and Graph features native inferencing with parallel, incremental, and secure operation for scalable reasoning with RDFS, OWL 2, SKOS, user-defined rules, and user-defined inference extensions. It has reasoned plug-ins for PelletDB and TrOWL. The semantic indexing of Oracle Spatial and Graph is suitable for text mining and entity analytics with integrated natural language processors. The database also supports R2RML direct mapping of relational data to RDF triples. For spatial RDF data storage and querying, Oracle supports GeoSPARQL as well. Oracle Spatial and Graph can be integrated with the Apache Jena and Sesame application development environments, along with the leading Semantic Web tools for querying, visualization, and ontology management.


Raw Data Is an Oxymoron by Lisa Gitelman

23andMe, collateralized debt obligation, computer age, continuous integration, crowdsourcing, disruptive innovation, Drosophila, Edmond Halley, Filter Bubble, Firefox, fixed income, Google Earth, Howard Rheingold, index card, informal economy, Isaac Newton, Johann Wolfgang von Goethe, knowledge worker, liberal capitalism, lifelogging, longitudinal study, Louis Daguerre, Menlo Park, optical character recognition, Panopticon Jeremy Bentham, peer-to-peer, RFID, Richard Thaler, Silicon Valley, social graph, software studies, statistical model, Stephen Hawking, Steven Pinker, text mining, time value of money, trade route, Turing machine, urban renewal, Vannevar Bush, WikiLeaks

In 1837, Weld had published The Bible Against Slavery, initially in the Anti-Slavery Quarterly Magazine, and then as a ninety-eight-page pamphlet. In it, he interpreted slavery in the Hebrew Bible as a form of paid service that could be stepped out of essentially at will, thus refuting claims that the Bible sanctioned chattel slavery as it was practiced in the United States. His biblical interpretation drew on another form of text mining, familiar to ministers: the concordance, essentially a keyword search through the text, providing context, in use since the thirteenth century. American Slavery As It Is importantly shifted the focus to the present when it took as its text the newspapers, along with testimony derived from questionnaires.


Designing Search: UX Strategies for Ecommerce Success by Greg Nudelman, Pabini Gabriel-Petit

access to a mobile phone, Albert Einstein, AltaVista, augmented reality, barriers to entry, business intelligence, call centre, crowdsourcing, information retrieval, Internet of things, performance metric, QR code, recommendation engine, RFID, search engine result page, semantic web, Silicon Valley, social graph, social web, speech recognition, text mining, the map is not the territory, The Wisdom of Crowds, web application, zero-sum game, Zipcar

Some search features that support discovery include the following: faceted search—which shows people results having particular attributes and lets them browse and refine the results user-review search—which lets people read the frank feedback of other consumers and possibly exposes the attributes of the reviewers through faceted search buying guide, product-info sheet, or demo video search—which incorporate features of document search like text mining and advanced relevancy to show supplemental information alongside results from the product catalog Context of the Channel People’s expectations of search features vary according to the context of the channel. For example, they expect mobile to be location-aware, and in-store kiosks to be inventory aware.


pages: 321

Finding Alphas: A Quantitative Approach to Building Trading Strategies by Igor Tulchinsky

algorithmic trading, asset allocation, automated trading system, backpropagation, backtesting, barriers to entry, business cycle, buy and hold, capital asset pricing model, constrained optimization, corporate governance, correlation coefficient, credit crunch, Credit Default Swap, discounted cash flows, discrete time, diversification, diversified portfolio, Eugene Fama: efficient market hypothesis, financial intermediation, Flash crash, implied volatility, index arbitrage, index fund, intangible asset, iterative process, Long Term Capital Management, loss aversion, market design, market microstructure, merger arbitrage, natural language processing, passive investing, pattern recognition, performance metric, Performance of Mutual Funds in the Period, popular capitalism, prediction markets, price discovery process, profit motive, quantitative trading / quantitative finance, random walk, Renaissance Technologies, risk free rate, risk tolerance, risk-adjusted returns, risk/return, selection bias, sentiment analysis, shareholder value, Sharpe ratio, short selling, Silicon Valley, speech recognition, statistical arbitrage, statistical model, stochastic process, survivorship bias, systematic trading, text mining, transaction costs, Vanguard fund, yield curve

If the footnotes do not appear meaningful, chances are the company is being intentionally obscure. The ability to detect early warning signs in the footnotes of financial reports sets elite investors apart from average ones. Because the footnotes appear as unstructured text, the advanced alpha researcher must find or develop appropriate text-mining systems to convert them to usable signals. Though this step adds to the difficulty of utilizing such data, it provides an opportunity to extract uncorrelated signals. Alpha researchers also see quarterly conference calls as a tool for corporate disclosure. While financial statements give insight into a company’s past performance, conference calls give investors both an idea of the current situation and management’s expectations for future performance.


pages: 374 words: 94,508

Infonomics: How to Monetize, Manage, and Measure Information as an Asset for Competitive Advantage by Douglas B. Laney

3D printing, Affordable Care Act / Obamacare, banking crisis, blockchain, business climate, business intelligence, business process, call centre, chief data officer, Claude Shannon: information theory, commoditize, conceptual framework, crowdsourcing, dark matter, data acquisition, digital twin, discounted cash flows, disintermediation, diversification, en.wikipedia.org, endowment effect, Erik Brynjolfsson, full employment, informal economy, intangible asset, Internet of things, linked data, Lyft, Nash equilibrium, Network effects, new economy, obamacare, performance metric, profit motive, recommendation engine, RFID, semantic web, smart meter, Snapchat, software as a service, source of truth, supply-chain management, text mining, uber lyft, Y2K, yield curve

While Infinity already had a rudimentary system which screened questionable claims based on “red flags,” it still required quite a bit of manual intervention, thereby slowing down the claims process, and affecting customer service. Moreover, its hard-coded flagging process had trouble catching emerging fraud patterns. However, by text mining the content of previous claims known to be fraudulent, Infinity matches these patterns to the content of police reports, medical files, and other accident-related documents. These patterns of “narrative inconsistency” indicate probable fraud. With this new predictive claims analytics system in place, Infinity’s success rate in pursuing fraudulent claims jumped from 55 percent to 88 percent, and in the first six months of operation the system increased subrogation recovery by $12 million.


Data and the City by Rob Kitchin,Tracey P. Lauriault,Gavin McArdle

A Declaration of the Independence of Cyberspace, bike sharing scheme, bitcoin, blockchain, Bretton Woods, Chelsea Manning, citizen journalism, Claude Shannon: information theory, clean water, cloud computing, complexity theory, conceptual framework, corporate governance, correlation does not imply causation, create, read, update, delete, crowdsourcing, cryptocurrency, dematerialisation, digital map, distributed ledger, fault tolerance, fiat currency, Filter Bubble, floating exchange rates, functional programming, global value chain, Google Earth, hive mind, Internet of things, Kickstarter, knowledge economy, lifelogging, linked data, loose coupling, new economy, New Urbanism, Nicholas Carr, open economy, openstreetmap, packet switching, pattern recognition, performance metric, place-making, RAND corporation, RFID, Richard Florida, ride hailing / ride sharing, semantic web, sentiment analysis, sharing economy, Silicon Valley, Skype, smart cities, Smart Cities: Big Data, Civic Hackers, and the Quest for a New Utopia, smart contracts, smart grid, smart meter, social graph, software studies, statistical model, TaskRabbit, text mining, The Chicago School, The Death and Life of Great American Cities, the market place, the medium is the message, the scientific method, Toyota Production System, urban planning, urban sprawl, web application

The stickiness of social media data resists the operationalization in automatic pipelines for knowledge extraction and manifests itself in false positives that can only be identified and resolved by a close reading of the source. This has consequences for the use of big data in urban governance, urban operation centres and predictive policing – applications that often rely on decontextualized data and reductive modes of analysis, such as text mining based on trigger words or dictionary-based sentiment analysis. Ignoring stickiness of context can lead to Sticky data 105 cases where a terrorism suspect identified by unsupervised text analysis turns out to be the journalist who reported on the issue (Currier et al. 2015). In this sense, stickiness points to issues of privacy even within the realm of publicly accessible data sources.


The Art of SEO by Eric Enge, Stephan Spencer, Jessie Stricchiola, Rand Fishkin

AltaVista, barriers to entry, bounce rate, Build a better mousetrap, business intelligence, cloud computing, dark matter, en.wikipedia.org, Firefox, Google Chrome, Google Earth, hypertext link, index card, information retrieval, Internet Archive, Law of Accelerating Returns, linked data, mass immigration, Metcalfe’s law, Network effects, optical character recognition, PageRank, performance metric, risk tolerance, search engine result page, self-driving car, sentiment analysis, social web, sorting algorithm, speech recognition, Steven Levy, text mining, web application, wikimedia commons

Sort through the most common remnants first, and comb as far down as you feel is valuable. Through this process, you are basically text-mining documents relevant to the subject of your industry/service/product for terms that, although lower in search volume, have a reasonable degree of relation. When using this process, it is imperative to have human eyes reviewing the extracted data to make sure it passes the “common sense” test. You may even find previously unidentified terms at the head of the keyword distribution graph. You can expand on this method in the following ways: Text-mine Technorati or Delicious for relevant results. Use documents purely from specific types of results—local, academic, etc.


pages: 502 words: 107,510

Natural Language Annotation for Machine Learning by James Pustejovsky, Amber Stubbs

Amazon Mechanical Turk, bioinformatics, cloud computing, computer vision, crowdsourcing, easy for humans, difficult for computers, finite state, game design, information retrieval, iterative process, natural language processing, pattern recognition, performance metric, sentiment analysis, social web, speech recognition, statistical model, text mining

In Proceedings of the 5th International Workshop on Semantic Evaluation. Stubbs, Amber. A Methodology for Using Professional Knowledge in Corpus Annotation. Doctoral dissertation. Brandeis University, August 2012; to be published February 2013. Stubbs, Amber, and Benjamin Harshfield. 2010. “Applying the TARSQI Toolkit to augment text mining of EHRs.” BioNLP ’10 poster session: In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. Tenny, Carol. 2000. “Core events and adverbial modification.” In J. Pustejovsky and C. Tenny (Eds.), Events as Grammatical Objects. Stanford, CA: Stanford: Center for the Study of Language and Information, pp. 285–334.


pages: 424 words: 114,905

Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again by Eric Topol

23andMe, Affordable Care Act / Obamacare, AI winter, Alan Turing: On Computable Numbers, with an Application to the Entscheidungsproblem, algorithmic bias, artificial general intelligence, augmented reality, autonomous vehicles, backpropagation, bioinformatics, blockchain, cloud computing, cognitive bias, Colonization of Mars, computer age, computer vision, conceptual framework, creative destruction, crowdsourcing, Daniel Kahneman / Amos Tversky, dark matter, David Brooks, digital twin, Elon Musk, en.wikipedia.org, epigenetics, Erik Brynjolfsson, fault tolerance, George Santayana, Google Glasses, ImageNet competition, Jeff Bezos, job automation, job satisfaction, Joi Ito, Mark Zuckerberg, medical residency, meta-analysis, microbiome, natural language processing, new economy, Nicholas Carr, nudge unit, pattern recognition, performance metric, personalized medicine, phenotype, placebo effect, randomized controlled trial, recommendation engine, Rubik’s Cube, Sam Altman, self-driving car, Silicon Valley, speech recognition, Stephen Hawking, text mining, the scientific method, Tim Cook: Apple, War on Poverty, Watson beat the top human players on Jeopardy!, working-age population

Hutson, “Machine-Learning Algorithms Can Predict Suicide Risk.” 40. Hutson, “Machine-Learning Algorithms Can Predict Suicide Risk”; Horwitz, B., “Identifying Suicidal Young Adults.” Nature Human Behavior, 2017. 1: pp. 860–861. 41. Cheng, Q., et al., “Assessing Suicide Risk and Emotional Distress in Chinese Social Media: A Text Mining and Machine Learning Study.” J Med Internet Res, 2017. 19(7): p. e243. 42. McConnon, “AI Helps Identify Those at Risk for Suicide.” 43. “Crisis Trends,” July 19, 2018. https://crisistrends.org/#visualizations. 44. Resnick, B., “How Data Scientists Are Using AI for Suicide Prevention,” Vox. 2018. 45.


Text Analytics With Python: A Practical Real-World Approach to Gaining Actionable Insights From Your Data by Dipanjan Sarkar

bioinformatics, business intelligence, computer vision, continuous integration, en.wikipedia.org, functional programming, general-purpose programming language, Guido van Rossum, information retrieval, Internet of things, invention of the printing press, iterative process, natural language processing, out of africa, performance metric, premature optimization, recommendation engine, self-driving car, semantic web, sentiment analysis, speech recognition, statistical model, text mining, Turing test, web application

Hence, standard statistical methods aren’t helpful when applied out of the box on unstructured text data. This section covers some of the main concepts in text analytics and also discusses the definition and scope of text analytics, which will give you a broad idea of what you can expect in the upcoming chapters. Text analytics, also known as text mining, is the methodology and process followed to derive quality and actionable information and insights from textual data. This involves using NLP, information retrieval, and machine learning techniques to parse unstructured text data into more structured forms and deriving patterns and insights from this data that would be helpful for the end user.


pages: 519 words: 142,646

Track Changes by Matthew G. Kirschenbaum

active measures, Apple II, Apple's 1984 Super Bowl advert, Bill Gates: Altair 8800, Buckminster Fuller, commoditize, computer age, corporate governance, David Brooks, dematerialisation, Donald Knuth, Douglas Hofstadter, Dynabook, East Village, en.wikipedia.org, feminist movement, forensic accounting, future of work, Google Earth, Gödel, Escher, Bach, Haight Ashbury, HyperCard, Jason Scott: textfiles.com, Joan Didion, John Markoff, John von Neumann, Kickstarter, low earth orbit, mail merge, Marshall McLuhan, Mother of all demos, New Journalism, Norman Mailer, pattern recognition, pink-collar, popular electronics, RAND corporation, rolodex, Ronald Reagan, self-driving car, Shoshana Zuboff, Silicon Valley, social web, Stephen Hawking, Steve Jobs, Steve Wozniak, Steven Levy, Stewart Brand, technoutopianism, Ted Nelson, text mining, thinkpad, Turing complete, Vannevar Bush, Whole Earth Catalog, Y2K, Year of Magical Thinking

Voice recognition, which allows us (like Henry James) to write simply by speaking, is an everyday option. “Text” itself has become a verb. Track Changes is not a stylistic study. It does not seek to tease out subtleties of how individual authors’ writing styles may have been altered following their adoption of word processing. I do not engage in the kind of computational text analysis or text mining we nowadays associate with the digital humanities. Neither do I seek to appraise whether word processing has been “good” or “bad” for literature as such. My agenda is both more ambitious and more modest: I accept that computers and word processing are aspects of the literary, and thus I try to help reconstruct the way in which this came to be and its significance for how we think about the material act of writing.


pages: 552 words: 168,518

MacroWikinomics: Rebooting Business and the World by Don Tapscott, Anthony D. Williams

accounting loophole / creative accounting, airport security, Andrew Keen, augmented reality, Ayatollah Khomeini, barriers to entry, Ben Horowitz, bioinformatics, Bretton Woods, business climate, business process, buy and hold, car-free, carbon footprint, Charles Lindbergh, citizen journalism, Clayton Christensen, clean water, Climategate, Climatic Research Unit, cloud computing, collaborative editing, collapse of Lehman Brothers, collateralized debt obligation, colonial rule, commoditize, corporate governance, corporate social responsibility, creative destruction, crowdsourcing, death of newspapers, demographic transition, disruptive innovation, distributed generation, don't be evil, en.wikipedia.org, energy security, energy transition, Exxon Valdez, failed state, fault tolerance, financial innovation, Galaxy Zoo, game design, global village, Google Earth, Hans Rosling, hive mind, Home mortgage interest deduction, information asymmetry, interchangeable parts, Internet of things, invention of movable type, Isaac Newton, James Watt: steam engine, Jaron Lanier, jimmy wales, Joseph Schumpeter, Julian Assange, Kevin Kelly, Kickstarter, knowledge economy, knowledge worker, Marc Andreessen, Marshall McLuhan, mass immigration, medical bankruptcy, megacity, mortgage tax deduction, Netflix Prize, new economy, Nicholas Carr, oil shock, old-boy network, online collectivism, open borders, open economy, pattern recognition, peer-to-peer lending, personalized medicine, Ray Kurzweil, RFID, ride hailing / ride sharing, Ronald Reagan, Rubik’s Cube, scientific mainstream, shareholder value, Silicon Valley, Skype, smart grid, smart meter, social graph, social web, software patent, Steve Jobs, text mining, the scientific method, The Wisdom of Crowds, transaction costs, transfer pricing, University of East Anglia, urban sprawl, value at risk, WikiLeaks, X Prize, Yochai Benkler, young professional, Zipcar

In other words, journals like PLoS ONE could become platforms for innovation in the same way the iPhone is a platform for 100,000 third-party apps. In some cases, the ultimate applications for research may not even be known until some time in the future. “You never know what could turn out to be valuable down the line,” says Binfield. “In years to come when there’s better data discovery tools and better text mining, somebody or some machine somewhere will pull out the one data point or one insight in a paper that may have been disregarded when it was first published.” But the key point for Binfield is that open-access content is inherently more valuable and more useful than subscription content because it reaches a broader audience—the same audience that traditional science publishers assume is irrelevant.


Engineering Security by Peter Gutmann

active measures, algorithmic trading, Amazon Web Services, Asperger Syndrome, bank run, barriers to entry, bitcoin, Brian Krebs, business process, call centre, card file, cloud computing, cognitive bias, cognitive dissonance, combinatorial explosion, Credit Default Swap, crowdsourcing, cryptocurrency, Daniel Kahneman / Amos Tversky, Debian, domain-specific language, Donald Davies, Donald Knuth, double helix, en.wikipedia.org, endowment effect, fault tolerance, Firefox, fundamental attribution error, George Akerlof, glass ceiling, GnuPG, Google Chrome, iterative process, Jacob Appelbaum, Jane Jacobs, Jeff Bezos, John Conway, John Markoff, John von Neumann, Kickstarter, lake wobegon effect, Laplace demon, linear programming, litecoin, load shedding, MITM: man-in-the-middle, Network effects, Parkinson's law, pattern recognition, peer-to-peer, Pierre-Simon Laplace, place-making, post-materialism, QR code, race to the bottom, random walk, recommendation engine, RFID, risk tolerance, Robert Metcalfe, Ruby on Rails, Sapir-Whorf hypothesis, Satoshi Nakamoto, security theater, semantic web, Skype, slashdot, smart meter, social intelligence, speech recognition, statistical model, Steve Jobs, Steven Pinker, Stuxnet, sunk-cost fallacy, telemarketer, text mining, the built environment, The Death and Life of Great American Cities, The Market for Lemons, the payments system, Therac-25, too big to fail, Tragedy of the Commons, Turing complete, Turing machine, Turing test, web application, web of trust, x509 certificate, Y2K, zero day, Zimmermann PGP

What these checks are doing is making the expert user’s ability to check a site’s validity available to all users. Risk Diversification through Content Analysis As is already being done for spam filtering, there are statistical classification techniques that you can apply to web sites to try and detect potential phishing sites. One rather straightforward one uses text-mining feature extraction from the suspect web page to create a list of keywords (a so-called lexical signature) to submit to Google as a search query. If the domain hosting the site matches the first few Google search results then it’s regarded as probably OK. If it doesn’t match, for example a page with the characteristics of the eBay login page hosted in the Ukraine (or at least hosted somewhere that doesn’t match an immediate Google search result), then it’s regarded as suspicious.