text mining

17 results back to index

Exploring Everyday Things with R and Ruby by Sau Sheong Chang

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Alfred Russel Wallace, bioinformatics, business process, butterfly effect, cloud computing, Craig Reynolds: boids flock, Debian, Edward Lorenz: Chaos theory, Gini coefficient, income inequality, invisible hand, p-value, price stability, Skype, statistical model, stem cell, Stephen Hawking, text mining, The Wealth of Nations by Adam Smith, We are the 99%, web application, wikimedia commons

:), in Ruby ternary conditional expression, if and unless R R Development Core Team, Introducing R R language, Packing Your Bags, Introducing R–Introducing R, Introducing R, Introducing R, Using R–Using R, The R Console–The R Console, The R Console, The R Console, Sourcing Files and the Command Line–Sourcing Files and the Command Line, Sourcing Files and the Command Line, Packages–Using packages, Programming R, Variables and Functions, Variables and Functions–Variables and Functions, Variables and Functions–Variables and Functions, Conditionals and Loops, Conditionals and Loops–Conditionals and Loops, Vectors–Vectors, Lists–Lists, Matrices–Matrices, Arrays–Arrays, Factors–Factors, Data frames–Data frames, Importing Data–Importing data from a database, Charting–Adjustments, Basic Graphs, MailMiner–MailMiner arrays, Arrays–Arrays assignment operators, Variables and Functions batch mode, Sourcing Files and the Command Line charting, Charting–Adjustments conditionals, Conditionals and Loops console for, The R Console–The R Console data frames, Data frames–Data frames expressions, Programming R factors, Factors–Factors functions, Variables and Functions–Variables and Functions importing data, Importing Data–Importing data from a database installing, Introducing R lists, Lists–Lists loops, Conditionals and Loops–Conditionals and Loops matrices, Matrices–Matrices output formats, Basic Graphs packages for, Packages–Using packages packages for, creating, MailMiner–MailMiner running, Using R–Using R running code from a file, Sourcing Files and the Command Line–Sourcing Files and the Command Line statistical functions, The R Console variables, Variables and Functions–Variables and Functions vectors, The R Console, Vectors–Vectors version of, Introducing R R-Forge repository, Packages .rb file extension, Running Ruby rbind() function, R, Data frames read() function, R, Importing data from text files read.table() function, R, Interpreting the Data, Number of Messages by Day of the Month repeat loop, R, Conditionals and Loops require statement, Ruby, Requiring External Libraries restrooms example, Offices and Restrooms–The Final Simulation, Offices and Restrooms, The Simple Scenario–The Simple Scenario, The Simple Scenario–The First Simulation, Representing Restrooms and Such, Representing Restrooms and Such, Representing Restrooms and Such, Interpreting the Data–Interpreting the Data, Interpreting the Data–Interpreting the Data, The Second Simulation–The Final Simulation, The Second Simulation, The Second Simulation–The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation charts for, Interpreting the Data–Interpreting the Data, The Second Simulation, The Second Simulation–The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation data results, interpreting, Interpreting the Data–Interpreting the Data Facility class for, Representing Restrooms and Such HSE (Health and Safety Executive) data regarding, Offices and Restrooms modeling, The Simple Scenario–The Simple Scenario Person class for, Representing Restrooms and Such Restroom class for, Representing Restrooms and Such simulations for, The Simple Scenario–The First Simulation, The Second Simulation–The Final Simulation return keyword, R, Variables and Functions return keyword, Ruby, Methods Reynolds, Craig (creator of Boids algorithm), Schooling Fish and Flocking Birds RIFF format, Extracting Data from Sound right angle bracket (>), The R Console, Variables and Functions -> assignment operator, R, Variables and Functions > R console prompt, The R Console RMagick library, Extracting Data from Video, Extracting Data from Video Ruby language, Packing Your Bags, Ruby–Why Ruby, Why Ruby, Installing Ruby–Installing Ruby using your platform’s package management tool, Installing Ruby from source, Running Ruby–Running Ruby, Running Ruby, Running Ruby, Requiring External Libraries–Requiring External Libraries, Strings–Strings, Strings, Arrays and hashes–Arrays and hashes, Arrays and hashes–Arrays and hashes, Arrays and hashes, Symbols, Conditionals and loops–case expression, Loops, Classes and objects–Classes and objects, Classes and objects–Classes and objects, Methods, Class methods and variables, Class methods and variables–Class methods and variables, Inheritance–Inheritance, Inheritance–Inheritance, Inheritance, Inheritance, Code like a duck–Code like a duck, Code like a duck–Code like a duck, Shoes–Shoes doodler, Roids arrays, Arrays and hashes–Arrays and hashes, Arrays and hashes class methods, Class methods and variables class variables, Class methods and variables–Class methods and variables classes, Classes and objects–Classes and objects compiling from source code, Installing Ruby from source conditionals, Conditionals and loops–case expression duck typing, Code like a duck–Code like a duck dynamic typing, Code like a duck–Code like a duck external libraries for, Requiring External Libraries–Requiring External Libraries hashes, Arrays and hashes–Arrays and hashes here-documents, Strings inheritance, Inheritance–Inheritance installing, Installing Ruby–Installing Ruby using your platform’s package management tool interactive tool for, Running Ruby interpreter for, Running Ruby loops, Loops methods, Methods mixin mechanism, Inheritance modules, Inheritance objects, Classes and objects–Classes and objects open classes, Roids running, Running Ruby–Running Ruby Shoes toolkit for, Shoes–Shoes doodler strings, Strings–Strings subclassing, Inheritance–Inheritance symbols, Symbols website for, Why Ruby Ruby Version Manager (RVM), Ruby Version Manager (RVM) RubyGems package manager, Requiring External Libraries RubyInstaller, RubyInstaller RVideo library, Extracting Data from Video RVM (Ruby Version Manager), Ruby Version Manager (RVM) S saccadic masking, Data, Data, Everywhere sample frame, Extracting Data from Sound sample points, Extracting Data from Sound sapply() function, R, Variables and Functions scale_shape_manual() function, Interpreting the Data scale_x_continuous() function, Interpreting the Data scale_y_continuous() function, Interpreting the Data scatterplot, R, The R Console, Sourcing Files and the Command Line scatterplots, Interpreting the Data–Interpreting the Data, The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation, Number of Messages by Day of the Month–Number of Messages by Hour of the Day, Implementation schools of fish, Schooling Fish and Flocking Birds (see flocking example) sd() function, R, The R Console self keyword, Ruby, Class methods and variables seq() function, R, Vectors Shapiro-Wilk test, Money Shoes toolkit, Shoes–Shoes doodler, A Rainbow of Shoes, Installing Shoes–Installing Shoes, Programming Shoes–Shoes doodler, Shoes stopwatch, Shoes stopwatch, Shoes stopwatch, Simulation–Simulation, Roids–Roids flows, Shoes stopwatch installing, Installing Shoes–Installing Shoes programming in, Programming Shoes–Shoes doodler slots, Shoes stopwatch stacks, Shoes stopwatch versions (colors) of, A Rainbow of Shoes simulations, Bringing the World to Us, The Simple Scenario–The First Simulation, The Simple Scenario–The First Simulation, The Second Simulation–The Final Simulation, The Simulation–The Simulation, The Simulation–The Simulation, Simulation–Simulation, The Boid Flocking Rules–Putting in Obstacles economics example, The Simulation–The Simulation, The Simulation–The Simulation flocking example, Simulation–Simulation, The Boid Flocking Rules–Putting in Obstacles Monte Carlo method, The Simple Scenario–The First Simulation restrooms example, The Simple Scenario–The First Simulation, The Second Simulation–The Final Simulation single quotes (' '), enclosing Ruby strings, Strings slots, Shoes, Shoes stopwatch Smith, Adam (author), The Invisible Hand An Inquiry into the Nature and Causes of the Wealth of Nations (University of Chicago Press), The Invisible Hand source() function, R, Sourcing Files and the Command Line square brackets ([ ]), Vectors, Matrices, Data frames accessing subset of R data frame, Data frames enclosing R matrix indexes, Matrices enclosing R vector indexes, Vectors square brackets, double ([[ ]]), enclosing single R vector index, Vectors stacks, Shoes, Shoes stopwatch standard deviation, R, The R Console Standard library, Ruby, Requiring External Libraries Starlings in Flight (STARFLAG) project, A Variation on the Rules statistical functions, R, The R Console, Packages, Interpreting the Data–Interpreting the Data stats package, R, Packages stat_bin() function, R, Statistical transformation, Statistical transformation stethoscope, homemade, Homemade Digital Stethoscope stopwatch example, Shoes stopwatch–Shoes stopwatch String class, Extracting Data from Sound strings, Ruby, Strings–Strings subclassing, Ruby, Inheritance–Inheritance sudo command, Installing Ruby using your platform’s package management tool symbols, Ruby, Symbols T table() function, R, Interpreting the Data, Number of Messages by Day of the Month term-document matrix, Text Mining ternary conditional expression, Ruby, if and unless text document, Text Mining text files, Importing data from text files, Importing data from text files, The Emailing Habits of Enron Executives (see also CSV files) email message data in, The Emailing Habits of Enron Executives importing data from, R, Importing data from text files text mining, Text MiningText Mining The Grammar of Graphics (Springer), Introducing ggplot2 tm library, Text Mining U Ubuntu system, installing Ruby on, Installing Ruby using your platform’s package management tool UI toolkits, Shoes toolkit, Shoes–Shoes doodler unless expression, Ruby, if and unless unpack method, String class, Extracting Data from Sound until loop, Ruby, Loops Utopia example, It’s a Good Life, It’s a Good Life, Money–Money, Money–Money, Money–Money, Sex–The Changes, Birth and Death, The Changes–The Changes, Evolution–Implementation, Implementation, Implementation charts for, Money–Money, Implementation data, analyzing, Money–Money, The Changes–The Changes, Implementation evolution added to simulation, Evolution–Implementation flocking roids, as basis for simulation, It’s a Good Life food added to simulation, Money–Money mortality added to simulation, Birth and Death procreation added to simulation, Sex–The Changes research regarding, It’s a Good Life V variables, R, Variables and Functions–Variables and Functions Vector class, Ruby, Roids vectors, R, The R Console, Vectors–Vectors video file, extracting data from, Extracting Data from Video–Extracting Data from Video W WAV files, Homemade Digital Stethoscope, Extracting Data from Sound–Extracting Data from Sound, Extracting Data from Sound extracting to CSV file, Extracting Data from Sound–Extracting Data from Sound format of, Extracting Data from Sound recording audio to, Homemade Digital Stethoscope waveforms, Generating the Heart Sounds Waveform–Generating the Heart Sounds Waveform, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate webcam, pulse oximeter using, Homemade Pulse Oximeter website resources, How to Contact Us, Why Ruby, Installing Ruby from source, Ruby Version Manager (RVM), RubyInstaller, Introducing R, Packages, The Emailing Habits of Enron Executives, The Emailing Habits of Enron Executives Enron email data, The Emailing Habits of Enron Executives for this book, How to Contact Us Internet Message Format, The Emailing Habits of Enron Executives R language, Introducing R R packages, Packages Ruby language, Why Ruby Ruby source code, Installing Ruby from source RubyInstaller, RubyInstaller RVM (Ruby Version Manager), Ruby Version Manager (RVM) weight example, The R Console (see height and weight example) while loop, R, Conditionals and Loops while loop, Ruby, Loops Wickham, Hadley (creator of ggplot2 package), Introducing ggplot2 Wilkinson, Leland (author), Introducing ggplot2 The Grammar of Graphics (Springer), Introducing ggplot2 win.metafile() function, R, Basic Graphs Windows, RubyInstaller, Installing Shoes, Using R, Basic Graphs installing Ruby on, RubyInstaller installing Shoes on, Installing Shoes opening graphics device, Basic Graphs R user interface for, Using R windows() function, R, Basic Graphs with() function, R, Data frames, Importing data from a database write() method, Ruby, The Simulation X X11() function, R, Basic Graphs About the Author Sau Sheong Chang has been in software development, mostly web applications and recently cloud- and data-related systems, for almost 17 years and is still a keen and enthusiastic programmer.

The code we have written might be simple, but the insights could be significant. I have ventured a bit into the territory of text mining, but overall we’ve barely scraped the surface of what could be done. The tm library, for example, is extremely powerful for text mining, and various other text mining packages have been built on it as well. A few things you should take note of (especially for text mining) before you wend your way to mining your mailbox: The Enron dataset was cleaned up before it was published, so it was a lot easier to mine. Your own mailbox, on the other hand, could be wild and unruly, so your mileage will definitely vary. The Enron dataset comprises office email accounts derived from Exchange and Outlook files. For the text mining section, you will definitely want to tweak the write_row method to give you better results.

: (question mark, colon), in Ruby ternary conditional expression, if and unless > (right angle bracket), The R Console, Variables and Functions -> assignment operator, R, Variables and Functions > R console prompt, The R Console ' ' (single quotes), enclosing Ruby strings, Strings [ ] (square brackets), Vectors, Matrices, Data frames accessing subset of R data frame, Data frames enclosing R matrix indexes, Matrices enclosing R vector indexes, Vectors [[ ]] (square brackets, double), enclosing single R vector index, Vectors A aes() function, R, Aesthetics An Inquiry into the Nature and Causes of the Wealth of Nations (University of Chicago Press), The Invisible Hand apply() function, R, Interpreting the Data Armchair Economist (Free Press), How to Be an Armchair Economist array() function, R, Arrays arrays, R, Arrays–Arrays arrays, Ruby, Arrays and hashes–Arrays and hashes, Arrays and hashes artificial society, Money (see Utopia example) as.Date() function, R, Number of Messages by Day of the Month ascultation, Auscultation assignment operators, R, Variables and Functions at sign, double (@@), preceding Ruby class variables, Class methods and variables attr keyword, Ruby, Classes and objects Audacity audio editor, Homemade Digital Stethoscope average, Interpreting the Data (see mean() function, R) Axtell, Robert (researcher), It’s a Good Life Growing Artificial Societies: Social Science from the Bottom Up (Brookings Institution Press/MIT Press), It’s a Good Life B backticks (` `), enclosing R operators as functions, Variables and Functions bar charts, Plotting charts, Interpreting the Data–Interpreting the Data, The Second Simulation–The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation barplot() function, R, Plotting charts batch mode, R, Sourcing Files and the Command Line Bioconductor repository, Packages birds flocking, Schooling Fish and Flocking Birds (see flocking example) bmp() function, R, Basic Graphs Boids algorithm, Schooling Fish and Flocking Birds–The Origin of Boids Box, George Edward Pelham (statistician), regarding usefulness of models, The Simple Scenario break keyword, R, Conditionals and Loops brew command, Installing Ruby using your platform’s package management tool butterfly effect, The Changes C c() function, R, Vectors CALO Project, The Emailing Habits of Enron Executives camera, pulse oximeter using, Homemade Pulse Oximeter case expression, Ruby, case expression chaos theory, The Changes charts, Charting–Adjustments, Plotting charts, Statistical transformation, Geometric object, Interpreting the Data–Interpreting the Data, Interpreting the Data–Interpreting the Data, Interpreting the Data–Interpreting the Data, The Second Simulation, The Second Simulation–The Second Simulation, The Third Simulation–The Third Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation, The Final Simulation–The Final Simulation, Analyzing the Simulation–Analyzing the Simulation, Analyzing the Second Simulation–Analyzing the Second Simulation, Number of Messages by Day of the Month–Number of Messages by Hour of the Day, Generating the Heart Sounds Waveform–Generating the Heart Sounds Waveform, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate, Money–Money, Money–Money, Implementation bar charts, Plotting charts, Interpreting the Data–Interpreting the Data, The Second Simulation–The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation histograms, Statistical transformation, Geometric object, Money–Money line charts, Interpreting the Data–Interpreting the Data, Analyzing the Simulation–Analyzing the Simulation, Analyzing the Second Simulation–Analyzing the Second Simulation Lorenz curves, Money–Money scatterplots, Interpreting the Data–Interpreting the Data, The Second Simulation, The Third Simulation–The Third Simulation, The Final Simulation–The Final Simulation, Number of Messages by Day of the Month–Number of Messages by Hour of the Day, Implementation waveforms, Generating the Heart Sounds Waveform–Generating the Heart Sounds Waveform, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate class methods, Ruby, Class methods and variables class variables, Ruby, Class methods and variables–Class methods and variables classes, R, Programming R classes, Ruby, Classes and objects–Classes and objects code examples, Using Code Examples (see example applications) colon (:), Symbols, Vectors creating R vectors, Vectors preceding Ruby symbols, Symbols comma-separated value (CSV) files, Importing data from text files (see CSV files) Comprehensive R Archive Network (CRAN), Packages conditionals, R, Conditionals and Loops conditionals, Ruby, Conditionals and loops–case expression contact information for this book, How to Contact Us conventions used in this book, Conventions Used in This Book cor() function, R, The R Console Core library, Ruby, Requiring External Libraries corpus, Text Mining correlation, R, The R Console CRAN (Comprehensive R Archive Network), Packages CSV (comma-separated value) files, Importing data from text files, The First Simulation–The First Simulation, The First Simulation, Interpreting the Data, The Simulation, Extracting Data from Sound–Extracting Data from Sound, Extracting Data from Video extracting video data to, Extracting Data from Video extracting WAV data to, Extracting Data from Sound–Extracting Data from Sound reading data from, Interpreting the Data writing data to, The First Simulation–The First Simulation, The Simulation csv library, Ruby, The First Simulation, The Simulation, Grab and Parse curl utility, Ruby Version Manager (RVM) D data, Data, Data, Everywhere–Data, Data, Everywhere, Bringing the World to Us, Importing Data–Importing data from a database, Importing data from text files, The First Simulation–The First Simulation, Interpreting the Data, How to Be an Armchair Economist, The Simulation, Grab and Parse–Grab and Parse, The Emailing Habits of Enron Executives–The Emailing Habits of Enron Executives, Homemade Digital Stethoscope–Extracting Data from Sound, Extracting Data from Sound–Extracting Data from Sound, Homemade Pulse Oximeter–Extracting Data from Video, Extracting Data from Video analyzing, Data, Data, Everywhere–Data, Data, Everywhere, Bringing the World to Us, How to Be an Armchair Economist charts for, How to Be an Armchair Economist (see charts) obstacles to, Data, Data, Everywhere–Data, Data, Everywhere simulations for, Bringing the World to Us (see simulations) audio, from stethoscope, Homemade Digital Stethoscope–Extracting Data from Sound CSV files for, Importing data from text files, The First Simulation–The First Simulation, Interpreting the Data, The Simulation, Extracting Data from Sound–Extracting Data from Sound, Extracting Data from Video from Enron, The Emailing Habits of Enron Executives–The Emailing Habits of Enron Executives from Gmail, Grab and Parse–Grab and Parse importing, R, Importing Data–Importing data from a database video, from pulse oximeter, Homemade Pulse Oximeter–Extracting Data from Video data frames, R, Data frames–Data frames data mining, The Idea data.frame() function, R, Data frames database, importing data from, Importing data from a database–Importing data from a database dbConnect() function, R, Importing data from a database dbGet() function, R, Importing data from a database DBI packages, R, Importing data from a database–Importing data from a database Debian system, installing Ruby on, Installing Ruby using your platform’s package management tool def keyword, Ruby, Classes and objects dimnames() function, R, Matrices distribution, normal, Money dollar sign ($), preceding R list item names, Lists doodling example, Shoes doodler–Shoes doodler double quotes (" "), enclosing Ruby strings, Strings duck typing, Ruby, Code like a duck–Code like a duck dynamic typing, Ruby, Code like a duck–Code like a duck E economics example, A Simple Market Economy–A Simple Market Economy, The Producer–The Producer, The Consumer–The Consumer, Some Convenience Methods–Some Convenience Methods, The Simulation–The Simulation, Analyzing the Simulation–Analyzing the Simulation, The Producer–The Producer, The Consumer–The Consumer, Market–Market, The Simulation–The Simulation, Analyzing the Second Simulation–Analyzing the Second Simulation, Price Controls–Price Controls charts for, Analyzing the Simulation–Analyzing the Simulation, Analyzing the Second Simulation–Analyzing the Second Simulation Consumer class for, The Consumer–The Consumer, The Consumer–The Consumer Market class for, Some Convenience Methods–Some Convenience Methods, Market–Market modeling, A Simple Market Economy–A Simple Market Economy price controls analysis, Price Controls–Price Controls Producer class for, The Producer–The Producer, The Producer–The Producer simulations for, The Simulation–The Simulation, The Simulation–The Simulation email example, Grab and Parse–Grab and Parse, The Emailing Habits of Enron Executives–The Emailing Habits of Enron Executives, Number of Messages by Day of the Month–Number of Messages by Day of the Month, Number of Messages by Day of the Month–Number of Messages by Hour of the Day, MailMiner–MailMiner, Number of Messages by Day of Week–Number of Messages by Hour of the Day, Interactions–Comparative Interactions, Text MiningText Mining charts for, Number of Messages by Day of the Month–Number of Messages by Hour of the Day content of messages, analyzing, Text MiningText Mining data for, Grab and Parse–Grab and Parse Enron data for, The Emailing Habits of Enron Executives–The Emailing Habits of Enron Executives interactions in email, analyzing, Interactions–Comparative Interactions number of messages, analyzing, Number of Messages by Day of the Month–Number of Messages by Day of the Month, Number of Messages by Day of Week–Number of Messages by Hour of the Day R package for, creating, MailMiner–MailMiner emergent behavior, The Origin of Boids (see also flocking example) Enron Corporation scandal, The Emailing Habits of Enron Executives Epstein, Joshua (researcher), It’s a Good Life Growing Artificial Societies: Social Science from the Bottom Up (Brookings Institution Press/MIT Press), It’s a Good Life equal sign (=), assignment operator, R, Variables and Functions Euclidean distance, Roids evolution, Evolution example applications, Using Code Examples, Shoes stopwatch–Shoes stopwatch, Shoes doodler–Shoes doodler, The R Console–Sourcing Files and the Command Line, Data frames–Introducing ggplot2, qplot–qplot, Statistical transformation–Geometric object, Adjustments–Adjustments, Offices and Restrooms, A Simple Market Economy, Grab and Parse, My Beating Heart, Schooling Fish and Flocking Birds, Money artificial utopian society, Money (see Utopia example) birds flocking, Schooling Fish and Flocking Birds (see flocking example) doodling, Shoes doodler–Shoes doodler economics, A Simple Market Economy (see economics example) email, Grab and Parse (see email example) fuel economy, qplot–qplot, Adjustments–Adjustments heartbeat, My Beating Heart (see heartbeat example) height and weight, The R Console–Sourcing Files and the Command Line league table, Data frames–Introducing ggplot2 movie database, Statistical transformation–Geometric object permission to use, Using Code Examples restrooms, Offices and Restrooms (see restrooms example) stopwatch, Shoes stopwatch–Shoes stopwatch expressions, R, Programming R external libraries, Ruby, Requiring External Libraries–Requiring External Libraries F factor() function, R, Factors, Text Mining factors, R, Factors–Factors FFmpeg library, Extracting Data from Video, Extracting Data from Video field of vision (FOV), Roids fish, schools of, Schooling Fish and Flocking Birds (see flocking example) flocking example, Schooling Fish and Flocking Birds–The Origin of Boids, The Origin of Boids, Simulation–Simulation, Roids–Roids, The Boid Flocking Rules–Putting in Obstacles, The Boid Flocking Rules–The Boid Flocking Rules, A Variation on the Rules–A Variation on the Rules, Going Round and Round–Going Round and Round, Putting in Obstacles–Putting in Obstacles Boids algorithm for, Schooling Fish and Flocking Birds–The Origin of Boids centering path for, Going Round and Round–Going Round and Round obstacles in path for, Putting in Obstacles–Putting in Obstacles research regarding, A Variation on the Rules–A Variation on the Rules Roid class for, Roids–Roids rules for, The Origin of Boids, The Boid Flocking Rules–The Boid Flocking Rules simulations for, Simulation–Simulation, The Boid Flocking Rules–Putting in Obstacles flows, Shoes, Shoes stopwatch fonts used in this book, Conventions Used in This Book–Conventions Used in This Book for loop, R, Conditionals and Loops format() function, R, Number of Messages by Day of the Month FOV (field of vision), Roids fuel economy example, qplot–qplot, Adjustments–Adjustments function class, R, Programming R functions, R, Variables and Functions–Variables and Functions G GAM (generalized addictive model), The Changes gem command, Ruby, Requiring External Libraries .gem file extension, Requiring External Libraries generalized addictive model (GAM), The Changes Gentleman, Robert (creator of R), Introducing R geom_bar() function, R, Interpreting the Data, The Second Simulation, The Final Simulation geom_histogram() function, R, Geometric object geom_line() function, R, Analyzing the Simulation geom_point() function, R, Plot, Interpreting the Data, Generating the Heart Sounds Waveform geom_smooth() function, R, Interpreting the Data ggplot() function, R, Plot ggplot2 package, R, Introducing ggplot2–Adjustments Gini coefficient, Money Git utility, Ruby Version Manager (RVM) Gmail, retrieving message data from, Grab and Parse–Grab and Parse graphics device, opening, Basic Graphs graphics package, R, Basic Graphs graphs, Charting (see charts) Growing Artificial Societies: Social Science from the Bottom Up (Brookings Institution Press/MIT Press), It’s a Good Life H hash mark, curly brackets (#{ }), enclosing Ruby string escape sequences, Strings hashes, Ruby, Arrays and hashes–Arrays and hashes heart, diagram of, Generating the Heart Sounds Waveform heartbeat example, My Beating Heart, My Beating Heart, My Beating Heart, Homemade Digital Stethoscope, Homemade Digital Stethoscope, Homemade Digital Stethoscope–Extracting Data from Sound, Generating the Heart Sounds Waveform–Generating the Heart Sounds Waveform, Generating the Heart Sounds Waveform, Finding the Heart Rate–Finding the Heart Rate, Homemade Pulse Oximeter–Homemade Pulse Oximeter, Homemade Pulse Oximeter–Extracting Data from Video, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate charts for, Generating the Heart Sounds Waveform–Generating the Heart Sounds Waveform, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate data for, Homemade Digital Stethoscope–Extracting Data from Sound, Homemade Pulse Oximeter–Extracting Data from Video audio from stethoscope, Homemade Digital Stethoscope–Extracting Data from Sound video from pulse oximeter, Homemade Pulse Oximeter–Extracting Data from Video heart rate, My Beating Heart, Finding the Heart Rate–Finding the Heart Rate, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate finding from video file, Generating the Heartbeat Waveform and Calculating the Heart Rate–Generating the Heartbeat Waveform and Calculating the Heart Rate finding from WAV file, Finding the Heart Rate–Finding the Heart Rate health parameters for, My Beating Heart heart sounds, My Beating Heart, My Beating Heart, Homemade Digital Stethoscope, Generating the Heart Sounds Waveform health parameters for, My Beating Heart recording, Homemade Digital Stethoscope types of, My Beating Heart, Generating the Heart Sounds Waveform homemade pulse oximeter for, Homemade Pulse Oximeter–Homemade Pulse Oximeter homemade stethoscope for, Homemade Digital Stethoscope height and weight example, The R Console–Sourcing Files and the Command Line here-documents, Ruby, Strings hex editor, Extracting Data from Sound histograms, Statistical transformation, Geometric object, Money–Money Homebrew tool, Installing Ruby using your platform’s package management tool hyphen (-), Variables and Functions, Variables and Functions -> assignment operator, R, Variables and Functions <- assignment operator, R, Variables and Functions I icons used in this book, Conventions Used in This Book if expression, R, Conditionals and Loops if expression, Ruby, if and unless–if and unless Ihaka, Ross (creator of R), Introducing R ImageMagick library, Extracting Data from Video IMAP (Internet Message Access Protocol), Grab and Parse importing data, R, Importing Data–Importing data from a database inheritance, Ruby, Inheritance–Inheritance initialize method, Ruby, Classes and objects inner product, Roids–Roids installation, Installing Ruby–Installing Ruby using your platform’s package management tool, Installing Shoes–Installing Shoes, Introducing R, Installing packages–Installing packages R, Introducing R R packages, Installing packages–Installing packages Ruby, Installing Ruby–Installing Ruby using your platform’s package management tool Shoes, Installing Shoes–Installing Shoes Internet Message Access Protocol (IMAP), Grab and Parse Internet Message Format, The Emailing Habits of Enron Executives invisible hand metaphor, The Invisible Hand irb application, Running Ruby–Running Ruby J jittering, Adjustments jpeg() function, R, Basic Graphs L Landsburg, Stephen E.


pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack by Matthew A. Russell

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Climategate, cloud computing, crowdsourcing, en.wikipedia.org, fault tolerance, Firefox, full text search, Georg Cantor, Google Earth, information retrieval, Mark Zuckerberg, natural language processing, NP-complete, profit motive, Saturday Night Live, semantic web, Silicon Valley, slashdot, social graph, social web, statistical model, Steve Jobs, supply-chain management, text mining, traveling salesman, Turing test, web application

There’s no real reason to introduce Buzz earlier in the book than blogs (the topic of Chapter 8), other than the fact that it fills an interesting niche somewhere between Twitter and blogs, so this ordering facilitates telling a story from cover to cover. All in all, the text-mining techniques you’ll learn in any chapter of this book could just as easily be applied to any other chapter. Wherever possible we won’t reinvent the wheel and implement analysis tools from scratch, but we will take a couple of “deep dives” when particularly foundational topics come up that are essential to an understanding of text mining. The Natural Language Toolkit (NLTK), a powerful technology that you may recall from some opening examples in Chapter 1, provides many of the tools. Its rich suites of APIs can be a bit overwhelming at first, but don’t worry: text analytics is an incredibly diverse and complex field of study, but there are lots of powerful fundamentals that can take you a long way without too significant of an investment.

I identity consolidation, Brief analysis of breadth-first techniques IDF (inverse document frequency), A Whiz-Bang Introduction to TF-IDF, A Whiz-Bang Introduction to TF-IDF (see also TF-IDF) calculation of, A Whiz-Bang Introduction to TF-IDF idf function, A Whiz-Bang Introduction to TF-IDF IETF OAuth 2.0 protocol, No, You Can’t Have My Password IMAP (Internet Message Access Protocol), Analyzing Your Own Mail Data, Accessing Gmail with OAuth, Fetching and Parsing Email Messages connecting to, using OAuth, Accessing Gmail with OAuth constructing an IMAP query, Fetching and Parsing Email Messages imaplib, Fetching and Parsing Email Messages ImportError, Installing Python Development Tools indexing function, JavaScript-based, couchdb-lucene: Full-Text Indexing and More inference, Open-World Versus Closed-World Assumptions, Inferencing About an Open World with FuXi application to machine knowledge, Inferencing About an Open World with FuXi in logic-based programming languages and RDF, Open-World Versus Closed-World Assumptions influence, measuring for Twitter users, Measuring Influence, Measuring Influence, Measuring Influence, Measuring Influence calculating Twitterer’s most popular followers, Measuring Influence crawling friends/followers connections, Measuring Influence Infochimps, Strong Links API, The Infochimps “Strong Links” API, Interactive 3D Graph Visualization information retrieval industry, Before You Go Off and Try to Build a Search Engine… information retrieval theory, Text Mining Fundamentals (see IR theory) intelligent clustering, Intelligent clustering enables compelling user experiences interactive 3D graph visualization, Interactive 3D Graph Visualization interactive 3D tag clouds for tweet entities co-occurring with #JustinBieber and #TeaParty, Visualizing Tweets with Tricked-Out Tag Clouds interpreter, Python (IPython), Closing Remarks intersection operations, Elementary Set Operations, How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets? overlap between entities of #TeaParty and #JustinBieber tweets, How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets? IR (information retrieval) theory, Text Mining Fundamentals, A Whiz-Bang Introduction to TF-IDF, A Whiz-Bang Introduction to TF-IDF, Querying Buzz Data with TF-IDF, Querying Buzz Data with TF-IDF, The Theory Behind Vector Space Models and Cosine Similarity, Clustering Posts with Cosine Similarity, Clustering Posts with Cosine Similarity finding similar documents using cosine similarity, Clustering Posts with Cosine Similarity, Clustering Posts with Cosine Similarity introduction to TF-IDF, A Whiz-Bang Introduction to TF-IDF, A Whiz-Bang Introduction to TF-IDF querying Buzz data with TF-IDF, Querying Buzz Data with TF-IDF, Querying Buzz Data with TF-IDF vector space models and cosine similarity, The Theory Behind Vector Space Models and Cosine Similarity irrational numbers, Elementary Set Operations J Jaccard distance, Common Similarity Metrics for Clustering, Common Similarity Metrics for Clustering comparing with MASI distance for two sets of items, Common Similarity Metrics for Clustering defined, Common Similarity Metrics for Clustering Jaccard Index, Buzzing on Bigrams, How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions defined, How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions Java-based search engine library (Lucene), couchdb-lucene: Full-Text Indexing and More JavaScript, Sensible Sorting, couchdb-lucene: Full-Text Indexing and More indexing function based on, couchdb-lucene: Full-Text Indexing and More map/reduce functions in CouchDB, Sensible Sorting JavaScript InfoVis Toolkit, Visualizing Your Entire Social Network (see JIT) JavaScript Object Notation, Tinkering with Twitter’s API (see JSON) JIT (JavaScript InfoVis Toolkit), Visualizing Your Entire Social Network, Visualizing with RGraphs, Visualizing with RGraphs, Visualizing with a Sunburst RGraph visualization of Facebook network, Visualizing with RGraphs, Visualizing with RGraphs Sunburst visualization of Facebook network, Visualizing with a Sunburst job titles, Motivation for Clustering, Motivation for Clustering, Clustering Contacts by Job Title, Common Similarity Metrics for Clustering, A Greedy Approach to Clustering, Hierarchical and k-Means Clustering, k-means clustering clustering contacts by, Motivation for Clustering, Clustering Contacts by Job Title, Common Similarity Metrics for Clustering, A Greedy Approach to Clustering, Hierarchical and k-Means Clustering, k-means clustering common similarity metrics for clustering, Common Similarity Metrics for Clustering greedy approach to clustering, A Greedy Approach to Clustering hierarchical clustering, Hierarchical and k-Means Clustering k-means clustering, k-means clustering standardizing and counting job titles, Motivation for Clustering lack of standardization in, Motivation for Clustering JSON, Tinkering with Twitter’s API, mbox: The Quick and Dirty on Unix Mailboxes, mbox: The Quick and Dirty on Unix Mailboxes, Bulk Loading Documents into CouchDB, Souping Up the Machine with Basic Friend/Follower Metrics, Intelligent clustering enables compelling user experiences, Where Have My Friends All Gone?

The Graph Your Inbox Chrome Extension provides a concise summary of your Gmail activity Tapping into Your Gmail provides an overview of how to use Python’s smtplib module to tap into your Gmail account (or any other mail account that speaks SMTP) and mine the textual information in messages. Be sure to check it out when you’re interested in moving beyond mail header information and ready to dig into text mining. Closing Remarks We’ve covered a lot of ground in this chapter, but we’ve just barely begun to scratch the surface of what’s possible with mail data. Our focus has been on mboxes, a simple and convenient file format that lends itself to high portability and easy analysis by many Python tools and packages. There’s an incredible amount of open source technology available for mining mboxes, and Python is a terrific language for slicing and dicing them.


pages: 163 words: 42,402

Machine Learning for Email by Drew Conway, John Myles White

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

call centre, correlation does not imply causation, Debian, natural language processing, Netflix Prize, pattern recognition, recommendation engine, SpamAssassin, text mining

Though uncommon, occasionally you may may want to install a package that is not yet available on CRAN—for example, if you’re updating to an experimental version of a package. In these cases you will need to install from source: install.packages("tm", dependencies=TRUE) setwd("~/Downloads/") install.packages("RCurl_1.5-0.tar.gz", repos=NULL, type="source") In the first example above, we use the default settings to install the tm package from CRAN. The tm provides function used to do text mining, and we will use it in Chapter 3 to perform classification on email text. One useful parameter in the install.packages function is suggests, which by default is set to FALSE but if activated will instruct the function to download and install any secondary packages used by the primary installation. As a best practice, we recommend always setting this to TRUE, especially if you are working with a clean installation of R.

R packages used in this book NameLocationAuthorDescription & Use ggplot2 http://had.co.nz/ggplot2/ Hadley Wickham An implementation of the grammar of graphics in R. The premier package for creating high-quality graphics. plyr http://had.co.nz/plyr/ Hadley Wickham A set of tools used to manipulate, aggregate and manage data in R. tm http://www.spatstat.org/spatstat/ Ingo Feinerer A collection of functions for performing text mining in R. Used to work with unstructured text data. R Basics for Machine Learning UFO Sightings in the United States, from 1990-2010 As we stated at the outset, we believe that the best way to learn a new technical skill is to start with a problem you wish to solve or a question you wish to answer. Being excited about the higher level vision of your work makes makes learning from case studies work.

We need to transform our raw text data into a set of features that describe qualitative concepts in a quantitative way. In our case, that will be a 0/1 coding strategy: spam or ham. For example, we may want to determine the following: “does containing HTML tags make an email more likely to be spam?” To answer this, we will need a strategy for turning the text in our email into numbers. Fortunately, the general-purpose text mining packages available in R will do much of this work for us. For that reason, much of this chapter will focus on building up your intuition for the types of features that people have used in the past when working with text data. Feature generation is a major topic in current machine learning research and is still very far from being automated in a general purpose way. At present, it’s best to think of the features being used as part of a vocabulary of machine learning that you become more familiar with as you perform more machine learning tasks.


pages: 502 words: 107,657

Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die by Eric Siegel

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, algorithmic trading, Amazon Mechanical Turk, Apple's 1984 Super Bowl advert, backtesting, Black Swan, book scanning, bounce rate, business intelligence, business process, call centre, computer age, conceptual framework, correlation does not imply causation, crowdsourcing, dark matter, data is the new oil, en.wikipedia.org, Erik Brynjolfsson, experimental subject, Google Glasses, happiness index / gross national happiness, job satisfaction, Johann Wolfgang von Goethe, Machine translation of "The spirit is willing, but the flesh is weak." to Russian and back, Moneyball by Michael Lewis explains big data, Nate Silver, natural language processing, Netflix Prize, Network effects, Norbert Wiener, personalized medicine, placebo effect, prediction markets, Ray Kurzweil, recommendation engine, risk-adjusted returns, Ronald Coase, Search for Extraterrestrial Intelligence, self-driving car, sentiment analysis, software as a service, speech recognition, statistical model, Steven Levy, text mining, the scientific method, The Signal and the Noise by Nate Silver, The Wisdom of Crowds, Turing test, Watson beat the top human players on Jeopardy!, X Prize, Yogi Berra

For final results, see http://stat.duke.edu/datafest/results. G. Miner, D. Delen, J. Elder, A. Fast, T. Hill, and B. Nisbet, Practical Text Mining and Statistical Analysis for Non-Structured Data Text Applications (Academic Press, 2012), Part II, Tutorial K, p. 417, by Richard Foley of SAS. Tap directly into Kiva’s loan database with the Kiva API: http://build.kiva.org. U.S. Social Security Administration: Thanks to John Elder, PhD, Elder Research, Inc. (www.datamininglab.com), for this case study. John Elder, PhD, “Text Mining to Fast-Track Deserving Disability Applicants,” Elder Research, Inc., August 7, 2010. http://videolectures.net/site/normal_dl/tag=73772/kdd2010_elder_tmft_01.pdf. John Elder, PhD, “Text Mining: Lessons Learned,” Text Analytics World San Francisco Conference, March 7, 2012, San Francisco, CA. www.textanalyticsworld.com/sanfrancisco/2012/agenda/full-agenda#day1520–605.

questions right: Stephen Baker, Final Jeopardy: Man vs. Machine and the Quest to Know Everything (Houghton Mifflin Harcourt, 2011), 212–224. Quote about Google’s book scanning project: George Dyson, Turing’s Cathedral: The Origins of the Digital Universe (Pantheon Books, 2012). Natural language processing: Dursun Delen, Andrew Fast, Thomas Hill, Robert Nisbit, John Elder, and Gary Miner, Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications (Academic Press, 2012). James Allen, Natural Language Understanding, 2nd ed. (Addison-Wesley, 1994). Regarding the translation of “The spirit is willing, but the flesh is weak”: John Hutchins, “The Whisky Was Invisible or Persistent Myths of MT,” MT News International 11 (June 1995), 17–18. www.hutchinsweb.me.uk/MTNI-11–1995.pdf.

Stehly, “Multivariate Statistical Model for Predicting Occurrence and Location of Broken Rails,” Transportation Research Board of the National Academies, January 26, 2007. http://trb.metapress.com/content/v2j6022171r41478/. See also: http://ict.uiuc.edu/railroad/cee/pdf/Dick_et_al_2003.pdf. TTX: Thanks to Mahesh Kumar at Tiger Analytics for this case study, “Predicting Wheel Failure Rate for Railcars.” Fortune 500 global technology company: Thanks to Dean Abbott, Abbot Analytics (http://abbottanalytics.com/index.php) for information about this case study. “Inductive Business-Rule Discovery in Text Mining.” Text Analytics World San Francisco Conference, March 7, 2012, San Francisco, CA. www.textanalyticsworld.com/sanfrancisco/2012/agenda/full-agenda#day11040–11–2. Leading payments processor: Thanks to Robert Grossman, Open Data Group (http://opendatagroup.com), for this case study. “Scaling Health and Status Models to Large, Complex Systems,” Predictive Analytics World San Francisco Conference, February 16, 2010, San Francisco, CA. www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day1–17.

Data Mining: Concepts and Techniques: Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

bioinformatics, business intelligence, business process, Claude Shannon: information theory, cloud computing, computer vision, correlation coefficient, cyber-physical system, database schema, discrete time, distributed generation, finite state, information retrieval, iterative process, knowledge worker, linked data, natural language processing, Netflix Prize, Occam's razor, pattern recognition, performance metric, phenotype, random walk, recommendation engine, RFID, semantic web, sentiment analysis, speech recognition, statistical model, stochastic process, supply-chain management, text mining, thinkpad, web application

This is followed by deriving patterns within the structured data, and evaluation and interpretation of the output. “High quality” in text mining usually refers to a combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity-relation modeling (i.e., learning relations between named entities). Other examples include multilingual data mining, multidimensional text analysis, contextual text mining, and trust and evolution analysis in text data, as well as text mining applications in security, biomedical literature analysis, online media analysis, and analytical customer relationship management. Various kinds of text mining and analysis software and tools are available in academic institutions, open-source forums, and industry.

Other topics in multimedia mining include classification and prediction analysis, mining associations, and video and audio data mining (Section 13.2.3). Mining Text Data Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. A substantial portion of information is stored as text such as news articles, technical papers, books, digital libraries, email messages, blogs, and web pages. Hence, research in text mining has been very active. An important goal is to derive high-quality information from text. This is typically done through the discovery of patterns and trends by means such as statistical pattern learning, topic modeling, and statistical language modeling. Text mining usually requires structuring the input text (e.g., parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database).

Text data analysis has been studied extensively in information retrieval, with many textbooks and survey articles such as Croft, Metzler, and Strohman [CMS09]; S. Buttcher, C. Clarke, G. Cormack [BCC10]; Manning, Raghavan, and Schutze [MRS08]; Grossman and Frieder [GR04]; Baeza-Yates and Riberio-Neto [BYRN11]; Zhai [Zha08]; Feldman and Sanger [FS06]; Berry [Ber03]; and Weiss, Indurkhya, Zhang, and Damerau [WIZD04]. Text mining is a fast-developing field with numerous papers published in recent years, covering many topics such as topic models (e.g., Blei and Lafferty [BL09]); sentiment analysis (e.g., Pang and Lee [PL07]); and contextual text mining (e.g., Mei and Zhai [MZ06]). Web mining is another focused theme, with books like Chakrabarti [Cha03a], Liu [Liu06] and Berry [Ber03]. Web mining has substantially improved search engines with a few influential milestone works, such as Brin and Page [BP98]; Kleinberg [Kle99]; Chakrabarti, Dom, Kumar, et al.


pages: 398 words: 86,855

Bad Data Handbook by Q. Ethan McCallum

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Mechanical Turk, asset allocation, barriers to entry, Benoit Mandelbrot, business intelligence, cellular automata, chief data officer, cloud computing, cognitive dissonance, combinatorial explosion, conceptual framework, database schema, en.wikipedia.org, Firefox, Flash crash, Gini coefficient, illegal immigration, iterative process, labor-force participation, loose coupling, natural language processing, Netflix Prize, quantitative trading / quantitative finance, recommendation engine, sentiment analysis, statistical model, supply-chain management, text mining, too big to fail, web application

Used with unicode.encode('ascii', ‘xmlcharrefreplace') Example 4-9 Example 4-10 HTMLParser.unescape Decodes HTML-encoded string Example 4-9 Example 4-10 csv.reader Parses delimited text Example 4-11 The functions listed in Table 4-5 are good low-level building blocks for creating text processing and text mining applications. There are a lot of excellent Open Source Python libraries for higher level text analysis. A few of my favorites are listed in Table 4-6. Table 4-6. Third-party Python reference LibraryNotes NLTK Parsers, tokenizers, stemmers, classifiers BeautifulSoup HTML & XML parsers, tolerant of bad inputs gensim Topic modeling jellyfish Approximate and phoenetic string matching These tools provide a great starting point for many text processing, text mining, and text analysis applications. Exercises The results shown in Example 4-3 were generated when the n parameter to make_alnum_sample was set to 512.

Paul Murrell is a senior lecturer in the Department of Statistics at the University of Auckland, New Zealand. His research area is Statistical Computing and Graphics and he is a member of the core development team for the R project. He is the author of two books, R Graphics and Introduction to Data Technologies, and is a Fellow of the American Statistical Association. Josh Levy is a data scientist in Austin, Texas. He works on content recommendation and text mining systems. He earned his doctorate at the University of North Carolina where he researched statistical shape models for medical image segmentation. His favorite foosball shot is banked from the backfield. Adam Laiacano has a BS in Electrical Engineering from Northeastern University and spent several years designing signal detection systems for atomic clocks before joining a prominent NYC-based startup.

Problem: Application-Specific Characters Leaking into Plain Text Some applications have characters or sequences of characters with application-specific meanings. One source of bad text data I have encountered is when these sequences leak into places where they don’t belong. This can arise anytime the data flows through a tool with a restricted vocabulary. One project where I had to clean up this type of bad data involved text mining on web content. The users of one system would submit content through a web form to a server where it was stored in a database and in a text index before being embedded in other HTML files for display to additional users. The analysis I performed looked at dumps from various tools that sat on top of the database and/or the final HTML files. That analysis would have been corrupted had I not detected and normalized these application-specific encodings: URL encoding HTML encoding Database escaping In most cases, the user’s browser will URL encode the content before submitting it.


pages: 23 words: 5,264

Designing Great Data Products by Jeremy Howard, Mike Loukides, Margit Zwemer

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

AltaVista, Filter Bubble, PageRank, pattern recognition, recommendation engine, self-driving car, sentiment analysis, Silicon Valley, text mining

In another area where objective-based data products have the power to change lives, the CMU extension in Silicon Valley has an active project for building data products to help first responders after natural or man-made disasters. Jeannie Stamberger of Carnegie Mellon University Silicon Valley explained to us many of the possible applications of predictive algorithms to disaster response, from text-mining and sentiment analysis of tweets to determine the extent of the damage, to swarms of autonomous robots for reconnaissance and rescue, to logistic optimization tools that help multiple jurisdictions coordinate their responses. These disaster applications are a particularly good example of why data products need simple, well-designed interfaces that produce concrete recommendations. In an emergency, a data product that just produces more data is of little use.


pages: 504 words: 89,238

Natural language processing with Python by Steven Bird, Ewan Klein, Edward Loper

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

bioinformatics, business intelligence, conceptual framework, elephant in my pajamas, en.wikipedia.org, finite state, Firefox, information retrieval, Menlo Park, natural language processing, P = NP, search inside the book, speech recognition, statistical model, text mining, Turing test

By providing more natural human-machine interfaces, and more sophisticated access to stored information, language processing has come to play a central role in the multilingual information society. This book provides a highly accessible introduction to the field of NLP. It can be used for individual study or as the textbook for a course on natural language processing or computational linguistics, or as a supplement to courses in artificial intelligence, text mining, or corpus linguistics. The book is intensely practical, containing hundreds of fully worked examples and graded exercises. The book is based on the Python programming language together with an open source library called the Natural Language Toolkit (NLTK). NLTK includes extensive software, data, and documentation, all freely downloadable from http://www.nltk.org/. Distributions are provided for Windows, Macintosh, and Unix platforms.

The IOB format (or sometimes BIO Format) was developed for NP chunking by (Ramshaw & Marcus, 1995), and was used for the shared NP bracketing task run by the Conference on Natural Language Learning (CoNLL) in 1999. The same format was adopted by CoNLL 2000 for annotating a section of Wall Street Journal text as part of a shared task on NP chunking. Section 13.5 of (Jurafsky & Martin, 2008) contains a discussion of chunking. Chapter 22 covers information extraction, including named entity recognition. For information about text mining in biology and medicine, see (Ananiadou & McNaught, 2006). For more information on the Getty and Alexandria gazetteers, see http://en.wikipedia .org/wiki/Getty_Thesaurus_of_Geographic_Names and http://www.alexandria.ucsb .edu/gazetteer/. 7.9 Exercises 1. ○ The IOB format categorizes tagged tokens as I, O, and B. Why are three tags necessary? What problem would be caused if we used I and O tags exclusively?

[Abney, 2008] Steven Abney. Semisupervised Learning for Computational Linguistics. Chapman and Hall, 2008. [Agirre and Edmonds, 2007] Eneko Agirre and Philip Edmonds. Word Sense Disambiguation: Algorithms and Applications. Springer, 2007. [Alpaydin, 2004] Ethem Alpaydin. Introduction to Machine Learning. MIT Press, 2004. [Ananiadou and McNaught, 2006] Sophia Ananiadou and John McNaught, editors. Text Mining for Biology and Biomedicine. Artech House, 2006. [Androutsopoulos et al., 1995] Ion Androutsopoulos, Graeme Ritchie, and Peter Thanisch. Natural language interfaces to databases—an introduction. Journal of Natural Language Engineering, 1:29–81, 1995. [Artstein and Poesio, 2008] Ron Artstein and Massimo Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, pages 555–596, 2008.


pages: 204 words: 58,565

Keeping Up With the Quants: Your Guide to Understanding and Using Analytics by Thomas H. Davenport, Jinho Kim

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Black-Scholes formula, business intelligence, business process, call centre, computer age, correlation coefficient, correlation does not imply causation, Credit Default Swap, en.wikipedia.org, feminist movement, Florence Nightingale: pie chart, forensic accounting, global supply chain, Hans Rosling, hypertext link, invention of the telescope, inventory management, Jeff Bezos, margin call, Moneyball by Michael Lewis explains big data, Netflix Prize, p-value, performance metric, publish or perish, quantitative hedge fund, random walk, Renaissance Technologies, Robert Shiller, Robert Shiller, self-driving car, sentiment analysis, six sigma, Skype, statistical model, supply-chain management, text mining, the scientific method

There are various types of analytics that serve different purposes for researchers: Statistics: The science of collection, organization, analysis, interpretation, and presentation of data Forecasting: The estimation of some variable of interest at some specified future point in time as a function of past data Data mining: The automatic or semiautomatic extraction of previously unknown, interesting patterns in large quantities of data through the use of computational algorithmic and statistical techniques Text mining: The process of deriving patterns and trends from text in a manner similar to data mining Optimization: The use of mathematical techniques to find optimal solutions with regard to some criteria while satisfying constraints Experimental design: The use of test and control groups, with random assignment of subjects or cases to each group, to elicit the cause and effect relationships in a particular outcome Although the list presents a range of analytics approaches in common use, it is unavoidable that considerable overlaps exist in the use of techniques across the types.

A demanding course of study: Full-time (M–F/9–5) study on campus; an integrated curriculum shared with the other students; working in teams throughout the program; and, typically, working on projects when not in class A broad and practical content focus: An integrated, multidisciplinary curriculum (drawing from multiple schools and departments at NC State) aimed at the acquisition of practical skills which can be applied to real-world problems, drawing on fields such as statistics, applied mathematics, computer science, operations research, finance and economics, and marketing science Learning by doing: Use of a practicum rather than the traditional MS thesis (students work in teams of five individuals, then use real-world problems and data provided by an industry sponsor; highly structured, substantive work conducted over seven months culminates in a final report to the sponsor) The NCSU MSA has a novel curriculum consisting of classes developed exclusively for the program. Topics include data mining, text mining, forecasting, optimization, databases, data visualization, data privacy and security, financial analytics, and customer analytics. Students come into the program with a variety of backgrounds, though some degree of quantitative orientation is desired. The average age of students is twenty-seven, and about 26 percent of students enrolled have a prior graduate degree. About half the students were previously employed full time.

Big Data at Work: Dispelling the Myths, Uncovering the Opportunities by Thomas H. Davenport

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Automated Insights, autonomous vehicles, bioinformatics, business intelligence, business process, call centre, chief data officer, cloud computing, data acquisition, Edward Snowden, Erik Brynjolfsson, intermodal, Internet of things, Jeff Bezos, knowledge worker, Mark Zuckerberg, move fast and break things, Narrative Science, natural language processing, Netflix Prize, New Journalism, recommendation engine, RFID, self-driving car, sentiment analysis, Silicon Valley, smart grid, smart meter, social graph, sorting algorithm, statistical model, Tesla Model S, text mining

Hadoop uses a processing framework called MapReduce not only to distribute data across the disks but also to apply complex computational instructions to that data. In keeping with the high-performance capabilities of the platform, MapReduce instructions are processed in parallel across various nodes on the big data platform, and then quickly assembled to provide a new data structure or answer set. An example of a big data application in Hadoop might be “Find the number of all the influential customers who like us on social media.” A text-mining application might crunch through social Chapter_05.indd 122 03/12/13 1:04 PM Technology for Big Data   123 media t­ ransactions, searching for words such as fan, love, bought, or awesome and consolidating a list of key influencer customers with positive sentiment. Apache Pig and Hive are two open-source scripting languages that sit on top of Hadoop and provide a higher-level language for carrying out MapReduce functionality in application code.

“We are very ROI-driven, and we only invest in a technology if it solves a business problem for us,” he noted. Over time there will be increasing integration between Macys.com and the rest of Macy’s systems and data on customers, since Tomak and his colleagues believe that an omnichannel approach to customer relationships is the right direction for the future. Chapter_08.indd 183 03/12/13 12:57 PM 184  big data @ work ­n atural ­language processing or text-mining skills, video or image ­a nalytics, and visual analytics. Many of the data scientists are also able to code in scripting languages like Python, Pig, and Hive. In terms of backgrounds, some have PhDs in scientific fields; others are simply strong programmers with some analytical skills. Many of our interviewees questioned whether a data scientist could possess all the needed skills and were taking a team-based approach to ­a ssembling them.

Practical OCaml by Joshua B. Smith

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

cellular automata, Debian, domain-specific language, general-purpose programming language, Grace Hopper, hiring and firing, John Conway, Paul Graham, slashdot, text mining, Turing complete, type inference, web application, Y2K

You might need a parser when you implement a domain-specific language (DSL). DSLs, which are found in a variety of applications, are configuration languages, embedded scripting languages, and data description languages. SQL is probably the most popular DSL in wide use. These kinds of languages can be built using ocamlyacc. You also can use a parser to handle situations that are too complicated for regular expressions alone. Text mining and log file analysis are two areas in which having a lexer/parser combination can result in better code and easier maintenance. A Small Discussion of Small Languages DSLs are programming languages that are focused on one problem domain. That problem domain can be anything: text processing, image manipulation, configuration, page layout, and so on. This focus on a single domain is what separates DSLs from general-purpose programming languages such as OCaml.

See functional programming Haskell, 253 imperative, 249 non-LISPy, 254 PHP, 273 programming styles and, 261 Prolog, 263 reasoning and, 272 SPARK Ada, 271 web programming and, 273–291 Prolog programming language, 263 prototyping languages, 231 purity, OCaml and, 249–260 defined, 251 impurity and, 254–260 Pythagorean Theorem, 353 ■Q queues, 103 quotations, 420 quote function, 137 quoting, 137, 140 ■R race conditions, 315 random numbers, sample code and, 226, 236 range operator (…), 64 Ratio libraries, 30 reading directories, 119 readline support, 357 really_input function, 114 really_output function, 114 reasoning, about applications, 272 receive function, 320 record types, 53 records, 23–29 defining, 28 mutability and, 28 name clashes and, 29 -rectypes compiler flag, 406 recursion, 30, 44 recursive algorithms, 31 Find it faster at http://superindex.apress.com/ pattern matching, 31, 46–48 chars/strings and, 64 exceptions and, 123 Paul Graham algorithm, 170 paul_graham function, 172 Pcre library, 142 permission integers, 116 perror function, 137 persistence, 249 Pervasives module, 66 comparison functions and, 90, 247 primitives and, 113 PF_INET type, 121 -pflag, 408 PHP programming language, 273 physical equality, 39 poll function, 320 polymorphic functions, 68 polymorphic types, 25, 35, 67 polymorphic variant types, 69 polymorphism arrays/lists and, 66 classes and, 230, 234 exceptions and, 129 methods and, 229 POP3 client (example), 322–327 Portable Operating System Interface (POSIX), 73, 120 portable paths, 118 position type, 53, 80 positions, 197 POSIX (Portable Operating System Interface), 73, 120 pos_bol, 197 pos_cnum, 197 pos_fname, 197 pos_in function, 114 pos_lnum, 197 pos_out function, 114 pound sign (#), 91 -pp PREPROC compiler flag, 406 pr_dump.cmo module, 420 pr_o.cmo module, 420, 422 prefix functions, 34 preprocessors, Camlp4 and, 411–429 primitive types, 24, 61–71 camlidl tool for, 355 implementing your own, 350 -principal compiler flag, 406 printers, 80, 411, 413 Printexc function, 124 printf command, 17 printf format strings, 17 printf function, 64, 75 453 620Xidxfinal.qxd 454 9/22/06 4:19 PM Page 454 ■INDEX recursive functions, 44 recursive types, 25 recv function, 121 reducing (folding), 39 references, 21, 23 regular expressions, 142, 330 ocamllex and, 193, 199 reasons for using Printf/Scanf functions instead of, 77 strings and, 65 Remote Procedure Call (RPC) mechanism, 349 remove function hashtables and, 101 lists and, 95 replace function, hashtables and, 101 reports, 73–87 research and analysis applications, 3 resources for further reading, 433 built-in functions, 48 LGPL, 11 parametric polymorphism, 26 responsetype type, 339 resume (sample), 434–443 return types, 35 reversed lists, 92 revised syntax, 411–429 rev_append function, 93 RFC 2396 (URI syntax), 329 rlwrap, 404 Rot13 quoter, 141 rounding, floats and, 63 RPC (Remote Procedure Call) mechanism, 349 RRR#remove_printer directive, 80 rstrip function, 298 rule keyword, 198 run function, 183 runner function, 337, 342 run_server function, 188 ■S sample code average number of words in set of files, 263–266 blog server, 278–288 BMP files, 386, 397 command-line client, 190 configuration file parser, 415–419 configuration language for set/unset variables, 208 functors and modules, 163 log files and, 213– 223 network-aware scoring function, 179–191 ocamldoc documentation, 149, 151 POP3 client, 322–327 random numbers, 226, 236 resume, 434–443 securities trades database, 54–60 spam filter, 171–178 strings and, 243–248 syntax extension, 421–429 URI module, 137–140 web crawler, 329–348 sample data, for sample securities trades database, 53 scanf commands, 38 scanf function, 77 Scanf functions, 76, 142, 185 binary files and, 295 reasons for using vs. regular expressions, 77 Scanf-specific formatting codes, 76 Scanning module, 76 scan_buffer function, 185 scope, 23 scripts (fragments), 274 search_forward function, 331 secure-by-design programming, 276 securities trades sample database, 51–60 displaying/importing data and, 73–87 generating data and, 86 interacting with, 56 reports and, 73– 87 stock price information and, 59 Seek_in function, 114 Seek_out function, 114 select function, 122, 183, 317 blocking and, 320 double call to, 187 select-based servers, 307 self function, 316 self_init, 71 sell function, for securities trades database, 54 semantic actions, 193, 197, 203 semantics, 29, 40 semicolons (;;), 91 send function, 121, 320 sender function, 341 servers creating, 179 high-level functions and, 122 OCaml support for, 179–191 server_setup function, 185 Set functor 136, 333 set methods, arrays and, 97 sets, 106 Shared-Memory-Processor (SMP) systems, 309 shells, 404 Shootout, 266 620Xidxfinal.qxd 9/22/06 4:19 PM Page 455 ■INDEX String_val(v) function, 351 strip command, caution for, 409 strongly typed types, 62 strptime function, 359, 362 structured programming, 262 str_item level, 420 style.css file, ocamldoc HTML output and, 147 sub function, 98 subsections, 262 sum function, 39 SWIG (Simplified Wrapper Interface Generator), 358 symmetric multiprocessing (SMP), 309 sync function, 315, 320 synchronization, threads and, 309 syntax, 21–32 extending, 411, 416, 420–428 semantics and, 40 Sys.file_exists, 117 syslog, 365 system threads, 310 Sys_blocked_io exception, 127 Sys_error of string exception, 126 ■T t type, 136 tags, creating custom, 153 tail recursion, 44 templates, mod_caml library and, 289 temporary files, 118, 135 temp_file function, 119 Texinfo pages, ocamldoc output and, 148 text mining, parsers and, 203 theorem solvers, 29 -thread flag, 407 Thread module, 311 threaded_link_harvest class, 339 threads, 309–27 creating/using, 310–316 exiting/killing, 316 modules for, 316–322 sample POP3 client and, 322–327 THREADS variable, 189 threadsafe libraries, 310 Time library, 359–365 Tk graphical user interface, 13 tokens, 193–210 tools camlidl, 349, 355, 357 Coq proof assistant, 29 Findlib, 167, 311, 409 GraphViz, 347 ocamldep, 401–410 xxd, 376 top-down design, 262, 267 Find it faster at http://superindex.apress.com/ shortest keyword, 198 Shoutcast protocol, 293 Shoutcast server, 293–308 connecting to, 307 framework for, 300–305 implementing, 305 shutdown_connection function, 122 side effects, 251, 253–260 signal function, 318 signatures, 32 functions and, 33, 89 modules and, 159 Signoles, Julien, 360 Simplified Wrapper Interface Generator (SWIG), 358 Singleton design pattern, multiple module views and, 160 sitemaps, 329 SMP, 309 .so files, 408 socket functions, 120 sort function, 96 source files, processed by ocamllex, 197–201 spam filters, 169–178 spam server, 182–89 spam.cmo library, 174 SPARK Ada programming language, 271 split function, 96 sprintf function, 75 sscanf function, 77 stacks, 105 Stack_overflow exception, 45, 126 stat function, 117, 423 state, CGI and, 274 static linking of code, 356 stdout, 17 Stolpmann, Gerd, 135, 409 store function, for securities trades database, 56 Store_field(block,index,val) function, 352 Str library, 142 Str module, 330 Str.regexp function, 331 strcopy function, 351 stream parsers, 432 streams, 413–419 strftime function, 359 string keywords, 415 StringMap module, 335 strings, 24, 64, 110, 377 allocating, 352 copying, 352 sample code and, 243–248 StringSet module, 335 string_length(v) function, 351 string_match function, 331 455 620Xidxfinal.qxd 456 9/22/06 4:19 PM Page 456 ■INDEX toplevel.


pages: 169 words: 56,250

Startup Communities: Building an Entrepreneurial Ecosystem in Your City by Brad Feld

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

barriers to entry, cleantech, cloud computing, corporate social responsibility, Grace Hopper, job satisfaction, Kickstarter, labour mobility, Lean Startup, minimum viable product, Network effects, Peter Thiel, place-making, pre–internet, Richard Florida, Silicon Valley, Silicon Valley startup, smart cities, software as a service, Steve Jobs, text mining, Y Combinator, Zipcar

I started reading everything that I could lay my hands on. I bumped into Fred Wilson’s blog (http://startuprev.com/l4) and Brad Feld’s blog (http://startuprev.com/o4 and http://startuprev.com/h1) and was amazed at the wealth of knowledge and wisdom that these two individuals were sharing freely on the Internet. I met this small team sitting in the old fishing factory in the Reykjavik Harbor, working on text mining. They were each younger than 25 years old and called their company CLARA. They wanted to build a software-as-a-service company that helped gaming companies understand their communities. I was startled. These kids were not worried about the ISK or the government or the global financial crisis or anything. They were building something and wanted to sell it to create value. I was impressed. I found out that they needed capital to get their minimum viable product onto the market.


pages: 284 words: 79,265

The Half-Life of Facts: Why Everything We Know Has an Expiration Date by Samuel Arbesman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, Alfred Russel Wallace, Amazon Mechanical Turk, Andrew Wiles, bioinformatics, British Empire, Chelsea Manning, Clayton Christensen, cognitive bias, cognitive dissonance, conceptual framework, David Brooks, demographic transition, double entry bookkeeping, double helix, Galaxy Zoo, guest worker program, Gödel, Escher, Bach, Ignaz Semmelweis: hand washing, index fund, invention of movable type, Isaac Newton, John Harrison: Longitude, Kevin Kelly, life extension, meta analysis, meta-analysis, Milgram experiment, Nicholas Carr, p-value, Paul Erdős, Pluto: dwarf planet, randomized controlled trial, Richard Feynman, Richard Feynman, Rodney Brooks, social graph, social web, text mining, the scientific method, Thomas Kuhn: the structure of scientific revolutions, Thomas Malthus, Tyler Cowen: Great Stagnation

While scientific progress isn’t necessarily correlated with a single publication—some papers might have multiple discoveries, and others might simply be confirming something we already know—it is often a good unit of study. Focusing on the scientific paper gives us many pieces of data to measure and study. We can look at the title and text and, using sophisticated algorithms from computational linguistics or text mining, determine the subject area. We can look at the authors themselves and create a web illustrating the interactions between scientists who write papers together. We can examine the affiliations of each of the authors and try to see which collaborations between individuals at different institutions are more effective. And we can comb through the papers’ citations, in order to get a sense of the research a paper is building upon.

Raw Data Is an Oxymoron by Lisa Gitelman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

collateralized debt obligation, computer age, continuous integration, crowdsourcing, Drosophila, Edmond Halley, Filter Bubble, Firefox, Google Earth, Howard Rheingold, index card, informal economy, Isaac Newton, Johann Wolfgang von Goethe, knowledge worker, Louis Daguerre, Menlo Park, optical character recognition, RFID, Richard Thaler, Silicon Valley, social graph, software studies, statistical model, Stephen Hawking, Steven Pinker, text mining, time value of money, trade route, Turing machine, urban renewal, Vannevar Bush

In 1837, Weld had published The Bible Against Slavery, initially in the Anti-Slavery Quarterly Magazine, and then as a ninety-eight-page pamphlet. In it, he interpreted slavery in the Hebrew Bible as a form of paid service that could be stepped out of essentially at will, thus refuting claims that the Bible sanctioned chattel slavery as it was practiced in the United States. His biblical interpretation drew on another form of text mining, familiar to ministers: the concordance, essentially a keyword search through the text, providing context, in use since the thirteenth century. American Slavery As It Is importantly shifted the focus to the present when it took as its text the newspapers, along with testimony derived from questionnaires. It represented data mined from an enormous number of papers. Forty-five years later, Weld recalled: After the work was finished, we were curious to know how many newspapers had been examined.


pages: 348 words: 39,850

Data Scientists at Work by Sebastian Gutierrez

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, algorithmic trading, bioinformatics, bitcoin, business intelligence, chief data officer, clean water, cloud computing, computer vision, continuous integration, correlation does not imply causation, crowdsourcing, data is the new oil, DevOps, domain-specific language, follow your passion, full text search, informal economy, information retrieval, Infrastructure as a Service, inventory management, iterative process, linked data, Mark Zuckerberg, microbiome, Moneyball by Michael Lewis explains big data, move fast and break things, natural language processing, Network effects, nuclear winter, optical character recognition, pattern recognition, Paul Graham, personalized medicine, Peter Thiel, pre–internet, quantitative hedge fund, quantitative trading / quantitative finance, recommendation engine, Renaissance Technologies, Richard Feynman, Richard Feynman, self-driving car, side project, Silicon Valley, Skype, software as a service, speech recognition, statistical model, Steve Jobs, stochastic process, technology bubble, text mining, the scientific method, web application

I had to decide what was relevant and what was not. In many cases, I had just been CC’ed and I wasn’t supposed to take any action. In other cases, however, I was supposed to take action. www.it-ebooks.info 225 226 Chapter 11 | André Karpištšenko, Planet OS To try to get my communications under control, I started looking at those data sets by integrating them, visualizing them, and analyzing them. Some of the analysis methods included text mining and network analytics. Then, through that process, a friend and I created a productivity tool which helped me to understand the important people in my current workflow and the important contexts from past conversations I’d had with them, as opposed to conversations in which I was just CC’ed. This made it so that when I talked with someone, there would already be something that helped me to understand what we had talked about before in relation to what we were talking about now, the people the person knew, and the projects they were involved in.

There are a lot of interesting people from a wide range of different viewpoints who tweet links to different links and references that are good to be aware of. Some of them are journalists from places like The Guardian and some of them are organizations like the Stanford NLP group. For example, Simon Rogers, who was with The Guardian and has now moved to Twitter, posts links to different visualizations. Stanford NLP, post links to recent text mining research. There are also creative people who use data in imaginative ways. For example, Pete Warden who just released a software development toolkit that lets leverage deep learning in smartphone apps. You can basically set it up to make it easy for you to create a mobile app that uses deep learning in the background. There’s a wonderful video on his blog of him making his app detect whether his cat Dude is in the picture or not.


pages: 502 words: 107,510

Natural Language Annotation for Machine Learning by James Pustejovsky, Amber Stubbs

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Mechanical Turk, bioinformatics, cloud computing, computer vision, crowdsourcing, easy for humans, difficult for computers, finite state, game design, information retrieval, iterative process, natural language processing, pattern recognition, performance metric, sentiment analysis, social web, speech recognition, statistical model, text mining

“HeidelTime: High Quality Rule-Based Extraction and Normalization of Temporal Expressions.” In Proceedings of the 5th International Workshop on Semantic Evaluation. Stubbs, Amber. A Methodology for Using Professional Knowledge in Corpus Annotation. Doctoral dissertation. Brandeis University, August 2012; to be published February 2013. Stubbs, Amber, and Benjamin Harshfield. 2010. “Applying the TARSQI Toolkit to augment text mining of EHRs.” BioNLP ’10 poster session: In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. Tenny, Carol. 2000. “Core events and adverbial modification.” In J. Pustejovsky and C. Tenny (Eds.), Events as Grammatical Objects. Stanford, CA: Stanford: Center for the Study of Language and Information, pp. 285–334. Tsuruoka, Yoshimasa, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun’ichi Tsujii. 2005.


pages: 552 words: 168,518

MacroWikinomics: Rebooting Business and the World by Don Tapscott, Anthony D. Williams

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

accounting loophole / creative accounting, airport security, Andrew Keen, augmented reality, Ayatollah Khomeini, barriers to entry, bioinformatics, Bretton Woods, business climate, business process, car-free, carbon footprint, citizen journalism, Clayton Christensen, clean water, Climategate, Climatic Research Unit, cloud computing, collaborative editing, collapse of Lehman Brothers, collateralized debt obligation, colonial rule, corporate governance, corporate social responsibility, crowdsourcing, death of newspapers, demographic transition, distributed generation, don't be evil, en.wikipedia.org, energy security, energy transition, Exxon Valdez, failed state, fault tolerance, financial innovation, Galaxy Zoo, game design, global village, Google Earth, Hans Rosling, hive mind, Home mortgage interest deduction, interchangeable parts, Internet of things, invention of movable type, Isaac Newton, James Watt: steam engine, Jaron Lanier, jimmy wales, Joseph Schumpeter, Julian Assange, Kevin Kelly, knowledge economy, knowledge worker, Marshall McLuhan, medical bankruptcy, megacity, mortgage tax deduction, Netflix Prize, new economy, Nicholas Carr, oil shock, online collectivism, open borders, open economy, pattern recognition, peer-to-peer lending, personalized medicine, Ray Kurzweil, RFID, ride hailing / ride sharing, Ronald Reagan, scientific mainstream, shareholder value, Silicon Valley, Skype, smart grid, smart meter, social graph, social web, software patent, Steve Jobs, text mining, the scientific method, The Wisdom of Crowds, transaction costs, transfer pricing, University of East Anglia, urban sprawl, value at risk, WikiLeaks, X Prize, young professional, Zipcar

In other words, journals like PLoS ONE could become platforms for innovation in the same way the iPhone is a platform for 100,000 third-party apps. In some cases, the ultimate applications for research may not even be known until some time in the future. “You never know what could turn out to be valuable down the line,” says Binfield. “In years to come when there’s better data discovery tools and better text mining, somebody or some machine somewhere will pull out the one data point or one insight in a paper that may have been disregarded when it was first published.” But the key point for Binfield is that open-access content is inherently more valuable and more useful than subscription content because it reaches a broader audience—the same audience that traditional science publishers assume is irrelevant.