full text search

33 results back to index


Learning Flask Framework by Matt Copperwaite, Charles Leifer

create, read, update, delete, database schema, Debian, DevOps, don't repeat yourself, full text search, place-making, Skype, web application

Create the file entries/tag_index.html and add the following code: {% extends "base.html" %} {% block title %}Tags{% endblock %} {% block content_title %}Tags{% endblock %} {% block content %} <ul> {% for tag in object_list.items %} <li><a href="{{ url_for('entries.tag_detail', slug=tag.slug) }}">{{ tag.name }}</a></li> [ 68 ] Chapter 3 {% endfor %} </ul> {% endblock %} If you like, you can add a link to the tag list in the base template's navigation. Full-text search In order to allow users to find posts containing certain words or phrases, we will add simple full-text search to the pages that contain lists of blog entries. To accomplish this, we will do some refactoring. We will be adding a search form to the sidebar of all pages containing lists of blog entries. While we could copy and paste the same code into both entries/index.html and entries/tag_detail.html, we will, instead, create another base template that contains the search widget.

Introducing SQLAlchemy Installing SQLAlchemy [i] www.allitebooks.com 21 22 23 24 Table of Contents Using SQLAlchemy in our Flask app Choosing a database engine Connecting to the database Creating the Entry model Creating the Entry table Working with the Entry model Making changes to an existing entry Deleting an entry Retrieving blog entries Filtering the list of entries Special lookups Combining expressions 24 25 25 26 29 30 32 32 32 33 34 35 Building a tagging system Adding and removing tags from entries Using backrefs Making changes to the schema Adding Flask-Migrate to our project Creating the initial migration Adding a status column Summary 37 41 42 43 43 44 45 46 Negation Operator precedence Chapter 3: Templates and Views Introducing Jinja2 Basic template operations Loops, control structures, and template programming Jinja2 built-in filters Creating a base template for the blog Creating a URL scheme Defining the URL routes Building the index view Building the detail view Listing entries matching a given tag Listing all the tags Full-text search Adding pagination links Enhancing the blog app Summary [ ii ] 36 37 47 48 49 51 55 57 60 62 63 66 67 68 69 71 73 73 Table of Contents Chapter 4: Forms and Validation Getting started with WTForms Defining a form for the Entry model A form with a view The create.html template Handling form submissions Validating input and displaying error messages Editing existing entries The edit.html template Deleting entries Cleaning up Using flash messages Displaying flash messages in the template Saving and modifying tags on posts Image uploads Processing file uploads The image upload template Serving static files Summary 75 75 76 77 78 80 82 85 86 89 90 91 93 94 96 97 99 100 101 Chapter 5: Authenticating Users 103 Chapter 6: Building an Administrative Dashboard 123 Creating a user model Installing Flask-Login Implementing the Flask-Login interface Creating user objects Login and logout views The login template Logging out Accessing the current user Restricting access to views Storing an entry's author Setting the author on blog entries Protecting the edit and delete views Displaying a user's drafts Sessions Summary Installing Flask-Admin Adding Flask-Admin to our app [ iii ] 104 105 107 108 110 112 113 114 114 115 117 117 119 120 121 123 125 Table of Contents Exposing models through the Admin Customizing the list views Adding search and filtering to the list view Customizing Admin model forms Enhancing the User form Generating slugs Managing static assets via the Admin Securing the admin website Creating an authentication and authorization mixin Setting up a custom index page Flask-Admin templates Reading more Summary 126 129 132 134 136 138 140 141 143 144 145 146 146 Chapter 7: AJAX and RESTful APIs 147 Chapter 8: Testing Flask Apps 167 Creating a comment model Creating a schema migration Installing Flask-Restless Setting up Flask-Restless Making API requests Creating comments using AJAX AJAX form submissions Validating data in the API Preprocessors and postprocessors Loading comments using AJAX Retrieving the list of comments Reading more Summary Unit testing Python's unit test module A simple math test Flask and unit testing Testing a page Testing an API Test-friendly configuration Mocking objects 147 149 149 150 151 154 156 159 160 161 163 166 166 167 168 169 171 173 175 176 177 [ iv ] Table of Contents Logging and error reporting Logging 179 180 Error reporting Read more Summary 182 182 182 Logging to file Custom log messages Levels 180 181 181 Chapter 9: Excellent Extensions SeaSurf and CSRF protection of forms Creating Atom feeds Syntax highlighting using Pygments Simple editing with Markdown Caching with Flask-Cache and Redis Creating secure, stable versions of your site by creating static content Commenting on a static site Synchronizing multiple editors Asynchronous tasks with Celery Creating command line instructions with Flask-script References Summary Chapter 10: Deploying Your Application Running Flask with a WSGI server Apache's httpd Serving static files 183 183 185 186 190 192 194 195 195 196 199 200 201 203 203 204 206 Nginx 207 Serving static files Gunicorn Securing your site with SSL Getting your certificate Apache httpd Nginx Gunicorn Automating deployment using Ansible Read more Summary Index [v] 209 210 210 211 212 214 215 216 219 219 221 Preface Welcome to Learning Flask, the book that will teach you the necessary skills to build web applications with Flask, a lightweight Python web framework.

In each chapter, you will learn a new skill through practical, hands-on coding projects. In the following table, I've listed a brief description of the core skills paired with the corresponding features of the blog: Skill Blog site feature(s) Relational databases with SQLAlchemy Store entries and tags in a relational database. Perform a wide variety of queries, including pagination, date-ranges, full-text search, inner and outer joins, and more. Flask-SQLAlchemy Form processing and validation Flask-WTF Template rendering with Jinja2 Jinja2 User authentication and administrative dashboards Flask-Login Ajax and RESTful APIs Flask-API Unit testing unittest Everything else Create and edit blog entries using forms. In later chapters, we will also use forms for logging users into the site and allowing visitors to post comments.


Martin Kleppmann-Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable and Maintainable Systems-O’Reilly (2017) by Unknown

active measures, Amazon Web Services, bitcoin, blockchain, business intelligence, business process, c2.com, cloud computing, collaborative editing, commoditize, conceptual framework, cryptocurrency, database schema, DevOps, distributed ledger, Donald Knuth, Edward Snowden, ethereum blockchain, fault tolerance, finite state, Flash crash, full text search, general-purpose programming language, informal economy, information retrieval, Internet of things, iterative process, John von Neumann, loose coupling, Marc Andreessen, natural language processing, Network effects, packet switching, peer-to-peer, performance metric, place-making, premature optimization, recommendation engine, Richard Feynman, Richard Feynman, self-driving car, semantic web, Shoshana Zuboff, social graph, social web, software as a service, software is eating the world, sorting algorithm, source of truth, SPARQL, speech recognition, statistical model, web application, WebSocket, wikimedia commons

Although Google later moved away from using MapReduce for this purpose [43], it helps to under‐ stand MapReduce if you look at it through the lens of building a search index. (Even today, Hadoop MapReduce remains a good way of building indexes for Lucene/Solr [44].) We saw briefly in “Full-text search and fuzzy indexes” on page 88 how a full-text search index such as Lucene works: it is a file (the term dictionary) in which you can efficiently look up a particular keyword and find the list of all the document IDs con‐ taining that keyword (the postings list). This is a very simplified view of a search index—in reality it requires various additional data, in order to rank search results by relevance, correct misspellings, resolve synonyms, and so on—but the principle holds. If you need to perform a full-text search over a fixed set of documents, then a batch process is a very effective way of building the indexes: the mappers partition the set of documents as needed, each reducer builds the index for its partition, and the index files are written to the distributed filesystem.

Composing Data Storage Technologies Over the course of this book we have discussed various features provided by data‐ bases and how they work, including: • Secondary indexes, which allow you to efficiently search for records based on the value of a field (see “Other Indexing Structures” on page 85) Unbundling Databases | 499 • Materialized views, which are a kind of precomputed cache of query results (see “Aggregation: Data Cubes and Materialized Views” on page 101) • Replication logs, which keep copies of the data on other nodes up to date (see “Implementation of Replication Logs” on page 158) • Full-text search indexes, which allow keyword search in text (see “Full-text search and fuzzy indexes” on page 88) and which are built into some relational databases [1] In Chapters 10 and 11, similar themes emerged. We talked about building full-text search indexes (see “The Output of Batch Workflows” on page 411), about material‐ ized view maintenance (see “Maintaining materialized views” on page 467), and about replicating changes from a database to derived data systems (see “Change Data Capture” on page 454). It seems that there are parallels between the features that are built into databases and the derived data systems that people are building with batch and stream processors.

With a onedimensional index, you would have to either scan over all the records from 2013 (regardless of temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by timestamp and temperature simultaneously. This tech‐ nique is used by HyperDex [36]. Full-text search and fuzzy indexes All the indexes discussed so far assume that you have exact data and allow you to query for exact values of a key, or a range of values of a key with a sort order. What they don’t allow you to do is search for similar keys, such as misspelled words. Such fuzzy querying requires different techniques. For example, full-text search engines commonly allow a search for one word to be expanded to include synonyms of the word, to ignore grammatical variations of words, and to search for occurrences of words near each other in the same document, and support various other features that depend on linguistic analysis of the text.


pages: 713 words: 93,944

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement by Eric Redmond, Jim Wilson, Jim R. Wilson

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Web Services, create, read, update, delete, data is the new oil, database schema, Debian, domain-specific language, en.wikipedia.org, fault tolerance, full text search, general-purpose programming language, linked data, MVC pattern, natural language processing, node package manager, random walk, recommendation engine, Ruby on Rails, Skype, social graph, web application

At first, the dusting wasn’t even enough to cover this morning’s earliest tracks, but the power of the storm took over, replenishing the landscape and delivering the perfect skiing experience with the diversity and quality that we craved. Just this past year, I woke up to the realization that the database world, too, is covered with a fresh blanket of snow. Sure, the relational databases are there, and you can get a surprisingly rich experience with open source RDBMS software. You can do clustering, full-text search, and even fuzzy searching. But you’re no longer limited to that approach. I have not built a fully relational solution in a year. Over that time, I’ve used a document-based database and a couple of key-value datastores. The truth is that relational databases no longer have a monopoly on flexibility or even scalability. For the kinds of applications that we build, there are more appropriate models that are simpler, faster, and more reliable.

Sometimes, users can’t remember the full name of “J. Roberts.” In other cases, we just plain don’t know how to spell “Benn Aflek.” We’ll look into a few PostgreSQL packages that make text searching easy. It’s worth noting that as we progress, this kind of string matching blurs the lines between relational queries and searching frameworks like Lucene.[8] Although some may feel features like full-text search belong with the application code, there can be performance and administrative benefits of pushing these packages to the database, where the data lives. SQL Standard String Matches PostgreSQL has many ways of performing text matches, but the two big default methods are LIKE and regular expressions. I Like LIKE and ILIKE LIKE and ILIKE (case-insensitive LIKE) are the simplest forms of text search.

CREATE INDEX movies_title_trigram ON movies​​ ​​USING gist (title gist_trgm_ops);​​ Now you can query with a few misspellings and still get decent results. ​​SELECT *​​ ​​FROM movies​​ ​​WHERE title % 'Avatre';​​ ​​ title​​ ​​---------​​ ​​ Avatar​​ Trigrams are an excellent choice for accepting user input, without weighing them down with wildcard complexity. Full-Text Fun Next, we want to allow users to perform full-text searches based on matching words, even if they’re pluralized. If a user wants to search for certain words in a movie title but can remember only some of them, Postgres supports simple natural-language processing. TSVector and TSQuery Let’s look for a movie that contains the words night and day. This is a perfect job for text search using the @@ full-text query operator. ​​SELECT title​​ ​​FROM movies​​ ​​WHERE title @@ 'night & day';​​ ​​ title​​ ​​-------------------------------​​ ​​ A Hard Day’s Night​​ ​​ Six Days Seven Nights​​ ​​ Long Day’s Journey Into Night​​ The query returns titles like A Hard Day’s Night, despite the word Day being in possessive form, and the two words are out of order in the query.


pages: 205 words: 47,169

PostgreSQL: Up and Running by Regina Obe, Leo Hsu

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

cloud computing, database schema, Debian, en.wikipedia.org, full text search, web application

You can usually get by with just this one alone if you don’t want to experiment with additional types. If PostgreSQL automaticaly creates an index for you or you don’t bother picking the type, B-tree will be chosen. It is currently the only index type allowed for primary key and unique indexes. GiST Generalized Search Tree (GiST) is an index type optimized for full text search, spatial data, astronomical data, and hierarchical data. You can’t use it to enforce uniqueness, however, you can use it in exclusion constraints. GIN Generalized Inverted Index (GIN) is an index type commonly used for the built-in full text search of PostgreSQL and the trigram extensions. GIN is a decendent of Gist, but it’s not lossy. GIN indexes are generally faster to search than GiST, but slower to update. You can see an example at Waiting for Faster LIKE/ILIKE. SP-GiST Space-Partitioning Trees Generalized Search Tree (SP-GiST) is an index type introduced in PostgreSQL 9.2.

Unlogged tables speeds up queries against tables where logging is unnecessary. Triggers on views. In prior versions, to make views updatable you used DO INSTEAD rules, which only supported SQL for programming logic. Triggers can be written in most procedural languages—except SQL—and opens the door for more complex abstraction using views. KNN GiST adds improvement to popular extensions like full-text search, trigram (for fuzzy search and case insensitive search), and PostGIS. Database Drivers If you are using or plan to use PostgreSQL, chances are that you’re not going to use it in a vacuum. To have it interact with other applications, you’re going to need database drivers. PostgreSQL enjoys a generous number of freely available database drivers that can be used in many programming languages.

Some past extensions have gained enough traction to become part of the PostgreSQL core, so if you’re upgrading from an ancient version, you may not even have to worry about extensions. Old Extensions Absorbed into PostgreSQL Prior to PostgreSQL 8.3, the following extensions weren’t part of core: PL/PgSQL wasn’t always installed by default in every database. In old versions, you had to run CREATE LANGUAGE plpgsql; in your database. From around 8.3 on, it’s installed by default, but you retain the option of uninstalling it. tsearch is a suite for supporting full-text searches by adding indexes, operators, custom dictionaries, and functions. It became part of PostgreSQL core in 8.3. You don’t have the option to uninstall it. If you’re still relying on old behavior, you can install the tsearch2 extension, which retained old functions that are no longer available in the newer version. A better approach would be just to update where you’re using the functions because compatiblity with the old tsearch could end at any time.


pages: 481 words: 121,669

The Invisible Web: Uncovering Information Sources Search Engines Can't See by Gary Price, Chris Sherman, Danny Sullivan

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

AltaVista, American Society of Civil Engineers: Report Card, bioinformatics, Brewster Kahle, business intelligence, dark matter, Donald Davies, Douglas Engelbart, Douglas Engelbart, full text search, HyperCard, hypertext link, information retrieval, Internet Archive, joint-stock company, knowledge worker, natural language processing, pre–internet, profit motive, publish or perish, search engine result page, side project, Silicon Valley, speech recognition, stealth mode startup, Ted Nelson, Vannevar Bush, web application

Thoroughness and accuracy are absolutely critical to the patent searcher. Major business decisions involving significant expense or potential litigation often hinge on the details of a patent search, so using a general-purpose search engine for this type of search is effectively out of the question. Many government patent offices maintain Web sites, but Delphion’s Intellectual Property Network (http://www.delphion.com/) allows full-text searching of U.S. and European patents and abstracts of Japanese patents simultaneously. Additionally, the United States Patent Office (http://www.uspto.gov) provides patent information dating back to 1790, as well as U.S. Trademark data. 6. Out of Print Books. The growth of the Web has proved to be a boon for bibliophiles. Countless out of print booksellers have established Web sites, obliterating the geographical constraints that formerly limited their business to local customers.

These newsletters don’t limit themselves to the Invisible Web, but the news and information they provide is exceptionally useful for all serious Web searchers. All of these newsletters are free. The Scout Report http://scout.cs.wisc.edu/scout/report/current/ The Scout Report provides the closest thing to an “official” seal of approval for quality Web sites. Published weekly, it provides organized summaries of the most valuable and authoritative Web resources available. The Scout Report Signpost provides the full-text search of nearly 6,000 of these summaries. The Scout Report staff is made up of a group 110 The Invisible Web of librarians and information professionals, and their standards for inclusion in the report are quite high. Librarians’ Index to the Internet (LII) http://www.lii.org This searchable, annotated directory of Web resources, maintained by Carole Leita and a volunteer team of more than 70 reference librarians, is organized into categories including “best of,” “directories,” “databases,” and “specific resources.”

The resulting page offers several searchable databases, Patent Full-Text Database with Full-Page Images, Patent Bibliographic and Abstract Database, an Expired Patent Search, and several others. Wally chooses the full-text database http://www.uspto.gov/patft/index.html over a bibliographic database that provides only limited information for each patent. Clicking the full-text database link brings up further options. After scanning the page, Wally notices a direct link that allows for full-text searching by patent number (http://164.195.100.11/netahtml/ srchnum.htm). Wally quickly types in the number and in less than a second has a link to the full-text of patent number 3541541. Wally’s job is complete and his boss is very impressed. This is a case where a general-purpose search engine failed to find the desired end result, but was indispensable in helping Wally locate the “front door” of the Invisible Web database that ultimately provided what he was looking for.


pages: 1,085 words: 219,144

Solr in Action by Trey Grainger, Timothy Potter

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

business intelligence, cloud computing, commoditize, conceptual framework, crowdsourcing, data acquisition, en.wikipedia.org, failed state, fault tolerance, finite state, full text search, glass ceiling, information retrieval, natural language processing, performance metric, premature optimization, recommendation engine, web application

Even though I had no formal search background when I started writing Solr, it felt like a very natural fit, because I have always enjoyed making software “go fast.” I viewed Solr more as an alternate type of datastore designed around an inverted index than as a full-text search engine, and that has helped Solr extend beyond the legacy enterprise search market. By the end of 2005, Solr was powering the search and faceted navigation of a number of CNET sites, and soon it was made open source. Solr was contributed to the Apache Software Foundation in January 2006 and became a subproject of the Lucene PMC (with Lucene Java as its sibling). There had always been a large degree of overlap with Lucene (the core full-text search library used by Solr) committers, and in 2010 the projects were merged. Separate Lucene and Solr downloads would still be available, but they would be developed by a single unified team.

I could not have done this without their insightful questions about Solr and their giving me the opportunity to build a large-scale search solution using Solr. About this Book Whether handling big data, building cloud-based services, or developing multitenant web applications, it’s vital to have a fast, reliable search solution. Apache Solr is a scalable and ready-to-deploy open source full-text search engine powered by Lucene. It offers key features like multilingual keyword searching, faceted search, intelligent matching, content clustering, and relevancy weighting right out of the box. Solr in Action is the definitive guide to implementing fast and scalable search using Apache Solr. It uses well-documented examples ranging from basic keyword searching to scaling a system for billions of documents and queries.

Chapter 5 teaches how Solr indexes documents, starting with a discussion of another important configuration file: schema.xml. You’ll learn how to define fields to represent structured data like numbers, dates, prices, and unique identifiers. We also cover how update requests are processed and configured using solrconfig.xml. Chapter 6 builds on the material in chapter 5 by showing how to index text fields using text analysis. Solr was designed to efficiently search and rank documents requiring full-text search. Text analysis is an important part of the search process in that it removes the linguistic variations between indexed text and queries. At this point in the book, you’ll have a solid foundation and will be ready to put Solr to work on your own search needs. As your knowledge of search and Solr grows, so too will your need to go beyond basic keyword searching and implement common search features such as advanced query parsing, hit highlighting, spell-checking, autosuggest, faceting, and result grouping.


pages: 82 words: 17,229

Redis Cookbook by Tiago Macedo, Fred Oliveira

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Debian, full text search, loose coupling, Ruby on Rails, Silicon Valley

The possibilities are endless, and Redis’s pub/sub implementation makes it trivial to implement robust solutions for chat or notifications. Implementing an Inverted-Index Text Search with Redis Problem An inverted index is an index data structure that stores mappings of words (or other content) to their locations in a file, document, database, etc. This is generally used to implement full text search, but it requires previous indexing of the documents to be searched. In this recipe, we’ll use Redis as the storage backend for a full-text search implementation. Solution Our implementation will use one set per word, containing document IDs. In order to allow fast searches, we’ll index all the documents beforehand. Search itself is performed by splitting the query into words and intersecting the matching sets. This will return the IDs of the documents containing all the words we search for.[1] Discussion Indexing Let’s say we have a hundred documents or web pages that we want to allows searches on.


pages: 519 words: 102,669

Programming Collective Intelligence by Toby Segaran

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

always be closing, correlation coefficient, Debian, en.wikipedia.org, Firefox, full text search, information retrieval, PageRank, prediction markets, recommendation engine, slashdot, Thomas Bayes, web application

Multidimensional scaling in two dimensions is easy to print, but scaling can be done in any number of dimensions. Try changing the code to scale in one dimension (all the points on a line). Now try making it work for three dimensions. Chapter 4. Searching and Ranking This chapter covers full-text search engines, which allow people to search a large set of documents for a list of words, and which rank results according to how relevant the documents are to those words. Algorithms for full-text searches are among the most important collective intelligence algorithms, and many fortunes have been made by new ideas in this field. It is widely believed that Google's rapid rise from an academic project to the world's most popular search engine was based largely on the PageRank algorithm, a variation that you'll learn about in this chapter.

The neural network will learn to associate searches with results based on what links people click on after they get a list of search results. The neural network will use this information to change the ordering of the results to better reflect what people have clicked on in the past. To work through the examples in this chapter, you'll need to create a Python module called searchengine, which has two classes: one for crawling and creating the database, and the other for doing full-text searches by querying the database. The examples will use SQLite, but they can easily be adapted to work with a traditional client-server database. To start, create a new file called searchengine.py and add the following crawler class and method signatures, which you'll be filling in throughout this chapter: class crawler: # Initialize the crawler with the name of database def __init_ _(self,dbname): pass def __del_ _(self): pass def dbcommit(self): pass # Auxilliary function for getting an entry id and adding # it if it's not present def getentryid(self,table,field,value,createnew=True): return None # Index an individual page def addtoindex(self,url,soup): print 'Indexing %s' % url # Extract the text from an HTML page (no tags) def gettextonly(self,soup): return None # Separate the words by any non-whitespace character def separatewords(self,text): return None # Return true if this url is already indexed def isindexed(self,url): return False # Add a link between two pages def addlinkref(self,urlFrom,urlTo,linkText): pass # Starting with a list of pages, do a breadth # first search to the given depth, indexing pages # as we go def crawl(self,pages,depth=2): pass # Create the database tables def createindextables(self): pass A Simple Crawler I'll assume for now that you don't have a big collection of HTML documents sitting on your hard drive waiting to be indexed, so I'll show you how to build a simple crawler.

If you'd like to make sure that the crawl worked properly, you can try checking the entries for a word by querying the database: >>[row for row in crawler.con.execute( .. 'select rowid from wordlocation where wordid=1')] [(1,), (46,), (330,), (232,), (406,), (271,), (192,),... The list that is returned is the list of all the URL IDs containing "word," which means that you've successfully run a full-text search. This is a great start, but it will only work with one word at a time, and will just return the documents in the order in which they were loaded. The next section will show you how to expand this functionality by doing these searches with multiple words in the query. Querying You now have a working crawler and a big collection of documents indexed, and you're ready to set up the search part of the search engine.


pages: 696 words: 111,976

SQL Hacks by Andrew Cumming, Gordon Russell

bioinformatics, business intelligence, business process, database schema, en.wikipedia.org, Erdős number, Firefox, full text search, Hacker Ethic, Paul Erdős, Stewart Brand, web application

It does not give scores other than 1 or 0. You could use this to perform the search: mysql> SELECT author, -> MATCH (body) AGAINST ('+database +systems' IN BOOLEAN MODE) -> AS SCORE -> FROM story -> ORDER BY 2 DESC; +---------------------+-------+ | author | SCORE | +---------------------+-------+ | Atzeni | 1 | | Adams | 0 | | Russell and Cumming | 0 | +---------------------+-------+ 3.1.2. PostgreSQL To get full text searching in PostgreSQL, you need to use the Tsearch2 module. A more detailed guide on how to do this is available from devx (http://www.devx.com/opensource/Article/21674). To install Tsearch2 (from a source-code install) go to your source directory for PostgreSQL and type the following at the Linux or Unix shell prompt (you may need to be root for the install step): $ cd contrib/tsearch2 $ make $ make install To use Tsearch2 in a particular database, you need to issue this command: $ psql dbname < tsearch2.sql tsearch2.sql should be in your install directory's contrib directory (for instance, /usr/local/pgsql/share/contrib/tsearch2.sql).

To use this new searching capability, you need to add a column to the tables to be searched (to hold some system vector data concerning the field to be searched), add an index, and prepare the new column for searching: ALTER TABLE story ADD COLUMN vectors tsvector; CREATE INDEX story_index ON story USING gist(vectors); SELECT set_curcfg('default'); UPDATE story SET vectors = to_tsvector(body); Finally, you can perform your search: dbname=> SELECT author,rank (vectors,q) dbname-> FROM story, to_tsquery('database&systems') AS q dbname-> ORDER BY rank(vectors,q) DESC; author | rank ---------------------+----------- Atzeni | 0.0991032 Adams | 1e-20 Russell and Cumming | 1e-20 3.1.3. SQL Server Implementation of full text searching in SQL Server utilizes the Microsoft Search Engine. This is external to the database and has to be configured separately first. Part of this configuration requires you to specify a location for saving the full text indexes. Because these indexes are stored outside the normal database structure, you need to remember to back these up separately when you are doing a database backup. Make sure you have the Microsoft Search Engine installed on your machine.

VALUES statement INSERT ALL statement (Oracle) INSERT statement creating words table piping sequence of to SQL command-line utility using SELECT instead of VALUES construct web page converted to INSERT TRIGGER INSERTED table (SQL Server) instance name (SQL Server) integers converting dates to Access Oracle SQL Server representing seconds since epoch integers table generating sequence of dates generating the alphabet numbers 0-99 simplifying queries that use self-joins on using with LEFT OUTER JOIN generating data for consecutive dates generating data for each alphabet letter Internet Movie Database (IMDb) Internet, sharing data across INTERVAL notation (Oracle) invoices, reconciling with remittances ISNULL function (MySQL) ISO standard date format isolated transactions, ensuring in database updates isolation level, transactions autocommit, turning off concurrency issues enforcing querying standard, listed iSQL*Plus (Oracle) Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] Java connecting to SQL database from program escaping special characters in strings placeholders Java Virtual Machine javac (Java compiler) JavaScript validation JDBC driver for SQL databases JOIN keyword JOINs ANSI syntax choosing right style for relationships JOIN chain JOIN star combinations, generating comma-delimited lists converting aggregate subqueries to converting subqueries to steps in process including rows left out of query output making use of indexes one-to-many relationships, finding specific rows on many side one-to-many, producing excess repetition in result set self-joins types of updating in MySQL Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] Kevin Bacon game keywords capitalization searching for, without using LIKE Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] labels, adding to SVG pie chart Large Object Blocks [See LOBs] last Thursday of the month, finding latitude, longitude, and altitude values (GPS) LEAST function 2nd 3rd working without LEFT JOIN [See LEFT OUTER JOINs] LEFT OUTER JOINs 2nd JOIN chain example providing missing report data star JOIN example using in CROSS JOIN LEVEL pseudovariable (Oracle) libraries, database LIKE operator case sensitivity JOINs on comma-delimited lists searching for keywords without using LIMIT instruction (MySQL) 2nd limiting number of rows returned line continuation character (\\) Linux Apache web server logs XSLT transformation of web page to SQL xsltproc utility linuxzoo web site LOBs (Large Object Blocks) locking performance and coping with unexpected redo implicit locks within transactions transaction isolation level using optimistic locking using pessimistic locking table comparison and LOG function (SQL Server) logarithms, adding for list of numbers logging creating permanent log processing web server logs checking for broken links investigating user actions queries Webalizer tool, using long-running queries, finding and stopping preventing from running LPAD function Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] Macintosh XSLT transformation of web page to SQL, Mac OS X xsltproc utility magnitude math avoiding division by zero calculating maximum of two fields calculating rank counting disaggregating a COUNT distance between GPS locations, calculation keeping a running total median, calculating minimum of two values, calculating multiplying across a result set progressive tax, applying reconciling invoices and remittances rounding numbers tallying survey results into a chart values and subtotals, getting in one query maximum calculating for three fields calculating for two fields MD5 hashing in the programming language Oracle SQL Server median, calculating finding middle row or rows temporary table creating filling in memory, buffered data from queries returning many rows metadata, obtaining Microsoft Access [See Access] Microsoft Search Engine Microsoft SQL Server 2005 MID function (Access) middle element(s), finding minimum calculating minimum of two values MOD function (Oracle) 2nd MOD operator modulus operator (%) MONTH function monthly totals, reporting months current month, reporting on last Thursday, finding second Tuesday, finding msxsl.exe cleaning up Unicode output multiplication across a result set MySQL anonymous access audit trail, creating auto-numbering calculating a running total checking if table exists before issuing DROP TABLE command-line utility Connector/J driver (JDBC driver) creating tables CURRENT_TIMESTAMP dates casting to strings converting to integers DAYOFWEEK function finding floating calendar date MONTH, YEAR, and EXTRACT functions parsing quarterly reports reducing precision of disk space, managing DROP avoiding constraints dropping FROM clause of SELECT statement errors caused by unrecognized date formats exporting and importing table definitions FULLTEXT pattern match GREATEST and LEAST functions IFNULL function IN BOOLEAN MODE text-searching InnoDB, specifying ISNULL function LIMIT instruction 2nd limiting queries long-running queries, finding and stopping MD5 function MONTH and YEAR functions ODBC driver OFFSET instruction pattern-matching personalized parameters, defining pessimistic locking REPEAT function static functions string concatenation SUBSTRING function support for SQL92 standard temporary table, creating transaction isolation transaction isolation level, querying transposition errors, finding in numbers triggers tunneling into from Access creating secure tunnel MySQL ODBC connector starting tunnel with Visual Basic stopping the tunnel testing the connection updates users and administrators, creating UUID function versions, SQL conventions and web-based interface WITH ROLLUP clause XML features mysql_ functions mysqldump Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] names (column and table), spaces in namespace management navigation features for web applications network connection concepts NEWID( ) 2nd nonrepeatable reads nonstring data, checking format NOT IN subquery 2nd NULL values skipped in a count NULLIF function numbers finding transposition errors generating unique sequential numbers auto-numbering 2nd choosing primary key gaps in sequence multiuser considerations performance random rounding 2nd sequential, generating numbers table [See integers table] numerical keys, table comparisons and Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] ODBC canonical date format MySQL connector source data for Excel pivot tables ODBC/JDBC bridge OFFSET instruction OID datatype (PostgreSQL) old copy of data, returned for transaction ON clause one-row tables one-to-many relationship implemented incorrectly searching for specific rows on the many side one-to-one relationship one-way hashes online applications [See web-based applications] operating systems clocks using epoch time command-line switches finding all the files for an operation optimistic locking optimizer hints, returning subset of results Oracle anonymous access auditing features auto-numbering autocommit, turning off calculating a running total calendar view of important dates checking if table exists before issuing DROP TABLE creating tables CURRENT_TIMESTAMP dates converting to integers day-of-the week EXTRACT and TO_DATE functions finding floating calendar date generating sequence with integers table parsing quarterly reports reducing precision in report data reporting on current month sequence of days up to a year subtracting to get days between disk space, managing DROP avoiding constraints exporting and importing table definitions generating alphabet with integers table GREATEST and LEAST functions GROUPING SETS clause INSERT ALL statement limiting queries long-running queries, finding and stopping LPAD function MD5 hash, generating with DBMS_CRYPTO library (Oracle 10g) MOD function optimizer behavior, changing with ALTER SESSION personalized parameters, defining pessimistic locking RANK( ) function recursive extensions ROWNUM pseudovariable 2nd 3rd 4th running from command line single-row table (dual table) SQL conventions in this book static functions string concatenation SUBSTR function SYS_GUID( ) function temp space, running out of temporary table, creating transaction isolation level, querying transposition errors, finding in numbers triggers UPDATE statement users and administrators, creating web-based interface XML features OUTER JOINs eliminating subqueries by using JOIN chain example OVER clause used with aggregate function overlapping ranges, identifying Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] page rank calculations parameters personalized creating list of defining storing in database parameters, query partial index matching passwords accessing with SQL injection easypass generator tool hashing Oracle recorded in plain text specifying with command-line switches strength of pattern-matching FULLTEXT, MySQL solving crossword puzzle using SQL performance combining queries comparing tables compressing to avoid LOBs executing functions in the database extracting subset of the results improving with indexes locking and coping with unexpected redo implicit locks within transactions transaction isolation level using optimistic locking using pessimistic locking mixing file and database storage adding/removing files changing existing files too many files one-to-many JOIN producing excess repetition in result set queries returning many rows using a large buffer using series of round trips using variable binding SQL calculations unique sequential number generation Perl assigning list of variables in single statement cleaning up Unicode output from msxsl.exe compression/decompression script connecting to SQL database from program converting logfile lines into SQL INSERT statements dates (user supplied), converting to date literals DBI interface creating BLOBs and retrieving data from escaping special characters in strings MD5 hashing opening files in append mode placeholders pop-up list from a table table comparison script Webalizer DNS cache, accessing permissions user, setting on rows and columns personalized parameters creating list of defining pessimistic locking pg_dump command phantom reads PHP connecting to SQL database from program escaping special characters in strings extracting subset of results MD5 hashing navigation features for web application search results, alphabetic placeholders pop-up list from a table user creation form repeatable phpBB program, sharing namespaces phpMyAdmin 2nd phpPgAdmin pipes pivot tables, filling in missing values placeholders 2nd examples in different languages Plink (SSH client) killing the process starting connection using Visual Basic plpgsql language 2nd pnm library pop-up list from a table in Perl in PHP portability, database applications PostgreSQL anonymous access audit trail, creating auto-numbering autocommit (chained mode), turning off calendar view of important dates checking if table exists before issuing DROP TABLE command-line utility (psql) CURRENT_TIMESTAMP dates casting to strings converting to integers day-of-the-week finding floating calendar date generating sequence with integers table quarterly reports reducing precision in report data sequence of days up to a year DROP avoiding constraints exporting table definitions full text searching with Tsearch2 module LIMIT instruction limiting queries long-running queries, finding and stopping LPAD function maximum of three fields, calculating MD5 algorithm OFFSET instruction OID datatype pattern-matching personalized parameters, defining pessimistic locking SQL conventions in this book static functions string concatenation SUBSTRING function temporary table, creating transaction isolation level, querying transposition errors, finding in numbers triggers UPDATE statement users and administrators, creating web-based interfaace precision, reducing for report data custom ranges dates numbers primary keys choosing for number sequence defining to create indexed column email-based usernames image filenames prioritized decision tables processors, XSLT profiles (Oracle) programming languages accessing SQL database C# Java Perl PHP Ruby placeholder examples progressive tax, applying PUBLIC role PuTTY web site Pythagorean theorem 2nd Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] quadratic time, SQL calculations QUARTER function quarterly reports, generating queries combining issuing without using table long-running, finding and stopping preventing from running returning many rows using a large buffer using series of round trips using variable binding Query Builder interface (Access) query caching query parameters queuing in database Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] RAD (rapid application development) tools radians, converting GPS coordinates to random numbers ranges grouping report data by custom ranges reducing precision of dates reducing precision of numbers overlapping, identifying RANK( ) function 2nd rank, calculating rank, page rank calculations READ COMMITTED isolation level 2nd testing INSERT transaction READ UNCOMMITTED isolation level recursive extensions (Oracle) redirection operator (<) regular expressions user-supplied date values converted to date literals remittances, reconciling with invoices REPEAT function (MySQL) REPEATABLE READ isolation level testing INSERT transaction repeating same calculation, avoiding REPLACE function using instead of string concatenations widespread support among database systems REPLICATE function (SQL Server) Report Wizard (Access) reports breaking data down into ranges custom ranges reducing precision of dates reducing precision of numbers building decision tables prioritized decision tables choosing any three of five GROUP BY solution JOIN solution comma-delimited lists in a column finding top n in each group generating a calendar generating sequential or missing data providing missing data with OUTER JOIN sequential data identifying updates uniquely branch transactions central server update duplicate batches pivot tables, filling in missing values queuing in database, setting up six degrees of Kevin Bacon game testing two values from a subquery traversing a simple tree Oracle recursive extensions tree visualization result sets excess repetition produced by one-to-many JOIN extracting subset queries returning large result set result weighting results page REVOKE command RGB values RIGHT OUTER JOINs filling in missing values in pivot table robot spiders, filtering from web server logs ROLLBACK command ROUND function rounding errors ROWNUM pseudovariable (Oracle) 2nd 3rd 4th rows displaying as columns using self-join displaying columns as ungrouping data with repeated columns generating without tables Ruby connecting to SQL database from program placeholders running total, keeping Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] scalable vector graphics [See SVG, generating from SQL] schema, dropping second Tuesday of the month, finding secure tunnel, creating on Windows security user permissions, setting on rows web-based interfaces SELECT statement asterisk (*) wildcard, using dropping FROM clause FROM clause, using other SELECT statements in VALUES list INSERT ...


pages: 480 words: 99,288

Mastering ElasticSearch by Rafal Kuc, Marek Rogozinski

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Web Services, create, read, update, delete, en.wikipedia.org, fault tolerance, finite state, full text search, information retrieval

You should also know how to send queries to get the documents you are interested in, how to narrow down the results of your queries by using filtering, and how to calculate statistics for your data with the use of the faceting/aggregation mechanism. However, before getting to the exciting functionality that ElasticSearch offers, we think that we should start with a quick tour of Apache Lucene, the full text search library that ElasticSearch uses to build and search its indices, as well as the basic concepts that ElasticSearch is built on. In order to move forward and extend our learning, we need to ensure we don't forget the basics. It is easy to do. We also need to make sure that we understand Lucene correctly as Mastering ElasticSearch requires this understanding. By the end of this chapter we will have covered: What Apache Lucene is What overall Lucene architecture looks like How the analysis process is done What Apache Lucene query language is and how to use it What are the basic concepts of ElasticSearch How ElasticSearch communicates internally Introducing Apache Lucene In order to fully understand how ElasticSearch works, especially when it comes to indexing and query processing, it is crucial to understand how Apache Lucene library works.

Getting familiar with Lucene You may wonder why ElasticSearch creators decided to use Apache Lucene instead of developing their own functionality. We don't know for sure, because we were not the ones that made the decision, but we assume that it was because Lucene is mature, highly performing, scalable, light, and yet, very powerful. Its core comes as a single file of Java library with no dependencies, and allows you to index documents and search them with its out of the box full text search capabilities. Of course there are extensions to Apache Lucene that allows different languages handling, enables spellchecking, highlighting, and much more; but if you don't need those features, you can download a single file and use it in your application. Overall architecture Although I would like to jump straight to Apache Lucene architecture, there are some things we need to know first in order to fully understand it, and those are: Document: It is a main data carrier used during indexing and search, containing one or more fields, which contain the data we put and get from Lucene Field: It is a section of the document which is built of two parts, the name and the value Term: It is a unit of search representing a word from text Token: It is an occurrence of a term from the text of the field.

By the end of this chapter we will have covered the following topics: How to use different scoring formulae and what they can bring How to use different posting formats and what they can bring How to handle Near Real Time searching, real-time GET, and what searcher reopening means Looking deeper into multilingual data handling Configuring transaction log to our needs and see how it affects our deployments Segments merging, different merge policies, and merge scheduling Altering Apache Lucene scoring With the release of Apache Lucene 4.0 in 2012, all the users of this great, full text search library, were given the opportunity to alter the default TF/IDF based algorithm. Lucene API was changed to allow easier modification and extension of the scoring formula. However, that was not the only change that was made to Lucene when it comes to documents score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use different scoring formula for our documents.


pages: 468 words: 233,091

Founders at Work: Stories of Startups' Early Days by Jessica Livingston

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

8-hour work day, affirmative action, AltaVista, Apple II, Brewster Kahle, business process, Byte Shop, Danny Hillis, David Heinemeier Hansson, don't be evil, fear of failure, financial independence, Firefox, full text search, game design, Googley, HyperCard, illegal immigration, Internet Archive, Jeff Bezos, Justin.tv, Larry Wall, Maui Hawaii, Menlo Park, nuclear winter, Paul Buchheit, Paul Graham, Peter Thiel, Richard Feynman, Richard Feynman, Robert Metcalfe, Ruby on Rails, Sand Hill Road, side project, Silicon Valley, slashdot, social software, software patent, South of Market, San Francisco, Startup school, stealth mode startup, Steve Ballmer, Steve Jobs, Steve Wozniak, web application, Y Combinator

The expectation when they came to Yahoo was that they could find anything, but we didn’t necessarily deliver on that needle in the haystack expectation. So what we did was that we searched our directory first, we gave you those results, and then, if we didn’t find anything, we kicked you over to a full-text search. So, when I say we “rented” that technology, we essentially partnered with full-text search companies to be the falloff searches that we had. Livingston: That’s what you did with Google? Brady: Yes. Strategically, it was spot-on until Google showed up. Because we always thought it was going to be a leapfrogging game. No one is ever going to be able to get so far ahead that we’d ever be in strategic risk of kingmaking a full-text search engine, because you just can’t do that. Google ended up doing exactly that. At the time, until 2000/2001, we had Open Text first, then I think we had AltaVista, then Inktomi.

Brady: In the early days, not too much. Jerry and Dave were way ahead of the curve. The ideas that they had really early on were right strategically and creatively. So everything we did through the middle of ’97, invariably we were first and we did it very well. The one thing we didn’t do that all our competitors were spending a lot of time doing was search. They were crawling the Web and doing full text search, and our strategy was, “Look, that’s a technology game. We’re not a technology company, we’re a media company. Since there are so many of them out there, we’re always going to be able to rent it.” That was the thought back then, and until Google came along that strategy was perfect. Because, as things played out, that’s exactly what happened. We had this searchable directory. It was big, and it had all the popular sites, so you could search for anything on it.

See also angel investors Firefox, 395–404 Firefox 1.0, 395 Firefox toolbar, 226 FirePower Systems, 17–18 flagging, 251–252 flash card program, 52 Fletcher, Mark, 233–246 Flickr, 257–264 floppy disk drive, 52, 55 Fog Creek Software, 345–360 FogBugz, 348–350 Index 459 Forum for Women Entrepreneurs, 264 Founders Award, 169 Fourier transform, 178 Frankston, Bob, 73–88, 90, 91 fraud, 6–11 fraud investigators, 8 fraud unit, 9 Fregin, Doug, 141–151 Fried, Jason, 309 Fry’s stores, 199 full-text search companies, 133 Fuzzy. See Mauldin, Michael (Fuzzy) Fylstra, Dan, 76, 83–84, 90 G Galbraith, David, 118 Game Neverending (Ludicorp), 257, 258 Gates, Bill, 292, 307 Gecko, 397 General Magic, 174, 189 General Motors, 141, 145–146 GeoURL, 223 Geschke, Charles, 281–296 GlaxoSmithKline, 106 Gmail, 161–172 Goldman, Phil, 178 Goodger, Ben, 397 Google, 27, 122–123, 161, 167–170 Google’s Founders Award, 168–169 Government Printing Office, 270 Graham, Paul, 205–222 Greenspun, Philip, 317–344 Groove Networks, 103–110 Gruner, Ron, 427–446 H hackers, 230 Hambrecht & Quist, 283–284, 429 Handler, Sheryl, 265 hard-disk drive, 196 hardware vs. software designers, 21 Harris 2200, 79 Harvard Business School, 75–76 Heinemeier Hansson, David, 309, 313 Hembrecht, Bill, 283–284, 429 Hendricks, John, 202 Hewitt, Joe, 395, 402 Hewlett-Packard (HP), 32, 191–192 Hewlett-Packard 3000 minicomputer, 42 Highland Capital, 419 Hillis, Danny, 265, 278 Hoffman, Reid, 261 Homebrew Computer Club, 33 Hong, James, 377–386 HOT or NOT, 377–386 Hotmail, 17–29, 135 Hourihan, Meg, 112, 119, 120 House of Representatives, 270 HP LaserJet printers, 289, 296 Huffman, Steve, 448 Human Resources, 391 Hummer Winblad Venture Partners, 297–298 Hyatt, Dave, 395, 397 460 Index I IBM, 89, 93–94, 289 IBM PC, 51, 94, 186 IDG, 65 IFC (Internet Foundation Class), 154 IGOR, 8–10 IM (Instant Messaging), 316 Imbach, Sarah, 9 InfoWorld, 65 instant messenger application, 259 intellectual property issues, 21 interactive pager, 149 InterActiveCorp.


pages: 291 words: 77,596

Total Recall: How the E-Memory Revolution Will Change Everything by C. Gordon Bell, Jim Gemmell

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

airport security, Albert Einstein, book scanning, cloud computing, conceptual framework, Douglas Engelbart, full text search, information retrieval, invention of writing, inventory management, Isaac Newton, John Markoff, lifelogging, Menlo Park, optical character recognition, pattern recognition, performance metric, RAND corporation, RFID, semantic web, Silicon Valley, Skype, social web, statistical model, Stephen Hawking, Steve Ballmer, Ted Nelson, telepresence, Turing test, Vannevar Bush, web application

A database is a program for storing and retrieving large collections of interrelated information. Modern databases let you very quickly retrieve all the records with a given attribute. You can rapidly sort, sift, and combine information in just about any way you can imagine. There was once a slight technical distinction to be made between how a database could index and look up records and full-text retrieval of documents, but by now databases have subsumed full-text search; they are happy to store documents and perform Google-like retrieval. In his memex paper, Bush had expressed hope that the search algorithms of the future would be better than simple index-lookup on some attribute like author or date. He held up the human brain’s associative memory as the ideal. In an associative network, items are linked together by contingency in time and space, by similarity, by context, and by usefulness.

MyCyberTwin is great, but we are still waiting for the first company that can take a heap of someone’s correspondence (e-mail, chats, letters, et cetera) and produce a really convincing impersonation. Any team that can take my corpus and turn it into my digitally immortal chatting self will get my support. And that’s not just vanity—if you can imitate me, you can imitate help-desk personnel and make a ton of money. START-UP #8—DOCUMENT MANAGEMENT It sounds great to declutter your life by scanning all your documents, but full-text search on a heap of files is not always the best way to retrieve information. This service (or program that you run) will automatically group similar items. It will build a knowledge base of every kind of document it can learn about, for example from all major utility and phone companies. It will be able to pull out the date, the total, and who the bill is from. It will create descriptive file names for all your documents and also create a human-readable XML file containing all the information it was able to extract.

Python Web Development With Django by Jeff Forcier

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

create, read, update, delete, database schema, Debian, don't repeat yourself, en.wikipedia.org, Firefox, full text search, Guido van Rossum, loose coupling, MVC pattern, revision control, Ruby on Rails, Silicon Valley, slashdot, web application

MySQL lacks some advanced functionality present in Postgres, but is also a bit more common, partly due to its tight integration with the common Web language PHP. Unlike some database servers, MySQL has a couple of different internal database types that determine the effective feature set: One is MyISAM, which lacks transactional support and foreign keys but is capable of full-text searching, and another is InnoDB, which is newer and has a better feature set but currently lacks full-text search.There are others, but these two are the most commonly used. If you’re on Windows or your package manager doesn’t have a recent version of MySQL, its official Web site ishttp://www.mysql.com, and offers binaries for most platforms. Django’s preferred MySQL Python library is MySQLdb, whose official site is http: //www.sourceforge.net/projects/mysql-python, and you need version 1.2.1p2 or newer.

For more, see the official Django documentation. More powerful search. Our search function is handy, but doesn’t offer the power that something as familiar as a Web search engine does; a multiword phrase, for example, should ideally be treated as a collection of independent search terms unless otherwise specified.The implementation here could be made more sophisticated, but if you are doing full-text searching over large numbers of records you probably would benefit from Summary something such as Sphinx, a search engine with available Django integration. For more, see withdjango.com. Status change notifications. We’ve already got a custom save method that handles our Markdown rendering.We could easily extend this to improve our workflow system by detecting when a story’s status has been changed and sending a notification e-mail to the person responsible for handling stories at that stage.A key piece of implementing this would be to replace our status field with a ForeignKey to a full-fledged Status model, which in addition to the numerical value and label fields implied of our STATUS_CHOICES list would have a status_owner field, a ForeignKey field to the User model.


pages: 1,266 words: 278,632

Backup & Recovery by W. Curtis Preston

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Berlin Wall, business intelligence, business process, database schema, Debian, dumpster diving, failed state, fault tolerance, full text search, job automation, side project, Silicon Valley, web application

Metadata might also contain the project the item is attached to or some other logical grouping. An email archive system would include who sent and received an email, the subject of the email, and all other appropriate metadata. Finally, an archive system may import the full text of the item into its database, allowing for full-text searches against the archive. This can be useful, especially if multiple formats can be supported. It’s particularly expedient to be able to do a full-text search against all emails, Word documents, PDF files, etc. Another important feature of archive systems is their ability to store a predetermined number of copies of an archived item. A company can then decide how many copies it wishes to keep. For example, if a firm is storing its archives on a RAID-protected system, it may choose to have one copy on disk and another on a removable medium such as optical or tape.

The most common use of backups as archives is for the retrieval of reference data. The assumption is that if someone asks for widget ABC’s parts (or some other piece of reference data), the appropriate files can just be restored from the system where they used to reside. The first problem with that scenario is remembering where the files were several years ago. While backup products and even some backup devices are starting to offer full-text search against all your backups, the problems in the following paragraph still exist. Even if you can remember where the files belong, the number of operating systems or application versions that have come and gone in the intervening time can stymie the effort. To restore files that were backed up from “Apollo” five years ago, the first requirement is a system named Apollo. Someone also has to handle any authentication issues between the backup server and the new Apollo because it isn’t the same Apollo it backed up from five years ago.

This feature breaks decades of backup tradition by giving you another way to access your backups. For too long our files and databases have been put into backup formats that required the backup software to extract them. We’ve lived with this for so long that it’s actually quite hard to imagine the possibilities that this brings to the table. Here’s a short list to help you wrap your brain around this one: You can point a full-text search appliance directly at your backups and search the full text of all files ever backed up. If you’re running multiple backup products, users and administrators can use a single method of recovery. Imagine how easy end-user recoveries would be if you could just point them at a mount point such as \\backupserver\yourclientname\date. If the disk device allows you to mount the backup read/write, you could actually use the backup as the production filesystem if the production filesystem were down.

Django Book by Matt Behrens

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Benevolent Dictator For Life (BDFL), create, read, update, delete, database schema, distributed revision control, don't repeat yourself, en.wikipedia.org, Firefox, full text search, loose coupling, MVC pattern, revision control, Ruby on Rails, school choice, slashdot, web application

The icontains is a lookup type (as explained in Chapter 5 and Appendix B), and the statement can be roughly translated as “Get the books whose title contains q, without being case-sensitive.” This is a very simple way to do a book search. We wouldn’t recommend using a simple icontains query on a large production database, as it can be slow. (In the real world, you’d want to use a custom search system of some sort. Search the Web for open-source full-text search to get an idea of the possibilities.) We pass books, a list of Book objects, to the template. The template code for search_results.html might include something like this: <p>You searched for: <strong>{{ query }}</strong></p> {% if books %} <p>Found {{ books|length }} book{{ books|pluralize }}.</p> <ul> {% for book in books %} <li>{{ book.title }}</li> {% endfor %} </ul> {% else %} <p>No books matched your search criteria.

year, month, and day For date/datetime fields, perform exact year, month, or day matches: # Return all entries published in 2005 >>>Entry.objects.filter(pub_date__year=2005) # Return all entries published in December >>> Entry.objects.filter(pub_date__month=12) # Return all entries published on the 3rd of the month >>> Entry.objects.filter(pub_date__day=3) # Combination: return all entries on Christmas of any year >>> Entry.objects.filter(pub_date__month=12, pub_date_day=25) isnull Takes either True or False, which correspond to SQL queries of IS NULL and IS NOT NULL, respectively: >>> Entry.objects.filter(pub_date__isnull=True) search A Boolean full-text search that takes advantage of full-text indexing. This is like contains but is significantly faster due to full-text indexing. Note this is available only in MySQL and requires direct manipulation of the database to add the full-text index. The pk Lookup Shortcut For convenience, Django provides a pk lookup type, which stands for “primary_key”. In the example Blog model, the primary key is the id field, so these three statements are equivalent: >>> Blog.objects.get(id__exact=14) # Explicit form >>> Blog.objects.get(id=14) # __exact is implied >>> Blog.objects.get(pk=14) # pk implies id__exact The use of pk isn’t limited to __exact queries – any query term can be combined with pk to perform a query on the primary key of a model: # Get blogs entries with id 1, 4, and 7 >>> Blog.objects.filter(pk__in=[1,4,7]) # Get all blog entries with id > 14 >>> Blog.objects.filter(pk__gt=14) pk lookups also work across joins.


pages: 593 words: 118,995

Relevant Search: With Examples Using Elasticsearch and Solr by Doug Turnbull, John Berryman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

commoditize, crowdsourcing, domain-specific language, finite state, fudge factor, full text search, information retrieval, natural language processing, premature optimization, recommendation engine, sentiment analysis

And although tokens are typically generated text, as you’ll see in chapter 4, analysis can be applied and tokens generated for nontext values such as floating-point numbers and geographic locations. In chapter 1, we mentioned the notion of features. In machine learning, features are descriptors for the items being classified. Features used to classify fruit may be things such as color, flavor, and shape. With full-text search, the tokens produced during analysis are the dominant features used to match a user’s query with documents in the index. Don’t worry if this seems vague right now; the greater portion of this book is dedicated to making these ideas clear. After analysis is complete, the documents are indexed; the tokens from the analysis step are stored into search engine data structures for document retrieval.

simple signals, 2nd Solr additive, with Boolean queries boosting feature mappings multiplicative, with function queries user ratings vs. filtering breadcrumb navigation browse experience browse interface, Yowl buckets section building signals bulk index API bulkMovies string business and domain awareness business concerns group business weight business-ranking logic BusinessScore C cast.name field, 2nd, 3rd, 4th, 5th cast.name scores cast.name.bigrammed field, 2nd, 3rd, 4th character filtering, 2nd, 3rd character offsets classic similarity classification features cleaning click-through rate co-occurrence counting cold-start problem COLLAB_FILTER filter, 2nd collaboration filtering, using co-occurrence counting search relevance and collation collocation extraction combining fields committed documents common words, removing completion field, 2nd completion suggester completion_analyzer completion_prefix variable complexphrase query parser compound queries, 2nd, 3rd concept search basic methods for building augmenting content with synonyms concept signals building using machine learning personalized search and configurations conflate tokens constant_score query, 2nd content augmentation curation engineer/curator pairing risk of miscommunication with content curator role of content curator exploring extracting into documents providing to search engine searching content group content weight, 2nd ContentScore control analysis controlling field matching converge conversion rate coord (coordinating factor), 2nd, 3rd, 4th, 5th, 6th, 7th copyField, 2nd, 3rd copy_to option, 2nd cosine similarity cross_fields, 2nd, 3rd searching, 2nd, 3rd, 4th Solr solving signal discordance with cuisine field cuisine_hifi field, 2nd cuisine_lofi field curation, search relevance and custom all field custom score query D data-driven culture debugging example search application Elasticsearch first searches with The Movie Database Python matching query matching analysis to solve matching issues comparing query to inverted index fixing by changing analyzers query parsing underlying strategy ranking computing weight explain feature scoring matches to measure relevance search term importance similarity vector-space model decay functions, 2nd deep paging default analyzer defType parameter delimiters acronyms modeling specificity phone numbers synonyms tokenizing geographic data tokenizing integers tokenizing melodies deployment, relevance-focused search application description field, 2nd, 3rd, 4th, 5th descriptive query directors field directors.name field, 2nd, 3rd, 4th directors.name score directors.name.bigrammed, 2nd, 3rd disable_coord option disabling tokenization discriminating fields DisjunctionMaximumQuery dismax, 2nd doc frequency, 2nd doc values document search and retrieval aggregations Boolean search facets filtering Lucene-based search positional and phrase matching ranked results relevance sorting document-ranking system documents analysis enhancement enrichment extraction flattening nested grouping similar matching meaning of scored search completion from documents being searched tokens as features of matching process meaning of documents dot character dot product, 2nd down-boosting title DSL (domain-specific language) E e-commerce search, 2nd easy_install utility edismax query parser, 2nd Elasticsearch example search application overview end sentinels engaged field engaged restaurants English analyzer overview reindexing with english_* filters english_bigrams analyzer english_keywords filter english_possessive_stemmer filter english_stemmer filter english_stop filter enrichment, 2nd ETL (extract, transform, load), 2nd every field gets a vote exact matching, 2nd, 3rd, 4th expert search, 2nd, 3rd explanation field external sources extract function, 2nd, 3rd extracting features extraction F faceted browsing overview Solr facet.prefix option facets, 2nd, 3rd fail fast, 2nd, 3rd, 4th fast vector highlighter feature modeling, 2nd feature selection feature space features creation of overview, 2nd, 3rd feedback at search box search completion search suggestions search-as-you-type business and domain awareness content curation risk of miscommunication with content curator role of content curator in search results listing grouping similar documents information presented snippet highlighting when there are no results search relevance and Solr faceted browsing field collapsing match phrase prefix relevance feedback feature mappings suggestion and highlighting components while browsing alternative results ordering breadcrumb navigation faceted browsing field boosts field collapsing overview Solr field discordance field mappings field normalization field scores, 2nd field synchronicity, signal modeling and field-by-field dismax field-centric methods, 2nd field-centric search, combining term-centric search and combining greedy search and conservative amplifiers like fields precision vs. recall Solr fieldNorms, 2nd, 3rd fields fieldType field_value_factor function fieldWeight, 2nd, 3rd, 4th filter clause filter element filter queries filtering Amazon-style collaborative overview using co-occurrence counting score shaping vs. boosting finite state transducer fire token first_name field floating-point numbers fragment_size parameter fudge factors full bulk command full search string full-text search full_name field function decay function queries, multiplicative boosting with Boolean queries vs. combining high-value tiers scored with simple Solr function_score query, 2nd, 3rd, 4th G garbage features Gaussian decay generalizing matches generate_word_parts genres aggregation genres.name field geographic data, tokenizing geohashing geolocation, 2nd getCastAndCrew function GitHub repository granular fields grouping fields H has_discount field high-quality signals highlighted snippets highlights HTMLStripCharFilter HTTP commands, 2nd I ideal document IDF (inverse document frequency) ignoring when ranking overview, 2nd inconsistent scoring index-time analysis, 2nd index-time personalization indexing documents information and requirements gathering business needs required and available information users and information needs information retrieval, creating relevance solutions through inner objects innermost calculation integers, tokenizing inventory-related files inventory_dir configuration inverse document frequency.


pages: 485 words: 74,211

Developing Web Applications with Haskell and Yesod by Michael Snoyman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

create, read, update, delete, database schema, Debian, domain-specific language, don't repeat yourself, full text search, MVC pattern, web application

Persistent: Raw SQL The Persistent package provides a type-safe interface to data stores. It tries to be backend-agnostic, such as not relying on relational features of SQL. My experience has been that you can easily perform 95% of what you need to do with the high-level interface. (In fact, most of my web apps use the high-level interface exclusively.) But occassionally you’ll want to use a feature that’s specific to a backend. One feature I’ve used in the past is full text search. In this case, we’ll use the SQL “LIKE” operator, which is not modeled in Persistent. We’ll get all people with the last name “Snoyman” and print the records out. Note Actually, you can express a LIKE operator directly in the normal syntax due to a feature added in Persistent 0.6, which allows backend-specific operators. But this is still a good example, so let’s roll with it. {-# LANGUAGE OverloadedStrings, TemplateHaskell, QuasiQuotes, TypeFamilies #-} {-# LANGUAGE GeneralizedNewtypeDeriving, GADTs, FlexibleContexts #-} import Database.Persist.Sqlite (withSqliteConn) import Database.Persist.TH (mkPersist, persist, share, mkMigrate, sqlSettings) import Database.Persist.GenericSql (runSqlConn, runMigration, SqlPersist) import Database.Persist.GenericSql.Raw (withStmt) import Data.Text (Text) import Database.Persist import Database.Persist.Store (PersistValue) import Control.Monad.IO.Class (liftIO) import qualified Data.Conduit as C import qualified Data.Conduit.List as CL share [mkPersist sqlSettings, mkMigrate "migrateAll"] [persist| Person name Text |] main :: IO () main = withSqliteConn ":memory:" $ runSqlConn $ do runMigration migrateAll insert $ Person "Michael Snoyman" insert $ Person "Miriam Snoyman" insert $ Person "Eliezer Snoyman" insert $ Person "Gavriella Snoyman" insert $ Person "Greg Weber" insert $ Person "Rick Richardson" -- Persistent does not provide the LIKE keyword, but we'd like to get the -- whole Snoyman family... let sql = "SELECT name FROM Person WHERE name LIKE '%Snoyman'" C.runResourceT $ withStmt sql [] C.$$ CL.mapM_ $ liftIO . print There is also higher-level support that allows for automated data marshaling.


pages: 420 words: 79,867

Developing Backbone.js Applications by Addy Osmani

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Airbnb, anti-pattern, create, read, update, delete, database schema, don't repeat yourself, Firefox, full text search, Google Chrome, Khan Academy, loose coupling, MVC pattern, node package manager, pull request, Ruby on Rails, side project, single page application, web application

For clientPager these include: Collection.goTo(n, options) - go to a specific page Collection.prevPage(options) - go to the previous page Collection.nextPage(options) - go to the next page Collection.howManyPer(n) - set how many items to display per page Collection.setSort(sortBy, sortDirection) - update sort on the current view. Sorting will automatically detect if you’re trying to sort numbers (even if they’re strored as strings) and will do the right thing. Collection.setFilter(filterFields, filterWords) - filter the current view. Filtering supports multiple words without any specific order, so you’ll basically get a full-text search ability. Also, you can pass it only one field from the model, or you can pass an array with fields and all of them will get filtered. Last option is to pass it an object containing a comparison method and rules. Currently, only levenshtein method is available. The goTo(), prevPage(), and nextPage() functions do not require the options param since they will be executed synchronously. However, when specified, the success callback will be invoked before the function returns.


pages: 420 words: 61,808

Flask Web Development: Developing Web Applications With Python by Miguel Grinberg

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

database schema, Firefox, full text search, Minecraft, platform as a service, web application

Following is a short list of some additional packages that are worth exploring: Flask-Babel: Internationalization and localization support Flask-RESTful: Tools for building RESTful APIs Celery: Task queue for processing background jobs Frozen-Flask: Conversion of a Flask application to a static website Flask-DebugToolbar: In-browser debugging tools Flask-Assets: Merging, minifying, and compiling of CSS and JavaScript assets Flask-OAuth: Authentication against OAuth providers Flask-OpenID: Authentication against OpenID providers Flask-WhooshAlchemy: Full-text search for Flask-SQLAlchemy models based on Whoosh Flask-KVsession: Alternative implementation of user sessions that use server-side storage If the functionality that you need for your project is not covered by any of the extensions and packages mentioned in this book, then your first destination to look for additional extensions should be the official Flask Extension Registry. Other good places to search are the Python Package Index, GitHub, and BitBucket.


pages: 263 words: 75,610

Delete: The Virtue of Forgetting in the Digital Age by Viktor Mayer-Schönberger

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

en.wikipedia.org, Erik Brynjolfsson, Firefox, full text search, George Akerlof, information asymmetry, information retrieval, information trail, Internet Archive, invention of movable type, invention of the printing press, John Markoff, lifelogging, moveable type in China, Network effects, packet switching, pattern recognition, RFID, slashdot, Steve Jobs, Steven Levy, The Market for Lemons, The Structural Transformation of the Public Sphere, Vannevar Bush

In the United States in the 1970s, Lexis and Westlaw, for example, made available to their customers huge databases with the full text of tens of thousands of court decisions, but these could only be retrieved using a limited set of keys. Customers, however, wanted to find relevant decisions by searching for words in the text of the decision, not just the case name, docket number, date, and a few subject words that had been indexed. The solution was to make searchable every word of every document in the database. Such full-text searches still require input of the precise words or terms and so it is no surefire recipe of finding the desired information, but it is eminently easier and more powerful than a search that is restricted to a small number of predefined search keys. At first, full-text indexing and searches were used by large providers of information databases, but by the beginning of the twenty-first century it had become a standard feature of all major PC operating systems, bringing the power of pinpoint information retrieval to people’s desktops.


pages: 313 words: 75,583

Ansible for DevOps: Server and Configuration Management for Humans by Jeff Geerling

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Web Services, Any sufficiently advanced technology is indistinguishable from magic, cloud computing, continuous integration, database schema, Debian, defense in depth, DevOps, fault tolerance, Firefox, full text search, Google Chrome, inventory management, loose coupling, Minecraft, Ruby on Rails, web application

A similar server configuration, running Apache, MySQL, and PHP, can be used to run many popular web frameworks and CMSes besides Drupal, including Symfony, Wordpress, Joomla, Laravel, etc. You can find the entire example Drupal LAMP server playbook in this book’s code repository at https://github.com/geerlingguy/ansible-for-devops, in the drupal directory. Real-world playbook: Ubuntu Apache Tomcat server with Solr Apache Solr is a fast and scalable search server optimized for full-text search, word highlighting, faceted search, fast indexing, and more. It’s a very popular search server, and it’s pretty easy to install and configure using Ansible. In the following example, we’re going to set up Apache Solr using Ubuntu 12.04 and Apache Tomcat. Apache Solr Server. Include a variables file, and discover pre_tasks and handlers Just like the previous LAMP server example, we’ll begin this playbook by telling Ansible our variables will be in a separate vars.yml file: 1 - hosts: all 2 3 vars_files: 4 - vars.yml Let’s quickly create the vars.yml file, while we’re thinking about it.


pages: 260 words: 77,007

Are You Smart Enough to Work at Google?: Trick Questions, Zen-Like Riddles, Insanely Difficult Puzzles, and Other Devious Interviewing Techniques You ... Know to Get a Job Anywhere in the New Economy by William Poundstone

affirmative action, Albert Einstein, big-box store, Buckminster Fuller, car-free, cloud computing, creative destruction, en.wikipedia.org, full text search, hiring and firing, index card, Isaac Newton, John von Neumann, loss aversion, mental accounting, new economy, Paul Erdős, RAND corporation, random walk, Richard Feynman, Richard Feynman, rolodex, Rubik’s Cube, Silicon Valley, Silicon Valley startup, sorting algorithm, Steve Ballmer, Steve Jobs, The Spirit Level, Tony Hsieh, why are manhole covers round?, William Shockley: the traitorous eight

But it has baggage—legacy products, legacy users, and a corporate culture forged in the 1980s. Google entered the new millennium with a clean slate. As the tech blogger Joel Spolsky wrote, A very senior Microsoft developer who moved to Google told me that Google works and thinks at a higher level of abstraction than Microsoft. “Google uses Bayesian filtering the way Microsoft uses the ‘if’ statement,” he said. That’s true. Google also uses full-text-search-of-the-entire-Internet the way Microsoft uses little tables that list what error IDs correspond to which help text. Look at how Google does spell checking: it’s not based on dictionaries; it’s based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn’t. Bob and Eve This “higher level of abstraction” figures in many of Google’s interview questions.


pages: 281 words: 95,852

The Googlization of Everything: by Siva Vaidhyanathan

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

1960s counterculture, activist fund / activist shareholder / activist investor, AltaVista, barriers to entry, Berlin Wall, borderless world, Burning Man, Cass Sunstein, choice architecture, cloud computing, computer age, corporate social responsibility, correlation does not imply causation, creative destruction, data acquisition, death of newspapers, don't be evil, Firefox, Francis Fukuyama: the end of history, full text search, global village, Google Earth, Howard Rheingold, informal economy, information retrieval, John Markoff, Joseph Schumpeter, Kevin Kelly, knowledge worker, libertarian paternalism, market fundamentalism, Marshall McLuhan, means of production, Mikhail Gorbachev, moral panic, Naomi Klein, Network effects, new economy, Nicholas Carr, PageRank, pirate software, Ray Kurzweil, Richard Thaler, Ronald Reagan, side project, Silicon Valley, Silicon Valley ideology, single-payer health, Skype, social web, Steven Levy, Stewart Brand, technoutopianism, The Nature of the Firm, The Structural Transformation of the Public Sphere, Thorstein Veblen, urban decay, web application, zero-sum game

At that dinner, Tim said “I know this doesn’t have anything to do with the matter at hand, but out of curiosity, how many people here use Google?” Every hand went up. From library consultant Karen Coyle: I was chatting with the brother of one of the Google founders. He told me that his brother was working on a new search engine that would be better than anything ever seen before. I tried to argue that it would still be limited by the NOTES TO PAGES 82–91 235 reality of the full-text search. I probably looked at Google when it was first made available, and I was pretty un-impressed. Just more keyword searching. Today I use it constantly, but I’m very aware of the fact that it works quite well for nouns and proper nouns (people, companies, named things), and less well for concepts. . . . . I think of it as a giant phone book for the Internet, not as a classification of knowledge.


pages: 360 words: 96,275

PostgreSQL 9 Admin Cookbook: Over 80 Recipes to Help You Run an Efficient PostgreSQL 9. 0 Database by Simon Riggs, Hannu Krosing

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

business intelligence, business process, database schema, Debian, en.wikipedia.org, full text search, Skype

In this case, you should also have lots of postgres processes in status D. Reducing the number of rows returned Although often the problem is producing many rows in the first place, it is made worse by returning all the unneeded rows to the client. This is especially true if client and server are not on the same host. Here are some ways to reduce the traffic between the client and server. A full text search returns 10,000 documents, but only first the 20 are displayed to user In this case, order the documents by ranking on the server, and return only the top 20 actually displayed SELECT title, ts_rank_cd(body_tsv, query, 20) AS text_rank FROM articles, plainto_tsquery('spicy potatoes') AS query WHERE body_tsv @@ query ORDER BY rank DESC LIMIT 20 ; If you need the next 20, don't just query with limit 40 and throw away the first 20, but use "OFFSET 20 LIMIT 20" to return just the next 20.


pages: 353 words: 104,146

European Founders at Work by Pedro Gairifo Santos

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

business intelligence, cloud computing, crowdsourcing, fear of failure, full text search, information retrieval, inventory management, iterative process, Jeff Bezos, Lean Startup, Mark Zuckerberg, natural language processing, pattern recognition, pre–internet, recommendation engine, Richard Stallman, Silicon Valley, Skype, slashdot, Steve Jobs, Steve Wozniak, subscription business, technology bubble, web application, Y Combinator

Really try to break those barriers, break those borders, and take inspiration from things around you and be curious, but apply it to problems in a smart, differentiated, useful way. Ilya Segalovich Yandex Ilya Segalovich is co-founder of Yandex, the leading search engine in Russian-speaking countries. The roots of Yandex trace back to a company called Arkadia, which in the early 1990s developed software featuring full-text search supporting the Russian language. In 1993, Segalovich and Arkady Volozh came up with the word “Yandex” to describe their search technologies. The web site, Yandex.ru, was launched in 1997 and in 2000 Yandex was incorporated as a standalone company. In May 2011, Yandex raised $1.3 billion in an initial public offering on NASDAQ. It was the biggest IPO for a dot-com since Google went public in 2004.


pages: 349 words: 114,038

Culture & Empire: Digital Revolution by Pieter Hintjens

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

4chan, airport security, anti-communist, anti-pattern, barriers to entry, Bill Duvall, bitcoin, blockchain, business climate, business intelligence, business process, Chelsea Manning, clean water, commoditize, congestion charging, Corn Laws, correlation does not imply causation, cryptocurrency, Debian, Edward Snowden, failed state, financial independence, Firefox, full text search, German hyperinflation, global village, GnuPG, Google Chrome, greed is good, Hernando de Soto, hiring and firing, informal economy, intangible asset, invisible hand, James Watt: steam engine, Jeff Rulifson, Julian Assange, Kickstarter, M-Pesa, mass immigration, mass incarceration, mega-rich, mutually assured destruction, Naomi Klein, national security letter, new economy, New Urbanism, Occupy movement, offshore financial centre, packet switching, patent troll, peak oil, pre–internet, private military company, race to the bottom, rent-seeking, reserve currency, RFC: Request For Comment, Richard Feynman, Richard Feynman, Richard Stallman, Satoshi Nakamoto, security theater, selection bias, Skype, slashdot, software patent, spectrum auction, Steve Crocker, Steve Jobs, Steven Pinker, Stuxnet, The Wealth of Nations by Adam Smith, The Wisdom of Crowds, trade route, transaction costs, union organizing, wealth creators, web application, WikiLeaks, Y2K, zero day, Zipf's Law

Also in 1998, Google was founded, and soon their revolutionary concept of "it works the way you expect" made them King of the Search Engines. Once upon a time, the list of all websites was twenty pages long. I still have a book that has the entire World Wide Web printed as an appendix. Then the list got too long to print and sites like Yahoo! organized them into categories. Then the category list got too large to keep updated, and Lycos invented the full-text search. This was too slow, so Digital Equipment Corporation built a natty search engine called Altavista to show how to do it properly. The results for any search got too long, so Google invented the ranked search, which pretty much fixed the search issue. Google also threw all the clutter off the main page. Less is more. The dot-com boom bubbled in 1999, driven by the dream of cheap access to millions -- no, billions -- of consumers.


pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack by Matthew A. Russell

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Climategate, cloud computing, crowdsourcing, en.wikipedia.org, fault tolerance, Firefox, full text search, Georg Cantor, Google Earth, information retrieval, Mark Zuckerberg, natural language processing, NP-complete, profit motive, Saturday Night Live, semantic web, Silicon Valley, slashdot, social graph, social web, statistical model, Steve Jobs, supply-chain management, text mining, traveling salesman, Turing test, web application

Example 5-17 is a trivial adaptation of Example 5-4 that illustrates a routine emitting a simple JSON structure (a list of [term, URL, frequency] tuples) that can be fed into an HTML template for WP-Cumulus. We’ll pass in empty strings for the URL portion of those tuples, but you could use your imagination and hyperlink to a simple web service that displays a list of tweets containing the entities. (Recall that Example 5-7 provides just about everything you’d need to wire this up by using couchdb-lucene to perform a full-text search on tweets stored in CouchDB.) Another option might be to write a web service and link to a URL that provides any tweet containing the specified entity. Example 5-17. Generating the data for an interactive tag cloud using WP-Cumulus (the_tweet__tweet_tagcloud_code.py) # -*- coding: utf-8 -*- import os import sys import webbrowser import json from cgi import escape from math import log import couchdb from couchdb.design import ViewDefinition DB = sys.argv[1] MIN_FREQUENCY = int(sys.argv[2]) HTML_TEMPLATE = '..


pages: 470 words: 109,589

Apache Solr 3 Enterprise Search Server by Unknown

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

bioinformatics, continuous integration, database schema, en.wikipedia.org, fault tolerance, Firefox, full text search, information retrieval, Internet Archive, natural language processing, performance metric, platform as a service, Ruby on Rails, web application

Which Rails/Ruby library should I use? The two most common high level libraries for interacting with Solr are acts_as_solr and Sunspot. However, in the last couple of years, Sunspot has become the more popular choice, and comes in a version designed to work explicitly with Rails called sunspot_rails that allows Rails ActiveRecord database objects to be transparently backed by a Solr index for full text search. For lower-level client interface to Solr from Ruby environments, there are two libraries duking it out to be the client of choice: solr-ruby, a client library developed by the Apache Solr project and rsolr, which is a reimplementation of a Ruby centric client library. Both of these solutions are solid and act as great low level API libraries. However, rsolr has gained more attention, has better documentation, and some nice features such as a direct embedded Solr connection through JRuby. rsolr also has support for using curb (Ruby bindings to curl, a very fast HTTP library) instead of the standard Net::HTTP library for the HTTP transport layer.


pages: 561 words: 120,899

The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant From Two Centuries of Controversy by Sharon Bertsch McGrayne

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Bayesian statistics, bioinformatics, British Empire, Claude Shannon: information theory, Daniel Kahneman / Amos Tversky, double helix, Edmond Halley, Fellow of the Royal Society, full text search, Henri Poincaré, Isaac Newton, John Markoff, John Nash: game theory, John von Neumann, linear programming, meta analysis, meta-analysis, Nate Silver, p-value, Pierre-Simon Laplace, placebo effect, prediction markets, RAND corporation, recommendation engine, Renaissance Technologies, Richard Feynman, Richard Feynman, Richard Feynman: Challenger O-ring, Ronald Reagan, speech recognition, statistical model, stochastic process, Thomas Bayes, Thomas Kuhn: the structure of scientific revolutions, traveling salesman, Turing machine, Turing test, uranium enrichment, Yom Kippur War

According to Google’s research director, Peter Norvig, “There must have been dozens of times when a project started with naïve Bayes, just because it was easy to do and we expected to replace it with something more sophisticated later, but in the end the vast amount of data meant that a more complex technique was not needed.” Google also uses Bayesian techniques to classify spam and pornography and to find related words, phrases, and documents. A very large Bayesian network finds synonyms of words and phrases. Instead of downloading dictionaries for a spelling checker, Google conducted a full-text search of the entire Internet looking for all the different ways words can be spelled. The result was a flexible system that could recognize that “shaorn” should have been “Sharon” and correct the typo. While Bayes has helped revolutionize modern life on the web, it is also helping to finesse the Tower of Babel that has separated linguistic communities for millennia. During the Second World War, Warren Weaver of the Rockefeller Foundation was impressed with how “a multiplicity of languages impedes cultural interchange between the peoples of the earth and is a serious deterrent to international understanding.”6 Struck by the power of mechanized cryptography and by Claude Shannon’s new information theory, Weaver suggested that computerized statistical methods could treat translation as a cryptography problem.

Pragmatic.Programming.Erlang.Jul.2007 by Unknown

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Chuck Templeton: OpenTable, Debian, en.wikipedia.org, fault tolerance, finite state, full text search, RFC: Request For Comment, sorting algorithm

We talk in general terms about shared memory and message passing concurrency and why we strongly believe that languages with no mutable state and concurrency are ideally suited to programming multicore computers. • Chapter 20, Programming Multicore CPUs, on page 367 is about programming multicore computers. We talk about the techniques for ensuring that an Erlang program will run efficiently on multicore computers. We introduce a number of abstractions for speeding up sequential programs on multicore computers. Finally we perform some measurements and develop our third major program, a full-text search engine. To write this, we first implement a function called mapreduce—this is a higher-order function for parallelizing a computation over a set of processing elements. • Appendix A, on page 390, describes the type system used to document Erlang functions. • Appendix B, on page 396, describes how to set up Erlang on the Windows operating system (and how to configure emacs on all operating systems). • Appendix C, on page 399, has a catalog of Erlang resources. • Appendix D, on page 403, describes lib_chan, which is a library for programming socket-based distribution. 15 B EGIN A GAIN • Appendix E, on page 419, looks at techniques for analyzing, profiling, debugging, and tracing your code. • Appendix F, on page 439, has one-line summaries of the most used modules in the Erlang standard libraries. 1.2 Begin Again Once upon a time a programmer came across a book describing a funny programming language.


pages: 933 words: 205,691

Hadoop: The Definitive Guide by Tom White

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Web Services, bioinformatics, business intelligence, combinatorial explosion, database schema, Debian, domain-specific language, en.wikipedia.org, fault tolerance, full text search, Grace Hopper, information retrieval, Internet Archive, linked data, loose coupling, openstreetmap, recommendation engine, RFID, SETI@home, social graph, web application

Parsed page data Page content is then parsed using a suitable parser—Nutch provides parsers for documents in many popular formats, such as HTML, PDF, Open Office and Microsoft Office, RSS, and others. Link graph database This database is necessary to compute link-based page ranking scores, such as PageRank. For each URL known to Nutch, it contains a list of other URLs pointing to it, and their associated anchor text (from HTML <a href="..">anchor text</a> elements). This database is called LinkDb. Full-text search index This is a classical inverted index, built from the collected page metadata and from the extracted plain-text content. It is implemented using the excellent Lucene library. We briefly mentioned before that Hadoop began its life as a component in Nutch, intended to improve its scalability and to address clear performance bottlenecks caused by a centralized data processing model. Nutch was also the first public proof-of-concept application ported to the framework that would later become Hadoop, and the effort required to port Nutch algorithms and data structures to Hadoop proved to be surprisingly small.


pages: 602 words: 207,965

Practical Ext JS Projects With Gears by Frank Zammetti

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

a long time ago in a galaxy far, far away, Albert Einstein, corporate raider, create, read, update, delete, database schema, en.wikipedia.org, Firefox, full text search, Gordon Gekko, Larry Wall, loose coupling, Ronald Reagan, web application

The Create Index and Drop Index functions will first require that you figure out how to retrieve the list of indexes for a table and then present a Window to enter the index details in the case of Create Index, or a list of existing indexes to choose from in the case of Delete Index. None of this is especially hard, and would make for a good exercise (hint: getting the list of indexes is a slight modification to the query to retrieve a list of tables in a database). • The SQLite engine Gears uses has a full-text search capability, and it would be nice if there was a Text Search tool, similar to the Query tool, where that could be used. • Provide the ability to add a new record from the Browse tab of the Table Details Window, as well as the ability to duplicate, edit, or delete the selected record. • Allow more than 20 fields to be added in the Create Table Window. You can implement this however you choose; one way would be to dynamically add a new row to the form any time you detect that all existing rows have been populated


pages: 348 words: 39,850

Data Scientists at Work by Sebastian Gutierrez

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, algorithmic trading, Bayesian statistics, bioinformatics, bitcoin, business intelligence, chief data officer, clean water, cloud computing, commoditize, computer vision, continuous integration, correlation does not imply causation, creative destruction, crowdsourcing, data is the new oil, DevOps, domain-specific language, Donald Knuth, follow your passion, full text search, informal economy, information retrieval, Infrastructure as a Service, Intergovernmental Panel on Climate Change (IPCC), inventory management, iterative process, lifelogging, linked data, Mark Zuckerberg, microbiome, Moneyball by Michael Lewis explains big data, move fast and break things, move fast and break things, natural language processing, Network effects, nuclear winter, optical character recognition, pattern recognition, Paul Graham, personalized medicine, Peter Thiel, pre–internet, quantitative hedge fund, quantitative trading / quantitative finance, recommendation engine, Renaissance Technologies, Richard Feynman, Richard Feynman, self-driving car, side project, Silicon Valley, Skype, software as a service, speech recognition, statistical model, Steve Jobs, stochastic process, technology bubble, text mining, the scientific method, web application

The first was when I was working with a digital library and realized we could dramatically improve document tagging by algorithmically recycling author-supplied labels. While authors tagged articles with keywords and phrases, the tagging was sparse and inconsistent. As a result of this type of tagging, the use of tags for article retrieval offered high precision but low recall. Unfortunately, the alternative of performing full-text search on the tags provided unacceptably low precision. So we developed a system to bootstrap on author-supplied tags, thus improving tagging across the collection. The result was an order of magnitude increase in recall without sacrificing precision. The second was using entropy calculations on language models to automatically detect events in a news archive. We started by performing entity extraction on the archive to detect named entities and key phrases.