full text search

28 results back to index

pages: 713 words: 93,944

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement by Eric Redmond, Jim Wilson, Jim R. Wilson

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Web Services, create, read, update, delete, data is the new oil, database schema, Debian, domain-specific language, en.wikipedia.org, fault tolerance, full text search, general-purpose programming language, linked data, MVC pattern, natural language processing, node package manager, random walk, recommendation engine, Skype, social graph, web application

At first, the dusting wasn’t even enough to cover this morning’s earliest tracks, but the power of the storm took over, replenishing the landscape and delivering the perfect skiing experience with the diversity and quality that we craved. Just this past year, I woke up to the realization that the database world, too, is covered with a fresh blanket of snow. Sure, the relational databases are there, and you can get a surprisingly rich experience with open source RDBMS software. You can do clustering, full-text search, and even fuzzy searching. But you’re no longer limited to that approach. I have not built a fully relational solution in a year. Over that time, I’ve used a document-based database and a couple of key-value datastores. The truth is that relational databases no longer have a monopoly on flexibility or even scalability. For the kinds of applications that we build, there are more appropriate models that are simpler, faster, and more reliable.

Sometimes, users can’t remember the full name of “J. Roberts.” In other cases, we just plain don’t know how to spell “Benn Aflek.” We’ll look into a few PostgreSQL packages that make text searching easy. It’s worth noting that as we progress, this kind of string matching blurs the lines between relational queries and searching frameworks like Lucene.[8] Although some may feel features like full-text search belong with the application code, there can be performance and administrative benefits of pushing these packages to the database, where the data lives. SQL Standard String Matches PostgreSQL has many ways of performing text matches, but the two big default methods are LIKE and regular expressions. I Like LIKE and ILIKE LIKE and ILIKE (case-insensitive LIKE) are the simplest forms of text search.

CREATE INDEX movies_title_trigram ON movies​​ ​​USING gist (title gist_trgm_ops);​​ Now you can query with a few misspellings and still get decent results. ​​SELECT *​​ ​​FROM movies​​ ​​WHERE title % 'Avatre';​​ ​​ title​​ ​​---------​​ ​​ Avatar​​ Trigrams are an excellent choice for accepting user input, without weighing them down with wildcard complexity. Full-Text Fun Next, we want to allow users to perform full-text searches based on matching words, even if they’re pluralized. If a user wants to search for certain words in a movie title but can remember only some of them, Postgres supports simple natural-language processing. TSVector and TSQuery Let’s look for a movie that contains the words night and day. This is a perfect job for text search using the @@ full-text query operator. ​​SELECT title​​ ​​FROM movies​​ ​​WHERE title @@ 'night & day';​​ ​​ title​​ ​​-------------------------------​​ ​​ A Hard Day’s Night​​ ​​ Six Days Seven Nights​​ ​​ Long Day’s Journey Into Night​​ The query returns titles like A Hard Day’s Night, despite the word Day being in possessive form, and the two words are out of order in the query.


pages: 205 words: 47,169

PostgreSQL: Up and Running by Regina Obe, Leo Hsu

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

cloud computing, database schema, Debian, en.wikipedia.org, full text search, web application

You can usually get by with just this one alone if you don’t want to experiment with additional types. If PostgreSQL automaticaly creates an index for you or you don’t bother picking the type, B-tree will be chosen. It is currently the only index type allowed for primary key and unique indexes. GiST Generalized Search Tree (GiST) is an index type optimized for full text search, spatial data, astronomical data, and hierarchical data. You can’t use it to enforce uniqueness, however, you can use it in exclusion constraints. GIN Generalized Inverted Index (GIN) is an index type commonly used for the built-in full text search of PostgreSQL and the trigram extensions. GIN is a decendent of Gist, but it’s not lossy. GIN indexes are generally faster to search than GiST, but slower to update. You can see an example at Waiting for Faster LIKE/ILIKE. SP-GiST Space-Partitioning Trees Generalized Search Tree (SP-GiST) is an index type introduced in PostgreSQL 9.2.

Unlogged tables speeds up queries against tables where logging is unnecessary. Triggers on views. In prior versions, to make views updatable you used DO INSTEAD rules, which only supported SQL for programming logic. Triggers can be written in most procedural languages—except SQL—and opens the door for more complex abstraction using views. KNN GiST adds improvement to popular extensions like full-text search, trigram (for fuzzy search and case insensitive search), and PostGIS. Database Drivers If you are using or plan to use PostgreSQL, chances are that you’re not going to use it in a vacuum. To have it interact with other applications, you’re going to need database drivers. PostgreSQL enjoys a generous number of freely available database drivers that can be used in many programming languages.

Some past extensions have gained enough traction to become part of the PostgreSQL core, so if you’re upgrading from an ancient version, you may not even have to worry about extensions. Old Extensions Absorbed into PostgreSQL Prior to PostgreSQL 8.3, the following extensions weren’t part of core: PL/PgSQL wasn’t always installed by default in every database. In old versions, you had to run CREATE LANGUAGE plpgsql; in your database. From around 8.3 on, it’s installed by default, but you retain the option of uninstalling it. tsearch is a suite for supporting full-text searches by adding indexes, operators, custom dictionaries, and functions. It became part of PostgreSQL core in 8.3. You don’t have the option to uninstall it. If you’re still relying on old behavior, you can install the tsearch2 extension, which retained old functions that are no longer available in the newer version. A better approach would be just to update where you’re using the functions because compatiblity with the old tsearch could end at any time.


pages: 481 words: 121,669

The Invisible Web: Uncovering Information Sources Search Engines Can't See by Gary Price, Chris Sherman, Danny Sullivan

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

AltaVista, American Society of Civil Engineers: Report Card, bioinformatics, Brewster Kahle, business intelligence, dark matter, Douglas Engelbart, full text search, HyperCard, hypertext link, information retrieval, Internet Archive, joint-stock company, knowledge worker, natural language processing, pre–internet, profit motive, publish or perish, search engine result page, side project, Silicon Valley, speech recognition, stealth mode startup, Ted Nelson, Vannevar Bush, web application

Thoroughness and accuracy are absolutely critical to the patent searcher. Major business decisions involving significant expense or potential litigation often hinge on the details of a patent search, so using a general-purpose search engine for this type of search is effectively out of the question. Many government patent offices maintain Web sites, but Delphion’s Intellectual Property Network (http://www.delphion.com/) allows full-text searching of U.S. and European patents and abstracts of Japanese patents simultaneously. Additionally, the United States Patent Office (http://www.uspto.gov) provides patent information dating back to 1790, as well as U.S. Trademark data. 6. Out of Print Books. The growth of the Web has proved to be a boon for bibliophiles. Countless out of print booksellers have established Web sites, obliterating the geographical constraints that formerly limited their business to local customers.

These newsletters don’t limit themselves to the Invisible Web, but the news and information they provide is exceptionally useful for all serious Web searchers. All of these newsletters are free. The Scout Report http://scout.cs.wisc.edu/scout/report/current/ The Scout Report provides the closest thing to an “official” seal of approval for quality Web sites. Published weekly, it provides organized summaries of the most valuable and authoritative Web resources available. The Scout Report Signpost provides the full-text search of nearly 6,000 of these summaries. The Scout Report staff is made up of a group 110 The Invisible Web of librarians and information professionals, and their standards for inclusion in the report are quite high. Librarians’ Index to the Internet (LII) http://www.lii.org This searchable, annotated directory of Web resources, maintained by Carole Leita and a volunteer team of more than 70 reference librarians, is organized into categories including “best of,” “directories,” “databases,” and “specific resources.”

The resulting page offers several searchable databases, Patent Full-Text Database with Full-Page Images, Patent Bibliographic and Abstract Database, an Expired Patent Search, and several others. Wally chooses the full-text database http://www.uspto.gov/patft/index.html over a bibliographic database that provides only limited information for each patent. Clicking the full-text database link brings up further options. After scanning the page, Wally notices a direct link that allows for full-text searching by patent number ( srchnum.htm). Wally quickly types in the number and in less than a second has a link to the full-text of patent number 3541541. Wally’s job is complete and his boss is very impressed. This is a case where a general-purpose search engine failed to find the desired end result, but was indispensable in helping Wally locate the “front door” of the Invisible Web database that ultimately provided what he was looking for.


pages: 1,085 words: 219,144

Solr in Action by Trey Grainger, Timothy Potter

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

business intelligence, cloud computing, conceptual framework, crowdsourcing, data acquisition, en.wikipedia.org, failed state, fault tolerance, finite state, full text search, glass ceiling, information retrieval, natural language processing, performance metric, premature optimization, recommendation engine, web application

Even though I had no formal search background when I started writing Solr, it felt like a very natural fit, because I have always enjoyed making software “go fast.” I viewed Solr more as an alternate type of datastore designed around an inverted index than as a full-text search engine, and that has helped Solr extend beyond the legacy enterprise search market. By the end of 2005, Solr was powering the search and faceted navigation of a number of CNET sites, and soon it was made open source. Solr was contributed to the Apache Software Foundation in January 2006 and became a subproject of the Lucene PMC (with Lucene Java as its sibling). There had always been a large degree of overlap with Lucene (the core full-text search library used by Solr) committers, and in 2010 the projects were merged. Separate Lucene and Solr downloads would still be available, but they would be developed by a single unified team.

I could not have done this without their insightful questions about Solr and their giving me the opportunity to build a large-scale search solution using Solr. About this Book Whether handling big data, building cloud-based services, or developing multitenant web applications, it’s vital to have a fast, reliable search solution. Apache Solr is a scalable and ready-to-deploy open source full-text search engine powered by Lucene. It offers key features like multilingual keyword searching, faceted search, intelligent matching, content clustering, and relevancy weighting right out of the box. Solr in Action is the definitive guide to implementing fast and scalable search using Apache Solr. It uses well-documented examples ranging from basic keyword searching to scaling a system for billions of documents and queries.

Chapter 5 teaches how Solr indexes documents, starting with a discussion of another important configuration file: schema.xml. You’ll learn how to define fields to represent structured data like numbers, dates, prices, and unique identifiers. We also cover how update requests are processed and configured using solrconfig.xml. Chapter 6 builds on the material in chapter 5 by showing how to index text fields using text analysis. Solr was designed to efficiently search and rank documents requiring full-text search. Text analysis is an important part of the search process in that it removes the linguistic variations between indexed text and queries. At this point in the book, you’ll have a solid foundation and will be ready to put Solr to work on your own search needs. As your knowledge of search and Solr grows, so too will your need to go beyond basic keyword searching and implement common search features such as advanced query parsing, hit highlighting, spell-checking, autosuggest, faceting, and result grouping.


pages: 82 words: 17,229

Redis Cookbook by Tiago Macedo, Fred Oliveira

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Debian, full text search, loose coupling, Silicon Valley

The possibilities are endless, and Redis’s pub/sub implementation makes it trivial to implement robust solutions for chat or notifications. Implementing an Inverted-Index Text Search with Redis Problem An inverted index is an index data structure that stores mappings of words (or other content) to their locations in a file, document, database, etc. This is generally used to implement full text search, but it requires previous indexing of the documents to be searched. In this recipe, we’ll use Redis as the storage backend for a full-text search implementation. Solution Our implementation will use one set per word, containing document IDs. In order to allow fast searches, we’ll index all the documents beforehand. Search itself is performed by splitting the query into words and intersecting the matching sets. This will return the IDs of the documents containing all the words we search for.[1] Discussion Indexing Let’s say we have a hundred documents or web pages that we want to allows searches on.


pages: 519 words: 102,669

Programming Collective Intelligence by Toby Segaran

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

correlation coefficient, Debian, en.wikipedia.org, Firefox, full text search, information retrieval, PageRank, prediction markets, recommendation engine, slashdot, web application

Multidimensional scaling in two dimensions is easy to print, but scaling can be done in any number of dimensions. Try changing the code to scale in one dimension (all the points on a line). Now try making it work for three dimensions. Chapter 4. Searching and Ranking This chapter covers full-text search engines, which allow people to search a large set of documents for a list of words, and which rank results according to how relevant the documents are to those words. Algorithms for full-text searches are among the most important collective intelligence algorithms, and many fortunes have been made by new ideas in this field. It is widely believed that Google's rapid rise from an academic project to the world's most popular search engine was based largely on the PageRank algorithm, a variation that you'll learn about in this chapter.

The neural network will learn to associate searches with results based on what links people click on after they get a list of search results. The neural network will use this information to change the ordering of the results to better reflect what people have clicked on in the past. To work through the examples in this chapter, you'll need to create a Python module called searchengine, which has two classes: one for crawling and creating the database, and the other for doing full-text searches by querying the database. The examples will use SQLite, but they can easily be adapted to work with a traditional client-server database. To start, create a new file called searchengine.py and add the following crawler class and method signatures, which you'll be filling in throughout this chapter: class crawler: # Initialize the crawler with the name of database def __init_ _(self,dbname): pass def __del_ _(self): pass def dbcommit(self): pass # Auxilliary function for getting an entry id and adding # it if it's not present def getentryid(self,table,field,value,createnew=True): return None # Index an individual page def addtoindex(self,url,soup): print 'Indexing %s' % url # Extract the text from an HTML page (no tags) def gettextonly(self,soup): return None # Separate the words by any non-whitespace character def separatewords(self,text): return None # Return true if this url is already indexed def isindexed(self,url): return False # Add a link between two pages def addlinkref(self,urlFrom,urlTo,linkText): pass # Starting with a list of pages, do a breadth # first search to the given depth, indexing pages # as we go def crawl(self,pages,depth=2): pass # Create the database tables def createindextables(self): pass A Simple Crawler I'll assume for now that you don't have a big collection of HTML documents sitting on your hard drive waiting to be indexed, so I'll show you how to build a simple crawler.

If you'd like to make sure that the crawl worked properly, you can try checking the entries for a word by querying the database: >>[row for row in crawler.con.execute( .. 'select rowid from wordlocation where wordid=1')] [(1,), (46,), (330,), (232,), (406,), (271,), (192,),... The list that is returned is the list of all the URL IDs containing "word," which means that you've successfully run a full-text search. This is a great start, but it will only work with one word at a time, and will just return the documents in the order in which they were loaded. The next section will show you how to expand this functionality by doing these searches with multiple words in the query. Querying You now have a working crawler and a big collection of documents indexed, and you're ready to set up the search part of the search engine.


pages: 480 words: 99,288

Mastering ElasticSearch by Rafal Kuc, Marek Rogozinski

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Web Services, create, read, update, delete, en.wikipedia.org, fault tolerance, finite state, full text search, information retrieval

You should also know how to send queries to get the documents you are interested in, how to narrow down the results of your queries by using filtering, and how to calculate statistics for your data with the use of the faceting/aggregation mechanism. However, before getting to the exciting functionality that ElasticSearch offers, we think that we should start with a quick tour of Apache Lucene, the full text search library that ElasticSearch uses to build and search its indices, as well as the basic concepts that ElasticSearch is built on. In order to move forward and extend our learning, we need to ensure we don't forget the basics. It is easy to do. We also need to make sure that we understand Lucene correctly as Mastering ElasticSearch requires this understanding. By the end of this chapter we will have covered: What Apache Lucene is What overall Lucene architecture looks like How the analysis process is done What Apache Lucene query language is and how to use it What are the basic concepts of ElasticSearch How ElasticSearch communicates internally Introducing Apache Lucene In order to fully understand how ElasticSearch works, especially when it comes to indexing and query processing, it is crucial to understand how Apache Lucene library works.

Getting familiar with Lucene You may wonder why ElasticSearch creators decided to use Apache Lucene instead of developing their own functionality. We don't know for sure, because we were not the ones that made the decision, but we assume that it was because Lucene is mature, highly performing, scalable, light, and yet, very powerful. Its core comes as a single file of Java library with no dependencies, and allows you to index documents and search them with its out of the box full text search capabilities. Of course there are extensions to Apache Lucene that allows different languages handling, enables spellchecking, highlighting, and much more; but if you don't need those features, you can download a single file and use it in your application. Overall architecture Although I would like to jump straight to Apache Lucene architecture, there are some things we need to know first in order to fully understand it, and those are: Document: It is a main data carrier used during indexing and search, containing one or more fields, which contain the data we put and get from Lucene Field: It is a section of the document which is built of two parts, the name and the value Term: It is a unit of search representing a word from text Token: It is an occurrence of a term from the text of the field.

By the end of this chapter we will have covered the following topics: How to use different scoring formulae and what they can bring How to use different posting formats and what they can bring How to handle Near Real Time searching, real-time GET, and what searcher reopening means Looking deeper into multilingual data handling Configuring transaction log to our needs and see how it affects our deployments Segments merging, different merge policies, and merge scheduling Altering Apache Lucene scoring With the release of Apache Lucene 4.0 in 2012, all the users of this great, full text search library, were given the opportunity to alter the default TF/IDF based algorithm. Lucene API was changed to allow easier modification and extension of the scoring formula. However, that was not the only change that was made to Lucene when it comes to documents score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use different scoring formula for our documents.


pages: 468 words: 233,091

Founders at Work: Stories of Startups' Early Days by Jessica Livingston

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

8-hour work day, affirmative action, AltaVista, Apple II, Brewster Kahle, business process, Byte Shop, Danny Hillis, don't be evil, fear of failure, financial independence, Firefox, full text search, game design, Googley, HyperCard, illegal immigration, Internet Archive, Jeff Bezos, Maui Hawaii, Menlo Park, nuclear winter, Paul Buchheit, Paul Graham, Peter Thiel, Richard Feynman, Richard Feynman, Sand Hill Road, side project, Silicon Valley, slashdot, social software, software patent, South of Market, San Francisco, Startup school, stealth mode startup, Steve Ballmer, Steve Jobs, Steve Wozniak, web application, Y Combinator

The expectation when they came to Yahoo was that they could find anything, but we didn’t necessarily deliver on that needle in the haystack expectation. So what we did was that we searched our directory first, we gave you those results, and then, if we didn’t find anything, we kicked you over to a full-text search. So, when I say we “rented” that technology, we essentially partnered with full-text search companies to be the falloff searches that we had. Livingston: That’s what you did with Google? Brady: Yes. Strategically, it was spot-on until Google showed up. Because we always thought it was going to be a leapfrogging game. No one is ever going to be able to get so far ahead that we’d ever be in strategic risk of kingmaking a full-text search engine, because you just can’t do that. Google ended up doing exactly that. At the time, until 2000/2001, we had Open Text first, then I think we had AltaVista, then Inktomi.

Brady: In the early days, not too much. Jerry and Dave were way ahead of the curve. The ideas that they had really early on were right strategically and creatively. So everything we did through the middle of ’97, invariably we were first and we did it very well. The one thing we didn’t do that all our competitors were spending a lot of time doing was search. They were crawling the Web and doing full text search, and our strategy was, “Look, that’s a technology game. We’re not a technology company, we’re a media company. Since there are so many of them out there, we’re always going to be able to rent it.” That was the thought back then, and until Google came along that strategy was perfect. Because, as things played out, that’s exactly what happened. We had this searchable directory. It was big, and it had all the popular sites, so you could search for anything on it.

See also angel investors Firefox, 395–404 Firefox 1.0, 395 Firefox toolbar, 226 FirePower Systems, 17–18 flagging, 251–252 flash card program, 52 Fletcher, Mark, 233–246 Flickr, 257–264 floppy disk drive, 52, 55 Fog Creek Software, 345–360 FogBugz, 348–350 Index 459 Forum for Women Entrepreneurs, 264 Founders Award, 169 Fourier transform, 178 Frankston, Bob, 73–88, 90, 91 fraud, 6–11 fraud investigators, 8 fraud unit, 9 Fregin, Doug, 141–151 Fried, Jason, 309 Fry’s stores, 199 full-text search companies, 133 Fuzzy. See Mauldin, Michael (Fuzzy) Fylstra, Dan, 76, 83–84, 90 G Galbraith, David, 118 Game Neverending (Ludicorp), 257, 258 Gates, Bill, 292, 307 Gecko, 397 General Magic, 174, 189 General Motors, 141, 145–146 GeoURL, 223 Geschke, Charles, 281–296 GlaxoSmithKline, 106 Gmail, 161–172 Goldman, Phil, 178 Goodger, Ben, 397 Google, 27, 122–123, 161, 167–170 Google’s Founders Award, 168–169 Government Printing Office, 270 Graham, Paul, 205–222 Greenspun, Philip, 317–344 Groove Networks, 103–110 Gruner, Ron, 427–446 H hackers, 230 Hambrecht & Quist, 283–284, 429 Handler, Sheryl, 265 hard-disk drive, 196 hardware vs. software designers, 21 Harris 2200, 79 Harvard Business School, 75–76 Heinemeier Hansson, David, 309, 313 Hembrecht, Bill, 283–284, 429 Hendricks, John, 202 Hewitt, Joe, 395, 402 Hewlett-Packard (HP), 32, 191–192 Hewlett-Packard 3000 minicomputer, 42 Highland Capital, 419 Hillis, Danny, 265, 278 Hoffman, Reid, 261 Homebrew Computer Club, 33 Hong, James, 377–386 HOT or NOT, 377–386 Hotmail, 17–29, 135 Hourihan, Meg, 112, 119, 120 House of Representatives, 270 HP LaserJet printers, 289, 296 Huffman, Steve, 448 Human Resources, 391 Hummer Winblad Venture Partners, 297–298 Hyatt, Dave, 395, 397 460 Index I IBM, 89, 93–94, 289 IBM PC, 51, 94, 186 IDG, 65 IFC (Internet Foundation Class), 154 IGOR, 8–10 IM (Instant Messaging), 316 Imbach, Sarah, 9 InfoWorld, 65 instant messenger application, 259 intellectual property issues, 21 interactive pager, 149 InterActiveCorp.


pages: 291 words: 77,596

Total Recall: How the E-Memory Revolution Will Change Everything by C. Gordon Bell, Jim Gemmell

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

airport security, Albert Einstein, book scanning, cloud computing, conceptual framework, full text search, information retrieval, invention of writing, inventory management, Isaac Newton, Menlo Park, optical character recognition, pattern recognition, performance metric, RAND corporation, RFID, semantic web, Silicon Valley, Skype, social web, statistical model, Stephen Hawking, Steve Ballmer, Ted Nelson, telepresence, Turing test, Vannevar Bush, web application

A database is a program for storing and retrieving large collections of interrelated information. Modern databases let you very quickly retrieve all the records with a given attribute. You can rapidly sort, sift, and combine information in just about any way you can imagine. There was once a slight technical distinction to be made between how a database could index and look up records and full-text retrieval of documents, but by now databases have subsumed full-text search; they are happy to store documents and perform Google-like retrieval. In his memex paper, Bush had expressed hope that the search algorithms of the future would be better than simple index-lookup on some attribute like author or date. He held up the human brain’s associative memory as the ideal. In an associative network, items are linked together by contingency in time and space, by similarity, by context, and by usefulness.

MyCyberTwin is great, but we are still waiting for the first company that can take a heap of someone’s correspondence (e-mail, chats, letters, et cetera) and produce a really convincing impersonation. Any team that can take my corpus and turn it into my digitally immortal chatting self will get my support. And that’s not just vanity—if you can imitate me, you can imitate help-desk personnel and make a ton of money. START-UP #8—DOCUMENT MANAGEMENT It sounds great to declutter your life by scanning all your documents, but full-text search on a heap of files is not always the best way to retrieve information. This service (or program that you run) will automatically group similar items. It will build a knowledge base of every kind of document it can learn about, for example from all major utility and phone companies. It will be able to pull out the date, the total, and who the bill is from. It will create descriptive file names for all your documents and also create a human-readable XML file containing all the information it was able to extract.


Python Web Development With Django by Jeff Forcier

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

create, read, update, delete, database schema, Debian, en.wikipedia.org, Firefox, full text search, loose coupling, MVC pattern, revision control, Silicon Valley, slashdot, web application

MySQL lacks some advanced functionality present in Postgres, but is also a bit more common, partly due to its tight integration with the common Web language PHP. Unlike some database servers, MySQL has a couple of different internal database types that determine the effective feature set: One is MyISAM, which lacks transactional support and foreign keys but is capable of full-text searching, and another is InnoDB, which is newer and has a better feature set but currently lacks full-text search.There are others, but these two are the most commonly used. If you’re on Windows or your package manager doesn’t have a recent version of MySQL, its official Web site ishttp://www.mysql.com, and offers binaries for most platforms. Django’s preferred MySQL Python library is MySQLdb, whose official site is http: //www.sourceforge.net/projects/mysql-python, and you need version 1.2.1p2 or newer.

For more, see the official Django documentation. More powerful search. Our search function is handy, but doesn’t offer the power that something as familiar as a Web search engine does; a multiword phrase, for example, should ideally be treated as a collection of independent search terms unless otherwise specified.The implementation here could be made more sophisticated, but if you are doing full-text searching over large numbers of records you probably would benefit from Summary something such as Sphinx, a search engine with available Django integration. For more, see withdjango.com. Status change notifications. We’ve already got a custom save method that handles our Markdown rendering.We could easily extend this to improve our workflow system by detecting when a story’s status has been changed and sending a notification e-mail to the person responsible for handling stories at that stage.A key piece of implementing this would be to replace our status field with a ForeignKey to a full-fledged Status model, which in addition to the numerical value and label fields implied of our STATUS_CHOICES list would have a status_owner field, a ForeignKey field to the User model.


pages: 1,266 words: 278,632

Backup & Recovery by W. Curtis Preston

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Berlin Wall, business intelligence, business process, database schema, Debian, dumpster diving, failed state, fault tolerance, full text search, job automation, side project, Silicon Valley, web application

Metadata might also contain the project the item is attached to or some other logical grouping. An email archive system would include who sent and received an email, the subject of the email, and all other appropriate metadata. Finally, an archive system may import the full text of the item into its database, allowing for full-text searches against the archive. This can be useful, especially if multiple formats can be supported. It’s particularly expedient to be able to do a full-text search against all emails, Word documents, PDF files, etc. Another important feature of archive systems is their ability to store a predetermined number of copies of an archived item. A company can then decide how many copies it wishes to keep. For example, if a firm is storing its archives on a RAID-protected system, it may choose to have one copy on disk and another on a removable medium such as optical or tape.

The most common use of backups as archives is for the retrieval of reference data. The assumption is that if someone asks for widget ABC’s parts (or some other piece of reference data), the appropriate files can just be restored from the system where they used to reside. The first problem with that scenario is remembering where the files were several years ago. While backup products and even some backup devices are starting to offer full-text search against all your backups, the problems in the following paragraph still exist. Even if you can remember where the files belong, the number of operating systems or application versions that have come and gone in the intervening time can stymie the effort. To restore files that were backed up from “Apollo” five years ago, the first requirement is a system named Apollo. Someone also has to handle any authentication issues between the backup server and the new Apollo because it isn’t the same Apollo it backed up from five years ago.

This feature breaks decades of backup tradition by giving you another way to access your backups. For too long our files and databases have been put into backup formats that required the backup software to extract them. We’ve lived with this for so long that it’s actually quite hard to imagine the possibilities that this brings to the table. Here’s a short list to help you wrap your brain around this one: You can point a full-text search appliance directly at your backups and search the full text of all files ever backed up. If you’re running multiple backup products, users and administrators can use a single method of recovery. Imagine how easy end-user recoveries would be if you could just point them at a mount point such as \\backupserver\yourclientname\date. If the disk device allows you to mount the backup read/write, you could actually use the backup as the production filesystem if the production filesystem were down.


Django Book by Matt Behrens

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

create, read, update, delete, database schema, distributed revision control, en.wikipedia.org, Firefox, full text search, loose coupling, MVC pattern, revision control, school choice, slashdot, web application

The icontains is a lookup type (as explained in Chapter 5 and Appendix B), and the statement can be roughly translated as “Get the books whose title contains q, without being case-sensitive.” This is a very simple way to do a book search. We wouldn’t recommend using a simple icontains query on a large production database, as it can be slow. (In the real world, you’d want to use a custom search system of some sort. Search the Web for open-source full-text search to get an idea of the possibilities.) We pass books, a list of Book objects, to the template. The template code for search_results.html might include something like this: <p>You searched for: <strong>{{ query }}</strong></p> {% if books %} <p>Found {{ books|length }} book{{ books|pluralize }}.</p> <ul> {% for book in books %} <li>{{ book.title }}</li> {% endfor %} </ul> {% else %} <p>No books matched your search criteria.

year, month, and day For date/datetime fields, perform exact year, month, or day matches: # Return all entries published in 2005 >>>Entry.objects.filter(pub_date__year=2005) # Return all entries published in December >>> Entry.objects.filter(pub_date__month=12) # Return all entries published on the 3rd of the month >>> Entry.objects.filter(pub_date__day=3) # Combination: return all entries on Christmas of any year >>> Entry.objects.filter(pub_date__month=12, pub_date_day=25) isnull Takes either True or False, which correspond to SQL queries of IS NULL and IS NOT NULL, respectively: >>> Entry.objects.filter(pub_date__isnull=True) search A Boolean full-text search that takes advantage of full-text indexing. This is like contains but is significantly faster due to full-text indexing. Note this is available only in MySQL and requires direct manipulation of the database to add the full-text index. The pk Lookup Shortcut For convenience, Django provides a pk lookup type, which stands for “primary_key”. In the example Blog model, the primary key is the id field, so these three statements are equivalent: >>> Blog.objects.get(id__exact=14) # Explicit form >>> Blog.objects.get(id=14) # __exact is implied >>> Blog.objects.get(pk=14) # pk implies id__exact The use of pk isn’t limited to __exact queries – any query term can be combined with pk to perform a query on the primary key of a model: # Get blogs entries with id 1, 4, and 7 >>> Blog.objects.filter(pk__in=[1,4,7]) # Get all blog entries with id > 14 >>> Blog.objects.filter(pk__gt=14) pk lookups also work across joins.


pages: 593 words: 118,995

Relevant Search: With Examples Using Elasticsearch and Solr by Doug Turnbull, John Berryman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

crowdsourcing, domain-specific language, finite state, fudge factor, full text search, information retrieval, natural language processing, premature optimization, recommendation engine, sentiment analysis

And although tokens are typically generated text, as you’ll see in chapter 4, analysis can be applied and tokens generated for nontext values such as floating-point numbers and geographic locations. In chapter 1, we mentioned the notion of features. In machine learning, features are descriptors for the items being classified. Features used to classify fruit may be things such as color, flavor, and shape. With full-text search, the tokens produced during analysis are the dominant features used to match a user’s query with documents in the index. Don’t worry if this seems vague right now; the greater portion of this book is dedicated to making these ideas clear. After analysis is complete, the documents are indexed; the tokens from the analysis step are stored into search engine data structures for document retrieval.

simple signals, 2nd Solr additive, with Boolean queries boosting feature mappings multiplicative, with function queries user ratings vs. filtering breadcrumb navigation browse experience browse interface, Yowl buckets section building signals bulk index API bulkMovies string business and domain awareness business concerns group business weight business-ranking logic BusinessScore C cast.name field, 2nd, 3rd, 4th, 5th cast.name scores cast.name.bigrammed field, 2nd, 3rd, 4th character filtering, 2nd, 3rd character offsets classic similarity classification features cleaning click-through rate co-occurrence counting cold-start problem COLLAB_FILTER filter, 2nd collaboration filtering, using co-occurrence counting search relevance and collation collocation extraction combining fields committed documents common words, removing completion field, 2nd completion suggester completion_analyzer completion_prefix variable complexphrase query parser compound queries, 2nd, 3rd concept search basic methods for building augmenting content with synonyms concept signals building using machine learning personalized search and configurations conflate tokens constant_score query, 2nd content augmentation curation engineer/curator pairing risk of miscommunication with content curator role of content curator exploring extracting into documents providing to search engine searching content group content weight, 2nd ContentScore control analysis controlling field matching converge conversion rate coord (coordinating factor), 2nd, 3rd, 4th, 5th, 6th, 7th copyField, 2nd, 3rd copy_to option, 2nd cosine similarity cross_fields, 2nd, 3rd searching, 2nd, 3rd, 4th Solr solving signal discordance with cuisine field cuisine_hifi field, 2nd cuisine_lofi field curation, search relevance and custom all field custom score query D data-driven culture debugging example search application Elasticsearch first searches with The Movie Database Python matching query matching analysis to solve matching issues comparing query to inverted index fixing by changing analyzers query parsing underlying strategy ranking computing weight explain feature scoring matches to measure relevance search term importance similarity vector-space model decay functions, 2nd deep paging default analyzer defType parameter delimiters acronyms modeling specificity phone numbers synonyms tokenizing geographic data tokenizing integers tokenizing melodies deployment, relevance-focused search application description field, 2nd, 3rd, 4th, 5th descriptive query directors field directors.name field, 2nd, 3rd, 4th directors.name score directors.name.bigrammed, 2nd, 3rd disable_coord option disabling tokenization discriminating fields DisjunctionMaximumQuery dismax, 2nd doc frequency, 2nd doc values document search and retrieval aggregations Boolean search facets filtering Lucene-based search positional and phrase matching ranked results relevance sorting document-ranking system documents analysis enhancement enrichment extraction flattening nested grouping similar matching meaning of scored search completion from documents being searched tokens as features of matching process meaning of documents dot character dot product, 2nd down-boosting title DSL (domain-specific language) E e-commerce search, 2nd easy_install utility edismax query parser, 2nd Elasticsearch example search application overview end sentinels engaged field engaged restaurants English analyzer overview reindexing with english_* filters english_bigrams analyzer english_keywords filter english_possessive_stemmer filter english_stemmer filter english_stop filter enrichment, 2nd ETL (extract, transform, load), 2nd every field gets a vote exact matching, 2nd, 3rd, 4th expert search, 2nd, 3rd explanation field external sources extract function, 2nd, 3rd extracting features extraction F faceted browsing overview Solr facet.prefix option facets, 2nd, 3rd fail fast, 2nd, 3rd, 4th fast vector highlighter feature modeling, 2nd feature selection feature space features creation of overview, 2nd, 3rd feedback at search box search completion search suggestions search-as-you-type business and domain awareness content curation risk of miscommunication with content curator role of content curator in search results listing grouping similar documents information presented snippet highlighting when there are no results search relevance and Solr faceted browsing field collapsing match phrase prefix relevance feedback feature mappings suggestion and highlighting components while browsing alternative results ordering breadcrumb navigation faceted browsing field boosts field collapsing overview Solr field discordance field mappings field normalization field scores, 2nd field synchronicity, signal modeling and field-by-field dismax field-centric methods, 2nd field-centric search, combining term-centric search and combining greedy search and conservative amplifiers like fields precision vs. recall Solr fieldNorms, 2nd, 3rd fields fieldType field_value_factor function fieldWeight, 2nd, 3rd, 4th filter clause filter element filter queries filtering Amazon-style collaborative overview using co-occurrence counting score shaping vs. boosting finite state transducer fire token first_name field floating-point numbers fragment_size parameter fudge factors full bulk command full search string full-text search full_name field function decay function queries, multiplicative boosting with Boolean queries vs. combining high-value tiers scored with simple Solr function_score query, 2nd, 3rd, 4th G garbage features Gaussian decay generalizing matches generate_word_parts genres aggregation genres.name field geographic data, tokenizing geohashing geolocation, 2nd getCastAndCrew function GitHub repository granular fields grouping fields H has_discount field high-quality signals highlighted snippets highlights HTMLStripCharFilter HTTP commands, 2nd I ideal document IDF (inverse document frequency) ignoring when ranking overview, 2nd inconsistent scoring index-time analysis, 2nd index-time personalization indexing documents information and requirements gathering business needs required and available information users and information needs information retrieval, creating relevance solutions through inner objects innermost calculation integers, tokenizing inventory-related files inventory_dir configuration inverse document frequency.


pages: 485 words: 74,211

Developing Web Applications with Haskell and Yesod by Michael Snoyman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

create, read, update, delete, database schema, Debian, domain-specific language, full text search, MVC pattern, web application

Persistent: Raw SQL The Persistent package provides a type-safe interface to data stores. It tries to be backend-agnostic, such as not relying on relational features of SQL. My experience has been that you can easily perform 95% of what you need to do with the high-level interface. (In fact, most of my web apps use the high-level interface exclusively.) But occassionally you’ll want to use a feature that’s specific to a backend. One feature I’ve used in the past is full text search. In this case, we’ll use the SQL “LIKE” operator, which is not modeled in Persistent. We’ll get all people with the last name “Snoyman” and print the records out. Note Actually, you can express a LIKE operator directly in the normal syntax due to a feature added in Persistent 0.6, which allows backend-specific operators. But this is still a good example, so let’s roll with it. {-# LANGUAGE OverloadedStrings, TemplateHaskell, QuasiQuotes, TypeFamilies #-} {-# LANGUAGE GeneralizedNewtypeDeriving, GADTs, FlexibleContexts #-} import Database.Persist.Sqlite (withSqliteConn) import Database.Persist.TH (mkPersist, persist, share, mkMigrate, sqlSettings) import Database.Persist.GenericSql (runSqlConn, runMigration, SqlPersist) import Database.Persist.GenericSql.Raw (withStmt) import Data.Text (Text) import Database.Persist import Database.Persist.Store (PersistValue) import Control.Monad.IO.Class (liftIO) import qualified Data.Conduit as C import qualified Data.Conduit.List as CL share [mkPersist sqlSettings, mkMigrate "migrateAll"] [persist| Person name Text |] main :: IO () main = withSqliteConn ":memory:" $ runSqlConn $ do runMigration migrateAll insert $ Person "Michael Snoyman" insert $ Person "Miriam Snoyman" insert $ Person "Eliezer Snoyman" insert $ Person "Gavriella Snoyman" insert $ Person "Greg Weber" insert $ Person "Rick Richardson" -- Persistent does not provide the LIKE keyword, but we'd like to get the -- whole Snoyman family... let sql = "SELECT name FROM Person WHERE name LIKE '%Snoyman'" C.runResourceT $ withStmt sql [] C.$$ CL.mapM_ $ liftIO . print There is also higher-level support that allows for automated data marshaling.


pages: 420 words: 79,867

Developing Backbone.js Applications by Addy Osmani

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Airbnb, anti-pattern, create, read, update, delete, database schema, Firefox, full text search, Google Chrome, Khan Academy, loose coupling, MVC pattern, node package manager, pull request, side project, single page application, web application

For clientPager these include: Collection.goTo(n, options) - go to a specific page Collection.prevPage(options) - go to the previous page Collection.nextPage(options) - go to the next page Collection.howManyPer(n) - set how many items to display per page Collection.setSort(sortBy, sortDirection) - update sort on the current view. Sorting will automatically detect if you’re trying to sort numbers (even if they’re strored as strings) and will do the right thing. Collection.setFilter(filterFields, filterWords) - filter the current view. Filtering supports multiple words without any specific order, so you’ll basically get a full-text search ability. Also, you can pass it only one field from the model, or you can pass an array with fields and all of them will get filtered. Last option is to pass it an object containing a comparison method and rules. Currently, only levenshtein method is available. The goTo(), prevPage(), and nextPage() functions do not require the options param since they will be executed synchronously. However, when specified, the success callback will be invoked before the function returns.


pages: 420 words: 61,808

Flask Web Development: Developing Web Applications With Python by Miguel Grinberg

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

database schema, Firefox, full text search, Minecraft, platform as a service, web application

Following is a short list of some additional packages that are worth exploring: Flask-Babel: Internationalization and localization support Flask-RESTful: Tools for building RESTful APIs Celery: Task queue for processing background jobs Frozen-Flask: Conversion of a Flask application to a static website Flask-DebugToolbar: In-browser debugging tools Flask-Assets: Merging, minifying, and compiling of CSS and JavaScript assets Flask-OAuth: Authentication against OAuth providers Flask-OpenID: Authentication against OpenID providers Flask-WhooshAlchemy: Full-text search for Flask-SQLAlchemy models based on Whoosh Flask-KVsession: Alternative implementation of user sessions that use server-side storage If the functionality that you need for your project is not covered by any of the extensions and packages mentioned in this book, then your first destination to look for additional extensions should be the official Flask Extension Registry. Other good places to search are the Python Package Index, GitHub, and BitBucket.


pages: 263 words: 75,610

Delete: The Virtue of Forgetting in the Digital Age by Viktor Mayer-Schönberger

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

en.wikipedia.org, Erik Brynjolfsson, Firefox, full text search, George Akerlof, information retrieval, information trail, Internet Archive, invention of movable type, invention of the printing press, moveable type in China, Network effects, packet switching, pattern recognition, RFID, slashdot, Steve Jobs, Steven Levy, The Market for Lemons, The Structural Transformation of the Public Sphere, Vannevar Bush

In the United States in the 1970s, Lexis and Westlaw, for example, made available to their customers huge databases with the full text of tens of thousands of court decisions, but these could only be retrieved using a limited set of keys. Customers, however, wanted to find relevant decisions by searching for words in the text of the decision, not just the case name, docket number, date, and a few subject words that had been indexed. The solution was to make searchable every word of every document in the database. Such full-text searches still require input of the precise words or terms and so it is no surefire recipe of finding the desired information, but it is eminently easier and more powerful than a search that is restricted to a small number of predefined search keys. At first, full-text indexing and searches were used by large providers of information databases, but by the beginning of the twenty-first century it had become a standard feature of all major PC operating systems, bringing the power of pinpoint information retrieval to people’s desktops.


pages: 281 words: 95,852

The Googlization of Everything: by Siva Vaidhyanathan

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

1960s counterculture, AltaVista, barriers to entry, Berlin Wall, borderless world, Burning Man, Cass Sunstein, choice architecture, cloud computing, computer age, corporate social responsibility, correlation does not imply causation, data acquisition, death of newspapers, don't be evil, Firefox, Francis Fukuyama: the end of history, full text search, global village, Google Earth, Howard Rheingold, informal economy, information retrieval, Joseph Schumpeter, Kevin Kelly, knowledge worker, libertarian paternalism, market fundamentalism, Marshall McLuhan, means of production, Mikhail Gorbachev, Naomi Klein, Network effects, new economy, Nicholas Carr, PageRank, pirate software, Ray Kurzweil, Richard Thaler, Ronald Reagan, side project, Silicon Valley, Silicon Valley ideology, single-payer health, Skype, social web, Steven Levy, Stewart Brand, technoutopianism, The Nature of the Firm, The Structural Transformation of the Public Sphere, Thorstein Veblen, urban decay, web application

At that dinner, Tim said “I know this doesn’t have anything to do with the matter at hand, but out of curiosity, how many people here use Google?” Every hand went up. From library consultant Karen Coyle: I was chatting with the brother of one of the Google founders. He told me that his brother was working on a new search engine that would be better than anything ever seen before. I tried to argue that it would still be limited by the NOTES TO PAGES 82–91 235 reality of the full-text search. I probably looked at Google when it was first made available, and I was pretty un-impressed. Just more keyword searching. Today I use it constantly, but I’m very aware of the fact that it works quite well for nouns and proper nouns (people, companies, named things), and less well for concepts. . . . . I think of it as a giant phone book for the Internet, not as a classification of knowledge.


pages: 360 words: 96,275

PostgreSQL 9 Admin Cookbook: Over 80 Recipes to Help You Run an Efficient PostgreSQL 9. 0 Database by Simon Riggs, Hannu Krosing

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

business intelligence, business process, database schema, Debian, en.wikipedia.org, full text search, Skype

In this case, you should also have lots of postgres processes in status D. Reducing the number of rows returned Although often the problem is producing many rows in the first place, it is made worse by returning all the unneeded rows to the client. This is especially true if client and server are not on the same host. Here are some ways to reduce the traffic between the client and server. A full text search returns 10,000 documents, but only first the 20 are displayed to user In this case, order the documents by ranking on the server, and return only the top 20 actually displayed SELECT title, ts_rank_cd(body_tsv, query, 20) AS text_rank FROM articles, plainto_tsquery('spicy potatoes') AS query WHERE body_tsv @@ query ORDER BY rank DESC LIMIT 20 ; If you need the next 20, don't just query with limit 40 and throw away the first 20, but use "OFFSET 20 LIMIT 20" to return just the next 20.


pages: 353 words: 104,146

European Founders at Work by Pedro Gairifo Santos

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

business intelligence, cloud computing, crowdsourcing, fear of failure, full text search, information retrieval, inventory management, iterative process, Jeff Bezos, Lean Startup, Mark Zuckerberg, natural language processing, pattern recognition, pre–internet, recommendation engine, Richard Stallman, Silicon Valley, Skype, slashdot, Steve Jobs, Steve Wozniak, subscription business, technology bubble, web application, Y Combinator

Really try to break those barriers, break those borders, and take inspiration from things around you and be curious, but apply it to problems in a smart, differentiated, useful way. Ilya Segalovich Yandex Ilya Segalovich is co-founder of Yandex, the leading search engine in Russian-speaking countries. The roots of Yandex trace back to a company called Arkadia, which in the early 1990s developed software featuring full-text search supporting the Russian language. In 1993, Segalovich and Arkady Volozh came up with the word “Yandex” to describe their search technologies. The web site, Yandex.ru, was launched in 1997 and in 2000 Yandex was incorporated as a standalone company. In May 2011, Yandex raised $1.3 billion in an initial public offering on NASDAQ. It was the biggest IPO for a dot-com since Google went public in 2004.


pages: 349 words: 114,038

Culture & Empire: Digital Revolution by Pieter Hintjens

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

4chan, airport security, anti-communist, anti-pattern, barriers to entry, Bill Duvall, bitcoin, blockchain, business climate, business intelligence, business process, Chelsea Manning, clean water, congestion charging, Corn Laws, correlation does not imply causation, cryptocurrency, Debian, Edward Snowden, failed state, financial independence, Firefox, full text search, German hyperinflation, global village, GnuPG, Google Chrome, greed is good, Hernando de Soto, hiring and firing, informal economy, invisible hand, James Watt: steam engine, Jeff Rulifson, Julian Assange, Kickstarter, M-Pesa, mutually assured destruction, Naomi Klein, national security letter, new economy, New Urbanism, Occupy movement, offshore financial centre, packet switching, patent troll, peak oil, pre–internet, private military company, race to the bottom, rent-seeking, reserve currency, RFC: Request For Comment, Richard Feynman, Richard Feynman, Richard Stallman, Satoshi Nakamoto, security theater, Skype, slashdot, software patent, spectrum auction, Steve Crocker, Steve Jobs, Steven Pinker, Stuxnet, The Wealth of Nations by Adam Smith, The Wisdom of Crowds, trade route, transaction costs, union organizing, web application, WikiLeaks, Y2K, zero day, Zipf's Law

Also in 1998, Google was founded, and soon their revolutionary concept of "it works the way you expect" made them King of the Search Engines. Once upon a time, the list of all websites was twenty pages long. I still have a book that has the entire World Wide Web printed as an appendix. Then the list got too long to print and sites like Yahoo! organized them into categories. Then the category list got too large to keep updated, and Lycos invented the full-text search. This was too slow, so Digital Equipment Corporation built a natty search engine called Altavista to show how to do it properly. The results for any search got too long, so Google invented the ranked search, which pretty much fixed the search issue. Google also threw all the clutter off the main page. Less is more. The dot-com boom bubbled in 1999, driven by the dream of cheap access to millions -- no, billions -- of consumers.


pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack by Matthew A. Russell

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Climategate, cloud computing, crowdsourcing, en.wikipedia.org, fault tolerance, Firefox, full text search, Georg Cantor, Google Earth, information retrieval, Mark Zuckerberg, natural language processing, NP-complete, profit motive, Saturday Night Live, semantic web, Silicon Valley, slashdot, social graph, social web, statistical model, Steve Jobs, supply-chain management, text mining, traveling salesman, Turing test, web application

Example 5-17 is a trivial adaptation of Example 5-4 that illustrates a routine emitting a simple JSON structure (a list of [term, URL, frequency] tuples) that can be fed into an HTML template for WP-Cumulus. We’ll pass in empty strings for the URL portion of those tuples, but you could use your imagination and hyperlink to a simple web service that displays a list of tweets containing the entities. (Recall that Example 5-7 provides just about everything you’d need to wire this up by using couchdb-lucene to perform a full-text search on tweets stored in CouchDB.) Another option might be to write a web service and link to a URL that provides any tweet containing the specified entity. Example 5-17. Generating the data for an interactive tag cloud using WP-Cumulus (the_tweet__tweet_tagcloud_code.py) # -*- coding: utf-8 -*- import os import sys import webbrowser import json from cgi import escape from math import log import couchdb from couchdb.design import ViewDefinition DB = sys.argv[1] MIN_FREQUENCY = int(sys.argv[2]) HTML_TEMPLATE = '..


pages: 470 words: 109,589

Apache Solr 3 Enterprise Search Server by Unknown

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

bioinformatics, continuous integration, database schema, en.wikipedia.org, fault tolerance, Firefox, full text search, information retrieval, Internet Archive, natural language processing, performance metric, platform as a service, web application

Which Rails/Ruby library should I use? The two most common high level libraries for interacting with Solr are acts_as_solr and Sunspot. However, in the last couple of years, Sunspot has become the more popular choice, and comes in a version designed to work explicitly with Rails called sunspot_rails that allows Rails ActiveRecord database objects to be transparently backed by a Solr index for full text search. For lower-level client interface to Solr from Ruby environments, there are two libraries duking it out to be the client of choice: solr-ruby, a client library developed by the Apache Solr project and rsolr, which is a reimplementation of a Ruby centric client library. Both of these solutions are solid and act as great low level API libraries. However, rsolr has gained more attention, has better documentation, and some nice features such as a direct embedded Solr connection through JRuby. rsolr also has support for using curb (Ruby bindings to curl, a very fast HTTP library) instead of the standard Net::HTTP library for the HTTP transport layer.


pages: 561 words: 120,899

The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant From Two Centuries of Controversy by Sharon Bertsch McGrayne

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

bioinformatics, British Empire, Claude Shannon: information theory, Daniel Kahneman / Amos Tversky, double helix, Edmond Halley, Fellow of the Royal Society, full text search, Henri Poincaré, Isaac Newton, John Nash: game theory, John von Neumann, linear programming, meta analysis, meta-analysis, Nate Silver, p-value, placebo effect, prediction markets, RAND corporation, recommendation engine, Renaissance Technologies, Richard Feynman, Richard Feynman, Richard Feynman: Challenger O-ring, Ronald Reagan, speech recognition, statistical model, stochastic process, Thomas Kuhn: the structure of scientific revolutions, traveling salesman, Turing machine, Turing test, uranium enrichment, Yom Kippur War

According to Google’s research director, Peter Norvig, “There must have been dozens of times when a project started with naïve Bayes, just because it was easy to do and we expected to replace it with something more sophisticated later, but in the end the vast amount of data meant that a more complex technique was not needed.” Google also uses Bayesian techniques to classify spam and pornography and to find related words, phrases, and documents. A very large Bayesian network finds synonyms of words and phrases. Instead of downloading dictionaries for a spelling checker, Google conducted a full-text search of the entire Internet looking for all the different ways words can be spelled. The result was a flexible system that could recognize that “shaorn” should have been “Sharon” and correct the typo. While Bayes has helped revolutionize modern life on the web, it is also helping to finesse the Tower of Babel that has separated linguistic communities for millennia. During the Second World War, Warren Weaver of the Rockefeller Foundation was impressed with how “a multiplicity of languages impedes cultural interchange between the peoples of the earth and is a serious deterrent to international understanding.”6 Struck by the power of mechanized cryptography and by Claude Shannon’s new information theory, Weaver suggested that computerized statistical methods could treat translation as a cryptography problem.


Pragmatic.Programming.Erlang.Jul.2007 by Unknown

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Debian, en.wikipedia.org, fault tolerance, finite state, full text search, RFC: Request For Comment, sorting algorithm

We talk in general terms about shared memory and message passing concurrency and why we strongly believe that languages with no mutable state and concurrency are ideally suited to programming multicore computers. • Chapter 20, Programming Multicore CPUs, on page 367 is about programming multicore computers. We talk about the techniques for ensuring that an Erlang program will run efficiently on multicore computers. We introduce a number of abstractions for speeding up sequential programs on multicore computers. Finally we perform some measurements and develop our third major program, a full-text search engine. To write this, we first implement a function called mapreduce—this is a higher-order function for parallelizing a computation over a set of processing elements. • Appendix A, on page 390, describes the type system used to document Erlang functions. • Appendix B, on page 396, describes how to set up Erlang on the Windows operating system (and how to configure emacs on all operating systems). • Appendix C, on page 399, has a catalog of Erlang resources. • Appendix D, on page 403, describes lib_chan, which is a library for programming socket-based distribution. 15 B EGIN A GAIN • Appendix E, on page 419, looks at techniques for analyzing, profiling, debugging, and tracing your code. • Appendix F, on page 439, has one-line summaries of the most used modules in the Erlang standard libraries. 1.2 Begin Again Once upon a time a programmer came across a book describing a funny programming language.


pages: 933 words: 205,691

Hadoop: The Definitive Guide by Tom White

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amazon Web Services, bioinformatics, business intelligence, combinatorial explosion, database schema, Debian, domain-specific language, en.wikipedia.org, fault tolerance, full text search, Grace Hopper, information retrieval, Internet Archive, linked data, loose coupling, openstreetmap, recommendation engine, RFID, SETI@home, social graph, web application

Parsed page data Page content is then parsed using a suitable parser—Nutch provides parsers for documents in many popular formats, such as HTML, PDF, Open Office and Microsoft Office, RSS, and others. Link graph database This database is necessary to compute link-based page ranking scores, such as PageRank. For each URL known to Nutch, it contains a list of other URLs pointing to it, and their associated anchor text (from HTML <a href="..">anchor text</a> elements). This database is called LinkDb. Full-text search index This is a classical inverted index, built from the collected page metadata and from the extracted plain-text content. It is implemented using the excellent Lucene library. We briefly mentioned before that Hadoop began its life as a component in Nutch, intended to improve its scalability and to address clear performance bottlenecks caused by a centralized data processing model. Nutch was also the first public proof-of-concept application ported to the framework that would later become Hadoop, and the effort required to port Nutch algorithms and data structures to Hadoop proved to be surprisingly small.


pages: 602 words: 207,965

Practical Ext JS Projects With Gears by Frank Zammetti

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, create, read, update, delete, database schema, en.wikipedia.org, Firefox, full text search, Gordon Gekko, loose coupling, Ronald Reagan, web application

The Create Index and Drop Index functions will first require that you figure out how to retrieve the list of indexes for a table and then present a Window to enter the index details in the case of Create Index, or a list of existing indexes to choose from in the case of Delete Index. None of this is especially hard, and would make for a good exercise (hint: getting the list of indexes is a slight modification to the query to retrieve a list of tables in a database). • The SQLite engine Gears uses has a full-text search capability, and it would be nice if there was a Text Search tool, similar to the Query tool, where that could be used. • Provide the ability to add a new record from the Browse tab of the Table Details Window, as well as the ability to duplicate, edit, or delete the selected record. • Allow more than 20 fields to be added in the Create Table Window. You can implement this however you choose; one way would be to dynamically add a new row to the form any time you detect that all existing rows have been populated


pages: 348 words: 39,850

Data Scientists at Work by Sebastian Gutierrez

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Albert Einstein, algorithmic trading, bioinformatics, bitcoin, business intelligence, chief data officer, clean water, cloud computing, computer vision, continuous integration, correlation does not imply causation, crowdsourcing, data is the new oil, DevOps, domain-specific language, follow your passion, full text search, informal economy, information retrieval, Infrastructure as a Service, inventory management, iterative process, linked data, Mark Zuckerberg, microbiome, Moneyball by Michael Lewis explains big data, move fast and break things, natural language processing, Network effects, nuclear winter, optical character recognition, pattern recognition, Paul Graham, personalized medicine, Peter Thiel, pre–internet, quantitative hedge fund, quantitative trading / quantitative finance, recommendation engine, Renaissance Technologies, Richard Feynman, Richard Feynman, self-driving car, side project, Silicon Valley, Skype, software as a service, speech recognition, statistical model, Steve Jobs, stochastic process, technology bubble, text mining, the scientific method, web application

The first was when I was working with a digital library and realized we could dramatically improve document tagging by algorithmically recycling author-supplied labels. While authors tagged articles with keywords and phrases, the tagging was sparse and inconsistent. As a result of this type of tagging, the use of tags for article retrieval offered high precision but low recall. Unfortunately, the alternative of performing full-text search on the tags provided unacceptably low precision. So we developed a system to bootstrap on author-supplied tags, thus improving tagging across the collection. The result was an order of magnitude increase in recall without sacrificing precision. The second was using entropy calculations on language models to automatically detect events in a news archive. We started by performing entity extraction on the archive to detect named entities and key phrases.