8 results back to index

pages: 163 words: 42,402

Machine Learning for Email by Drew Conway, John Myles White


call centre, correlation does not imply causation, Debian, natural language processing, Netflix Prize, pattern recognition, recommendation engine, SpamAssassin, text mining

Does a picture uploaded to a social network contain a face in it or not? Was “The Tempest” written by William Shakespeare or Francis Bacon? In this chapter, we’re going to focus on problems with text classification that are closely related to the tools you could use to answer the last question in our list. In our exercise, however, we’re going to build a system for deciding whether an email is spam or ham. Our raw data are The SpamAssassin Public Corpus, available for free download at: Portions of this corpus are included in the code/data/ folder for this chapter and will be used throughout this chapter. At the unprocessed stage, the features are simply the contents of the raw email as plain text. This raw text provides us with our first problem. We need to transform our raw text data into a set of features that describe qualitative concepts in a quantitative way.

Note Just as learning the words of a new language builds up an intuition for what could realistically be a word, learning about the features people have used in the past builds up an intuition for what features could reasonably be helpful in the future. When working with text, the most important type of feature that’s been used historically is word-count. If we think that the text of HTML tags are strong indicators of whether an email is spam, then we might pick terms like “html” and “table” and count how often they occur in one type of document versus the other. To show how this approach would work with the SpamAssassin Public Corpus, we’ve gone ahead and counted the number of times the terms “html” and “table” occurred: Table 3-1 shows the results. Table 3-1. Frequency of “spammy” words TermSpamHam html 377 9 table 1,182 43 Figure 3-2. Frequency of terms “html” and “table” by email type For every email in our data set, we’ve also plotted the class memberships (Figure 3-2). This plot isn’t actually very informative, because too many of the data points in our data set overlap.

Because we will also make use of base rate information about emails being spam, the model will be also called a Bayes model—in homage to the 18th century mathematician who first described conditional probabilities. Taken together, these two traits make our model a Naive Bayes classifier. Writing Our First Bayesian Spam Classifier As we mentioned earlier in this chapter, we will be using the SpamAssassin Public Corpus to both train and test our classifier. These data consist of labelled emails from three categories: “spam,” “easy ham,” and “hard ham.” As you’d expect, hard ham is more difficult to distinguish from spam than the easy stuff. For instance, hard ham messages often include HTML tags. Recall that one way we mentioned to easily identify spam was by these tags. To more accurately classify hard ham, we will have to include more information from many more text features.


pages: 371 words: 78,103

Webbots, Spiders, and Screen Scrapers by Michael Schrenk


Amazon Web Services, corporate governance, fault tolerance, Firefox, new economy, pre–internet, SpamAssassin, Turing test, web application

+OK 2398 octets Return-Path: <> Delivered-To: Received: (qmail 73301 invoked from network); 19 Feb 2006 20:55:31 −0000 Received: from by (qmail-ldap-1.03) with compressed QMQP; 19 Feb 2006 20:55:31 −0000 Delivered-To: CLUSTERHOST Received: (qmail 50923 invoked from network); 19 Feb 2006 20:55:31 −0000 Received: by simscan 1.1.0 ppid: 50907, pid: 50912, t: 2.8647s scanners: attach: 1.1.0 clamav: 0.86.1/m:34/d:1107 spam: 3.0.4 Received: from (envelope-sender <>) by (qmail-ldap-1.03) with SMTP for <>; 19 Feb 2006 20:55:28 −0000 Received: (qmail 7734 invoked by uid 60001); 19 Feb 2006 20:55:26 −0000 Message-ID: <> Date: Sun, 19 Feb 2006 12:55:26 −0800 (PST) From: mike schrenk <> Subject: Hey, Can you read this email? To: mike schrenk <> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-349883719-1140382526=:7581" Content-Transfer-Encoding: 8bit X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on X-Spam-Level: X-Spam-Status: No, score=0.9 required=17.0 tests=HTML_00_10,HTML_MESSAGE, HTML_SHORT_LENGTH autolearn=no version=3.0.4 --0-349883719-1140382526=:7581 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit This is an email sent from my Yahoo! email account. --0-349883719-1140382526=:7581 Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: 8bit This is an email sent from my Yahoo!

Listing 15-6: A raw email message read from the server using the RETR POP3 command As you can see, even a short email message has a lot of overhead. Most of the returned information has little to do with the actual text of a message. For example, the email message retrieved in Listing 15-6 doesn't appear until over halfway down the listing. The rest of the text returned by the mail server consists of headers, which tell the mail client the path the message took, which services touched it (like SpamAssassin), how to display or handle the message, to whom to send replies, and so forth. These headers include some familiar information such as the subject header, the to and from values, and the MIME version. You can easily parse this information with the return_between() function found in the LIB_parse library (see Chapter 4), as shown in Listing 15-7. $ret_path = return_between($raw_message, "Return-Path: ", "\n", EXCL ); $deliver_to = return_between($raw_message, "Delivered-To: ", "\n", EXCL ); $date = return_between($raw_message, "Date: ", "\n", EXCL ); $from = return_between($raw_message, "From: ", "\n", EXCL ); $subject = return_between($raw_message, "Subject: ", "\n", EXCL ); Listing 15-7: Parsing header values The header values in Listing 15-7 are separated by their names and a \n (carriage return) character.


pages: 241 words: 43,073

Puppet 3 Beginner's Guide by John Arundel


cloud computing, Debian, job automation, job satisfaction, Lao Tzu, Network effects, SpamAssassin

He's been a consultant in the past, but he's now an employee for a provincial government agency for which he manages the infrastructure (servers, workstations, network, security, virtualization, SAN/NAS, PBX). He's a big fan of open-source software and its underlying philosophy. He's worked with Debian, Ubuntu, and SUSE, but what he knows best is RHEL-based distributions. He's known for his contributions to the MailScanner project (he has been a technical reviewer for the MailScanner book), but he also gave time to different open-source projects, such as mondorescue, OTRS, SpamAssassin, pfSense, and a few others. I thank my lover, Lysanne, who accepted allowing me some free time slots for this review even with a 2-year-old and a 6-month-old to take care of. The presence of these 3 human beings in my life is simply invaluable. I must also thank my friend Sébastien, whose generosity is only matched by his knowledge and kindness. I would never have reached that high in my career if it wasn't for him.


pages: 183 words: 49,460

Start Small, Stay Small: A Developer's Guide to Launching a Startup by Rob Walling


8-hour work day,, inventory management, Lean Startup, Network effects, Paul Graham, rolodex, side project, Silicon Valley, software as a service, SpamAssassin, Superbowl ad, web application

I’ve offered an email course on Becoming a Programmer that’s been wildly successful in building a relationship with an audience I would otherwise have had a single contact point and then would have been out of their minds forever. Nuts and Bolts Hopefully by now, you’re on board with the idea of building your mailing list. So what’s next? The first step is to get email list management software in place: you need a form on your website to collect emails, somewhere to store them, the ability to add unsubscribe links to the bottom of your emails, run spam assassin on outgoing emails to ensure they won’t get caught in spam filters, send both HTML and text emails, track clicks and other metrics, send sequential emails when someone signs up, guard against being blacklisted by people clicking “spam” in their mail client…and on and on… Needless to say, you won’t be building a list management system yourself, nor will you be hosting it yourself. This is the #1 mistake I’ve seen software entrepreneurs make with regards to list management; trying to host the software themselves.


pages: 579 words: 76,657

Data Science from Scratch: First Principles with Python by Joel Grus


correlation does not imply causation, natural language processing, Netflix Prize, p-value, Paul Graham, recommendation engine, SpamAssassin, statistical model

Our function will return a list of triplets containing each word, the probability of seeing that word in a spam message, and the probability of seeing that word in a nonspam message: def word_probabilities(counts, total_spams, total_non_spams, k=0.5): """turn the word_counts into a list of triplets w, p(w | spam) and p(w | ~spam)""" return [(w, (spam + k) / (total_spams + 2 * k), (non_spam + k) / (total_non_spams + 2 * k)) for w, (spam, non_spam) in counts.iteritems()] The last piece is to use these word probabilities (and our Naive Bayes assumptions) to assign probabilities to messages: def spam_probability(word_probs, message): message_words = tokenize(message) log_prob_if_spam = log_prob_if_not_spam = 0.0 # iterate through each word in our vocabulary for word, prob_if_spam, prob_if_not_spam in word_probs: # if *word* appears in the message, # add the log probability of seeing it if word in message_words: log_prob_if_spam += math.log(prob_if_spam) log_prob_if_not_spam += math.log(prob_if_not_spam) # if *word* doesn't appear in the message # add the log probability of _not_ seeing it # which is log(1 - probability of seeing it) else: log_prob_if_spam += math.log(1.0 - prob_if_spam) log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam) prob_if_spam = math.exp(log_prob_if_spam) prob_if_not_spam = math.exp(log_prob_if_not_spam) return prob_if_spam / (prob_if_spam + prob_if_not_spam) We can put this all together into our Naive Bayes Classifier: class NaiveBayesClassifier: def __init__(self, k=0.5): self.k = k self.word_probs = [] def train(self, training_set): # count spam and non-spam messages num_spams = len([is_spam for message, is_spam in training_set if is_spam]) num_non_spams = len(training_set) - num_spams # run training data through our "pipeline" word_counts = count_words(training_set) self.word_probs = word_probabilities(word_counts, num_spams, num_non_spams, self.k) def classify(self, message): return spam_probability(self.word_probs, message) Testing Our Model A good (if somewhat old) data set is the SpamAssassin public corpus. We’ll look at the files prefixed with 20021010. (On Windows, you might need a program like 7-Zip to decompress and extract them.) After extracting the data (to, say, C:\spam) you should have three folders: spam, easy_ham, and hard_ham. Each folder contains many emails, each contained in a single file. To keep things really simple, we’ll just look at the subject lines of each email.


pages: 377 words: 110,427

The Boy Who Could Change the World: The Writings of Aaron Swartz by Aaron Swartz, Lawrence Lessig


affirmative action, Alfred Russel Wallace, Benjamin Mako Hill, bitcoin, Bonfire of the Vanities, Brewster Kahle, Cass Sunstein, deliberate practice, Donald Trump, failed state, fear of failure, Firefox, full employment, Howard Zinn, index card, invisible hand, John Gruber, Lean Startup, More Guns, Less Crime, post scarcity, Richard Feynman, Richard Feynman, Richard Stallman, Ronald Reagan, school vouchers, semantic web, single-payer health, SpamAssassin, SPARQL, telemarketer, The Bell Curve by Richard Herrnstein and Charles Murray, the scientific method, Toyota Production System, unbiased observer, wage slave, Washington Consensus, web application, WikiLeaks, working poor

All the content here is plain old web pages, served up by Apache. Tinderbox uses a similar system, drawing from your database of notes to produce a bunch of static pages. My book collection pages are done this way. Radio UserLand statically generates the pages on your local computer and then “upstreams” them to your website. Finally, while researching Webmake, the Perl CMS that generates pages like Jmason’s Weblog and SpamAssassin, I found a good bit of terminology for this. Some websites, the documentation explains, are fried up for the user every time. But others are baked once and served up again and again. Why bake your pages instead of frying? Well, as you might guess, it’s healthier, but at the expense of not tasting quite as good. Baked pages are easy to serve. You can almost always switch servers and software and they’ll still work.


pages: 525 words: 149,886

Higher-Order Perl: A Guide to Program Transformation by Mark Jason Dominus


Defenestration of Prague, Isaac Newton, P = NP, Paul Graham, slashdot, SpamAssassin

The problem occurs because the input is being read character-by-character; when the buffer contains “a\n\n”, the terminator pattern succeeds, and the record is split, even though more reading would have generated a longer match. The same bug causes a more serious problem in a different example. Suppose we’re reading an email header and we’d like the iterator to generate logical fields instead of physical lines. Suppose the email header is as follows: Delivered-To: Received: from localhost [] by with SpamAssassin (2.55; Mon, 11 Aug 2003 16:22:12 -0400 From: "Doris Bower" <> To: Subject: LoseWeight Now with Pphentermine,Aadipex,Bontriil,PrescribedOnline,shipped to Your Door fltynzlfoybv kie There are five fields here; the second one, with the Received tag, consists of three physical lines. Lines that begin with whitespace are continuations of the previous line.


pages: 678 words: 159,840

The Debian Administrator's Handbook, Debian Wheezy From Discovery to Mastery by Raphaal Hertzog, Roland Mas


bash_history, Debian, distributed generation,, failed state, Firefox, GnuPG, Google Chrome, Jono Bacon, NP-complete, QWERTY keyboard, RFC: Request For Comment, Richard Stallman, Skype, SpamAssassin, Valgrind, web application, x509 certificate, zero day, Zimmermann PGP

A milter (short for mail filter) is a filtering program specially designed to interface with email servers. A milter uses a standard application programming interface (API) that provides much better performance than filters external to the email servers. Milters were initially introduced by Sendmail, but Postfix soon followed suit. QUICK LOOK A milter for Spamassassin The spamass-milter package provides a milter based on SpamAssassin, the famous unsolicited email detector. It can be used to flag messages as probable spams (by adding an extra header) and/or to reject the messages altogether if their “spamminess” score goes beyond a given threshold. Once the clamav-milter package is installed, the milter should be reconfigured to run on a TCP port rather than on the default named socket. This can be achieved with dpkg-reconfigure clamav-milter.