SpamAssassin

14 results back to index


pages: 163 words: 42,402

Machine Learning for Email by Drew Conway, John Myles White

call centre, correlation does not imply causation, Debian, natural language processing, Netflix Prize, pattern recognition, recommendation engine, SpamAssassin, text mining

Sender: spamassassin-devel-admin@example.sourceforge.net Errors-To: spamassassin-devel-admin@example.sourceforge.net X-Beenthere: spamassassin-devel@example.sourceforge.net X-Mailman-Version: 2.0.9-sf.net Precedence: bulk List-Help: <mailto:spamassassin-devel-request@example.sourceforge.net?subject=help> List-Post: <mailto:spamassassin-devel@example.sourceforge.net> List-Subscribe: <https://example.sourceforge.net/lists/listinfo/spamassassin-devel>, <mailto:spamassassin-devel-request@lists.sourceforge.net?subject=subscribe> List-Id: SpamAssassin Developers <spamassassin-devel.example.sourceforge.net> List-Unsubscribe: <https://example.sourceforge.net/lists/listinfo/spamassassin-devel>, <mailto:spamassassin-devel-request@lists.sourceforge.net?subject=unsubscribe> List-Archive: <http://www.geocrawler.com/redir-sf.php3?list=spamassassin-devel> X-Original-Date: Thu, 22 Aug 2002 16:19:48 +0200 Date: Thu, 22 Aug 2002 16:19:48 +0200 Hello, have you seen and discussed this article and his approach?

Received: from usw-sf-list1-b.sourceforge.net ([10.3.1.13] helo=usw-sf-list1.sourceforge.net) by usw-sf-list2.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17hsof-00042r-00; Thu, 22 Aug 2002 07:20:05 -0700 Received: from vivi.uptime.at ([62.116.87.11] helo=mail.uptime.at) by usw-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17hsoM-0000Ge-00 for <spamassassin-devel@lists.sourceforge.net>; Thu, 22 Aug 2002 07:19:47 -0700 Received: from [192.168.0.4] (chello062178142216.4.14.vie.surfer.at [62.178.142.216]) (authenticated bits=0) by mail.uptime.at (8.12.5/8.12.5) with ESMTP id g7MEI7Vp022036 for <spamassassin-devel@lists.sourceforge.net>; Thu, 22 Aug 2002 16:18:07 +0200 From: David H=?ISO-8859-1?B?9g==?=hn <dh@uptime.at> To: <spamassassin-devel@example.sourceforge.net> Message-Id: <B98ABFA4.1F87%dh@uptime.at> MIME-Version: 1.0 X-Trusted: YES X-From-Laptop: YES Content-Type: text/plain; charset="US-ASCII” Content-Transfer-Encoding: 7bit X-Mailscanner: Nothing found, baby Subject: [SAdev] Interesting approach to Spam handling.. Sender: spamassassin-devel-admin@example.sourceforge.net Errors-To: spamassassin-devel-admin@example.sourceforge.net X-Beenthere: spamassassin-devel@example.sourceforge.net X-Mailman-Version: 2.0.9-sf.net Precedence: bulk List-Help: <mailto:spamassassin-devel-request@example.sourceforge.net?

Thank you http://www.paulgraham.com/spam.html -- “Hell, there are no rules here-- we’re trying to accomplish something.” -- Thomas Alva Edison ------------------------------------------------------- This sf.net email is sponsored by: OSDN - Tired of that same old cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ Spamassassin-devel mailing list Spamassassin-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/spamassassin-devel Since we are focusing on only the email message body, we need to extract this text from the message files. If you explore some of the message files contained in this exercise, you will notice that the email message always begins after the first full line break in the email file. In Example 3-1, we see that the sentence, “Hello, have you seen and discussed this article and his approach?”


pages: 451 words: 103,606

Machine Learning for Hackers by Drew Conway, John Myles White

call centre, centre right, correlation does not imply causation, Debian, Erdős number, Nate Silver, natural language processing, Netflix Prize, p-value, pattern recognition, Paul Erdős, recommendation engine, social graph, SpamAssassin, statistical model, text mining, the scientific method, traveling salesman

Sender: spamassassin-devel-admin@example.sourceforge.net Errors-To: spamassassin-devel-admin@example.sourceforge.net X-Beenthere: spamassassin-devel@example.sourceforge.net X-Mailman-Version: 2.0.9-sf.net Precedence: bulk List-Help: <mailto:spamassassin-devel-request@example.sourceforge.net?subject=help> List-Post: <mailto:spamassassin-devel@example.sourceforge.net> List-Subscribe: <https://example.sourceforge.net/lists/listinfo/spamassassin-devel>, <mailto:spamassassin-devel-request@lists.sourceforge.net?subject=subscribe> List-Id: SpamAssassin Developers <spamassassin-devel.example.sourceforge.net> List-Unsubscribe: <https://example.sourceforge.net/lists/listinfo/spamassassin-devel>, <mailto:spamassassin-devel-request@lists.sourceforge.net?subject=unsubscribe> List-Archive: <http://www.geocrawler.com/redir-sf.php3?list=spamassassin-devel> X-Original-Date: Thu, 22 Aug 2002 16:19:48 +0200 Date: Thu, 22 Aug 2002 16:19:48 +0200 Hello, have you seen and discussed this article and his approach?

Received: from usw-sf-list1-b.sourceforge.net ([10.3.1.13] helo=usw-sf-list1.sourceforge.net) by usw-sf-list2.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17hsof-00042r-00; Thu, 22 Aug 2002 07:20:05 -0700 Received: from vivi.uptime.at ([62.116.87.11] helo=mail.uptime.at) by usw-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17hsoM-0000Ge-00 for <spamassassin-devel@lists.sourceforge.net>; Thu, 22 Aug 2002 07:19:47 -0700 Received: from [192.168.0.4] (chello062178142216.4.14.vie.surfer.at [62.178.142.216]) (authenticated bits=0) by mail.uptime.at (8.12.5/8.12.5) with ESMTP id g7MEI7Vp022036 for <spamassassin-devel@lists.sourceforge.net>; Thu, 22 Aug 2002 16:18:07 +0200 From: David H=?ISO-8859-1?B?9g==?=hn <dh@uptime.at> To: <spamassassin-devel@example.sourceforge.net> Message-Id: <B98ABFA4.1F87%dh@uptime.at> MIME-Version: 1.0 X-Trusted: YES X-From-Laptop: YES Content-Type: text/plain; charset="US-ASCII” Content-Transfer-Encoding: 7bit X-Mailscanner: Nothing found, baby Subject: [SAdev] Interesting approach to Spam handling.. Sender: spamassassin-devel-admin@example.sourceforge.net Errors-To: spamassassin-devel-admin@example.sourceforge.net X-Beenthere: spamassassin-devel@example.sourceforge.net X-Mailman-Version: 2.0.9-sf.net Precedence: bulk List-Help: <mailto:spamassassin-devel-request@example.sourceforge.net?

Thank you http://www.paulgraham.com/spam.html -- “Hell, there are no rules here-- we’re trying to accomplish something.” -- Thomas Alva Edison ------------------------------------------------------- This sf.net email is sponsored by: OSDN - Tired of that same old cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ Spamassassin-devel mailing list Spamassassin-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/spamassassin-devel Note The “null line” separating the header from the body of an email is part of the protocol definition. For reference, see RFC822: http://tools.ietf.org/html/frc822. As is always the case, the first thing to do is to load in the libraries we will use for this exercise. For text classification, we will be using the tm package, which stands for text mining.


Ubuntu 15.04 Server with systemd: Administration and Reference by Richard Petersen

Amazon Web Services, bash_history, cloud computing, Debian, Firefox, Mark Shuttleworth, MITM: man-in-the-middle, RFC: Request For Comment, SpamAssassin, web application

SpamAssassin rule files are located at /usr/share/spamassassin. The files contain rules for running tests such as detecting the fake hello in the header. Configuration files for SpamAssassin are located at /etc/spamassassin. The local.cf file lists system-wide SpamAssassin options such as how to rewrite headers. The init.pre file holds spam system configurations. Server options such as enabling SpamAssassin, are listed in the /etc/default spamassassin file. Users can set their own SpamAssassin option in their .spamassassin/user_prefs file. Common options include required_scorei, which sets a threshold for classifying a message as SPAM, numerous whitelist and blacklist options that accept and reject messages from certain users and domains, and tagging options that either rewrite or just add SPAM labels. Check the Mail::SpamAssassin::Conf man page for details.

The Courier-IMAP server (http://courier-mta.org) is a small, fast IMAP server that provides extensive authentication support including LDAP and PAM (Universe repository). Spam: SpamAssassin With SpamAssassin, you can filter sent and received e-mail for spam. The filter examines both headers and content, drawing on rules designed to detect common spam messages. When they are detected, it then tags the message as spam, so that a mail client can then discard it. SpamAssassin will also report spam messages to spam detection databases. The version of SpamAssassin distributed for Linux is the open source version developed by the Apache project, located at http://spamassassin.apache.org. There you can find detailed documentation, FAQs, mailing lists, and even a listing of the tests that SpamAssassin performs. Note: For dovecot IMAP server you can use dovecot-antispam plugin to implement spam detection. SpamAssassin rule files are located at /usr/share/spamassassin.

First, a message is filtered using an external filters such as opnedkim or python-policy-spf (Postfix will use both), then Amavisd-new has the message scanned by ClamAV for viruses, followed by an analysis by SpamAssassin to see if it is spam. Only then does Amavisd-new allow the message to be placed in the in box. To implement mail filtering, be sure you have installed amavisd-new, spamassassin, and clamav, along with the external filters. sudo apt-get install amavisd-new spamassassin clamav-daemon sudo apt-get install opendkim postfix-policyd-spf-python Ubuntu also recommends that you install supporting applications such as pyzor, razor, and the extraction utilities if you have not already done so (arg, capextract, cpio, lha, nomarch, pax, rar, unrar, unzip, zip). Add the clamav user to the amavis group to allow amavis to use clamav to scan files. sudo adduser calmav amavis sudo adduser amavis clamav Enable spamassassin by editing the spamassassin configuration file, /etc/default/spamassassin, and setting the ENABLED entry to 1.


pages: 678 words: 159,840

The Debian Administrator's Handbook, Debian Wheezy From Discovery to Mastery by Raphaal Hertzog, Roland Mas

bash_history, Debian, distributed generation, do-ocracy, en.wikipedia.org, failed state, Firefox, GnuPG, Google Chrome, Jono Bacon, MITM: man-in-the-middle, NP-complete, QWERTY keyboard, RFC: Request For Comment, Richard Stallman, Skype, SpamAssassin, Valgrind, web application, zero day, Zimmermann PGP

With this command, it is possible to go back to an older version of a package (if for instance you know that it works well), provided that it is still available in one of the sources referenced by the sources.list file. Otherwise the snapshot.debian.org archive can come to the rescue (see sidebar GOING FURTHER Old package versions: snapshot.debian.org). Example 6.3. Installation of the unstable version of spamassassin # apt-get install spamassassin/unstable GOING FURTHER The cache of .deb files APT keeps a copy of each downloaded .deb file in the directory /var/cache/apt/archives/. In case of frequent updates, this directory can quickly take a lot of disk space with several versions of each package; you should regularly sort through them. Two commands can be used: apt-get clean entirely empties the directory; apt-get autoclean only removes packages which cannot be downloaded (because they have disappeared from the Debian mirror) and are therefore clearly useless (the configuration parameter APT::Clean-Installed can prevent the removal of .deb files that are currently installed). 6.2.3.

A milter (short for mail filter) is a filtering program specially designed to interface with email servers. A milter uses a standard application programming interface (API) that provides much better performance than filters external to the email servers. Milters were initially introduced by Sendmail, but Postfix soon followed suit. QUICK LOOK A milter for Spamassassin The spamass-milter package provides a milter based on SpamAssassin, the famous unsolicited email detector. It can be used to flag messages as probable spams (by adding an extra header) and/or to reject the messages altogether if their “spamminess” score goes beyond a given threshold. Once the clamav-milter package is installed, the milter should be reconfigured to run on a TCP port rather than on the default named socket.

Stable Updates Stable updates are not security sensitive but are deemed important enough to be pushed to users before the next stable point release. This repository will typically contain fixes for critical bugs which could not be fixed before release or which have been introduced by subsequent updates. Depending on the urgency, it can also contain updates for packages that have to evolve over time… like spamassassin's spam detection rules, clamav's virus database, or the daylight-saving rules of all timezones (tzdata). In practice, this repository is a subset of the proposed-updates repository, carefully selected by the Stable Release Managers. 6.1.2.3. Proposed Updates Once published, the Stable distribution is only updated about once every 2 months. The proposed-updates repository is where the expected updates are prepared (under the supervision of the Stable Release Managers).


Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Geron

Amazon Mechanical Turk, Bayesian statistics, centre right, combinatorial explosion, constrained optimization, correlation coefficient, crowdsourcing, en.wikipedia.org, iterative process, Netflix Prize, NP-complete, optical character recognition, P = NP, p-value, pattern recognition, performance metric, recommendation engine, self-driving car, SpamAssassin, speech recognition, statistical model

Finally, train your best model on this expanded training set and measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing the training set is called data augmentation or training set expansion. Tackle the Titanic dataset. A great place to start is on Kaggle. Build a spam classifier (a more challenging exercise): Download examples of spam and ham from Apache SpamAssassin’s public datasets. Unzip the datasets and familiarize yourself with the data format. Split the datasets into a training set and a test set. Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector indicating the presence or absence of each possible word. For example, if all emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1] (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.


pages: 371 words: 78,103

Webbots, Spiders, and Screen Scrapers by Michael Schrenk

Amazon Web Services, corporate governance, fault tolerance, Firefox, Marc Andreessen, new economy, pre–internet, SpamAssassin, The Hackers Conference, Turing test, web application

+OK 2398 octets Return-Path: <returnpath@server.com> Delivered-To: me@server.com Received: (qmail 73301 invoked from network); 19 Feb 2006 20:55:31 −0000 Received: from mail2.server.net by mail1.server.net (qmail-ldap-1.03) with compressed QMQP; 19 Feb 2006 20:55:31 −0000 Delivered-To: CLUSTERHOST mail2.server.net me@server.com Received: (qmail 50923 invoked from network); 19 Feb 2006 20:55:31 −0000 Received: by simscan 1.1.0 ppid: 50907, pid: 50912, t: 2.8647s scanners: attach: 1.1.0 clamav: 0.86.1/m:34/d:1107 spam: 3.0.4 Received: from web30515.mail.mud.server.com (envelope-sender <sender@server.com>) by mail2.server.net (qmail-ldap-1.03) with SMTP for <me@server.com>; 19 Feb 2006 20:55:28 −0000 Received: (qmail 7734 invoked by uid 60001); 19 Feb 2006 20:55:26 −0000 Message-ID: <20060219205526.7732.qmail@web30515.mail.mud.server.com> Date: Sun, 19 Feb 2006 12:55:26 −0800 (PST) From: mike schrenk <sender@server.com> Subject: Hey, Can you read this email? To: mike schrenk <me@server.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-349883719-1140382526=:7581" Content-Transfer-Encoding: 8bit X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail2.server.com X-Spam-Level: X-Spam-Status: No, score=0.9 required=17.0 tests=HTML_00_10,HTML_MESSAGE, HTML_SHORT_LENGTH autolearn=no version=3.0.4 --0-349883719-1140382526=:7581 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit This is an email sent from my Yahoo! email account. --0-349883719-1140382526=:7581 Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: 8bit This is an email sent from my Yahoo!

Listing 15-6: A raw email message read from the server using the RETR POP3 command As you can see, even a short email message has a lot of overhead. Most of the returned information has little to do with the actual text of a message. For example, the email message retrieved in Listing 15-6 doesn't appear until over halfway down the listing. The rest of the text returned by the mail server consists of headers, which tell the mail client the path the message took, which services touched it (like SpamAssassin), how to display or handle the message, to whom to send replies, and so forth. These headers include some familiar information such as the subject header, the to and from values, and the MIME version. You can easily parse this information with the return_between() function found in the LIB_parse library (see Chapter 4), as shown in Listing 15-7. $ret_path = return_between($raw_message, "Return-Path: ", "\n", EXCL ); $deliver_to = return_between($raw_message, "Delivered-To: ", "\n", EXCL ); $date = return_between($raw_message, "Date: ", "\n", EXCL ); $from = return_between($raw_message, "From: ", "\n", EXCL ); $subject = return_between($raw_message, "Subject: ", "\n", EXCL ); Listing 15-7: Parsing header values The header values in Listing 15-7 are separated by their names and a \n (carriage return) character.


pages: 241 words: 43,073

Puppet 3 Beginner's Guide by John Arundel

cloud computing, Debian, DevOps, job automation, job satisfaction, Lao Tzu, Larry Wall, Network effects, SpamAssassin

He's been a consultant in the past, but he's now an employee for a provincial government agency for which he manages the infrastructure (servers, workstations, network, security, virtualization, SAN/NAS, PBX). He's a big fan of open-source software and its underlying philosophy. He's worked with Debian, Ubuntu, and SUSE, but what he knows best is RHEL-based distributions. He's known for his contributions to the MailScanner project (he has been a technical reviewer for the MailScanner book), but he also gave time to different open-source projects, such as mondorescue, OTRS, SpamAssassin, pfSense, and a few others. I thank my lover, Lysanne, who accepted allowing me some free time slots for this review even with a 2-year-old and a 6-month-old to take care of. The presence of these 3 human beings in my life is simply invaluable. I must also thank my friend Sébastien, whose generosity is only matched by his knowledge and kindness. I would never have reached that high in my career if it wasn't for him.


Pulling Strings With Puppet: Configuration Management Made Easy by James Turnbull

Debian, en.wikipedia.org, Kickstarter, revision control, Ruby on Rails, source of truth, SpamAssassin

We’re going to put these classes into the os directory and also import them into our site.pp file (again, we’ve already seen this in Listing 4-1). import "os/*" In Listing 4-7, you can see both these classes. Listing 4-7. The Operating System Classes class fedora { yumrepo { "testing.com-repo": baseurl => "http://repos.testing.com/fedora/$lsbdistrelease/", descr => "Testing.com's YUM repository", enabled => 1, gpgcheck => 0, } } class debian { $disableservices = ["hplip", "avahi-daemon", "rsync", "spamassassin"] service { $disableservices: enable => false, ensure => stopped, } } In Listing 4-7, we’ve created two classes; the first is fedora, which loads whenever a node returns Fedora as the value of the $operatingsystem fact. It uses the yumrepo type to setup a yum repository for our environment. We’ve also used another fact, $lsbdistrelease, which returns the LSB release number to select the correct repository for the Fedora release we have installed.


Producing Open Source Software: How to Run a Successful Free Software Project by Karl Fogel

active measures, AGPL, barriers to entry, Benjamin Mako Hill, collaborative editing, continuous integration, corporate governance, Debian, Donald Knuth, en.wikipedia.org, experimental subject, Firefox, GnuPG, Hacker Ethic, Internet Archive, iterative process, Kickstarter, natural language processing, patent troll, peer-to-peer, pull request, revision control, Richard Stallman, selection bias, slashdot, software as a service, software patent, SpamAssassin, web application, zero-sum game

There is not space here for detailed instructions on setting up spam filters. You will have to consult your mailing list software's documentation for that (see the section called “Mailing List / Message Forum Software” later in this chapter). List software often comes with some built-in spam prevention features, but you may want to add some third-party filters. I've had good experiences with SpamAssassin (spamassassin.apache.org) and SpamProbe (spamprobe.sourceforge.net), but this is not a comment on the many other open source spam filters out there, some of which are apparently also quite good. I just happen to have used those two myself and been satisfied with them. Moderation. For mails that aren't automatically allowed by virtue of being from a list subscriber, and which make it through the spam filtering software, if any, the last stage is moderation: the mail is routed to a special holding area, where a human examines it and confirms or rejects it.


Practical OCaml by Joshua B. Smith

cellular automata, Debian, domain-specific language, general-purpose programming language, Grace Hopper, hiring and firing, John Conway, Paul Graham, slashdot, SpamAssassin, text mining, Turing complete, type inference, web application, Y2K

The only real issue with the OCamlMakefile is that you need more than one for multiple results (which is a very small price to pay for such a useful utility). Running It You need a very large corpus of email to train a Bayesian classifier like this one. Luckily, the “spamassassin” developers have released a public corpus of email that can be used to develop antispam utilities. (A command-line client for this module is presented at the end of this chapter.) This corpus, broken up into ham and spam of varying stripes, can be downloaded from http://spamassassin.apache.org/publiccorpus/. Running the code on these corpuses gives results not nearly as good as Paul Graham says he got, but they are still very good. When tested on the files that made up the training corpus, the results are quite effective, especially with regard to false positives (nonspam mail was not flagged as spam).


pages: 1,331 words: 163,200

Hands-On Machine Learning With Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron

Amazon Mechanical Turk, Anton Chekhov, combinatorial explosion, computer vision, constrained optimization, correlation coefficient, crowdsourcing, don't repeat yourself, Elon Musk, en.wikipedia.org, friendly AI, ImageNet competition, information retrieval, iterative process, John von Neumann, Kickstarter, natural language processing, Netflix Prize, NP-complete, optical character recognition, P = NP, p-value, pattern recognition, pull request, recommendation engine, self-driving car, sentiment analysis, SpamAssassin, speech recognition, stochastic process

Finally, train your best model on this expanded training set and measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing the training set is called data augmentation or training set expansion. Tackle the Titanic dataset. A great place to start is on Kaggle. Build a spam classifier (a more challenging exercise): Download examples of spam and ham from Apache SpamAssassin’s public datasets. Unzip the datasets and familiarize yourself with the data format. Split the datasets into a training set and a test set. Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector indicating the presence or absence of each possible word. For example, if all emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1] (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.


pages: 579 words: 76,657

Data Science from Scratch: First Principles with Python by Joel Grus

correlation does not imply causation, natural language processing, Netflix Prize, p-value, Paul Graham, recommendation engine, SpamAssassin, statistical model

Our function will return a list of triplets containing each word, the probability of seeing that word in a spam message, and the probability of seeing that word in a nonspam message: def word_probabilities(counts, total_spams, total_non_spams, k=0.5): """turn the word_counts into a list of triplets w, p(w | spam) and p(w | ~spam)""" return [(w, (spam + k) / (total_spams + 2 * k), (non_spam + k) / (total_non_spams + 2 * k)) for w, (spam, non_spam) in counts.iteritems()] The last piece is to use these word probabilities (and our Naive Bayes assumptions) to assign probabilities to messages: def spam_probability(word_probs, message): message_words = tokenize(message) log_prob_if_spam = log_prob_if_not_spam = 0.0 # iterate through each word in our vocabulary for word, prob_if_spam, prob_if_not_spam in word_probs: # if *word* appears in the message, # add the log probability of seeing it if word in message_words: log_prob_if_spam += math.log(prob_if_spam) log_prob_if_not_spam += math.log(prob_if_not_spam) # if *word* doesn't appear in the message # add the log probability of _not_ seeing it # which is log(1 - probability of seeing it) else: log_prob_if_spam += math.log(1.0 - prob_if_spam) log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam) prob_if_spam = math.exp(log_prob_if_spam) prob_if_not_spam = math.exp(log_prob_if_not_spam) return prob_if_spam / (prob_if_spam + prob_if_not_spam) We can put this all together into our Naive Bayes Classifier: class NaiveBayesClassifier: def __init__(self, k=0.5): self.k = k self.word_probs = [] def train(self, training_set): # count spam and non-spam messages num_spams = len([is_spam for message, is_spam in training_set if is_spam]) num_non_spams = len(training_set) - num_spams # run training data through our "pipeline" word_counts = count_words(training_set) self.word_probs = word_probabilities(word_counts, num_spams, num_non_spams, self.k) def classify(self, message): return spam_probability(self.word_probs, message) Testing Our Model A good (if somewhat old) data set is the SpamAssassin public corpus. We’ll look at the files prefixed with 20021010. (On Windows, you might need a program like 7-Zip to decompress and extract them.) After extracting the data (to, say, C:\spam) you should have three folders: spam, easy_ham, and hard_ham. Each folder contains many emails, each contained in a single file. To keep things really simple, we’ll just look at the subject lines of each email.


pages: 377 words: 110,427

The Boy Who Could Change the World: The Writings of Aaron Swartz by Aaron Swartz, Lawrence Lessig

affirmative action, Alfred Russel Wallace, American Legislative Exchange Council, Benjamin Mako Hill, bitcoin, Bonfire of the Vanities, Brewster Kahle, Cass Sunstein, deliberate practice, Donald Knuth, Donald Trump, failed state, fear of failure, Firefox, full employment, Howard Zinn, index card, invisible hand, Joan Didion, John Gruber, Lean Startup, More Guns, Less Crime, peer-to-peer, post scarcity, Richard Feynman, Richard Stallman, Ronald Reagan, school vouchers, semantic web, single-payer health, SpamAssassin, SPARQL, telemarketer, The Bell Curve by Richard Herrnstein and Charles Murray, the scientific method, Toyota Production System, unbiased observer, wage slave, Washington Consensus, web application, WikiLeaks, working poor, zero-sum game

All the content here is plain old web pages, served up by Apache. Tinderbox uses a similar system, drawing from your database of notes to produce a bunch of static pages. My book collection pages are done this way. Radio UserLand statically generates the pages on your local computer and then “upstreams” them to your website. Finally, while researching Webmake, the Perl CMS that generates pages like Jmason’s Weblog and SpamAssassin, I found a good bit of terminology for this. Some websites, the documentation explains, are fried up for the user every time. But others are baked once and served up again and again. Why bake your pages instead of frying? Well, as you might guess, it’s healthier, but at the expense of not tasting quite as good. Baked pages are easy to serve. You can almost always switch servers and software and they’ll still work.


pages: 525 words: 149,886

Higher-Order Perl: A Guide to Program Transformation by Mark Jason Dominus

always be closing, Defenestration of Prague, Donald Knuth, Isaac Newton, Larry Wall, P = NP, Paul Graham, Perl 6, slashdot, SpamAssassin

The problem occurs because the input is being read character-by-character; when the buffer contains “a\n\n”, the terminator pattern succeeds, and the record is split, even though more reading would have generated a longer match. The same bug causes a more serious problem in a different example. Suppose we’re reading an email header and we’d like the iterator to generate logical fields instead of physical lines. Suppose the email header is as follows: Delivered-To: mjd-filter-deliver2@plover.com Received: from localhost [127.0.0.1] by plover.com with SpamAssassin (2.55 1.174.2.19-2003-05-19-exp); Mon, 11 Aug 2003 16:22:12 -0400 From: "Doris Bower" <yij447mrx@yahoo.com.hk> To: webmaster@plover.com Subject: LoseWeight Now with Pphentermine,Aadipex,Bontriil,PrescribedOnline,shipped to Your Door fltynzlfoybv kie There are five fields here; the second one, with the Received tag, consists of three physical lines. Lines that begin with whitespace are continuations of the previous line.