SpamAssassin

pages: 163 words: 42,402

Machine Learning for Email by Drew Conway, John Myles White

call centre, correlation does not imply causation, data science, Debian, natural language processing, Netflix Prize, pattern recognition, recommendation engine, SpamAssassin, text mining

=hn <dh@uptime.at> To: <spamassassin-devel@example.sourceforge.net> Message-Id: <B98ABFA4.1F87%dh@uptime.at> MIME-Version: 1.0 X-Trusted: YES X-From-Laptop: YES Content-Type: text/plain; charset="US-ASCII” Content-Transfer-Encoding: 7bit X-Mailscanner: Nothing found, baby Subject: [SAdev] Interesting approach to Spam handling.. Sender: spamassassin-devel-admin@example.sourceforge.net Errors-To: spamassassin-devel-admin@example.sourceforge.net X-Beenthere: spamassassin-devel@example.sourceforge.net X-Mailman-Version: 2.0.9-sf.net Precedence: bulk List-Help: <mailto:spamassassin-devel-request@example.sourceforge.net?subject=help> List-Post: <mailto:spamassassin-devel@example.sourceforge.net> List-Subscribe: <https://example.sourceforge.net/lists/listinfo/spamassassin-devel>, <mailto:spamassassin-devel-request@lists.sourceforge.net?subject=subscribe> List-Id: SpamAssassin Developers <spamassassin-devel.example.sourceforge.net> List-Unsubscribe: <https://example.sourceforge.net/lists/listinfo/spamassassin-devel>, <mailto:spamassassin-devel-request@lists.sourceforge.net?

…

Received: from usw-sf-list1-b.sourceforge.net ([10.3.1.13] helo=usw-sf-list1.sourceforge.net) by usw-sf-list2.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17hsof-00042r-00; Thu, 22 Aug 2002 07:20:05 -0700 Received: from vivi.uptime.at ([62.116.87.11] helo=mail.uptime.at) by usw-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17hsoM-0000Ge-00 for <spamassassin-devel@lists.sourceforge.net>; Thu, 22 Aug 2002 07:19:47 -0700 Received: from [192.168.0.4] (chello062178142216.4.14.vie.surfer.at [62.178.142.216]) (authenticated bits=0) by mail.uptime.at (8.12.5/8.12.5) with ESMTP id g7MEI7Vp022036 for <spamassassin-devel@lists.sourceforge.net>; Thu, 22 Aug 2002 16:18:07 +0200 From: David H=?ISO-8859-1?B?9g==?=hn <dh@uptime.at> To: <spamassassin-devel@example.sourceforge.net> Message-Id: <B98ABFA4.1F87%dh@uptime.at> MIME-Version: 1.0 X-Trusted: YES X-From-Laptop: YES Content-Type: text/plain; charset="US-ASCII” Content-Transfer-Encoding: 7bit X-Mailscanner: Nothing found, baby Subject: [SAdev] Interesting approach to Spam handling..

…

subject=subscribe> List-Id: SpamAssassin Developers <spamassassin-devel.example.sourceforge.net> List-Unsubscribe: <https://example.sourceforge.net/lists/listinfo/spamassassin-devel>, <mailto:spamassassin-devel-request@lists.sourceforge.net?subject=unsubscribe> List-Archive: <http://www.geocrawler.com/redir-sf.php3?list=spamassassin-devel> X-Original-Date: Thu, 22 Aug 2002 16:19:48 +0200 Date: Thu, 22 Aug 2002 16:19:48 +0200 Hello, have you seen and discussed this article and his approach? Thank you http://www.paulgraham.com/spam.html -- “Hell, there are no rules here-- we’re trying to accomplish something.” -- Thomas Alva Edison ------------------------------------------------------- This sf.net email is sponsored by: OSDN - Tired of that same old cell phone?

pages: 451 words: 103,606

Machine Learning for Hackers by Drew Conway, John Myles White

call centre, centre right, correlation does not imply causation, data science, Debian, Erdős number, Nate Silver, natural language processing, Netflix Prize, off-by-one error, p-value, pattern recognition, Paul Erdős, recommendation engine, social graph, SpamAssassin, statistical model, text mining, the scientific method, traveling salesman

…

Ubuntu 15.04 Server with systemd: Administration and Reference by Richard Petersen

Amazon Web Services, bash_history, cloud computing, Debian, Firefox, lock screen, Mark Shuttleworth, MITM: man-in-the-middle, OpenAI, operational security, RFC: Request For Comment, SpamAssassin, web application

There you can find detailed documentation, FAQs, mailing lists, and even a listing of the tests that SpamAssassin performs. Note: For dovecot IMAP server you can use dovecot-antispam plugin to implement spam detection. SpamAssassin rule files are located at /usr/share/spamassassin. The files contain rules for running tests such as detecting the fake hello in the header. Configuration files for SpamAssassin are located at /etc/spamassassin. The local.cf file lists system-wide SpamAssassin options such as how to rewrite headers. The init.pre file holds spam system configurations. Server options such as enabling SpamAssassin, are listed in the /etc/default spamassassin file. Users can set their own SpamAssassin option in their .spamassassin/user_prefs file.

…

Add the clamav user to the amavis group to allow amavis to use clamav to scan files. sudo adduser calmav amavis sudo adduser amavis clamav Enable spamassassin by editing the spamassassin configuration file, /etc/default/spamassassin, and setting the ENABLED entry to 1. ENABLED=1 Then start spamassassin. sudo service spamassasin start You can then configure Amavisd-new using files in the /etc/amavis/conf.d directory. To activate virus detection and spamassassin, edit the /etc/amavis/conf.d/15_content_filter_mode file and uncomment the lines for virus detection and spamassassin as indicated by the comments. Ubuntu also recommends that you disable the bounce response for spam emails by settings the final_spam_destiny option in the 20_debian_defaults file to D_DISCARD instead of D_BOUNCE.

…

The Courier-IMAP server (http://courier-mta.org) is a small, fast IMAP server that provides extensive authentication support including LDAP and PAM (Universe repository). Spam: SpamAssassin With SpamAssassin, you can filter sent and received e-mail for spam. The filter examines both headers and content, drawing on rules designed to detect common spam messages. When they are detected, it then tags the message as spam, so that a mail client can then discard it. SpamAssassin will also report spam messages to spam detection databases. The version of SpamAssassin distributed for Linux is the open source version developed by the Apache project, located at http://spamassassin.apache.org. There you can find detailed documentation, FAQs, mailing lists, and even a listing of the tests that SpamAssassin performs.

pages: 678 words: 159,840

The Debian Administrator's Handbook, Debian Wheezy From Discovery to Mastery by Raphaal Hertzog, Roland Mas

bash_history, Debian, distributed generation, do-ocracy, en.wikipedia.org, end-to-end encryption, failed state, Firefox, Free Software Foundation, GnuPG, Google Chrome, Jono Bacon, MITM: man-in-the-middle, Neal Stephenson, NP-complete, precautionary principle, QWERTY keyboard, RFC: Request For Comment, Richard Stallman, Skype, SpamAssassin, SQL injection, Valgrind, web application, zero day, Zimmermann PGP

With this command, it is possible to go back to an older version of a package (if for instance you know that it works well), provided that it is still available in one of the sources referenced by the sources.list file. Otherwise the snapshot.debian.org archive can come to the rescue (see sidebar GOING FURTHER Old package versions: snapshot.debian.org). Example 6.3. Installation of the unstable version of spamassassin # apt-get install spamassassin/unstable GOING FURTHER The cache of .deb files APT keeps a copy of each downloaded .deb file in the directory /var/cache/apt/archives/. In case of frequent updates, this directory can quickly take a lot of disk space with several versions of each package; you should regularly sort through them.

…

A milter (short for mail filter) is a filtering program specially designed to interface with email servers. A milter uses a standard application programming interface (API) that provides much better performance than filters external to the email servers. Milters were initially introduced by Sendmail, but Postfix soon followed suit. QUICK LOOK A milter for Spamassassin The spamass-milter package provides a milter based on SpamAssassin, the famous unsolicited email detector. It can be used to flag messages as probable spams (by adding an extra header) and/or to reject the messages altogether if their “spamminess” score goes beyond a given threshold. Once the clamav-milter package is installed, the milter should be reconfigured to run on a TCP port rather than on the default named socket.

…

Stable Updates Stable updates are not security sensitive but are deemed important enough to be pushed to users before the next stable point release. This repository will typically contain fixes for critical bugs which could not be fixed before release or which have been introduced by subsequent updates. Depending on the urgency, it can also contain updates for packages that have to evolve over time… like spamassassin's spam detection rules, clamav's virus database, or the daylight-saving rules of all timezones (tzdata). In practice, this repository is a subset of the proposed-updates repository, carefully selected by the Stable Release Managers. 6.1.2.3. Proposed Updates Once published, the Stable distribution is only updated about once every 2 months.

Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Geron

AlphaGo, Amazon Mechanical Turk, Bayesian statistics, centre right, combinatorial explosion, constrained optimization, correlation coefficient, crowdsourcing, data science, deep learning, DeepMind, duck typing, en.wikipedia.org, Geoffrey Hinton, iterative process, Netflix Prize, NP-complete, optical character recognition, P = NP, p-value, pattern recognition, performance metric, recommendation engine, self-driving car, SpamAssassin, speech recognition, statistical model

You should observe that your model performs even better now! This technique of artificially growing the training set is called data augmentation or training set expansion. Tackle the Titanic dataset. A great place to start is on Kaggle. Build a spam classifier (a more challenging exercise): Download examples of spam and ham from Apache SpamAssassin’s public datasets. Unzip the datasets and familiarize yourself with the data format. Split the datasets into a training set and a test set. Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector indicating the presence or absence of each possible word.

pages: 371 words: 78,103

Webbots, Spiders, and Screen Scrapers by Michael Schrenk

Amazon Web Services, corporate governance, digital rights, fault tolerance, Firefox, machine readable, Marc Andreessen, new economy, pre–internet, SpamAssassin, The Hackers Conference, Turing test, web application

+OK 2398 octets Return-Path: <returnpath@server.com> Delivered-To: me@server.com Received: (qmail 73301 invoked from network); 19 Feb 2006 20:55:31 −0000 Received: from mail2.server.net by mail1.server.net (qmail-ldap-1.03) with compressed QMQP; 19 Feb 2006 20:55:31 −0000 Delivered-To: CLUSTERHOST mail2.server.net me@server.com Received: (qmail 50923 invoked from network); 19 Feb 2006 20:55:31 −0000 Received: by simscan 1.1.0 ppid: 50907, pid: 50912, t: 2.8647s scanners: attach: 1.1.0 clamav: 0.86.1/m:34/d:1107 spam: 3.0.4 Received: from web30515.mail.mud.server.com (envelope-sender <sender@server.com>) by mail2.server.net (qmail-ldap-1.03) with SMTP for <me@server.com>; 19 Feb 2006 20:55:28 −0000 Received: (qmail 7734 invoked by uid 60001); 19 Feb 2006 20:55:26 −0000 Message-ID: <20060219205526.7732.qmail@web30515.mail.mud.server.com> Date: Sun, 19 Feb 2006 12:55:26 −0800 (PST) From: mike schrenk <sender@server.com> Subject: Hey, Can you read this email? To: mike schrenk <me@server.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-349883719-1140382526=:7581" Content-Transfer-Encoding: 8bit X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail2.server.com X-Spam-Level: X-Spam-Status: No, score=0.9 required=17.0 tests=HTML_00_10,HTML_MESSAGE, HTML_SHORT_LENGTH autolearn=no version=3.0.4 --0-349883719-1140382526=:7581 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit This is an email sent from my Yahoo!

…

Most of the returned information has little to do with the actual text of a message. For example, the email message retrieved in Listing 15-6 doesn't appear until over halfway down the listing. The rest of the text returned by the mail server consists of headers, which tell the mail client the path the message took, which services touched it (like SpamAssassin), how to display or handle the message, to whom to send replies, and so forth. These headers include some familiar information such as the subject header, the to and from values, and the MIME version. You can easily parse this information with the return_between() function found in the LIB_parse library (see Chapter 4), as shown in Listing 15-7.

Pulling Strings With Puppet: Configuration Management Made Easy by James Turnbull

Debian, en.wikipedia.org, Kickstarter, revision control, Ruby on Rails, single source of truth, source of truth, SpamAssassin

We’re going to put these classes into the os directory and also import them into our site.pp file (again, we’ve already seen this in Listing 4-1). import "os/*" In Listing 4-7, you can see both these classes. Listing 4-7. The Operating System Classes class fedora { yumrepo { "testing.com-repo": baseurl => "http://repos.testing.com/fedora/$lsbdistrelease/", descr => "Testing.com's YUM repository", enabled => 1, gpgcheck => 0, } } class debian { $disableservices = ["hplip", "avahi-daemon", "rsync", "spamassassin"] service { $disableservices: enable => false, ensure => stopped, } } In Listing 4-7, we’ve created two classes; the first is fedora, which loads whenever a node returns Fedora as the value of the $operatingsystem fact. It uses the yumrepo type to setup a yum repository for our environment. We’ve also used another fact, $lsbdistrelease, which returns the LSB release number to select the correct repository for the Fedora release we have installed.

pages: 241 words: 43,073

Puppet 3 Beginner's Guide by John Arundel

business logic, cloud computing, Debian, DevOps, job automation, job satisfaction, Lao Tzu, Larry Wall, Network effects, SpamAssassin

He's worked with Debian, Ubuntu, and SUSE, but what he knows best is RHEL-based distributions. He's known for his contributions to the MailScanner project (he has been a technical reviewer for the MailScanner book), but he also gave time to different open-source projects, such as mondorescue, OTRS, SpamAssassin, pfSense, and a few others. I thank my lover, Lysanne, who accepted allowing me some free time slots for this review even with a 2-year-old and a 6-month-old to take care of. The presence of these 3 human beings in my life is simply invaluable. I must also thank my friend Sébastien, whose generosity is only matched by his knowledge and kindness.

Producing Open Source Software: How to Run a Successful Free Software Project by Karl Fogel

active measures, AGPL, barriers to entry, Benjamin Mako Hill, collaborative editing, continuous integration, Contributor License Agreement, corporate governance, Debian, Donald Knuth, en.wikipedia.org, experimental subject, Firefox, Free Software Foundation, GnuPG, Hacker Ethic, Hacker News, intentional community, Internet Archive, iterative process, Kickstarter, natural language processing, off-by-one error, patent troll, peer-to-peer, pull request, revision control, Richard Stallman, selection bias, slashdot, software as a service, software patent, SpamAssassin, the Cathedral and the Bazaar, Wayback Machine, web application, zero-sum game

You will have to consult your mailing list software's documentation for that (see the section called “Mailing List / Message Forum Software” later in this chapter). List software often comes with some built-in spam prevention features, but you may want to add some third-party filters. I've had good experiences with SpamAssassin (spamassassin.apache.org) and SpamProbe (spamprobe.sourceforge.net), but this is not a comment on the many other open source spam filters out there, some of which are apparently also quite good. I just happen to have used those two myself and been satisfied with them. Moderation. For mails that aren't automatically allowed by virtue of being from a list subscriber, and which make it through the spam filtering software, if any, the last stage is moderation: the mail is routed to a special holding area, where a human examines it and confirms or rejects it.

Practical OCaml by Joshua B. Smith

cellular automata, Debian, domain-specific language, duck typing, Free Software Foundation, functional programming, general-purpose programming language, Grace Hopper, higher-order functions, hiring and firing, John Conway, Paul Graham, slashdot, SpamAssassin, text mining, Turing complete, type inference, web application, Y2K

The only real issue with the OCamlMakefile is that you need more than one for multiple results (which is a very small price to pay for such a useful utility). Running It You need a very large corpus of email to train a Bayesian classifier like this one. Luckily, the “spamassassin” developers have released a public corpus of email that can be used to develop antispam utilities. (A command-line client for this module is presented at the end of this chapter.) This corpus, broken up into ham and spam of varying stripes, can be downloaded from http://spamassassin.apache.org/publiccorpus/. Running the code on these corpuses gives results not nearly as good as Paul Graham says he got, but they are still very good. When tested on the files that made up the training corpus, the results are quite effective, especially with regard to false positives (nonspam mail was not flagged as spam).

pages: 1,331 words: 163,200

Hands-On Machine Learning With Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron

AlphaGo, Amazon Mechanical Turk, Anton Chekhov, backpropagation, combinatorial explosion, computer vision, constrained optimization, correlation coefficient, crowdsourcing, data science, deep learning, DeepMind, don't repeat yourself, duck typing, Elon Musk, en.wikipedia.org, friendly AI, Geoffrey Hinton, ImageNet competition, information retrieval, iterative process, John von Neumann, Kickstarter, machine translation, natural language processing, Netflix Prize, NP-complete, OpenAI, optical character recognition, P = NP, p-value, pattern recognition, pull request, recommendation engine, self-driving car, sentiment analysis, SpamAssassin, speech recognition, stochastic process

pages: 579 words: 76,657

Data Science from Scratch: First Principles with Python by Joel Grus

backpropagation, confounding variable, correlation does not imply causation, data science, deep learning, Hacker News, higher-order functions, natural language processing, Netflix Prize, p-value, Paul Graham, recommendation engine, SpamAssassin, statistical model

Our function will return a list of triplets containing each word, the probability of seeing that word in a spam message, and the probability of seeing that word in a nonspam message: def word_probabilities(counts, total_spams, total_non_spams, k=0.5): """turn the word_counts into a list of triplets w, p(w | spam) and p(w | ~spam)""" return [(w, (spam + k) / (total_spams + 2 * k), (non_spam + k) / (total_non_spams + 2 * k)) for w, (spam, non_spam) in counts.iteritems()] The last piece is to use these word probabilities (and our Naive Bayes assumptions) to assign probabilities to messages: def spam_probability(word_probs, message): message_words = tokenize(message) log_prob_if_spam = log_prob_if_not_spam = 0.0 # iterate through each word in our vocabulary for word, prob_if_spam, prob_if_not_spam in word_probs: # if *word* appears in the message, # add the log probability of seeing it if word in message_words: log_prob_if_spam += math.log(prob_if_spam) log_prob_if_not_spam += math.log(prob_if_not_spam) # if *word* doesn't appear in the message # add the log probability of _not_ seeing it # which is log(1 - probability of seeing it) else: log_prob_if_spam += math.log(1.0 - prob_if_spam) log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam) prob_if_spam = math.exp(log_prob_if_spam) prob_if_not_spam = math.exp(log_prob_if_not_spam) return prob_if_spam / (prob_if_spam + prob_if_not_spam) We can put this all together into our Naive Bayes Classifier: class NaiveBayesClassifier: def __init__(self, k=0.5): self.k = k self.word_probs = [] def train(self, training_set): # count spam and non-spam messages num_spams = len([is_spam for message, is_spam in training_set if is_spam]) num_non_spams = len(training_set) - num_spams # run training data through our "pipeline" word_counts = count_words(training_set) self.word_probs = word_probabilities(word_counts, num_spams, num_non_spams, self.k) def classify(self, message): return spam_probability(self.word_probs, message) Testing Our Model A good (if somewhat old) data set is the SpamAssassin public corpus. We’ll look at the files prefixed with 20021010. (On Windows, you might need a program like 7-Zip to decompress and extract them.) After extracting the data (to, say, C:\spam) you should have three folders: spam, easy_ham, and hard_ham. Each folder contains many emails, each contained in a single file.

pages: 377 words: 110,427

The Boy Who Could Change the World: The Writings of Aaron Swartz by Aaron Swartz, Lawrence Lessig

Aaron Swartz, affirmative action, Alfred Russel Wallace, American Legislative Exchange Council, Benjamin Mako Hill, bitcoin, Bonfire of the Vanities, Brewster Kahle, Cass Sunstein, deliberate practice, do what you love, Donald Knuth, Donald Trump, failed state, fear of failure, Firefox, Free Software Foundation, full employment, functional programming, Hacker News, Howard Zinn, index card, invisible hand, Joan Didion, John Gruber, Lean Startup, low interest rates, More Guns, Less Crime, peer-to-peer, post scarcity, power law, Richard Feynman, Richard Stallman, Ronald Reagan, school vouchers, semantic web, single-payer health, SpamAssassin, SPARQL, telemarketer, The Bell Curve by Richard Herrnstein and Charles Murray, the scientific method, Toyota Production System, unbiased observer, wage slave, Washington Consensus, web application, WikiLeaks, working poor, zero-sum game

Tinderbox uses a similar system, drawing from your database of notes to produce a bunch of static pages. My book collection pages are done this way. Radio UserLand statically generates the pages on your local computer and then “upstreams” them to your website. Finally, while researching Webmake, the Perl CMS that generates pages like Jmason’s Weblog and SpamAssassin, I found a good bit of terminology for this. Some websites, the documentation explains, are fried up for the user every time. But others are baked once and served up again and again. Why bake your pages instead of frying? Well, as you might guess, it’s healthier, but at the expense of not tasting quite as good.

pages: 525 words: 149,886

Higher-Order Perl: A Guide to Program Transformation by Mark Jason Dominus

always be closing, Defenestration of Prague, digital rights, Donald Knuth, functional programming, higher-order functions, Isaac Newton, Larry Wall, P = NP, Paul Graham, Perl 6, slashdot, SpamAssassin

The same bug causes a more serious problem in a different example. Suppose we’re reading an email header and we’d like the iterator to generate logical fields instead of physical lines. Suppose the email header is as follows: Delivered-To: mjd-filter-deliver2@plover.com Received: from localhost [127.0.0.1] by plover.com with SpamAssassin (2.55 1.174.2.19-2003-05-19-exp); Mon, 11 Aug 2003 16:22:12 -0400 From: "Doris Bower" <yij447mrx@yahoo.com.hk> To: webmaster@plover.com Subject: LoseWeight Now with Pphentermine,Aadipex,Bontriil,PrescribedOnline,shipped to Your Door fltynzlfoybv kie There are five fields here; the second one, with the Received tag, consists of three physical lines.