Bug 7287 - Statistical classifier plugin for SA
Summary: Statistical classifier plugin for SA
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: 4.0.0
Hardware: PC Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-01-07 17:07 UTC by Sarang Shrivastava
Modified: 2018-08-25 22:37 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Sarang Shrivastava 2016-01-07 17:07:37 UTC
Project Goals:

    1) To build a statistical classifier plugin for SA with multiple models encompassed underneath

    2) Implementation and Comparative study between the RBF( Radial Basis Function) and MLP (Multi Layer Perceptron) model of neural networks and evaluating their learning curve.    

    3) Exploring the possibility to learn from result symbols of known messages.

    4) Implementation of Kohonen networks or SOMP (Self Organizing Feature Map) in order to start the auto learning phase and in turn evaluating its learning curve.

    5) Implementation of SVM model for spam filtering which are especially popular in text classification problems where very high-dimensional spaces are the norm 

	6) Implementation of Hidden-Markov model in order to tackle the fact that the naive approach of automatical learning based on a messages’ overall score doesn't work well in general.
 
    
Availability:
    
    I have my end semester examinations till 2nd of may so after that I would be available throughout the summers. Regarding the work hours I would be able to dedicate about 8 hours a day on an average during my summer vacations, and about 5-6 hours after my college starts after the summer break ends( Tentatively on 20th July) . I'll make sure that the I am able to compensate for the loss of working hours in the summer break itself. If my proposal gets accepted, I am certain that I would finish my project well within the time frame.

Biography:

    ->Hello ,myself Sarang Shrivastava a 3rd year undergrad from Computer Science and Engineering at MNNIT Allahabad,India.
 
    ->Contribution to the open source community is something that is not new to me.Last summer I was working with the open source community of IIT Bombay named FOSSEE.I worked on the open source project named FreeEDA (previously known as Oscad) which is an open source EDA tool for circuit design, simulation, analysis and PCB design. It is an integrated tool built using open source software    such as KiCad (http://www.kicad-pcb.org), Ngspice (http://ngspice.sourceforge.net/) and Scilab (http://www.scilab.org/).The whole of the source code was in C++ with some part of it being implemented in python. Shell scripting was used in order to communicate between C++ and python.The build system used was Cmake.
    
    -> Regarding machine learning I have implemented the KNN algorithm for the recognition of Hand Written Digits with an efficiency of about 92%. Feature selection was the main part of it.Comparing the testing images with the training data with the parameters being center of mass of the images, mean squared distance between the images,scaling down of the image from 26X26 to 5X5 and then applying the same algorithm on it.I implemented this in python.

    -> I have also applied neural networks in one of my project titled "Recognition of Hand Gestures" which was completely in c++. Initially it involved extracting of the hand gesture from a continuous stream of video and then in turn applying various segmentation and filtering algorithms like erosion,dilation etc using OpenCV. And then once the gesture is extracted then this gesture is fed as an input to the neural nets which were trained previously on a set labeled data.
    
    ->Completed a course on Coursera titled "Machine learning" which was offered   by Stanford University and instructed by Andrew Ng.
      https://www.coursera.org/course/ml 

    ->Language proficiencies - c (advanced) ,c++ (advanced),Java (medium), lua (medium) ,Python(medium), MySQL(medium).

    ->Build systems - Cmake
            
    ->Should my proposal be accepted, I will completely devote my summer to completing this project and making SA more efficient and  fool proof.

Project Proposal:

    1)Brief Discription of existing support !
        Talking about statistics the only support available with SA is of Naive bayes. As a result of which in case of repeated spam patterns addtional overhead of computation is required everytime. 

	So presently SA uses Bayesian classifier together with some additional DNS filters to check for Spam. Firstly the present SA doesn't use any of the neural nets , SVM and HMM models that I have listed in my proposal.Secondly the new words that are not present in the Bayes database, to them SA assigns a very high probability to it. But there are chances that together with some garbage there can be a meaning full message along with it.

	Well the models that I have listed in my proposal seems to be better on paper because of the following reasons:-

	1) NB is not so scalable algorithm for large number of emails and where most of the words are random nouns.
	2) NB is a good approach to classify emails but it doesn't do well at all in front of good unsupervised learning because size   of the email doesn't really determine its importance, even small e mails can be spam free.

	For example,

	 a) we are considering you for a job ..
	 b) and an urgent job is posted for you ..

	Now sentence number one might be from a possible employer with whom you have applied whereas sentence two is 90% from those spammers which send random job requirements to people
	The second one falls more into the category of a spam, So if the user classifies it as a spam, then the weightage of the combine "urgent job" will be more than the rest of the features, But in case of Bayesian filtering each and ebery features are considered independent. The grammar and language features are not included with the Bayesian filtering but with the neural nets a lot of things can come into action.

	The problems that usually arise with probability models are:-

		a)Bayes Poisoning:
		Bayes Poisoning is a technique used by spammers to attempt to degrade the effectiveness of spam filters that rely on bayesian spam filtering. In other words, Spammers are intelligent enough to know that the filters use Bayes theorem as an integral part and they along with regular Spam words simply use legitimate Ham words to decrease the Spam probability of the mail and thereby escape the spam filter. 
		
		b)Learning Rate:
		The learning rate of spam classifier using Naïve Bayes as machine learning algorithim is low as it depends on probability model to learn

	So the idea is to build a statistical classifier plugin that includes various unsupervised and supervised machine learning algorithms with the focus being on neural networks , SVM, HMM .Theoritically, if we go with these method, they require almost zero overhead computation and can learn automatically instead of writing meta rules unlike the traditional methods. The user can activate any number of models of this particular plugin and the overall filtering would be done on the combine scores of the active models of this plugins. This is in accordance to the fact the some algorithms work better than the others in different situations.So the user should be given a choice to choose personally what all models does he wants to activate because classification of mails as spam or ham is a personal choice that varies from person to person.

For the users who are completely unaware of these models, they would be given a choice to choose from a combination of models with the model that performs well in general turned on by default.
                  
  
    2)What needs to be done. Features to be included.
        
		a) Well according to the plan,various models for neural network, SVMs , HMMs will be implemented and thoroughly tested against the SA dataset and the one giving the best result would be turned on by default for new users in the statistical classifier plugin.The contenders are RBF ( radial basis function ) ,MLP (multi layer perceptron ) , SOMP (Self organizing feature maps or Kohonen network) ANN, SVM ( support vector machines) , HMM ( Hidden markov models) and Bayesian Networks( If time permits ) .This would be followed by evaluating their learning curve by varying the different set of input features and parameters

Eg: (no of layers, learning rate , momentum ,epochs, Number of neurons in the hidden layer  for the neural network models ).

    3)How would it be done ?
        
	A rough draft to implement neural networks would look like this. A similar approach would be followed for the other algorithms also.

        -> A function to Feed the pattern through the network and return the results
        -> A function to initialize neuron weights
        -> A function that performs the feed forward operation
        -> For testing purpose a function which returns the accuracy and mean squared error rate for the model

        Parameters that would affect the overall learning curve of the neural networks

        ->Epochs
        ->Learning rate
        ->Number of neurons in the hidden layer
        
        ANN models to be used and implemented:-

        ->Multilayer perceptron (MLP) would be trained with the gradient descent and Levenberg-Marquardt methods 

        ->Kohonen’s self-organizing map (SOM) with learning vector quantization (LVQ)

        ->Radial basis function neural network (RBF).
        
        Discussion about the various learning algorithms and the different activation function to be used will be crucial.

        The initial implementation would have a single hidden layer with one neuron in the outut layer.The activation function would be linear for the neuron of the output layer and a hyperbolic tangent activation function for the neurons of the intermediate layer.The initial weights would be initialized between +1 and -1. These all parameters would be further changed and finally the best amongst the various permutations available would be chosen on the basis of comparision between their learning curve.

		For the backend I'll initially go with redis (the Bayes implementation does support this for something to use as research) , if everything works out well then I'll stick to it but in case any problem arises then I'll switch to SQL. But better would be to go with redis as it highly scalable with larger installations.


    4)Other ideas that can further enhance the functionalities
        
		->Implementation of Bayesian Networks and then comparing the results of all the above implemented ANNs and finding a possibility to replace naive bayes with Bayesian networks.
		->Possibility to incorporate KNN in this statistical module.


Project Success Criteria:

    1)First mid term evaluation
        ->Implementation of RBF, MLP models of ANN and comparision between them.
        ->The possibility to learn from result symbols of known messages would be checked.
		->Evaluating the learning curve of different ANN models by varying the different parameters such as learning rate, momentum etc  and incorporporation of the Kohonen( Self organizinf feature maps) network.
        
    
    2)Final evaluation
		->Implementation of SVMs to tackle the type of spams in which the final score depends more on the frequency of words.
		->Markov chains implementation to regulate autolearning based on symbols’ combinations.
		->Add ability to learn from the incoming messages.
		->Evaluate the quality of implemented Markov chains and SVMs.       
		->Group of various models implemented to be used together in order to beat the overall score produced by the naive bayes plugin 
		->Incorporation of the model for the default stage in the classifier plugin that performs the best out of the above models.
    
Roadmap:
        
        1) Community Bonding Period ( April 27th - May 24th )

        ->Familiarize myself more with the source code, and learn more about the team's coding practices and development processes
        ->Discuss and finalize the basic structure of the neural nets and brainstorming about the different input parameters and models of ANNs to be used.    
		->Discussions with the SA community regarding the various types of spams that the present SA can handle.
		->Thoroughly going through the various research papers already published to classify mails as spams by using various machine learning algorithms.

        2) First Coding Period ( May 25th - June 14th)
        
        ->Doing Research work related to MLP and RBF neural networks so as to determine theoritically which is better and why for the purpose of spam filtering.
		->Brain stroming with the mentors and SA community about the various input features and parameters that can have a huge impact on the overall performance of the listed neural nets models.
        ->Discussion with the mentors regarding the possibility to learn from result symbols of known messages.
		->Exploring the possibility to use kohonen networks( SOMP ) in order to classify mails as Spams and Hams and thus in turn working out a rough sketch for all the various ANN models to be used        
        ->Looking up for Beta testers who can test the neural nets implementation.

        3) Second Coding Period ( June 15th - July 3rd)
        
        ->Implementation of both the Multi Layer perceptron model and the Radial basis function models of Artificial Neural nets.
        ->Comparsion between the learning curves of both the models by varying the different parameters and discussions with the mentors regarding the possibility to improve the overall results of the neural net models.
		->Implementation of the Kohonen networks along with evaluating it's learning curve.
        ->Brain storming about partial grouping of models with the SA community in order to acheive better results out of the models implemented till now.
		->Working on any issues that arise after the beta testing of the implemented models of neural nets.
		->Documentation of work done till now        

        4) Third Coding Phase ( July 4th - July 25th)
        
        ->Discussion and implementation of the Hidden Markov Models and SVMs models in order to enable auto learning part of this module.
        ->Adjustment of the generated meta rules according to the evidence of rules.
        ->Add ability to learn from the incoming messages;
		->Evaluation of the quality of algorithms implemented till now based on their learning curves and discussions with the SA community of further improving them;
		->Determine when to start autolearning.

        5) Fourth Coding Phase (July 26th - August 17th)
         
		->Discussion about the possibillity to incorporate Bayesian neural networks into the module and ways for implementing it.
		->Exploring the possibilty to use KNNs to classify emails as spams and hams and thus further looking out for ways to implement it.           
        ->Exploration of further ideas that would help in the efficient filtering of spams.
		->Final grouping of implemented models till now in order to provide users flexibility to choose from a set of collective models that perform differently in different circumstances. 
		

        6) Fifth Coding Phase ( August 18th - August 26th)
            
        ->Testing and code clean up.
        ->Documentation of the whole code and filing of bugs with SA bugzilla that arise after the beta testing phase.


Profit for SA:
        
		->->Development of a Statistical classifier plugin that can switch between the underneath implemented algorithms depending upon the need of various types of users.

        ->A potential reason to pick artificial neural networks (ANN) over Naive Bayes is the possibility of correlations between input variables. Naive Bayes assumes that all input variables are independent. If that assumption is not correct, then it can impact the accuracy of the Naive Bayes classifier. An ANN with appropriate network structure can handle the correlation/dependence between input variables.Therefore introducing ANN would ultimately increase the possibility of correctly classifying the mails as spam or ham.

        ->In the case of repeated spam patterns ANN do quite well as compared to the naive bayes classifier. So the ANN has an upper hand over the naive bayes classifier in this context.

        ->A Naive Bayes classifier is a simple model that describes particular class of Bayesian network - where all of the features are class-conditionally independent. Because of this, there are certain problems that Naive Bayes cannot solve and hence can be tackled by Bayesian networks.
Comment 1 Kevin A. McGrail 2018-08-25 22:37:32 UTC
More information was added to this idea for GSOC 2018 at https://issues.apache.org/jira/browse/COMDEV-268?jql=text%20~%20%22GSOC%22