As a reference to the last version, Avital and I were making a lot of enhancements to the application capabilities.
3 main enhancements created: more algorithms, Single writer vs. Page writer and cross-correlation view between the algorithms altogether.
1) We added 4 more classification algorithms to achieve smarter and various application such as:
1. KNN – It is a non-parametric method used for classification. The input consists of the k closest
training examples in the feature space. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. k-NN is a type of instance-based learning (our notes), or lazy learning, where the function is only approximated locally, and all computation is deferred until classification.
2. Multinomial-Logistic-Regression – It is a model that is used to predict the probabilities of the
different possible outcomes of a categorically distributed dependent variable, given a set of independent variables. The multinomial logistic model assumes that data are case specific. That is, each independent variable has a single value for each case – our writers.
• Using some machine learning articles, we discovered the weakness of KNN algorithm and the so much powerful of SVM and Logistic Regression on big scale training sets.
That all points led us to develop new integrated algorithms:
3. Combo 1 – SVM at first to filter objects than operate KNN on the remain notes
4. Combo 2 – Logistic Regression at first to filter objects that operate KNN on the remain notes
2) Using these algorithms, we can classify notes to their real writers with up to 85% of success – depending on the dataset size.
Also, our application could classify each whole page to discover how many writers used it and who.
That is usage to big scale analysis projects and to see the big picture.
3) In addition, we thought that to a researcher there is no any knowledge of what the best algorithm he needs to use to achieve the best scores.
That point led us to develop cross-correlation tool which can compare the algorithms altogether as relative to the optimal.
Now, each researcher may use the classifier with more self-confidence, basic recognition, and understanding with the methods we are using from a high-level view.
As we can see above, Combo 1 and Combo 2 is getting the best scores than the other, we could be predicting that issue due to of the integrated algorithms that it operates.
Moreover, we added to the HTML page a send function that can use as an infrastructure to Active Learning techniques [read more about the concept].
Application Plug-ins View
The software design is still the same except new script – “RMtmpsAll.pyw” – that is used to clear all notes from the end-point folder if we need to and contains functions for future enhancement such as – “TruncatedSVD” and so on.
Choosing an algorithm:
• There is no any default algorithm
• Once the algorithm is picked you see the name in the “Algorithm” field
Choosing Single writer vs. Page writers view:
• The default view is Single writer view
• Once the view is picked you see the name on the right side of “Learn&Predict” button label
• For more deep information and understanding – go to the source code, there is good documentation to start with
• Machine learning knowledge you can learn using “Coursera” online courses, google, python sci-kit-learn library, tensor-flow framework, University courses and so on
The more basic information you could find in our version 1 description
o Improvements You Can Do
▪ SQLite end-point dynamic changing – at this point, it is hardcoded
By Itay Guy and Avital Vovnoboy
Contact me about any question
Name: Itay Guy
Department: Computer Science RBD Lab