Data mining

It is  a prcess of finding modes, interesting trends or pattern in large database in order to guide decision about future activities. – Fayyad

Model- summary, statements about any point in full measurement space

Pattern: structure, relationship to small parts of data or the space in which data exists .

4 relationship pattern 

  1. Classes
  2. Clusters
  3. Association
  4. Sequentual pattern

The actual dat mining tsk is to automatic or semiautomatic analysis of large quantities of data to extract previously unknown interesting pattern such as groups of data records (cluster analysis) 

ususual records (anomaly detection)

dependencies (assocition rule mining)

Decision support system: identifying multiple groups in a data which can be used to obtain more accurate, prediction result by “decision support system) 

6 task:

a. Classification

b. Estimation

c. Predicition

d. Associated rules: simplr correlation between two or more items

e. Clustering: 

f. Description and visulization

DECISION TREE ; strt with simple questions

COMBINTION:

LONG TERM MEMORY (PROCESSING)

TANAGRA is a free DATA MINING software for academic and research purposes. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area.

http://eric.univ-lyon2.fr/%7Ericco/tanagra/en/tanagra.html

http://data-mining-tutorials.blogspot.in/2013/12/tanagra-version-1450.html

EDM (Enterprise Data mining) :

 

 

Top 10 open source data mining tools

49.54K 3

Data remains as raw text until it is mined and the information contained within it is harnessed. Mining data to make sense out of it has applications in varied fields of industry and academia. In this article, we explore the best open source tools that can aid us in data mining.

Data mining, also known as knowledge discovery from databases, is a process of mining and analysing enormous amounts of data and extracting information from it. Data mining can quickly answer business questions that would have otherwise consumed a lot of time. Some of its applications include market segmentation – like identifying characteristics of a customer buying a certain product from a certain brand, fraud detection – identifying transaction patterns that could probably result in an online fraud, and market based and trend analysis – what products or services are always purchased together, etc. This article focuses on the various open source options available and their significance in different contexts.

A brief look at mining tasks
For those who are new to data mining, let’s take a brief look at some of the common mining tasks.
Pre-processing: This involves all the preliminary tasks that can help in getting started with any of the actual mining tasks. Pre-processing could be removing anomalies and noise from the data that’s about to be mined, filling in missing values, normalising the data or compressing data using techniques like generalisation and aggregation.

Clustering: This is partitioning a huge set of data into related sub-classes.
Classification: This is tagging or classifying data items into different user-defined categories.
Outlier analysis helps in identifying those data elements which are deviant or distant from the rest of the elements in a dataset. This can help in anomaly detection.

Associative analysis helps in bringing out hidden relationships among data items in a large data set. This can help in predicting the occurrence of a particular item in a transaction or an event whenever some other item is present. You can think of this as a conditional probability.
Regression is used to predict values of a dependent variable by constructing a model or a mathematical function out of independent variables.

Summarisation helps in coming up with a compact description for the whole data set.
Data mining is a combination of various techniques like pattern recognition, statistics, machine learning, etc. While there is a good amount of intersection between machine learning and data mining, as both go hand in hand and machine learning algorithms are used for mining data, we will restrict ourselves in this article to only those tools specialised for data mining.

Weka
Weka is a Java based free and open source software licensed under the GNU GPL and available for use on Linux, Mac OS X and Windows. It comprises a collection of machine learning algorithms for data mining. It packages tools for data pre-processing, classification, regression, clustering, association rules and visualisation. The various ways of accessing it are – Weka Knowledge Explorer, Experimenter, Knowledge Flow and a simple CL. Explorer is a user-friendly graphical interface for two-dimensional visualisation of mined data. It lets you import the raw data from various file formats, and supports well known algorithms for different mining actions like filtering, clustering, classification and attribute selection. However, when dealing with large data sets, it is best to use a CL based approach as Explorer tries to load the whole data set into the main memory, causing performance issues. This software also provides a Java Appetiser for use in applications and can connect to databases using CJD. Weka has proved to be an ideal choice for educational and research purposes, as well as for rapid prototyping.

Rapid Miner
Rapid Miner is available in both FOSS and commercial editions and is a leading predictive analytic platform. Gartner, the US research and advisory firm, has recognised Rapid Miner and Knife as leaders in the magic quadrant for advanced analytic platforms in 2016. Rapid Miner is helping enterprises embed predictive analysis in their business processes with its user friendly, rich library of data science and machine learning algorithms through its all-in-one programming environments like Rapid Miner Studio. Besides the standard data mining features like data cleansing, filtering, clustering, etc, the software also features built-in templates, repeatable work flows, a professional visualisation environment, and seamless integration with languages like Python and R into work flows that aid in rapid prototyping. The tool is also compatible with weak scripts. Rapid Miner is used for business/commercial applications, research and education.

Orange
Python users playing around with data sciences might be familiar with Orange. It is a Python library that powers Python scripts with its rich compilation of mining and machine learning algorithms for data pre-processing, classification, modelling, regression, clustering and other miscellaneous functions. Orange also comes with a visual programming environment and its workbench consists of tools for importing data, and dragging and dropping widgets and links to connect different widgets for completing the workflow. The visual programming comes with an easy-to-use UI, with plenty of online tutorials for assistance. Due to the ease of programming and integration in Python, Orange can be a great take off point for novices and experts to plunge into data mining.

Knime
Knime is one of the leading open source analytic, integration and reporting platforms that comes as free software and as well as a commercial version. Written in Java and built upon Eclipse, its access is through a GUI that provides options to create the data flow and conduct data pre-processing, collection, analysis, modelling and reporting. A Gartner survey reveals that customers are happy with the platform’s flexibility, openness and smooth integration with other software like Weka and R. Given the small size of the company, Knime has a large user base and an active community. It makes use of Eclipse’s extension mechanism capability to add plugins for the required functionalities like text and image mining. This software is ideal for enterprise use.

DataMelt
DataMelt or DMelt does much more than just data mining. It is a computational platform, offering statistics, numeric and symbolic computations, scientific visualisation, etc. To avoid digressing from our topic, I’ll restrict myself to only covering its data mining capabilities. DMelt provides data mining features like linear regression, curve fitting, cluster analysis, neural networks, fuzzy algorithms, analytic calculations and interactive visualisations using 2D/3D plots and histograms. One can play around with its IDE (integrated development kit) or its functions can be called from applications using its Java API. Both community and commercial editions of DMelt are available on Linux, Mac OS, Windows and Android platforms. DMelt is a successor to the jHepWork and SCaVis programs, which some people working in data analysis might be familiar with. This software is well suited for students, engineers and scientists.

Apache Mahout
Mahout is primarily a library of machine learning algorithms that can help in clustering, classification and frequent pattern mining. It can be used in a distributed mode that helps easy integration with Hadoop. Mahout is currently being used by some of the giants in the tech industry like Adobe, AOL, Drupal and Twitter, and it has also made an impact in research and academics. It can be a great choice for anyone looking for easy integration with Hadoop and to mine huge volumes of data.

ELKI
ELKI is open source software written in Java and licensed under AGPLv3. This software focuses especially on cluster analysis and outlier detection with a compilation of numerous algorithms from both these domains. The software is accessed through a GUI that displays the results once the selected algorithm is run. ELKI’s design goals are performance, scalability, completeness, extensibility and a modular design to welcome contributions. ELKI currently doesn’t offer professional support and the software is optimised for use in science and research. Hence, this option works best for those in research.

MOA
Massive Online Analysis (MOA), as the name suggests, is primarily data stream mining software that is well suited for applications that need to handle volumes of real-time data streams at a high speed. MOA is distributed under GNU GPL, and can be used via the command line, GUI or Java API. It is a rich compilation of machine learning algorithms and has proved to be a great choice during the design of real-time applications. Stream mining algorithms typically require faster computations without storing all of the datasets in the memory and have to get the work done within a limited time. MOA is well suited for these requirements. Weka and MOA can be closely linked to each other and either of the classifiers can be called from the other one. For those looking to analyse and mine information from real-time data, MOA can be the best choice.

KEEL
KEEL (Knowledge Extraction for Evolutionary Learning) is a Java based open source tool distributed under GPLv3. It is powered by a well-organised GUI that lets you manage (import, export, edit and visualise) data with different file formats, and to experiment with the data (through its data pre-processing, statistical libraries and some standard data mining and evolutionary learning algorithms). Since KEEL is based on Java, JVM has to be installed on the system to run its GUI and do data mining experiments. You may visit http://keel.es/ for the complete list of supported algorithms. KEEL is ideal for research and educational purposes. It serves as a useful aid for teachers.

Rattle
Rattle, expanded to ‘R Analytical Tool To Learn Easily’, has been developed using the R statistical programming language. The software can run on Linux, Mac OS and Windows, and features statistics, clustering, modelling and visualisation with the computing power of R. Rattle is currently being used in business, commercial enterprises and for teaching purposes in Australian and American universities.

All the tools and software discussed so far are not the only available ones—the list keeps growing. While I have covered only those tools exclusively meant for mining data, there are a few other machine learning, NLP and data analytic tools that could aid in mining, like scikit-learn, NLTK, GraphLab, Neural Designer, Pandas and SPMF, which readers could explore

Tanagra website statistics for 2017

 
The year 2017 ends, 2018 begins. I wish you all a very happy year 2018.

A small statistical report on the website statistics for 2017. All sites (Tanagra, course materials, e-books, tutorials) has been visited 222,293 times this year, 609 visits per day.

Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there was 2,33,371 visits (644 daily visits).

Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, …

39 new course materials and tutorials were posted online this year: 18 in French language, 21 in English.

The pages containing course materials about Data Science and Programming (R and Python) are the most popular ones. This is not really surprising.

Happy New Year 2018 to all.

Ricco.
Slideshow: Website statistics for 2017

 

Tuesday, January 2, 2018

Sparse data file format

 
The data to be processed with machine learning algorithms are increasing in size. Especially when we need to process unstructured data. The data preparation (e. g. the use of a bag of words representation in text mining) leads to the creation of large data tables where, often, the number of columns (descriptors) is higher than the number of rows (observations). With the singularity that the table contains many zero values. In this context, storing all these zero values into the data file is not opportune. A data compression strategy without loss of information must be implemented, which must remain simple so that the file is readable with a text editor.

In this tutorial, we describe the use of the sparse data file format handled by Tanagra (from the version 1.4.4). It is based on the file format processed by famous libraries for machine learning (svmlight, libsvm, libcvm). We show its use in a text categorization process applied to the Reuters database, well known in data mining. We will observe that the use of this kind of sparse format enables to reduce dramatically the data file size.

Keywords: sparse dataset, dense dataset, attribute-value table, support vector machine, svm, libsvm, c-svc, logistic regression, tr-irls, scoring, roc curve, auc, area under curve
Componets: VIEW DATASET, CONT TO DISC, UNIVARIATE DUISCRETE STAT, SELECT FIRST EXAMPLES, C-SVC, SCORING, ROC CURVE
Tutorial: en_Tanagra_Sparse_File_Format.pdf
Dataset: reuters.data.zip
References:
T. Joachims, “SVMlight: Support Vector Machine“.
UCI Repository,  “Reuters-21578 Text Categorization Collection“.

 

Friday, December 29, 2017

Configuration of a multilayer perceptron

 
The multilayer perceptron is one of the most popular neural network approach for supervised learning, and that it was very effective if we know to determine the number of neurons in the hidden layers.

In this tutorial, we will try to explain the role of neurons in the hidden layer of the multilayer perceptron (when we have one hidden layer). Using an artificial toy dataset, we show the behavior of the classifier when we modify the number of neurons.

We work with Tanagra in a first step. Then, we use R (nnet package) to create a program to determine automatically the right number of neurons into the hidden layer.

Keywords: neural network, perceptron, multilayer perceptron, MLP
Components: MULTILAYER PERCEPTRON, FORMULA
TutorialConfiguration of a MLP
Dataset: artificial2d.zip
References:
Tanagra Tutorials, “Single layer and multilayer perceptron (slides)“, September 2014.
Tanagra Tutorials, “Multilayer perceptron – Software comparison“, November 2008.

 

Wednesday, October 25, 2017

CDF and PPF in Excel, R and Python

 
 How to compute the cumulative distribution functions and the percent point functions of various commonly used distributions in Excel, R and Python.

I use Excel (in conjunction with Tanagra or Sipina), R and Python for the practical classes of my courses about data mining and statistics at the University. Often, I ask students to perform hypothesis tests or to calculate confidence intervals, etc.

We work on computers, it is obviously out of the question to use the statistical tables to obtain the quantile or p-value of the commonly used distribution functions. In this tutorial, I present the main functions for normal distribution, Student’s t-distribution, chi-squared distribution and Fisher-Snedecor distribution. I realized that students sometimes find it difficult to match the reading of statistical tables with the functions they have difficulty identifying in software. It is also an opportunity for us to verify the equivalences between the functions proposed by Excel, R (stats package) and Python (scipy package). Whew! At least on the few illustrative examples given in our document, the results are consistent.

Keywords: excel, r, stats package, python, scipy package, p-value, quantile, cdf, cumulative distribution function, ppf, percent point function, quantile function
Tutorial: CDF and PPF

 

Wednesday, October 18, 2017

The “compiler” package for R

 
It is widely agreed that R is not a fast language. Notably, because it is an interpreted language. To overcome this issue, some solutions exists which allow to compile functions written in R. The gains in computation time can be considerable. But it depends on our ability to write code that can benefit from these tools.

In this tutorial, we study the efficiency of the Luke Tierney’s “compiler” package which is provided in the base distribution of R. We program two standard data analysis treatments, (1) with and (2) without using loops: the scaling of variables in a data frame; the calculation of a correlation matrix by matrix product. We compare the efficiency of non-compiled and compiled versions of these functions.

We observe that the gain for the compiled version is dramatic for the version with loops, but negligible for the second variant. We note also that, in the R 3.4.2 version used, it is not needed to compile explicitly the functions containing loops because it exists a JIT (just in time compilation) mechanism which ensure to our code the maximal performance.

Keywords: package compiler, cmpfun, byte code, package rbenchmark, benchmark, JIT, just in time
Tutorial: en_Tanagra_R_compiler_package.pdf
Program: compilation_r.zip
References :
Luke Tierney, “A Byte Code Compiler for R“, Department of Statistics and Actuarial Science, University of Iowa, March 30, 2012.
Package ‘compiler’ – “Byte Code Compiler

 

Monday, October 9, 2017

Regression analysis in Python

 
Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

In this tutorial, we will try to identify the potentialities of StatsModels by conducting a case study in multiple linear regression. We will discuss about: the estimation of model parameters using the ordinary least squares method, the implementation of some statistical tests, the checking of the model assumptions by analyzing the residuals, the detection of outliers and influential points, the analysis of multicollinearity, the calculation of the prediction interval for a new instance.

Keywords: regression, statsmodels, pandas, matplotlib
Tutorial: en_Tanagra_Python_StatsModels.pdf
Dataset and program: en_python_statsmodels.zip
References:
StatsModels: Statistics in Python

 

Thursday, October 5, 2017

Document classification in Python

 
The aim of text categorization is to assign documents to predefined categories as accurately as possible. We are within the supervised learning framework, with a categorical target attribute, often binary. The originality lies in the nature of the input attribute, which is a textual document. It is not possible to implement predictive methods directly, it is necessary to go through a data preparation phase.

In this tutorial, we will describe a text categorization process in Python using mainly the text mining capabilities of the scikit-learn package, which will also provide data mining methods (logistics regression). We want to classify SMS as “spam” (spam, malicious) or “ham” (legitimate). We use the “SMS Spam Collection v.1” dataset.

Keywords: text mining, document categorization, corpus, bag of words, f1-score, recall, precision, dimensionality reduction, variable selection, logistic regression, scikit learn, python
Tutorial: Spam identification
Dataset: Corpus and Python program
References:
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, “A. Contributions to the Study of SMS Spam Filtering: New Collection and Results”, in Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11), Mountain View, CA, USA, 2011.

 

Thursday, September 28, 2017

SVM: Support Vector Machine in R and Python

 
This tutorial completes the course material devoted to the Support Vector Machine approach (SVM).

It highlights two important dimensions of the method: the position of the support points and the definition of the decision boundaries in the representation space when we construct a linear separator; the difficulty to determine the “best” values of the parameters for a given problem.

We will use R (“e1071” package) and Python (“scikit-learn” package).

Keywords: svm, package e1071, logiciel R, logiciel Python, package scikit-learn, sklearn
Tutorial: SVM – Support Vector Machine
Dataset and programs: svm_r_python.zip
References:
Tanagra Tutorial, “Support Vector Machine“, May 2017.
Tanagra Tutorial, “Implementing SVM on large dataset“, July 2009.

 

Monday, September 11, 2017

Association rule learning with ARS

 
SIPINA is known for its decision tree induction algorithms. In fact, the distribution includes two other tools that are little known to the public: REGRESS, which is specialized in multiple linear regression, we described it in one of our tutorials ; and an association rules extraction tool, called simply Association Rule Software (ARS).

In this tutorial, I describe the use of the ARS tool. Its interactivity with Excel spreadsheet is its main advantage. We launch the software from Excel using the “sipina.xla” add-in. We can easily retrieve the rules in the spreadsheet. Then, we can explore them (the mined rules) using the Excel data handling capabilities. The ability to filter and sort rules according to different criteria is a great help in detecting interesting rules. This is a very important aspect because the profusion of rules can quickly confuse the data miner.

Keywords: ARS, association rule software, excel spreadsheet, filtering and sorting rules, interestingness measures
Components: ASSOCIATION RULE SOFTWARE
Tutorial: en_Tanagra_Association_Sipina.pdf
Dataset: market_basket.zip
References:
Tanagra Tutorial, “Association rule learning (slides)“, August 2014.

 

Friday, August 25, 2017

Linear classifiers

 
In this tutorial, we study the behavior of 5 linear classifiers on artificial data. Linear models are often the baseline approaches in supervised learning. Indeed, based on a simple linear combination of predictive variables, they have the advantage of simplicity: the reading of the influence of each descriptor is relatively easy (signs and values of the coefficients); learning techniques are often (not always) fast, even on very large databases. We are interested in: (1) the naive bayes classifier; (2) the linear discriminant analysis; (3) the logistic regression; (4) the perceptron (single-layer perceptron); (5) the support vector machine (linear SVM).

The experiment was conducted under R. The source code accompanies this document. My idea, besides the theme of the linear classifiers that concerns us, is also to describe the different stages of the elaboration of an experiment for the comparison of learning techniques. In addition, we show also the results provided by the linear approaches implemented in various tools such as Tanagra, Knime, Orange, Weka and RapidMiner.

Keywords: linear classifier, naive bayes, linear discriminant analysis, logistic regression, perceptron, neural network, linear svm, support vector machine, decision tree, rpart, random forest, k-nn, nearest neighbors, e1071 package, nnet package, rf package, class package
Components : NAIVE BAYES CONTINUOUS, LINEAR DISCRIMINANT ANALYSIS, BINARY LOGISTIC REGRESSION, MULTILAYER PERCEPTRON, SVM
Tutorial: en_Tanagra_Linear_Classifier.pdf
Programs and dataset: linear_classifier.zip
References:
Wikipedia, “Linear Classifier“.

 

Friday, August 18, 2017

Discriminant analysis and linear regression

 
Linear discriminant analysis and linear regression are both supervised learning techniques. But, the first one is related to classification problems i.e. the target attribute is categorical; the second one is used for regression problems i.e. the target attribute is continuous (numeric).

However, there are strong connections between these approaches when we deal with a binary target attribute. From a practical example, we describe the connections between the two approaches in this case. We detail the formulas for obtaining the coefficients of discriminant analysis from those of linear regression.

We perform the calculations under Tanagra and R.

Keywords: linear discriminant analysis, predictive discriminant analysis, multiple linear regression, wilks’ lambda, mahalanobis distance, score function, linear classifier, sas, proc discrim, proc stepdisc
Components: LINEAR DISCRIMINANT ANALYSIS, MULTIPLE LINEAR REGRESSION
Tutorial: en_Tanagra_LDA_and_Regression.pdf
Programs and dataset: lda_regression.zip
References:
C.J. Huberty, S. Olejnik, « Applied MANOVA and Discriminant Analysis »,Wiley, 2006.
R. Tomassone, M. Danzart, J.J. Daudin, J.P. Masson, « Discrimination et Classement », Masson, 1988.

 

Friday, August 11, 2017

Gradient boosting with R and Python

 
This tutorial follows the course material devoted to the “Gradient Boosting” to which we are referring constantly in this document. It also comes in addition to the supports and tutorials for Bagging, Random Forest and Boosting approaches (see References).

The thread will be basic: after importing the data which are split into two data files (learning and testing) in advance, we build predictive models and evaluate them. The test error rate criterion is used to compare performance of various classifiers.

The question of parameters, particularly sensitive in the context of the gradient boosting, is studied. Indeed, there are many parameters, and their influence on the behavior of the classifier is considerable. Unfortunately, if we guess about the paths to explore to improve the quality of the models (more or less regularization), accurately identifying the parameters to modify and set the right values are difficult, especially because they (the various parameters) can interact with each other. Here, more than for other machine learning methods, the trial and error strategy takes a lot of importance.

We use R and Python with their appropriate packages.

Keywords: gradient boosting, R software, decision tree, adabag package, rpart, xgboost, gbm, mboost, Python, scikit-learn package, gridsearchcv, boosting, random forest
Tutorial: Gradient boosting
Programs and datasets: gradient_boosting.zip
References:
Tanagra tutorial, “Gradient boosting – Slides“, June 2016.
Tanagra tutorial, “Bagging, Random Forest, Boosting – Slides“, December 2015.
Tanagra tutorial, “Random Forest and Boosting with R and Python“, December 2015.

 

Friday, August 4, 2017

Statistical analysis with Gnumeric

 
The spreadsheet is a valuable tool for data scientist. This is what the annual KDnuggets polls reveal during these last years where Excel spreadsheet is always well placed. In France, this popularity is largely confirmed by its almost systematic presence in job postings related to the data processing (statistics, data mining, data science, big data/data analytics, etc.). Excel is specifically referred, but this success must be viewed as an acknowledgment of the skills and capabilities of the spreadsheet tools.

This tutorial is devoted to the Gnumeric Spreadsheet 1.12.12. It has interesting features: Setup and installation programs are small because it is not part of an office suite; It is fast and lightweight; It is dedicated to numerical computation and natively incorporates a “statistics” menu with the common statistical procedures (parametric tests, non-parametric tests, regression, principal component analysis, etc.); and, it seems more accurate than some popular spreadsheets programs. These last two points have caught my attention and have convinced me to study it in more detail. In the following, we make a quick overview of Gnumeric’s statistical procedures. If it is possible, we compare the results with those of Tanagra 1.4.50.

Keywords: gnumeric, spreadsheet, descriptive statistics, principal component analysis, pca, multiple linear regression, wilcoxon signed rank test, welch test unequal variance, mann and whitney, analysis of variance, anova
Tanagra components:  MORE UNIVARIATE CONT STAT, PRINCIPAL COMPONENT ANALYSIS, MULTIPLE LINEAR REGRESSION, WILCOXON SIGNED RANKS TEST, T-TEST UNEQUAL VARIANCE, MANN-WHITNEY COMPARISON, ONE-WAY ANOVA
Tutorial: en_Tanagra_Gnumeric.pdf
Dataset : credit_approval.zip
References :
Gnumeric, “The Gnumeric Manual, version 1.12″.

Leave a Comment

Your email address will not be published. Required fields are marked *