Data Mining

Hello TECHBUDDYS! Today the discussion at the table that I am willing to take up is “DATA MINING”. Well,shall we begin…? Let’s get to it.

First things first. What is DATA MINING?

Let’s see to it. Shall we…?

DataMining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary sub field of computer science with an overall goal to extract information (with intelligent method) from a data set and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the “knowledge discovery in databases” process, or K.D.D. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

The term “data mining” is in fact a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.

The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

Process

The knowledge discovery in databases (KDD) process is commonly defined with the stages:

  1. Selection
  2. Pre-processing
  3. Transformation
  4. Data mining
  5. Interpretation/evaluation.

It exists, however, in many variations on this theme, such as the Cross Industry Standard Process for Data Mining (CRISP-DM) which defines six phases:

  1. Business understanding
  2. Data understanding
  3. Data preparation
  4. Modeling
  5. Evaluation
  6. Deployment

OR a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.

Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners.The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models,and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.

Pre-processing

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.

Data mining

Data mining involves six common classes of tasks:

  • Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.
  • Association rule learning (dependency modelling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
  • Clustering – is the task of discovering groups and structures in the data that are in some way or another “similar”, without using known structures in the data.
  • Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as “legitimate” or as “spam”.
  • Regression – attempts to find a function which models the data with the least error that is, for estimating the relationships among data or datasets.
  • Summarization – providing a more compact representation of the data set, including visualization and report generation.

Overview of the PL/SQL Sample Programs

Mining Function Description
Classification The classification programs demonstrate various preprocessing techniques and perform the following steps:

  • Build a classification model using training data
  • Display model details and settings
  • Test the model by applying the model on the test data
  • Compute test metrics, such as confusion matrix, lift, and ROC
  • Apply the model on the scoring data
  • Present apply results
  • Present ranked apply results, influenced by a cost matrix

dmnbdemo.sqlillustrates Naive Bayes.

dmdtdemo.sqlillustrates Decision Tree.

dmsvcdem.sqlillustrates SVM classification.

dmglcdem.sqlillustrates GLM classification (binary logistic regression)

The dmdtxvlddemo.sqlprogram demonstrates cross-validation techniques for decision tree based-classification. With minor modifications, this program can be used to perform cross validation using other models/algorithms.

Regression dmsvrdem.sql uses different test metrics, but otherwise performs most of the same steps used in the classification programs. Selected attributes of the input data are preprocessed (normalized).

NOTE: dmsvrdem.sql illustrates the new Automatic Data Preparation feature.

dmglrdem.sqlillustrates GLM regression (multivariate linear regression)

Anomaly Detection dmsvodem.sqlillustrates one-class SVM
Association dmardemo.sqlbuilds an association model and presents frequent itemsets and association rules as output.
Clustering dmkmdemo.sql (k-Means) and dmocdemo.sql (0-Cluster) build clustering models and present cluster details, such as rules, centroid, and histogram for each cluster as output. The models are scored, and the probabilities associated with each cluster are returned as output. Selected attributes of the input data are preprocessed.

NOTE: dmkmdemo.sql illustrates the new Automatic Data Preparation feature.

Feature extraction dmnmdemo.sqlbuilds a feature extraction model and presents model details as the output. The model is scored, and each feature ID is associated with a probability. Selected attributes of the input data are preprocessed (normalized).
Attribute importance dmaidemo.sqlbuilds an attribute importance model and presents a list of important attributes as the output of model details. Selected attributes of the input data are preprocessed (binned).

Well, that’s all for today. See you guyz in the next discussion. Until then take care everyone and don’t forget to smile. HAVE A GREAT DAY!!!

©2018 idlAxis All Rights Reserved

Log in with your credentials

or    

Forgot your details?

Create Account