Intelligent Modeling Introduction

01Data source

1. Local data file

Intelligent modeling supports TXT, CSV and other data files.


After selecting a file, you can define the parameter configuration of the data file.


Next, you can define the variable type, date format, and selection status.

Variable types can be automatically detected or be configured by importing the data dictionary. The format of data dictionary is as follows:

Name Type DateFormat Used Importance
PassengerId Identity   TRUE 0
Survived Binary   TRUE 0
Pclass Categorical   TRUE 0
Name Text   FALSE 0
Sex Binary   TRUE 0
Age Numerical   TRUE 0
SibSp Categorical   TRUE 0

2. Database

In the data source window, you can define two data source connections: JDBC and ODBC.


JDBC Datasource

ODBC Datasource


Next, you can use the configured data source to edit the SQL statement for data loading.


02Data Exploration

1. Basic characteristics

After importing the data, the basic characteristics of the data are displayed:

The target variable is survived (it needs to be set by the user), with 12 variables and 891 records.

Automatically parses the types of each variable and the recommended selection status.


The variable types of intelligent modeling are as follows:

Variable type Description
Numerical variable Variable with real number value
Single value variable Variables containing only one category (excluding missing values)
Binary variable Variables with only two categories (excluding missing values)
Count variable Variable with natural value
Categorical variable Variables with more than two classifications (excluding missing values)
ID Unique identifier
Time and date Date, time or datetime variable
Long text Variables with a length of more than 128 bytes and a large number of classifications

The target variables of intelligent modeling support binary variables, numerical variables, count variables and categorical variables.


2. Statistics of discrete variables

Discrete variables include single value variables, binary variables and categorical variables.

Missing rate: the percentage of missing values in all data.
Potential: the number of members of the set that can be valued by a discrete variable.
Pie chart shows the proportion of each classification.


Target variable is binary variable: frequency distribution table of grouped target

In the frequency distribution table of grouped target, samples are grouped according to the classification value, and the number of samples in each group, the number of positive samples, the rate of positive samples and odds(occurrence ratio) are observed.

The positive sample of binary target variable refers to the classification value with a small number of samples. As can be seen from the right figure, in this example, the positive sample is a record with a target variable value of 1.

Pie chart of target variable


Target variable is binary variable: frequency distribution table of grouped target

The odds wrt all graph shows the odds for each group of samples and the total odds. Classification with fewer samples (less than 100 samples) is not drawn.


Target variable is numerical variable: statistics of grouped target, statistics of grouped target graph

Grouped target statistics group the samples according to the categorical value, and observe the statistics of each group of samples. Including: frequency, average, standard deviation, median, minimum, maximum and Z-STAT.

The statistical graph of grouped target, in the form of box line chart, more intuitively represents the distribution of each group of samples. A box line chart can be used to mark outliers.


3. Continuous variable statistics

Continuous variables include numerical variables, count variables and time date variables.

Descriptive statistics show the basic statistical information of the data.

Frequency distribution diagram includes frequency distribution histogram, normal distribution curve and box line chart.


Target variable is binary variable: descriptive statistics of grouped target

Descriptive statistics of grouped target group the samples according to the target variable values, make statistics respectively, and draw the corresponding box line chart.


Target variable is binary variable :frequency distributions of grouped target

Frequency distributions of grouped target: the samples in each interval are grouped according to the target variable value, and the frequency is displayed in different colors.


Target variable is a numerical variable: target variable correlation coefficient

Pearson correlation coefficient:used to describe the linear correlation between two continuous variables.

Spearman rank correlation coefficient:used to describe the rank correlation between two continuous variables.

The greater the absolute value of the correlation coefficient, the greater the correlation between the two variables.

Above is the correlation coefficient between basement area and house price. It can be seen that there is a strong correlation between the two.


Target variable is a numerical variable :single factor scatter plot

The single factor scatter plot intuitively shows the correlation distribution of current variable (basement area) and target variable (house price). The yellow line is the regression line.


4. Data exploration report

Provide the function of exporting data exploration report to Excel file. Sheet1 is the basic information of variables:


Sheet2 is the correlation between various variables and target variable.


5. Data quality report

Provide the function of exporting data quality report to PDF file. Some contents are as follows:


03Preprocessing

1. Automatic preprocessing

The preprocessing process of intelligent modeling is integrated in the modeling process, with one key automatic preprocessing.


2. Preprocessing report

After modeling, you can export the model report, which describes the actions of preprocessing. Some contents are as follows:


3. Preprocessing process

(1) Check variable value field

Check and record the value range of all variables. If the test data has a category that is not listed in the training data or beyond the range of values, certain processing needs to be carried out.

(2) Time date variable processing

Check all time and date variables and create several commonly used derived variables. Check the correlation of time and date variables, and create multi date linkage derived variables.

(3) Missing value information extraction

If there are missing values in the data, the missing value pattern is extracted and recorded, and the behavior characteristics of missing values are transformed into derivative variables for use.

(4) Missing value filling

If there are missing values in the data, use simple or personalized intelligent algorithm to fill in the missing values.

(5) Noise reduction of categorical variables

For the noise that may exist in the categorical variables, such as very few category, abnormal category, suspected error classification and so on, carry out targeted processing.

(6) Convert the categorical variable to a numeric variable

Convert the categorical variable to a numeric variable that can be operated normally. The main method is dummy variable and smoothing, which is judged by algorithm intelligence.

(7) Rectify deviation

For some models with normal hypothesis, the high skewness variables are transformed mathematically to make the skewness return to 0, which satisfies the model hypothesis.

(8) Exception handling

Detect and identify possible outliers, and deal with them accordingly.

(9) Variable selection

In order to reduce the time cost and the complexity of the model, we need to remove the useless variables.

(10) Standardization / normalization

Data standardization / normalization to eliminate caliber difference. It is beneficial to the optimization of neural networks and other models.

(11)Sample balancing

For binary data, if the proportion of positive and negative samples is seriously unbalanced, it will be balanced according to the specified proportion, and intelligent resampling modeling will be carried out.


4. Manual preprocessing

Variable selection

Remove some irrelevant variables according to the variable type. For example, ID and long text, single value variable without missing value, etc.

Filter variables according to the importance of variables, only the variables with higher importance are retained. Variable importance can be imported from data dictionary or obtained through modeling.


Derived variables

The number of family members is obtained by adding the number of variable "SibSp" and the number of variable "Parch". It can be seen that the survival rate of family members is higher at 1-3.

Add derived variable family

Variable family statistics


The numerical variables can be discretized and converted into categorical variables. Taking age as an example, it is divided into 0, 8, 18, 35 and 60 age groups, generating derivative variables and making statistics.

Add derived variable AgeArea

Variable AgeArea statistics

It can be seen that the survival rate of the 0-8-year-old is the highest, the difference between the young and the middle-aged is not big, and the survival rate of the old is the lowest.


Preprocessing options

In the model options, you can define whether to preprocess data and whether to fill it intelligently.

If the data has been preprocessed, you can cancel the data preprocessing.

Intelligent filling can better fill the missing value, but it will consume more hardware resources and time. When the amount of data is large, intelligent filling is not recommended. If unchecked, it will be filled in simply.


04Modeling

1. Modeling process

When using traditional tools, it usually requires professionals with statistical basis to continuously select algorithms, adjust model parameters, and finally get the expected model. The modeling process is as follows:


2. Intelligent modeling

Intelligent modeling tools do not need statistical knowledge, one key intelligent modeling, optimization of model combination and model parameters are implemented internally.


3. Professional modeling

Intelligent modeling opens up model parameters for professional users who are proficient in the models. Here are the general options for the model:


Intelligent modeling supports several binary classification algorithm models in the graph, and can also set whether each model is used and the sampling times. On the right, you can set parameter values for each model. For ordinary users, these settings can be ignored.


Similarly, we can set whether to use regression model and multi classification model, and their respective parameters.

Detailed documentation of each model parameter: 《Model building》


05Model performance

1. Model performance

Classification model: evaluation index

Intelligent modeling provides three commonly used evaluation indexes for classification model:

Evaluation Index Description
GINI Gini index is equal to 2 * auc-1 in numerical value, which is used to characterize the model's ability to distinguish positive and negative samples.
AUC AUC is equal to the area under ROC curve. The higher AUC is, the better the model is.
KS KS value is used to measure the ability of the model to distinguish positive and negative samples. The larger the KS value is, the stronger the ability of the model to distinguish positive and negative samples is.

Classification model: ROC curve

ROC curve is the relationship between true positive class rate and "1-true negative class rate". ROC curve can be regarded as a visual display to evaluate all possible decision-making performance of a given model.


Classification model: Lift

Lift refers to the multiple that can be improved by using association rules. It is the ratio of the degree of confidence to expected confidence.

Lift is particularly suitable for targeted marketing and other scenarios.


Classification model: Recall

Recall graph shows that the model can find positive samples, which is mainly used in the scene of data imbalance. The cumulative recall rate is the ratio of cumulative positive samples and total positive samples in each group.


Classification model: Accuracy table

Threshold: value used to distinguish positive and negative samples.
Accuracy: the ratio of correct samples to all samples.
Precision: the correct rate of prediction in the result of positive sample.
Recall: the ratio of correctly predicted positive samples and all positive samples.


Multiclassification model

When the target variable is a categorical variable, the model performance of each classification can be viewed by switching prediction values.


Regression model:True response values and transformed response values

The performance of regression model can be divided into true value performance and transformed value performance (data value after preprocessed). The true value looks more intuitive, and the transformed value is more accurate for the evaluation of model performance.


Regression model: evaluation index

Intelligent modeling provides six commonly used evaluation indexes of regression model:

Evaluation Index Description
R ² is the ratio of the sum of the square of the error between the predicted value and the observed value to the sum of the square of the difference between the observed value and the observed mean value.
MSE The average sum of the squares of the deviations of the predicted value from the true value.
RMSE The square root of MSE. The order of magnitude is the same as the true value.
GINI The average of the absolute value of the deviation between the predicted value and the true value.
MAE The average of the absolute value of the deviation between the predicted value and the true value.
MAPE The average of the absolute value of the deviation between the predicted value and the true value.

Regression model: residual chart

The residual is the difference between the observed value and the predicted value. The residual chart is a scatter chart with the residual as the vertical axis and any numerical variable as the horizontal axis. The yellow line is three times RMSE.

You can adjust the horizontal axis variable and the value range of the horizontal and vertical axis for further viewing.


Regression model: result comparison chart

The horizontal axis of the result comparison chart are the samples of random distribution, and the vertical axis is the corresponding observation value and prediction value.
Blue is the observed value and red is the predicted value.


2. Model presentation

The model presentation lists the final selected model combinations and the parameter values of each model.The selected model parameters can be copied to the model options through the button to further optimize the model parameters.

The final classification model and parameters of Titanic model

The final regression model and parameters of house price model


3. Variable importance

After modeling, the importance information of each variable can be obtained. From the returned importance of Titanic model, we can see that age (children first) and ticket price (higher class) are the most important factors for survival.

The role of variable importance
1 Refer to the importance of variables and reprocess the data accordingly.
2 The important variables are used interactively to generate the derived variables, such as distance / time = speed, speed * time = distance and so on.
3 Refer to the importance of variables and make targeted suggestions to customers.

06Prediction

1. Batch prediction

After you create the model, you can use test data for prediction.

For the binary classification model, the first column is the probability that the target variable is a positive sample.

Taking Titanic as an example, the probability of survival of No. 624 passenger is predicted to be 32.984%.


For regression model, the first column is the predicted value of the target variable.

Taking the house price as an example, the price of house 1461 is predicted to be 120644.118.


When the target variable is a categorical variable, the probability (sum of 1) of each target classification value is displayed after prediction. For example, for the first record, the probability of target value of 2 is the highest, which is 97.402%.


Generally, the prediction data does not contain the target variable.

When target variable is included in the prediction data, the performance of the model can be calculated according to the prediction result to evaluate the model.


2. Single prediction

A single prediction can be dragged to modify the variable value and view the prediction result in real time.

The variables are arranged in descending order of importance, and the top variables usually have more influence on the prediction result. It can be seen that the survival rate of the younger females is very high.


For the house price prediction model, we can see that when the basement area is dragged from 334 to 5642 (other variables have not changed), the house price has greatly increased.


07Integration solution

1. esProc External library

esProc external library provides interface functions for intelligent modeling, which can be called by SPL. The SPL for modeling:

  A B
1 =file("titanic_train.csv").cursor@cqt() /Create training data cursor
2 =ym_env() /Initialize environment
3 =ym_model(A2,A1) /Loading data
4 =ym_target(A3, "Survived") /Set target variable
5 =ym_build_model(A3) /Execute modeling
6 =ym_save_pcf(A5,"titanic.pcf") /Save model file
7 =ym_json(A5) /Export model information as JSON string
8 =ym_importance(A5) /Get variable importance
9 =ym_present(A5) /Get model presentation
10 =ym_performance(A5) /Get model performance
11 >ym_close(A2) /Close
A7
Value
{"Importance":{"PassengerId":0,"Pclass":0,"Sex":0,""Age":0.433191…
A8
Name Importance
PassengerId 0.0
Pclass 0.0
A9
name value properties
XGBClass… 0.815 [[max_delt...
XGBClass… 0.777 [[max_delt...
A10
Name Value
GINI 0.617
AUC 0.808

For details, please refer to :《SPL realizes automatic modeling and prediction》


After the model is created (or the model created by the intelligent modeling designer), the external library of intelligent modeling can be called through SPL for prediction. The SPL for Prediction:

  A B
1 =ym_env() /Initialize environment
2 =ym_load_pcf("titanic.pcf") /Loading model file
3 =file("titanic_test.csv").import@cqt() /Loading prediction data
4 =ym_predict(A2,A3) /Execute prediction, return predicted result object
5 =ym_result(A4) /Get predicted result sequence table
6 =ym_json(A4) /When the prediction data is no less than 20 pieces, the model performance JSON information will be exported according to the prediction data evaluation.
7 >ym_close(A1) /Close
A5
PassengerId Survived Pclass Name Sex
624 0 3 Hansen,… male
625 0 3 Bowen, … male
626 0 1 Sutton, … male
627 0 2 Kirkland… male
A6
Value
{"Model-Performance":"{\"GINI\":0.8369670542635659,\"AUC\":0.9184835271317829,\"KS\":0.6867732558139534,\"ROC-Data\":[\"{\\\"1-specificity\\\":\\\"0.0\\\",\\\"sensitivity\\\":\\\"0.020833333333333332\\\"}\",\"{\\\"1-…

2. Integration architecture

There are two ways to create a model:

  1. Use the intelligent modeling designer to create model file
  2. Call the external library of esProc to create model through SPL.

The program data source here refers to the SPL program as the data source.