Intelligent modeling supports TXT, CSV and other data files.
After selecting a file, you can define the parameter configuration of the data file.
Next, you can define the variable type, date format, and selection status.
Variable types can be automatically detected or be configured by importing the data dictionary. The format of data dictionary is as follows:
Name | Type | DateFormat | Used | Importance |
---|---|---|---|---|
PassengerId | Identity | TRUE | 0 | |
Survived | Binary | TRUE | 0 | |
Pclass | Categorical | TRUE | 0 | |
Name | Text | FALSE | 0 | |
Sex | Binary | TRUE | 0 | |
Age | Numerical | TRUE | 0 | |
SibSp | Categorical | TRUE | 0 | |
… | … | … | … | … |
In the data source window, you can define two data source connections: JDBC and ODBC.
Next, you can use the configured data source to edit the SQL statement for data loading.
After importing the data, the basic characteristics of the data are displayed:
The target variable is survived (it needs to be set by the user), with 12 variables and 891 records.
Automatically parses the types of each variable and the recommended selection status.
The variable types of intelligent modeling are as follows:
Variable type | Description |
---|---|
Numerical variable | Variable with real number value |
Single value variable | Variables containing only one category (excluding missing values) |
Binary variable | Variables with only two categories (excluding missing values) |
Count variable | Variable with natural value |
Categorical variable | Variables with more than two classifications (excluding missing values) |
ID | Unique identifier |
Time and date | Date, time or datetime variable |
Long text | Variables with a length of more than 128 bytes and a large number of classifications |
The target variables of intelligent modeling support binary variables, numerical variables, count variables and categorical variables.
Discrete variables include single value variables, binary variables and categorical variables.
Missing rate: the percentage of missing values in all data.
Potential: the number of members of the set that can be valued by a discrete variable.
Pie chart shows the proportion of each classification.
In the frequency distribution table of grouped target, samples are grouped according to the classification value, and the number of samples in each group, the number of positive samples, the rate of positive samples and odds(occurrence ratio) are observed.
The positive sample of binary target variable refers to the classification value with a small number of samples. As can be seen from the right figure, in this example, the positive sample is a record with a target variable value of 1.
Pie chart of target variable
The odds wrt all graph shows the odds for each group of samples and the total odds. Classification with fewer samples (less than 100 samples) is not drawn.
Grouped target statistics group the samples according to the categorical value, and observe the statistics of each group of samples. Including: frequency, average, standard deviation, median, minimum, maximum and Z-STAT.
The statistical graph of grouped target, in the form of box line chart, more intuitively represents the distribution of each group of samples. A box line chart can be used to mark outliers.
Continuous variables include numerical variables, count variables and time date variables.
Descriptive statistics show the basic statistical information of the data.
Frequency distribution diagram includes frequency distribution histogram, normal distribution curve and box line chart.
Descriptive statistics of grouped target group the samples according to the target variable values, make statistics respectively, and draw the corresponding box line chart.
Frequency distributions of grouped target: the samples in each interval are grouped according to the target variable value, and the frequency is displayed in different colors.
Pearson correlation coefficient:used to describe the linear correlation between two continuous variables.
Spearman rank correlation coefficient:used to describe the rank correlation between two continuous variables.
The greater the absolute value of the correlation coefficient, the greater the correlation between the two variables.
Above is the correlation coefficient between basement area and house price. It can be seen that there is a strong correlation between the two.
The single factor scatter plot intuitively shows the correlation distribution of current variable (basement area) and target variable (house price). The yellow line is the regression line.
Provide the function of exporting data exploration report to Excel file. Sheet1 is the basic information of variables:
Sheet2 is the correlation between various variables and target variable.
Provide the function of exporting data quality report to PDF file. Some contents are as follows:
The preprocessing process of intelligent modeling is integrated in the modeling process, with one key automatic preprocessing.
After modeling, you can export the model report, which describes the actions of preprocessing. Some contents are as follows:
Check and record the value range of all variables. If the test data has a category that is not listed in the training data or beyond the range of values, certain processing needs to be carried out.
Check all time and date variables and create several commonly used derived variables. Check the correlation of time and date variables, and create multi date linkage derived variables.
If there are missing values in the data, the missing value pattern is extracted and recorded, and the behavior characteristics of missing values are transformed into derivative variables for use.
If there are missing values in the data, use simple or personalized intelligent algorithm to fill in the missing values.
For the noise that may exist in the categorical variables, such as very few category, abnormal category, suspected error classification and so on, carry out targeted processing.
Convert the categorical variable to a numeric variable that can be operated normally. The main method is dummy variable and smoothing, which is judged by algorithm intelligence.
For some models with normal hypothesis, the high skewness variables are transformed mathematically to make the skewness return to 0, which satisfies the model hypothesis.
Detect and identify possible outliers, and deal with them accordingly.
In order to reduce the time cost and the complexity of the model, we need to remove the useless variables.
Data standardization / normalization to eliminate caliber difference. It is beneficial to the optimization of neural networks and other models.
For binary data, if the proportion of positive and negative samples is seriously unbalanced, it will be balanced according to the specified proportion, and intelligent resampling modeling will be carried out.
Remove some irrelevant variables according to the variable type. For example, ID and long text, single value variable without missing value, etc.
Filter variables according to the importance of variables, only the variables with higher importance are retained. Variable importance can be imported from data dictionary or obtained through modeling.
The number of family members is obtained by adding the number of variable "SibSp" and the number of variable "Parch". It can be seen that the survival rate of family members is higher at 1-3.
Add derived variable family
Variable family statistics
The numerical variables can be discretized and converted into categorical variables. Taking age as an example, it is divided into 0, 8, 18, 35 and 60 age groups, generating derivative variables and making statistics.
Add derived variable AgeArea
Variable AgeArea statistics
It can be seen that the survival rate of the 0-8-year-old is the highest, the difference between the young and the middle-aged is not big, and the survival rate of the old is the lowest.
In the model options, you can define whether to preprocess data and whether to fill it intelligently.
If the data has been preprocessed, you can cancel the data preprocessing.
Intelligent filling can better fill the missing value, but it will consume more hardware resources and time. When the amount of data is large, intelligent filling is not recommended. If unchecked, it will be filled in simply.
When using traditional tools, it usually requires professionals with statistical basis to continuously select algorithms, adjust model parameters, and finally get the expected model. The modeling process is as follows:
Intelligent modeling tools do not need statistical knowledge, one key intelligent modeling, optimization of model combination and model parameters are implemented internally.
Intelligent modeling opens up model parameters for professional users who are proficient in the models. Here are the general options for the model:
Intelligent modeling supports several binary classification algorithm models in the graph, and can also set whether each model is used and the sampling times. On the right, you can set parameter values for each model. For ordinary users, these settings can be ignored.
Similarly, we can set whether to use regression model and multi classification model, and their respective parameters.
Detailed documentation of each model parameter: 《Model building》
Intelligent modeling provides three commonly used evaluation indexes for classification model:
Evaluation Index | Description |
---|---|
GINI | Gini index is equal to 2 * auc-1 in numerical value, which is used to characterize the model's ability to distinguish positive and negative samples. |
AUC | AUC is equal to the area under ROC curve. The higher AUC is, the better the model is. |
KS | KS value is used to measure the ability of the model to distinguish positive and negative samples. The larger the KS value is, the stronger the ability of the model to distinguish positive and negative samples is. |
ROC curve is the relationship between true positive class rate and "1-true negative class rate". ROC curve can be regarded as a visual display to evaluate all possible decision-making performance of a given model.
Lift refers to the multiple that can be improved by using association rules. It is the ratio of the degree of confidence to expected confidence.
Lift is particularly suitable for targeted marketing and other scenarios.
Recall graph shows that the model can find positive samples, which is mainly used in the scene of data imbalance. The cumulative recall rate is the ratio of cumulative positive samples and total positive samples in each group.
Threshold: value used to distinguish positive and negative samples.
Accuracy: the ratio of correct samples to all samples.
Precision: the correct rate of prediction in the result of positive sample.
Recall: the ratio of correctly predicted positive samples and all positive samples.
When the target variable is a categorical variable, the model performance of each classification can be viewed by switching prediction values.
The performance of regression model can be divided into true value performance and transformed value performance (data value after preprocessed). The true value looks more intuitive, and the transformed value is more accurate for the evaluation of model performance.
Intelligent modeling provides six commonly used evaluation indexes of regression model:
Evaluation Index | Description |
---|---|
R² | R ² is the ratio of the sum of the square of the error between the predicted value and the observed value to the sum of the square of the difference between the observed value and the observed mean value. |
MSE | The average sum of the squares of the deviations of the predicted value from the true value. |
RMSE | The square root of MSE. The order of magnitude is the same as the true value. |
GINI | The average of the absolute value of the deviation between the predicted value and the true value. |
MAE | The average of the absolute value of the deviation between the predicted value and the true value. |
MAPE | The average of the absolute value of the deviation between the predicted value and the true value. |
The residual is the difference between the observed value and the predicted value. The residual chart is a scatter chart with the residual as the vertical axis and any numerical variable as the horizontal axis. The yellow line is three times RMSE.
You can adjust the horizontal axis variable and the value range of the horizontal and vertical axis for further viewing.
The horizontal axis of the result comparison chart are the samples of random distribution, and the vertical axis is the corresponding observation value and prediction value.
Blue is the observed value and red is the predicted value.
The model presentation lists the final selected model combinations and the parameter values of each model.The selected model parameters can be copied to the model options through the button to further optimize the model parameters.
The final classification model and parameters of Titanic model
The final regression model and parameters of house price model
After modeling, the importance information of each variable can be obtained. From the returned importance of Titanic model, we can see that age (children first) and ticket price (higher class) are the most important factors for survival.
The role of variable importance | |
---|---|
1 | Refer to the importance of variables and reprocess the data accordingly. |
2 | The important variables are used interactively to generate the derived variables, such as distance / time = speed, speed * time = distance and so on. |
3 | Refer to the importance of variables and make targeted suggestions to customers. |
After you create the model, you can use test data for prediction.
For the binary classification model, the first column is the probability that the target variable is a positive sample.
Taking Titanic as an example, the probability of survival of No. 624 passenger is predicted to be 32.984%.
For regression model, the first column is the predicted value of the target variable.
Taking the house price as an example, the price of house 1461 is predicted to be 120644.118.
When the target variable is a categorical variable, the probability (sum of 1) of each target classification value is displayed after prediction. For example, for the first record, the probability of target value of 2 is the highest, which is 97.402%.
Generally, the prediction data does not contain the target variable.
When target variable is included in the prediction data, the performance of the model can be calculated according to the prediction result to evaluate the model.
A single prediction can be dragged to modify the variable value and view the prediction result in real time.
The variables are arranged in descending order of importance, and the top variables usually have more influence on the prediction result. It can be seen that the survival rate of the younger females is very high.
For the house price prediction model, we can see that when the basement area is dragged from 334 to 5642 (other variables have not changed), the house price has greatly increased.
esProc external library provides interface functions for intelligent modeling, which can be called by SPL. The SPL for modeling:
A | B | |
---|---|---|
1 | =file("titanic_train.csv").cursor@cqt() | /Create training data cursor |
2 | =ym_env() | /Initialize environment |
3 | =ym_model(A2,A1) | /Loading data |
4 | =ym_target(A3, "Survived") | /Set target variable |
5 | =ym_build_model(A3) | /Execute modeling |
6 | =ym_save_pcf(A5,"titanic.pcf") | /Save model file |
7 | =ym_json(A5) | /Export model information as JSON string |
8 | =ym_importance(A5) | /Get variable importance |
9 | =ym_present(A5) | /Get model presentation |
10 | =ym_performance(A5) | /Get model performance |
11 | >ym_close(A2) | /Close |
Value |
---|
{"Importance":{"PassengerId":0,"Pclass":0,"Sex":0,""Age":0.433191… |
Name | Importance |
---|---|
PassengerId | 0.0 |
Pclass | 0.0 |
… | … |
name | value | properties |
---|---|---|
XGBClass… | 0.815 | [[max_delt... |
XGBClass… | 0.777 | [[max_delt... |
… | … | … |
Name | Value |
---|---|
GINI | 0.617 |
AUC | 0.808 |
… | … |
For details, please refer to :《SPL realizes automatic modeling and prediction》
After the model is created (or the model created by the intelligent modeling designer), the external library of intelligent modeling can be called through SPL for prediction. The SPL for Prediction:
A | B | |
---|---|---|
1 | =ym_env() | /Initialize environment |
2 | =ym_load_pcf("titanic.pcf") | /Loading model file |
3 | =file("titanic_test.csv").import@cqt() | /Loading prediction data |
4 | =ym_predict(A2,A3) | /Execute prediction, return predicted result object |
5 | =ym_result(A4) | /Get predicted result sequence table |
6 | =ym_json(A4) | /When the prediction data is no less than 20 pieces, the model performance JSON information will be exported according to the prediction data evaluation. |
7 | >ym_close(A1) | /Close |
PassengerId | Survived | Pclass | Name | Sex | … |
---|---|---|---|---|---|
624 | 0 | 3 | Hansen,… | male | … |
625 | 0 | 3 | Bowen, … | male | … |
626 | 0 | 1 | Sutton, … | male | … |
627 | 0 | 2 | Kirkland… | male | … |
… | … | … | … | … | … |
Value |
---|
{"Model-Performance":"{\"GINI\":0.8369670542635659,\"AUC\":0.9184835271317829,\"KS\":0.6867732558139534,\"ROC-Data\":[\"{\\\"1-specificity\\\":\\\"0.0\\\",\\\"sensitivity\\\":\\\"0.020833333333333332\\\"}\",\"{\\\"1-… |
There are two ways to create a model:
The program data source here refers to the SPL program as the data source.