The preparation of modeling data refers to the process of forming the various data sets required for subsequent modeling through integration processing in raw data that have not been consolidated, i.e. the variables required for subsequent modeling. The process of data preparation can be repeated due to the needs of the model and is not sequential. The main task of data preparation involves selecting data variables and amounts of data, standardizing the processing of related variables according to model requirements, and processing both missing and outliers. The preparation of modeling data can be one of the most time-consuming steps in the data mining process, and may even account for more than half of the overall data mining process, whose main workflows for data preparation are selecting data, cleaning data, building data, consolidating data, formatting data, and data set preparation.<br>The establishment stage of data model, first of all, to choose the variables involved in data modeling, variable selection is very critical If too many variables are selected and there are many unrelated variables, it will weaken the effect of the model, on the contrary, if there are too few variables, it can not fully reflect the impact of various attributes of the dependent variables, the same impact on the effectiveness of the model. The second is to analyze the correlation between the argument and the dependent variable, as well as the correlation between the argument, to ensure that the argument and the dependent variable are related, if there is no correlation is an unrelated variable can be directly eliminated; If there is a strong correlation between the arguments, direct modeling is prone to overfitting, so for the strong correlation between the arguments, the main component analysis method can be reduced, resulting in a new comprehensive index with little correlation, or can be processed by eliminating similar variables. Finally, one or more data mining techniques are selected, and the parameters are constantly adjusted in the process of data modeling, with the aim of bringing the model to the best possible position. It is often necessary to model the data using a variety of data mining algorithms, and finally compare the results trained by various algorithms with accuracy and effectiveness, so as to select the most suitable algorithm.
正在翻译中..