Data mining steps or phases can vary.
The exact # of data mining steps involved in data mining can vary based on the practitioner, scope of the problem and how they aggregate the steps and name them. Irrespective of that, the following typical steps are involved.
Defining the problem:
This in my opinion is one of the most important steps even though it may not have anything to do with actual technical aspects of data mining.
Identifying business goals: What business problem are you trying to solve? Customer Acquisition? Retention? Reduce maintenance costs or operational costs? Look at some of the data mining examples to get an idea.
Identifying data mining goals: How are those selected business goals translate into specific data mining project goals? The answer to this question will lead to discovering what data sets may be needed and what is in those data sets etc.
Identifying required data
Once step 1 is completed, gather required data and understand the data. Are all attributes understood? What is the data quality of those records and attributes? Do some visual inspection of data and do spot checks. This will give you an idea of how much data preparation and pre-processing may be required.
Preparing and pre-processing
This is where the grunt work will start. Select required data from the overall collection and go through the process of cleansing and formatting appropriately if necessary. You may realize that you only need partial data sets for the project you or your org has scoped out in step 1. There may be a need for integration of multiple data sources to prepare the final data. Some of these data sources may even be external to complete some attributes of the data.
Actual mining part of data mining will start with this step. Select appropriate algorithms for the required task and necessary parameters. Look at the data mining techniques article to get an idea of the algorithms. By this time, you would have selected a tool or tools to enhance your productivity. Using those tools, build the model and assess initial results. Given that the end goal of data mining is about predicting, the results at some times may invalidate prior assumptions if the predictions are outside prior hypothesis. Modeling itself may comprise of multiple steps with respect to describing the data as mentioned in data mining techniques article.
Training and testing
Evaluate preliminary results and test the model on different sample data sets and review the results. Do these results across different samples correlate? Are there any inconsistencies? Keep iterating until you are satisfied with the consistency of the results.
Verify and deploy
Verify the final model and plan for deployment. Think about the visualizations needed to tell the story. Remember that data mining is as much about story-telling as it is about modelling. Report the findings and operationalize the process.
Hope this article threw some light on data mining steps and as I mentioned earlier, you’ll find that practitioners and literature may identify as few as 3 to 4 steps or as many as 8 depending on the level of aggregation. As an example, Data mining for dummies book identifies different number of steps even though the scope is the same.
Ramesh Dontha is Managing Partner at Digital Transformation Pro, a management consulting and training organization focusing on Big Data, Data Strategy, Data Analytics, Data Governance/Quality and related Data management practices. For more than 15 years, Ramesh has put together successful strategies and implementation plans to meet/exceed business objectives and deliver business value. His personal passion is to demystify the intricacies of data related technologies and latest technology trends and make them applicable to business strategies and objectives. Ramesh can either be reached on LinkedIn or Twitter (@rkdontha1) or via email: rkdontha AT DigitalTransformationPro.com