This is meant to be a simple introduction to the CRISP-DM framework, just one of many artificial intelligence and machine learning lifecycles. There are numerous sources for deeper understanding.
The CRISP-DM framework, the CRoss-Industry Standard Process for Data Mining, was created in 1996. The process consists of six major phases:
- Business Understanding
- Data Understanding
- Data Preparation
The sequence between the phases is not strict. The other circle represents the symbolic nature of data mining itself. A data mining process continues after the initial problem is solved, often leading to more focused business questions or refinement of the data.
CRISP-DM Phase 1 – Business Understanding: Business understanding consists of four main steps, understanding business requirements, analyzing supporting information, converting to a data mining problem, and preparing a preliminary plan. The business question needs to be both specific and measurable. The business question can then be turned into a machine learning question, for example the business question “What customers should we target for a new product?” can be turned into the machine learning question “Would this customer buy the product or not?”. It is important to evaluate the cost of creating a data mining solution to the business value of the question. As with all business projects, proper planning is essential, including risks, goals, dependencies, tools and techniques, and project duration.
CRISP-DM Phase 2 – Data Understanding: Data understanding has three primary steps, data collection, data properties, and data quality. The data collection step entails listing data sources and what data to extract from those sources, analyzing the data for additional requirements, and determining if any additional data source is needed. The data properties include things like understanding the metadata of the data, the size of the set, key features and relationships between data elements, including correlation between elements. The data quality step involves determining if there are any missing data elements, if these can be removed or substituted.
CRISP-DM Phase 3 – Data Preparation: Data preparation includes the final data set selection and preparing the data. The final data set should keep in mind constraints such as total size, which columns to include and exclude, record selection, and element data types. Data preparation may involve cleaning, transforming, merging data sets, normalizing, or formatting the data. The number of records can be a consideration if the data set is small and missing elements can be filled in with default values or using statistical methods. It may be useful to revisit the data understanding phase after this phase is completed.
CRISP-DM Phase 4 – Modeling: Modeling, arguably the most fun phase, consists of three main steps, model selection and creation, creating a model testing plan, and parameter testing and tuning. This step is tied to Phase 3 because the model selection influences the data preparation and vice versa. Further testing may reveal that the data doesn’t fit well into the type of modeling algorithm used and Phase 3 must be revisited. Obviously, the first step is to choose a modeling algorithm and the tools needed to do it. Model testing is generally broken into a test and training data set. The split can vary depending on the data set and algorithm. A common split is 30% test and 70% training. At this time an evaluation criterion should be chosen. The actual training can involve tuning hyper parameters to adjust the accuracy or speed of training.
CRISP-DM Phase 5 – Evaluation: Evaluation is where you evaluate how the model is performing with relation to your business goals defined in Phase 1 and make a decision on if the model should be deployed or not. Evaluation depends on the evaluation criteria you outlined in the modeling phase. It is important to keep in mind business considerations like the cost of false positives or negatives, execution speed, and cost. Review the steps taken throughout the process to verify that all criteria are met. Finally determine if the model should be deployed.
CRISP-DM Phase 6 – Deployment: There are four phases in deployment, planning deployment, maintenance and monitoring, final report, and project review. First you need to determine where the model will be deployed. For example, on AWS there are many options including Amazon EC2, Amazon EC2 Container Service, and AWS Lambda. Then decide how the model will be deployed and managed, again, for example on AWS, AWS CodeDeploy, AWS CloudFormation, AWS OpsWorks, and AWS Elastic Beanstalk. As with all well-architected systems, monitoring system health is important. Examples on AWS include Amazon CloudWatch, AWS CloutTrail, and AWS Elastic Beanstalk. A final report is delivered to stakeholders, highlighting the processes used, if the project goals were met, any findings, and explain the model used and reasoning behind using it. The project review assesses what went wrong, what went write, and determine if any parts of the process can be reused.