What is Data Mining?
Simply put, Data mining is the process of sifting through large data sets to identify and describe patterns, discover and establish relationships with an intent to predict future trends based on those patterns and relationships.
Why is data mining relevant now? Haven’t we been ‘mining’ data from time immemorial?
Yes and No. It is true that data was always analyzed to identify patterns and predict outcomes, the data that organizations had to deal with exploded in recent times with the advent of big data. As these large data sets make it almost impossible to identify those multi-dimensional patterns using traditional techniques or tools, data mining in its modern form, with the advent of latest tools and faster processing, automates the discovery of patterns, establishing relationships, and putting together predictive models thus making it efficient.
What are some of the specific benefits of data mining?
The broad benefit of identifying hidden patterns, consequent relationships and establishing predictive models can be applied to many functions and contexts in organizations.
Specifically, customer-focused functions can mine customer data to acquire new customers, retain customers, cross-sell to existing customers. Other examples are to enhance customer lead conversion rates and/or build future sales prediction models or new products & services.
Financial sector companies can build fraud-detection models and risk mitigation models. Energy and manufacturing sector can come up with proactive maintenance models and quality detection models. Retailers can build stock placement/replenishment models in stores and assess the effectiveness of promotions and coupons. Pharmaceutical companies can mine large chemical compounds data sets to identify agents for the treatment of diseases. A detailed review of data modelling examples is here.
What skills are needed for data mining?
Data mining sits at the intersection of statistics (analysis of numerical data) and artificial intelligence / machine learning (Software and systems that perceive and learn like humans based on algorithms) and databases. Translating these into technical skills leads to requiring competency in Python, R, and SQL among others. In my opinion, a successful data miner should also have a business context/knowledge and other so called soft skills (team, business acumen, communication etc.) in addition to the above mentioned technical skills.
Why? Remember that data mining is a tool with the sole purpose of achieving a business objective (increase revenues / reduce costs) by accelerating the predictive capabilities. A pure technical skill will not accomplish that objective without some business context. The following article from KDNuggets proves my point that data mining job advertisements mentioned the following terms very frequently: team skills, business acumen, analytics among others. The same article also has SQL, Python and R at the top of the list as technical skills.
Another data point is from Meta Brown’s book “Data Mining for dummies” where she states:
“A data miner’s discoveries have value only if a decision maker is willing to act on them. As a data miner, your impact will be only as great as your ability to persuade someone — a client, an executive, a government bureaucrat — of the truth and relevance of the information you have to share. This means you’ve got to learn to tell a good story — not just any story, but one that honestly conveys the facts and their implications in a way that is compelling for your decision maker.”
What are some relevant data mining techniques?
As described above, data mining consists of ‘Describing’ existing data sets and ‘Predicting’ future behaviors. Some commonly used techniques are: Association learning, Anomaly detection, cluster detection, classification, and regression. For a much more description, please read this article.
As such the techniques explained below comprehend both these categories.
Anomaly detection: Identify data that deviates from typical patterns. Examples: Tax returns that have abnormal deviations from typical statements.
Association learning: Data points that are closely associated. Examples: Netflix movie recommendations, Amazon cross-selling opportunities, retail store coupons printed at cash registers etc.
Cluster detection: Type of a pattern recognition to recognize distinct clusters or sub-categories. Examples: Product quality issues at a certain plant, food quality issues from a certain farm/country etc.
Classification: Working from an already classified data (cases), data mining techniques can classify new cases into these pre-determined cases and identify a predictive pattern based on the behavior of prior cases. Examples: How patients will respond to a particular medical treatment or how different classes of customers (race/gender/region) respond to a mail campaign based on prior data points.
Regression: Regression is about forecasting other values based on existing data points. These could be as simple as linear regression (just extending previous ‘line’) to complicated interactions of multiple variables such as logistic regression, decision trees, and neural nets. It’s not my intent to go into more details into these techniques in this article. Some examples: Social networks like Facebook or Linkedin can predict future user engagements based on past engagements of tagging, liking, friend requests, comments etc. These predictive models based on regression can continue to be adjusted to weigh things differently based on how predictions differ from observed behavior.
Hope this gives you an overview of data mining and where it can be applicable. Please feel free to comment and / or share with your network.