Four Great Tips for Data Mining

Data Mining

Data Mining has become a critical analytical tool in the hands of management, marketers, digital advertisers, and anyone interested in exploring hidden but precious patterns in large data sets. It is much like mining diamonds from the earth.

It is used to extract knowledge and actionable information and extrapolate patterns from the data collected. It is employed in a vast range of fields – from analysis of healthcare data in a population to analyze their susceptibility to a pandemic to carry out market basket analysis, from fraud detection to web scraping, from corporate budgeting to detecting the political and social mood of the populace.

We can get data-backed insights for phenomena simply ‘hunches,’ applying the latest analytics tools, and working towards improving or fixing the current processes or building a new system.

With data mining, a digital marketer can find groups of people with similar preferences to target them more effectively and accurately. Similarly, an online gaming company can identify new trends and demands of its potential users on social media and make necessary updates to its games.

Like any tool that provides deeper insights into a business user, you may be drawn to overuse data mining. But, it must be used judiciously, and not to twist facts to suit your needs. Therefore, to maximize your successful outcomes, focus on projects aligned with measurable and identifiable business goals – like enhancing customer engagement and loyalty, identifying cross-selling opportunities, or detecting digital advertising frauds.

The following are the four most important tips for knowledge discovery:

1. Collect as much data from multiple resources

Data from a single source or only internal sources is valuable and essential, but will not reveal much information about, say a customer. If you limit yourself only to your internal data, you may miss out on a vast amount of user activities beyond the limits of your system. If you can trace your customers’ digital footprints – their social media activities, browser searches, reviews, etc. – then analytics can reveal actionable information resulting in better engagement and increased revenues. You can model customer behavior and profiles and find innovative ways to influence them – such as identify and use influential associates of a user. With the right mix of communication, you can build a favorable public perception in each target group. Creating a training set from wrong sources means you may skew your model and make it unreliable.

2. Use a clear sampling strategy

Sampling is the art of choosing the right number of limited items from a broad universe that the sample most likely represents. Many powerful analytics tools and applications have failed because they were not given a proper sample of data to work with – either it was too small or too limited in attributes or too biased. A clear, concise, and unambiguous sampling strategy is the key to successful data mining. E.g. to unearth a digital advertisement fraud, you will scrape the data from known fake websites with fake visitors but real ad-revenues. To find the general mood of the youth on Climate Change, you will mine data from popular social media platforms. Always make sure to have a “holdout sample” and do not use it in training the model. It is a reference sample used to test the model’s predictive performance, analysis, and efficacy after it has been trained. With testing, you will ensure that the models do not make wrong predictions and have no bias.

3. Make sure to use ‘throwaway modeling’

The first step to designing a model is identifying the independent variables or predictors from the overall set available. It is always best to use throwaway modeling or rapid modeling to create a model for the clients’ initial presentation after gathering preliminary requirements. It will help visualize the system and come up with more detailed requirements and expected outcomes. Throwaway modeling requires “throw” in all data, testing all alternate models, and finding the best fit using a systematic selection process. It helps you reduce the number of ‘reset’ and ‘redesign’ decisions at a later stage when it is more complex and time-consuming. You may leave out some important relationships and patterns in your data if you decide to skip this step. That can be detrimental to the overall success and project’s expected outcomes. The overall productivity will improve from a well-structured and intelligently built throwaway model.

4. Refresh your models regularly

The real-world is dynamic, and the data that represents it is only accurate for a small point in time. Your proposed predictive model must regularly refresh its data and remodel itself to accommodate changing realities to remain relevant. Even the best predictive models never fit the real-world data, in perpetuity! Therefore, the refresh rate for models – the frequency at which the data are updated, added, and model re-calibrated – needs to be in sync with the modeling system. It can be updated regularly weekly, daily, or based on certain triggers linked with important events. Depending on the system, you may need to factor in only the new data, or the calculations will have to be done all over again every time. Therefore, the correct frequency for such updates is critical not only for the validity but also for the model’s efficiency.


In the end, it is all about people. Make sure that the insights you present are meaningful to them. While communicating your conclusions and predictions for the organization, you must understand the details and terms to use in the presentation – pictures, graphs, patterns, word clouds, etc. – to be understandable by non-tech users. Excessive use of jargon and mathematical equations may be too complicated for others to understand and use in decision and policy making. Your model must have real applications in the real world. A theoretical model with no practical use may not have any takers at all. Mention clear execution paths for all intended users, because something that cannot be demonstrated doesn’t exist!

Leave a Reply