This article aims to provide more information on Propensity and similar mathematical models. We would attempt to explain Propensity and the mathematical models in three simple steps: self-education, formal citation, and an example of advertisement: three body problem. We will illustrate how these mathematical models are primarily used in the context of HCL’s Lab 21 projects which are of a large scale data science nature, and how we as a team, conduct research using the current literature available before any undertaking.
First of all, self-education on basic information and concept is necessary for any new project. If the domain knowledge of the customer is highly specialized such as STEM, this will take up a significant portion of the project time. Wiki type software is crowd sourced and you should never use it as a formal reference. However, it can be helpful in gaining a high level understanding of new terminology or jargon of the domain you are working on. Blog posts from academic researchers are usually better sources. However, please do not blindly use these as formal references as they can be difficult to properly cite and reference. Meta-research or research on what is the best way to research provides a framework to avoid many common pitfalls of quickly iterating and prototyping a propensity problem.
Formal citations and references will be necessary to avoid any miscommunication between team members and potential product users. Science journals and publications are obviously great resources and references, but so are business journals, social science journals, Google scholar, The Economist, HighWire Press Stanford, and many more that are available for free. While academic papers have many great qualities as a reference, business journals can particularly help when trying to decide on practical implementation of your solution and may give some additional guidance on the end user’s integration with the product, whether that is a website, interactive report, or a database. Please refer to Oracle documentation on implementing some classification propensity models.
A recent research area at Lab21 was advertising-based systems and search engines. One of the most common issues in advertising is the ‘Three Body Problem’, which is different from the science fiction novel by Cixin Liu or the three body problem from quantum mechanics. The Three Body Problem in search engine advertising can be explained by looking at the three groups impacted by the search engine and its revenue system. The goal of search engine advertising is to make the search engine company money. This is one group. Search engine advertising also has to make its search users happy. If you include too many ads it may make them angry and they may stop using your search engine, such users are the second group. Search engine advertising has to generate ROI for the advertisers who are paying to have their ads be seen on the webpage, and they are the third group. Every domain such as advertising or medical science has real life problems which are not going to be understood without copious research on your data.
Propensity modeling will require a great deal of feature engineering. Please refer to the R tutorial link for an in-depth overview of feature engineering for propensity modeling. Here in Lab21 we usually try to do feature engineering in database as much as we can. It is very nice to just have a script that creates all the variables you want to use and when you pass that new dataset around everyone can start to work on a small portion of it. Be prepared to break up your data into small chunks and put them back together once the common trends and relationships between the features are discovered. If you look at the data science competitive winning teams, such as BellKor's Pragmatic Chaos team who won the now famous 2009 Netflix competition, many of them used methods similar to BellKor’s "straightforward statistical linear models with a lot of data conditioning". If you are looking for quality code and ideas for feature engineering, data science competitions such Kaggle make much of their code publicly available and often have great examples and explanations on what sort of feature engineering techniques are useful for the particular statistical properties of your data.
We regularly prototype and explore datasets for items such as missing values or null values with tools which iterate very quickly (less than a week). Those tools include h2o.ai, Keras, Shogun(.NET), Alteryx, AutoML (Auto-Sklearn), Trifacta, Rapid Miner, Rattle GUI, Orange, Qlikview, KNIME, Plotly, DataWrapper, Data Science Studio (DSS), and BigML. These tools allow you to gather a quick assessment of your data’s cleanliness, completeness, and the types. Since propensity modeling is often a constantly evolving solution to a particular business problem, early communication between all parties on subjects such as data cleanliness is paramount to a successful outcome for the business solution. There are certainly many more tools just like those and they seem to be growing exponentially. Staying on top of the current best solutions will let you choose the appropriate time saving tools for your specific business problem and let you focus on the more important aspects of work, instead of just trying to visualize large data in an appropriate way.
Remember that propensity modeling or any kind of business scenario mathematical modeling is a team sport. If you cannot communicate thoughts and findings quickly and efficiently to the rest of your team, you will get lost in a sea of highly specialized domain knowledge and special circumstance. Having a prepared plan for each step of the project, from first understanding the common business practices, research citations, data cleaning, initial solution prototyping, model building, and application development to solution delivery, will save valuable time and resources. Therefore, we hope some of the technologies or links are useful to you. Please leave links for any other great tools available for propensity or classification models that you know of, which we didn’t talk about.