Introduction
The web logs which describe how a user navigates through the website pages during a particular task are called clickstream data. Clickstream is about all the actions taken by the user in a specific session. Clickstream data has several advantages (25% improvement in ad impressions, 15% improvement in click-through rate, and exhaustive insights). It has far more reliable data, as spams and anomalies are much easier to identify and filter out.
Anomaly detection is very crucial to the success for any organization.
Business need
Quick detection of anomalies is very critical to the success of the organization’s digital transformation culture. With the use of anomaly pattern detection, companies can identify or even predict abnormal patterns in unbounded data streams. Some use cases of anomaly pattern detection are:
- Positive buying behavior for retailers
- Fraud detection for financial service providers
- Potential threat identification and reduction for telco companies
Anomaly pattern detection requires detecting the behavior of the data in real-time when it becomes unusual and differs from normal behavior.
Real-time clickstream data anomaly detection can detect unusual behavior and take preventive actions to reduce costly repairs. So, the businesses that identify the anomaly in real-time can have a big impact on response times and associated costs.
Solution approaches
There are three parts of a clickstream anomaly detection system:
- Part 1 batch trains by using the historical data
- Part 2 identifies the anomaly in a real-time data stream based on the learning
- Part 3 takes appropriate action
Few machine learning algorithms that are useful for clickstream analysis are:
- Association rule learning
- Clustering
- Markov chain
Fig 1: Website Page Navigation by the User in Each Session
Association Rule derives the association rules from historical transactions. The anomaly transactions or outlier’s transactions are detected according to the mined association rules. For example, an association rule {A,B} -> {C} with a high confidence value 85% is derived from historical data. So now, the transactions with are treated as abnormal because C is supposed to appear but actually not been seam there .
Clustering is the class of unsupervised learning that helps in grouping the activities based on their behavior and structure without any previous knowledge about the data.
A Markov chain is a graphical representation of the activity/state transition diagram along with the corresponding probability (Fig 2). We can utilize the Markov chain for clickstream analysis. It gives us activity/state transition probability and activity clustering information which can be used to create an anomaly detection system.
Fig 2: Markov Chain Graphical Representation of Clickstream Data
We can use either one or the combination of two machine learning algorithms to detect the anomaly in the system and raise an alert which can further be used to trigger the preventive action.
References: