March 18, 2016

2069 Views

MongoDB in Big Data Analytics Applications

There are wide varieties of NoSQL databases available in the market, both open source as well commercial ones. Choosing the right NoSQL database was crucial for the call center analytics project’s success. In this blog, we will present an open source and widely used database – MongoDB. We will show how seamlessly MongoDB’s features map to call center analytics application that required near real time processing capabilities. We will compare the solution, if implemented using RDBMS database.

Optimizing customer care operations leads to better customer satisfaction and huge cost savings. U.S. call centers receive 45.4 billion inbound calls per year, and each call costs, on average, $5.90 for the average customer contact center, a 1% improvement in First Call Resolution (FCR) rate would result in a $276,000 reduction in annual operational costs. Apart from FCR, improvements in following metrics have direct impact on better customer service:

  1. Improved Response Time
  2. Adherence to Schedule
  3. Forecasting Accuracy and
  4. Self-service accessibility

We applied analytics on data collected from many channels (like individual phone calls, chat, and web interactions etc.) to improve the above mentioned metrics. The data was initially stored in RDBMS. The effectiveness of the solution will be realized if the analytics solution we built gives predictions in near real time. Since we dealt with a large amount of data, a data store like RDBMS exhibited the following limitations:

  1. Rigid Schema Structures: Dynamic schemas are not supported, whereas we need flexibility in storing data
  2. Poor Performance: The system should support millions of user requests simultaneously to handle large volume of data in near real time, whereas relational databases performs slower as data volume increases
  3. Non Scalable: As the data volume grows, these databases tends perform poorly. We needed a system that supports horizontal scaling with ease, as the data volume increases.

Hence, it is obvious for us to look out for a data store that has the following characteristics:

  1. Highly available and distributed data storage
  2. Faster data access irrespective of data volume
  3. Supports dynamic schemas
  4. Helps in storing multi-channel/server data into a single place to get a unified view of the data
  5. Helps in running analytic application on top of the data very easily
  6. Provides good aggregation framework
  7. Strong connector for Hadoop Map Reduce for Batch Processing
  8. May or may not support ACID transactions

After analyzing thoroughly the existing NoSQL databases, we nailed down our choice to use MongoDB database. Other factors that support this choice include factors such as data representation, preprocessing, analyzing and aggregation capabilities that we typically see in any data mining/ predictive modeling projects as important. The following diagram depicts the architecture diagram.

Fig. 1: Architecture Diagram Depicting MongoDB Usage

In the following sections, we will walk through MongoDB’s features that are best suited for call center analysis:

  1. Dynamic Schema Structures
  2. Performance Characteristics
  3. Horizontal Scaling

I: DYNAMIC SCHEMA STRUCTURES

Data Modeling

Data Model

The data in relational database had tables related to user profile, user call history, user chat history etc. stored in separate tables. Migrating these tables in ‘as is’ form to MongoDB would require application to perform complex joins to get related data. This has limitations, as the volume of data increases, the performance of query slows down, and we won’t be able to achieve near real time prediction as expected.

MongoDB supports flexible schema. This means that collections do not enforce upfront schema structure to store data. To support data models for our application needs, we:

  1. Embedded Documents: Related data is embedded in a single structure/document. For example, user call record is stored along with user profile. Though redundant data is updated in multiple rows, it provides faster access to the related fields. This is also referred as denormalizing the data.
  2. Referenced Documents: Every time user interacts with the system, logs were created. These logs get large in size very quickly. Embedding such logs along with user information will lead to explosion of data and we no longer get the performance benefits offered by MongoDB. For such scenarios, referencing the user info in each log will be sufficient. This is also referred to as normalizing the data.

Additionally, MongoDB has support for almost all the datatypes as that of SQL. This was helpful in keeping the respective datatypes of a typical relational database. Care has to be taken while doing data modeling. Each application will have its own demands in terms of read/write/update performances. Having designed a proper data model, one can take proper advantage of MongoDB.

Query Model

Unlike relational databases, MongoDB does not provide SQL like querying but has Rich Query Language. One can interact with the database using Mongo Shell. We used the shell extensively for data analysis and exploration studies, including:

  1. Aggregations: In a call center application, at any given point in time we wanted information about how many users have called and on what topics. To do such aggregations MongoDB provides aggregation pipelines and MapReduce processing paradigm.
  2. Text Search: Whenever a user calls in to call center, we wanted to find his topic of interest by mining comments and details section. To do such queries, MongoDB has good support for text search including regex.
  3. Text Search: Whenever a user calls in to call center, we wanted to find his topic of interest by mining comments and details section. To do such queries, MongoDB has good support for text search including regex.

II. PERFORMANCE CHARACTERISTICS

Performance

The system should support millions of user requests simultaneously. We needed to ensure that there is no data latency or delays in data processing and performance of MongoDB for Big Data Applications should not deteriorate as data volume grows. To improve performance, we used indexing, sharding, capped collections, and ran intensive tasks on large datasets on Hadoop Cluster. Here are more details.

Indexing

Performance is a paramount ingredient while choosing a database for our application. Once we have large volume of data, write and read tends to get slower. MongoDB supports a variety of indexing to enhance these operations. We used single field, complex, and multikey indices. Choosing the right type of index plays a crucial part in performance, as choosing the wrong index might hamper the performance of read/writes.

Sharding

Sharding is a technique to store data across nodes. When data size cannot fit into a single machine, then distributing the data horizontally across servers (scale out based on shard keys) improves query performance and enhances data management. In our set up, we dealt with data sizes of around 50-60GB which sits comfortably on a single server. Hence we did not explore sharding.

Capped Collections

As mentioned earlier, we were collecting user logs which grow to large sizes in no time. For analytic processing, we wanted to retain only recent logs and discard the oldest ones, in order to improve performance of queries and to limit the data size. For such operations, we could use capped collections that provides a way to restrict data sizes upfront.

Hadoop MapReduce

Certain machine learning algorithms or complex data mining operations can take advantage of native Map Reduce supported by MongoDB. Since MapReduce job is a single threaded operation (meaning, even if the server has 64 cores, the MapReduce would be running on a single core), running it on large datasets slows down the computation. To boost performance of batch processing of large datasets, one can take advantage of Hadoop Map Reduce.

III. HORIZONTAL SCALING

MongoDB achieves horizontal scaling through sharding (refer Performance Characteristics section on Sharding). With sharding, one can add more machines to support data growth and to achieve high read/write throughput operations.

Conclusion

In this blog, we showed how we used the features of MongoDB effectively in call center analytics applications. We were able to improve the metrics mentioned in the first section that has direct effect on call center efficiency.

References

  1. Big Data in your call center: Managing the numbers – A TechRepublic blog. Bluewolf infographic
  2. http://www.nosql-database.org/
  3. www.mongodb.com
  4. https://docs.mongodb.org/manual/indexes/
  5. https://github.com/mongodb/mongo-hadoop