Hadoop is a great foundation for a data lake. It provides the highly scalable distributed file system (HDFS) that opens the potential floodgates for storing a massive variety and volume of information not possible before, or to using prohibitively expensive database management systems.
Business managers are enamored, for good reason, with the idea of having fast and direct access to all the data they could ever want, without having to depend on IT or wait for a slow enterprise data warehouse process to provide it.
But how do you actually achieve this? After all, Hadoop is a files system, not a relational database with handy ODBC/JDBC connections (more on that later). It has no mechanism for easily discovering tables or how they are related. HDFS is just a set of files without any defined relationships between them or even a user-friendly way of understanding what is in each file. And remember, the volume of files grows constantly as the data lake is fed.
In general, the use of HDFS requires the wizardry of very sophisticated, technical users. They use the command line interface with HDFS shell commands and/or a variety of tools to move around and determine file contents to make judgments about their usefulness. This isn't your average user's Windows Explorer or Mac Finder interface. Can you say - Got Linux? Know shell?
Then, there is the less obvious but probably bigger issue of data quality and validity. To understand this, reflect on the issue of Excel spreadsheet propagation in most organizations. It's so bad, that no one can determine the single version of truth or the ultimate origin of the data. Now consider, with HDFS it’s possible that all data files, not just Excel, can be continuously consumed and re-populated.
The answer to these challenges is to apply some degree of structure and access control that strikes a balance between agility and usability, based on use case and users roles.
Applying structure can take many forms. One is, to decide to support a series of data management systems with defined and stable schemas that support specific business use cases and requirements. For example: HDFS can be used to supply data to relational databases, NoSQL (document oriented) databases, in-memory databases, key value pair columnar databases, etc. In this scenario, HDFS can be thought of as the universal data staging area that feeds these dependent systems, so that they can provide the predictable structure and user access mechanisms needed for repeatable business processes.
Another form of applied structure is the inverse of pushing data out and examining what is in HFDS more closely. The idea is to provide tools that make it easy for the average user to find what they need, and then let them decide how to subscribe or consume it. This is achieved by using automated processing to examine all data coming into HDFS and creating derived metadata. The examination includes determining the characteristics (profiling) of the data, indexing it for search, and exposing this metadata through a search interface. User applied metadata comes in the form of users appending comments, rankings and tags, as they use it. Collectively, the social influence of rankings, comments and tags helps people to qualify the value and validity of data.
The topic of applying access controls is sensitive. Everyone wants access to data but they want to limit what the other person can see. The reality is that not all data can simply be thrown into HDFS without some strict access controls. They must be applied in the form of conventional group/role access restrictions at the HDFS file and directory level. Additional control can also be applied in the dependent data management system and in the form of masking and encryption.
When you consider applying forms of structure and access control, an interesting observation comes into focus. We are essentially applying well-known, proven data management and architecture principles to this new combination of technology, known as the big data lake. It brings to mind the idiom 'everything old is new again'.
By extension, the answer to the question of how business analysts can 'fish' in the big data lake is really nothing new. They can use the data lake effectively only by first applying a well thought out architecture and management approach that fully considers a range of users and use cases. The range of users must consider data scientists (the new power users), business analysts and general business consumers. The range of use cases must include data discovery, exploration, experimentation and migration to production deployment.