Entering the world of ‘big data’ presents challenges to our conventional data warehouse set ups. Handling structured and unstructured datasets in the same environment can be tricky to manage, particularly when you want to marry the two for analysis. And what if we want a more flexible platform where we don’t have to adhere to strict schemas, which have been predefined upon load. Often, we might not have a particular use for our data right now, but this might not be the case in the future.
The data lake offers a central repository where we can store data in it’s raw, unmodeled and native format. Essentially, we forego the ETL process until such a time when the data is needed. This can allow for real time analysis as don’t have to wait for ETL processes to refresh. As business users, we generally don’t know all the possible questions we might want to ask of our data from the outset and traditional data warehouses are often built for specific business applications. What if our initial data warehouse design doesn’t incorporate all the information we need to answer our questions? It can prove costly to go back and restructure our data warehouses to cater for the ever-changing needs of a business. With a data lake, we can apply schemes on the fly and analysts can slice and dice the data as needed.
The issue with this is that the onus lies on the end user to understand the different data sets in the lake. Therefore, a data lake cannot be successfully implement without comprehensive metadata. But is a data lake always the best option? Data lakes have become a bit of a buzz work in the industry and it’s easy to get carried away with the concept. Data lakes aren’t always the solution where a data warehousing project has failed. The project might have failed not because a data warehouse was not sufficient for the job, but because it was poorly designed. In an environment that is strictly transactional a data lake is most likely overkill. Also, where data is generally structured and the variety of data isn’t changing all that much we probably don’t need a data lake either.
Data lakes are still a fairly new concept and there is not yet a comprehensive body of best practice surrounding their implementation. If you’re embarking on a data lake project, proceed with caution. Ensure that objectives are clear at the outset and that comprehensive metadata is maintained, otherwise a data lake can quickly turn into a data swamp.