When a business enters the domain of data management, it is easy to get lost in a flurry of promises, brochures, demos and the promise of the future. In the first article in our two-part series, entitled, ‘Data Warehouse, Data Lake, Data Mart, Data Hub: A Definition of Terms’, we defined the terms and differences in the market so that businesses can better understand the possibilities of Data Warehouses, Data Marts, Data Lakes and Data Hubs.
Data Volume and Location
Datawarehouses (DWH) typically serve the entire organization and may have several Data Marts combined within the DWH to serve individual business units or departments (see Data Marts below for more information). A Data Warehouse usually handles a very large volume of data and stores highly transformed, structured data from disparate sources within a structured environment.
Suitable For: Large volumes of data, integration of data sources, data sources do not change often. Is typically used by IT, MIS, data scientists and business analysts.
Advantages: Can handle storage and manipulation of a great deal of data coming from various types of data sources. Is comprised of curated, pristine quality data meant to serve as a single version of truth for the organization.
Limitations: Can typically only be used by trained professionals that have programming skills, SQL knowledge and/or can execute sophisticated queries and use data extraction, transformation and loading (ETL) tools or techniques.
A Data Mart is usually a subset of a data warehouse and for some organizations, will function as a staging repository for data sources to make data from specific sources accessible to users in a specific department or business unit, making it easier for them to find and analyze data with better scalability and performance.
Suitable For: Use by business units, departments or specific roles within the organization that have a need to analyze and report and require high quality data and good performance.
Advantages: Can provide secured access to data required by certain team members and business units.
Limitations: Requires appropriate structuring and data access provided by IT or data scientists to enable the business user to easily access and use the data. Limitations of the Data Mart are similar to the Data Warehouse.
A Data Lake can accommodate a very large volume of data and will act as a repository for operational data in near-source format. The data within a Data Lake can be used by a Citizen Data Scientist, Data Scientists, MIS and power users, and ITwith appropriate tools for analysis and data preparation.
Suitable For: Large volumes of data, integration of data sources, used by IT, MIS, data scientists and Citizen Data Scientists.
Advantages: Can handle storage and manipulation of a great deal of data coming from various types of data sources. Not tied to specific structures for the complex transformations of data sources and, in this respect, is more flexible than a Data Warehouse.
Limitations: Can be used by IT, Data Scientists or Citizen Data Scientists using appropriate analytical tools and self-serve data prep tools to review, analyze and report on operational source data in original form or semi transformed form. Requires appropriate structuring and data access provided by IT or data scientists to enable the business user to easily access and use the raw operational data rather than highlight transformed, and aggregated data.
A Data Hub is used to process, transform and govern data and may be used for large volumes of data. It acts as a bridge between data sources and provides a layer of data governance and data transformation in between the data sources.
Suitable For: Large volumes of data, organizations that require good data governance and integration of data sources, use by IT, MIS, data scientists and business analysts.
Advantages: Can handle governance and data quality of a great deal of data coming from various types of data sources.
Limitations: Can be used by IT and ETL experts to maintain integration, quality and governance across application and data stores across the enterprise.
Intended Use of Data
Curated, highly transformed data with pristine quality. Created by IT and Data Warehouse professionals, and used by IT and analysts to build reports, visualizations, business intelligence and analytics objects. Data stored within the Data Warehouse is intended to serve as one source of data and ‘truth’ across the enterprise.
Curated, highly transformed data with pristine quality. Created by IT and Data Warehouse professionals, and used by IT and analysts to build reports, visualizations, business intelligence and analytics objects. Typically intended for business units, departments or specific groups of users to contain and manage their applications and domain.
Created and maintained by IT staff and data engineers. Gives access to raw operational data in original or semi transformed form. Used by Citizen Data Scientists, Data Scientists, Analysts and MIS team members.
Created, maintained and used by IT, data engineers, and integration teams to ensure data governance and quality across various data sources and repositories.
Budget, Timeline and Required Skills
Because a Data Warehouse typically houses structured, curated data from many sources across the enterprise, the cost and timeline for implementation will be greater. The enterprise must have the skills and tools to perform data extraction, transformation and loading (ETL) and, because business data and data source formats and applications change quickly, the datawarehouse is likely to undergo continuous upgrade and changes.
A Data Mart is a smaller initiative for a business unit or group of users within the organization, and can be built more quickly but the cost and timeline for deployment will depend on the number of data sources, the required data transformation and structuring and issues related to the changing nature of the data source. It will require expert maintenance.
A Data Lake typically houses operational data in original or near-source format data from many sources across the enterprise, so the cost and timeline for implementation will be greater as it is intended to cover all applications and data sources across the organizations. The enterprise must have the skills and tools to perform data extraction, transformation and loading (ETL) and, because business data and formats change quickly, the Data Lake environment is like to undergo continuous upgrade and changes, but these changes are simpler than the Data Warehouse or Data Mart environment as data is stored in near-source formats, making it easier to add new applications or data sources to the Data Lake
The cost and the length of time for development and deployment of Data Hub will depend on the number of data sources and the required degree of transformation required for the quality and governance layers. Factors that will affect the schedule and cost also include the established policies, datavolume and required scalability and performance as well as data security and government, and integration of multiple data sources etc. The team will require specialized skills to create and sustain the Data Hub for use by those in the business community.
No matter which option the business chooses as the best fit, when an enterprise undertakes a project of this import and scale, it will need to plan its approach in one of the following ways:
- Dedicate a skilled and dedicated team of IT staff and data engineers to create and maintain the solution and accommodate upgrades and changes.
- Engage an IT consultant to help the business select the right solution, develop the structure and overlay reporting or analytical tools to provide the best access and data governance and then hand the solution off to the business to maintain
- Leverage a joint project team, comprised of internal IT and/or analysts and an IT consulting partner to take the project from concept to execution and work together to maintain and upgrade the solution as and when required.