Data integration (DI) job development is at the core of every enterprise application. Quality job development could lead to better return on investments and quality deliverables, enhancing the overall business outcome. It is therefore, of utmost importance to follow a well-defined process for every new ETL job/project. In this article we will share key considerations required for developing DI jobs for a project currently leveraging data integration tools like Talend, Informatica, SSIS, and Pentaho etc.
Having better practice implementation is the stepping stone for high quality design and development. Practice guidelines help developers produce high quality, zero defect, flawless and well managed code development. Practice covers all the aspects, which are required for quality deliverables as per company policy, experts’ guidelines, past experiences, reviewer’s comments, tool specific guidelines etc. All the following points explain different aspects of a practice.
- Create a foundational architecture. Formulating a best-in-class architecture for every ETL project is the first and foremost step, even before starting or visualizing the design of the project. Once the business requirements are understood and analyzed properly, create a plan for the architecture and be sure it covers all the aspects of requirement. Once the high level architecture or flow diagram is available, break it into usable pieces or smaller objects. This will cover all the aspects which are required for efficient design planning, test case planning, deployment planning, release planning, documentation etc.
- Provide a repository driven storage. Almost all tools provide a repository for storing metadata for the projects. If jobs are tightly coupled with the repository, then one can minimize the manual errors, while creating built in metadata. Another advantage of a repository driven job design, is that it promotes re-usability and reduces complexity in job development.
- Establish a parameter driven approach. All jobs are required to be tested across various environments and platforms. During job development, any variables that are changeable across platforms or environments, must be placed inside a parameter file, so that they can be easily modified before running the job.
- Develop documentation and commenting. Proper documentation and commenting must be done for every component and custom code for each job. This enables reviewers and other stake holders to understand the job in detail.
- Implement an efficient variable management. Every ETL tool provides different types of variables and can be used differently in diverse conditions. By taking advantage of these features, the performance and usability of a job can be optimized. The following are variable types commonly used for different scenarios:
- Context group
- Routine variables
- Job level context
- Global variables
- System variables
- Hashmap variables
- Environmental variables
- Perform code review and analysis. Once the code development is done, a review team must review and provide comments on each job as per the best practices, requirements, design considerations, failure scenarios, and null pointer handling. Once these comments are provided, a developer will accommodate all these comments to make the job more stable.
- Execute error logging and auditing. During every execution of a job, error or console trace must be written in the text file. This file may contain some audit messages, error stack trace, printable data etc. In addition to this file, job execution and audit information, like source record count, target insert count, reject count error count, source error count, target reject count etc. could also be part of audit table.
- Devise failure scenarios. During any job development, the developer usually chooses to focus on the best case scenarios. This means that failure scenario responses are often not implemented during code development, which can be problematic. It is always good to handle all the possible failure scenarios while developing a particular DI job.
- Accomplish higher test coverage. Test case development after the completion of requirements is important and should be shared with the development team. This will give the developer higher visibility and allow them to more effectively plan for the job design making it more robust so that the jobs are developed for a higher rate of success.
- Leverage reusable components. Re-usable components can be created and used across multiple jobs by using an enterprise version of DI tools. Reusable components include job templates, job lets, metadata connections, sql queries etc. Availability of reusable components in the repository helps quicken and calculate the development of new requirements.
- Achieve higher control on job flow and execution. ETL job design is to be planned in such a way that it exists within the job execution in a controlled way and not with the complete stack trace and failure. Various flags, lock files etc. could be used to achieve a higher control of the job flow.
Implementing practice when developing a DI job will create a vastly more successful project. Other benefits of having key drivers in place within the organization are:
- Reducing defect fixing cycle
- Reduce Go Live duration
- Improve development efficiency
- Implement best practices
This will enable you to have a larger return on investments and quality deliverables to improve the business outcomes.