Writing data integration and data migration scenarios

From Toolsverse Knowledge Base
Revision as of 21:37, 27 July 2014 by Maxlmus (Talk | contribs)

Jump to: navigation, search

Basic stuff

Any data integration or data migration process can be described as an extract-transform-load (ETL) or extract-load-transform (ELT). With that is mind writing scenario becomes a process of splitting task on extracts, loads and transformations. Scenario glues them all together and adds purpose and logic.

[ETL engine] which powers Toolsverse products uses XML-based language to create scenarios. Before reading this page please take a look at ETL scenario language specification. XML was a natural choice because it enforces a structure: loads follow extracts and transformations follow extracts and loads. That said you can chose to stream data so extract and load combined in one logical operation and run things in parallel where actual order of extracts and loads is not guarantee.

ETL engine makes it possible to concentrate on task in hands without reinventing the wheel. It hides complexity of data integration so in most cases the same techniques and language artifacts are used to work with any SQL and Non SQL data sources. It is however possible to use the full power of the target database and do things like direct data load, anonymous SQL blocks, etc.

Please take a look at the ETL scenario examples.

Simple data migration scenario

Let take a look at the simple data migration scenario:

<?xml version="1.0" encoding="UTF-8"?>
<scenario>
     <name>Migrate data</name>
     <script>migrate_date</script>
     <driver>auto</driver>
     <sources>
          <source>
               <name>employee</name>
               <extract>
                    <sql>select * from employee</sql>
               </extract>
          </source>
 
          <source>
               <name>emp_resume</name>
               <extract>
                    <sql>select * from emp_resume</sql>
               </extract>
          </source>
    </sources>
     <destinations>
          <destination>
               <name>employee</name>
               <metadata>true</metadata>
          </destination>
 
          <destination>
               <name>emp_resume</name>
               <metadata>true</metadata>
          </destination>
     </destinations>
</scenario>

In this example we:

  1. Extract all data from employee and emp_resume tables in the source database
  2. Load data into destination, which can be a database or files
  3. If destination tables don't exist create them on the fly (<metadata>true</metadata> flag)

What we don't do:

  1. Specify what kind or source and destination we are working with
  2. How to connect to the source and destination
  3. How to create destination tables if they don't exist

Reading data from file-based data sources

In the previous example we created a simple data migration scenario which reads data from database using SQL and loads into another database or file-based data source.

Reading data from file-based data sources as easy as writing:

<?xml version="1.0" encoding="UTF-8"?>
<scenario>
     <name>Migrate data</name>
     <script>migrate_date</script>
     <driver>auto</driver>
     <sources>
          <source>
               <name>employee</name>
          </source>
 
          <source>
               <name>emp_resume</name>
          </source>
    </sources>
     <destinations>
          <destination>
               <name>employee</name>
               <metadata>true</metadata>
          </destination>
 
          <destination>
               <name>emp_resume</name>
               <metadata>true</metadata>
          </destination>
     </destinations>
</scenario>

In this example there is no SQL so engine assumes it needs to read data from the files. It doesn't matter what format files are in, as long as there is a suitable connector.

ETL engine naively supports XML, JSON, CVS, Excel and many other formats.

Things to do

Typically a full blown data integration process includes at least extract and load. When everything said and done data from the source are moved to the destination. In some cases it makes sense to split process in stages which performed in the different time frames. For example you can:

  1. Extract data and store them in some intermediate format, for example XML.
  2. Load data sometime later.

You can can control what to do:

   EtlConfig etlConfig = new EtlConfig();
 
   etlConfig.setAction(EtlConfig.EXTRACT);
 
   EtlProcess etlProcess = new EtlProcess(EtlProcess.EtlMode.EMBEDDED);
 
   EtlResponse response = engine.loadConfigAndExecute(etlConfig,
                    "test_etl_config.xml", etlProcess);
   EtlConfig etlConfig = new EtlConfig();
 
   etlConfig.setAction(EtlConfig.LOAD);
 
   EtlProcess etlProcess = new EtlProcess(EtlProcess.EtlMode.EMBEDDED);
 
   EtlResponse response = engine.loadConfigAndExecute(etlConfig,
                    "test_etl_config.xml", etlProcess);

Using scripting languages

Using SQL