Writing data integration and data migration scenarios

From Toolsverse Knowledge Base
Revision as of 23:27, 21 August 2014 by Maxlmus (Talk | contribs)

Jump to: navigation, search

Basic stuff

Any data integration or data migration process can be described as an extract-transform-load (ETL) or extract-load-transform (ELT). With that in mind, writing scenarios becomes a process of splitting task into extracts, loads and transformations. A scenario glues them all together and adds purpose and logic.

[The ETL engine] which powers Toolsverse products uses XML-based language to create scenarios. Before reading this page please take a look at ETL scenario language specification. XML is a natural choice because it enforces a structure: loads follow extracts and transformations follow extracts and loads. That said, you can chose to stream data, so extract and load are combined in one logical operation or run them in parallel so the actual order of extracts and loads is not guaranteed.

The ETL engine makes it possible to concentrate on a task at hand without reinventing the wheel. It hides the complexity of data integration so in most cases the same techniques can be applied when working with any SQL and Non-SQL data sources. It is possible to use a full power of the target database and do things like direct data loads, parallel extracts, etc.

Please take a look at the ETL scenario examples.

Simple data migration scenario

Let take a look at the simple data migration scenario:

<?xml version="1.0" encoding="UTF-8"?>
<scenario>
     <name>Migrate data</name>
     <script>migrate_date</script>
     <driver>auto</driver>
     <sources>
          <source>
               <name>employee</name>
               <extract>
                    <sql>select * from employee</sql>
               </extract>
          </source>
 
          <source>
               <name>emp_resume</name>
               <extract>
                    <sql>select * from emp_resume</sql>
               </extract>
          </source>
    </sources>
     <destinations>
          <destination>
               <name>employee</name>
               <metadata>true</metadata>
          </destination>
 
          <destination>
               <name>emp_resume</name>
               <metadata>true</metadata>
          </destination>
     </destinations>
</scenario>

In this example we:

  1. Extract all data from employee and emp_resume tables in the source database
  2. Load data into destination, which can be a database or files
  3. If destination tables don't exist create them on the fly (<metadata>true</metadata> flag)

What we don't do:

  1. Specify what kind or source and destination we are working with
  2. How to connect to the source and destination
  3. How to create destination tables if they don't exist

Reading data from file-based data sources

In the previous example we created a simple data migration scenario which reads data from database using SQL and loads into another database or file-based data source.

Reading data from file-based data sources as easy as writing:

<?xml version="1.0" encoding="UTF-8"?>
<scenario>
     <name>Migrate data</name>
     <script>migrate_date</script>
     <driver>auto</driver>
     <sources>
          <source>
               <name>employee</name>
          </source>
 
          <source>
               <name>emp_resume</name>
          </source>
    </sources>
     <destinations>
          <destination>
               <name>employee</name>
               <metadata>true</metadata>
          </destination>
 
          <destination>
               <name>emp_resume</name>
               <metadata>true</metadata>
          </destination>
     </destinations>
</scenario>

In this example there is no SQL so engine assumes it needs to read data from the files. It doesn't matter what format files are in, as long as there is a suitable connector.

ETL engine naively supports XML, JSON, CVS, Excel and many other formats.

Using Transformations

If we assume that any source or destination has a dataset behind it adding a transformation is a basically attaching it to the source or destination. Transformations can be chained together.

There are three types of transformations:

  1. Column transformations: validation, add or exclude column, change column type, calculate column value
  2. Dataset transformations: set operations such as join, de-duplication, pivot, etc
  3. Dimension transformations: add dimension and extract dimension

Transformations are event based and can be performed on:

  • Before extract or load
  • During extract or load (inline transformation)
  • After extract or load
  • After success or failure

Example of column transformations:

<?xml version="1.0" encoding="UTF-8"?>
<scenario parallel="true">
     <name>Column level transformations using JavaScript</name>
     <script>column_transformations</script>
     <description>Extract data, change column name and value, add columns, remove columns</description>
     <driver name="com.toolsverse.etl.driver.GenericJdbcDriver" parent="com.toolsverse.etl.driver.mysql.MySqlDriver"/>
     <execute/>
     <sources>
          <source>
               <name>employee</name>
               <extract>
                    <sql>select * from employee</sql>
               </extract>
          </source>
     </sources>
     <destinations>
          <destination>
               <name>employee</name>
               <objectname>employee_date</objectname>
               <metadata use="true"/>
               <load stream="true"/>
               <variables>
                    <HIREDATE nativetype="int" sqltype="2"
                         code="if ({HIREDATE} != null) {value = new Date({HIREDATE}.getTime()).getFullYear();}"
                         field="HIREDATE" lang="JavaScript" name="hire_year"/>
                    <SALARY exclude="true" field="SALARY"/>
                    <BONUS exclude="true" field="BONUS"/>
 
                    <COMPENSATION add="true" nativetype="int" sqltype="2" name="compensation" lang="JavaScript"
                    code="value = parseInt(dataSet.getFieldValue(currentRow, 'SALARY') + dataSet.getFieldValue(currentRow, 'BONUS'));" />
 
                    <test add="true" nativetype="varchar(100)"
                         sqltype="12" value="abc"/>
                    <test2 add="true" nativetype="varchar(100)"
                         sqltype="12" value="aaa"/>
                    <MIDINIT exclude="true" field="MIDINIT"/>
               </variables>
          </destination>
     </destinations>
</scenario>

Example of dataset transformations:

<?xml version="1.0" encoding="UTF-8"?>
<scenario>
     <name>Diff datasets</name>
     <script>diff</script>
     <allow>EXTRACT_LOAD</allow>
     <sources>
          <source noconnection="true">
               <name>data_set_1</name>
               <extract>
                    <reader
                         class="com.toolsverse.etl.connector.excel.ExcelXlsxConnector"
                         filename="{app.root.data}\demo\matrix.xlsx" sheetname="norm"/>
               </extract>
          </source>
          <source noconnection="true">
               <name>data_set_2</name>
               <extract>
                    <reader
                         class="com.toolsverse.etl.connector.excel.ExcelXlsxConnector"
                         filename="{app.root.data}\demo\matrix.xlsx" sheetname="norm2"/>
               </extract>
          </source>
          <source noconnection="true">
               <name>data_set_only1</name>
               <tasks>
                    <task noconnection="true">
                         <name>minus</name>
                         <class>com.toolsverse.etl.core.task.common.Minus</class>
                         <variables>
                              <DRIVING value="data_set_1"/>
                              <MINUS value="data_set_2"/>
                         </variables>
                    </task>
                    <task noconnection="true">
                         <name>add_column</name>
                         <class>com.toolsverse.etl.core.task.common.Pivot</class>
                         <variables>
                              <FIELDS value="dataset='first only'; first=first; last=last; course=course; year=year; address=address;"/>
                         </variables>
                    </task>
               </tasks>
          </source>
          <source noconnection="true">
               <name>data_set_only2</name>
               <tasks>
                    <task noconnection="true">
                         <name>minus</name>
                         <class>com.toolsverse.etl.core.task.common.Minus</class>
                         <variables>
                              <DRIVING value="data_set_2"/>
                              <MINUS value="data_set_1"/>
                         </variables>
                    </task>
                    <task noconnection="true">
                         <name>add_column</name>
                         <class>com.toolsverse.etl.core.task.common.Pivot</class>
                         <variables>
                              <FIELDS value="dataset='second only'; first=first; last=last; course=course; year=year; address=address;"/>
                         </variables>
                    </task>
               </tasks>
          </source>
 
          <source noconnection="true">
               <name>data_set_12</name>
               <tasks>
                    <task noconnection="true">
                         <name>intersect</name>
                         <class>com.toolsverse.etl.core.task.common.Intersect</class>
                         <variables>
                              <DRIVING value="data_set_2"/>
                              <INTERSECT value="data_set_1"/>
                         </variables>
                    </task>
                    <task noconnection="true">
                         <name>add_column</name>
                         <class>com.toolsverse.etl.core.task.common.Pivot</class>
                         <variables>
                              <FIELDS value="dataset='both'; first=first; last=last; course=course; year=year; address=address;"/>
                         </variables>
                    </task>
               </tasks>
          </source>
 
 
          <source noconnection="true">
               <name>first_and_second</name>
               <tasks>
                    <task noconnection="true" scope="after">
                         <name>union</name>
                         <class>com.toolsverse.etl.core.task.common.Union</class>
                         <variables>
                              <DRIVING value="DATA_SET_ONLY1"/>
                              <UNION value="DATA_SET_ONLY2"/>
                              <TYPE value="union all"/>
                         </variables>
                    </task>
               </tasks>
          </source>
 
          <source noconnection="true">
               <name>first_second_both</name>
               <tasks>
                    <task noconnection="true" scope="after">
                         <name>union</name>
                         <class>com.toolsverse.etl.core.task.common.Union</class>
                         <variables>
                              <DRIVING value="FIRST_AND_SECOND"/>
                              <UNION value="DATA_SET_12"/>
                              <TYPE value="union all"/>
                         </variables>
                    </task>
               </tasks>
          </source>
 
     </sources>
     <destinations>
          <destination noconnection="true">
               <name>RESULT</name>
               <source>FIRST_SECOND_BOTH</source>
               <load>
                    <driver name="com.toolsverse.etl.driver.GenericFileDriver"/>
                    <writer
                         class="com.toolsverse.etl.connector.excel.ExcelXlsxConnector"
                         filename="{app.root.data}\demo\matrix-output.xlsx" sheetname="output"/>
               </load>
          </destination>
     </destinations>
</scenario>

Using database specific features

There are multiple techniques available to developers. In the example below we are using Oracle sql*plus and sql*loader to extract and load data.

<?xml version="1.0" encoding="UTF-8"?>
<scenario>
     <name>Oracle Extract using sql*plus and Load using sql*loader</name>
     <script>oracle_sql_loader</script>
     <driver>com.toolsverse.etl.driver.oracle.OracleDriver</driver>
     <allow>EXTRACT_LOAD</allow>
     <variables>
        <NAME label="Property Name:" value = "name" />
     </variables>
     <tasks>
          <!-- truncate staging table using destination connection, ignore exception if table doesn't exist -->
          <task>
               <name>drop_staging_table</name>
               <connection>dest</connection>
               <class>com.toolsverse.etl.core.task.common.SqlTask</class>
               <sql>
                   DROP TABLE config_property_stage;
               </sql>
               <onexception action="ignore"/>
          </task>
 
          <task>
               <name>create_staging_table</name>
               <connection>dest</connection>
               <class>com.toolsverse.etl.core.task.common.SqlTask</class>
               <sql>
                   create TABLE config_property_stage
                   (
                     CONFIG_PROPERTY_NUM	NUMBER(18),
                     NAME	VARCHAR2(255),
                     DESCRIPTION	VARCHAR2(255)
                   );
               </sql>
          </task>
 
          <task>
               <name>config_property_extract</name>
               <connection>source</connection>
               <class>com.toolsverse.etl.core.task.oracle.OracleExtractTask</class>
               <sql>
                   select CONFIG_PROPERTY_NUM || '|' ||
                          NAME || '|' ||
                          DESCRIPTION
                   from config_property
                   where upper(name) like upper('%{NAME}%')
               </sql>
          </task>
 
          <task>
               <name>load</name>
               <connection>dest</connection>
               <class>com.toolsverse.etl.core.task.oracle.OracleLoadTask</class>
               <sql>
                       LOAD DATA
                       INFILE '{DATA_FOLDER}config_property_extract.dat'
                       APPEND INTO TABLE config_property_stage
                       FIELDS TERMINATED BY "|"
                       trailing nullcols
                       (
                          CONFIG_PROPERTY_NUM,
                          NAME,
                          DESCRIPTION
                       )
               </sql>
          </task>
 
     </tasks>
 
     <destinations>
          <destination>
               <name>cfg_props</name>
               <metadata>true</metadata>
               <cursor table="config_property_stage" onfinish="drop" />
          </destination>
     </destinations>
</scenario>

Things to do

Typically a full blown data integration process includes at least extract and load. When everything said and done data from the source are moved to the destination. In some cases it makes sense to split process in steps which can performed in the different time frames. For example you can:

  1. Extract data and store them in some intermediate format, for example XML.
  2. Load data sometime later.

Example:

1. Extract

   EtlConfig etlConfig = new EtlConfig();
 
   etlConfig.setAction(EtlConfig.EXTRACT);
 
   EtlProcess etlProcess = new EtlProcess(EtlProcess.EtlMode.EMBEDDED);
 
   EtlResponse response = engine.loadConfigAndExecute(etlConfig,
                    "test_etl_config.xml", etlProcess);

2. Load

   EtlConfig etlConfig = new EtlConfig();
 
   etlConfig.setAction(EtlConfig.LOAD);
 
   EtlProcess etlProcess = new EtlProcess(EtlProcess.EtlMode.EMBEDDED);
 
   EtlResponse response = engine.loadConfigAndExecute(etlConfig,
                    "test_etl_config.xml", etlProcess);