ETL scenario language

From Toolsverse Knowledge Base
Revision as of 18:49, 27 July 2014 by Maxlmus (Talk | contribs)

Jump to: navigation, search

ETL scenarios must be written in special XML-based language.

Language

  • ETL scenario must be a well formed XML. Check out official XML specification
  • For special characters which are not allowed in the well formed XML use correct XML substitutions
  • All node and attribute names are case sensitive
  • Values generally are not case sensitive except class names
  • Class names must be fully qualified. For example com.toolsverse.etl.driver.oracle.OracleDriver
  • Order of nodes in the node list matters. For examples if there are multiple <task> nodes under <tasks> they will be executed in order
  • For comments use
    <!-- text -->

Rules

  • If attribute has multiple possible values (for example parallel="false|true") the pipe ("|") character is used as a delimiter
  • First value in the pipe delimited string is a default value. For example in the pair "false|true" "false" is a default.
  • Most nodes and attributes are not required. The required nodes and attributes specifically marked using bold-italic font
  • If parent node is not required (for example <task>) but child is (for example <name>) it just means that you can skip entire section (for example <task>) but if it is there certain nodes and attributes must be there as well (for example <name>)
  • If node (or attribute) is not required but it is present and there is no value, the default value will be used. If there is no default value the node or attribute will be skipped
  • Some nodes have short and long forms, for example <driver>class_name</driver> vs <driver attributes />. You can use either short or long form but not both
  • Some nodes and attributes have multiple synonyms, for example <tasks|beforetasks> (node) and sql|code (attribute)
  • Attribute value "int" means that attribute value should be an integer
  • Attribute value "sql" means that attribute value should be SQL
  • Attribute value "code" means that attribute value should be code in one of the supported scripting languages (for example JavaScript) or SQL
  • For boolean attributes "false|true" it is also possible to use (case insensitive) yes|no, y,|n, 1|0

ETL scenario parser

If there is a syntax error parser generates an exception. In most cases (but not all) parser is able to identify position of the error.

  • If ETL scenario is not a well formed XML parser generates an exception pointing out to the part of the XML which is not well formed
  • Parser always checks one error at the time, meaning if there are multiple errors it generates exception for the first found error
  • Parser checks all required nodes and attributes and if one of them is missing it generates an exception pointing out to the missing node or attribute
  • Parser checks all nodes and attributes defined by specification and if there is something it doesn't recognize (for example <dest> instead of <destination>) it generates an exception pointing out to the wrong node or attribute
  • In some cases (for example attributes and child nodes of the <variable_name> node) basically any attribute or parent node is allowed (including not defined by specification). In this case parser let it go and does not generate an exception
  • If wrong attribute value is used (for example varcharsize="abc" instead of varcharsize="123") or there is an attribute but there is no value (for example varcharsize="") the default value is used instead. If there is no default value the attribute is skipped
  • Is some cases, when node or attribute value is a class name parser will try to load class immediately. If class does not exist parser generates appropriate exception
  • If short form of the node is used (for example <driver>name</driver> instead of <driver attributes /> the default values for the attributes are used

Nodes and attributes

<?xml version="1.0" encoding="UTF-8"?>
<scenario parallel="false|true" requiresource="false|true" requiredest="false|true" notemp="false|true" >
   <name>scenario name</name>
   <description>scenario description</description>
   <script>script name</script>
   <driver>auto|driver_class_name</driver>
   <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper" />
   <source_driver>auto|driver_class_name</source_driver>
   <onsave action="skip|save"/>
   <onpopulate action="skip|save"/>
   <onpersist action="skip|save"/>
   <onexecute action="|commit" />
   <allow>LOAD|EXTRACT|EXTRACT_LOAD</allow>
   <function>function_class_name<function>
   <metadata types=""/>
   <codegen>code_generator_class_name</codegen>
   <connection>connection name</connection>

   <variables>
      <variable_name name="variable name" value="value" label="label" sql="sql" type="type" global="true|false" attributes>
         <variable_nodes/>
      </variable_name>
   </variables>

   <execute>
      <scenario name="inner_scenario_name" action="LOAD|EXTRACT|EXTRACT_LOAD" parallel="false|true" loop="code" count="int" variable="var_name" pattern="pattern" field="field_name" looplang="SQL|JavaScript|Others" loop_connection="connection name" condition="code" conditionlang="SQL|JavaScript|Others" condition_connection="connection name" />
   </execute>

   <tasks|beforetasks>
      <task noconnection="false|true" commit="false|true">
         <name>task name</name>
         <tablename>table name</tablename>
         <class>task_class_name</class>
         <connection>task connection name</connection>
         <driver>auto|driver_class_name</driver>
         <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>
         <sql|code|cmd>code</sql|code|cmd>
         <using>using</using>
         <onexception action="continue|ignore|raise" mask="mask" />
         <variables>
            <variable_name name="variable name" value="value" attributes />
         </variables>
      </task>
   </tasks|beforetasks>

   <sources>
      <source independent="false|true" noconnection="false|true" empty="false|true" mandatory="false|true" encode="false|true" enabled="true|false" parallel="false|true" condition="code" conditionlang="SQL|JavaScript|Others" condition_connection="connection name" keyfield="coma_delimited_key_columns_to_update" updatekey="false|true">
         <name>source name</name>
         <objectname|tablename>object or table name</objectname|tablename>
         <onpersist action="save|skip"/>
         <onpopulate action="save|skip"/>
         <source>linked source name</source>
         <extract keyname="key name" keyfield="key fields">

            <writer class="writer_class_name" attributes />
            <writer>writer_class_name</writer>

            <reader class="writer_class_name" attributes />
            <reader>reader_class_name</reader>

            <driver>auto|driver_class_name</driver>
            <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>

            <connection>source connection name</connection>
            <sql>sql</sql>
            <using>using</using>

            <statement>empty|prepared|callable</statement>
            <indexes>coma_delimited_list_of_output_ param_indexes</indexes>

            <onexception action="continue|ignore|raise|ignoreparseerror" mask="mask" />

            <variables>
               <variable_name name="variable name" value="value" attributes />
            </variables>
         </extract>

         <tasks>
            <task noconnection="false|true" commit="false|true" scope="after|pre|inline|before_etl">
               <name>task name</name>
               <tablename>table name</tablename>
               <class>task_class_name</class>
               <connection>task connection name</connection>
               <driver>auto|driver_class_name</driver>
               <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>
               <sql|code|cmd>code</sql|code|cmd>
               <using>using</using>
               <onexception action="continue|ignore|raise" mask="mask" />
               <variables>
                  <variable_name name="variable name" value="value" attributes />
               </variables>
            </task>
         </tasks>

      </source>
   </sources>

   <destinations>
      <destination encode="false|true" empty="false|true" tolerate="false|true" noconnection="false|true" type="regular|procedure|function|wait|table" enabled="true|false" parallel="false|true" scope="global|single" condition="code" conditionlang="SQL|JavaScript|Others" condition_connection="connection name"> source="false|true"
         <name>destination name</name>
         <source>linked source name</source>
         <objectname|tablename>object or table name</objectname|tablename>
         <cursor table="table name" sql="sql" type="regular|temp|temporary" onfinish="keep|drop" />
         <metadata>true|false</metadata>
         <metadata indexes="false|true" suffix="suffix"/>
         <load stream="false|true" key="keys" action="insert|update|delete|merge|conditional" condition="code" bind="false|true" window="number">
            <connection>destination connection name</connection>

            <driver>auto|driver_class_name</driver>
            <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>

            <condition>sql</condition>
            <then>sql</then>
            <else>sql</else>
            <after>sql</after>
            <sql>sql</sql>

            <onexception action="continue|ignore|merge|raise" mask="mask" key="keys" savepoint="false|true" />

            <reader class="writer_class_name" attributes />
            <reader>reader_class_name</reader>

            <writer class="writer_class_name" attributes />
            <writer>writer_class_name</writer>

            <variables>
               <variable_name name="variable name" tablename="table name" function="function" class="function_class_name" field="field" value="value" sql|code="sql|code" lang="SQL|JavaScript|Others" type="sql" linked="linked var name" label="label" param="params" scope="before|after|runtime" destination="linked destination name" tolerate="false|true" include="true|false" exclude="false|true" add="false|true" global="false|true" declare="sql" sqltype="sqltype" nativetype="nativetype" attributes />
            </variables>

            <cache>cache_class_name</cache>
         </load>

         <tasks>
            <task noconnection="false|true" commit="false|true" scope="after|pre|inline|before_etl" >
               <name>task name</name>
               <tablename>table name</tablename>
               <class>task_class_name</class>
               <connection>task connection name</connection>
               <driver>auto|driver_class_name</driver>
               <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>
               <sql|code|cmd>code</sql|code|cmd>
               <using>using</using>
               <onexception action="continue|ignore|raise" mask="mask" />
               <variables>
                  <variable_name name="variable name" value="value" attributes />
               </variables>
            </task>
         </tasks>

      </destination>
   </destinations>

   <aftertasks>
      <task noconnection="false|true" commit="false|true">
         <name>task name</name>
         <tablename>table name</tablename>
         <class>task_class_name</class>
         <connection>task connection name</connection>
         <driver>auto|driver_class_name</driver>
         <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>
         <sql|code|cmd>code</sql|code|cmd>
         <using>using</using>
         <onexception action="continue|ignore|raise" mask="mask" />
         <variables>
            <variable_name name="variable name" value="value" attributes />
         </variables>
      </task>
   </aftertasks>

</scenario>