ETL scenario language

From Toolsverse Knowledge Base
Jump to: navigation, search

ETL scenarios must be written in XML-based language.

Language

  • ETL scenario must be a well formed XML. Check out official XML specification
  • For special characters which are not allowed in the well formed XML use correct XML substitutions
  • All node and attribute names are case sensitive
  • Values generally are not case sensitive except class names
  • Class names must be fully qualified. For example com.toolsverse.etl.driver.oracle.OracleDriver
  • Order of nodes in the node list matters. For examples if there are multiple <task> nodes under <tasks> they will be executed in order
  • For comments use
    <!-- text -->

Language Artifacts

  • scenario - root node in XML. Used to define global attributes such as scenario name, description, script name, etc. Includes other nodes. Variables under scenario used to define global input parameters which can be used to parameterize scenario.
  • execute - root node for inner scenarios.
  • scenario - inner scenario under <execute>. Inner scenario is a sort of sub program executed as a part of owner's scenario.
  • sources - root node for all sources.
  • source - "extract from" object. Can include transformations and tasks which performed on the "extract from" object. Variables under source used to set dataset variables.
  • destinations - root node for all destinations.
  • destination - "load into" object. Can include transformations and tasks which performed on the "load into" object. Variables under destination used to define destination columns.
  • tasks - root node for tasks.
  • task - transformation performed on the source and destination or common task such as operation system command, etc. Variables under task used to define task parameters.
  • beforetasks - root node for tasks executed before first extract
  • aftertasks - root node for tasks executed after last load
  • variables - root node for scenario, source, destination or task variables
  • VARIABLE (UPPER-CASED variable name) - scenario's, source's, destination's or task's variable

Rules

  • If attribute has multiple possible values (for example parallel="false|true") the pipe ("|") character is used as a delimiter
  • First value in the pipe delimited string is a default value. For example in the pair "false|true" "false" is a default.
  • Most nodes and attributes are not required. The required nodes and attributes specifically marked using bold-italic font
  • If parent node is not required (for example <task>) but child is (for example <name>) it just means that you can skip entire section (for example <task>) but if it is there certain nodes and attributes must be there as well (for example <name>)
  • If node (or attribute) is not required but it is present and there is no value, the default value will be used. If there is no default value the node or attribute will be skipped
  • Some nodes have short and long forms, for example <driver>class_name</driver> vs <driver attributes />. You can use either short or long form but not both
  • Some nodes and attributes have multiple synonyms, for example <tasks|beforetasks> (node) and sql|code (attribute)
  • Attribute value "int" means that attribute value should be an integer
  • Attribute value "sql" means that attribute value should be SQL
  • Attribute value "code" means that attribute value should be code in one of the supported scripting languages (for example JavaScript) or SQL
  • For boolean attributes "false|true" it is also possible to use (case insensitive) yes|no, y,|n, 1|0
  • empty|something or |something for node or attribute value means that there is a default value which is rarely getting changed so it is better to skip this node or attribute
  • attributes means any valid XML attribute or list of attributes

ETL scenario parser

If there is a syntax error parser generates an exception. In most cases (but not all) parser is able to identify position of the error.

  • If ETL scenario is not a well formed XML parser generates an exception pointing out to the part of the XML which is not well formed
  • Parser always checks one error at the time, meaning if there are multiple errors it generates exception for the first found error
  • Parser checks all required nodes and attributes and if one of them is missing it generates an exception pointing out to the missing node or attribute
  • Parser checks all nodes and attributes defined by specification and if there is something it doesn't recognize (for example <dest> instead of <destination>) it generates an exception pointing out to the wrong node or attribute
  • In some cases (for example attributes and child nodes of the <variable_name> node) basically any attribute or parent node is allowed (including not defined by specification). In this case parser let it go and does not generate an exception
  • If wrong attribute value is used (for example varcharsize="abc" instead of varcharsize="123") or there is an attribute but there is no value (for example varcharsize="") the default value is used instead. If there is no default value the attribute is skipped
  • Is some cases, when node or attribute value is a class name parser will try to load class immediately. If class does not exist parser generates appropriate exception
  • If short form of the node is used (for example <driver>name</driver> instead of <driver attributes /> the default values for the attributes are used

Nodes and attributes

<?xml version="1.0" encoding="UTF-8"?>
<scenario parallel="false|true" requiresource="false|true" requiredest="false|true" notemp="false|true" >
   <name>scenario name</name>
   <description>scenario description</description>
   <script>script name</script>
   <driver>auto|driver_class_name</driver>
   <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper" />
   <source_driver>auto|driver_class_name</source_driver>
   <onsave action="skip|save"/>
   <onpopulate action="skip|save"/>
   <onpersist action="skip|save"/>
   <onexecute action="|commit" />
   <allow>LOAD|EXTRACT|EXTRACT_LOAD</allow>
   <function>function_class_name<function>
   <metadata types=""/>
   <codegen>code_generator_class_name</codegen>
   <connection>connection name</connection>

   <variables>
      <variable_name name="variable name" value="value" label="label" sql="sql" type="type" global="true|false" attributes>
         <variable_nodes/>
      </variable_name>
   </variables>

   <execute>
      <scenario name="inner_scenario_name" action="LOAD|EXTRACT|EXTRACT_LOAD" parallel="false|true" loop="code" count="int" variable="var_name" pattern="pattern" field="field_name" looplang="SQL|JavaScript|Others" loop_connection="connection name" condition="code" conditionlang="SQL|JavaScript|Others" condition_connection="connection name" />
   </execute>

   <tasks|beforetasks>
      <task noconnection="false|true" commit="false|true">
         <name>task name</name>
         <tablename>table name</tablename>
         <class>task_class_name</class>
         <connection>task connection name</connection>
         <driver>auto|driver_class_name</driver>
         <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>
         <sql|code|cmd>code</sql|code|cmd>
         <using>using</using>
         <onexception action="continue|ignore|raise" mask="mask" />
         <variables>
            <variable_name name="variable name" value="value" attributes />
         </variables>
      </task>
   </tasks|beforetasks>

   <sources>
      <source independent="false|true" noconnection="false|true" empty="false|true" mandatory="false|true" encode="false|true" enabled="true|false" parallel="false|true" condition="code" conditionlang="SQL|JavaScript|Others" condition_connection="connection name" keyfield="coma_delimited_key_columns_to_update" updatekey="false|true">
         <name>source name</name>
         <objectname|tablename>object or table name</objectname|tablename>
         <onpersist action="save|skip"/>
         <onpopulate action="save|skip"/>
         <source>linked source name</source>
         <extract keyname="key name" keyfield="key fields">

            <writer class="writer_class_name" attributes />
            <writer>writer_class_name</writer>

            <reader class="writer_class_name" attributes />
            <reader>reader_class_name</reader>

            <driver>auto|driver_class_name</driver>
            <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>

            <connection>source connection name</connection>
            <sql>sql</sql>
            <using>using</using>

            <statement>empty|prepared|callable</statement>
            <indexes>coma_delimited_list_of_output_ param_indexes</indexes>

            <onexception action="continue|ignore|raise|ignoreparseerror" mask="mask" />

            <variables>
               <variable_name name="variable name" value="value" attributes />
            </variables>
         </extract>

         <tasks>
            <task noconnection="false|true" commit="false|true" scope="after|pre|inline|before_etl">
               <name>task name</name>
               <tablename>table name</tablename>
               <class>task_class_name</class>
               <connection>task connection name</connection>
               <driver>auto|driver_class_name</driver>
               <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>
               <sql|code|cmd>code</sql|code|cmd>
               <using>using</using>
               <onexception action="continue|ignore|raise" mask="mask" />
               <variables>
                  <variable_name name="variable name" value="value" attributes />
               </variables>
            </task>
         </tasks>

      </source>
   </sources>

   <destinations>
      <destination encode="false|true" empty="false|true" tolerate="false|true" noconnection="false|true" type="regular|procedure|function|wait|table" enabled="true|false" parallel="false|true" scope="global|single" condition="code" conditionlang="SQL|JavaScript|Others" condition_connection="connection name"> source="false|true"
         <name>destination name</name>
         <source>linked source name</source>
         <objectname|tablename>object or table name</objectname|tablename>
         <cursor table="table name" sql="sql" type="regular|temp|temporary" onfinish="keep|drop" />
         <metadata>true|false</metadata>
         <metadata indexes="false|true" suffix="suffix"/>
         <load stream="false|true" key="keys" action="insert|update|delete|merge|conditional" condition="code" bind="false|true" window="int">
            <connection>destination connection name</connection>

            <driver>auto|driver_class_name</driver>
            <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>

            <condition>sql</condition>
            <then>sql</then>
            <else>sql</else>
            <after>sql</after>
            <sql>sql</sql>

            <onexception action="continue|ignore|merge|raise" mask="mask" key="keys" savepoint="false|true" />

            <reader class="writer_class_name" attributes />
            <reader>reader_class_name</reader>

            <writer class="writer_class_name" attributes />
            <writer>writer_class_name</writer>

            <variables>
               <variable_name name="variable name" tablename="table name" function="function" class="function_class_name" field="field" value="value" sql|code="sql|code" lang="SQL|JavaScript|Others" type="sql" linked="linked var name" label="label" param="params" scope="before|after|runtime" destination="linked destination name" tolerate="false|true" include="true|false" exclude="false|true" add="false|true" global="false|true" declare="sql" sqltype="sqltype" nativetype="nativetype" attributes />
            </variables>

            <cache>cache_class_name</cache>
         </load>

         <tasks>
            <task noconnection="false|true" commit="false|true" scope="after|pre|inline|before_etl" >
               <name>task name</name>
               <tablename>table name</tablename>
               <class>task_class_name</class>
               <connection>task connection name</connection>
               <driver>auto|driver_class_name</driver>
               <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>
               <sql|code|cmd>code</sql|code|cmd>
               <using>using</using>
               <onexception action="continue|ignore|raise" mask="mask" />
               <variables>
                  <variable_name name="variable name" value="value" attributes />
               </variables>
            </task>
         </tasks>

      </destination>
   </destinations>

   <aftertasks>
      <task noconnection="false|true" commit="false|true">
         <name>task name</name>
         <tablename>table name</tablename>
         <class>task_class_name</class>
         <connection>task connection name</connection>
         <driver>auto|driver_class_name</driver>
         <driver name="auto|driver_class_name" parent="parent_driver_class_name" literalsize="int" varcharsize="int" charsize="int" precision="int" scale="int" lineslimit="int" initsql="sql" case="empty|lower|upper"/>
         <sql|code|cmd>code</sql|code|cmd>
         <using>using</using>
         <onexception action="continue|ignore|raise" mask="mask" />
         <variables>
            <variable_name name="variable name" value="value" attributes />
         </variables>
      </task>
   </aftertasks>

</scenario>