bisdw
-
A simple ETL tool

Version 0.2.0
2013-10-06

1 Introduction

Bisdw is a simple ETL tool that is target to extract data from different source. It can use different embedded ways to do the extraction from different kind of databases, files, etc. In addition Bisdw support FTP based transfer of the result of the extraction process. Bisdw use the open source project Scriptellato enable ETL functionality, but can be extended in a simple way.
Bisdw runs as a daemon process. By configuration Bisdw will run different ETL jobs based on scheduling definition that are defined for each job.

2 Configuration

The main configuration file is the bidw.xml. Additional common properties are set in the properties.xml fle.

2.1 bisdw.xml configuration

Bisdw is configured by defining ETLs jobs in the in the bisdw.xml file.
1<?xml version="1.0" encoding="UTF-8"?> 
2<bisdw xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 	xsi:noNamespaceSchemaLocation="../src/main/resources/bisdw.xsd"> 	
3  <etljob> 		
4    <name>read_DB</name> 		
5    <desc>Database</desc> 		
6    <schedule>* */30 * * * ?</schedule>
7	
8    <etlconfig name="read_db"> 			
9      <desc>read db</desc> 			
10      <class>com.ingby.socbox.bisdw.etlprovider.ETLScriptella</class> 			
11      <order>10</order>
12      <property> 				
13        <key>configFile</key> 				
14        <value>read_db.xml</value> 			
15      </property> 	
16	</etlconfig>
17      
18    <etlconfig name="send_ftp">
19      <desc>send_ftp</desc>  
20      <class>com.ingby.socbox.bisdw.etlprovider.FTPSend</class>  
21      <order>20</order>
22      <property>  
23        <key>hostname</key> 
24        <value>172.25.1.203</value> 
25      </property>      
26      <property>    
27        <key>username</key> 
28        <value>ftpuser</value> 
29      </property> 
30      <property>     
31        <key>password</key>      
32        <value>abcxyz</value>   
33      </property>  
34      <property>    
35        <key>localDir</key>        
36        <value>/tmp/shipment</value>
37      </property>
38      <property>  
39        <key>remoteDir</key>    
40        <value>ship</value> 
41      </property> 
42    </etlconfig> 	
43  </etljob>
44</bisdw> 
Each <etljob> can include one to many <etlconfig>. A <etljob> will execute all etlconfig’s part of the job according to the cron expression in the <schedule> tag, but in the sequence order specified in the >order> tag. The etlconfig’s will be executed in the order of the lowest order first. This is important if there is a data dependencies between the different etlconfig’s. By using the order tag a sequence of execution is quarantined.
Each etlconfig has a class specification that define the java class to be executed by the etljob. A list of properties can be specified that is specific by the specified class.

2.1.1 Scheduling

Each etljob can have multiple schedule tags, but at least one. The scheduling can have two different formats, interval or cron based.

2.1.1.1 Interval scheduling

The simple format describe a interval execution that are repeated forever. The format is just a number and a indicator defining the granularity in seconds (S), minutes (M) or hours (H). 10M specify that the service should be executed every ten minutes.
1<schedule>10M</schedule>		

2.1.1.2 Cron based scheduling

The second format is more advanced and follow the cron specification of Quartz, see http://www.quartz-scheduler.org. With this format its possible to define scheduling expression like “0 15 10 ? * MON-FRI” which would schedule the service at 10:15am every Monday, Tuesday, Wednesday, Thursday and Friday. For more cron examples please visit http://www.quartz-scheduler.org/documentation/quartz-2.1.x/tutorials/crontrigger
1<schedule>0 15 10 ? * MON-FRI</schedule>		

2.2 Etlconfig configuration

2.2.1 ETLScriptella configuration

The ETLScriptella takes one property from the etlconfig definition in bisdw.xml. The property is the configFile and specify the scriptella script file to read and execute. For more information about Scriptella configuration please visit Scriptella documentation.
Below listing show a simple example of a scriptella configuration that select from a mysql database and create an xml file based on the result and write it to directory /tmp/shipment. Each file will be written to a unique file name using the etl.date.today macro.
1<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">  
2<etl>    
3  <description>Example</description>   
4  <properties>
5    filedepo=/tmp/shipment/ 
6  </properties>    
7 <connection id="mydb" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/bisdwtest" user="testdb" password="testdb" />
89  <connection id="outfile" driver="text" url="$filedepo/shipment_${etl.date.today(’yyyyMMdd-HHmmss’)}.xml" >    
10    null_string=   
11  </connection>
1213  <!--Writing header -->   
14  <script connection-id="outfile">     
15    &lt;shipments&gt;   
16  </script>
17  
18  <!-- Run query against db -->   
19  <query connection-id="mydb">   
20    <!-- Select all shipments with pckp_dt TODAY and inbnd_ind = O from shipmentdb-->    
21    SELECT ship_id, no_pce, tot_wght FROM shipment;
22    
23    <!-- For each row execute a script and write xml structure to out file -->    
24    <script connection-id="outfile">   
25    <![CDATA[  
26      <shipment>    
27      <id>$ship_id</id> 
28      <weight>$tot_wght</weight> 
29      <pieces>$no_pce</pieces>   
30      </shipment>]]>    
31    </script>
32  </query>   
3334  <!--Writing footer -->   
35  <script connection-id="outfile">      
36    &lt;/shipments&gt;  
37  </script>    
38</etl> 
39 

2.2.2 FTPSend configuration

The FTPSend class enable transfer over FTP to a remote server. This can typical be used as part of the of a etljob where files are created by a scriptella script that is after creation is transfered to a FTP server.
The FTPSend configuration takes a number of properties to control the connection.
As and example see the listing in the 2.1 on page 1↑ that read any file located in the /tmp/shipment directory.

2.3 properties.xml

The properties.xml include properties used by the core of Bisdw. The properties xml has a simple structure of key/value pair:
1<properties>
2	<property> 
3		<key>akey</key> 
4		<value>avalue</value>   
5	</property>
6</properties>
The following properties are currently used by core Bisdw:

3 Building Bisdw

To build Bisdw from source is simple. Check out the Bisdw trunk from gforge.ingby.com:
$ svn checkout --username anonymous http://gforge.ingby.com/svn/bischeck/bisdw/trunk bisdw
To build a Bisdw distribution run from the directory where you checked out the Bisdw code:
$ ant dist
This will create a compressed tar file in the target directory, named bisdw-x.y.z.tgz where x.y.z is the version number. Different versions of Bisdw can be checked out from the tags directory located in http://gforge.ingby.com/svn/bischeck/bisdw/tags

3.1 Jar customization

To support custom jar files please place them in the directory customlib. This would typical be jdbc drivers, etc.

3.2 Developing with Bisdw

Its simple to develop your own ETL config implementation. To develop your own you must follow the interface ETLInf.

4 Installation

The latest binary version of Bisdw is available on http://gforge.ingby.com/gf/project/bischeck/frs/.
Download the distribution file and follow the steps below to install. Make sure you have root privileges doing this.
# tar xzvf bisdw-x.y.z.tgz 
# cd bisdw-x.y.z
# chmod 755 install 
# ./install -u #Get usage
# ./install    #Install default
# service bisdw start    #Redhat/Centos
# /etc/init.d/bischeck start #Debian/Ubuntu
To get full list of available options to the install script use -u. By default the install script will install Bisdw in directory /opt/socbox/addons/bisdw, referred to as $BISDW and with the ownership of the user id bisdw. Make sure that the user exist before running install.
The last commands start the bisdw daemon with the effective user id of the user id set during install, default user bisdw. The installation will configure bisdwd to start automatically in run level 3, 4 and 5.
The process id of the java process running bisdw in daemon mode is located in a file, default in /var/tmp/bisdw.pid. This file is used by the bisdwd script to stop the java program running Bisdw and make sure that only one instance of Bisdw is started on the server.

4.1 Getting started

In the $BISDW/etc directory there are examples of all the configuration files.

4.2 Logging

Bisdw use log4j for log management. The log4j configuration is described in the log4.properties file located in the resources directory of the Bisdw installation. By default Bisdw writes log information at level INFO to file /var/tmp/bischeck.log.

5 Run Bisdw from the command line

The normal way to run Bisdw is as a daemon using the init.d script bisdwd, but is also possible to start Bisdw in continues running mode by executing:
$ bisdw Execute -d
Running in this way have limitations since the execution will not automatically be placed as a background process and the effective user id will be the user starting the process which may not have all permissions according to the installation. Neither will pid files be updated correctly. For production system always use the init.d script.
$ sudo /etc/init.d/bisdwd start
or
# service bisdwd start
For testing purpose it can be good to just run Bisdw once and make sure that every thing is executing as expected. This is done by executing:
To show the pid file used for the Bischeck daemon running:
$ bischeck ConfigurationManager -p
This command is used in the init script bischeckd to retrieve the current pid.

5.1 Validating configuration files

To validate if the xml configuration files are correct the following command will return 0 if correct. Use $? to see return status.
$ bischeck ConfigurationManager -v; echo $?

6 Releases

6.1 Release 0.2.0 - 2013-10-06

This is the first major version.

7 System requirements

Bisdw should run on any operating system that supports Java 6. The installation script and init scripts are supported on Redhat and Debian equivalent Linux distributions. Running on none Linux operating system has not been tested.
The following jar packages are distributed as part of the Bisdw distribution. All these packages have their own open source licenses.
All files are distributed as part of Bisdw are located in the lib directory.

8 Bisdw license

Bisdw is licensed under GNU license version 2. For more info please visit http://www.gnu.org/licenses/gpl-2.0.html

9 Bug reports and feature requests

Please submit bug reports and feature requests on gforge.ingby.com

10 Credits

Thanks to all people who has developed all the great software that Bisdw depends on.