Simple Extract, Transform and Load library for .NET
Rhino ETL
Rhino ETL is an extract, transform and load utility that enables you to move data from many different sources, transform them however you like and then load it into a different destination source. If you have worked with ETLs in the past you may be familiar with technologies such as: DTS(Data Transformation Services) or SSIS(SQL Server Integration Services) in the Microsoft world.
Resources
Hello World Rhino ETL
Patrik Lindström wrote an Hello World application. Check his blog post, and complete source files at GitHub for more details on how to quickly get started.
Paul Barriere
- Write Etl jobs in pure C#
- Presentation about Rhino ETL in the Dallas C# user group:
Ayende Rahien
- Status Report
- Turning Transformations to FizzBuzz tests
- First Code Drop
- Joins, Distinct & Engine
- Aggregates
- Targets
- Web Services Source
- FileHelpers integration
- Writing to files
- Importing Data into MS CRM
- Performance Joins – Why you should always have a profiler
- Algorithms, Joins and Performance
- Rhino Etl 2.0
- Command Line Interface
Coding Instinct (Torkel Ödegaard)
Learn
Rhino Etl processes can be created by either writing C# classes or through a DSL. The basic idea behind each process is a Pipeline. Each process has input rows and output rows. I will demonstrate using the DSL. Let’s assume for a moment that you have 5 records in a database as depicted in the picture above. You would like to get those records out of the database, format them and finally into a pipe delimited text file.
PROCESS 1: EXTRACT DATA FROM THE DATABASE
The input command below will execute the SQL statement and return all rows into the pipeline.
input "SourceDatabase", Command = """
SELECT
FirstName, LastName, LoyaltyNumber
FROM LoyaltyTable
"""
It’s important to note here that each step in the pipeline does not process or gather all of the rows before it passes it to the next step in the pipeline. As each row is emitted from one process and it is sent to the next. This can dramatically reduce the memory footprint required to run the complete process. The next step is to transform the rows from the first process.
PROCESS 2: TRANSFORM THE DATA
This operation uses simple boo syntax (python-like) to manipulate each row. You can modify existing columns or create new ones just by setting their value.
operation transform_names:
for row in rows:
row.Name = row.FirstName + " " + row.LastName
row.Loyalty = row.LoyaltyNumber.Substring(0,3) + "-" + row.LoyaltyNumber.Substring(3,row.LoyaltyNumber.Length-3)
yield row
This operation will take each row and create a new field called Name with the First and Last names combined together. Along with parsing out the Loyalty the way we’d like it.
PROCESS 3: EXPORT THE DATA TO A TEXT FILE
In order to export the data we must define what format should be exported. For text files Rhino Etl comes with the FileHelpers library that allows you to define the format by simply creating a class. Since we’re demonstrating this using the DSL here is how the class would look for our export file.
import FileHelpers
[DelimitedRecord("|")]
public class LoyaltyRow:
public Name as string
public Loyalty as string
Then we need a simple operation to write that file to disk.
operation export_file:
engine = Rhino.Etl.Core.Files.FluentFile.For[of LoyaltyRow]()
file = engine.To("""DatabaseExport.txt""")
for row in rows:
record = row.ToObject[of LoyaltyRow]()
file.Write(record)
yield row
file.Dispose()
COMPLETE FILE
Those are the major pieces of an etl script. Here is what the complete file looks like.
import FileHelpers
[DelimitedRecord("|")]
public class LoyaltyRow:
public Name as string
public Loyalty as string
operation transform_names:
for row in rows:
row.Name = row.FirstName + " " + row.LastName
row.Loyalty = row.LoyaltyNumber.Substring(0,3) + "-" + row.LoyaltyNumber.Substring(3,row.LoyaltyNumber.Length-3)
yield row
operation export_file:
engine = Rhino.Etl.Core.Files.FluentFile.For[of LoyaltyRow]()
file = engine.To("""DatabaseExport.txt""")
for row in rows:
record = row.ToObject[of LoyaltyRow]()
file.Write(record)
yield row
file.Dispose()
process Test:
input "SourceDatabase", Command = """
SELECT
FirstName, LastName, LoyaltyNumber
FROM LoyaltyTable
"""
transform_names()
export_file()
CONFIGURATION
Since we are referencing a database connection string named “SourceDatabase” it must pull that information from a configuration file. Here is a very simple one. You need to specify the name, connectionString along with the providerName to indicate which ADO.NET provider to use.
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<connectionStrings>
<add name="SourceDatabase"
connectionString="Data Source=.\sqlexpress;Initial Catalog=rhino-etl-examples;Integrated Security=SSPI;Timeout=300;"
providerName="System.Data.SqlClient.SqlConnection, System.Data, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"/>
</connectionStrings>
</configuration>
EXECUTION
Once you have your two files (Example.boo for the ETL and Example.config for the configuration) you can execute it on the command line using Rhino.Etl.Cmd.
Rhino.Etl.Cmd -f:Example.boo -c:Example.config -v
Dsl – Chaining Input From Three Files
This is an example of importing 3 text files into a Microsoft Sql Server. All three are the exact same format. Two files were created to perform the operation. The Configuration.config for the connection string and Changed.boo which contains the operation.
Configuration.config
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<connectionStrings>
<add name="ConnectionName"
connectionString="Data Source=YourServer;Initial Catalog=YourDatabase;Integrated Security=SSPI;Timeout=300;"
providerName="System.Data.SqlClient.SqlConnection, System.Data, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"/>
</connectionStrings>
</configuration>
Changed.boo
import FileHelpers
[FixedLengthRecord]
class NcoaRow:
[FieldFixedLength(4)] public Trne as string
[FieldFixedLength(11)] public CustomerId as string
[FieldFixedLength(25)] public CustomerName as string
[FieldFixedLength(25)] public CustomerAddress1 as string
[FieldFixedLength(25)] public CustomerAddress2 as string
[FieldFixedLength(20)] public CustomerCity as string
[FieldFixedLength(2)] public CustomerState as string
[FieldFixedLength(9)] public CustomerZip as string
[FieldFixedLength(10)] public CustomerPhone as string
[FieldFixedLength(5)] public SignUpStore as string
[FieldFixedLength(1)] public TypeCode as string
[FieldFixedLength(20)] public FirstName as string
[FieldFixedLength(20)] public LastName as string
[FieldFixedLength(50)] public EmailAddress as string
[FieldFixedLength(1)] public Gender as string
[FieldFixedLength(1)] public HomeOwnerStatus as string
[FieldFixedLength(10)] public DateOfBirth as string
[FieldFixedLength(10)] public DateUpdated as string
[FieldFixedLength(1)] public Language as string
[FieldFixedLength(10)] public DateOther as string
operation import_undeliv:
for row in rows:
yield row
file = Rhino.Etl.Core.Files.FluentFile(typeof(NcoaRow)).From("""YourUndelivPath""")
for row in file:
yield Row.FromObject(row)
operation import_changed:
for row in rows:
yield row
file = Rhino.Etl.Core.Files.FluentFile(typeof(NcoaRow)).From("""YourChangedPath""")
for row in file:
yield Row.FromObject(row)
operation import_unchanged:
for row in rows:
yield row
file = Rhino.Etl.Core.Files.FluentFile(typeof(NcoaRow)).From("""YourUnchangedPath""")
for row in file:
yield Row.FromObject(row)
process ChangedBulkImport:
import_undeliv()
import_changed()
import_unchanged()
sqlBulkInsert "ConnectionName", "TableName", TableLock = true, KeepIdentity = true, BatchSize = 10000, UseTransaction = false :
map "Trne", "Trne", string
map "CustomerId", "CustomerId", string
map "CustomerName", "CustomerName", string
map "CustomerAddress1", "CustomerAddress1", string
map "CustomerAddress2", "CustomerAddress2", string
map "CustomerCity", "CustomerCity", string
map "CustomerState", "CustomerState", string
map "CustomerZip", "CustomerZip", string
map "CustomerPhone", "CustomerPhone", string
map "SignUpStore", "SignUpStore", string
map "TypeCode", "TypeCode", string
map "FirstName", "FirstName", string
map "LastName", "LastName", string
map "EmailAddress", "EmailAddress", string
map "Gender", "Gender", string
map "HomeOwnerStatus", "HomeOwnerStatus", string
map "DateOfBirth", "DateOfBirth", string
map "DateUpdated", "DateUpdated", string
map "Language", "Language", string
map "DateOther", "DateOther", string
Executed using
Rhino.Etl.Cmd -c:Configuration.config -f:Changed.boo