Fundamentals of DWH and Pentaho, Talend ETL : September 2014

Database table Look up Example in Talend

Written on : 22nd Sep 2014

Update 1 : 22nd Sep 2014 (Find the updates at end of the post ).
[Sending Duplicate Rows to a flat file after adding CRC column]

Update 2 : 24th Sep 2014 (Find the updates at end of the post ).
Will the look up table executes for the first time ?

Scenario :

There is a flat file of few(let's say 6 rows) rows. Insert this( 6 rows of flat file) data to database table. Again you will get same flat file with newly added rows(Lets say 1 row is added). Now, when you run the job the new rows only will have to inserted to database table and reject the existing rows to a flat file by using Lookup of same table.

Find the below quick snapshot of Scenario :

Sample Data file: largecities.csv

Content in it :

City;Population(x1000);LandArea(km2);PopulationDensity(people/km2)

Beijing;10233;1418;7620

Moscow;10452;1081;9644

Seoul;10422;605;17215

Tokyo;8731;617;14151

Jakarta;8490;664;12738

New York;8310;789;10452

Hyderabad;4000;543;12345

Observer the data first, there is not unique identification of data using key(primary).

Step 1 :

Create meta data of the CSV file (Repository->MetaData->File Delimited).
Drag and drop the created .csv metadata to the design area.
Drag and drop tAddCRCRow component(Palette ->DataQuality->tAddCRC).
Connect these two components as Main flow.
As there is no unique identification of rows we are adding this component . i.e., making the rows to uniquely identify.
Check “Use in CRC” in Implication properties of the component.
NOTE : If you have a unique identifier in your data file you need not to add this component.

Step 2:

Drag and drop tMap compoent & tPostgreSQLOutput
Connect CRC component with tMap then tMap with tPostgreSQLOutput.
Give all the connection details information in tPostgreSQLOutput component.(For example : localhost, 5432, postgres, postgres, public as schema, student as database, largecities as table name).
Action on the table is : Drop table if exists and create and Action on Data is : Insert for the first time later change the action on the table to Create table if does not exist
Open the tMap component to Map the rows from source to target( i.e., file meta data to create table metadata).
In the schema editor of tMapComponent opened copy all the rows from left and paste in right side(This is quick way instead typing manually) and then click on Auto Map ( or else hold the meta data from left and release on the right to corresponding rows.
Convert CRC from Long type to String using in table meta data using this expression String.valueOf(row2.CRC)

Step 3 :

Save the job and run. i.e., up to here this is a normal table creation.

Step 4:

Now the look up scenario starts here. ..

Let's assume the source file is updated with new rows and you would like to insert the updated/added rows to the table.(Remember, the file consists of old records + new records and you need to insert only newly updated records)

That means you will have to look up the existing rows(formally the table) if the rows are existing reject them else insert them to the table.

Drag and drop the tPostgreSQLinput component to the canvas.
Give all the connection details.
Table Name : “largecities” .
Query : "SELECT * FROM largecities" (Should be enclosed in double quotes).
Connect it to tMap component.
Drag and drop “tFileOutputDelimited” on the canvas and connect “reject” from tMap component.
The scenario you can find in Image-1 (Go back to Image-1 and observe how the connections were given).

Now, Open the tMap component

Observe there will be 2 rows (row2 and row3 on the left side) one is of flat file and another is of lookup table one.
On the right side there will be 2 outputs.(output and reject) one is for table output and one is for reject.

Left side of tMap

Hold CRC on row2 and drop it on CRC of row3, you will be getting a connection with purple color.

For CRC in row3 give this expression “String.valueOf(row2.CRC) “

In row3 header there are 3 icons – Click the first one , it will show the Property and value paired default settings.

Change the Join Model from Left outer join to Inner join.

Right side of tMap

output

catch lookup inner join reject = true

and in the expression editor write below

!(String.valueOf(row2.CRC).equals(String.valueOf(row3.CRC)))

In above expression, we are comparing the string value of CRC (key) of row2 and row3..

If the expression is true it will pass the new rows to the table and if the expression is false it'll send the data to dilimited file.

As quick as you write above expression you will get two join lines colored orange.

Reject

copy paste the meta data of row2 to reject metadata.(You can manually type also).Map the row2 meta data to reject meta data.

Click on Apply and then Ok.

Save the job & and let's test it.

Test-1:

Data : largecities.csv

Beijing;10233;7620;3199468552

Moscow;10452;9644;1970518573

Seoul;10422;17215;2044616669

Tokyo;8731;14151;2119623448

Jakarta;8490;12738;555965170

New York;8310;10452;2301332887

PostgreSQL table output: largecities

City Population_x1000_ LandArea_km2_ PopulationDensity_people_km2_ CRC

Beijing 10233 1418 7620 3199468552

Moscow 10452 1081 9644 1970518573

Seoul 10422 605 17215 2044616669

Tokyo 8731 617 14151 2119623448

Jakarta 8490 664 12738 555965170

New York 8310 789 10452 2301332887

reject: reject.csv

As we are processing the file for the first time, no data will go (i.e., no rows) into this file as rejection.

Test-2:

Lets say the file is updated with 3 new records and the sample is

City;Population(x1000);LandArea(km2);PopulationDensity(people/km2)

Beijing;10233;1418;7620

Moscow;10452;1081;9644

Seoul;10422;605;17215

Tokyo;8731;617;14151

Jakarta;8490;664;12738

New York;8310;789;10452

Hyderabad;6767;899;2345

Bellampalli;6767;234;78788

Warangal;78787;3445;90909

PostgreSQL table output: largecities

City Population_x1000_ LandArea_km2_ PopulationDensity_people_km2_ CRC

Beijing 10233 1418 7620 3199468552

Moscow 10452 1081 9644 1970518573

Seoul 10422 605 17215 2044616669

Tokyo 8731 617 14151 2119623448

Jakarta 8490 664 12738 555965170

New York 8310 789 10452 2301332887

Hyderabad 6767 899 2345 1096786685

Bellampalli 6767 234 78788 1693815431

Warangal 78787 3445 90909 3569283860

reject: reject.csv

Beijing;10233;7620;3199468552

Moscow;10452;9644;1970518573

Seoul;10422;17215;2044616669

Tokyo;8731;14151;2119623448

Jakarta;8490;12738;555965170

New York;8310;10452;2301332887

Update 1 : (Eliminating duplicate rows from source file).
When we have duplicate rows in the source file, the duplicate rows are not getting inserted into the table.

To over come this scenario, drag and drop tUniqueRows component b/w tAddCRC and tMap components.

It'll list the columns coming from tAddCRC row. Check the CRC field as key attribute.

Let's take one more flat file to store duplicate rows.. for example if you have a row 3 times what the job has do is insert only 1 row and the remaining 2 duplicate rows have to go into a file.

find the image below to understand better.

Sample data for this design:

City;Population(x1000);LandArea(km2);PopulationDensity(people/km2)
Beijing;10233;1418;7620
Moscow;10452;1081;9644
Seoul;10422;605;17215
Tokyo;8731;617;14151
Msoft;3333;7777;8989
Jakarta;8490;664;12738
New York;8310;789;10452
Hyderabad;6767;899;2345
Bellampalli;6767;234;78788
Msoft;3333;7777;8989
Warangal;78787;3445;90909
Sadakar;1000;2222;3333
Raju;8888;9999;1111
Raju;8888;9999;1111
Msoft;3333;7777;8989

Update 2 : 24th Sep 2014 (Find the updates at end of the post ).
Will the look up table executes for the first time ?

Let us assume you have designed the job as shown in update 1 and running the job.

When you run the job for the 1st time the look up step will give you the error as you do not have table created in database.

My finding here is with out look up we have to run the job 1st time then it will create table in the database with it's data & then apply look up. ( not sure whether it is right way or not )

What I my questions in this scenario are :

1) Does Talend allows to create DDL using metadata before we run the job ?

2) Why sequential execution ? Will it not support parallel execution of components as pentaho kettle ETL does ?

References :

http://www.deepinopensource.com/scd-2-implementations-in-talend/

http://www.vikramtakkar.com/2013/04/tmap-vs-tjoin-talend-open-studio.html

http://www.vikramtakkar.com/2013/01/dynamic-database-lookup-using.html

:-) :-)