Database table Look
up Example in Talend
Written on : 22nd Sep 2014
Update 1 : 22nd Sep 2014 (Find the updates at end of the post ).
[Sending Duplicate Rows to a flat file after adding CRC column]
Update 2 : 24th Sep 2014 (Find the updates at end of the post ).
Will the look up table executes for the first time ?
Written on : 22nd Sep 2014
Update 1 : 22nd Sep 2014 (Find the updates at end of the post ).
[Sending Duplicate Rows to a flat file after adding CRC column]
Update 2 : 24th Sep 2014 (Find the updates at end of the post ).
Will the look up table executes for the first time ?
Scenario :
There is a flat file of
few(let's say 6 rows) rows. Insert this( 6 rows of flat file) data to
database table. Again you will get same flat file with newly added
rows(Lets say 1 row is added). Now, when you run the job the new rows
only will have to inserted to database table and reject the existing
rows to a flat file by using Lookup of same table.
Find the below quick
snapshot of Scenario :
Sample Data file:
largecities.csv
Content in it :
City;Population(x1000);LandArea(km2);PopulationDensity(people/km2)
Beijing;10233;1418;7620
Moscow;10452;1081;9644
Seoul;10422;605;17215
Tokyo;8731;617;14151
Jakarta;8490;664;12738
New York;8310;789;10452
Hyderabad;4000;543;12345
Observer the data
first, there is not unique identification of data using key(primary).
Step 1 :
- Create meta data of the CSV file (Repository->MetaData->File Delimited).
- Drag and drop the created .csv metadata to the design area.
- Drag and drop tAddCRCRow component(Palette ->DataQuality->tAddCRC).
- Connect these two components as Main flow.
- As there is no unique identification of rows we are adding this component . i.e., making the rows to uniquely identify.
- Check “Use in CRC” in Implication properties of the component.
- NOTE : If you have a unique identifier in your data file you need not to add this component.
Step 2:
- Drag and drop tMap compoent & tPostgreSQLOutput
- Connect CRC component with tMap then tMap with tPostgreSQLOutput.
- Give all the connection details information in tPostgreSQLOutput component.(For example : localhost, 5432, postgres, postgres, public as schema, student as database, largecities as table name).
- Action on the table is : Drop table if exists and create and Action on Data is : Insert for the first time later change the action on the table to Create table if does not exist
- Open the tMap component to Map the rows from source to target( i.e., file meta data to create table metadata).
- In the schema editor of tMapComponent opened copy all the rows from left and paste in right side(This is quick way instead typing manually) and then click on Auto Map ( or else hold the meta data from left and release on the right to corresponding rows.
- Convert CRC from Long type to String using in table meta data using this expression String.valueOf(row2.CRC)
Step 3 :
- Save the job and run. i.e., up to here this is a normal table creation.
Step 4:
Now
the look up scenario starts here. ..
Let's
assume the source file is updated with new rows and you would like
to insert the updated/added rows to the table.(Remember, the file
consists of old records + new records and you need to insert only
newly updated records)
That
means you will have to look up the existing rows(formally the table)
if the rows are existing reject them else insert them to the table.
- Drag and drop the tPostgreSQLinput component to the canvas.
- Give all the connection details.
- Table Name : “largecities” .
- Query : "SELECT * FROM largecities" (Should be enclosed in double quotes).
- Connect it to tMap component.
- Drag and drop “tFileOutputDelimited” on the canvas and connect “reject” from tMap component.
- The scenario you can find in Image-1 (Go back to Image-1 and observe how the connections were given).
Now,
Open the tMap component
- Observe there will be 2 rows (row2 and row3 on the left side) one is of flat file and another is of lookup table one.
- On the right side there will be 2 outputs.(output and reject) one is for table output and one is for reject.
Left side of tMap
Hold
CRC on row2 and drop it on CRC of row3, you will be getting a
connection with purple color.
For
CRC in row3 give this expression “String.valueOf(row2.CRC) “
In
row3 header there are 3 icons – Click the first one , it will show
the Property and value paired default settings.
Change
the Join Model from Left outer join to Inner join.
Right side of
tMap
output
catch
lookup inner join reject = true
and
in the expression editor write below
!(String.valueOf(row2.CRC).equals(String.valueOf(row3.CRC)))
In
above expression, we are comparing the string value of CRC (key) of
row2 and row3..
If
the expression is true it will pass the new rows to the table and if
the expression is false it'll send the data to dilimited file.
As
quick as you write above expression you will get two join lines
colored orange.
Reject
copy
paste the meta data of row2 to reject metadata.(You can manually type
also).Map the row2 meta data to reject meta data.
Click
on Apply and then Ok.
Save
the job & and let's test it.
Test-1:
Data :
largecities.csv
Beijing;10233;7620;3199468552
Moscow;10452;9644;1970518573
Seoul;10422;17215;2044616669
Tokyo;8731;14151;2119623448
Jakarta;8490;12738;555965170
New
York;8310;10452;2301332887
PostgreSQL table
output: largecities
City Population_x1000_ LandArea_km2_ PopulationDensity_people_km2_ CRC
Beijing 10233 1418 7620 3199468552
Moscow 10452 1081 9644 1970518573
Seoul 10422 605 17215 2044616669
Tokyo 8731 617 14151 2119623448
Jakarta 8490 664 12738 555965170
New
York 8310 789 10452 2301332887
reject:
reject.csv
As
we are processing the file for the first time, no data will go (i.e.,
no rows) into this file as rejection.
Test-2:
Lets say the file
is updated with 3 new records and the sample is
City;Population(x1000);LandArea(km2);PopulationDensity(people/km2)
Beijing;10233;1418;7620
Moscow;10452;1081;9644
Seoul;10422;605;17215
Tokyo;8731;617;14151
Jakarta;8490;664;12738
New
York;8310;789;10452
Hyderabad;6767;899;2345
Bellampalli;6767;234;78788
Warangal;78787;3445;90909
PostgreSQL table
output: largecities
City Population_x1000_ LandArea_km2_ PopulationDensity_people_km2_ CRC
Beijing 10233 1418 7620 3199468552
Moscow 10452 1081 9644 1970518573
Seoul 10422 605 17215 2044616669
Tokyo 8731 617 14151 2119623448
Jakarta 8490 664 12738 555965170
New
York 8310 789 10452 2301332887
Hyderabad 6767 899 2345 1096786685
Bellampalli 6767 234 78788 1693815431
Warangal 78787 3445 90909 3569283860
reject:
reject.csv
Beijing;10233;7620;3199468552
Moscow;10452;9644;1970518573
Seoul;10422;17215;2044616669
Tokyo;8731;14151;2119623448
Jakarta;8490;12738;555965170
New
York;8310;10452;2301332887
Update 1 : (Eliminating duplicate rows from source file).
When we have duplicate rows in the source file, the duplicate rows are not getting inserted into the table.
To over come this scenario, drag and drop tUniqueRows component b/w tAddCRC and tMap components.
It'll list the columns coming from tAddCRC row. Check the CRC field as key attribute.
Let's take one more flat file to store duplicate rows.. for example if you have a row 3 times what the job has do is insert only 1 row and the remaining 2 duplicate rows have to go into a file.
find the image below to understand better.
Sample data for this design:
City;Population(x1000);LandArea(km2);PopulationDensity(people/km2)
Beijing;10233;1418;7620
Moscow;10452;1081;9644
Seoul;10422;605;17215
Tokyo;8731;617;14151
Msoft;3333;7777;8989
Jakarta;8490;664;12738
New York;8310;789;10452
Hyderabad;6767;899;2345
Bellampalli;6767;234;78788
Msoft;3333;7777;8989
Warangal;78787;3445;90909
Sadakar;1000;2222;3333
Raju;8888;9999;1111
Raju;8888;9999;1111
Msoft;3333;7777;8989
Update 2 : 24th Sep 2014 (Find the updates at end of the post ).
Will the look up table executes for the first time ?
Let us assume you have designed the job as shown in update 1 and running the job.
When you run the job for the 1st time the look up step will give you the error as you do not have table created in database.
My finding here is with out look up we have to run the job 1st time then it will create table in the database with it's data & then apply look up. ( not sure whether it is right way or not )
What I my questions in this scenario are :
1) Does Talend allows to create DDL using metadata before we run the job ?
2) Why sequential execution ? Will it not support parallel execution of components as pentaho kettle ETL does ?
References :
:-) :-)