Friday, 7 February 2014

Convert Excel file to CSV file using Java Code in Talend ETL - Why we need to convert Excel file using java code instead components do it ???

Hi Guys,

Aim of this post :
Converting excel file(.xlsx and/or .xls) to CSV file(.csv) using Java Code in Talend Job.

Why to convert Excel to CSV using java code in Talend ETL when you can directly input ExcelInput to CSVoutput?

B'z some times when you dump data taking from
2007 Excel work book to CSV file  OR
2007 Excel work book to Table output (Any database table : Postgres or MySQL or SQL)

the data in the few columns in the work book will fall into infinite loop and causes to give "Stack Over flow error "

What I found with Excel Work book 2007 to cause this "Stack over flow " Exception
Categories of fields in Excel 
- Few fields could be General
- Few  fields could be Date
- Few fields could be Formulas
- Few fields could be Number

Though we Tick the check box "Read excel2007 file format(xlsx)" some times we may not able to dump the all rows of data to the target .

In my case I have an Excel work book of 2007 with 2000 rows and 10+ fields where few fields like date1, date2 having date type in the excel work book and ETL was not allowed me to dump all the data to the target instead giving me "Stack Over Flow" Exception..

I thought it of Memory problem but it was not and then converted Excel 2007 work book to CSV file and dumped the data to Target.

I've directly mapped ExcelComponent Input to CSV component Output but it also given me the same " Stack Over Flow" Exception and

the above situation leads me to convert Excel Workbook to CSV file.... I've tried to

Software setup for this example
ETL
1) TOS_DI-r96646-V5.1.3
2) 2007 Excel Work book with data


Code Execution in Eclipse (Testing Externally)
1) Eclipse version :
     Version: Juno Service Release 2
     Build id: 20130225-0426
2) Code compiled/Executed on JDK 1.7 in eclipse

I've taken code from this site

http://bigsnowball.com/content/convert-excel-xlsx-and-xls-csv?language=ar

You can download the same code with Eclipse project here(.zipfile)
NOTE:
The code uses library files of Apache POI - the Java API for Microsoft Document
You need to download the all the jar files of POI and need to add them to your eclipse project
Download jar files from here poi-bin-3.9-20121203.zip


ETL job design and code implementation of the same


1) Drag and drop "tLibraryLoad" component from Palette to Design area.
2) Add all the required jar files of Apache POI - the Java API for Microsoft Document
3) How to add multiple jar files - You can learn from here(Click this link)
4) Drag and drop "tJava" component.
5) Connect "tLibrayLoad" component to "tJava" component.
    a) Right click on "tLibraryLoad'
    b) Click on Trigger then drop On sub job ok to "tJava" component.

6) Write below java code in "tJava" component Code area ,which will convert the input Excel work book(.xlsx or .xls) to CSV file(.csv).

class CSVUtils {
  
   /*<strong>
     * Convert Excel spreadsheet to CSV.
   * Works for <strong>Excel 97-2003 Workbook (<em>.xls)</strong> and <strong>Excel Workbook (</em>.xlsx)</strong>.
  * Does not work for the <strong>XML Spreadsheet 2003 (*.xml)</strong> format produced by BIRT.
     * @param fileName
  * @throws InvalidFormatException
   * @throws IOException
  */
public void convertExcelToCSV(String fileName) throws InvalidFormatException, IOException {

        BufferedWriter output = new BufferedWriter(new FileWriter(fileName.substring(0, fileName.lastIndexOf(".")) + ".csv"));

        InputStream is = new FileInputStream(fileName);

       Workbook wb = WorkbookFactory.create(is);

     Sheet sheet = wb.getSheetAt(0);

       // hopefully the first row is a header and has a full compliment of
        // cells, else you'll have to pass in a max (yuck)
        int maxColumns = sheet.getRow(0).getLastCellNum();

        for (Row row : sheet) {

           // row.getFirstCellNum() and row.getLastCellNum() don't return the
            // first and last index when the first or last column is blank
         int minCol = 0; // row.getFirstCellNum()
           int maxCol = maxColumns; // row.getLastCellNum()
          
           for (int i = minCol; i < maxCol; i++) {

                Cell cell = row.getCell(i);
                String buf = "";
             if (i > 0) {
                    buf = "|";
               }

             if (cell == null) {
                    output.write(buf);
                 //System.out.print(buf);
               } else {

                  String v = null;

                  switch (cell.getCellType()) {
                  case Cell.CELL_TYPE_STRING:
                        v = cell.getRichStringCellValue().getString();
                     break;
                 case Cell.CELL_TYPE_NUMERIC:
                       if (DateUtil.isCellDateFormatted(cell)) {
                          
                          // SimpleDateFormat datetemp = new SimpleDateFormat("yyyy-MM-dd");
                           //v = datetemp.format(cell.getDateCellValue());
                          
                          v = cell.getDateCellValue().toString();
                        } else {
                           //v = String.valueOf(cell.getNumericCellValue());
                            v = new HSSFDataFormatter().formatCellValue( cell );
                        }
                      break;
                 case Cell.CELL_TYPE_BOOLEAN:
                       v = String.valueOf(cell.getBooleanCellValue());
                        break;
                 case Cell.CELL_TYPE_FORMULA:
                       v = cell.getCellFormula();
                      
                      switch(cell.getCachedFormulaResultType()) {
                       case Cell.CELL_TYPE_NUMERIC:
                           v =   String.valueOf(cell.getNumericCellValue());
                           //v = new HSSFDataFormatter().formatCellValue( cell );
                           break;
                       case Cell.CELL_TYPE_STRING:
                           v =   String.valueOf(cell.getRichStringCellValue());
                           break;
                   }
                     //v = String.valueOf(cell.getDateCellValue());
                     break;
                 default:
                   }

                 if (v != null) {
                       buf = buf + toCSV(v);
                  }
                  output.write(buf);
                 //System.out.print(buf);
               }
          }

         output.write("\n");
          //System.out.println();
        }
      is.close();
        output.close();

   }
   /*</strong>
     * Escape the given value for output to a CSV file.
    * Assumes the value does not have a double quote wrapper.
  * @return
  */
    public String toCSV(String value) {
    
        String v = null;
       boolean doWrap = true;
       
       if (value != null) {
          
           v = value;
        
           if (v.contains("\"")) {
             v = v.replace("\"", "\"\""); // escape embedded double quotes
               doWrap = true;
         }
         
           if (v.contains(",") || v.contains("\n")) {
             doWrap = true;
         }
         
           if (doWrap) {
              v = "\"" + v + "\""; // wrap with double quotes to hide the comma
            }
      }
     
       return v;
     
   }

}

CSVUtils csv = new CSVUtils();
csv.convertExcelToCSV("C:/Users/sadakar002/Desktop/TestBySadha/FilesTemplates/Sadakar.xlsx");


7) You need to import all the packages to run the above code.
You can do that by Clicking " Advanced settings" of  tJava Component and writing below code at "Import" area.

//import java.util.List;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;

import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.DateUtil;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;

import java.text.SimpleDateFormat;

import org.apache.poi.hssf.usermodel.HSSFDataFormatter;


8) Save and Run the job
9) Find the converted file in the same location where you are referring your Excel Work book.

NOTE:
Note that when you are taking code from Eclipse console to talend you need to skip few code lines like static key word from the methods.





Add multiple jar files using tLibraryLoad component in Talend ETL

Hi guys,
As part of my project I've explored tLibraryLoad component usage.

I needed to add multiple jar files using tLibraryLoad component and initially I used as many tLibraryLoad components for as many jar files and tired of adding them.

Then after googling and seeking help from expertise I've done the below.. This might be useful for Talend developers.

Aim of the post:
Adding multiple jar files using tLibraryLoad component.
Environment
TOS_DI-r96646-V5.1.3



Steps:
    1) Drag and drop tLibraryLoad component from Palette section to ETL design area.
    2) Click on the component to View it’s basic settings.
    3) Whatever the files you want to use in your job , you need to add all those jar files to the lib folder of Talend installed location.
·         For instance : D:\Talend\TOS_DI-r96646-V5.1.3\TOS_DI-r96646-V5.1.3\lib\java
·         Restart your talend
    4) When you use tLibraryLoad component, initially you must add a default jar file.(For eg: from the drop down select jdbc14.jar file even though there is no use of this jar file with in your job).
   5)   Next, click on Advanced settings and then keep on clicking + Symbol for the number of jar files you want to add.
   6)  Give the jar file names in double quotes.
a.       For instance : "poi-examples-3.9-20121203.jar"
b.      "poi-excelant-3.9-20121203.jar"
c.       "poi-ooxml-3.9-20121203.jar"
d.      And etc.
(These jar files are taken from "Apache POI - for Microsoft Excel)

    7)  Test whether the jar files are coming or not when you export the job.
Answer is yes, when you export the job containing tLibraryLoad, it’ll export all the jars added in it to lib folder of export.

A quick understanding image is given below.

 

Monday, 3 February 2014

Error Time in Talend 5.1.3 Open Studio - "Execution failed :generationManager is not initialized by the perljet/javajet!"

Hello every one,
It has given me a learning experience with Talend 5.1.3...

Yesterday(3rd feb 2014) every thing(Talend project jobs) working fine .. I have developed few jobs and executed successfully..

Today I have  executed the same jobs again and I received an error saying
 "Execution failed :generationManager is not initialized by the perljet/javajet!"

I had done a small googling and tried the tip from here

What I did to work properly..
1) What ever the newly created jobs(I've 2 jobs for testing purpose for a functionality)  created yesterday , I deactivated them.
2) Deleted org folder from \.JETEmitters\src folder and then tried to execute the job. I got null pointer exception.
3) Restored the org folder and executed required job.. it is working now..


What I understand from this error is :
What ever the jobs we create, we need to maintain all the information used by those jobs..
In my case I've created a job for testing purpose and taken a metadata file... later I removed metadata file and kept the job as it is.. so when I execute the other talend jobs.. the project was looking for all the resources available for the remaining jobs as well. and caused for the error.



Routine example in Talend ETL - Convert strings to dates

Hello every one..

In this example you will input different formats of strings and convert them to standard date format...
What ever the format you provide in the form of strings the below code converts into a single date format...
For example if your date is dd-MM-yyyy or dd/MM-yyyy which are strings in your meta data such as .csv or .xls or .xlsx ... by using this code the format of the string is changed to Date with a single formatted output. 

 Coding credit goes to Vaibhav ji(Data warehouse architect - one of my colleagues)...

 From the Repository, right click on Code -> Create routine and give the name of it..
Write below code in your routine. Routine name and class name should be equal.

package routines;

import java.util.Date;
import java.text.ParseException;
import java.text.SimpleDateFormat;


public class DateLogics{
    private static Date get_date = null;

    /**
     * @param get_date the get_date to set
     */
    public static void setGet_date(String get_dates) {
        if ((StringHandling.INDEX(get_dates,"-")==2))
        {
            try {
                get_date =(Date) new SimpleDateFormat("dd-MM-yyyy").parse(get_dates);
            } catch (ParseException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
        if ((StringHandling.INDEX(get_dates,".")==2))
        {
            try {
                get_date =(Date) new SimpleDateFormat("dd.MM.yyyy").parse(get_dates);
            } catch (ParseException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
        if ((StringHandling.INDEX(get_dates,"/")==2))
        {
            try {
                get_date =(Date) new SimpleDateFormat("dd/MM/yyyy").parse(get_dates);
            } catch (ParseException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
      
    }

    /**
     * @return the get_date
     */
    public static Date getGet_date() {
        return get_date;
    }
   
}

Now, create a job and drag and drop a tJava component and write below code..
String date_input = "10.10.2014";
System.out.println(StringHandling.LEN(date_input));
DateLogics.setGet_date(date_input);
System.out.println(DateLogics.getGet_date());


NOTE:
You need to provide as many if conditions for the different kinds of date formats as strings so that you can generalize the code ..

version tested: TOS : 5.1.3

Thank you.