Tuesday, October 20, 2015

Data integration with Knime and the R statistical software

I am testing the Knime software to create data pipelines. I started by installing the following extensions:
  •   KNIME Connectors for Common Databases    
  •   KNIME Interactive R Statistics Integration    

Database operations

I tried chaining the node database Row filter after database selector (containing an SQL statement of the form "select * from table"). But the query was taking ages because my source table is rather large.  I replaced the SQL statement in the node database row filter by a statement of the form "select * from table where code = 999". This time the query was much shorter.
Unlike dplyr which updates the SQL query - based on the group_by(), select(), filter() verbs - before  executing a final SQL query, it seems that Knime is executing all SQL queries one after the other.

Interaction with the R statistical program

Then I pushed the data to R. input data frame is called knime.in One issue is that most character vectors are transformed into factors. This was causing various errors. max(year) returned an error, and various merge operation were failing. I had to tell R to change back all those column types to character or numeric.

I wanted to use a filter before using a plot. But I needed to filter on 2 columns. I didn't know how to implement this in Knime. A Google search returned this forum. Rule based row filter seems to work.

In the workflow above, I used R View to display  a plot generated with ggplot.

Workflow are a nice way to display data integration steps and probably easy to explain to others. Node configuration is rather straightforward, once you have found the right node in the repository. I haven't figured out yet how to use input forms and flow variables.

I don't know how easy it is to maintain functional workflows on the long term.

No comments: