Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Tuesday, October 20, 2015

Data integration with Knime and the R statistical software

I am testing the Knime software to create data pipelines. I started by installing the following extensions:
  •   KNIME Connectors for Common Databases    
  •   KNIME Interactive R Statistics Integration    

Database operations


I tried chaining the node database Row filter after database selector (containing an SQL statement of the form "select * from table"). But the query was taking ages because my source table is rather large.  I replaced the SQL statement in the node database row filter by a statement of the form "select * from table where code = 999". This time the query was much shorter.
Unlike dplyr which updates the SQL query - based on the group_by(), select(), filter() verbs - before  executing a final SQL query, it seems that Knime is executing all SQL queries one after the other.

Interaction with the R statistical program


Then I pushed the data to R. input data frame is called knime.in One issue is that most character vectors are transformed into factors. This was causing various errors. max(year) returned an error, and various merge operation were failing. I had to tell R to change back all those column types to character or numeric.

I wanted to use a filter before using a plot. But I needed to filter on 2 columns. I didn't know how to implement this in Knime. A Google search returned this forum. Rule based row filter seems to work.




In the workflow above, I used R View to display  a plot generated with ggplot.

Workflow are a nice way to display data integration steps and probably easy to explain to others. Node configuration is rather straightforward, once you have found the right node in the repository. I haven't figured out yet how to use input forms and flow variables.

I don't know how easy it is to maintain functional workflows on the long term.

Wednesday, February 11, 2015

Big scientist

Hilary Mason:
Big data is data that cannot hold on one node.
[...] Some people spread the idea that big data will tell you what to do. [...] This is bullshit, it concerns me that this is starting to get steam outside of the tech community.
Neha Kothari
Linked In Hadoop cluster contains information on all clicks made by users. 1000 employees have access to the cluster and run queries on the data with pig. 
 Women in data science

Tuesday, November 18, 2014

Data manipulation with dplyr

Dplyr is a package for data manipulation developed by Hadley Wickham and Romain Francois for the R statistical software.

  • Introduction to dplyr
  • A Tutorial from João Neto (dplyr.Rmd) gives examples of tools for grouped operations: 
    • n(): number of observations in the current group
    • n_distinct(x): count the number of unique values in x.
    • first(x), last(x) and nth(x, n) - these work similarly to x[1], x[length(x)], and x[n] but give you more control of the result if the value isn’t present.
    • min(), max(), mean(), sum(), sd(), median(), and IQR()

Non standard evaluation

dplyr uses non standard evaluation. To use standard evaluation a work around has to be found. See Stackoverflow question.