Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Friday, March 25, 2016

Estimating panel data models with the R package plm

Panel data, also called longitudinal data concerns individuals observed through time. It is said to have both a cross section and time series dimension. The R package plm provides panel data estimators for econometricians and is documented in a detailed vignette.

Default settings of the plm() function

By default, the plm() function assumes that the individual and time indexes are in the first two columns. If this is  not the case, an index argument has to specify the name of those two variables in the dataset. For example the argument index = c("country","year") would specify that the individual index is in the column country and the time index is in the column year.

The plm() function's default settings perform a "Oneway (individual) effect Within Model". "Oneway (individual) effect" is a model specification considering that each individual i has a constant, unobserved effect \alpha_i. "Within Model" is an estimation method, identical to the Least Square with Dummy Variables (LSDV) estimation.

Tuesday, October 20, 2015

Data integration with Knime and the R statistical software

I am testing the Knime software to create data pipelines. I started by installing the following extensions:
  •   KNIME Connectors for Common Databases    
  •   KNIME Interactive R Statistics Integration    

Database operations


I tried chaining the node database Row filter after database selector (containing an SQL statement of the form "select * from table"). But the query was taking ages because my source table is rather large.  I replaced the SQL statement in the node database row filter by a statement of the form "select * from table where code = 999". This time the query was much shorter.
Unlike dplyr which updates the SQL query - based on the group_by(), select(), filter() verbs - before  executing a final SQL query, it seems that Knime is executing all SQL queries one after the other.

Interaction with the R statistical program


Then I pushed the data to R. input data frame is called knime.in One issue is that most character vectors are transformed into factors. This was causing various errors. max(year) returned an error, and various merge operation were failing. I had to tell R to change back all those column types to character or numeric.

I wanted to use a filter before using a plot. But I needed to filter on 2 columns. I didn't know how to implement this in Knime. A Google search returned this forum. Rule based row filter seems to work.




In the workflow above, I used R View to display  a plot generated with ggplot.

Workflow are a nice way to display data integration steps and probably easy to explain to others. Node configuration is rather straightforward, once you have found the right node in the repository. I haven't figured out yet how to use input forms and flow variables.

I don't know how easy it is to maintain functional workflows on the long term.

Monday, September 14, 2015

Programming a test harness

I would like to build a test harness around programs. Automated tests should increase my confidence in the reproducibility of their outcome.
"Whenever you are tempted to type something into a print statement or a debugger expression, write it as a test instead." — Martin Fowler. Quoted here.

Where to store test data

While trying to find out where to place test data, this answer thought me to distinguish between unit tests, which are meant to test each function individually on small mock data and integration tests, which would be based on a larger, real dataset.

Testthat

In a commit called "Don't attach dplyr backends", Hadley Wickham removed direct function calls from loaded packages. Probably to ensure that packages are not loaded directly, he changed function calls to a form of packagename::function().

The author of the testthat R package wrote that autotest
"[...] promotes a workflow where the only way you test your code is through tests. Instead of modify-save-source-check you just modify and save, then watch the automated test output for problems."

Debian Continuous Integration

ci.debian.net

"How often are test suites executed?
The test suite for a source package will be executed:

  • when any package in the dependency chain of its binary packages changes;
  • when the package itself changes;
  • when 1 month is passed since the test suite was run for the last time."

Online Continuous Integration


Wednesday, June 17, 2015

Rstudio tips - Key bindings to program and explore data with R


Rstudio is an editor for the R statistical programming language which can be installed on windows, mac and Linux. See my post explaining R setup under Debian.

Edit code

  • TAB auto complete object names
  • F1 on a function name shows the help page of that function
  • F2 on a function name jumps to the code where that function was created. I found this key so useful that I decided to created this blog post.
  • CTRL+W  close a tab
  • CTRL+F find and replace text
  • CTRL+SHIFT+F find in all files in a directory (like grep), then click on results lines to jump in the files

Explore data

In the environment window, click on a data frame to view it then click on filter to filter the data frame according to various criteria.

Create pdf or html reports

When editing a markdown .Rmd document, the pdf or html report can be generated with CTRL+SHIFT+K.

Create a package

The R packages book by Hadley Wickham explains how to create R packages. Useful short-cuts when working with packages:
  • CTRL+SHIFT+B build the package
  • CTRL+SHIFT+D generate documentation
  • CTRL+SHIFT+T run devtools::test()
The documentation step can be set to run automatically with the package building under build / configure build tools / generate documentation with Roxygen / configure.

Vim mode

Vim mode can be activated under Tools / Global options / Code. Enter command mode with ":" and ask for ":help".  I use primarily the following keys:
  • jklhw$ggG navigate text
  • iaoA enter edit mode to insert text
  • Escape return to navigation mode
  • v select text
  • ypP copy selected text and paste
  • d delete 
  • /nN search 

Wednesday, April 29, 2015

How to display dplyr's SQL query

dplyr verbs can be chained to query a database without writing SQL queries. dplyr uses lazy evaluation, meaning that database queries are prepared and only executed when asked by a specific verb such as collect(). I was wondering if it is possible to display the SQL query generated by dplyr?

Indeed dplyr::explain() displays the SQL query generated by dplyr. I have copied a reproducible example below based on the dplyr database vignette.
 

Wednesday, April 08, 2015

Ipython notebook and R

I chose to use python 3. Several of the shell commands below have a "3" suffix in Debian testing as of April 2015: ipython3, pip3.

Install programs

I installed ipython-3-notebook (in Debian Jessie) from the synaptic package manager.

In order to install the R module, I installed PIP for python 3 in the synaptic package manager. PIP is the Python Package Index, a module installation tool. Then I used pip3 to install rpy2
sudo pip3 install rpy2
There is a blog post on how to avoid using sudo to install pip modules.

Install statsmodel, a module for statistical modelling and econometrics in python. Maybe I should have installed python-statsmodels as a Debian package instead? But I it seems to be linked to python 2.x instead of python 3 (it had a dependency on python 2.7-dev). Therefore I installed statsmodels with pip3, using the --user flag mentioned above to install is as a user only module.
pip3 install --user statsmodels
The installation took several minutes on my system. It seemed to be installing a number of dependencies. Many warnings about variables defined but not used were returned but the installation kept running. The final message was:
Successfully installed statsmodels numpy scipy pandas patsy python-dateutil pytz
Cleaning up...

Starting the Ipython notebook

Move to a directory where the notebooks will be stored, start a ipython notebook kernel
cd python
ipython3 notebook

Shortcuts

See also the Ipython Notebook shortcuts. Useful shorcuts are ESCAPE to go in navigation mode, ENTER, to enter edit mode. It seems one can use vim navigation keys j and k to move up and down cells. Pressing the "d" key twice deletes a cell. CTRL+ENTER run cell in place, SHIFT+ENTER to run the cell and jump to the next one, and ALT+ENTER to run the cell and insert a new cell below. 

Run R commands in the Ipython notebook


Load an ipython extension that deals with R commands
%load_ext rpy2.ipython
 Display a standard R dataset
%R head(cars)
%R plot(cars)
Use data from the python statsmodels module based on this page.
import statsmodels.datasets as sd
data = sd.longley.load_pandas()
Print column names of the dataset
print(data.endog_name)
print(data.exog_name)
Print a dataset as an html table by simply giving its name in the cell. For example this data frame contains exogenous variables:
data.exog
Python can pass variables to R with the following command:
totemp = data.endog
gnp = data.exog['GNP']
%R -i totemp,gnp
Estimate a linear model with R
%%R
fit <- br="" gnp="" least-squares="" lm="" nbsp="" regression="" totemp="">print(fit$coefficients)  # Display the coefficients of the fit.
plot(gnp, totemp)  # Plot the data points.
abline(fit)  # And plot the linear regression.
Plot the datapoints and linear regression with the ggplot2 package
%%R
library(ggplot2)
ggplot(data = NULL, aes(x =gnp, y = totemp)) +
    geom_point() +
    geom_abline( aes(intercept=coef(fit)[1], slope=coef(fit)[2]))

Monday, March 02, 2015

Panel cross section dependence tests in STATA and R


STATA example 

Using the Grunfeld investment data:

        use "http://fmwww.bc.edu/ec-p/data/Greene2000/TBL15-1.dta"
        xtset firm year
        xtreg i f c,fe
        xtcsd, pesaran


Output of the xtcsd command only:
Pesaran's test of cross sectional independence =     1.098, Pr = 0.2722

R example

Using the same data: 
library(foreign) # To import STATA .dta files
grunfeld <- font="" read.data="">"http://fmwww.bc.edu/ec-p/data/Greene2000/TBL15-1.dta")

pcdtest(i ~ f + c, data=grunfeld, model = "within", effect = "individual", index = c("firm","year"))
Ouput of the pcdtest command:
    Pesaran CD test for cross-sectional dependence in panels

data:  formula
z = 1.0979, p-value = 0.2722
alternative hypothesis: cross-sectional dependence

Thursday, February 26, 2015

Installing STATA on Debian GNU-LINUX


I needed to install STATA to collaborate with a colleague at work. The computer guy gave me the software on a disk, with an installation guide. Here are the commands I entered following those instructions:

Create a directory for Stata
# mkdir /usr/local/stata13
# ln -s /usr/local/stata13/ /usr/local/stata
Install Stata
# cd /usr/local/stata13
# /media/paul/Stata/install
Stata 13 installation
---------------------

  1.  uncompressing files
  2.  extracting files
  3.  setting permissions

Done.  The next step is to run the license installer.  Type:

        ./stinit
If the licensed software is Stata/IC 13, you will be able to run Stata/IC by typing
        xstata              (Run windowed version of Stata/IC)
        stata               (Run console  version of Stata/IC)

Run the license installer
./stinit
There follows some questions about user name and affiliation. "The two lines, jointly, should not be longer than 67 characters."
Then comes the message:
Stata is initialized.
You should now, as superuser, verify that you can enter Stata by typing

        # ./stata
or
    # ./xstata

I added this to my .bashrc so that stata and xstata can be used as a command directly:
 export PATH=$PATH:/usr/local/stata

Both command "stata" and "xstata" work as a normal user now.

There is an error message when running xstata:
'Failed to load module "canberra-gtk-module"'
But this was not a problem at the start.

GNOME application launcher


I added STATA to the GNOME application lancher, by typing "application" in the launcher, then "main menu", "new menu".

R to Stata

I use R most of the time for data analysis and will export csv files to STATA.
R command to export csv files:
write.csv(dtf, "filename.csv", row.names = FALSE, na = ".")
STATA command to import csv files:
insheet using "filename.csv", delimiter(",")


Monday, January 26, 2015

Including R data and plots in a Latex document with knitr


knitr default typesetting:
"The chunk option out.width is set to '\\maxwidth' by default if the output format is LaTeX."

Monday, January 19, 2015

Read part of an Excel sheet into an R data.frame

Documentation of the read.xlsx function: 
read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,  startRow=NULL, endRow=NULL, colIndex=NULL,  as.data.frame=TRUE, header=TRUE, colClasses=NA,  keepFormulas=FALSE, encoding="unknown", ...)
Returns a data.frame.
Example of use:
dtf <- file="filename,<!-----" read.xlsx="">
    sheetName = sheetname,
    rowIndex = 2:10, 
    colIndex = 5:20)

Friday, January 16, 2015

Building an R package

A package can be seen as a coherent group of functions available for future projects. Building your own package enables you to reuse and share your statistical procedures. Function parameters and examples can be documented with Roxygen to facilitate digging back into the code later on. I created my second package  based on instructions from Hadley. My package structure is composed of the following folders:
  • R/   contains R code
  • test/  contains tests
  • inst/  contains files that will be exported with the package
  • docs/  contains .Rmd documents illustrating code development steps and data analysis.
  • data/    contains data sets exported with the package
  • data-raw/    contains raw dataset and R code to extract raw data from disk and from web resources.

Code

Create a directory containing a package skeleton
devtools::create("packagename")
RStudio has a menu build / configure build tools where devtools package functions and document generation can be linked to keyboard shortcuts:
  • document CTRL + SHIFT + D 
  • build and reload CTRL + SHIFT + B
devtools::load_all() or Cmd + Shift + L, reloads all code in the package.
Add packages to the list of required packages devtools::use_package("dplyr") devtools::use_package("ggplot2", "suggests")

Data

For data I followed his recommendations in r-pkgs/data.rmd devtools::use_data(mtcars) devtools::use_data_raw() # Creates a data-raw/ folder and add it to .Rbuildignore

Tests

Example of testing for the devtools package

Bash command to build and check a package

Bash command to build a package directory:
R CMD build packagename
Bash command to check a package tarball:

R CMD check packagename_version.tar.gz
 An error log (good luck for understanding it) is visible at:
packagename.Rcheck/00check.log
Generate the documentation and check for documentation specific errors
R CMD Rd2pdf tradeflows --no-clean
The --no-clean option keeps among other files a temporary Latex which can be inspected under:
packagename.Rcheck/packagename-manual.tex

Alternatively the build and check procedure can be run in RStudio as explained above.

Tuesday, November 18, 2014

Data manipulation with dplyr

Dplyr is a package for data manipulation developed by Hadley Wickham and Romain Francois for the R statistical software.

  • Introduction to dplyr
  • A Tutorial from João Neto (dplyr.Rmd) gives examples of tools for grouped operations: 
    • n(): number of observations in the current group
    • n_distinct(x): count the number of unique values in x.
    • first(x), last(x) and nth(x, n) - these work similarly to x[1], x[length(x)], and x[n] but give you more control of the result if the value isn’t present.
    • min(), max(), mean(), sum(), sd(), median(), and IQR()

Non standard evaluation

dplyr uses non standard evaluation. To use standard evaluation a work around has to be found. See Stackoverflow question.

Thursday, October 30, 2014

R, packages and Rstudio install on Debian wheezy


See also my previous post on Debian GNU-Linux installation on a Lenovo T400.

R install

I used the Synaptic package manager to add the R repository for Debian from a nearby mirror, under : settings / repositories / other software / add.
Add this APT line:
deb http://cran.univ-paris1.fr/bin/linux/debian/ wheezy-cran3/

There was an error:
W: GPG error: http://cran.univ-paris1.fr wheezy-cran3/ Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 06F90DE5381BA480
After looking at several forums, and this stackoverflow question, I installed debian-keyring and added the key with the commands:
gpg --keyserver pgpkeys.mit.edu --recv-key 06F90DE5381BA480
gpg -a --export 06F90DE5381BA480 |sudo apt-key add -
I could then install R version 3 from the synaptic package manager.

Rstudio

I downloaded R-studio and installed it. There was a missing dependency for libjpeg62. I installed that package from Synaptic. Then ran the dpkg command to install rstudio.
dpkg -i rstudio-0.98.507-i386.deb

Tools

Then I installed Git in order to clone my R project from an online repository.
git clone  project_repository_url

Packages

Within Rstudio, I installed a few packages:
install.packages(c("plyr", "reshape2", "ggplot2"))
install.packages(c("xtable", "markdown", "devtools"))

devtools

The devtools packages requires a libcurl dev Debian package. You can install it at the shell prompt:
$ sudo apt-get install libcurl4-gnutls-dev
Back at the R prompt
install.packages("devtools")
Other dependencies might be needed, the RStudio page on devtools recommends installing the Debian package r-base-dev.

dplyr

The dplyr package required the latest version of a Rcpp package. Which was not available on my CRAN mirror. I installed it from source, (based on this message):
install.packages("Rcpp", type = "source")
install.packages("dplyr")

xlsx

The xlsx package installation complained:
configure: error: Cannot compile a simple JNI program. See config.log for details.
Make sure you have Java Development Kit installed and correctly registered in R.
If in doubt, re-run "R CMD javareconf" as root.


Required the latest version of java 7. (inspired by this post). I installed openjdk-7 from the synaptic package manager. Then ran

update-alternatives --config java  # Choose java 7 as the default
R CMD javareconf
Then
install.packages("xlsx") # worked

RMySQL

MySQL client and server are installed on my system.
While installing RMySQL, I struggled with a configuration error:
  could not find the MySQL installation include and/or library
  directories.  Manually specify the location of the MySQL
  libraries and the header files and re-run R CMD INSTALL.
This post has an answer (thanks!):
sudo apt-get install libdbd-mysql  libmysqlclient-dev
That fixes the issue!
I can connect to the database
library(RMySQL)
mychannel <- br="" dbconnect="" host="localhost" user="paul" ysql="">                       password="***", dbname="dbname")

R packages which are better installed from the Debian package manager

Some packages, such as ‘minqa’, ‘SparseM’ and ‘car’ return an error when one tries to install them from the R prompt. The can only be installed from the Debian package manager, where they have names starting with "r-cran": "r-cran-car", "r-cran-sparsem", "r-cran-minqa".

Ready to work!


Wednesday, April 23, 2014

R commands

See also why use R and the RSS feed of posts labelled R.

R code in this post is garbage this shows the limits of blogger for displaying R code which contains the assignment operator <- .="" a="" href="http://www.r-bloggers.com/three-ways-to-format-r-code-for-blogger/" is="" one="" solution="" to="">paste html from knitr documents
.


R mailing list: Use < - data-blogger-escaped-assignment="" data-blogger-escaped-comment----="" data-blogger-escaped-for="" data-blogger-escaped-functions="">

Set operations


x = letters[1:3]
y = letters[3:5]
union(x, y)
## [1] "a" "b" "c" "d" "e"
intersect(x, y)
## [1] "c"
setdiff(x, y)
## [1] "a" "b"
setdiff(y, x)
## [1] "d" "e"
setequal(x, y)
## [1] FALSE

Information about your R system

sessionInfo()
installed.packages()

Handling files

getwd()
list.files(tempdir()) 
dir.create("blabla")
read.csv("data.csv")

Lists

Given a list structure x, unlist simplifies it to produce a vector which contains all the atomic components which occur in x.
l1 <- a="a" b="2," c="pi+2i)" font="" list="" nbsp="">
unlist(l1) # a character vector
x<- 1="" br="">
x<-1

S3 methods

x<-1



List all available methods for a class:

methods(class="lm")

 One liners

Remove all objects in the workspace except one :

rm(list=ls()[!ls()=="object_to_keep"])

knitr

Those 2 commands are different.
Sets the options for chunk, within a knitr chunk inside the .Rmd document

opts_chunk$set(fig.width=10)
 Sets the options for knitr outside the .Rmd document

opts_knit$set()

dplyr

pipes
cars %>%
  group_by(speed) %>%
  print %>%
  summarise(numberofcars = n(),
            min = min(dist),
            mean = mean(dist),
            max = max(dist))

group_by() creates a tbl_df objects which is a wrapper around a data.frame to enable some functionalities. Note that print returns its output on a tbl_df object. So print() can be used inside the pipe without stopping the workflow.


 plyr (I replaced it with dplyr)

progress bar

l_ply(1:100000, identity, .progress = "win")
Rename items in a dataframe with revalue

sawnwood$item <- br="" item="" revalue="" sawnwood="">    c("Sawnwood (C)" = "Sawnwood Coniferous",
   "Sawnwood (NC)" = "Sawnwood Non Coniferous"))
Rename column names by their names



rename(mtcars, c("disp" = "displacement"))

Plotting with ggplot2


Friday, April 11, 2014

Creating PDF reports with R on Ubuntu

Texi2pdf

Texi2pdf is a function from the tools package that Compiles LaTeX Files into PDFs.

Using the R command:
> texi2pdf("docs/rapports/draft/template2.tex")
Prompted the error:
Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  :
  Running 'texi2dvi' on 'docs/rapports/draft/template2.tex' failed.
Messages:
sh: 1: /usr/bin/texi2dvi: not found

Installing texlive and texinfo fixed this error.
sudo apt-get install texinfo
sudo apt-get install texlive
For info the source of the texi2dvi bash script was mentioned by this blogger.

Accents

There was an issue with accents not rendered.
Loading this package fixes it:
 \usepackage[utf8]{inputenc}

devtools

 opts_knit$get() showed me options that don't exist any-more in the current version of the knitr package. I wanted to install the latest version of knitr.
I needed the package devtools.
But I couldn't install devtools because of this message
Cannot find curl-config
As explained in this mail, installing the package  "libcurl4-gnutls-dev" fixes this. I could then install the package devtools and load it.

To install the latest version of knitr:
library(devtools)
install_github(repo = "knitr", username = "yihui")

xtable

The xtable galery explains how to do longtable and tables in landscape format.It also demonstrates how to rotate column names and how to print a table of linear model coefficients.

Thursday, March 13, 2014

Regular Expression


Rstudio REGEX
Wanted to replace # at the end of the line. So that they don't appear in the code navigator. $ indicates the end of a line in a regular expression. 
Replaced #######$ by ####### # .


Friday, January 24, 2014

Add a table of content to HTML files generated from R Markdown

Update November 2014

With the new version of knitr and Rmarkdown, the custom function is not necessary anymore. One can add a yaml at the beginning of a Rmd file:
---
title: "Development scrap concerning the input data"
output:
  html_document:
    toc: true
---



Old content from January 2014

Knitr creator Yihui explained in a comment on this forum how to add a table of content to a Rmd file using the knit2html() function:
library(knitr)
knit2html('docs/clean.example.Rmd', options = c('toc', markdown::markdownHTMLOptions(TRUE)))
I followed the RSTUDIO advice on how to customize markdown rendering.
A .Rprofile  at the root of my project directory with the following content does the tric:
options(rstudio.markdownToHTML =
  function(inputFile, outputFile) {     
    require(markdown)
    htmlOptions <- defaults="TRUE)<br" markdownhtmloptions="">    htmlOptions <- br="" c="" htmloptions="" toc="">    markdownToHTML(inputFile, outputFile, options = htmlOptions)
  }
)
I works, I can now use the RStudio button "knit html" or the shortcut CTRL+SHIFT+H and get an html file that includes a table of content!

Tuesday, January 21, 2014

R commands

A list of commonly used R commands.

Remove all objects from the workspace:
rm(list=ls())

Yihui Xie wrote that "setwd() is bad, dirty, ugly." Use relative paths instead.

 

Testthat library

Run all tests in a directory:
test_dir("tests")

Wednesday, January 15, 2014

Install R Packages

I recently participated in a training on the use of R to extract data from permanent forestry plots. These is how to install the R packages used in that training:
install.packages(c("doBy", "reshape2", "ggplot2", "GISTools", "lattice", "gstat", "knitr", "raster", "xtable", "rgdal"))
Somehow it didn't install all dependencies for ggplot2, I needed to run:
install.packages('ggplot2', dep = TRUE)

Wednesday, December 11, 2013

Data Visualisation Tools

A list of Data visualisation tools I've tried.

Desktop tools
  • R with ggplot2 package
  • Excel with pivot tables and charts
Web tools:
Sample websites:
Data storage tools: