Category Archives: Uncategorized

Using the `scan()` function in R to read weirdly formatted data

I am writing some code to parse a weird data format, and using scan() to suck in everything first. Basically, it’s csv-style lines, but some lines have a different number of fields and are for different things — imaging CTD data interspersed with system messages, where the line is identified by the very first field. Something like:

GPS,20150727T120000,-10.1,12.2
MESSAGE, 20150727T120005,Begin descent
CTD,20150727120100,1,25,35
CTD,20150727120200,10,20,34
CTD,20150727120400,100,10,33
MESSAGE,20150727T121000,Begin ascent
CTD,20150727121500,100,10,33
CTD,20150727121600,90,12,33.5
etc ...

Anyway, when I was just reading in the CTD fields, everything was fine, but when I started trying to parse the MESSAGE fields, I found that scan() was doing something unexpected with the spaces in the message field, and producing a char vector like:

"GPS,20150727T120000,-10.1,12.2"
"MESSAGE, 20150727T120005,Begin" 
"descent"
"CTD,20150727120100,1,25,35"
...

Basically, scan() was treating the space between “Begin” and “descent” as a delimiter (as well as the carriage returns).

Anyway, after much attempting to interpret the man page, and trying different things, I discovered that

scan(con, character(), sep='\n')

would suck in the entire line as a character vector, which is what I wanted.

Advertisements

Forking and syncing branches with git and github

When forking a branch on github, it was not entirely clear to me how to sync branches other than master (e.g. to make a pull request). The following eventually seemed to work:

Set upstream remotes

First, you need to make sure that your fork is set up to track the original repo as upstream (from here):

List the current remotes:

$ git remote -v
# origin  https://github.com/YOUR_USERNAME/YOUR_FORK.git (fetch)
# origin  https://github.com/YOUR_USERNAME/YOUR_FORK.git (push)

Specify a new remote upstream repository that will be synced with the fork.

$ git remote add upstream https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git

Verify the new upstream repository you’ve specified for your fork.

$ git remote -v
# origin    https://github.com/YOUR_USERNAME/YOUR_FORK.git (fetch)
# origin    https://github.com/YOUR_USERNAME/YOUR_FORK.git (push)
# upstream  https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git (fetch)
# upstream  https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git (push)

Syncing a fork

Now you’re ready to sync changes! See here for more details on syncing a “main” branch:

Fetch the branches:

$ git fetch upstream
# remote: Counting objects: 75, done.
# remote: Compressing objects: 100% (53/53), done.
# remote: Total 62 (delta 27), reused 44 (delta 9)
# Unpacking objects: 100% (62/62), done.
# From https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY
#  * [new branch]      master     -> upstream/master

Check out your fork’s local master branch.

$ git checkout master
# Switched to branch 'master'

Merge the changes from upstream/master into your local master branch. This brings your fork’s master branch into sync with the upstream repository, without losing your local changes.

$ git merge upstream/master
# Updating a422352..5fdff0f
# Fast-forward
#  README                    |    9 -------
#  README.md                 |    7 ++++++
#  2 files changed, 7 insertions(+), 9 deletions(-)
#  delete mode 100644 README
#  create mode 100644 README.md

Syncing an upstream branch

To sync upstream changes from a different branch, do the following (from here):

git fetch upstream                            ;make sure you have all the upstream changes
git checkout --no-track upstream/newbranch    ;grab the new branch but don't track it
git branch --set-upstream-to=origin/newbranch ;set the upstream repository to your origin
git push                                      ;push your new branch up to origin

Converting Latex to Markdown

I’m applying for a job, which requires me to submit a plain text version of my resumé. As I maintain my CV as a latex document, I wanted to find a simple way to convert it to Markdown format so that it will look good when cut/paste into the web browser.

I use pandoc all the time for document conversion, but I found that because of some heavy layout tweaks to make my CV look good (I’m not using a style file), the markdown produced using

pandoc cv.tex -o cv.md

is pretty gross.

After a bit of googling, I found out about the htlatex utility (found here, and it’s included with TexLive), and which does a fantastic job at converting Latex to HTML:

htlatex cv.tex "xhtml, mathml, charset=utf-8" " -cunihtf -utf8"

Then, use pandoc to convert from HTML to Markdown with:

pandoc cv.html -o cv.md

This leaves a few small things to clean up with further scripting (such as stray /s), but altogether a nice looking Markdown file.

Switching from Matlab to R: Part 1

Introduction

I was thinking recently about how best to help someone transitioning
from Matlab(TM) to R, and did my best to recall what sorts of things I
struggled with when I made the switch. Though I resisted for quite a
while, when I finally committed to making the change I recall that it
mostly happened in a matter of weeks. It helped that my thesis
supervisor exclusively used R, and we were working on code for a paper
together at the time, but in the end I found that the switch was
easier than I had anticipated.

Tips

  1. Don’t be afraid of the assign <- operator. It means exactly the
    same thing as you would use = in matlab, as in
a <- 1:10 # in matlab a=1:10;

except that it make more logical sense.

The only place you should use = is in logical comparisons like a ==
b
(as in matlab), or for specifying argument values in a function
(see number 5).

  1. Vectors are truly 1 dimensional. This is different from matlab in
    the way that you could not add together an Nx1 and a 1xN vector. In
    R it would be just two vectors of length N. The transpose in R is
    by doing t(), and the transpose of a vector (or class numeric)
    is the same as the original.

  2. Array indices use square brackets, like

a[1:5] <- 2 # assign the value 2 to the first 5 indices of a

This is one of the things that drove me crazy about matlab, that it
used () for indices as well as function arguments. It makes mixed
array indexing and function calls very confusing to look at and
interpret.

  1. By default arithmetic operations are done element-wise. If you have
    two MxN matrices (say A and B), and you do C &lt;- A*B, every
    element in C is the product of the corresponding elements in A and
    B. No need to do the .* stuff as in matlab. To get matrix
    multiplication, you use the %*% operator.

  2. Function arguments are named, so the order isn’t super
    important. If you don’t name them, then you have to give them in
    the order they appear (do ?function to see the help page). For
    example if a function took arguments like:

foo <- function(a, b, c, type, bar) {
# function code here
}

You could call it with something like:

junk <- foo(1, 2, bar = "whatever")

where a and b are given the values of 1 and 2, and c and type
are left unspecified. This would be equivalent:

junk <- foo(a = 1, b = 2, bar = "whatever")

You could also do:

junk <- foo(bar = "whatever", a = 1, b = 2)
  1. No semicolons needed (except where you’d like to have more than one
    operation per line, like a &lt;- 1; b &lt;- 2

  2. In R, the equivalent to a matlab structure is called a
    “list”. Instead of separating the levels with a ., it is
    generally done with a $. So the structure of a list could be
    something like:

a <- junk$stuff$whatever

Use the str() command to look at the structure of a list object.

  1. Most functions that return more than just a single value will
    return in a list. Unlike matlab there isn’t a simple way returning
    separate values to separate variables, like [a, b] =
    foo('bar')
    . For example, using the histogram function:
a <- rnorm(1000)
h <- hist(a)

plot of chunk unnamed-chunk-8

str(h)
## List of 6
## $ breaks : num [1:16] -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 ...
## $ counts : int [1:15] 1 1 3 24 47 80 147 186 206 134 ...
## $ density : num [1:15] 0.002 0.002 0.006 0.048 0.094 0.16 0.294 0.372 0.412 0.268 ...
## $ mids : num [1:15] -3.75 -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 ...
## $ xname : chr "a"
## $ equidist: logi TRUE
## - attr(*, "class")= chr "histogram"

If I wanted to extract something from that I could use

b <- h$breaks

If you really only want one thing out of the list, you could do
something like

b <- hist(a, plot = FALSE)$breaks
  1. You can use .‘s in variable and function names, but I don’t
    recommend you do. Often a function with a . in it means that it
    applies a “generic” operation to a specific class. For example, the
    plot() function is a straightforward way of plotting data, much
    like in matlab. However, there exist lots of variants of plot for
    different classes, which are usually specified as
    plot.class(). E.g. for the histogram object I created above, if I
    want to plot it, I can just do
h2 <- hist(a, plot = FALSE, breaks = 100)
plot(h2, main = "A plot with more breaks")

plot of chunk unnamed-chunk-11

and it will plot it as a histogram, using the generic function
plot.histogram(), as well as accept the arguments appropriate to
that generic function.

Thoughts on topics for future editions of matlab2R

  • plotting, including:

  • points, lines, styles, etc

  • “image”-style plots, contours, filled contours, colormaps, etc
  • POSIX times vs Matlab datenum

  • … suggestions in comments?