Using the `scan()` function in R to read weirdly formatted data

I am writing some code to parse a weird data format, and using scan() to suck in everything first. Basically, it’s csv-style lines, but some lines have a different number of fields and are for different things — imaging CTD data interspersed with system messages, where the line is identified by the very first field. Something like:

MESSAGE, 20150727T120005,Begin descent
MESSAGE,20150727T121000,Begin ascent
etc ...

Anyway, when I was just reading in the CTD fields, everything was fine, but when I started trying to parse the MESSAGE fields, I found that scan() was doing something unexpected with the spaces in the message field, and producing a char vector like:

"MESSAGE, 20150727T120005,Begin" 

Basically, scan() was treating the space between “Begin” and “descent” as a delimiter (as well as the carriage returns).

Anyway, after much attempting to interpret the man page, and trying different things, I discovered that

scan(con, character(), sep='\n')

would suck in the entire line as a character vector, which is what I wanted.


Using the R “apply” family with oce objects


In the oce package, the various different data formats are stored in consistently structured objects. In this post, I’ll explore a way to access elements of multiple oce objects using the R lapply(), from the apply family of functions.

Example with a ctd object

The objects always contain three fields (or “slots”): metadata, data, and processingLog. The layout of the object can be visualized using the str() command, like:


which produces something like:

Formal class 'ctd' [package "oce"] with 3 slots
  ..@ metadata     :List of 26
  .. ..$ header                  : chr [1:42] "* Sea-Bird SBE 25 Data File:"
  .. ..$ type                    : chr "SBE"
  .. ..$ conductivityUnit        : chr "ratio"
  .. ..$ temperatureUnit         : chr "IPTS-68"
  .. ..$ systemUploadTime        : POSIXct[1:1], format: "2003-10-15 11:38:38"
  .. ..$ station                 : chr "Stn 2"
  .. ..$ date                    : POSIXct[1:1], format: "2003-10-15 11:38:38"
  .. ..$ startTime               : POSIXct[1:1], format: "2003-10-15 11:38:38"
  .. ..$ latitude                : num 44.7
  .. ..$ longitude               : num -63.6
  ..@ data         :List of 9
  .. ..$ scan         : int [1:181] 130 131 132 133 134 135 136 137 138 139 ...
  .. ..$ time         : num [1:181] 129 130 131 132 133 134 135 136 137 138 ...
  .. ..$ pressure     : num [1:181] 1.48 1.67 2.05 2.24 2.62 ...
  .. ..$ depth        : num [1:181] 1.47 1.66 2.04 2.23 2.6 ...
  .. ..$ temperature  : num [1:181] 14.2 14.2 14.2 14.2 14.2 ...
  .. ..$ salinity     : num [1:181] 29.9 29.9 29.9 29.9 29.9 ...
  .. ..$ temperature68: num [1:181] 14.2 14.2 14.2 14.2 14.2 ...
  ..@ processingLog:List of 2
  .. ..$ time : POSIXct[1:5], format: "2015-08-18 19:22:36" "2015-08-18 19:22:36" ...
  .. ..$ value: chr [1:5] "create 'ctd' object" "ctdAddColumn(x = res, column = swSigmaTheta(res@data$salinity,     res@data$temperature, res@data$pressure), name = "sigmaThet"| __truncated__ "read.ctd.sbe(file = file, processingLog = processingLog)" "converted temperature from IPTS-69 to ITS-90" ...

(where I’ve trimmed a few lines out just to make it shorter).

For a single object, there are several ways to access the information contained in the object. The first (and generally recommended) way is to use the [[ accessor — for example if you wanted the temperature values from a ctd object you would do

T <- ctd[['temperature']]

Another way is to access the element directly, by using the slot and list syntax, like:

T <- ctd@data$temperature

The disadvantage to the latter is that it requires knowledge of exactly where the desired field is in the object structure, and is brittle to downstream changes in the oce source.

Working with multiple objects

Frequently, especially with CTD data, it is common to have to work with a number of individual ctd objects — usually representing different casts. One way of organizing such objects, particularly if they share a common instrument, or ship, or experiment etc, is to collect them into a list.

For example, we could loop through a directory of individual cast files (or extract multiple casts from one file using ctdFindProfiles()), and append each one to a list like:

files <- dir(pattern='*.cnv')
casts <- list()
for (ifile in 1:length(files)) {
    casts[[ifile]] <- read.oce(files[ifile])

If we summarize the new casts list, we can see that it’s filled with ctd objects:

str(casts, 1) # the "1" means just go one level deep
List of 5
 $ :Formal class 'ctd' [package "oce"] with 3 slots
 $ :Formal class 'ctd' [package "oce"] with 3 slots
 $ :Formal class 'ctd' [package "oce"] with 3 slots
 $ :Formal class 'ctd' [package "oce"] with 3 slots
 $ :Formal class 'ctd' [package "oce"] with 3 slots

Extracting fields from multiple objects at once

Say we want to extract all the temperature measurements from each object in our new list? How could we do it?

The brute force approach would be to loop through the list elements, and append the temperature field to a vector, maybe something like:

T_all <- NULL
for (i in 1:length(casts)) {
    T_all <- c(T_all, casts[[i]][['temperature']])

But in R, there’s a more elegant way — lapply()!

T_all <- unlist(lapply(casts, function(x) x[['temperature']]))

Forking and syncing branches with git and github

When forking a branch on github, it was not entirely clear to me how to sync branches other than master (e.g. to make a pull request). The following eventually seemed to work:

Set upstream remotes

First, you need to make sure that your fork is set up to track the original repo as upstream (from here):

List the current remotes:

$ git remote -v
# origin (fetch)
# origin (push)

Specify a new remote upstream repository that will be synced with the fork.

$ git remote add upstream

Verify the new upstream repository you’ve specified for your fork.

$ git remote -v
# origin (fetch)
# origin (push)
# upstream (fetch)
# upstream (push)

Syncing a fork

Now you’re ready to sync changes! See here for more details on syncing a “main” branch:

Fetch the branches:

$ git fetch upstream
# remote: Counting objects: 75, done.
# remote: Compressing objects: 100% (53/53), done.
# remote: Total 62 (delta 27), reused 44 (delta 9)
# Unpacking objects: 100% (62/62), done.
# From
#  * [new branch]      master     -> upstream/master

Check out your fork’s local master branch.

$ git checkout master
# Switched to branch 'master'

Merge the changes from upstream/master into your local master branch. This brings your fork’s master branch into sync with the upstream repository, without losing your local changes.

$ git merge upstream/master
# Updating a422352..5fdff0f
# Fast-forward
#  README                    |    9 -------
#                 |    7 ++++++
#  2 files changed, 7 insertions(+), 9 deletions(-)
#  delete mode 100644 README
#  create mode 100644

Syncing an upstream branch

To sync upstream changes from a different branch, do the following (from here):

git fetch upstream                            ;make sure you have all the upstream changes
git checkout --no-track upstream/newbranch    ;grab the new branch but don't track it
git branch --set-upstream-to=origin/newbranch ;set the upstream repository to your origin
git push                                      ;push your new branch up to origin

Turning off Auctex fontification so that columns can align

I love Emacs. I use it for everything, and particularly love it for doing tables in LaTeX because I can easily align everything so that it looks sensible, and rectangle mode makes it easy to move columns around if desired.

That being said, Auctex defaults do some fontification to math-mode super- and subscripts, which cause the horizontal alignments of characters to be off (essentially it is no longer a fixed-width font). To turn this off, do:

M-x customize-variable font-latex-fontify-script


Colormap tests


The current version of Dan Kelley’s oce package now has a branch testing some new functions for creating “colormaps” — the design here being that there is a way to map levels (say topographic height, or velocity, etc) to a specific set of colors. Development work on this has been ongoing in the colorize branch of the oce repo on Github. See Dan’s blog post at: for more information.

Many of the standard plotting commands that oce uses already mostly take advantage of the idea of a colormap (such as imagep() and drawPalette()), but recent use cases showed that there was much room for improvements. In particular, the connection between choosing a color scheme for a range of values, was previously up to the user to make sure they matched. This was most commonly done with the rescale() function, but it was found that it is not an ideal solution when the number of color levels is small.


Create a colormap for use in an imagep() plot of the adp dataset:

library(oce)  # I have built this from the `colorize` branch commit 365d7700f5be33e5
t <- adp[["time"]]
z <- adp[["distance"]]
p <- adp[["pressure"]]
u <- adp[["v"]][, , 1]
par(mar = c(3, 3, 1, 1))
pcol <- Colormap(p)
plot(t, p, bg = pcol$zcol, pch = 21)

plot of chunk unnamed-chunk-1

<br />## now for an imagep
ucol <- Colormap(u, col = oceColors9B)
imagep(t, z, u, colormap = ucol, filledContour = TRUE)

plot of chunk unnamed-chunk-1

Converting Latex to Markdown

I’m applying for a job, which requires me to submit a plain text version of my resumé. As I maintain my CV as a latex document, I wanted to find a simple way to convert it to Markdown format so that it will look good when cut/paste into the web browser.

I use pandoc all the time for document conversion, but I found that because of some heavy layout tweaks to make my CV look good (I’m not using a style file), the markdown produced using

pandoc cv.tex -o

is pretty gross.

After a bit of googling, I found out about the htlatex utility (found here, and it’s included with TexLive), and which does a fantastic job at converting Latex to HTML:

htlatex cv.tex "xhtml, mathml, charset=utf-8" " -cunihtf -utf8"

Then, use pandoc to convert from HTML to Markdown with:

pandoc cv.html -o

This leaves a few small things to clean up with further scripting (such as stray /s), but altogether a nice looking Markdown file.

Anti-aliasing and “image” plots


Frequently I make plots using the oce1 function imagep(), which at it’s core using the R-base function image(). R has several different graphics devices to choose from, and as each of them have different schemes for tasks such as anti-aliasing they can produce different results depending on the type of plot being created, and the type of file it gets written to. This can be especially apparent when using the filledContour type of plot. Frequently, I find that the default devices for making such plots in R produces undesirable artifacts, such as white lines in an image plot. The example below illustrates this effect using the adp data set:

imagep(adp[["v"]][, , 1], filledContour = TRUE)

plot of chunk plotWithLines

In this post I’ll explore some options for making plots without such artifacts.

PDF devices

It is common for anti-alias effects like the white lines shown above to show up in figures created using the pdf() device. As PDF is essentially a vector graphics format, there is nothing to be done in R to correct the problem. Typically the anti-aliasing is handled by the PDF viewer, and is therefore not native to the file. It is often possible to disable anti-aliasing in many of the most popular viewers (e.g. I use Skim and Preview on OSX), but this has the unfortunate side effect of removing anti-aliasing from all aspects of the figure, including the fonts and axes labels, etc.

For this reason, when producing image plots, I almost always default to using a PNG device instead of a PDF. PNG works perfectly well with pdflatex, and has no artifacts due to image compression (such as in JPGs). The only issue remaining is how to ensure that the image plot itself does not suffer from anti-aliasing effects, while retaining the smoothing of fonts, lines, and points to make a beautiful plot.

PNG devices

For PNG devices, there are several options for the “type” of device, each of which will produce slightly different output. From the help page for png(), the arguments are:

png(filename = "Rplot%03d.png",
width = 480, height = 480, units = "px", pointsize = 12,
bg = "white", res = NA, ...,
type = c("cairo", "cairo-png", "Xlib", "quartz"), antialias)

where the type argument is described as:

type: character string, one of ‘"Xlib"’ or ‘"quartz"’ (some OS X
builds) or ‘"cairo"’. The latter will only be available if
the system was compiled with support for cairo - otherwise
‘"Xlib"’ will be used. The default is set by
‘getOption("bitmapType")’ - the ‘out of the box’ default is
‘"quartz"’ or ‘"cairo"’ where available, otherwise ‘"Xlib"’.

Let’s try some examples of each of the type options to see the difference.

types <- c("cairo", "cairo-png", "Xlib", "quartz")
for (itype in seq_along(types)) {
png(paste("typeExample-", types[itype], ".png", sep = ""), type = types[itype],
width = 300, height = 300)
imagep(adp[["v"]][, , 1], filledContour = TRUE, main = types[itype])

each of which produces the following:

cairo cairo cairo cairo

Note that it is the default quartz type that produces the issues through anti-aliasing. This can be turned off by specifying antialias='none' (see the description of the antialias are from ?png for more details):

png("quartzNoAntialias.png", type = "quartz", antialias = "none")
imagep(adp[["v"]][, , 1], filledContour = TRUE, main = "quartz with antialias=none")
## pdf
## 2


This “fixes” the problem for the image plot, but leaves the fonts and axis lines un-antialiased.


Based on the above, the best option for producing image-style plots without antialiasiaing artifacts is to use the type='cairo' option for the png device (note that by default Cairo devices use the Helvetica font family, whereas Quartz devices use Arial).

png("cairoDevice.png", type = "cairo", antialias = "none", family = "Arial")
imagep(adp[["v"]][, , 1], filledContour = TRUE, main = "A Cairo device png")
## pdf
## 2


  1. For hints on installing the oce package check out the blog post here