Kinematics 1a – Velocity offsets


I have been working mostly with spectroscopic data from SDSS-IV [MaNGA](http://www.sdss.org/dr14/manga/) since its first public release in 2016. The first few releases include little in the way of value-added data beyond calibrated spectra and some metadata taken mostly from the [NASA-Sloan Atlas](http://nsatlas.org/). In particular no kinematic measurements are provided, which are essential to most analyses. In the next several posts I’m going to write about estimating and using the first moment of the line of sight velocity distributions (LOSVD), that is velocity offsets from the system velocity. Keep in mind that estimating the moments of the LOSVD was the original objective of Cappellari and Emsellem’s `ppxf`, and anyone professionally engaged with the data should be using it or a similar tool that has passed peer review. But, since this is a hobby I developed my own tools. All software needed to replicate the examples posted here is available on my github repository at .

The first step in any data analysis is to acquire some data, and SDSS makes this really easy. The base URL for MaNGA data in DR14, which is still the current release at the time of writing, is . You drill down two levels deeper to download individual files. An observation is uniquely identified by a plate and IFU number, which are combined to make a plateifu id. Objects on the other hand are assigned a unique mangaid, which might be associated with multiple plateifu’s if observations are repeated. I’m going to work with the observation having plateifu 8942-12702 in the next several posts, which is in the directory 8942/stack. Calibrated data products come in two basic flavors: “row stacked spectra” (RSS) files and data cubes. Both types are available with either linear or logarithmically spaced wavelength bins — since convolution is an essential step in most analyses I work with the logarithmically spaced files, which have equal velocity widths of \( c \Delta \log_{10}(\lambda) \ln(10) \approx 69 \) km/sec per wavelength bin. The files I’ll work with are [manga-8942-12702-LOGCUBE.fits.gz](https://data.sdss.org/sas/dr14/manga/spectro/redux/v2_1_2/8942/stack/manga-8942-12702-LOGCUBE.fits.gz) and [manga-8942-12702-LOGRSS.fits.gz](https://data.sdss.org/sas/dr14/manga/spectro/redux/v2_1_2/8942/stack/manga-8942-12702-LOGRSS.fits.gz). This particular target is the nice barred spiral UGC 4300 (image credit SDSS):

![UGC 4300](http://skyserver.sdss.org/dr14/SkyServerWS/ImgCutout/getjpeg?TaskName=Skyserver.Chart.Image&ra=124.003321406132&dec=27.0758846629965&scale=0.4&width=512&height=512&opt=GL&query=&Grid=on&Label=on)

Browser based downloads are fine to get a file or two, but for bulk access it’s easiest to use a tool like `wget` or `rsync`. This is documented at . A small shell script helps automate the process:

“`{bash}
#! /bin/bash
wget -nv -nc -r -nH -nd -P. \
-i $1 \
-B http://data.sdss.org/sas/dr14/manga/spectro/redux/v2_1_2/

“`

This reads from a text file with the remainder of the path to the files to be downloaded listed one per line, fetches the files from the Science Archive Server, and stores them in the current working directory.

Next we need to get the data into R’s workspace. For this I use the R package [FITSio](https://cran.r-project.org/package=FITSio), which is the only package I’ve found for reading FITS files. While not as complete or polished as `astropy`’s FITS module it’s fine for reading in binary data tables. We also need some metadata from the MaNGA catalog, which is described [here](http://www.sdss.org/dr14/manga/manga-data/catalogs/). I have created a flat data table from the catalog and stored it in the R data file [drpcat.rda](https://github.com/mlpeck/vrot_stanmodels/blob/master/drpcat.rda) containing the R object `drpcat`. With the source files sourced into the R session the command to read a data cube is

“`{R}
gdat.cube <- readcube("manga-8942-12702-LOGCUBE.fits", drpcat) ``` For any reader unfamiliar with R syntax the symbol `<-` is the assignment operator (don't ask), so this line is just assigning the value returned by `readcube()` to the variable `gdat.cube`, which... ```{R} > str(gdat.cube)
List of 13
$ meta :List of 10
..$ mangaid : chr “1-218280”
..$ plateifu: chr “8942-12702”
..$ ra : num 124
..$ dec : num 27.1
..$ l : num 196
..$ b : num 29.6
..$ ebv : num 0.0367
..$ z : num 0.0255
..$ zdist : num 0.0254
..$ cpix : num [1:2] 38 38
$ xpos : num [1:74] 0.00514 0.005 0.00486 0.00472 0.00458 …
$ ypos : num [1:74] -0.00514 -0.005 -0.00486 -0.00472 -0.00458 …
$ lambda: num [1:4563(1d)] 3622 3622 3623 3624 3625 …
$ ra.f : num [1:74, 1:74] 124 124 124 124 124 …
$ dec.f : num [1:74, 1:74] 27.1 27.1 27.1 27.1 27.1 …
$ flux : num [1:74, 1:74, 1:4563] NA NA NA NA NA NA NA NA NA NA …
$ ivar : num [1:74, 1:74, 1:4563] NA NA NA NA NA NA NA NA NA NA …
$ snr : num [1:74, 1:74] NA NA NA NA NA NA NA NA NA NA …
$ gimg : num [1:74, 1:74] 0 0 0 0 0 0 0 0 0 0 …
$ rimg : num [1:74, 1:74] 0 0 0 0 0 0 0 0 0 0 …
$ iimg : num [1:74, 1:74] 0 0 0 0 0 0 0 0 0 0 …
$ zimg : num [1:74, 1:74] 0 0 0 0 0 0 0 0 0 0 …
“`

is a large list containing another list with some metadata and a number of numerical arrays, some taken directly from the FITS file and some processed in various ways. I correct the fluxes and inverse variances for galactic foreground extinction and also check the `mask` array for spaxels that should be masked. These are set to `NA` in the `flux` and `ivar` arrays. For this observation using the largest, 127 fiber, IFU these are stored in 74 x 74 x 4563 arrays, with the number of wavelength bins equal to 4563. The 74 x 74 spaxel field is somewhat larger than the IFU footprint so only some spaxels have data — in this example just 3,049 out of 5,476, or about 56%.

RSS files are a little more complicated to work with. They contain one spectrum for each exposure and each fiber, while for every observation the IFU is dithered to 3 distinct positions. Therefore if there are Nexp total exposures there will be Nexp/3 exposures for each fiber/pointing combination. In the RSS FITS files the flux and other data are stored in 4563 x Nexp image arrays, but to use the data we need to identify the distinct fiber/pointing combinations and combine the data. This is complicated by the fact that it isn’t entirely clear how the rows are stacked — this varies between data sets, although the first 3Nfiber spectra appear always to comprise one complete set of 3 pointings. A further complication is that because of differential refraction the position on the sky seen by each fiber varies with wavelength and with exposure. I chose to gloss over that by taking the mean position (over each wavelength range)
as the fiber position for that exposure. I then do a k-nearest neighbor lookup to find the Nexp-1 neighbors of the first 3Nfiber exposures. Then, with a bit of rearranging of data I end up storing the flux and inverse variance measurements in 3Nfiber x Nexp/3 x 4563 arrays.

Now this isn’t what I really want. What I want is to combine the exposures for each fiber/pointing combination to get a single spectrum and inverse variance estimates. I call this a stacked data set, and I wrote separate functions to read and manipulate the RSS files and to stack the data. These could be combined of course; I really have no use for the RSS data itself. In any case the current workflow goes like

“`{R}
> gdat.rss <- readrss("manga-8942-12702-LOGRSS.fits", drpcat) > str(gdat.rss)
List of 7
$ meta :List of 10
..$ mangaid : chr “1-218280”
..$ plateifu: chr “8942-12702”
..$ nexp : num 9
..$ ra : num 124
..$ dec : num 27.1
..$ l : num 196
..$ b : num 29.6
..$ ebv : num 0.0367
..$ z : num 0.0255
..$ zdist : num 0.0254
$ lambda: num [1:4563] 3622 3622 3623 3624 3625 …
$ flux : num [1:381, 1:3, 1:4563] 1.913 0.861 2.618 -2.61 5.185 …
$ ivar : num [1:381, 1:3, 1:4563] 0.124 0.161 0.14 0.177 0.119 …
$ snr : num [1:381, 1:3] 2.28 2.2 2.42 2.71 2.76 …
$ xpos : num [1:381, 1:3] 6.67 7.89 5.37 2.78 4.13 …
$ ypos : num [1:381, 1:3] -12.6 -10.2 -10.4 -10.5 -12.7 …
> gdat.stack <- stackrss(gdat.rss) > str(gdat.stack)
List of 10
$ meta :List of 10
..$ mangaid : chr “1-218280”
..$ plateifu: chr “8942-12702”
..$ nexp : num 9
..$ ra : num 124
..$ dec : num 27.1
..$ l : num 196
..$ b : num 29.6
..$ ebv : num 0.0367
..$ z : num 0.0255
..$ zdist : num 0.0254
$ lambda: num [1:4563] 3622 3622 3623 3624 3625 …
$ flux : num [1:381, 1, 1:4563] -0.538 2.972 1.319 -0.642 1.778 …
$ ivar : num [1:381, 1, 1:4563] 0.396 0.383 0.403 0.436 0.388 …
$ snr : num [1:381, 1] 3.93 3.89 4.18 4.7 4.81 …
$ xpos : num [1:381] 6.67 7.88 5.36 2.78 4.12 …
$ ypos : num [1:381] -12.6 -10.2 -10.4 -10.5 -12.7 …
$ ra.f : num [1:381] 124 124 124 124 124 …
$ dec.f : num [1:381] 27.1 27.1 27.1 27.1 27.1 …
$ dz : NULL
“`
Now, I work almost exclusively with these stacked derivatives of the RSS files. The main reason is necessity: the Bayesian models I work with, especially for star formation history modeling, are extremely computationally expensive. These can take some 10’s of minutes per spectrum with a state of the art desktop PC even with a smaller number of samples than I’d like and some other compromises. At, say, 3 per hour analyzing āˆ¼3000spectra would take about 42 days worth of CPU time, while the stacked RSS files have at most 381 spectra, making a single IFU’s analysis at most a 5 days project. Eliminating the compromises could increase execution time by factors of several, making even the latter analysis difficult for any reasonable size sample of galaxies.

There are some principled reasons as well to prefer the RSS files. The data cubes are generated from the RSS data via an interpolation scheme which can’t add to the information content but does introduce spatial covariance into any derived data. This should properly be modeled, but in practice usually isn’t — the MaNGA science team recommends inflating variance estimates by some factor. For many analyses it’s necessary to bin spaxels, and again the spatial covariance should properly be accounted for. The fiber spectra on the other hand can be treated as independent samples. The one significant disadvantage of working with the RSS data is that new visualization tools are needed. I’ve developed a number of such tools, and continue to work on them.

This post is longer than I originally intended, so I will save results for the next one and then move on to discussing one use of velocity data.

SDSS

As I mentioned in the first post I make extensive use of spectroscopic data products from the Sloan Digital Sky Survey (SDSS), including the single fiber galaxy spectra from the legacy survey and more recently "Integral Field Unit" (IFU) spectra from SDSS-IV MaNGA. For completeness here are links to recent data release and technical overview papers, and to the acknowledgements page. I hope links will suffice.

DR12 release paper (the final SDSS-III data release): Alam et al. 2015

DR13 release: Albareti et al. 2016

DR14 release: Abolfathi et al. 2017

MaNGA papers

General overview: Bundy et al. 2015

Sample and observing strategy: Law et al. 2015 and Wake et al. 2017

Data products: Law et al. 2016

Acknowledgments

SDSS-III acknowledgment

SDSS-IV acknowledgment

 

Getting started

I have a long standing interest in the star formation histories of galaxies, or perhaps more precisely what it’s possible to learn about the star formation histories of unresolvedĀ  galaxies from spectroscopic snapshots. This has been a topic of great interest to astronomers for as long as they’ve understood the true nature of galaxies, yet somewhat surprisingly this is still a relatively immature field — evidence for that is that the epistemic question (what is it possible to learn?) still has no definite answer. And that’s a good thing because it means there’s still room for creative exploration.

Why am I starting a blog and why now? This is a hobby for me. An eccentric hobby to be sure, but it keeps me writing software and intellectually stimulated. I got actively interested in this around the time of SDSS data release 9 (so, late 2012) when I belatedly realized massive amounts of spectroscopic data were available to anyone with an internet connection. A bit of searching turned up the other main ingredient needed to analyze spectroscopic data in the form of libraries of “simple stellar population” models. I’ve mostly used various iterations of the MILES SSP models, but I’ve also worked with subsets of the BC03 and CB07 models, and one or two others. I’ll get around to some comparative analyses of these sometime later.

My modeling efforts started out fairly simple — at the most basic level fitting spectra involves non-negative least squares, for which there is a very effective algorithm. This was already known from the work of Cappellari and Emsellem (2004) [1] and my optimization code still only provides a subset of the capabilities of Cappellari’s ppxf. However I had something of an epiphany a few years ago when I discovered it’s possible to do what Conroy calls non-parametric star formation history modeling with fully Bayesian techniques. Now this is no huge scientific breakthrough — it was mostly a lucky find of the right sampling strategy and implementation in the probabilistic modeling language Stan, but it’s a useful technical advance, and to the best of my knowledge I am the first to do this. All other Bayesian SED modeling codes that I’m aware of use a parametric functional form for specifying star formation histories.

One of the good things about doing something like this as an outsider hobbyist is there’s no pressure to present or publish results. The not so good thing about it is that without ready access to professional conferences and journals it’s difficult to communicate results. Writing papers seems like an unproductive exercise[2] — they’re tedious to write and equally tedious to read. Also, I never seem to reach a definite end state. I’m always modifying code, getting new ideas about what’s interesting to look at, and trying different visualization tools.

Around the same time I was starting this activity I discovered the original Galaxy Zoo forum and Galaxy Zoo itself. I’ve posted in the forum and the various versions of Talk for some time and even contribute clicks now and then. I was also one of the last “citizen scientists” (I really don’t like that term) to throw in the towel on the failed Galaxy Zoo Quench project. Posting on GZ Talk can be a frustrating experience. Posts quickly get lost in the noise and it’s difficult to write even mildly technical content because of software limitations including a very limited Markdown implementation in the current version. Pictures or graphical content also must be uploaded to a separate server and hotlinked, which adds extra steps and increases the risk of link rot. The engagement of the GZ science team with the volunteer community seems to be haphazard at best, and only a small percentage of the volunteers appear to have the technical background to understand what I’m writing.

So, I decided to start another blog. This may go largely unread too, but that’s OK. Consider it a diary that I’ve left out in the open.

My current plan is to post an occasional case study in the form of a markdown notebook. I intend to make these fully reproducible, which entails among other things making all code and data publicly available. Like probably most recreational programmers I’m not very good at it. Right now my code is scattered over several source files, with no documentation other than an occasional comment, and no longer used functions mixed in with current code. There are potential readability issues: to mention one example it’s completely legal and fairly common practice in R to use periods in variable and function names. I’ve been trying to get away from doing that by using underscores instead, but I’m not fully consistent about it. In any case I will eventually get the code cleaned up, properly version controlled, and placed on Github.

Most of my current code other than the Stan models is written in R. Why R? I discovered it many years ago while working on a data analysis task completely unrelated to galaxies. From hazy memory R was in about version 1.4.x at the time, which would make it around 2002. After 16 years of experience and authoring several R packages I’m reasonably proficient at it. If I were starting from scratch though, particularly for the kinds of modeling I plan to discuss here, I’d probably use Python. I do plan a Python port, but it may be more than a few month’s project.

While I’m mostly interested in star formation history modeling I plan to look at other things as well. Most likely I will write first about kinematics and disk galaxy rotation curves, since I sorta offered to “write something up” on the subject. I will probably delve into related topics in Bayesian statistics too. I may occasionally say something about current literature when I see a paper on arxiv that’s either interesting or conspicuously bad. I may consider inviting additional contributors. I promise never to write anything about voor-whatevers.


[1] The use of quadratic programming techniques for stellar population modeling actually predates this paper by over 3 decades (Lasker 1970, Faber 1972).

[2] I’ve written one anyway. If you’re really interested it’s archived here. If I actually wanted to publish this I would get rid of most of the workflow discussion and focus on the Bayesian modeling aspect. I’d also make sure to publish software simultaneously. I might choose a different case study. Also, all of my code has changed since then and I’ve disproved some of my speculations. For example priors don’t appear to be very influential. The choice of SSP library on the other hand…