I have a long standing interest in the star formation histories of galaxies, or perhaps more precisely what it’s possible to learn about the star formation histories of unresolved galaxies from spectroscopic snapshots. This has been a topic of great interest to astronomers for as long as they’ve understood the true nature of galaxies, yet somewhat surprisingly this is still a relatively immature field — evidence for that is that the epistemic question (what is it possible to learn?) still has no definite answer. And that’s a good thing because it means there’s still room for creative exploration.
Why am I starting a blog and why now? This is a hobby for me. An eccentric hobby to be sure, but it keeps me writing software and intellectually stimulated. I got actively interested in this around the time of SDSS data release 9 (so, late 2012) when I belatedly realized massive amounts of spectroscopic data were available to anyone with an internet connection. A bit of searching turned up the other main ingredient needed to analyze spectroscopic data in the form of libraries of “simple stellar population” models. I’ve mostly used various iterations of the MILES SSP models, but I’ve also worked with subsets of the BC03 and CB07 models, and one or two others. I’ll get around to some comparative analyses of these sometime later.
My modeling efforts started out fairly simple — at the most basic level fitting spectra involves non-negative least squares, for which there is a very effective algorithm. This was already known from the work of Cappellari and Emsellem (2004) [1] and my optimization code still only provides a subset of the capabilities of Cappellari’s ppxf. However I had something of an epiphany a few years ago when I discovered it’s possible to do what Conroy calls non-parametric star formation history modeling with fully Bayesian techniques. Now this is no huge scientific breakthrough — it was mostly a lucky find of the right sampling strategy and implementation in the probabilistic modeling language Stan, but it’s a useful technical advance, and to the best of my knowledge I am the first to do this. All other Bayesian SED modeling codes that I’m aware of use a parametric functional form for specifying star formation histories.
One of the good things about doing something like this as an outsider hobbyist is there’s no pressure to present or publish results. The not so good thing about it is that without ready access to professional conferences and journals it’s difficult to communicate results. Writing papers seems like an unproductive exercise[2] — they’re tedious to write and equally tedious to read. Also, I never seem to reach a definite end state. I’m always modifying code, getting new ideas about what’s interesting to look at, and trying different visualization tools.
Around the same time I was starting this activity I discovered the original Galaxy Zoo forum and Galaxy Zoo itself. I’ve posted in the forum and the various versions of Talk for some time and even contribute clicks now and then. I was also one of the last “citizen scientists” (I really don’t like that term) to throw in the towel on the failed Galaxy Zoo Quench project. Posting on GZ Talk can be a frustrating experience. Posts quickly get lost in the noise and it’s difficult to write even mildly technical content because of software limitations including a very limited Markdown implementation in the current version. Pictures or graphical content also must be uploaded to a separate server and hotlinked, which adds extra steps and increases the risk of link rot. The engagement of the GZ science team with the volunteer community seems to be haphazard at best, and only a small percentage of the volunteers appear to have the technical background to understand what I’m writing.
So, I decided to start another blog. This may go largely unread too, but that’s OK. Consider it a diary that I’ve left out in the open.
My current plan is to post an occasional case study in the form of a markdown notebook. I intend to make these fully reproducible, which entails among other things making all code and data publicly available. Like probably most recreational programmers I’m not very good at it. Right now my code is scattered over several source files, with no documentation other than an occasional comment, and no longer used functions mixed in with current code. There are potential readability issues: to mention one example it’s completely legal and fairly common practice in R to use periods in variable and function names. I’ve been trying to get away from doing that by using underscores instead, but I’m not fully consistent about it. In any case I will eventually get the code cleaned up, properly version controlled, and placed on Github.
Most of my current code other than the Stan models is written in R. Why R? I discovered it many years ago while working on a data analysis task completely unrelated to galaxies. From hazy memory R was in about version 1.4.x at the time, which would make it around 2002. After 16 years of experience and authoring several R packages I’m reasonably proficient at it. If I were starting from scratch though, particularly for the kinds of modeling I plan to discuss here, I’d probably use Python. I do plan a Python port, but it may be more than a few month’s project.
While I’m mostly interested in star formation history modeling I plan to look at other things as well. Most likely I will write first about kinematics and disk galaxy rotation curves, since I sorta offered to “write something up” on the subject. I will probably delve into related topics in Bayesian statistics too. I may occasionally say something about current literature when I see a paper on arxiv that’s either interesting or conspicuously bad. I may consider inviting additional contributors. I promise never to write anything about voor-whatevers.
[1] The use of quadratic programming techniques for stellar population modeling actually predates this paper by over 3 decades (Lasker 1970, Faber 1972).
[2] I’ve written one anyway. If you’re really interested it’s archived here. If I actually wanted to publish this I would get rid of most of the workflow discussion and focus on the Bayesian modeling aspect. I’d also make sure to publish software simultaneously. I might choose a different case study. Also, all of my code has changed since then and I’ve disproved some of my speculations. For example priors don’t appear to be very influential. The choice of SSP library on the other hand…