Current Opinions of Joel Berendzen (mid 2022)

“Follow the data!” was the dictum my advisor,
Hans Frauenfelder
gave me more than 30 years ago as I was thinking about what to do post-PhD.
At the time Hans said that, biology wasn’t especially data-rich, but now
the situation
has changed. For example, the Webb Space Telescope
is expected to produce around 200 TB of data per year. A
premier biological sequence observatory, the Broad Institute, has been producing that has
been producing that much data per month for the last few years.
Moreover, the number of labs with sequencers in them is a lot larger—and growing faster–than
the number of labs with telescopes. It’s not just sequencers, either; there are sizeable
data flows from protein crystallography at synchrotron beamlines worldwide, and there’s
about to be huge streams coming out of microscopy-driven projects such as the Human
Brain Mapping Initiative. Biology has quietly become the most data-intensive science
of the Age of Big Data.
Here’s an essay on creating the Theory of Biology
by building bridges among data, signatures, models, and applications.
Much of my recent work has been on genomic sequences. Here are some
thoughts on bioinformatics and bioinformaticians.
I create software to analyze data, and I try to write for the future as well as
to solve particular problems today. If I do my part well, my efforts are to be a
model and example, not just a means to an end. Here are my thoughts on
writing scalable software.
I have published
roughly 50 papers with over 13,000 citations that
explore the interrelations among sequences, structures, gene-family trees, dynamics,
and hydration.
Here is an overview of some of my code repositories and other places where I’ve contributed:
- azulejo combines guilt-by-profiling (genome synteny)
and guilt-by-association (phylogeny) to create pangenomic collections of gene families and
tile phylogenetic space with supertrees of proxy genes. Uses out-of-memory external merges
and a novel “peatmer” algorithm to achieve linear scaling with a small memory footprint to
accomodate large numbers of input genomes.
- click_loguru Most scientific code needs a CLI and
benefits from logging to a file. This repository combines those two needs and is the starting
point for most of my active codes.
- pytest-datadir-mgr Most scientific code needs
to test using data files too large to be kept in the repository. The code in
this repository makes downloading input data and saving intermediate results an easier task.
- pybio Gentoo Overlay Computational biologists need a
development distro, and for years mine has been Gentoo i
because of the large number of
biology-related packages and because its a source-code distribution. (For production and
container use, I like Clear Linux for its performance and update
properties.) This private repo contains another 100 or so packages that I find useful on top of
the 200 in the main tree and the 300 in the
Science overlay.
- aakbar This Amino Acid K-mer calculator can be used to
calculate signature peptides by phylogenetic or other means. Its output can be used in Sequedex
or other signature methods. Its input can be raw proteomes, but it’s better with sets of proxy
genes from azulejo.
- alphabetsoup Parallel data wrangling of input sequences,
including alphabet checking and removal of some ugly but common artifacts.
- Sequedex is R&D100 award-winning software that uses scalable signature methods
to classify short DNA sequences as to where they come from and what they do. Sequedex is mostly used
in metagenomics and surveillance for emergent infectious diseases. The Sequedex open-source repository
is here.
- SOLVE is R&D100 award-winning software that helps automate the problem of phasing
X-ray crystal structures of proteins. It calculates a statistic that acts very much like autofocus
on a camera. SOLVE is closed-source.
You can comment on this page or reach me on Twitter.