**Our project** aims to develop mathematical theory and improved algorithms for 3D molecular structure determination using cryo-EM. Our methods use a combination of tools from different areas of mathematics, statistics, and computer science, such as computerized tomography, convex optimization and semidefinite programming, random matrix theory, signal and image processing, linear and nonlinear dimensionality reduction, randomized algorithms in numerical linear algebra, and representation theory.

**Acknowledgement:** This project would not have been possible without the support of Award Number R01GM090200 from the National Institute of General Medical Sciences (NIGMS), the Simons Foundation, and the Moore Foundation. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIGMS or the NIH.

**What is cryo-EM and SPR?** "Three dimensional electron microscopy" is the name commonly given to methods in which the 3D structures of macromolecular complexes are obtained from sets of 2D projection images taken in an electron microscope. The most widespread and general of these methods is single-particle reconstruction (SPR). In SPR, the 3D structure is determined from images of randomly oriented and positioned, ideally identical macromolecular "particles''. The SPR method has been applied to images of negatively stained specimens, and to images obtained from frozen-hydrated, unstained specimens. In the latter technique, called cryo-EM, the sample of macromolecules is rapidly frozen in a thin (~100 nm) layer of vitreous ice, and maintained at liquid nitrogen temperature throughout the imaging process. SPR from cryo-EM images is an entirely general imaging method that does not require crystallization, and can capture molecules in their native states. Recent technological advancements have revolutionized the cryo-EM field, enabling near atomic resolution reconstructions of molecules of size at least ~175kDa. Furthermore, cryo-EM has the potential to analyze compositionally and conformationally heterogeneous mixtures and, consequently, can be used to determine the structures of complexes in different functional states. Cryo-EM was selected by the journal Nature Methods as Method of the Year 2015 for its newfound ability to solve protein structures at near-atomic resolution.

**What is the mathematical problem of SPR using cryo-EM?** The cryo-EM reconstruction problem is to find the three-dimensional structure of a molecule given samples of its two-dimensional projection images at unknown random directions. The intensity of pixels in a given projection image corresponds to line integrals of the scattering density induced by the molecule along the path of the imaging electrons. The highly intense electron beam destroys the molecule and it is therefore impractical to take projection images of the same molecule at known different directions as in the case of classical computerized tomography. In other words, a single molecule can be imaged only once. All molecules are assumed to have the exact same structure; they differ only by their spatial orientation and position. Thus, every image is a projection of the same molecule but an unknown random orientation. The cryo-EM problem is an inverse problem stated as follows: find the 3D scattering density given a collection of 2D noisy projection images whose orientations (and positions) are unknown.

**There are many software packages for 3D Electron Microscopy. What makes this one different?** Indeed, there are many excellent software packages for molecular structure determination from cryo-EM images. Most existing algorithms for SPR use an iterative refinement procedure. As the starting point for the refinement process, however, some sort of ab initio estimate of the 3D structure must be made. Our package provides tools for estimating an initial model directly from the cryo-EM images, in a completely data-driven manner that does make any assumption about the molecular structure or the underlying distribution of the viewing directions.

**What are the highlight features of this software toolbox?** We believe that experienced clients of current 2D/3D image analysis toolboxes would appreciate the following main features of our toolbox:

- 3D Structural variability: Fast and accurate estimation of the covariance matrix of the 3D molecular structures from their 2D tomographic noisy images and tools for clustering and non-linear dimensionality reduction for continuous variability. Our method is based on a new technique for PCA from noisy linearly reduced measurements, and diffusion maps for continuous variability.
- Steerable PCA and image restoration: Fast and accurate Principal Component Analysis for computing the eigen-images and eigenvalues of a set of 2D raw images and their in-plane rotations. Those who use MSA (Multivariate Statistical Analysis) for compression and de-noising of their raw 2D images should definitely try our Fourier-Bessel Steerable PCA algorithm as a viable alternative.
- Class Averaging: An algorithmic pipeline that finds for every 2D raw image in the data set its nearest neighbors in terms of similar viewing directions. Our class averaging procedure is fast (nearly linear running time in the number of images) and succeeds at remarkably low levels of signal-to-noise ratio. The algorithm uses steerable PCA, Wiener filtering, rotational invariant representation of 2D images using bispectrum, fast randomized SVD, fast randomized nearest neighbors search, fast 2D alignment of images, and dimension reduction using vector diffusion maps. Try our class averaging procedure as an alternative to the MSA classification algorithm or the reference-free alignment procedure.
- Angular Reconstitution from Common Lines: The Fourier Projection Slice Theorem implies that the orientations of three projection images can be determined from their common lines, which is the foundation of the angular reconstitution method (van Heel 1987; Vainshtein and Goncharov 1986). Our toolbox provide several common-line based algorithms that utilize the information in the common lines between all pairs of images simultaneously, leading to an assignment of orientations that is as consistent as possible with the common lines. Our approaches use convex optimization and semidefinite relaxation, spectral methods and Bayesian approaches, and work for both uniform and non-uniform viewing angle distributions.

**Future Extensions:** We plan to add more functionality to our toolbox in the future. Here is a short list of additional capabilities that we are planning to include:

- Improved classification algorithm for heterogeneous datasets.
- Motion correction algorithms to deal with beam induced motion.
- Further improvements to our current ab-initio modeling algorithms.