Scaling biomarker discovery: more data, deeper insights with Sapient Bioanalytics


Biomarker discovery, once limited by the technical challenges of measuring complex biology consistently across large human populations, is being redefined. Innovations in instrumentation and software now enable the capture of thousands of molecules per sample with unprecedented speed and resolution. The question now becomes: how do we translate these growing datasets into insights that inform better drug development?

Sapient Bioanalytics (MA, USA), a leading lab for multi-omics data generation and insight delivery, was founded to tackle this challenge. In this interview, Jeramie Watrous, Co-Founder and Head of Analytical R&D, and Tao Long, Co-Founder and Head of Data Science at Sapient, discuss bridging the gap from large-scale biomarker data to deeper insights that drive discovery, translational research and therapeutic innovation.

Meet the interviewees

Jeramie Watrous
Co-Founder & Head of Analytical R&D
Sapient Bioanalytics

Dr Watrous is an Analytical Chemist and Engineer with more than a decade of experience using mass spectrometry for biological measures and associated informatics. As a postdoctoral researcher and subsequently an Assistant Professor of Medicine at the University of California San Diego (CA, USA), he helped lead the development of fully automated, high-throughput mass spectrometry-based workflows for global small molecule biomarker analysis, the next-generation technology that became the basis for Sapient, which he co-founded in 2021. At Sapient, Jeramie oversees the generation, optimization and standardization of Sapient’s mass spectrometry systems and other innovative instrumentation and lab automation infrastructure. 

Tao Long
Co-Founder & Head of Data Science
Sapient Bioanalytics

Dr Long is a Senior Bioinformatician and Computational Biologist with over a decade of experience in the handling and analysis of multi-dimensional complex human datasets, including genomics, metagenomics, proteomics and metabolomics. Prior to co-founding Sapient, she was Head of Bioinformatics at the University of California San Diego, and was also previously an Assistant Professor at Sanford Burnham Prebys Medical Discovery Institute and a Bioinformatics Scientist at Human Longevity, Inc. (both CA, USA).  At Sapient, Tao oversees data science, bioinformatics and pharmacological modeling capabilities and is responsible for maintaining and expanding Sapient’s proprietary longitudinal human biology databases.

Interview questions
  1. Can you share how Sapient generates biomarker data, and highlight some difficult-to-measure biomarkers you’ve worked with?
  2. What are some of the biggest challenges in generating high-quality biomarker data at scale?
  3. Once the data is generated, what’s the process for turning that raw information into insights or action?
  4. Can you walk us through an example where the data you generated led to a key scientific discovery?
  5. When collaborating with external partners who may have different research priorities, how do you ensure biomarker data generation aligns with both scientific objectives and practical applications?

Can you share how Sapient generates biomarker data, and highlight some difficult-to-measure biomarkers you’ve worked with?

Jeramie: At Sapient, we generate biomarker data entirely in-house using our high-throughput multi-omics platform that integrates proteomics, metabolomics, lipidomics and cytokine profiling. Our approach centers on what we call ‘next generation’ mass spectrometry (MS), which allows us to greatly expand the breadth and depth of measures we can make in a single complex biosample – and to also make these measures at scale, across thousands of samples at a time. We focus particularly on dynamic biomarkers – proteins, metabolites and lipids – that read out both genetic and non-genetic factors of health, disease and drug response, which can change over time.

Our technologies leverage both internal and external hardware innovations to enable MS to run faster and detect more molecules. For example, we’ve developed a proprietary rapid liquid chromatography system, or rLC, which enables us to capture many diverse small molecule compound classes in a single run, comprising more than 15,000 metabolites and lipids per biosample. For proteomics, we use nanoflow separation to achieve resolution to detect up to 12,000 proteins in cells and tissue. We pair this with state-of-the-art MS to achieve the speed and sensitivity needed to rapidly deliver robust and reproducible large-scale omics datasets.

It is important to note that these systems are embedded in a larger pipeline that includes everything from automated sample preparation on the front end to AI-enabled data extraction and quality control on the back end. It’s not just about generating more data; it’s making sure that the data is robust and analysis-ready for discovery. We leverage proprietary software, cloud computing and AI to process large-scale spectral files, remove noise and align peaks so that statistically powered, meaningful patterns can be uncovered in these vast datasets and translated into actionable insights.

In terms of difficult-to-measure biomarkers, you could say this is an area we specialize in because of our ability to rapidly develop bespoke methods for even the most complex sample types. On the proteomics front, for example, one particularly challenging target was a protein localized to circulating exosomes that originated from a specific tissue bed. Not only did we have to develop a method to capture exosomes from the biosample and figure out how to detect the target as it was expressed at very low levels, but we also needed to determine the ratio of exosomes that originated from the target tissue bed relative to the total number of exosomes. This allowed the target signal to be normalized across different patient samples.

What are some of the biggest challenges in generating high-quality biomarker data at scale?

Jeramie: Consistency is key. Minimizing experimental variance and doing your best to eliminate any inherent bias in the experimental design helps empower the downstream statistical analysis, which in turn produces more reliable associations between phenotypes and chemotypes. Operating at population scale, where we are measuring tens of thousands of molecules in tens of thousands of samples as part of metabolomics and/or proteomics studies, requires extensive operational and engineering controls to be in place in addition to highly trained personnel.

Two of the best ways we have found to maintain analytical consistency over large studies are (1) extensive incorporation of automation in the sample preparation pipeline, and (2) accelerating the actual data collection process. Incorporation of automated liquid and sample handling throughout the sample preparation process is great for limiting variance, particularly with modern experimental designs using smaller and smaller amounts of biosample, which require much more rapid and accurate liquid transfer. While perhaps less obvious, accelerating the rate of data collection on the instrument side also helps to reduce variance since samples are subjected to fewer long-term sources of variance, such as solvent and consumable batches or instrument cleaning cycles.

Lastly, while consistency is key, data depth is also extremely important for generating high-quality datasets. Sapient employs state-of-the-art equipment at every step of the sample preparation and data collection process to maximize the number of detected chemical features while maintaining low variance. For sample preparation, Sapient has developed a suite of proprietary sample preparation methods for metabolomics and proteomics, which allow us to study a wide array of chemical species from almost any sample type. For data collection, we employ highly sensitive ion-mobility capable mass spectrometers coupled with high-throughput chromatography, allowing for exceptional data depth. All together, this grants Sapient the ability to collect deep data sets at population scale at a level of quality that gives high confidence in any resulting statistical hit.


You may also be interested in:


Once the data is generated, what’s the process for turning that raw information into insights or action?

Jeramie: As mentioned earlier, data processing is a critical bridge between data generation and insight delivery. For metabolomics, we use Sapient’s proprietary software suite, which enables peak extraction and alignment across thousands of samples, as well as a metabolite identification pipeline that leverages our comprehensive, in-house standards library to identify known molecules captured. For proteomics, we’ve generated Sapient’s proprietary tissue-specific protein references and leverage the latest AI-based tools for spectral matching, FDR estimation, protein group quantification, and intensity normalization. From there, the analysis-ready data is handed off to Tao’s data science team for deeper interpretation.

Tao: Turning this processed multi-omics data into actionable insights relies on three foundational components: talent, infrastructure and tools. When it comes to talent, you really need a cross-functional team that can bring complementary expertise in biology, statistics, machine learning (ML) and data science, plus a strong understanding of laboratory processes. This allows us to interpret complex data in a meaningful way and solve for the real biological question being asked. At Sapient, we’ve built a diverse biocomputational team that contributes all of these different perspectives, elevating the quality of insights we can deliver.

In terms of infrastructure, you have to consider the sheer volume of data that multi-omics can generate. Sometimes there are multiple terabytes of data to be analyzed in a single study. We leverage both on-premise and cloud-based infrastructure to process this data in parallel, enabling much faster computation.

And finally, you need tools that enable your team to extract maximum value out of these large-scale datasets. We use a mix of best-in-class software and algorithms alongside internally developed tools tailored to address specific challenges. For example, we’ve built a proprietary workflow to correct for batch effects, a major issue in multi-omics caused by variability over time, variability across instruments, or sample handling discrepancies. Our tools remove these unwanted variations while preserving the true biological signals.

When it comes to extracting relevant biomarkers among tens of thousands of molecules per sample, finding meaningful clinical signals can sometimes be like finding a needle in a haystack. We use statistical and ML methods to prioritize molecules that are most strongly linked to biological or clinical outcomes, whether that’s a single biomarker or multi-biomarker signatures that can be developed into a panel.

These combined capabilities allow us to reliably identify robust, clinically relevant biomarkers that can drive forward the development of diagnostics, therapeutics and personalized medicines.

Can you walk us through an example where the data you generated led to a key scientific discovery?

Tao: We recently published a paper describing the application of our rLC-MS platform for nontargeted metabolomics in more than 62,000 human plasma samples from nearly 7,000 individuals. Several exciting discoveries were derived from this large-scale analysis, including the development of an ML-based metabolic aging clock model.

As we all know, two people of the same chronological age can have vastly different health profiles based on environmental, lifestyle and disease factors that may accelerate their biological age. Because our platform captures dynamic biomarkers, we asked: can our data be used to predict individual biological aging rates? We were able to train a model using a selection of key metabolites in our dataset, and through validating the model across different disease states, we found that it could accurately predict accelerated aging for individuals with chronic disorders. And, most importantly, it showed dynamic ‘reversal’ of accelerated aging following interventions like organ transplantation, offering novel insight into biological aging mechanisms as well as treatment response.

We’ve also seen exciting discovery potential using our MS platform to analyze the intricate protein dynamics and tumor-associated antigens that drive tumor biology. As an example, we analyzed high-grade serous carcinoma tumor samples alongside normal adjacent tissue samples using our proteomics method to identify proteins differentially expressed between tumor and normal tissue. Not only did we identify several known and emerging oncological drug targets, but we also saw hundreds of other differentially expressed proteins in the tumors that may represent novel targets.

When collaborating with external partners who may have different research priorities, how do you ensure biomarker data generation aligns with both scientific objectives and practical applications?

Jeramie: We see it as our responsibility to align everyone around a shared path to success. Our customers often come to us with a problem and a goal, and it’s our responsibility to help them get from A to B. They may not always be familiar with the types of experiments that can be run, or the limitations of different approaches. We can add a lot of unique, niche expertise and guidance to help shape the study design in a way that maximizes outcomes. By working collaboratively and transparently, we help ensure projects are set up for success from the start. That alignment has been key to consistently delivering results, especially on more complex or challenging initiatives.

Tao: From the data analysis side, the same principle applies. We work closely with partners to fully understand their underlying research or clinical questions so we can apply the most appropriate methods. But we also take a broader view, offering insights into next steps, potential follow-up analyses or additional questions they may not have considered.

We are uniquely equipped to generate everything from large-scale, non-targeted omics data for early discovery, all the way down to single, quantitative measures for a specific molecule for clinical applications. Our biocomputational team, along with the wet lab and quality assurance teams, can work with these datasets to support our clients’ scientific objectives and practical applications at any phase, from novel biomarker identification to deep biomarker characterization for translation, to full deployment of quantitative assays in clinical trials.

Our goal is to deliver not just data, but meaningful, actionable insights that move their science forward.

Find more interview content here.


The opinions expressed in this interview are those of the interviewees and do not necessarily reflect the views of Bioanalysis Zone or Taylor & Francis Group.