Dr. Serafim Batzoglou, Chief Knowledge Officer at Seer – Interview Collection


Serafim Batzoglou is Chief Knowledge Officer at Seer. Previous to becoming a member of Seer, Serafim served as Chief Knowledge Officer at Insitro, main machine studying and information science of their strategy to drug discovery. Previous to Insitro, he served as VP of Utilized and Computational Biology at Illumina, main analysis and expertise growth of AI and molecular assays for making genomic information extra interpretable in human well being.

What initially attracted you to the sector of genomics?

I got interested within the area of computational biology at first of my PhD in pc science at MIT, after I took a category on the subject taught by Bonnie Berger, who turned my PhD advisor, and David Gifford. The human genome mission was selecting up tempo throughout my PhD. Eric Lander, who was heading the Genome Heart at MIT turned my PhD co-advisor and concerned me within the mission. Motivated by the human genome mission, I labored on whole-genome meeting and comparative genomics of human and mouse DNA.

I then moved to Stanford College as college on the Laptop Science division the place I spent 15 years, and was privileged to have suggested about 30 extremely proficient PhD college students and lots of postdoctoral researchers and undergraduates. My workforce’s focus has been the applying of algorithms, machine studying and software program instruments constructing for the evaluation of large-scale genomic and biomolecular information. I left Stanford in 2016 to steer a analysis and expertise growth workforce at Illumina. Since then, I’ve loved main R&D groups in trade. I discover that teamwork, the enterprise side, and a extra direct influence to society are attribute of trade in comparison with academia. I labored at modern firms over my profession: DNAnexus, which I co-founded in 2009, Illumina, insitro and now Seer. Computation and machine studying are important throughout the expertise chain in biotech, from expertise growth, to information acquisition, to organic information interpretation and translation to human well being.

Over the past 20 years, sequencing the human genome has change into vastly cheaper and sooner. This led to dramatic progress within the genome sequencing market and broader adoption within the life sciences trade. We at the moment are on the cusp of getting inhabitants genomic, multi-omic and phenotypic information of adequate dimension to meaningfully revolutionize healthcare together with prevention, prognosis, remedy and drug discovery. We are able to more and more uncover the molecular underpinnings of illness for people via computational evaluation of genomic information, and sufferers have the prospect to obtain remedies which can be customized and focused, particularly within the areas of most cancers and uncommon genetic illness. Past the apparent use in medication, machine studying coupled with genomic info permits us to realize insights into different areas of our lives, similar to our family tree and vitamin. The following a number of years will see adoption of customized, data-driven healthcare, first for choose teams of individuals, similar to uncommon illness sufferers, and more and more for the broad public.

Previous to your present position you have been Chief Knowledge Officer at Insitro, main machine studying and information science of their strategy to drug discovery. What have been a few of your key takeaways from this time interval with how machine studying can be utilized to speed up drug discovery?

The traditional drug discovery and growth “trial-and-error” paradigm is plagued with inefficiencies and very prolonged timelines. For one drug to get to market, it may possibly take upwards of $1 billion and over a decade. By incorporating machine studying into these efforts, we are able to dramatically cut back prices and timeframes in a number of steps on the best way. One step is goal identification, the place a gene or set of genes that modulate a illness phenotype or revert a illness mobile state to a extra wholesome state will be recognized via large-scale genetic and chemical perturbations, and phenotypic readouts similar to imaging and useful genomics. One other step is compound identification and optimization, the place a small molecule or different modality will be designed by machine learning-driven in silico prediction in addition to in vitro screening, and furthermore desired properties of a drug similar to solubility, permeability, specificity and non-toxicity will be optimized. The toughest in addition to most essential side is probably translation to people. Right here, alternative of the appropriate mannequin—induced pluripotent stem cell-derived traces versus major affected person cell traces and tissue samples versus animal fashions—for the appropriate illness poses an extremely essential set of tradeoffs that finally replicate on the flexibility of the ensuing information plus machine studying to translate to sufferers.

Seer Bio is pioneering new methods to decode the secrets and techniques of the proteome to enhance human well being, for readers who’re unfamiliar with this time period what’s the proteome?

The proteome is the altering set of proteins produced or modified by an organism over time and in response to atmosphere, vitamin and well being state. Proteomics is the research of the proteome inside a given cell sort or tissue pattern. The genome of a human or different organisms is static: with the essential exception of somatic mutations, the genome at start is the genome one has their total life, copied precisely in every cell of their physique. The proteome is dynamic and modifications within the time spans of years, days and even minutes. As such, proteomes are vastly nearer to phenotype and finally to well being standing than are genomes, and consequently extra informative for monitoring well being and understanding illness.

At Seer, now we have developed a brand new solution to entry the proteome that gives deeper insights into proteins and proteoforms in advanced samples similar to plasma, which is a extremely accessible pattern that sadly to-date has posed an amazing problem for typical mass spectrometry proteomics.

What’s the Seer’s Proteograph™ platform and the way does it supply a brand new view of the proteome?

Seer’s Proteograph platform leverages a library of proprietary engineered nanoparticles, powered by a easy, speedy, and automatic workflow, enabling deep and scalable interrogation of the proteome.

The Proteograph platform shines in interrogating plasma and different advanced samples that exhibit giant dynamic vary—many orders of magnitude distinction within the abundance of varied proteins within the pattern—the place typical mass spectrometry strategies are unable to detect the low abundance a part of the proteome. Seer’s nanoparticles are engineered with tunable physiochemical properties that collect proteins throughout the dynamic vary in an unbiased method. In typical plasma samples, our expertise permits detection of 5x to 8x extra proteins than when processing neat plasma with out utilizing the Proteograph. In consequence, from pattern prep to instrumentation to information evaluation, our Proteograph Product Suite helps scientists discover proteome illness signatures which may in any other case be undetectable. We wish to say that at Seer, we’re opening up a brand new gateway to the proteome.

Moreover, we’re permitting scientists to simply carry out large-scale proteogenomic research. Proteogenomics is the combining of genomic information with proteomic information to determine and quantify protein variants, hyperlink genomic variants with protein abundance ranges, and finally hyperlink the genome and the proteome to phenotype and illness, and begin disentangling the causal and downstream genetic pathways related to illness.

Are you able to focus on among the machine studying expertise that’s presently used at Seer Bio?

Seer is leveraging machine studying in any respect steps from expertise growth to downstream information evaluation. These steps embrace: (1) design of our proprietary nanoparticles, the place machine studying helps us decide which physicochemical properties and mixtures of nanoparticles will work with particular product traces and assays; (2) detection and quantification of peptides, proteins, variants and proteoforms from the readout information produced from the MS devices; (3) downstream proteomic and proteogenomic analyses in large-scale inhabitants cohorts.

Final 12 months, we revealed a paper in Superior Supplies combining proteomics strategies, nanoengineering and machine studying for bettering our understanding of the mechanisms of protein corona formation. This paper uncovered nano-bio interactions and is informing Seer within the creation of improved future nanoparticles and merchandise.

Past nanoparticle growth, now we have been creating novel algorithms to determine variant peptides and post-translational modifications (PTMs). We not too long ago developed a way for detection of protein quantified trait loci (pQTLs) that’s strong to protein variants, which is a identified confounder for affinity-based proteomics. We’re extending this work to straight determine these peptides from the uncooked spectra utilizing deep learning-based de novo sequencing strategies to permit search with out inflating the scale of spectral libraries.

Our workforce can also be creating strategies to allow scientists with out deep experience in machine studying to optimally tune and make the most of machine studying fashions of their discovery work. That is completed through a Seer ML framework based mostly on the AutoML device, which permits environment friendly hyperparameter tuning through Bayesian optimization.

Lastly, we’re creating strategies to scale back the batch impact and enhance the quantitative accuracy of the mass spec readout by modeling the measured quantitative values to maximise anticipated metrics similar to correlation of depth values throughout peptides inside a protein group.

Hallucinations are a typical difficulty with LLMs, what are among the options to stop or mitigate this?

LLMs are generative strategies which can be given a big corpus and are skilled to generate related textual content. They seize the underlying statistical properties of the textual content they’re skilled on, from easy native properties similar to how typically sure mixtures of phrases (or tokens) are discovered collectively, to larger degree properties that emulate understanding of context and that means.

Nevertheless, LLMs aren’t primarily skilled to be right. Reinforcement studying with human suggestions (RLHF) and different strategies assist prepare them for fascinating properties together with correctness, however aren’t totally profitable. Given a immediate, LLMs will generate textual content that almost all intently resembles the statistical properties of the coaching information. Typically, this textual content can also be right. For instance, if requested “when was Alexander the Nice born,” the right reply is 356 BC (or BCE), and an LLM is probably going to present that reply as a result of inside the coaching information Alexander the Nice’s start seems typically as this worth. Nevertheless, when requested “when was Empress Reginella born,” a fictional character not current within the coaching corpus, the LLM is more likely to hallucinate and create a narrative of her start. Equally, when requested a query that the LLM could not retrieve a proper reply for (both as a result of the appropriate reply doesn’t exist, or for different statistical functions), it’s more likely to hallucinate and reply as if it is aware of. This creates hallucinations which can be an apparent downside for severe functions, similar to “how can such and such most cancers be handled.”

There are not any excellent options but for hallucinations. They’re endemic to the design of the LLM. One partial answer is correct prompting, similar to asking the LLM to “think twice, step-by-step,” and so forth. This will increase the LLMs probability to not concoct tales. A extra subtle strategy that’s being developed is the usage of data graphs. Data graphs present structured information: entities in a data graph are linked to different entities in a predefined, logical method. Developing a data graph for a given area is after all a difficult activity however doable with a mixture of automated and statistical strategies and curation. With a built-in data graph, LLMs can cross-check the statements they generate towards the structured set of identified information, and will be constrained to not generate a press release that contradicts or just isn’t supported by the data graph.

Due to the basic difficulty of hallucinations, and arguably due to their lack of adequate reasoning and judgment talents, LLMs are right now highly effective for retrieving, connecting and distilling info, however can’t exchange human specialists in severe functions similar to medical prognosis or authorized recommendation. Nonetheless, they will tremendously improve the effectivity and functionality of human specialists in these domains.

Are you able to share your imaginative and prescient for a future the place biology is steered by information reasonably than hypotheses?

The normal hypothesis-driven strategy, which includes researchers discovering patterns, creating hypotheses, performing experiments or research to check them, after which refining theories based mostly on the information, is changing into supplanted by a brand new paradigm based mostly on data-driven modeling.

On this rising paradigm, researchers begin with hypothesis-free, large-scale information era. Then, they prepare a machine studying mannequin similar to an LLM with the target of correct reconstruction of occluded information, sturdy regression or classification efficiency in various downstream duties. As soon as the machine studying mannequin can precisely predict the information, and achieves constancy corresponding to the similarity between experimental replicates, researchers can interrogate the mannequin to extract perception in regards to the organic system and discern the underlying organic rules.

LLMs are proving to be particularly good in modeling biomolecular information, and are geared to gasoline a shift from hypothesis-driven to data-driven organic discovery. This shift will change into more and more pronounced over the subsequent 10 years and permit correct modeling of biomolecular programs at a granularity that goes nicely past human capability.

What’s the potential influence for illness prognosis and drug discovery?

I imagine LLM and generative AI will result in important modifications within the life sciences trade. One space that may profit significantly from LLMs is scientific prognosis, particularly for uncommon, difficult-to-diagnose ailments and most cancers subtypes. There are great quantities of complete affected person info that we are able to faucet into – from genomic profiles, remedy responses, medical data and household historical past – to drive correct and well timed prognosis. If we are able to discover a solution to compile all this information such that they’re simply accessible, and never siloed by particular person well being organizations, we are able to dramatically enhance diagnostic precision. This isn’t to suggest that the machine studying fashions, together with LLMs, will be capable to autonomously function in prognosis. Because of their technical limitations, within the foreseeable future they won’t be autonomous, however as an alternative they may increase human specialists. They are going to be highly effective instruments to assist the physician present beautifully knowledgeable assessments and diagnoses in a fraction of the time wanted thus far, and to correctly doc and talk their diagnoses to the affected person in addition to to all the community of well being suppliers linked via the machine studying system.

The trade is already leveraging machine studying for drug discovery and growth, touting its capacity to scale back prices and timelines in comparison with the standard paradigm. LLMs additional add to the out there toolbox, and are offering wonderful frameworks for modeling large-scale biomolecular information together with genomes, proteomes, useful genomic and epigenomic information, single-cell information, and extra. Within the foreseeable future, basis LLMs will undoubtedly join throughout all these information modalities and throughout giant cohorts of people whose genomic, proteomic and well being info is collected. Such LLMs will help in era of promising drug targets, determine doubtless pockets of exercise of proteins related to organic perform and illness, or recommend pathways and extra advanced mobile capabilities that may be modulated in a selected method with small molecules or different drug modalities. We are able to additionally faucet into LLMs to determine drug responders and non-responders based mostly on genetic susceptibility, or to repurpose medication in different illness indications. Lots of the current modern AI-based drug discovery firms are undoubtedly already beginning to suppose and develop on this path, and we should always anticipate to see the formation of further firms in addition to public efforts aimed on the deployment of LLMs in human well being and drug discovery.

Thanks for the detailed interview, readers who want to be taught extra ought to go to Seer.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles